html

package module

v0.0.0-...-c9a5dc9 Latest Latest Go to latest Published: Jul 24, 2026 License: Apache-2.0 Imports: 7 Imported by: 7

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cloudwego/eino-ext

Links

Open Source Insights

README ¶

HTML Parser for Eino

English | 简体中文

Introduction

This is an HTML parser component for Eino. It implements the Parser interface and can be seamlessly integrated into Eino's document processing workflow to parse HTML content into structured documents.

Features

Implements github.com/cloudwego/eino/components/document/parser.Parser interface
Parse HTML content to plain text
Extract metadata from HTML (title, description, language, charset)
Customizable content selector using CSS selector syntax
HTML sanitization using bluemonday
Easy integration into Eino workflows

Installation

go get github.com/cloudwego/eino-ext/components/document/parser/html

Quick Start

package main

import (
	"context"
	"log"
	"strings"

	"github.com/cloudwego/eino-ext/components/document/parser/html"
)

func main() {
	ctx := context.Background()

	parser, err := html.NewParser(ctx, &html.Config{})
	if err != nil {
		log.Fatalf("html.NewParser failed, err=%v", err)
	}

	htmlContent := `
		<!DOCTYPE html>
		<html lang="en">
		<head>
			<meta charset="UTF-8">
			<meta name="description" content="Sample page">
			<title>Sample Page</title>
		</head>
		<body>
			<h1>Hello World</h1>
			<p>This is a sample HTML document.</p>
		</body>
		</html>
	`

	reader := strings.NewReader(htmlContent)
	docs, err := parser.Parse(ctx, reader)
	if err != nil {
		log.Fatalf("parser.Parse failed, err=%v", err)
	}

	log.Printf("Content: %s", docs[0].Content)
	log.Printf("Title: %s", docs[0].MetaData[html.MetaKeyTitle])
	log.Printf("Description: %s", docs[0].MetaData[html.MetaKeyDesc])
	log.Printf("Language: %s", docs[0].MetaData[html.MetaKeyLang])
}

Configuration

The parser can be configured through the Config structure:

type Config struct {
    // Selector is a CSS selector to extract specific content (Optional)
    // Examples: "body" for <body>, "#content" for <div id="content">
    // Default: entire document
    Selector *string
}

Metadata

The parser automatically extracts and adds the following metadata to parsed documents:

_title: Document title from <title> tag
_description: Description from <meta name="description"> tag
_language: Language from <html lang=""> attribute
_charset: Character encoding from <meta charset=""> tag
_source: Source URI (if provided via parser options)

Using Custom Selector

You can extract content from specific parts of the HTML:

bodySelector := "body"
parser, err := html.NewParser(ctx, &html.Config{
    Selector: &bodySelector,
})

contentSelector := "#main-content"
parser, err := html.NewParser(ctx, &html.Config{
    Selector: &contentSelector,
})

Using in Chain

import (
    "github.com/cloudwego/eino/compose"
    "github.com/cloudwego/eino/components/document"
    htmlParser "github.com/cloudwego/eino-ext/components/document/parser/html"
)

parser, _ := htmlParser.NewParser(ctx, &htmlParser.Config{})
loader, _ := urlLoader.NewURLLoader(ctx, &urlLoader.LoaderConfig{
    Parser: parser,
})

chain := compose.NewChain[document.Source, []*schema.Document]()
chain.AppendLoader(loader)

run, _ := chain.Compile(ctx)
docs, _ := run.Invoke(ctx, document.Source{URI: "https://example.com"})

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Documentation ¶

Index ¶

Constants
Variables
type Config
type Parser
- func NewParser(ctx context.Context, conf *Config) (*Parser, error)
- func (p *Parser) Parse(ctx context.Context, reader io.Reader, opts ...parser.Option) ([]*schema.Document, error)

Constants ¶

View Source

const (
	MetaKeyTitle   = "_title"
	MetaKeyDesc    = "_description"
	MetaKeyLang    = "_language"
	MetaKeyCharset = "_charset"
	MetaKeySource  = "_source"
)

Variables ¶

View Source

var (
	BodySelector = "body"
)

Functions ¶

This section is empty.

Types ¶

type Config ¶

type Config struct {
	// content selector of goquery. eg: body for <body>, #id for <div id="id">
	Selector *string
}

type Parser ¶

type Parser struct {
	// contains filtered or unexported fields
}

Parser implements parser.Parser. It parses HTML content to text. use goquery to parse the HTML content, will read the <body> content as text (remove tags). will extract title/description/language/charset from the HTML content as meta data.

func NewParser ¶

func NewParser(ctx context.Context, conf *Config) (*Parser, error)

NewParser returns a new parser.

func (*Parser) Parse ¶

func (p *Parser) Parse(ctx context.Context, reader io.Reader, opts ...parser.Option) ([]*schema.Document, error)

Source Files ¶

View all Source files

html.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL