html

package module
v0.0.0-...-8fe9471 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 10, 2026 License: Apache-2.0 Imports: 7 Imported by: 7

README

HTML Parser for Eino

English | 简体中文

Introduction

This is an HTML parser component for Eino. It implements the Parser interface and can be seamlessly integrated into Eino's document processing workflow to parse HTML content into structured documents.

Features

  • Implements github.com/cloudwego/eino/components/document/parser.Parser interface
  • Parse HTML content to plain text
  • Extract metadata from HTML (title, description, language, charset)
  • Customizable content selector using CSS selector syntax
  • HTML sanitization using bluemonday
  • Easy integration into Eino workflows

Installation

go get github.com/cloudwego/eino-ext/components/document/parser/html

Quick Start

package main

import (
	"context"
	"log"
	"strings"

	"github.com/cloudwego/eino-ext/components/document/parser/html"
)

func main() {
	ctx := context.Background()

	parser, err := html.NewParser(ctx, &html.Config{})
	if err != nil {
		log.Fatalf("html.NewParser failed, err=%v", err)
	}

	htmlContent := `
		<!DOCTYPE html>
		<html lang="en">
		<head>
			<meta charset="UTF-8">
			<meta name="description" content="Sample page">
			<title>Sample Page</title>
		</head>
		<body>
			<h1>Hello World</h1>
			<p>This is a sample HTML document.</p>
		</body>
		</html>
	`

	reader := strings.NewReader(htmlContent)
	docs, err := parser.Parse(ctx, reader)
	if err != nil {
		log.Fatalf("parser.Parse failed, err=%v", err)
	}

	log.Printf("Content: %s", docs[0].Content)
	log.Printf("Title: %s", docs[0].MetaData[html.MetaKeyTitle])
	log.Printf("Description: %s", docs[0].MetaData[html.MetaKeyDesc])
	log.Printf("Language: %s", docs[0].MetaData[html.MetaKeyLang])
}

Configuration

The parser can be configured through the Config structure:

type Config struct {
    // Selector is a CSS selector to extract specific content (Optional)
    // Examples: "body" for <body>, "#content" for <div id="content">
    // Default: entire document
    Selector *string
}

Metadata

The parser automatically extracts and adds the following metadata to parsed documents:

  • _title: Document title from <title> tag
  • _description: Description from <meta name="description"> tag
  • _language: Language from <html lang=""> attribute
  • _charset: Character encoding from <meta charset=""> tag
  • _source: Source URI (if provided via parser options)

Using Custom Selector

You can extract content from specific parts of the HTML:

bodySelector := "body"
parser, err := html.NewParser(ctx, &html.Config{
    Selector: &bodySelector,
})

contentSelector := "#main-content"
parser, err := html.NewParser(ctx, &html.Config{
    Selector: &contentSelector,
})

Using in Chain

import (
    "github.com/cloudwego/eino/compose"
    "github.com/cloudwego/eino/components/document"
    htmlParser "github.com/cloudwego/eino-ext/components/document/parser/html"
)

parser, _ := htmlParser.NewParser(ctx, &htmlParser.Config{})
loader, _ := urlLoader.NewURLLoader(ctx, &urlLoader.LoaderConfig{
    Parser: parser,
})

chain := compose.NewChain[document.Source, []*schema.Document]()
chain.AppendLoader(loader)

run, _ := chain.Compile(ctx)
docs, _ := run.Invoke(ctx, document.Source{URI: "https://example.com"})

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Documentation

Index

Constants

View Source
const (
	MetaKeyTitle   = "_title"
	MetaKeyDesc    = "_description"
	MetaKeyLang    = "_language"
	MetaKeyCharset = "_charset"
	MetaKeySource  = "_source"
)

Variables

View Source
var (
	BodySelector = "body"
)

Functions

This section is empty.

Types

type Config

type Config struct {
	// content selector of goquery. eg: body for <body>, #id for <div id="id">
	Selector *string
}

type Parser

type Parser struct {
	// contains filtered or unexported fields
}

Parser implements parser.Parser. It parses HTML content to text. use goquery to parse the HTML content, will read the <body> content as text (remove tags). will extract title/description/language/charset from the HTML content as meta data.

func NewParser

func NewParser(ctx context.Context, conf *Config) (*Parser, error)

NewParser returns a new parser.

func (*Parser) Parse

func (p *Parser) Parse(ctx context.Context, reader io.Reader, opts ...parser.Option) ([]*schema.Document, error)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL