trafilatura

package
v0.7.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 16, 2024 License: GPL-3.0 Imports: 14 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	DefaultUserAgent = "Mozilla/5.0 (X11; Linux x86_64; rv:88.0) Gecko/20100101 Firefox/88.0"
	DefaultTimeout   = 30 * time.Second

	DefaultOptions = &Options{
		FallbackConfig: trafilaturaFallback,
		HttpClient:     &http.Client{},
		Timeout:        DefaultTimeout,
		Transport:      nil,
		UserAgent:      fetch.DefaultUserAgent,
	}
)

Functions

func Factory

func Factory(options Options) func() (fetch.URLFetcher, error)

Factory function for new fetcher.

Types

type Options

type Options struct {
	FallbackConfig *trafilatura.FallbackConfig
	HttpClient     *http.Client
	UserAgent      string
	Transport      http.RoundTripper
	Timeout        time.Duration
}

type TrafilaturaFetcher

type TrafilaturaFetcher struct {
	// contains filtered or unexported fields
}

func NewTrafilaturaFetcher

func NewTrafilaturaFetcher(options Options) *TrafilaturaFetcher

func (*TrafilaturaFetcher) Close

func (f *TrafilaturaFetcher) Close() error

func (*TrafilaturaFetcher) Fetch

func (f *TrafilaturaFetcher) Fetch(url *nurl.URL) (*resource.WebPage, error)

Fetch a URL and return a WebPage resource. The web page will be fetched and parsed using the Trafilatura library. The returned resource will contain the metadata and content text. The request's StatusCode will be set to the HTTP status code returned. If there's an error fetching the page, in addition to the returned error, the *resource.WebPage will contain partial data pertaining to the request.

func (*TrafilaturaFetcher) Open

func (f *TrafilaturaFetcher) Open(ctx context.Context) error

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL