scraper

package
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 27, 2025 License: MIT Imports: 16 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CloseBrowser added in v0.0.2

func CloseBrowser()

CloseBrowser closes the browser

func ClosePlaywright

func ClosePlaywright()

ClosePlaywright closes the browser and stops Playwright

func ExtractContentWithCSS

func ExtractContentWithCSS(content, includeSelector string, excludeSelectors []string) (string, error)

ExtractContentWithCSS extracts content from HTML using a CSS selector

func FetchWebpageContent

func FetchWebpageContent(urlStr string) (string, error)

FetchWebpageContent retrieves the content of a webpage using Playwright

func InitBrowser added in v0.0.2

func InitBrowser() error

InitBrowser initializes the browser

func InitPlaywright

func InitPlaywright() error

InitPlaywright initializes Playwright and launches the browser

func NormalizePathForFilename added in v0.1.1

func NormalizePathForFilename(urlPath string) string

NormalizePathForFilename converts a URL path into a valid filename component

func ProcessHTMLContent

func ProcessHTMLContent(htmlContent string, config Config) (string, error)

ProcessHTMLContent converts HTML content to Markdown

func SaveToFiles added in v0.1.1

func SaveToFiles(content map[string]struct {
	content string
	site    SiteConfig
}, config Config) error

SaveToFiles writes the scraped content to files based on output type

func ScrapeSites added in v0.0.2

func ScrapeSites(config Config) error

func SetupLogger

func SetupLogger(verbose bool)

SetupLogger initializes the logger based on the verbose flag

Types

type Config

type Config struct {
	Sites      []SiteConfig
	OutputType string
	Verbose    bool
	Scrape     ScrapeConfig
}

Config holds the scraper configuration

type PathOverride added in v0.0.2

type PathOverride struct {
	Path             string
	CSSLocator       string
	ExcludeSelectors []string
}

PathOverride holds path-specific overrides

type ScrapeConfig added in v0.0.2

type ScrapeConfig struct {
	RequestsPerSecond float64
	BurstLimit        int
}

ScrapeConfig holds the scraping-specific configuration

type SiteConfig added in v0.0.2

type SiteConfig struct {
	BaseURL          string
	CSSLocator       string
	ExcludeSelectors []string
	AllowedPaths     []string
	ExcludePaths     []string
	FileNamePrefix   string
	PathOverrides    []PathOverride
}

SiteConfig holds configuration for a single site

type URLConfig

type URLConfig struct {
	URL              string
	CSSLocator       string
	ExcludeSelectors []string
	FileNamePrefix   string
}

URLConfig holds configuration for a single URL

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL