Documentation
¶
Overview ¶
Package defuddle provides web content extraction and demuddling capabilities.
Index ¶
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Defuddle ¶
type Defuddle struct {
// contains filtered or unexported fields
}
Defuddle represents a document parser instance
func NewDefuddle ¶
NewDefuddle creates a new Defuddle instance from HTML content JavaScript original code:
constructor(document: Document, options: DefuddleOptions = {}) {
this.doc = document;
this.options = options;
}
func (*Defuddle) Parse ¶
Parse extracts the main content from the document JavaScript original code:
parse(): DefuddleResponse {
// Try first with default settings
const result = this.parseInternal();
// If result has very little content, try again without clutter removal
if (result.wordCount < 200) {
console.log('Initial parse returned very little content, trying again');
const retryResult = this.parseInternal({
removePartialSelectors: false
});
// Return the result with more content
if (retryResult.wordCount > result.wordCount) {
this._log('Retry produced more content');
return retryResult;
}
}
return result;
}
type ExtractedContent ¶
type ExtractedContent struct {
Title *string `json:"title,omitempty"`
Author *string `json:"author,omitempty"`
Published *string `json:"published,omitempty"`
Content *string `json:"content,omitempty"`
ContentHTML *string `json:"contentHtml,omitempty"`
Variables *ExtractorVariables `json:"variables,omitempty"`
}
ExtractedContent represents content extracted by site-specific extractors JavaScript original code:
export interface ExtractedContent {
title?: string;
author?: string;
published?: string;
content?: string;
contentHtml?: string;
variables?: ExtractorVariables;
}
type ExtractorVariables ¶
ExtractorVariables represents variables extracted by site-specific extractors JavaScript original code:
export interface ExtractorVariables {
[key: string]: string;
}
type MetaTag ¶
MetaTag represents a meta tag item from HTML This is an alias to the internal metadata.MetaTag type
type Metadata ¶
Metadata represents extracted metadata from a document This is an alias to the internal metadata.Metadata type
type Options ¶
type Options struct {
// Enable debug logging
Debug bool `json:"debug,omitempty"`
// URL of the page being parsed
URL string `json:"url,omitempty"`
// Convert output to Markdown
Markdown bool `json:"markdown,omitempty"`
// Include Markdown in the response
SeparateMarkdown bool `json:"separateMarkdown,omitempty"`
// Whether to remove elements matching exact selectors like ads, social buttons, etc.
// Defaults to true.
RemoveExactSelectors bool `json:"removeExactSelectors,omitempty"`
// Whether to remove elements matching partial selectors like ads, social buttons, etc.
// Defaults to true.
RemovePartialSelectors bool `json:"removePartialSelectors,omitempty"`
// Remove images from the extracted content
// Defaults to false.
RemoveImages bool `json:"removeImages,omitempty"`
// Element processing options
ProcessCode bool `json:"processCode,omitempty"`
ProcessImages bool `json:"processImages,omitempty"`
ProcessHeadings bool `json:"processHeadings,omitempty"`
ProcessMath bool `json:"processMath,omitempty"`
ProcessFootnotes bool `json:"processFootnotes,omitempty"`
ProcessRoles bool `json:"processRoles,omitempty"`
CodeOptions *elements.CodeBlockProcessingOptions `json:"codeOptions,omitempty"`
ImageOptions *elements.ImageProcessingOptions `json:"imageOptions,omitempty"`
HeadingOptions *elements.HeadingProcessingOptions `json:"headingOptions,omitempty"`
MathOptions *elements.MathProcessingOptions `json:"mathOptions,omitempty"`
FootnoteOptions *elements.FootnoteProcessingOptions `json:"footnoteOptions,omitempty"`
RoleOptions *elements.RoleProcessingOptions `json:"roleOptions,omitempty"`
}
Options represents configuration options for Defuddle parsing JavaScript original code:
export interface DefuddleOptions {
debug?: boolean;
url?: string;
markdown?: boolean;
separateMarkdown?: boolean;
removeExactSelectors?: boolean;
removePartialSelectors?: boolean;
}
type Result ¶
type Result struct {
Metadata
Content string `json:"content"`
ContentMarkdown *string `json:"contentMarkdown,omitempty"`
ExtractorType *string `json:"extractorType,omitempty"`
MetaTags []MetaTag `json:"metaTags,omitempty"`
DebugInfo *debug.Info `json:"debugInfo,omitempty"`
}
Result represents the complete response from Defuddle parsing JavaScript original code:
export interface DefuddleResponse extends DefuddleMetadata {
content: string;
contentMarkdown?: string;
extractorType?: string;
metaTags?: MetaTagItem[];
}
func ParseFromString ¶ added in v0.2.0
ParseFromString parses HTML content directly from a string This is useful when you already have the HTML content (e.g., from browser automation)
type StyleChange ¶
StyleChange represents a CSS style change for mobile
Directories
¶
| Path | Synopsis |
|---|---|
|
cmd
|
|
|
defuddle
command
Package main provides the defuddle CLI application.
|
Package main provides the defuddle CLI application. |
|
examples
|
|
|
advanced
command
Package main demonstrates advanced defuddle usage.
|
Package main demonstrates advanced defuddle usage. |
|
basic
command
Package main demonstrates basic defuddle usage.
|
Package main demonstrates basic defuddle usage. |
|
custom_extractor
command
Package main demonstrates custom extractor usage.
|
Package main demonstrates custom extractor usage. |
|
extractors
command
Package main demonstrates extractors usage.
|
Package main demonstrates extractors usage. |
|
markdown
command
Package main demonstrates markdown conversion.
|
Package main demonstrates markdown conversion. |
|
Package extractors provides site-specific content extraction functionality.
|
Package extractors provides site-specific content extraction functionality. |
|
internal
|
|
|
constants
Package constants provides configuration constants and selectors for the defuddle content extraction system.
|
Package constants provides configuration constants and selectors for the defuddle content extraction system. |
|
debug
Package debug provides debugging functionality for the defuddle content extraction system.
|
Package debug provides debugging functionality for the defuddle content extraction system. |
|
elements
Package elements provides enhanced element processing functionality This module handles code block processing including syntax highlighting, language detection, and code formatting
|
Package elements provides enhanced element processing functionality This module handles code block processing including syntax highlighting, language detection, and code formatting |
|
markdown
Package markdown provides HTML to Markdown conversion functionality.
|
Package markdown provides HTML to Markdown conversion functionality. |
|
metadata
Package metadata provides functionality for extracting and processing document metadata.
|
Package metadata provides functionality for extracting and processing document metadata. |
|
pool
Package pool provides memory pooling utilities for the defuddle content extraction system.
|
Package pool provides memory pooling utilities for the defuddle content extraction system. |
|
scoring
Package scoring provides content scoring functionality for the defuddle content extraction system.
|
Package scoring provides content scoring functionality for the defuddle content extraction system. |
|
standardize
Package standardize provides content standardization functionality for the defuddle content extraction system.
|
Package standardize provides content standardization functionality for the defuddle content extraction system. |