Documentation
¶
Overview ¶
Package source provides pluggable content source implementations for Dewey.
Package source defines the pluggable content source interface for Dewey. Sources provide documents from various origins (local disk, GitHub, web crawl) that are indexed into the knowledge graph.
Index ¶
- func ParseRefreshInterval(interval string) (time.Duration, error)
- func SaveSourcesConfig(path string, sources []SourceConfig) error
- func SetLogLevel(level log.Level)
- func SetLogOutput(w io.Writer, level log.Level)
- type Change
- type ChangeType
- type DiskSource
- type DiskSourceOption
- type Document
- type FetchResult
- type FetchSummary
- type GitHubSource
- type Manager
- type Source
- type SourceConfig
- type SourceMetadata
- type SourcesFile
- type WebSource
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ParseRefreshInterval ¶
ParseRefreshInterval converts a refresh interval string to a time.Duration. Supports named intervals ("daily" = 24h, "weekly" = 168h, "hourly" = 1h) and Go duration strings (e.g., "1h", "30m"). Returns 0 for an empty string (no refresh interval). Returns an error if the string is not a recognized named interval and cannot be parsed as a Go duration.
func SaveSourcesConfig ¶
func SaveSourcesConfig(path string, sources []SourceConfig) error
SaveSourcesConfig writes the sources configuration to the given path (typically .dewey/sources.yaml) with a descriptive YAML header comment. Overwrites any existing file at the path.
Returns an error if the configuration cannot be marshaled to YAML or the file cannot be written.
func SetLogLevel ¶ added in v1.2.0
SetLogLevel sets the logging level for the source package. Use log.DebugLevel for verbose output during diagnostics.
Types ¶
type Change ¶
type Change struct {
// Type indicates the kind of change.
Type ChangeType
// Document is the changed document (nil for deletions).
Document *Document
// ID is the document identifier (always set, even for deletions).
ID string
}
Change represents a modification detected by a source's Diff method.
type ChangeType ¶
type ChangeType string
ChangeType enumerates the kinds of changes a source can report.
const ( // ChangeAdded indicates a new document was added. ChangeAdded ChangeType = "added" // ChangeModified indicates an existing document was modified. ChangeModified ChangeType = "modified" // ChangeDeleted indicates a document was removed. ChangeDeleted ChangeType = "deleted" )
type DiskSource ¶
type DiskSource struct {
// contains filtered or unexported fields
}
DiskSource implements the Source interface for local Markdown files. It scans a directory for .md files and uses content hashing (SHA-256) for change detection, matching the VaultStore pattern.
func NewDiskSource ¶
func NewDiskSource(id, name, basePath string, opts ...DiskSourceOption) *DiskSource
NewDiskSource creates a DiskSource for the given directory path. Returns a ready-to-use source with an empty stored hashes map. Call DiskSource.SetStoredHashes before DiskSource.Diff to enable incremental change detection.
Options are applied after defaults. The default configuration is recursive=true with no extra ignore patterns. The variadic opts parameter ensures backward compatibility — existing callers that pass zero options continue to work unchanged.
func (*DiskSource) Diff ¶
func (d *DiskSource) Diff() ([]Change, error)
Diff returns changes since the last fetch by comparing current file hashes against stored hashes. Uses the same SHA-256 algorithm as VaultStore. Returns a slice of changes categorized as ChangeAdded, ChangeModified, or ChangeDeleted. Returns an error if the directory walk fails.
Decomposed into walkDiskFiles (directory scan) and diffFileChanges (hash comparison) to keep each function under cyclomatic complexity 10.
func (*DiskSource) Fetch ¶
func (d *DiskSource) Fetch(id string) (*Document, error)
Fetch retrieves a single document by its relative file path (e.g., "subfolder/page.md"). Returns the document with computed SHA-256 content hash. Returns an error if the file cannot be read.
func (*DiskSource) List ¶
func (d *DiskSource) List() ([]Document, error)
List returns all .md files in the source directory as Documents, skipping ignored entries (hidden directories, .gitignore patterns, and configured ignore patterns) and unreadable files. Updates the source's lastFetched timestamp on success. Returns an error if the directory walk itself fails or the ignore matcher cannot be constructed.
func (*DiskSource) Meta ¶
func (d *DiskSource) Meta() SourceMetadata
Meta returns metadata about this disk source, including its ID, type ("disk"), name, status, and last fetch timestamp.
func (*DiskSource) SetStoredHashes ¶
func (d *DiskSource) SetStoredHashes(hashes map[string]string)
SetStoredHashes sets the previously known content hashes for change detection. The hashes map is keyed by relative file path with SHA-256 hex digest values. Call this before DiskSource.Diff to enable incremental updates; without stored hashes, Diff reports all files as added.
type DiskSourceOption ¶ added in v1.5.0
type DiskSourceOption func(*DiskSource)
DiskSourceOption configures optional behavior for a DiskSource. Use With* constructors to create options.
func WithIgnorePatterns ¶ added in v1.5.0
func WithIgnorePatterns(patterns []string) DiskSourceOption
WithIgnorePatterns returns a DiskSourceOption that sets additional gitignore-compatible patterns for the DiskSource. These patterns are merged with any .gitignore file found in the source's base directory (union merge semantics per FR-005).
func WithRecursive ¶ added in v1.5.0
func WithRecursive(recursive bool) DiskSourceOption
WithRecursive returns a DiskSourceOption that controls whether subdirectories are traversed during List and Diff walks. When false, only files in the base directory are included.
type Document ¶
type Document struct {
// ID is the source-specific document identifier (e.g., file path, issue number).
ID string
// Title is the human-readable document title.
Title string
// Content is the raw text content of the document.
Content string
// ContentHash is a hash of the content for change detection.
ContentHash string
// SourceID identifies which source produced this document.
SourceID string
// OriginURL is the original URL for external sources (nil for disk).
OriginURL string
// FetchedAt is when this document was last fetched.
FetchedAt time.Time
// Properties holds source-specific metadata (e.g., GitHub labels, web crawl depth).
Properties map[string]any
}
Document represents a content item fetched from a source. Documents are transient — they are converted to Pages, Blocks, and Links during indexing and are not persisted directly.
type FetchResult ¶
type FetchResult struct {
Summaries []FetchSummary
TotalDocs int
TotalErrs int
TotalSkip int
}
FetchResult is the aggregate result of fetching all sources.
func (*FetchResult) FormatSummary ¶
func (r *FetchResult) FormatSummary() string
FormatSummary returns a human-readable, multi-line summary of the fetch result including per-source status (documents fetched, errors, skips) and aggregate totals.
type FetchSummary ¶
type FetchSummary struct {
SourceID string
SourceType string
Documents int
Errors int
Skipped bool
Error string
}
FetchSummary reports the results of a fetch operation.
type GitHubSource ¶
type GitHubSource struct {
// contains filtered or unexported fields
}
GitHubSource implements the Source interface for GitHub repositories. It fetches issues, pull requests, and READMEs via the GitHub REST API.
Token precedence (FR-015a):
- GITHUB_TOKEN or GH_TOKEN environment variable
- `gh auth token` subprocess if gh CLI is available
- Unauthenticated access (60 req/hr rate limit)
Tokens are NEVER logged or persisted (FR-015b).
func NewGitHubSource ¶
func NewGitHubSource(id, name, org string, repos, contentTypes []string) *GitHubSource
NewGitHubSource creates a GitHubSource for the given organization and repositories. If contentTypes is empty, it defaults to ["issues", "pulls", "readme"]. The GitHub token is resolved at creation time using the precedence chain: GITHUB_TOKEN env → GH_TOKEN env → `gh auth token` subprocess → unauthenticated (60 req/hr limit).
Returns a ready-to-use source. The token is held in memory only and never persisted or logged (FR-015b).
func (*GitHubSource) Diff ¶
func (gs *GitHubSource) Diff() ([]Change, error)
Diff returns changes since the last fetch. GitHub source treats all current items as ChangeModified since the API does not support efficient diff detection. Returns an error if the underlying List call fails.
func (*GitHubSource) Fetch ¶
func (gs *GitHubSource) Fetch(id string) (*Document, error)
Fetch retrieves a single document by its source-specific ID. The ID format is "repo/type/number" (e.g., "gaze/issues/42"). Returns the document with full content and metadata. Returns an error if the ID format is invalid or the GitHub API request fails.
func (*GitHubSource) List ¶
func (gs *GitHubSource) List() ([]Document, error)
List returns all documents from configured GitHub repositories by fetching issues, pull requests, and/or READMEs based on the configured content types. If a rate limit is hit, returns the documents fetched so far without an error (partial result). Updates status and lastFetched timestamp. Returns an error only if a non-rate-limit API failure occurs on a repository (logged as warning, fetch continues with other repos).
func (*GitHubSource) Meta ¶
func (gs *GitHubSource) Meta() SourceMetadata
Meta returns metadata about this GitHub source, including its ID, type ("github"), name, current status, any error message from the last fetch, and the last fetch timestamp.
type Manager ¶
type Manager struct {
// contains filtered or unexported fields
}
Manager orchestrates fetching across all configured content sources. It checks refresh intervals, handles source failures gracefully (log warning, continue with others per FR-020), and reports summaries.
func NewManager ¶
func NewManager(configs []SourceConfig, basePath, cacheDir string) *Manager
NewManager creates a Manager from source configurations, instantiating the appropriate Source implementation for each config entry (disk, github, or web). Unknown source types are logged as warnings and skipped. The basePath is used as the default directory for disk sources, and cacheDir is used for web source caching.
Returns a Manager ready for Manager.FetchAll calls.
func (*Manager) FetchAll ¶
func (m *Manager) FetchAll(sourceName string, force bool, lastFetchedTimes map[string]time.Time) (*FetchResult, map[string][]Document)
FetchAll fetches content from all configured sources and returns the aggregate result along with a map of source ID → fetched documents. If sourceName is non-empty, only that source is fetched. If force is true, refresh intervals are ignored and all sources are fetched regardless of when they were last refreshed.
Source failures are non-fatal — each failure is logged as a warning and the fetch continues with remaining sources (FR-020). The returned FetchResult contains per-source summaries including document counts, error counts, and skip counts.
type Source ¶
type Source interface {
// List returns all documents available from this source.
List() ([]Document, error)
// Fetch retrieves a single document by its source-specific identifier.
Fetch(id string) (*Document, error)
// Diff returns changes since the last fetch, enabling incremental indexing.
// Returns nil if the source does not support incremental updates.
Diff() ([]Change, error)
// Meta returns metadata about this source (type, name, status).
Meta() SourceMetadata
}
Source represents a pluggable content origin. Implementations fetch documents from a specific backend (disk, GitHub API, web crawl) and support incremental updates via the Diff method.
type SourceConfig ¶
type SourceConfig struct {
ID string `yaml:"id"`
Type string `yaml:"type"`
Name string `yaml:"name"`
Config map[string]any `yaml:"config"`
RefreshInterval string `yaml:"refresh_interval,omitempty"`
}
SourceConfig represents a single source entry from .dewey/sources.yaml.
func LoadSourcesConfig ¶
func LoadSourcesConfig(path string) ([]SourceConfig, error)
LoadSourcesConfig reads and parses the sources configuration file at the given path (typically .dewey/sources.yaml). Returns (nil, nil) if the file does not exist. Validates each source entry for required fields and type-specific configuration.
Returns an error if the file cannot be read, the YAML is malformed, or any source entry fails validation.
type SourceMetadata ¶
type SourceMetadata struct {
// ID is the unique source identifier (e.g., "disk-local", "github-gaze").
ID string
// Type is the source type (e.g., "disk", "github", "web").
Type string
// Name is the human-readable source name.
Name string
// Status is the current source status ("active", "error", "disabled").
Status string
// ErrorMessage contains the last error if Status is "error".
ErrorMessage string
// LastFetchedAt is when the source was last successfully fetched.
LastFetchedAt time.Time
// RefreshInterval is how often the source should be re-fetched.
RefreshInterval string
}
SourceMetadata describes a content source's identity and status.
type SourcesFile ¶
type SourcesFile struct {
Sources []SourceConfig `yaml:"sources"`
}
SourcesFile represents the top-level structure of .dewey/sources.yaml.
type WebSource ¶
type WebSource struct {
// contains filtered or unexported fields
}
WebSource implements the Source interface for web crawl content. It fetches HTML pages, converts them to plain text via k3a/html2text, and respects robots.txt directives and rate limits.
Safety constraints (FR-017a/b/c):
- Only http:// and https:// schemes allowed
- Max 1MB response body per page
- Max 100 pages per source
- Follow redirects within same domain only
- Respect robots.txt
- Configurable rate limiting (default: 1s between requests)
func NewWebSource ¶
func NewWebSource(id, name string, urls []string, depth int, rateLimit time.Duration, cacheDir string) *WebSource
NewWebSource creates a WebSource for the given seed URLs with the specified crawl depth and rate limit. Negative depth is clamped to 0 (no crawling beyond seed URLs). Non-positive rateLimit defaults to 1 second between requests. If cacheDir is non-empty, fetched documents are cached to disk.
Returns a ready-to-use source configured with same-domain-only redirect policy (FR-017c) and robots.txt compliance.
func (*WebSource) Diff ¶
Diff returns changes since the last fetch. Web sources don't support incremental updates — every fetch is a full crawl, so all documents are reported as ChangeModified. Returns an error if the underlying List call fails.
func (*WebSource) Fetch ¶
Fetch retrieves a single document by URL. Checks the disk cache first if cacheDir is configured; falls back to an HTTP fetch. Returns the document with HTML converted to plain text. Returns an error if the page cannot be fetched or has a non-HTML content type.
func (*WebSource) List ¶
List returns all documents from configured web URLs by crawling each seed URL up to the configured depth. Validates URL schemes (http/https only, FR-017a), enforces the max pages per source limit (FR-017b), and respects robots.txt directives. Caches documents to disk if cacheDir is configured. Returns an empty slice (not an error) if all URLs are invalid or blocked. Updates source status and lastFetched timestamp.
func (*WebSource) Meta ¶
func (ws *WebSource) Meta() SourceMetadata
Meta returns metadata about this web source, including its ID, type ("web"), name, current status, any error message from the last fetch, and the last fetch timestamp.