source

package
v1.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 30, 2026 License: MIT Imports: 18 Imported by: 0

Documentation

Overview

Package source provides pluggable content source implementations for Dewey.

Package source defines the pluggable content source interface for Dewey. Sources provide documents from various origins (local disk, GitHub, web crawl) that are indexed into the knowledge graph.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ParseRefreshInterval

func ParseRefreshInterval(interval string) (time.Duration, error)

ParseRefreshInterval converts a refresh interval string to a time.Duration. Supports named intervals ("daily" = 24h, "weekly" = 168h, "hourly" = 1h) and Go duration strings (e.g., "1h", "30m"). Returns 0 for an empty string (no refresh interval). Returns an error if the string is not a recognized named interval and cannot be parsed as a Go duration.

func SaveSourcesConfig

func SaveSourcesConfig(path string, sources []SourceConfig) error

SaveSourcesConfig writes the sources configuration to the given path (typically .dewey/sources.yaml) with a descriptive YAML header comment. Overwrites any existing file at the path.

Returns an error if the configuration cannot be marshaled to YAML or the file cannot be written.

func SetLogLevel added in v1.2.0

func SetLogLevel(level log.Level)

SetLogLevel sets the logging level for the source package. Use log.DebugLevel for verbose output during diagnostics.

func SetLogOutput added in v1.3.1

func SetLogOutput(w io.Writer, level log.Level)

SetLogOutput replaces the source package logger with one that writes to the given writer at the given level. Used to enable file logging.

Types

type Change

type Change struct {
	// Type indicates the kind of change.
	Type ChangeType

	// Document is the changed document (nil for deletions).
	Document *Document

	// ID is the document identifier (always set, even for deletions).
	ID string
}

Change represents a modification detected by a source's Diff method.

type ChangeType

type ChangeType string

ChangeType enumerates the kinds of changes a source can report.

const (
	// ChangeAdded indicates a new document was added.
	ChangeAdded ChangeType = "added"

	// ChangeModified indicates an existing document was modified.
	ChangeModified ChangeType = "modified"

	// ChangeDeleted indicates a document was removed.
	ChangeDeleted ChangeType = "deleted"
)

type DiskSource

type DiskSource struct {
	// contains filtered or unexported fields
}

DiskSource implements the Source interface for local Markdown files. It scans a directory for .md files and uses content hashing (SHA-256) for change detection, matching the VaultStore pattern.

func NewDiskSource

func NewDiskSource(id, name, basePath string, opts ...DiskSourceOption) *DiskSource

NewDiskSource creates a DiskSource for the given directory path. Returns a ready-to-use source with an empty stored hashes map. Call DiskSource.SetStoredHashes before DiskSource.Diff to enable incremental change detection.

Options are applied after defaults. The default configuration is recursive=true with no extra ignore patterns. The variadic opts parameter ensures backward compatibility — existing callers that pass zero options continue to work unchanged.

func (*DiskSource) Diff

func (d *DiskSource) Diff() ([]Change, error)

Diff returns changes since the last fetch by comparing current file hashes against stored hashes. Uses the same SHA-256 algorithm as VaultStore. Returns a slice of changes categorized as ChangeAdded, ChangeModified, or ChangeDeleted. Returns an error if the directory walk fails.

Decomposed into walkDiskFiles (directory scan) and diffFileChanges (hash comparison) to keep each function under cyclomatic complexity 10.

func (*DiskSource) Fetch

func (d *DiskSource) Fetch(id string) (*Document, error)

Fetch retrieves a single document by its relative file path (e.g., "subfolder/page.md"). Returns the document with computed SHA-256 content hash. Returns an error if the file cannot be read.

func (*DiskSource) List

func (d *DiskSource) List() ([]Document, error)

List returns all .md files in the source directory as Documents, skipping ignored entries (hidden directories, .gitignore patterns, and configured ignore patterns) and unreadable files. Updates the source's lastFetched timestamp on success. Returns an error if the directory walk itself fails or the ignore matcher cannot be constructed.

func (*DiskSource) Meta

func (d *DiskSource) Meta() SourceMetadata

Meta returns metadata about this disk source, including its ID, type ("disk"), name, status, and last fetch timestamp.

func (*DiskSource) SetStoredHashes

func (d *DiskSource) SetStoredHashes(hashes map[string]string)

SetStoredHashes sets the previously known content hashes for change detection. The hashes map is keyed by relative file path with SHA-256 hex digest values. Call this before DiskSource.Diff to enable incremental updates; without stored hashes, Diff reports all files as added.

type DiskSourceOption added in v1.5.0

type DiskSourceOption func(*DiskSource)

DiskSourceOption configures optional behavior for a DiskSource. Use With* constructors to create options.

func WithIgnorePatterns added in v1.5.0

func WithIgnorePatterns(patterns []string) DiskSourceOption

WithIgnorePatterns returns a DiskSourceOption that sets additional gitignore-compatible patterns for the DiskSource. These patterns are merged with any .gitignore file found in the source's base directory (union merge semantics per FR-005).

func WithRecursive added in v1.5.0

func WithRecursive(recursive bool) DiskSourceOption

WithRecursive returns a DiskSourceOption that controls whether subdirectories are traversed during List and Diff walks. When false, only files in the base directory are included.

type Document

type Document struct {
	// ID is the source-specific document identifier (e.g., file path, issue number).
	ID string

	// Title is the human-readable document title.
	Title string

	// Content is the raw text content of the document.
	Content string

	// ContentHash is a hash of the content for change detection.
	ContentHash string

	// SourceID identifies which source produced this document.
	SourceID string

	// OriginURL is the original URL for external sources (nil for disk).
	OriginURL string

	// FetchedAt is when this document was last fetched.
	FetchedAt time.Time

	// Properties holds source-specific metadata (e.g., GitHub labels, web crawl depth).
	Properties map[string]any
}

Document represents a content item fetched from a source. Documents are transient — they are converted to Pages, Blocks, and Links during indexing and are not persisted directly.

type FetchResult

type FetchResult struct {
	Summaries []FetchSummary
	TotalDocs int
	TotalErrs int
	TotalSkip int
}

FetchResult is the aggregate result of fetching all sources.

func (*FetchResult) FormatSummary

func (r *FetchResult) FormatSummary() string

FormatSummary returns a human-readable, multi-line summary of the fetch result including per-source status (documents fetched, errors, skips) and aggregate totals.

type FetchSummary

type FetchSummary struct {
	SourceID   string
	SourceType string
	Documents  int
	Errors     int
	Skipped    bool
	Error      string
}

FetchSummary reports the results of a fetch operation.

type GitHubSource

type GitHubSource struct {
	// contains filtered or unexported fields
}

GitHubSource implements the Source interface for GitHub repositories. It fetches issues, pull requests, and READMEs via the GitHub REST API.

Token precedence (FR-015a):

  1. GITHUB_TOKEN or GH_TOKEN environment variable
  2. `gh auth token` subprocess if gh CLI is available
  3. Unauthenticated access (60 req/hr rate limit)

Tokens are NEVER logged or persisted (FR-015b).

func NewGitHubSource

func NewGitHubSource(id, name, org string, repos, contentTypes []string) *GitHubSource

NewGitHubSource creates a GitHubSource for the given organization and repositories. If contentTypes is empty, it defaults to ["issues", "pulls", "readme"]. The GitHub token is resolved at creation time using the precedence chain: GITHUB_TOKEN env → GH_TOKEN env → `gh auth token` subprocess → unauthenticated (60 req/hr limit).

Returns a ready-to-use source. The token is held in memory only and never persisted or logged (FR-015b).

func (*GitHubSource) Diff

func (gs *GitHubSource) Diff() ([]Change, error)

Diff returns changes since the last fetch. GitHub source treats all current items as ChangeModified since the API does not support efficient diff detection. Returns an error if the underlying List call fails.

func (*GitHubSource) Fetch

func (gs *GitHubSource) Fetch(id string) (*Document, error)

Fetch retrieves a single document by its source-specific ID. The ID format is "repo/type/number" (e.g., "gaze/issues/42"). Returns the document with full content and metadata. Returns an error if the ID format is invalid or the GitHub API request fails.

func (*GitHubSource) List

func (gs *GitHubSource) List() ([]Document, error)

List returns all documents from configured GitHub repositories by fetching issues, pull requests, and/or READMEs based on the configured content types. If a rate limit is hit, returns the documents fetched so far without an error (partial result). Updates status and lastFetched timestamp. Returns an error only if a non-rate-limit API failure occurs on a repository (logged as warning, fetch continues with other repos).

func (*GitHubSource) Meta

func (gs *GitHubSource) Meta() SourceMetadata

Meta returns metadata about this GitHub source, including its ID, type ("github"), name, current status, any error message from the last fetch, and the last fetch timestamp.

type Manager

type Manager struct {
	// contains filtered or unexported fields
}

Manager orchestrates fetching across all configured content sources. It checks refresh intervals, handles source failures gracefully (log warning, continue with others per FR-020), and reports summaries.

func NewManager

func NewManager(configs []SourceConfig, basePath, cacheDir string) *Manager

NewManager creates a Manager from source configurations, instantiating the appropriate Source implementation for each config entry (disk, github, or web). Unknown source types are logged as warnings and skipped. The basePath is used as the default directory for disk sources, and cacheDir is used for web source caching.

Returns a Manager ready for Manager.FetchAll calls.

func (*Manager) FetchAll

func (m *Manager) FetchAll(sourceName string, force bool, lastFetchedTimes map[string]time.Time) (*FetchResult, map[string][]Document)

FetchAll fetches content from all configured sources and returns the aggregate result along with a map of source ID → fetched documents. If sourceName is non-empty, only that source is fetched. If force is true, refresh intervals are ignored and all sources are fetched regardless of when they were last refreshed.

Source failures are non-fatal — each failure is logged as a warning and the fetch continues with remaining sources (FR-020). The returned FetchResult contains per-source summaries including document counts, error counts, and skip counts.

func (*Manager) Sources

func (m *Manager) Sources() []Source

Sources returns the list of instantiated Source implementations created from the configurations passed to NewManager. Returns nil if no sources were successfully created.

type Source

type Source interface {
	// List returns all documents available from this source.
	List() ([]Document, error)

	// Fetch retrieves a single document by its source-specific identifier.
	Fetch(id string) (*Document, error)

	// Diff returns changes since the last fetch, enabling incremental indexing.
	// Returns nil if the source does not support incremental updates.
	Diff() ([]Change, error)

	// Meta returns metadata about this source (type, name, status).
	Meta() SourceMetadata
}

Source represents a pluggable content origin. Implementations fetch documents from a specific backend (disk, GitHub API, web crawl) and support incremental updates via the Diff method.

type SourceConfig

type SourceConfig struct {
	ID              string         `yaml:"id"`
	Type            string         `yaml:"type"`
	Name            string         `yaml:"name"`
	Config          map[string]any `yaml:"config"`
	RefreshInterval string         `yaml:"refresh_interval,omitempty"`
}

SourceConfig represents a single source entry from .dewey/sources.yaml.

func LoadSourcesConfig

func LoadSourcesConfig(path string) ([]SourceConfig, error)

LoadSourcesConfig reads and parses the sources configuration file at the given path (typically .dewey/sources.yaml). Returns (nil, nil) if the file does not exist. Validates each source entry for required fields and type-specific configuration.

Returns an error if the file cannot be read, the YAML is malformed, or any source entry fails validation.

type SourceMetadata

type SourceMetadata struct {
	// ID is the unique source identifier (e.g., "disk-local", "github-gaze").
	ID string

	// Type is the source type (e.g., "disk", "github", "web").
	Type string

	// Name is the human-readable source name.
	Name string

	// Status is the current source status ("active", "error", "disabled").
	Status string

	// ErrorMessage contains the last error if Status is "error".
	ErrorMessage string

	// LastFetchedAt is when the source was last successfully fetched.
	LastFetchedAt time.Time

	// RefreshInterval is how often the source should be re-fetched.
	RefreshInterval string
}

SourceMetadata describes a content source's identity and status.

type SourcesFile

type SourcesFile struct {
	Sources []SourceConfig `yaml:"sources"`
}

SourcesFile represents the top-level structure of .dewey/sources.yaml.

type WebSource

type WebSource struct {
	// contains filtered or unexported fields
}

WebSource implements the Source interface for web crawl content. It fetches HTML pages, converts them to plain text via k3a/html2text, and respects robots.txt directives and rate limits.

Safety constraints (FR-017a/b/c):

  • Only http:// and https:// schemes allowed
  • Max 1MB response body per page
  • Max 100 pages per source
  • Follow redirects within same domain only
  • Respect robots.txt
  • Configurable rate limiting (default: 1s between requests)

func NewWebSource

func NewWebSource(id, name string, urls []string, depth int, rateLimit time.Duration, cacheDir string) *WebSource

NewWebSource creates a WebSource for the given seed URLs with the specified crawl depth and rate limit. Negative depth is clamped to 0 (no crawling beyond seed URLs). Non-positive rateLimit defaults to 1 second between requests. If cacheDir is non-empty, fetched documents are cached to disk.

Returns a ready-to-use source configured with same-domain-only redirect policy (FR-017c) and robots.txt compliance.

func (*WebSource) Diff

func (ws *WebSource) Diff() ([]Change, error)

Diff returns changes since the last fetch. Web sources don't support incremental updates — every fetch is a full crawl, so all documents are reported as ChangeModified. Returns an error if the underlying List call fails.

func (*WebSource) Fetch

func (ws *WebSource) Fetch(id string) (*Document, error)

Fetch retrieves a single document by URL. Checks the disk cache first if cacheDir is configured; falls back to an HTTP fetch. Returns the document with HTML converted to plain text. Returns an error if the page cannot be fetched or has a non-HTML content type.

func (*WebSource) List

func (ws *WebSource) List() ([]Document, error)

List returns all documents from configured web URLs by crawling each seed URL up to the configured depth. Validates URL schemes (http/https only, FR-017a), enforces the max pages per source limit (FR-017b), and respects robots.txt directives. Caches documents to disk if cacheDir is configured. Returns an empty slice (not an error) if all URLs are invalid or blocked. Updates source status and lastFetched timestamp.

func (*WebSource) Meta

func (ws *WebSource) Meta() SourceMetadata

Meta returns metadata about this web source, including its ID, type ("web"), name, current status, any error message from the last fetch, and the last fetch timestamp.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL