store

package
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 8, 2026 License: Apache-2.0 Imports: 12 Imported by: 0

Documentation

Index

Constants

View Source
const (
	RunStateInProgress = "in_progress" // A crawl is currently running
	RunStatePaused     = "paused"      // Budget hit, more URLs to crawl
	RunStateCompleted  = "completed"   // All URLs crawled or manually completed
)

Run state constants for IncrementalCrawlRun

View Source
const (
	CrawlStateInProgress = "in_progress"
	CrawlStatePaused     = "paused" // Deprecated: use RunStatePaused on IncrementalCrawlRun instead
	CrawlStateCompleted  = "completed"
	CrawlStateFailed     = "failed"
)

Crawl state constants Note: CrawlStatePaused is deprecated - paused state now lives at the run level

Variables

This section is empty.

Functions

This section is empty.

Types

type Config

type Config struct {
	ID                     uint   `gorm:"primaryKey"`
	ProjectID              uint   `gorm:"uniqueIndex;not null"`
	Domain                 string `gorm:"not null"`
	JSRenderingEnabled     bool   `gorm:"default:false"`
	InitialWaitMs          int    `gorm:"default:1500"` // Initial wait after page load for JS frameworks to hydrate (in milliseconds)
	ScrollWaitMs           int    `gorm:"default:2000"` // Wait after scrolling to bottom for lazy-loaded content (in milliseconds)
	FinalWaitMs            int    `gorm:"default:1000"` // Final wait before capturing HTML (in milliseconds)
	Parallelism            int    `gorm:"default:5"`
	UserAgent              string `gorm:"type:text;default:'bluesnake/1.0 (+https://snake.blue)'"`
	IncludeSubdomains      bool   `gorm:"default:false"`                                // When true, crawl all subdomains of the project domain
	DiscoveryMechanisms    string `gorm:"type:text;default:'[\"spider\",\"sitemap\"]'"` // JSON array
	SitemapURLs            string `gorm:"type:text"`                                    // JSON array, nullable
	CheckExternalResources bool   `gorm:"default:true"`                                 // When true, validate external resources for broken links
	// Crawler directive configuration (robots.txt, nofollow, noindex, etc.)
	RobotsTxtMode            string `gorm:"default:'respect'"` // "respect", "ignore", or "ignore-report"
	FollowInternalNofollow   bool   `gorm:"default:false"`     // When true, follow links with rel="nofollow" on same domain
	FollowExternalNofollow   bool   `gorm:"default:false"`     // When true, follow links with rel="nofollow" on external domains
	RespectMetaRobotsNoindex bool   `gorm:"default:true"`      // When true, respect <meta name="robots" content="noindex">
	RespectNoindex           bool   `gorm:"default:true"`      // When true, respect X-Robots-Tag: noindex headers
	// Incremental crawling configuration
	IncrementalCrawlingEnabled bool     `gorm:"default:false"` // When true, crawl in chunks and allow resume
	CrawlBudget                int      `gorm:"default:0"`     // Max URLs to crawl per session (0 = unlimited)
	Project                    *Project `gorm:"foreignKey:ProjectID;constraint:OnDelete:CASCADE"`
	CreatedAt                  int64    `gorm:"autoCreateTime"`
	UpdatedAt                  int64    `gorm:"autoUpdateTime"`
}

Config represents the crawl configuration for a domain

func (*Config) GetDiscoveryMechanismsArray

func (c *Config) GetDiscoveryMechanismsArray() []string

GetDiscoveryMechanismsArray deserializes the DiscoveryMechanisms JSON to []string

func (*Config) GetSitemapURLsArray

func (c *Config) GetSitemapURLsArray() []string

GetSitemapURLsArray deserializes the SitemapURLs JSON to []string Returns nil if empty (which means use defaults)

func (*Config) SetDiscoveryMechanismsArray

func (c *Config) SetDiscoveryMechanismsArray(mechanisms []string) error

SetDiscoveryMechanismsArray serializes []string to JSON for DiscoveryMechanisms

func (*Config) SetSitemapURLsArray

func (c *Config) SetSitemapURLsArray(urls []string) error

SetSitemapURLsArray serializes []string to JSON for SitemapURLs

type Crawl

type Crawl struct {
	ID             uint                 `gorm:"primaryKey"`
	ProjectID      uint                 `gorm:"not null;index"`
	RunID          *uint                `gorm:"index"` // nullable FK to IncrementalCrawlRun (null for non-incremental crawls)
	CrawlDateTime  int64                `gorm:"not null"`
	CrawlDuration  int64                `gorm:"not null"`
	PagesCrawled   int                  `gorm:"not null"`
	State          string               `gorm:"not null;default:'completed'"` // in_progress, completed, failed
	DiscoveredUrls []DiscoveredUrl      `gorm:"foreignKey:CrawlID;constraint:OnDelete:CASCADE"`
	Run            *IncrementalCrawlRun `gorm:"foreignKey:RunID"`
	CreatedAt      int64                `gorm:"autoCreateTime"`
	UpdatedAt      int64                `gorm:"autoUpdateTime"`
}

Crawl represents a single crawl session for a project

type CrawlHistoryEntry

type CrawlHistoryEntry struct {
	ID            uint // The crawl ID to use for fetching results (latest crawl in run, or the crawl itself)
	ProjectID     uint
	CrawlDateTime int64  // When the crawl/run started
	CrawlDuration int64  // Total duration across all sessions
	PagesCrawled  int    // Total pages across all sessions
	State         string // Current state (from run if incremental, from crawl if standalone)
}

CrawlHistoryEntry represents a unified crawl history entry for the frontend. For runs, this aggregates multiple crawl sessions into one entry. For standalone crawls, this represents a single crawl.

type CrawlQueueItem

type CrawlQueueItem struct {
	ID        uint     `gorm:"primaryKey"`
	ProjectID uint     `gorm:"not null;index:idx_queue_project_visited"`
	URL       string   `gorm:"not null"`
	URLHash   int64    `gorm:"not null;index:idx_queue_hash"` // Stored as int64 for SQLite compatibility
	Source    string   `gorm:"not null"`                      // initial, spider, sitemap, network, resource
	Depth     int      `gorm:"not null;default:0"`
	Visited   bool     `gorm:"not null;default:false;index:idx_queue_project_visited"`
	Project   *Project `gorm:"foreignKey:ProjectID;constraint:OnDelete:CASCADE"`
	CreatedAt int64    `gorm:"autoCreateTime"`
	UpdatedAt int64    `gorm:"autoUpdateTime"`
}

CrawlQueueItem stores URLs discovered for a project across crawl sessions. This is the persistent queue that survives between crawl sessions for incremental crawling.

func (CrawlQueueItem) TableName

func (CrawlQueueItem) TableName() string

TableName returns the table name for CrawlQueueItem

type DiscoveredUrl

type DiscoveredUrl struct {
	ID              uint   `gorm:"primaryKey"`
	CrawlID         uint   `gorm:"not null;index"`
	URL             string `gorm:"not null"`
	Visited         bool   `gorm:"not null;default:false;index"` // true = URL was crawled/visited, false = discovered but not visited
	Status          int    `gorm:"not null"`
	Title           string `gorm:"type:text"`
	MetaDescription string `gorm:"type:text"`
	H1              string `gorm:"type:text"`
	H2              string `gorm:"type:text"`
	CanonicalURL    string `gorm:"type:text"`
	WordCount       int    `gorm:"default:0"`
	ContentHash     string `gorm:"type:text;index"`
	Indexable       string `gorm:"not null"`
	ContentType     string `gorm:"type:text"` // MIME type: text/html, image/jpeg, text/css, application/javascript, etc.
	Error           string `gorm:"type:text"`
	CreatedAt       int64  `gorm:"autoCreateTime"`
}

DiscoveredUrl represents a single URL that was discovered during crawling This includes both URLs that were visited/crawled and URLs that were discovered but not visited (e.g., framework-specific URLs)

type DomainFramework

type DomainFramework struct {
	ID          uint     `gorm:"primaryKey"`
	ProjectID   uint     `gorm:"not null;index:idx_project_domain"`
	Domain      string   `gorm:"not null;index:idx_project_domain;index:idx_domain"`
	Framework   string   `gorm:"not null"`
	ManuallySet bool     `gorm:"default:false"`
	Project     *Project `gorm:"foreignKey:ProjectID;constraint:OnDelete:CASCADE"`
	CreatedAt   int64    `gorm:"autoCreateTime"`
	UpdatedAt   int64    `gorm:"autoUpdateTime"`
}

DomainFramework represents the detected web framework for a specific domain

func (DomainFramework) TableName

func (DomainFramework) TableName() string

Unique constraint on (ProjectID, Domain)

type IncrementalCrawlRun

type IncrementalCrawlRun struct {
	ID        uint     `gorm:"primaryKey"`
	ProjectID uint     `gorm:"not null;index"`
	State     string   `gorm:"not null;default:'in_progress'"` // in_progress, paused, completed
	Project   *Project `gorm:"foreignKey:ProjectID;constraint:OnDelete:CASCADE"`
	Crawls    []Crawl  `gorm:"foreignKey:RunID"`
	CreatedAt int64    `gorm:"autoCreateTime"`
	UpdatedAt int64    `gorm:"autoUpdateTime"`
}

IncrementalCrawlRun groups multiple crawls in a single incremental run. When incremental crawling is enabled, each "run" can contain multiple crawl sessions that together form a complete crawl of the site.

type PageLink struct {
	ID          uint   `gorm:"primaryKey"`
	CrawlID     uint   `gorm:"not null;index:idx_crawl_source;index:idx_crawl_target"`
	SourceURL   string `gorm:"not null;index:idx_crawl_source"` // Page containing the link
	TargetURL   string `gorm:"not null;index:idx_crawl_target"` // Page being linked to
	LinkType    string `gorm:"not null"`                        // "anchor", "image", "script", etc.
	LinkText    string `gorm:"type:text"`                       // anchor text or alt text
	LinkContext string `gorm:"type:text"`                       // surrounding text context
	IsInternal  bool   `gorm:"not null"`                        // internal vs external
	Status      int    `gorm:"default:0"`                       // HTTP status of target (0 if not crawled)
	Title       string `gorm:"type:text"`                       // Title of target page (if crawled)
	ContentType string `gorm:"type:text"`                       // Content type of target (if crawled)
	Position    string `gorm:"type:text"`                       // Position: "content", "navigation", "header", "footer", "sidebar", "breadcrumbs", "pagination", "unknown"
	DOMPath     string `gorm:"type:text"`                       // Simplified DOM path showing link's location in HTML structure
	URLAction   string `gorm:"type:text;index"`                 // Action: "crawl" (normal), "record" (framework-specific), "skip" (ignored)
	CreatedAt   int64  `gorm:"autoCreateTime"`
}

PageLink represents a link between two pages

type PageLinkData

type PageLinkData struct {
	URL         string
	Type        string
	Text        string
	Context     string
	IsInternal  bool
	Status      int
	Title       string
	ContentType string
	Position    string
	DOMPath     string
	URLAction   string
}

PageLinkData is a simplified structure for passing link data

type Project

type Project struct {
	ID             uint    `gorm:"primaryKey"`
	URL            string  `gorm:"not null"`             // Normalized URL for the project
	Domain         string  `gorm:"uniqueIndex;not null"` // Domain identifier (includes subdomain)
	FaviconPath    string  `gorm:"type:text"`            // Path to cached favicon
	AICrawlerData  string  `gorm:"type:text"`            // JSON data for AI Crawler results
	SSRScreenshot  string  `gorm:"type:text"`            // Path to SSR screenshot
	JSScreenshot   string  `gorm:"type:text"`            // Path to JS-enabled screenshot
	NoJSScreenshot string  `gorm:"type:text"`            // Path to JS-disabled screenshot
	Crawls         []Crawl `gorm:"foreignKey:ProjectID;constraint:OnDelete:CASCADE"`
	CreatedAt      int64   `gorm:"autoCreateTime"`
	UpdatedAt      int64   `gorm:"autoUpdateTime"`
}

Project represents a project (base URL) that can have multiple crawls

type QueueStats

type QueueStats struct {
	Visited int64
	Pending int64
	Total   int64
}

QueueStats contains statistics about the crawl queue.

type Store

type Store struct {
	// contains filtered or unexported fields
}

Store represents the database store

func NewStore

func NewStore() (*Store, error)

NewStore creates a new Store and initializes the database

func NewStoreForTesting

func NewStoreForTesting(dbPath string) (*Store, error)

NewStoreForTesting creates a store with a custom database path (used for testing)

func (*Store) AddAndMarkVisited

func (s *Store) AddAndMarkVisited(projectID uint, url string, urlHash int64, source string) error

AddAndMarkVisited adds a URL to the queue and marks it as visited in one operation. Uses upsert: if URL exists, marks it visited; if not, creates it with visited=true. This ensures ALL crawled URLs are tracked in the queue, regardless of discovery source (spider, sitemap, redirects, etc.).

func (*Store) AddSingleToQueue

func (s *Store) AddSingleToQueue(projectID uint, url string, urlHash int64, source string, depth int) error

AddSingleToQueue adds a single URL to the crawl queue. Uses upsert to handle duplicates.

func (*Store) AddToQueue

func (s *Store) AddToQueue(projectID uint, items []CrawlQueueItem) error

AddToQueue adds URLs to the crawl queue for a project. Uses upsert to handle duplicates - existing URLs are not modified. Batches inserts to avoid SQLite "too many SQL variables" error.

func (*Store) ClearQueue

func (s *Store) ClearQueue(projectID uint) error

ClearQueue removes all URLs from the crawl queue for a project. Used when starting a fresh crawl or disabling incremental crawling.

func (*Store) CreateCrawl

func (s *Store) CreateCrawl(projectID uint, crawlDateTime int64, crawlDuration int64, pagesCrawled int) (*Crawl, error)

CreateCrawl creates a new crawl for a project

func (*Store) CreateCrawlWithRun

func (s *Store) CreateCrawlWithRun(projectID uint, runID uint) (*Crawl, error)

CreateCrawlWithRun creates a new crawl associated with an incremental run

func (*Store) CreateCrawlWithState

func (s *Store) CreateCrawlWithState(projectID uint, crawlDateTime int64, crawlDuration int64, pagesCrawled int, state string) (*Crawl, error)

CreateCrawlWithState creates a new crawl for a project with a specified initial state

func (*Store) CreateIncrementalRun

func (s *Store) CreateIncrementalRun(projectID uint) (*IncrementalCrawlRun, error)

CreateIncrementalRun creates a new incremental crawl run for a project

func (*Store) DB

func (s *Store) DB() *gorm.DB

DB returns the underlying GORM database instance

func (*Store) DeleteCrawl

func (s *Store) DeleteCrawl(crawlID uint) error

DeleteCrawl deletes a crawl and all its crawled URLs (cascade)

func (*Store) DeleteDomainFramework

func (s *Store) DeleteDomainFramework(projectID uint, domain string) error

DeleteDomainFramework deletes a domain framework entry

func (*Store) DeleteProject

func (s *Store) DeleteProject(projectID uint) error

DeleteProject deletes a project and all its crawls (cascade)

func (*Store) GetActiveCrawlStatsAggregated

func (s *Store) GetActiveCrawlStatsAggregated(crawlID uint) (map[string]int, error)

GetActiveCrawlStatsAggregated gets statistics aggregated across a run if applicable

func (*Store) GetActiveOrPausedRun

func (s *Store) GetActiveOrPausedRun(projectID uint) (*IncrementalCrawlRun, error)

GetActiveOrPausedRun returns the most recent in-progress or paused run for a project

func (*Store) GetAllDomainFrameworks

func (s *Store) GetAllDomainFrameworks(projectID uint) ([]DomainFramework, error)

GetAllDomainFrameworks gets all frameworks for a project

func (*Store) GetAllProjects

func (s *Store) GetAllProjects() ([]Project, error)

GetAllProjects returns all projects with their latest crawl info

func (*Store) GetCrawlByID

func (s *Store) GetCrawlByID(id uint) (*Crawl, error)

GetCrawlByID gets a crawl by ID

func (*Store) GetCrawlHistory

func (s *Store) GetCrawlHistory(projectID uint) ([]CrawlHistoryEntry, error)

GetCrawlHistory returns a deduplicated crawl history for a project. Runs are aggregated into single entries, standalone crawls are returned as-is.

func (*Store) GetCrawlResultsPaginatedAggregated

func (s *Store) GetCrawlResultsPaginatedAggregated(crawlID uint, limit int, cursor uint, contentTypeFilter string) ([]DiscoveredUrl, uint, bool, error)

GetCrawlResultsPaginatedAggregated gets paginated discovered URLs, aggregating across a run if applicable

func (*Store) GetDomainFramework

func (s *Store) GetDomainFramework(projectID uint, domain string) (*DomainFramework, error)

GetDomainFramework gets the framework for a specific domain in a project

func (*Store) GetInProgressRun

func (s *Store) GetInProgressRun(projectID uint) (*IncrementalCrawlRun, error)

GetInProgressRun returns the in-progress run for a project, if any

func (*Store) GetIncrementalConfig

func (s *Store) GetIncrementalConfig(projectID uint) (enabled bool, budget int, err error)

GetIncrementalConfig retrieves just the incremental crawling settings for a project

func (*Store) GetLatestCrawl

func (s *Store) GetLatestCrawl(projectID uint) (*Crawl, error)

GetLatestCrawl gets the most recent crawl for a project

func (*Store) GetOrCreateConfig

func (s *Store) GetOrCreateConfig(projectID uint, domain string) (*Config, error)

GetOrCreateConfig retrieves the config for a project or creates one with defaults

func (*Store) GetOrCreateProject

func (s *Store) GetOrCreateProject(urlStr string, domain string) (*Project, error)

GetOrCreateProject gets or creates a project by domain

func (s *Store) GetPageLinks(crawlID uint, pageURL string) (inlinks []PageLink, outlinks []PageLink, err error)

GetPageLinks retrieves inbound and outbound links for a specific URL in a crawl

func (*Store) GetPausedRun

func (s *Store) GetPausedRun(projectID uint) (*IncrementalCrawlRun, error)

GetPausedRun returns the most recent paused run for a project, if any

func (*Store) GetPendingURLs

func (s *Store) GetPendingURLs(projectID uint) ([]CrawlQueueItem, error)

GetPendingURLs returns all unvisited URLs from the crawl queue for a project. These are URLs that were discovered but not yet crawled.

func (*Store) GetProjectByDomain

func (s *Store) GetProjectByDomain(domain string) (*Project, error)

GetProjectByDomain gets a project by domain

func (*Store) GetProjectByID

func (s *Store) GetProjectByID(id uint) (*Project, error)

GetProjectByID gets a project by ID

func (*Store) GetProjectCrawls

func (s *Store) GetProjectCrawls(projectID uint) ([]Crawl, error)

GetProjectCrawls returns all crawls for a project ordered by date

func (*Store) GetQueueItemByURL

func (s *Store) GetQueueItemByURL(projectID uint, url string) (*CrawlQueueItem, error)

GetQueueItemByURL retrieves a queue item by its URL.

func (*Store) GetQueueStats

func (s *Store) GetQueueStats(projectID uint) (*QueueStats, error)

GetQueueStats returns statistics about the crawl queue for a project.

func (*Store) GetRunByID

func (s *Store) GetRunByID(runID uint) (*IncrementalCrawlRun, error)

GetRunByID returns a run by its ID

func (*Store) GetRunCrawls

func (s *Store) GetRunCrawls(runID uint) ([]Crawl, error)

GetRunCrawls returns all crawls for a run ordered by date

func (*Store) GetRunWithCrawls

func (s *Store) GetRunWithCrawls(runID uint) (*IncrementalCrawlRun, error)

GetRunWithCrawls returns a run with all its crawls preloaded

func (*Store) GetTotalURLsForCrawl

func (s *Store) GetTotalURLsForCrawl(crawlID uint) (int, error)

GetTotalURLsForCrawl returns the total number of discovered URLs for a crawl

func (*Store) GetVisitedURLHashes

func (s *Store) GetVisitedURLHashes(projectID uint) ([]int64, error)

GetVisitedURLHashes returns the hashes of all visited URLs for a project. Used when resuming a crawl to pre-populate the visited set. Returns int64 because SQLite stores hashes as signed integers.

func (*Store) HasPendingURLs

func (s *Store) HasPendingURLs(projectID uint) (bool, error)

HasPendingURLs checks if there are any unvisited URLs in the queue.

func (*Store) MarkURLsVisited

func (s *Store) MarkURLsVisited(projectID uint, urls []string) error

MarkURLsVisited marks the given URLs as visited in the queue. Uses URL matching to find and update the records.

func (*Store) SaveDiscoveredUrl

func (s *Store) SaveDiscoveredUrl(crawlID uint, url string, visited bool, status int, title string, metaDescription string, h1 string, h2 string, canonicalURL string, wordCount int, contentHash string, indexable string, contentType string, errorMsg string) error

SaveDiscoveredUrl saves a discovered URL (whether visited or not)

func (*Store) SaveDomainFramework

func (s *Store) SaveDomainFramework(projectID uint, domain string, framework string, manuallySet bool) error

SaveDomainFramework saves or updates a domain framework

func (s *Store) SavePageLinks(crawlID uint, sourceURL string, outboundLinks []PageLinkData, inboundLinks []PageLinkData) error

SavePageLinks saves all links from a crawled page

func (*Store) SearchCrawlResultsPaginatedAggregated

func (s *Store) SearchCrawlResultsPaginatedAggregated(crawlID uint, query string, contentTypeFilter string, limit int, cursor uint) ([]DiscoveredUrl, uint, bool, error)

SearchCrawlResultsPaginatedAggregated searches discovered URLs with pagination, aggregating across a run if applicable

func (*Store) UpdateConfig

func (s *Store) UpdateConfig(projectID uint, jsRendering bool, initialWaitMs, scrollWaitMs, finalWaitMs int, parallelism int, userAgent string, includeSubdomains bool, discoveryMechanisms []string, sitemapURLs []string, checkExternalResources bool, robotsTxtMode string, followInternalNofollow, followExternalNofollow, respectMetaRobotsNoindex, respectNoindex bool) error

UpdateConfig updates the configuration for a project

func (*Store) UpdateCrawlState

func (s *Store) UpdateCrawlState(crawlID uint, state string) error

UpdateCrawlState updates the state of a crawl

func (*Store) UpdateCrawlStats

func (s *Store) UpdateCrawlStats(crawlID uint, crawlDuration int64, pagesCrawled int) error

UpdateCrawlStats updates the crawl statistics

func (*Store) UpdateCrawlStatsAndState

func (s *Store) UpdateCrawlStatsAndState(crawlID uint, crawlDuration int64, pagesCrawled int, state string) error

UpdateCrawlStatsAndState updates crawl statistics and state in one operation

func (*Store) UpdateIncrementalConfig

func (s *Store) UpdateIncrementalConfig(projectID uint, enabled bool, budget int) error

UpdateIncrementalConfig updates only the incremental crawling settings for a project

func (*Store) UpdateProject

func (s *Store) UpdateProject(projectID uint, updates map[string]interface{}) error

UpdateProject updates a project with given fields

func (*Store) UpdateRunState

func (s *Store) UpdateRunState(runID uint, state string) error

UpdateRunState updates the state of an incremental run

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL