Documentation
¶
Index ¶
- func ContainsCookie(cookies []*http.Cookie, name string) bool
- func StringifyCookies(cookies []*http.Cookie) string
- func UnstringifyCookies(s string) []*http.Cookie
- type CrawlerStore
- func (s *CrawlerStore) AddRedirectDestination(finalURL, intermediateURL string)
- func (s *CrawlerStore) Clear()
- func (s *CrawlerStore) CountActions() int
- func (s *CrawlerStore) CountVisited() int
- func (s *CrawlerStore) GetAction(url string) (interface{}, bool)
- func (s *CrawlerStore) GetAndClearRedirectDestinations(finalURL string) ([]string, bool)
- func (s *CrawlerStore) GetMetadata(url string) (interface{}, bool)
- func (s *CrawlerStore) IsVisited(hash uint64) (bool, error)
- func (s *CrawlerStore) PreMarkVisited(hash uint64)
- func (s *CrawlerStore) StoreAction(url string, action interface{})
- func (s *CrawlerStore) StoreMetadata(url string, metadata interface{})
- func (s *CrawlerStore) VisitIfNotVisited(hash uint64) (bool, error)
- type InMemoryStorage
- func (s *InMemoryStorage) Close() error
- func (s *InMemoryStorage) Cookies(u *url.URL) string
- func (s *InMemoryStorage) GetContentHash(url string) (string, error)
- func (s *InMemoryStorage) Init() error
- func (s *InMemoryStorage) IsContentVisited(contentHash string) (bool, error)
- func (s *InMemoryStorage) SetContentHash(url string, contentHash string) error
- func (s *InMemoryStorage) SetCookies(u *url.URL, cookies string)
- func (s *InMemoryStorage) VisitedContent(contentHash string) error
- type Storage
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ContainsCookie ¶
ContainsCookie checks if a cookie name is represented in cookies
func StringifyCookies ¶
StringifyCookies serializes list of http.Cookies to string
func UnstringifyCookies ¶
UnstringifyCookies deserializes a cookie string to http.Cookies
Types ¶
type CrawlerStore ¶
type CrawlerStore struct {
// contains filtered or unexported fields
}
CrawlerStore manages crawler-specific state during a crawl session. This is separate from the Collector's Storage (which handles HTTP-level concerns like cookies and content hashes). CrawlerStore handles crawl orchestration concerns.
func NewCrawlerStore ¶
func NewCrawlerStore() *CrawlerStore
NewCrawlerStore creates a new CrawlerStore instance
func (*CrawlerStore) AddRedirectDestination ¶
func (s *CrawlerStore) AddRedirectDestination(finalURL, intermediateURL string)
AddRedirectDestination adds a redirect destination to the list for a final URL For redirect chains A→B→C, this is called twice: - First with finalURL=B (since we don't know C yet) - Then with finalURL=C, and we add B to C's list
func (*CrawlerStore) Clear ¶
func (s *CrawlerStore) Clear()
Clear resets all stored data (useful for testing or restarting crawls)
func (*CrawlerStore) CountActions ¶
func (s *CrawlerStore) CountActions() int
CountActions returns the total number of URLs with stored actions
func (*CrawlerStore) CountVisited ¶
func (s *CrawlerStore) CountVisited() int
CountVisited returns the number of URLs marked as visited.
func (*CrawlerStore) GetAction ¶
func (s *CrawlerStore) GetAction(url string) (interface{}, bool)
GetAction retrieves the action for a URL (returns nil if not found)
func (*CrawlerStore) GetAndClearRedirectDestinations ¶
func (s *CrawlerStore) GetAndClearRedirectDestinations(finalURL string) ([]string, bool)
GetAndClearRedirectDestinations retrieves and removes all redirect destinations for a URL Returns the list of intermediate URLs and true if found, empty list and false otherwise
func (*CrawlerStore) GetMetadata ¶
func (s *CrawlerStore) GetMetadata(url string) (interface{}, bool)
GetMetadata retrieves metadata for a URL (returns nil if not found)
func (*CrawlerStore) IsVisited ¶
func (s *CrawlerStore) IsVisited(hash uint64) (bool, error)
IsVisited checks if a URL hash has been visited (read-only check)
func (*CrawlerStore) PreMarkVisited ¶
func (s *CrawlerStore) PreMarkVisited(hash uint64)
PreMarkVisited marks a URL hash as visited without checking first. Used for pre-populating the visited set when resuming a crawl.
func (*CrawlerStore) StoreAction ¶
func (s *CrawlerStore) StoreAction(url string, action interface{})
StoreAction stores the action for a discovered URL (for memoization)
func (*CrawlerStore) StoreMetadata ¶
func (s *CrawlerStore) StoreMetadata(url string, metadata interface{})
StoreMetadata stores metadata for a crawled page (for link population)
func (*CrawlerStore) VisitIfNotVisited ¶
func (s *CrawlerStore) VisitIfNotVisited(hash uint64) (bool, error)
VisitIfNotVisited atomically checks if a URL hash has been visited and marks it as visited. Returns true if already visited, false if newly visited. This is the CRITICAL method for preventing race conditions in URL visit tracking.
type InMemoryStorage ¶
type InMemoryStorage struct {
// contains filtered or unexported fields
}
InMemoryStorage is the default storage backend of bluesnake. InMemoryStorage keeps cookies and content hashes in memory without persisting data on the disk.
func (*InMemoryStorage) Close ¶
func (s *InMemoryStorage) Close() error
Close implements Storage.Close()
func (*InMemoryStorage) Cookies ¶
func (s *InMemoryStorage) Cookies(u *url.URL) string
Cookies implements Storage.Cookies()
func (*InMemoryStorage) GetContentHash ¶
func (s *InMemoryStorage) GetContentHash(url string) (string, error)
GetContentHash implements Storage.GetContentHash()
func (*InMemoryStorage) Init ¶
func (s *InMemoryStorage) Init() error
Init initializes InMemoryStorage
func (*InMemoryStorage) IsContentVisited ¶
func (s *InMemoryStorage) IsContentVisited(contentHash string) (bool, error)
IsContentVisited implements Storage.IsContentVisited()
func (*InMemoryStorage) SetContentHash ¶
func (s *InMemoryStorage) SetContentHash(url string, contentHash string) error
SetContentHash implements Storage.SetContentHash()
func (*InMemoryStorage) SetCookies ¶
func (s *InMemoryStorage) SetCookies(u *url.URL, cookies string)
SetCookies implements Storage.SetCookies()
func (*InMemoryStorage) VisitedContent ¶
func (s *InMemoryStorage) VisitedContent(contentHash string) error
VisitedContent implements Storage.VisitedContent()
type Storage ¶
type Storage interface {
// Init initializes the storage
Init() error
// Cookies retrieves stored cookies for a given host
Cookies(u *url.URL) string
// SetCookies stores cookies for a given host
SetCookies(u *url.URL, cookies string)
// SetContentHash stores a content hash for a given URL
SetContentHash(url string, contentHash string) error
// GetContentHash retrieves the stored content hash for a given URL
GetContentHash(url string) (string, error)
// IsContentVisited returns true if content with the given hash has been visited
IsContentVisited(contentHash string) (bool, error)
// VisitedContent marks content with the given hash as visited
VisitedContent(contentHash string) error
}
Storage is an interface which handles Collector's HTTP-level data. The default Storage of the Collector is the InMemoryStorage. Collector's storage can be changed by calling Collector.SetStorage() function.
NOTE: This is separate from CrawlerStore (which handles crawl orchestration). Storage handles only HTTP client concerns:
- Cookies (HTTP session state)
- Content hashes (duplicate content detection)
Visit tracking has been removed - Crawler owns all visit tracking via CrawlerStore.