storage

package
v0.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 8, 2026 License: Apache-2.0 Imports: 5 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ContainsCookie

func ContainsCookie(cookies []*http.Cookie, name string) bool

ContainsCookie checks if a cookie name is represented in cookies

func StringifyCookies

func StringifyCookies(cookies []*http.Cookie) string

StringifyCookies serializes list of http.Cookies to string

func UnstringifyCookies

func UnstringifyCookies(s string) []*http.Cookie

UnstringifyCookies deserializes a cookie string to http.Cookies

Types

type CrawlerStore

type CrawlerStore struct {
	// contains filtered or unexported fields
}

CrawlerStore manages crawler-specific state during a crawl session. This is separate from the Collector's Storage (which handles HTTP-level concerns like cookies and content hashes). CrawlerStore handles crawl orchestration concerns.

func NewCrawlerStore

func NewCrawlerStore() *CrawlerStore

NewCrawlerStore creates a new CrawlerStore instance

func (*CrawlerStore) AddRedirectDestination

func (s *CrawlerStore) AddRedirectDestination(finalURL, intermediateURL string)

AddRedirectDestination adds a redirect destination to the list for a final URL For redirect chains A→B→C, this is called twice: - First with finalURL=B (since we don't know C yet) - Then with finalURL=C, and we add B to C's list

func (*CrawlerStore) Clear

func (s *CrawlerStore) Clear()

Clear resets all stored data (useful for testing or restarting crawls)

func (*CrawlerStore) CountActions

func (s *CrawlerStore) CountActions() int

CountActions returns the total number of URLs with stored actions

func (*CrawlerStore) CountVisited

func (s *CrawlerStore) CountVisited() int

CountVisited returns the number of URLs marked as visited.

func (*CrawlerStore) GetAction

func (s *CrawlerStore) GetAction(url string) (interface{}, bool)

GetAction retrieves the action for a URL (returns nil if not found)

func (*CrawlerStore) GetAndClearRedirectDestinations

func (s *CrawlerStore) GetAndClearRedirectDestinations(finalURL string) ([]string, bool)

GetAndClearRedirectDestinations retrieves and removes all redirect destinations for a URL Returns the list of intermediate URLs and true if found, empty list and false otherwise

func (*CrawlerStore) GetMetadata

func (s *CrawlerStore) GetMetadata(url string) (interface{}, bool)

GetMetadata retrieves metadata for a URL (returns nil if not found)

func (*CrawlerStore) IsVisited

func (s *CrawlerStore) IsVisited(hash uint64) (bool, error)

IsVisited checks if a URL hash has been visited (read-only check)

func (*CrawlerStore) PreMarkVisited

func (s *CrawlerStore) PreMarkVisited(hash uint64)

PreMarkVisited marks a URL hash as visited without checking first. Used for pre-populating the visited set when resuming a crawl.

func (*CrawlerStore) StoreAction

func (s *CrawlerStore) StoreAction(url string, action interface{})

StoreAction stores the action for a discovered URL (for memoization)

func (*CrawlerStore) StoreMetadata

func (s *CrawlerStore) StoreMetadata(url string, metadata interface{})

StoreMetadata stores metadata for a crawled page (for link population)

func (*CrawlerStore) VisitIfNotVisited

func (s *CrawlerStore) VisitIfNotVisited(hash uint64) (bool, error)

VisitIfNotVisited atomically checks if a URL hash has been visited and marks it as visited. Returns true if already visited, false if newly visited. This is the CRITICAL method for preventing race conditions in URL visit tracking.

type InMemoryStorage

type InMemoryStorage struct {
	// contains filtered or unexported fields
}

InMemoryStorage is the default storage backend of bluesnake. InMemoryStorage keeps cookies and content hashes in memory without persisting data on the disk.

func (*InMemoryStorage) Close

func (s *InMemoryStorage) Close() error

Close implements Storage.Close()

func (*InMemoryStorage) Cookies

func (s *InMemoryStorage) Cookies(u *url.URL) string

Cookies implements Storage.Cookies()

func (*InMemoryStorage) GetContentHash

func (s *InMemoryStorage) GetContentHash(url string) (string, error)

GetContentHash implements Storage.GetContentHash()

func (*InMemoryStorage) Init

func (s *InMemoryStorage) Init() error

Init initializes InMemoryStorage

func (*InMemoryStorage) IsContentVisited

func (s *InMemoryStorage) IsContentVisited(contentHash string) (bool, error)

IsContentVisited implements Storage.IsContentVisited()

func (*InMemoryStorage) SetContentHash

func (s *InMemoryStorage) SetContentHash(url string, contentHash string) error

SetContentHash implements Storage.SetContentHash()

func (*InMemoryStorage) SetCookies

func (s *InMemoryStorage) SetCookies(u *url.URL, cookies string)

SetCookies implements Storage.SetCookies()

func (*InMemoryStorage) VisitedContent

func (s *InMemoryStorage) VisitedContent(contentHash string) error

VisitedContent implements Storage.VisitedContent()

type Storage

type Storage interface {
	// Init initializes the storage
	Init() error

	// Cookies retrieves stored cookies for a given host
	Cookies(u *url.URL) string
	// SetCookies stores cookies for a given host
	SetCookies(u *url.URL, cookies string)

	// SetContentHash stores a content hash for a given URL
	SetContentHash(url string, contentHash string) error
	// GetContentHash retrieves the stored content hash for a given URL
	GetContentHash(url string) (string, error)
	// IsContentVisited returns true if content with the given hash has been visited
	IsContentVisited(contentHash string) (bool, error)
	// VisitedContent marks content with the given hash as visited
	VisitedContent(contentHash string) error
}

Storage is an interface which handles Collector's HTTP-level data. The default Storage of the Collector is the InMemoryStorage. Collector's storage can be changed by calling Collector.SetStorage() function.

NOTE: This is separate from CrawlerStore (which handles crawl orchestration). Storage handles only HTTP client concerns:

  • Cookies (HTTP session state)
  • Content hashes (duplicate content detection)

Visit tracking has been removed - Crawler owns all visit tracking via CrawlerStore.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL