Documentation
¶
Index ¶
- Variables
- func GetNextIP(availableIPs *availableIPs) net.IP
- func GetSHA1(r io.Reader) string
- func GetSHA256(r io.Reader) string
- func GetSHA256Base16(r io.Reader) string
- func NewDecompressionReader(r io.Reader) (io.Reader, error)
- type CustomHTTPClient
- type DedupeOptions
- type DiscardHook
- type DiscardHookError
- type Error
- type HTTPClientSettings
- type Header
- type Reader
- type Record
- type RecordBatch
- type RotatorSettings
- type WaitGroupWithCount
- type Writer
Constants ¶
This section is empty.
Variables ¶
var ( IPv6 *availableIPs IPv4 *availableIPs )
var ( // Create a counter to keep track of the number of bytes written to WARC files // and the number of bytes deduped DataTotal *ratecounter.Counter RemoteDedupeTotal *ratecounter.Counter LocalDedupeTotal *ratecounter.Counter )
Functions ¶
func GetSHA256Base16 ¶ added in v0.8.37
Types ¶
type CustomHTTPClient ¶ added in v0.7.0
type CustomHTTPClient struct {
WaitGroup *WaitGroupWithCount
ErrChan chan *Error
WARCWriter chan *RecordBatch
http.Client
TempDir string
WARCWriterDoneChannels []chan bool
DiscardHook DiscardHook
TLSHandshakeTimeout time.Duration
MaxReadBeforeTruncate int
FullOnDisk bool
// MaxRAMUsageFraction is the fraction of system RAM above which we'll force spooling to disk. For example, 0.5 = 50%.
// If set to <= 0, the default value is DefaultMaxRAMUsageFraction.
MaxRAMUsageFraction float64
// contains filtered or unexported fields
}
func NewWARCWritingHTTPClient ¶ added in v0.7.0
func NewWARCWritingHTTPClient(HTTPClientSettings HTTPClientSettings) (httpClient *CustomHTTPClient, err error)
func (*CustomHTTPClient) Close ¶ added in v0.7.0
func (c *CustomHTTPClient) Close() error
func (*CustomHTTPClient) WriteRecord ¶ added in v0.8.50
func (c *CustomHTTPClient) WriteRecord(WARCTargetURI, WARCType, contentType, payloadString string, payloadReader io.Reader)
type DedupeOptions ¶ added in v0.8.0
type DiscardHook ¶ added in v0.8.76
DiscardHook is a hook function that is called for each response. (if set) It can be used to determine if the response should be discarded. Returns:
- bool: should the response be discarded
- string: (optional) why the response was discarded or not
type DiscardHookError ¶ added in v0.8.76
type DiscardHookError struct {
URL string
Reason string // reason for discarding
Err error // nil: discarded successfully
}
func (*DiscardHookError) Error ¶ added in v0.8.76
func (e *DiscardHookError) Error() string
func (*DiscardHookError) Unwrap ¶ added in v0.8.76
func (e *DiscardHookError) Unwrap() error
type HTTPClientSettings ¶ added in v0.8.14
type HTTPClientSettings struct {
RotatorSettings *RotatorSettings
Proxy string
TempDir string
DNSServer string
DiscardHook DiscardHook
DNSServers []string
DedupeOptions DedupeOptions
DialTimeout time.Duration
ResponseHeaderTimeout time.Duration
DNSResolutionTimeout time.Duration
DNSRecordsTTL time.Duration
DNSCacheSize int
TLSHandshakeTimeout time.Duration
TCPTimeout time.Duration
MaxReadBeforeTruncate int
DecompressBody bool
FollowRedirects bool
FullOnDisk bool
MaxRAMUsageFraction float64
VerifyCerts bool
RandomLocalIP bool
DisableIPv4 bool
DisableIPv6 bool
IPv6AnyIP bool
}
type Header ¶
Header provides information about the WARC record. It stores WARC record field names and their values. Since WARC field names are case-insensitive, the Header methods are case-insensitive as well.
type Reader ¶
type Reader struct {
// contains filtered or unexported fields
}
Reader store the bufio.Reader and gzip.Reader for a WARC file
func NewReader ¶
func NewReader(reader io.ReadCloser) (*Reader, error)
NewReader returns a new WARC reader
func (*Reader) ReadRecord ¶
ReadRecord reads the next record from the opened WARC file returns:
- Record: if an error occurred, record **may be** nil. if eol is true, record **must be** nil.
- bool (eol): if true, we readed all records successfully.
- error: error
type Record ¶
type Record struct {
Header Header
Content spooledtempfile.ReadWriteSeekCloser
Version string // WARC/1.0, WARC/1.1 ...
}
Record represents a WARC record.
type RecordBatch ¶
RecordBatch is a structure that contains a bunch of records to be written at the same time, and a common capture timestamp. FeedbackChan is used to signal when the records have been written.
func NewRecordBatch ¶
func NewRecordBatch(feedbackChan chan struct{}) *RecordBatch
NewRecordBatch creates a record batch, it also initialize the capture time.
type RotatorSettings ¶
type RotatorSettings struct {
// Content of the warcinfo record that will be written
// to all WARC files
WarcinfoContent Header
// Prefix used for WARC filenames, WARC 1.1 specifications
// recommend to name files this way:
// Prefix-Timestamp-Serial-Crawlhost.warc.gz
Prefix string
// Compression algorithm to use
Compression string
// Path to a ZSTD compression dictionary to embed (and use) in .warc.zst files
CompressionDictionary string
// Directory where the created WARC files will be stored,
// default will be the current directory
OutputDirectory string
// WarcSize is in Megabytes
WarcSize float64
// WARCWriterPoolSize defines the number of parallel WARC writers
WARCWriterPoolSize int
}
RotatorSettings is used to store the settings needed by recordWriter to write WARC files
func NewRotatorSettings ¶
func NewRotatorSettings() *RotatorSettings
NewRotatorSettings creates a RotatorSettings structure and initialize it with default values
func (*RotatorSettings) NewWARCRotator ¶
func (s *RotatorSettings) NewWARCRotator() (recordWriterChan chan *RecordBatch, doneChannels []chan bool, err error)
NewWARCRotator creates and return a channel that can be used to communicate records to be written to WARC files to the recordWriter function running in a goroutine
type WaitGroupWithCount ¶ added in v0.8.18
func (*WaitGroupWithCount) Add ¶ added in v0.8.18
func (wg *WaitGroupWithCount) Add(delta int)
func (*WaitGroupWithCount) Done ¶ added in v0.8.18
func (wg *WaitGroupWithCount) Done()
func (*WaitGroupWithCount) Size ¶ added in v0.8.18
func (wg *WaitGroupWithCount) Size() int
type Writer ¶
type Writer struct {
GZIPWriter *gzip.Writer
ZSTDWriter *zstd.Encoder
FileWriter *bufio.Writer
FileName string
Compression string
ParallelGZIP bool
}
Writer writes WARC records to WARC files.
func NewWriter ¶
func NewWriter(writer io.Writer, fileName string, compression string, contentLengthHeader string, newFileCreation bool, dictionary []byte) (*Writer, error)
NewWriter creates a new WARC writer.
func (*Writer) CloseCompressedWriter ¶ added in v0.8.20
func (*Writer) WriteInfoRecord ¶
WriteInfoRecord method can be used to write informations record to the WARC file
func (*Writer) WriteRecord ¶
WriteRecord writes a record to the underlying WARC file. A record consists of a version string, the record header followed by a record content block and two newlines:
Version CLRF Header-Key: Header-Value CLRF CLRF Content CLRF CLRF