Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
Functions ¶
func CreateSnapshot ¶
func CreateSnapshot(opts SnapshotOptions) error
CreateSnapshot implements a three-stage metadata snapshot approach, given snapshot options. Tihs allows to create a current view of crossref our of a continously harvested set of files.
On a machine with fast i/o, many parts of this process can be cpu bound, whereas on spinning disks, this will likely be i/o bound.
An error is returned, if the snapshot options do not contain any files to process.
On a 2011 dual-socket Xeon E5645 with spinning disk, the whole process runs in: 78189.57user 6229.72system 7:59:19elapsed 293%CPU -- or about 21 hours. On a 2023 i9-13900T with raid0 nvme disks the process runs in about 3-4 hours.
Types ¶
type LineNumberEntry ¶
type LineNumbersFileMap ¶
type LineNumbersFileMap map[string]*LineNumberEntry
LineNumbersFileMap maps a filename to the associated filename that contains the line numbers to extract and the number of lines in that file.
type Record ¶
type Record struct {
DOI string `json:"DOI"`
Indexed struct {
Timestamp int64 `json:"timestamp"`
} `json:"indexed"`
}
Record represents the JSON structure we're interested in
type SnapshotOptions ¶
type SnapshotOptions struct {
InputFiles []string // InputFiles, following a Record structure.
OutputFile string // OutputFile is the file the snapshot is written to.
TempDir string // Directory for temporary files.
BatchSize int // BatchSize is the number records we process at once, affect memory usage.
NumWorkers int // Number of threads, each thread may allocate buffers.
SortBufferSize string // For sort -S parameter (e.g. "25%"), curcial for faster sort.
KeepTempFiles bool // For debugging.
Verbose bool // Verbose output.
}
SnapshotOptions contains configuration for the snapshot process.
func DefaultSnapshotOptions ¶
func DefaultSnapshotOptions() SnapshotOptions
DefaultSnapshotOptions returns default options.