Documentation
¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
var ( MaxScanTokenSize = 104_857_600 // 100MB, note: each thread will allocate a buffer of this size Today = time.Now().Format("2006-01-02") TempfilePrefix = "span-crossref-snapshot" DefaultOutputFile = path.Join(os.TempDir(), fmt.Sprintf("%s-%s.json.zst", TempfilePrefix, Today)) )
var NoFilter = func(_ Record) bool { return true }
NoFilter is a noop filter
Functions ¶
func CreateSnapshot ¶ added in v0.2.6
func CreateSnapshot(opts SnapshotOptions) error
CreateSnapshot implements a three-stage metadata snapshot approach, given snapshot options. Tihs allows to create a current view of crossref out of a continously harvested set of files.
On a machine with fast i/o, many parts of this process can be cpu bound, whereas on spinning disks, this will likely be i/o bound.
An error is returned, if the snapshot options do not contain any files to process.
On a 2011 dual-socket Xeon E5645 with spinning disk, the whole process runs in: 78189.57user 6229.72system 7:59:19elapsed 293%CPU -- or about 21 hours. On a 2023 i9-13900T with raid0 nvme disks the process runs in about 3-4 hours.
Running time depending on the number of input files; in 07/2025 about 4 hours.
func ExcludeFilter ¶ added in v0.2.6
ExcludeFilter is a filter that excludes a given list of DOI
func SortFilesBySize ¶ added in v0.2.7
SortFilesBySize takes a slice of filenames and returns them sorted by file size (largest first)
Types ¶
type FilterFunc ¶ added in v0.2.6
FilterFunc can filter a record
type LineNumberEntry ¶ added in v0.2.6
type LineNumbersMap ¶ added in v0.2.6
type LineNumbersMap map[string]*LineNumberEntry
LineNumbersMap maps a filename to the associated filename that contains the line numbers to extract and the number of lines in that file.
type Record ¶ added in v0.2.6
type Record struct {
DOI string `json:"DOI"`
Indexed struct {
Timestamp int64 `json:"timestamp"`
} `json:"indexed"`
}
Record is the relevant part of a crossref record.
type SnapshotOptions ¶ added in v0.2.6
type SnapshotOptions struct {
InputFiles []string // InputFiles, following a Record structure
OutputFile string // OutputFile is the file the snapshot is written to
TempDir string // TempDir
BatchSize int // BatchSize is the number records we process at once, affects memory usage
NumWorkers int // Threads
SortBufferSize string // For sort -S parameter (e.g. "25%"), higher values make sort faste
KeepTempFiles bool // For debugging
Verbose bool // Verbose output
Excludes []string // List of DOI to exclude
ShuffleInputFiles bool // Randomize processing order
}
SnapshotOptions contains configuration for the snapshot process.
func DefaultSnapshotOptions ¶ added in v0.2.6
func DefaultSnapshotOptions() SnapshotOptions
DefaultSnapshotOptions returns default options.