crossref

package
v0.2.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 7, 2025 License: GPL-3.0 Imports: 17 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	MaxScanTokenSize  = 104_857_600 // 100MB, note: each thread will allocate a buffer of this size
	Today             = time.Now().Format("2006-01-02")
	TempfilePrefix    = "span-crossref-snapshot"
	DefaultOutputFile = path.Join(os.TempDir(), fmt.Sprintf("%s-%s.json.zst", TempfilePrefix, Today))
)
View Source
var NoFilter = func(_ Record) bool { return true }

NoFilter is a noop filter

Functions

func CreateSnapshot added in v0.2.6

func CreateSnapshot(opts SnapshotOptions) error

CreateSnapshot implements a three-stage metadata snapshot approach, given snapshot options. Tihs allows to create a current view of crossref our of a continously harvested set of files.

On a machine with fast i/o, many parts of this process can be cpu bound, whereas on spinning disks, this will likely be i/o bound.

An error is returned, if the snapshot options do not contain any files to process.

On a 2011 dual-socket Xeon E5645 with spinning disk, the whole process runs in: 78189.57user 6229.72system 7:59:19elapsed 293%CPU -- or about 21 hours. On a 2023 i9-13900T with raid0 nvme disks the process runs in about 3-4 hours.

func ExcludeFilter added in v0.2.6

func ExcludeFilter(excludes []string) func(record Record) bool

ExcludeFilter is a filter that excludes a given list of DOI

Types

type FilterFunc added in v0.2.6

type FilterFunc func(r Record) bool

FilterFunc can filter a record

type LineNumberEntry added in v0.2.6

type LineNumberEntry struct {
	LineNumbersFilename string
	NumLines            int64
}

type LineNumbersMap added in v0.2.6

type LineNumbersMap map[string]*LineNumberEntry

LineNumbersMap maps a filename to the associated filename that contains the line numbers to extract and the number of lines in that file.

type Record added in v0.2.6

type Record struct {
	DOI     string `json:"DOI"`
	Indexed struct {
		Timestamp int64 `json:"timestamp"`
	} `json:"indexed"`
}

Record is the relevant part of a crossref record.

type SnapshotOptions added in v0.2.6

type SnapshotOptions struct {
	InputFiles     []string // InputFiles, following a Record structure
	OutputFile     string   // OutputFile is the file the snapshot is written to
	TempDir        string   // TempDir
	BatchSize      int      // BatchSize is the number records we process at once, affects memory usage
	NumWorkers     int      // Threads
	SortBufferSize string   // For sort -S parameter (e.g. "25%"), higher values make sort faste
	KeepTempFiles  bool     // For debugging
	Verbose        bool     // Verbose output
	Excludes       []string // List of DOI to exclude
}

SnapshotOptions contains configuration for the snapshot process.

func DefaultSnapshotOptions added in v0.2.6

func DefaultSnapshotOptions() SnapshotOptions

DefaultSnapshotOptions returns default options.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL