crossref

package
v0.2.12 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 23, 2025 License: GPL-3.0 Imports: 19 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

View Source
var (
	MaxScanTokenSize  = 104_857_600 // 100MB, note: each thread will allocate a buffer of this size
	Today             = time.Now().Format("2006-01-02")
	TempfilePrefix    = "span-crossref-snapshot"
	DefaultOutputFile = path.Join(os.TempDir(), fmt.Sprintf("%s-%s.json.zst", TempfilePrefix, Today))
)
View Source
var NoFilter = func(_ Record) bool { return true }

NoFilter is a noop filter

Functions

func CreateSnapshot added in v0.2.6

func CreateSnapshot(opts SnapshotOptions) error

CreateSnapshot implements a three-stage metadata snapshot approach, given snapshot options. Tihs allows to create a current view of crossref out of a continously harvested set of files.

On a machine with fast i/o, many parts of this process can be cpu bound, whereas on spinning disks, this will likely be i/o bound.

An error is returned, if the snapshot options do not contain any files to process.

On a 2011 dual-socket Xeon E5645 with spinning disk, the whole process runs in: 78189.57user 6229.72system 7:59:19elapsed 293%CPU -- or about 21 hours. On a 2023 i9-13900T with raid0 nvme disks the process runs in about 3-4 hours.

Running time depending on the number of input files; in 07/2025 about 4 hours.

func ExcludeFilter added in v0.2.6

func ExcludeFilter(excludes []string) func(record Record) bool

ExcludeFilter is a filter that excludes a given list of DOI

func SortFilesBySize added in v0.2.7

func SortFilesBySize(filenames []string) ([]string, error)

SortFilesBySize takes a slice of filenames and returns them sorted by file size (largest first)

Types

type FileInfo added in v0.2.7

type FileInfo struct {
	Name string
	Size int64
}

FileInfo holds filename and size for sorting

type FilterFunc added in v0.2.6

type FilterFunc func(_ Record) bool

FilterFunc can filter a record

type LineNumberEntry added in v0.2.6

type LineNumberEntry struct {
	LineNumbersFilename string
	NumLines            int64
}

type LineNumbersMap added in v0.2.6

type LineNumbersMap map[string]*LineNumberEntry

LineNumbersMap maps a filename to the associated filename that contains the line numbers to extract and the number of lines in that file.

type Record added in v0.2.6

type Record struct {
	DOI     string `json:"DOI"`
	Indexed struct {
		Timestamp int64 `json:"timestamp"`
	} `json:"indexed"`
}

Record is the relevant part of a crossref record.

type SnapshotOptions added in v0.2.6

type SnapshotOptions struct {
	InputFiles        []string // InputFiles, following a Record structure
	OutputFile        string   // OutputFile is the file the snapshot is written to
	TempDir           string   // TempDir
	BatchSize         int      // BatchSize is the number records we process at once, affects memory usage
	NumWorkers        int      // Threads
	SortBufferSize    string   // For sort -S parameter (e.g. "25%"), higher values make sort faste
	KeepTempFiles     bool     // For debugging
	Verbose           bool     // Verbose output
	Excludes          []string // List of DOI to exclude
	ShuffleInputFiles bool     // Randomize processing order
}

SnapshotOptions contains configuration for the snapshot process.

func DefaultSnapshotOptions added in v0.2.6

func DefaultSnapshotOptions() SnapshotOptions

DefaultSnapshotOptions returns default options.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL