span-crossref-fast-snapshot

command
v0.2.12 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 23, 2025 License: GPL-3.0 Imports: 7 Imported by: 0

Documentation

Overview

Create a snapshot from a list of API slices.

Background: After harvesting daily slices from crossref, we accumulate duplicates and we want to dedupliate and keep the latest version for a DOI.

Ex. 797,724,618 messages; with 64 cores, we are parsing about 215053 json docs/s, or about 3K docs/s/core. Data is about 70GB, so only about 19MB/s. Overall, we save 1-2TB of SSD writes per month, which should extend the lifetime of the storage hardware (est. TBW of ssd: 14,016 TB)

# for d in /dev/nvme{0..3}; do smartctl -A $d | grep "Data Units Written"; done # 2025-07-08 Data Units Written: 114,134,244 [58.4 TB] Data Units Written: 114,188,635 [58.4 TB] Data Units Written: 114,185,851 [58.4 TB] Data Units Written: 114,109,391 [58.4 TB]

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL