Documentation
¶
Overview ¶
span-compact deduplicates NDJSON streams on a user-defined field, keeping one record per key chosen by a selectable strategy. Built for the 10M-100M record range: uses an external sort instead of in-memory hash maps, so memory usage stays bounded. Works with stdin or a file (auto decompresses .gz and .zst) and writes NDJSON to stdout or -o.
Strategies:
first keep the first record seen for a key last keep the last record seen random keep a uniformly random record (reservoir over the group) min keep the record with the smallest -sort-key value max keep the record with the largest -sort-key value
Examples:
span-compact -key id -strategy last input.ndj
zstdcat big.ndj.zst | span-compact -key doi -strategy max \
-sort-key indexed_at -numeric > out.ndj
Click to show internal directories.
Click to hide internal directories.