span-compact

command
v0.2.33 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 28, 2026 License: GPL-3.0 Imports: 14 Imported by: 0

Documentation

Overview

span-compact deduplicates NDJSON streams on a user-defined field, keeping one record per key chosen by a selectable strategy. Built for the 10M-100M record range: uses an external sort instead of in-memory hash maps, so memory usage stays bounded. Works with stdin or a file (auto decompresses .gz and .zst) and writes NDJSON to stdout or -o.

Strategies:

first    keep the first record seen for a key
last     keep the last record seen
random   keep a uniformly random record (reservoir over the group)
min      keep the record with the smallest -sort-key value
max      keep the record with the largest -sort-key value

Examples:

span-compact -key id -strategy last input.ndj
zstdcat big.ndj.zst | span-compact -key doi -strategy max \
    -sort-key indexed_at -numeric > out.ndj

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL