scholkit

package module
v0.2.12 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 11, 2025 License: MIT Imports: 0 Imported by: 0

README

scholkit

   ,   ,
  /////|
 ///// |
|~~~|  |
|===|  |
|j  |  |
| g |  |
|  s| /
|===|/
'---'

Scratch project, assorted utilities around scholarly metadata formats and tasks.

status: wip, api and cli not stable yet

Try

$ git clone https://github.com/miku/scholkit.git
$ cd scholkit
$ make

This builds a couple of executables, all starting with the sk prefix. The executables are designed to work as standalone as possible, but also share configuration for various tasks (e.g. directories).

$ curl -sL https://archive.org/download/arxiv-2024-02-15/arxiv-2024-02-15.xml.zst | \
    zstd -dc | \
    sk-convert -f arxiv

Tools

Conversions

We want conversions from various formats to one single format (e.g. release entities). Source formats include:

  • crossref
  • datacite
  • pubmed
  • arxiv
  • oaiscrape
  • openalex
  • dblp
  • and more

Target:

  • fatcat2 entities

For each format, try to find the smallest conversion unit, e.g. one record. Then add convenience layers on top, e.g. for streams.

No bulk conversion should take longer than an 1 hour, roughly (slowest currently is openalex - 250M records - which takes about 45 min).

Clustering

Create a "works" view from releases.

Misc

The sk-cat utility streams content from multiple URLs to stdout. Can help to create single file versions of larger datasets like pubmed, openalex, etc.

$ curl -s "https://www.gutenberg.org/browse/scores/top" | \
    grep -Eo "/ebooks/[0-9]+" | \
    awk '{print "https://gutenberg.org"$0".txt.utf-8"}' > top100.txt

$ sk-cat < top100.txt > top100books.txt

Notes

TODO

  • implement schema conversions and tests
  • add layer for daily harvests and capturing data on disk
  • cli to interact with the current files on dist
  • cli for basic stats
  • some simplistic index/query structure, e.g. to quickly find a record by id or the like

More:

  • map basic fields to fatcat release entities
  • map all fields to fatcat release entities
  • basic clustering algorithm

Documentation

Index

Constants

View Source
const (
	FeedsDir     = "feeds"
	SnapshotsDir = "snapshots"
)

Variables

View Source
var (
	AppName = "scholkit"
	Version = "0.2.12"
)

Functions

This section is empty.

Types

This section is empty.

Directories

Path Synopsis
To write to files in a robust way we should:
To write to files in a robust way we should:
attic
sk-oai-dctojsonl command
sk-oai-dctojsonl converts a stream of XML records, where each record is separated by a record separator "1E".
sk-oai-dctojsonl converts a stream of XML records, where each record is separated by a record separator "1E".
cmd
cdxfetch command
sk-cat command
sk-cat takes one or more links to (compressed) files and will stream their content to stdout.
sk-cat takes one or more links to (compressed) files and will stream their content to stdout.
sk-cdx command
sk-cluster command
todo: adjust code
todo: adjust code
sk-convert command
CLI to convert various metadata formats, mostly to fatcat entities.
CLI to convert various metadata formats, mostly to fatcat entities.
sk-crossref-snapshot command
sk-crossref-snapshot creates a snapshot from a set of crossref records, as harvested.
sk-crossref-snapshot creates a snapshot from a set of crossref records, as harvested.
sk-feed command
sk-feed retrieves various upstream data sources.
sk-feed retrieves various upstream data sources.
sk-id command
sk-norm command
TODO: string normalization cli tool
TODO: string normalization cli tool
sk-oai-dctojsonl command
sk-oai-dctojsonl converts a stream of XML records, where each record is separated by a record separator "1E".
sk-oai-dctojsonl converts a stream of XML records, where each record is separated by a record separator "1E".
sk-oai-records command
sk-oai-records was used as a first step to go from concatenated metha OAI XML file (invalid, hard to parse) to properly separated XML documents (using ASCII record separator).
sk-oai-records was used as a first step to go from concatenated metha OAI XML file (invalid, hard to parse) to properly separated XML documents (using ASCII record separator).
sk-snapshot command
sk-snapshot turns feeds into snapshots, for simplicity often with external tools.
sk-snapshot turns feeds into snapshots, for simplicity often with external tools.
Package dateutil provides interval handling.
Package dateutil provides interval handling.
exp
notes
journals command
Package parallel implements helpers for fast processing of line oriented inputs.
Package parallel implements helpers for fast processing of line oriented inputs.
record
Package scan accepts a bufio.SplitFunc and generalizes batches to non-line oriented input, e.g.
Package scan accepts a bufio.SplitFunc and generalizes batches to non-line oriented input, e.g.
schema
Package xflag add an additional flag type Array for repeated string flags.
Package xflag add an additional flag type Array for repeated string flags.
Package xmlstream implements a lightweight XML scanner on top of encoding/xml.
Package xmlstream implements a lightweight XML scanner on top of encoding/xml.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL