dcdump

package module
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 4, 2022 License: MIT Imports: 8 Imported by: 0

README

Datacite Dump Tool

As of Fall 2019 the datacite API is a bit flaky: #237, #851, #188, #709 #897, #898.

This tool tries to get a data dump from the API, until a full dump might be available.

This data has been ingested into fatcat, via fatcat_import.py in 01/2020.

Install and Build

You'll need the go tool installed (i.e. installed go).

$ git clone https://git.archive.org/webgroup/dcdump.git
$ cd dcdump
$ make

Or install with the Go tool:

$ go install github.com/miku/dcdump/cmd/dcdump@latest

Usage

$ dcdump -h
Usage of dcdump:
  -d string
	directory, where to put harvested files (default ".")
  -debug
	only print intervals then exit
  -e value
	end date for harvest (default 2019-12-10)
  -i string
	[w]eekly, [d]daily, [h]ourly, [e]very minute (default "d")
  -l int
	upper limit for number of requests (default 16777216)
  -p string
	file prefix for harvested files (default "dcdump-")
  -s value
	start date for harvest (default 2018-01-01)
  -sleep duration
	backoff after HTTP error (default 5m0s)
  -version
	show version
  -w int
	parallel workers (approximate) (default 4)

Examples

The dcdump tool uses datacite API version 2. We query for intervals and via cursor to circumvent the Index Deep Paging Problem (limit as of 12/2019 is 10000 records for a query, 400 pages x 25 records per page).

To just list the intervals (depending on the -i flag), use the -debug flag:

$ dcdump -i h -s 2019-10-01 -e 2019-10-02 -debug
2019-10-01 00:00:00 +0000 UTC -- 2019-10-01 00:59:59.999999999 +0000 UTC
2019-10-01 01:00:00 +0000 UTC -- 2019-10-01 01:59:59.999999999 +0000 UTC
2019-10-01 02:00:00 +0000 UTC -- 2019-10-01 02:59:59.999999999 +0000 UTC
2019-10-01 03:00:00 +0000 UTC -- 2019-10-01 03:59:59.999999999 +0000 UTC
2019-10-01 04:00:00 +0000 UTC -- 2019-10-01 04:59:59.999999999 +0000 UTC
2019-10-01 05:00:00 +0000 UTC -- 2019-10-01 05:59:59.999999999 +0000 UTC
2019-10-01 06:00:00 +0000 UTC -- 2019-10-01 06:59:59.999999999 +0000 UTC
2019-10-01 07:00:00 +0000 UTC -- 2019-10-01 07:59:59.999999999 +0000 UTC
2019-10-01 08:00:00 +0000 UTC -- 2019-10-01 08:59:59.999999999 +0000 UTC
2019-10-01 09:00:00 +0000 UTC -- 2019-10-01 09:59:59.999999999 +0000 UTC
2019-10-01 10:00:00 +0000 UTC -- 2019-10-01 10:59:59.999999999 +0000 UTC
2019-10-01 11:00:00 +0000 UTC -- 2019-10-01 11:59:59.999999999 +0000 UTC
2019-10-01 12:00:00 +0000 UTC -- 2019-10-01 12:59:59.999999999 +0000 UTC
2019-10-01 13:00:00 +0000 UTC -- 2019-10-01 13:59:59.999999999 +0000 UTC
2019-10-01 14:00:00 +0000 UTC -- 2019-10-01 14:59:59.999999999 +0000 UTC
2019-10-01 15:00:00 +0000 UTC -- 2019-10-01 15:59:59.999999999 +0000 UTC
2019-10-01 16:00:00 +0000 UTC -- 2019-10-01 16:59:59.999999999 +0000 UTC
2019-10-01 17:00:00 +0000 UTC -- 2019-10-01 17:59:59.999999999 +0000 UTC
2019-10-01 18:00:00 +0000 UTC -- 2019-10-01 18:59:59.999999999 +0000 UTC
2019-10-01 19:00:00 +0000 UTC -- 2019-10-01 19:59:59.999999999 +0000 UTC
2019-10-01 20:00:00 +0000 UTC -- 2019-10-01 20:59:59.999999999 +0000 UTC
2019-10-01 21:00:00 +0000 UTC -- 2019-10-01 21:59:59.999999999 +0000 UTC
2019-10-01 22:00:00 +0000 UTC -- 2019-10-01 22:59:59.999999999 +0000 UTC
2019-10-01 23:00:00 +0000 UTC -- 2019-10-01 23:59:59.999999999 +0000 UTC
2019-10-02 00:00:00 +0000 UTC -- 2019-10-02 00:59:59.999999999 +0000 UTC
INFO[0000] 25 intervals

Start and end date are relatively flexible, for example (minute slices for a single day):

$ dcdump -s 2019-05-01 -e '2019-05-01 23:59:59' -i e -debug
2019-05-01 00:00:00 +0000 UTC -- 2019-05-01 00:00:59.999999999 +0000 UTC
...
2019-05-01 23:59:00 +0000 UTC -- 2019-05-01 23:59:59.999999999 +0000 UTC
INFO[0000] 1440 intervals
...

So create some temporary dir (to not pollute the current directory with the harvested files).

$ mkdir tmp

Start harvesting (minute intervals, into tmp, with 2 workers).

$ dcdump -i e -d tmp -w 2

The time windows are not adjusted dynamically. Worse, it seems that even with a low profile harvest (two workers, backoffs, retries) and minute intervals, the harvest still can stall (maybe with a 403 or 500).

If a specific time window fails repeatedly, you can manually touch the file, e.g.

$ touch tmp/dcdump-20190801114700-20190801114759.ndjson

The dcdump tool checks for the existence of the file, before harvesting; this way it's possible to skip unfetchable slices.

After successful runs, concatenate the data to get a newline delimited single file dump of datacite.

$ cat tmp/*ndjson | sort -u > datacite.ndjson

Again, this is ugly, but should all be obsolete as soon as a public data dump is available.

Duration

A duration data point, about 80h.

$ dcdump -version
dcdump 5ae0556 2020-01-21T16:25:10Z

$ dcdump -i e
...
INFO[294683] 1075178 date slices succeeded

real    4911m23.343s
user    930m54.034s
sys     173m7.383s

After 80h, the total size amounts to about 78G.

Archive Items

Initial snapshot

A datacite snapshot from 11/2019 is available as part of the Bulk Bibliographic Metadata collection at Datacite Dump 20191122.

18210075 items, 72GB uncompressed.

Updates
$ curl -L https://archive.org/download/datacite_dump_20211022/datacite_dump_20211022.json.zst | \
    zstdcat -c -T0 | jq -rc '.id'

10.1001/jama.289.8.989
10.1001/jama.293.14.1723-a
10.1001/jamainternmed.2013.9245
10.1001/jamaneurol.2015.4885
10.1002/2014gb004975
10.1002/2014gl061020
10.1002/2014jc009965
10.1002/2014jd022411
10.1002/2015gb005314
10.1002/2015gl065259
...
$ xz -T0 -cd datacite.ndjson.xz | wc
18210075 2562859030 72664858976

$ xz -T0 -cd datacite.ndjson.xz | sha1sum
6fa3bbb1fe07b42e021be32126617b7924f119fb  -

JI:KNIEKQ2QKJFEGVCTFUZDIMZQBI

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func HarvestBatch

func HarvestBatch(link string, maxRequests int, sleep time.Duration) (string, error)

HarvestBatch takes a link (like https://is.gd/0pwu5c) and follows subsequent pages, writes everything into a tempfile. Returns path to temporary file and an error. Fails, if HTTP status is >= 400; has limited retry capabilities.

Types

type DOIResponse

type DOIResponse struct {
	Data []struct {
		Attributes    interface{} `json:"attributes"`
		Id            string      `json:"id"`
		Relationships struct {
			Client struct {
				Data struct {
					Id   string `json:"id"`
					Type string `json:"type"`
				} `json:"data"`
			} `json:"client"`
		} `json:"relationships"`
		Type string `json:"type"`
	} `json:"data"`
	Included []struct {
		Attributes struct {
			AlternateName interface{}   `json:"alternateName"`
			ClientType    string        `json:"clientType"`
			ContactEmail  string        `json:"contactEmail"`
			Created       string        `json:"created"`
			Description   interface{}   `json:"description"`
			Domains       string        `json:"domains"`
			HasPassword   bool          `json:"hasPassword"`
			IsActive      bool          `json:"isActive"`
			Issn          interface{}   `json:"issn"`
			Language      []interface{} `json:"language"`
			Name          string        `json:"name"`
			Opendoar      interface{}   `json:"opendoar"`
			Re3data       interface{}   `json:"re3data"`
			Symbol        string        `json:"symbol"`
			Updated       string        `json:"updated"`
			Url           interface{}   `json:"url"`
			Year          int64         `json:"year"`
		} `json:"attributes"`
		Id            string `json:"id"`
		Relationships struct {
			Prefixes struct {
				Data []struct {
					Id   string `json:"id"`
					Type string `json:"type"`
				} `json:"data"`
			} `json:"prefixes"`
			Provider struct {
				Data struct {
					Id   string `json:"id"`
					Type string `json:"type"`
				} `json:"data"`
			} `json:"provider"`
		} `json:"relationships"`
		Type string `json:"type"`
	} `json:"included"`
	Links struct {
		Next string `json:"next"`
		Self string `json:"self"`
	} `json:"links"`
	Meta struct {
		Affiliations []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"affiliations"`
		Certificates []interface{} `json:"certificates"`
		Clients      []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"clients"`
		Created []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"created"`
		LinkChecksCitationDoi  int64 `json:"linkChecksCitationDoi"`
		LinkChecksDcIdentifier int64 `json:"linkChecksDcIdentifier"`
		LinkChecksSchemaOrgId  int64 `json:"linkChecksSchemaOrgId"`
		LinkChecksStatus       []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"linkChecksStatus"`
		LinksChecked       int64 `json:"linksChecked"`
		LinksWithSchemaOrg []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"linksWithSchemaOrg"`
		Page     int64 `json:"page"`
		Prefixes []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"prefixes"`
		Providers []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"providers"`
		Registered []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"registered"`
		ResourceTypes []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"resourceTypes"`
		SchemaVersions []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"schemaVersions"`
		Sources []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"sources"`
		States []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"states"`
		Subjects []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"subjects"`
		Total      int64 `json:"total"`
		TotalPages int64 `json:"totalPages"`
	} `json:"meta"`
}

DOIResponse is the https://api.datacite.org/dois endpoint response. TODO(martin): Sort out the interface{} fields, if necessary.

Directories

Path Synopsis
cmd
dcdump command
Tool to fetch a full list of DOI from datacite.org API, because as of Fall 2019 a full dump is not yet available (https://git.io/Je6bs, https://git.io/Je6Dg).
Tool to fetch a full list of DOI from datacite.org API, because as of Fall 2019 a full dump is not yet available (https://git.io/Je6bs, https://git.io/Je6Dg).
Package dateutil provides a custom flag for dates.
Package dateutil provides a custom flag for dates.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL