dcdump

package module

v0.1.1 Latest Latest Go to latest Published: Feb 4, 2022 License: MIT Imports: 8 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/miku/dcdump

Links

Open Source Insights

README ¶

Datacite Dump Tool

As of Fall 2019 the datacite API is a bit flaky: #237, #851, #188, #709 #897, #898.

This tool tries to get a data dump from the API, until a full dump might be available.

This data has been ingested into fatcat, via fatcat_import.py in 01/2020.

Install and Build

You'll need the go tool installed (i.e. installed go).

$ git clone https://git.archive.org/webgroup/dcdump.git
$ cd dcdump
$ make

Or install with the Go tool:

$ go install github.com/miku/dcdump/cmd/dcdump@latest

Usage

$ dcdump -h
Usage of dcdump:
  -d string
	directory, where to put harvested files (default ".")
  -debug
	only print intervals then exit
  -e value
	end date for harvest (default 2019-12-10)
  -i string
	[w]eekly, [d]daily, [h]ourly, [e]very minute (default "d")
  -l int
	upper limit for number of requests (default 16777216)
  -p string
	file prefix for harvested files (default "dcdump-")
  -s value
	start date for harvest (default 2018-01-01)
  -sleep duration
	backoff after HTTP error (default 5m0s)
  -version
	show version
  -w int
	parallel workers (approximate) (default 4)

Examples

The dcdump tool uses datacite API version 2. We query for intervals and via cursor to circumvent the Index Deep Paging Problem (limit as of 12/2019 is 10000 records for a query, 400 pages x 25 records per page).

To just list the intervals (depending on the -i flag), use the -debug flag:

$ dcdump -i h -s 2019-10-01 -e 2019-10-02 -debug
2019-10-01 00:00:00 +0000 UTC -- 2019-10-01 00:59:59.999999999 +0000 UTC
2019-10-01 01:00:00 +0000 UTC -- 2019-10-01 01:59:59.999999999 +0000 UTC
2019-10-01 02:00:00 +0000 UTC -- 2019-10-01 02:59:59.999999999 +0000 UTC
2019-10-01 03:00:00 +0000 UTC -- 2019-10-01 03:59:59.999999999 +0000 UTC
2019-10-01 04:00:00 +0000 UTC -- 2019-10-01 04:59:59.999999999 +0000 UTC
2019-10-01 05:00:00 +0000 UTC -- 2019-10-01 05:59:59.999999999 +0000 UTC
2019-10-01 06:00:00 +0000 UTC -- 2019-10-01 06:59:59.999999999 +0000 UTC
2019-10-01 07:00:00 +0000 UTC -- 2019-10-01 07:59:59.999999999 +0000 UTC
2019-10-01 08:00:00 +0000 UTC -- 2019-10-01 08:59:59.999999999 +0000 UTC
2019-10-01 09:00:00 +0000 UTC -- 2019-10-01 09:59:59.999999999 +0000 UTC
2019-10-01 10:00:00 +0000 UTC -- 2019-10-01 10:59:59.999999999 +0000 UTC
2019-10-01 11:00:00 +0000 UTC -- 2019-10-01 11:59:59.999999999 +0000 UTC
2019-10-01 12:00:00 +0000 UTC -- 2019-10-01 12:59:59.999999999 +0000 UTC
2019-10-01 13:00:00 +0000 UTC -- 2019-10-01 13:59:59.999999999 +0000 UTC
2019-10-01 14:00:00 +0000 UTC -- 2019-10-01 14:59:59.999999999 +0000 UTC
2019-10-01 15:00:00 +0000 UTC -- 2019-10-01 15:59:59.999999999 +0000 UTC
2019-10-01 16:00:00 +0000 UTC -- 2019-10-01 16:59:59.999999999 +0000 UTC
2019-10-01 17:00:00 +0000 UTC -- 2019-10-01 17:59:59.999999999 +0000 UTC
2019-10-01 18:00:00 +0000 UTC -- 2019-10-01 18:59:59.999999999 +0000 UTC
2019-10-01 19:00:00 +0000 UTC -- 2019-10-01 19:59:59.999999999 +0000 UTC
2019-10-01 20:00:00 +0000 UTC -- 2019-10-01 20:59:59.999999999 +0000 UTC
2019-10-01 21:00:00 +0000 UTC -- 2019-10-01 21:59:59.999999999 +0000 UTC
2019-10-01 22:00:00 +0000 UTC -- 2019-10-01 22:59:59.999999999 +0000 UTC
2019-10-01 23:00:00 +0000 UTC -- 2019-10-01 23:59:59.999999999 +0000 UTC
2019-10-02 00:00:00 +0000 UTC -- 2019-10-02 00:59:59.999999999 +0000 UTC
INFO[0000] 25 intervals

Start and end date are relatively flexible, for example (minute slices for a single day):

$ dcdump -s 2019-05-01 -e '2019-05-01 23:59:59' -i e -debug
2019-05-01 00:00:00 +0000 UTC -- 2019-05-01 00:00:59.999999999 +0000 UTC
...
2019-05-01 23:59:00 +0000 UTC -- 2019-05-01 23:59:59.999999999 +0000 UTC
INFO[0000] 1440 intervals
...

So create some temporary dir (to not pollute the current directory with the harvested files).

$ mkdir tmp

Start harvesting (minute intervals, into tmp, with 2 workers).

$ dcdump -i e -d tmp -w 2

The time windows are not adjusted dynamically. Worse, it seems that even with a low profile harvest (two workers, backoffs, retries) and minute intervals, the harvest still can stall (maybe with a 403 or 500).

If a specific time window fails repeatedly, you can manually touch the file, e.g.

$ touch tmp/dcdump-20190801114700-20190801114759.ndjson

The dcdump tool checks for the existence of the file, before harvesting; this way it's possible to skip unfetchable slices.

After successful runs, concatenate the data to get a newline delimited single file dump of datacite.

$ cat tmp/*ndjson | sort -u > datacite.ndjson

Again, this is ugly, but should all be obsolete as soon as a public data dump is available.

Duration

A duration data point, about 80h.

$ dcdump -version
dcdump 5ae0556 2020-01-21T16:25:10Z

$ dcdump -i e
...
INFO[294683] 1075178 date slices succeeded

real    4911m23.343s
user    930m54.034s
sys     173m7.383s

After 80h, the total size amounts to about 78G.

Archive Items

Initial snapshot

A datacite snapshot from 11/2019 is available as part of the Bulk Bibliographic Metadata collection at Datacite Dump 20191122.

18210075 items, 72GB uncompressed.

Updates

https://archive.org/details/datacite_dump_20211022; 25859678 unique (lowercased) DOI

$ curl -L https://archive.org/download/datacite_dump_20211022/datacite_dump_20211022.json.zst | \
    zstdcat -c -T0 | jq -rc '.id'

10.1001/jama.289.8.989
10.1001/jama.293.14.1723-a
10.1001/jamainternmed.2013.9245
10.1001/jamaneurol.2015.4885
10.1002/2014gb004975
10.1002/2014gl061020
10.1002/2014jc009965
10.1002/2014jd022411
10.1002/2015gb005314
10.1002/2015gl065259
...

https://archive.org/details/datacite_dump_20200824; 19606708 unique DOI

$ xz -T0 -cd datacite.ndjson.xz | wc
18210075 2562859030 72664858976

$ xz -T0 -cd datacite.ndjson.xz | sha1sum
6fa3bbb1fe07b42e021be32126617b7924f119fb  -

JI:KNIEKQ2QKJFEGVCTFUZDIMZQBI

Documentation ¶

Index ¶

func HarvestBatch(link string, maxRequests int, sleep time.Duration) (string, error)
type DOIResponse

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func HarvestBatch ¶

func HarvestBatch(link string, maxRequests int, sleep time.Duration) (string, error)

HarvestBatch takes a link (like https://is.gd/0pwu5c) and follows subsequent pages, writes everything into a tempfile. Returns path to temporary file and an error. Fails, if HTTP status is >= 400; has limited retry capabilities.

Types ¶

type DOIResponse ¶

type DOIResponse struct {
	Data []struct {
		Attributes    interface{} `json:"attributes"`
		Id            string      `json:"id"`
		Relationships struct {
			Client struct {
				Data struct {
					Id   string `json:"id"`
					Type string `json:"type"`
				} `json:"data"`
			} `json:"client"`
		} `json:"relationships"`
		Type string `json:"type"`
	} `json:"data"`
	Included []struct {
		Attributes struct {
			AlternateName interface{}   `json:"alternateName"`
			ClientType    string        `json:"clientType"`
			ContactEmail  string        `json:"contactEmail"`
			Created       string        `json:"created"`
			Description   interface{}   `json:"description"`
			Domains       string        `json:"domains"`
			HasPassword   bool          `json:"hasPassword"`
			IsActive      bool          `json:"isActive"`
			Issn          interface{}   `json:"issn"`
			Language      []interface{} `json:"language"`
			Name          string        `json:"name"`
			Opendoar      interface{}   `json:"opendoar"`
			Re3data       interface{}   `json:"re3data"`
			Symbol        string        `json:"symbol"`
			Updated       string        `json:"updated"`
			Url           interface{}   `json:"url"`
			Year          int64         `json:"year"`
		} `json:"attributes"`
		Id            string `json:"id"`
		Relationships struct {
			Prefixes struct {
				Data []struct {
					Id   string `json:"id"`
					Type string `json:"type"`
				} `json:"data"`
			} `json:"prefixes"`
			Provider struct {
				Data struct {
					Id   string `json:"id"`
					Type string `json:"type"`
				} `json:"data"`
			} `json:"provider"`
		} `json:"relationships"`
		Type string `json:"type"`
	} `json:"included"`
	Links struct {
		Next string `json:"next"`
		Self string `json:"self"`
	} `json:"links"`
	Meta struct {
		Affiliations []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"affiliations"`
		Certificates []interface{} `json:"certificates"`
		Clients      []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"clients"`
		Created []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"created"`
		LinkChecksCitationDoi  int64 `json:"linkChecksCitationDoi"`
		LinkChecksDcIdentifier int64 `json:"linkChecksDcIdentifier"`
		LinkChecksSchemaOrgId  int64 `json:"linkChecksSchemaOrgId"`
		LinkChecksStatus       []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"linkChecksStatus"`
		LinksChecked       int64 `json:"linksChecked"`
		LinksWithSchemaOrg []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"linksWithSchemaOrg"`
		Page     int64 `json:"page"`
		Prefixes []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"prefixes"`
		Providers []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"providers"`
		Registered []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"registered"`
		ResourceTypes []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"resourceTypes"`
		SchemaVersions []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"schemaVersions"`
		Sources []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"sources"`
		States []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"states"`
		Subjects []struct {
			Count int64  `json:"count"`
			Id    string `json:"id"`
			Title string `json:"title"`
		} `json:"subjects"`
		Total      int64 `json:"total"`
		TotalPages int64 `json:"totalPages"`
	} `json:"meta"`
}

DOIResponse is the https://api.datacite.org/dois endpoint response. TODO(martin): Sort out the interface{} fields, if necessary.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
atomic
cmd
dcdump command Tool to fetch a full list of DOI from datacite.org API, because as of Fall 2019 a full dump is not yet available (https://git.io/Je6bs, https://git.io/Je6Dg).	Tool to fetch a full list of DOI from datacite.org API, because as of Fall 2019 a full dump is not yet available (https://git.io/Je6bs, https://git.io/Je6Dg).
dateutil Package dateutil provides a custom flag for dates.	Package dateutil provides a custom flag for dates.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL