span

package module

v0.1.54 Latest Latest Go to latest Published: Nov 13, 2015 License: GPL-3.0 Imports: 12 Imported by: 3

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/miku/span

Links

Open Source Insights

README ¶

Span

Span formats.

The span command line tools aim at high performance, versatile document conversions between a series of metadata formats.

The goal is to quickly move between input formats, such as stored API responses, XML-ish or line-delimited JSON bibliographic data and output formats, such as finc intermediate format or formats, that can be directly imported into SOLR or elasticsearch.

As a non-goal, the span tools do not care, how you obtain your input data. The tools expect a single input file and produce a single output file (stdin and stdout, respectively).

Why in Go?

Linux shell scripts have no native XML or JSON support, Python is a bit too slow for the casual processing of 100M or more records, Java is a bit too verbose - which is why we chose Go. Go comes with XML and JSON support in the standard library, nice concurrency primitives and simple single static-binary deployments.

Install with

$ go get github.com/miku/span/cmd/...

or via deb or rpm packages.

Formats

CrossRef API, works and members
JATS Journal Archiving and Interchange Tag Set, with various flavours for JSTOR and others
DOAJ exports
OVID holdings
Google holdings
FINC Intermediate Format
FINC SOLR Schema
GENIOS Profile XML

A toolkit approach

span-import, anything to intermediate schema
span-export, intermediate schema to anything

The span-import tool should require minimal external information (no holdings file, etc.) and be mainly concerned with the transformation of fancy source formats into the catch-all intermediate schema.

The span-export tool may include external sources to create output, e.g. holdings files.

Usage

$ span-import -h
Usage of span-import:
  -cpuprofile="": write cpu profile to file
  -i="": input format
  -list=false: list formats
  -log="": if given log to file
  -members="": path to LDJ file, one member per line
  -v=false: prints current program version
  -verbose=false: more output
  -w=4: number of workers

$ span-export -h
Usage of span-export:
  -any=[]: ISIL
  -b=20000: batch size
  -cpuprofile="": write cpu profile to file
  -dump=false: dump filters and exit
  -f=[]: ISIL:/path/to/ovid.xml
  -l=[]: ISIL:/path/to/list.txt
  -list=false: list output formats
  -o="solr413": output format
  -skip=false: skip errors
  -source=[]: ISIL:SID
  -v=false: prints current program version
  -w=4: number of workers

Examples

List available formats:

$ span-import -list
doaj
genios
crossref
degruyter
jstor

Import crossref LDJ (with cached members API responses) or DeGruyter XML (preprocessed into a single file):

$ span-import -i crossref -members members.ldj crossref.ldj > crossref.is.ldj
$ span-import -i jats degruyter.ldj > degruyter.is.ldj

Concat for convenience:

$ cat crossref.is.ldj degruyter.is.ldj > ai.is.ldj

Export intermediate schema records to a memcache server with memcldj:

$ memcldj ai.is.ldj

Export to finc 1.3 SOLR 4 schema:

$ span-export -o solr413 -f DE-14:DE-14.xml -f DE-15:DE-15.xml ai.is.ldj > ai.ldj

The exported ai.ldj contains all aggregated index record and incorporates all holdings information. It can be indexed quickly with solrbulk:

$ solrbulk ai.ldj

Adding new sources

This is work/simplification-in-progress.

For the moment, a new data source has to implement is the span.Source interface:

// Source can emit records given a reader. The channel is of type []Importer,
// to allow the source to send objects over the channel in batches for
// performance (1000 x 1000 docs vs 1000000 x 1 doc).
type Source interface {
        Iterate(io.Reader) (<-chan []Importer, error)
}

Channels in APIs might not be the optimum, though we deal with a kind of unbounded streams here.

Additionally, the the emitted objects must implement span.Importer or span.Batcher, which is the transformation business logic:

// Importer objects can be converted into an intermediate schema.
type Importer interface {
        ToIntermediateSchema() (*finc.IntermediateSchema, error)
}

The exporters need to implement the finc.Exporter interface:

// ExportSchema encapsulate an export flavour. This will most likely be a
// struct with fields and methods relevant to the exported format. For the
// moment we assume, the output is JSON. If formats other than JSON are
// requested, move the marshalling into this interface.
type ExportSchema interface {
        // Convert takes an intermediate schema record to export. Returns an
        // error, if conversion failed.
        Convert(IntermediateSchema) error
        // Attach takes a list of strings (here: ISILs) and attaches them to the
        // current record.
        Attach([]string)
}

Licence

GPL
CLD2: Compact Language Detector 2, Dick Sites dsites@google.com, Apache License Version 2.0

TODO

maybe factor out importer interface (like exporter)
docs: add example files for each supported data format

A filtering pipeline.

The final processing step from an intermediate schema to an export format includes various decisions.

Should an ISIL be attached to a record or not?
Should a record be excluded, due to an expired or deleted DOI?

Provide a middleware-ish processing interface (similar to flow, metafacture)?

pl := Pipeline{}
pl.Add(DOIFilter)
pl.Add(ISILAttacher)
pl.Add(QualityAssuranceTests)
pl.Add(Exporter)

err := pl.Run(input)

Done

decouple batching (performance) from record stream generation (content)
write wrappers around common inputs like XML, JSON, CSV ...

Documentation ¶

Overview ¶

Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de
               by The Finc Authors, http://finc.info
               by Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de
               by The Finc Authors, http://finc.info
               by Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

Index ¶

Constants
func ByteSink(w io.Writer, out chan []byte, done chan bool)
func DetectLang3(text string) (string, error)
func FromJSON(r io.Reader, decoder JSONDecoderFunc) (chan []Importer, error)
func FromJSONSize(r io.Reader, decoder JSONDecoderFunc, size int) (chan []Importer, error)
func FromXML(r io.Reader, name string, decoderFunc XMLDecoderFunc) (chan []Importer, error)
func FromXMLSize(r io.Reader, name string, decoderFunc XMLDecoderFunc, size int) (chan []Importer, error)
func UnescapeTrim(s string) string
type Importer
type JSONDecoderFunc
type Skip
- func (s Skip) Error() string
type Source
type XMLDecoderFunc

Constants ¶

View Source

const (
	// AppVersion of span package. Commandline tools will show this on -v.
	AppVersion = "0.1.54"
	// KeyLengthLimit is a limit imposed by memcached protocol, which is used
	// for blob storage as of June 2015. If we change the key value store,
	// this limit might become obsolete.
	KeyLengthLimit = 250
)

Variables ¶

This section is empty.

Functions ¶

func ByteSink ¶

func ByteSink(w io.Writer, out chan []byte, done chan bool)

ByteSink is a fan in writer for a byte channel. A newline is appended after each object.

func DetectLang3 ¶

func DetectLang3(text string) (string, error)

DetectLang3 returns the best guess 3-letter language code for a given text.

func FromJSON ¶

func FromJSON(r io.Reader, decoder JSONDecoderFunc) (chan []Importer, error)

FromJSON returns a channel of slices of importable objects with a default batch size of 20000 docs.

func FromJSONSize ¶

func FromJSONSize(r io.Reader, decoder JSONDecoderFunc, size int) (chan []Importer, error)

FromJSONSize returns a channel of slices of importable values, given a reader, decoder (for a single value) and number of documents to batch. Important: Due to fan-out input and output order will differ.

func FromXML ¶

func FromXML(r io.Reader, name string, decoderFunc XMLDecoderFunc) (chan []Importer, error)

FromXML is like FromXMLSize, with a default batch size of 2000 XML documents.

func FromXMLSize ¶

func FromXMLSize(r io.Reader, name string, decoderFunc XMLDecoderFunc, size int) (chan []Importer, error)

FromXMLSize returns a channel of importable document slices given a reader over XML, a name of the XML start element, a XMLDecoderFunc callback that deserializes an XML snippet and a batch size. TODO(miku): more idiomatic error handling, e.g. over error channel.

func UnescapeTrim ¶

func UnescapeTrim(s string) string

UnescapeTrim unescapes HTML character references and trims the space of a given string.

Types ¶

type Importer ¶

type Importer interface {
	ToIntermediateSchema() (*finc.IntermediateSchema, error)
}

Importer objects can be converted into an intermediate schema.

type JSONDecoderFunc ¶

type JSONDecoderFunc func(s string) (Importer, error)

JSONDecoderFunc turns a string into a single importable object.

type Skip ¶

type Skip struct {
	Reason string
}

Skip marks records to skip.

func (Skip) Error ¶

func (s Skip) Error() string

Error returns the reason for skipping.

type Source ¶

type Source interface {
	Iterate(io.Reader) (<-chan []Importer, error)
}

Source can emit records given a reader. The channel is of type []Importer, to allow the source to send objects over the channel in batches for performance (1000 x 1000 docs vs 1000000 x 1 doc).

type XMLDecoderFunc ¶

type XMLDecoderFunc func(*xml.Decoder, xml.StartElement) (Importer, error)

XMLDecoderFunc returns an importable document, given an XML decoder and a start element.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
assetutil
cmd
span-export command
span-gh-dump command
span-import command
span-is command
container
crossref
doaj
filter
finc
exporter
genios
holdings Package holdings contains wrappers for holding files.	Package holdings contains wrappers for holding files.
jats
degruyter
jstor
sources
thieme

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL