span

package module
v0.1.54 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 13, 2015 License: GPL-3.0 Imports: 12 Imported by: 3

README

Span

Span formats.

Build Status GoDoc

The span command line tools aim at high performance, versatile document conversions between a series of metadata formats.

The goal is to quickly move between input formats, such as stored API responses, XML-ish or line-delimited JSON bibliographic data and output formats, such as finc intermediate format or formats, that can be directly imported into SOLR or elasticsearch.

As a non-goal, the span tools do not care, how you obtain your input data. The tools expect a single input file and produce a single output file (stdin and stdout, respectively).


Why in Go?

Linux shell scripts have no native XML or JSON support, Python is a bit too slow for the casual processing of 100M or more records, Java is a bit too verbose - which is why we chose Go. Go comes with XML and JSON support in the standard library, nice concurrency primitives and simple single static-binary deployments.


Install with

$ go get github.com/miku/span/cmd/...

or via deb or rpm packages.

Formats

A toolkit approach

  • span-import, anything to intermediate schema
  • span-export, intermediate schema to anything

The span-import tool should require minimal external information (no holdings file, etc.) and be mainly concerned with the transformation of fancy source formats into the catch-all intermediate schema.

The span-export tool may include external sources to create output, e.g. holdings files.

Usage

$ span-import -h
Usage of span-import:
  -cpuprofile="": write cpu profile to file
  -i="": input format
  -list=false: list formats
  -log="": if given log to file
  -members="": path to LDJ file, one member per line
  -v=false: prints current program version
  -verbose=false: more output
  -w=4: number of workers

$ span-export -h
Usage of span-export:
  -any=[]: ISIL
  -b=20000: batch size
  -cpuprofile="": write cpu profile to file
  -dump=false: dump filters and exit
  -f=[]: ISIL:/path/to/ovid.xml
  -l=[]: ISIL:/path/to/list.txt
  -list=false: list output formats
  -o="solr413": output format
  -skip=false: skip errors
  -source=[]: ISIL:SID
  -v=false: prints current program version
  -w=4: number of workers

Examples

List available formats:

$ span-import -list
doaj
genios
crossref
degruyter
jstor

Import crossref LDJ (with cached members API responses) or DeGruyter XML (preprocessed into a single file):

$ span-import -i crossref -members members.ldj crossref.ldj > crossref.is.ldj
$ span-import -i jats degruyter.ldj > degruyter.is.ldj

Concat for convenience:

$ cat crossref.is.ldj degruyter.is.ldj > ai.is.ldj

Export intermediate schema records to a memcache server with memcldj:

$ memcldj ai.is.ldj

Export to finc 1.3 SOLR 4 schema:

$ span-export -o solr413 -f DE-14:DE-14.xml -f DE-15:DE-15.xml ai.is.ldj > ai.ldj

The exported ai.ldj contains all aggregated index record and incorporates all holdings information. It can be indexed quickly with solrbulk:

$ solrbulk ai.ldj

Adding new sources

This is work/simplification-in-progress.

For the moment, a new data source has to implement is the span.Source interface:

// Source can emit records given a reader. The channel is of type []Importer,
// to allow the source to send objects over the channel in batches for
// performance (1000 x 1000 docs vs 1000000 x 1 doc).
type Source interface {
        Iterate(io.Reader) (<-chan []Importer, error)
}

Channels in APIs might not be the optimum, though we deal with a kind of unbounded streams here.

Additionally, the the emitted objects must implement span.Importer or span.Batcher, which is the transformation business logic:

// Importer objects can be converted into an intermediate schema.
type Importer interface {
        ToIntermediateSchema() (*finc.IntermediateSchema, error)
}

The exporters need to implement the finc.Exporter interface:

// ExportSchema encapsulate an export flavour. This will most likely be a
// struct with fields and methods relevant to the exported format. For the
// moment we assume, the output is JSON. If formats other than JSON are
// requested, move the marshalling into this interface.
type ExportSchema interface {
        // Convert takes an intermediate schema record to export. Returns an
        // error, if conversion failed.
        Convert(IntermediateSchema) error
        // Attach takes a list of strings (here: ISILs) and attaches them to the
        // current record.
        Attach([]string)
}

Licence


TODO

  • maybe factor out importer interface (like exporter)
  • docs: add example files for each supported data format

A filtering pipeline.

The final processing step from an intermediate schema to an export format includes various decisions.

  • Should an ISIL be attached to a record or not?
  • Should a record be excluded, due to an expired or deleted DOI?

Provide a middleware-ish processing interface (similar to flow, metafacture)?

pl := Pipeline{}
pl.Add(DOIFilter)
pl.Add(ISILAttacher)
pl.Add(QualityAssuranceTests)
pl.Add(Exporter)

err := pl.Run(input)

Done

  • decouple batching (performance) from record stream generation (content)
  • write wrappers around common inputs like XML, JSON, CSV ...

Documentation

Overview

Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de
               by The Finc Authors, http://finc.info
               by Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de
               by The Finc Authors, http://finc.info
               by Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

Index

Constants

View Source
const (
	// AppVersion of span package. Commandline tools will show this on -v.
	AppVersion = "0.1.54"
	// KeyLengthLimit is a limit imposed by memcached protocol, which is used
	// for blob storage as of June 2015. If we change the key value store,
	// this limit might become obsolete.
	KeyLengthLimit = 250
)

Variables

This section is empty.

Functions

func ByteSink

func ByteSink(w io.Writer, out chan []byte, done chan bool)

ByteSink is a fan in writer for a byte channel. A newline is appended after each object.

func DetectLang3

func DetectLang3(text string) (string, error)

DetectLang3 returns the best guess 3-letter language code for a given text.

func FromJSON

func FromJSON(r io.Reader, decoder JSONDecoderFunc) (chan []Importer, error)

FromJSON returns a channel of slices of importable objects with a default batch size of 20000 docs.

func FromJSONSize

func FromJSONSize(r io.Reader, decoder JSONDecoderFunc, size int) (chan []Importer, error)

FromJSONSize returns a channel of slices of importable values, given a reader, decoder (for a single value) and number of documents to batch. Important: Due to fan-out input and output order will differ.

func FromXML

func FromXML(r io.Reader, name string, decoderFunc XMLDecoderFunc) (chan []Importer, error)

FromXML is like FromXMLSize, with a default batch size of 2000 XML documents.

func FromXMLSize

func FromXMLSize(r io.Reader, name string, decoderFunc XMLDecoderFunc, size int) (chan []Importer, error)

FromXMLSize returns a channel of importable document slices given a reader over XML, a name of the XML start element, a XMLDecoderFunc callback that deserializes an XML snippet and a batch size. TODO(miku): more idiomatic error handling, e.g. over error channel.

func UnescapeTrim

func UnescapeTrim(s string) string

UnescapeTrim unescapes HTML character references and trims the space of a given string.

Types

type Importer

type Importer interface {
	ToIntermediateSchema() (*finc.IntermediateSchema, error)
}

Importer objects can be converted into an intermediate schema.

type JSONDecoderFunc

type JSONDecoderFunc func(s string) (Importer, error)

JSONDecoderFunc turns a string into a single importable object.

type Skip

type Skip struct {
	Reason string
}

Skip marks records to skip.

func (Skip) Error

func (s Skip) Error() string

Error returns the reason for skipping.

type Source

type Source interface {
	Iterate(io.Reader) (<-chan []Importer, error)
}

Source can emit records given a reader. The channel is of type []Importer, to allow the source to send objects over the channel in batches for performance (1000 x 1000 docs vs 1000000 x 1 doc).

type XMLDecoderFunc

type XMLDecoderFunc func(*xml.Decoder, xml.StartElement) (Importer, error)

XMLDecoderFunc returns an importable document, given an XML decoder and a start element.

Directories

Path Synopsis
cmd
span-export command
span-gh-dump command
span-import command
span-is command
Package holdings contains wrappers for holding files.
Package holdings contains wrappers for holding files.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL