span

package module
v0.1.58 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 16, 2015 License: GPL-3.0 Imports: 14 Imported by: 3

README

Span

Span formats.

Build Status GoDoc

The span command line tools aim at high performance, versatile document conversions between a series of metadata formats.

The goal is to quickly move between input formats, such as stored API responses, XML-ish or line-delimited JSON bibliographic data and output formats, such as finc intermediate format or formats, that can be directly imported into SOLR or elasticsearch.

As a non-goal, the span tools do not care, how you obtain your input data. The tools expect a single input file and produce a single output file (stdin and stdout, respectively).


Why in Go?

Linux shell scripts have no native XML or JSON support, Python is a bit too slow for the casual processing of 100M or more records, Java is a bit too verbose - which is why we chose Go. Go comes with XML and JSON support in the standard library, nice concurrency primitives and simple single static-binary deployments.


Install with

$ go get github.com/miku/span/cmd/...

or via deb or rpm packages.

Formats

A toolkit approach

  • span-import, anything to intermediate schema
  • span-export, intermediate schema to anything

The span-import tool should require minimal external information (no holdings file, etc.) and be mainly concerned with the transformation of fancy source formats into the catch-all intermediate schema.

The span-export tool may include external sources to create output, e.g. holdings files.

Usage

$ span-import -h
Usage of span-import:
  -cpuprofile="": write cpu profile to file
  -i="": input format
  -list=false: list formats
  -log="": if given log to file
  -members="": path to LDJ file, one member per line
  -v=false: prints current program version
  -verbose=false: more output
  -w=4: number of workers

$ span-export -h
Usage of span-export:
  -any=[]: ISIL
  -b=20000: batch size
  -cpuprofile="": write cpu profile to file
  -dump=false: dump filters and exit
  -f=[]: ISIL:/path/to/ovid.xml
  -l=[]: ISIL:/path/to/list.txt
  -list=false: list output formats
  -o="solr413": output format
  -skip=false: skip errors
  -source=[]: ISIL:SID
  -v=false: prints current program version
  -w=4: number of workers

Examples

List available formats:

$ span-import -list
doaj
genios
crossref
degruyter
jstor

Import crossref LDJ (with cached members API responses) or DeGruyter XML (preprocessed into a single file):

$ span-import -i crossref -members members.ldj crossref.ldj > crossref.is.ldj
$ span-import -i jats degruyter.ldj > degruyter.is.ldj

Concat for convenience:

$ cat crossref.is.ldj degruyter.is.ldj > ai.is.ldj

Export intermediate schema records to a memcache server with memcldj:

$ memcldj ai.is.ldj

Export to finc 1.3 SOLR 4 schema:

$ span-export -o solr413 -f DE-14:DE-14.xml -f DE-15:DE-15.xml ai.is.ldj > ai.ldj

The exported ai.ldj contains all aggregated index record and incorporates all holdings information. It can be indexed quickly with solrbulk:

$ solrbulk ai.ldj

Adding new sources

This is work/simplification-in-progress.

For the moment, a new data source has to implement is the span.Source interface:

// Source can emit records given a reader. The channel is of type []Importer,
// to allow the source to send objects over the channel in batches for
// performance (1000 x 1000 docs vs 1000000 x 1 doc).
type Source interface {
        Iterate(io.Reader) (<-chan []Importer, error)
}

Channels in APIs might not be the optimum, though we deal with a kind of unbounded streams here.

Additionally, the the emitted objects must implement span.Importer or span.Batcher, which is the transformation business logic:

// Importer objects can be converted into an intermediate schema.
type Importer interface {
        ToIntermediateSchema() (*finc.IntermediateSchema, error)
}

The exporters need to implement the finc.Exporter interface:

// ExportSchema encapsulate an export flavour. This will most likely be a
// struct with fields and methods relevant to the exported format. For the
// moment we assume, the output is JSON. If formats other than JSON are
// requested, move the marshalling into this interface.
type ExportSchema interface {
        // Convert takes an intermediate schema record to export. Returns an
        // error, if conversion failed.
        Convert(IntermediateSchema) error
        // Attach takes a list of strings (here: ISILs) and attaches them to the
        // current record.
        Attach([]string)
}

Licence

  • GPL3
  • This project uses the Compact Language Detector 2 - CLD2 by Dick Sites, Apache License Version 2.0

Documentation

Overview

Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de
                  The Finc Authors, http://finc.info
                  Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de
                  The Finc Authors, http://finc.info
                  Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

Copyright 2015 by Leipzig University Library, http://ub.uni-leipzig.de
                  The Finc Authors, http://finc.info
                  Martin Czygan, <martin.czygan@uni-leipzig.de>

This file is part of some open source application.

Some open source application is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

Some open source application is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with Foobar. If not, see <http://www.gnu.org/licenses/>.

@license GPL-3.0+ <http://spdx.org/licenses/GPL-3.0+>

Index

Constants

View Source
const (
	// AppVersion of span package. Commandline tools will show this on -v.
	AppVersion = "0.1.58"
	// KeyLengthLimit is a limit imposed by memcached protocol, which is used
	// for blob storage as of June 2015. If we change the key value store,
	// this limit might become obsolete.
	KeyLengthLimit = 250
)

Variables

View Source
var ErrInvalidISSN = errors.New("invalid ISSN")

Functions

func ByteSink

func ByteSink(w io.Writer, out chan []byte, done chan bool)

ByteSink is a fan in writer for a byte channel. A newline is appended after each object.

func DetectLang3

func DetectLang3(text string) (string, error)

DetectLang3 returns the best guess 3-letter language code for a given text.

func FromLines added in v0.1.56

func FromLines(r io.Reader, f ImporterFunc) (chan []Importer, error)

FromLines returns a channel of slices of importable objects with a default batch size of 20000 docs.

func FromLinesSize added in v0.1.56

func FromLinesSize(r io.Reader, f ImporterFunc, size int) (chan []Importer, error)

FromLinesSize returns a channel of slices of importable values, given a reader, f (for a single value) and number of documents to batch. Important: Due to fan-out input and output order will not be preserved.

func FromXML

func FromXML(r io.Reader, name string, decoderFunc XMLDecoderFunc) (chan []Importer, error)

FromXML is like FromXMLSize, with a default batch size of 2000 XML documents.

func FromXMLSize

func FromXMLSize(r io.Reader, name string, decoderFunc XMLDecoderFunc, size int) (chan []Importer, error)

FromXMLSize returns a channel of importable document slices given a reader over XML, a name of the XML start element, a XMLDecoderFunc callback that deserializes an XML snippet and a batch size. TODO(miku): more idiomatic error handling, e.g. over error channel.

func UnescapeTrim

func UnescapeTrim(s string) string

UnescapeTrim unescapes HTML character references and trims the space of a given string.

Types

type ISSN added in v0.1.56

type ISSN string

func (ISSN) String added in v0.1.56

func (s ISSN) String() string

func (ISSN) Validate added in v0.1.56

func (s ISSN) Validate() error

type Importer

type Importer interface {
	ToIntermediateSchema() (*finc.IntermediateSchema, error)
}

Importer objects can be converted into an intermediate schema.

type ImporterFunc added in v0.1.56

type ImporterFunc func(b []byte) (Importer, error)

ImporterFunc turns a byte slice into a single importable object.

type Skip

type Skip struct {
	Reason string
}

Skip marks records to skip.

func (Skip) Error

func (s Skip) Error() string

Error returns the reason for skipping.

type Source

type Source interface {
	Iterate(io.Reader) (<-chan []Importer, error)
}

Source can emit records given a reader. The channel is of type []Importer, to allow the source to send objects over the channel in batches for performance (1000 x 1000 docs vs 1000000 x 1 doc).

type XMLDecoderFunc

type XMLDecoderFunc func(*xml.Decoder, xml.StartElement) (Importer, error)

XMLDecoderFunc returns an importable document, given an XML decoder and a start element.

Directories

Path Synopsis
cmd
span-export command
Converts intermediate schema docs into solr docs.
Converts intermediate schema docs into solr docs.
span-gh-dump command
Dump TSV(ISSN, title) from a google holdings file.
Dump TSV(ISSN, title) from a google holdings file.
span-import command
Converts various input formats into an intermediate schema.
Converts various input formats into an intermediate schema.
Package sets implements basic set types.
Package sets implements basic set types.
Package finc holds finc SolrSchema (SOLR) and intermediate schema related types and methods.
Package finc holds finc SolrSchema (SOLR) and intermediate schema related types and methods.
Package holdings contains wrappers for holding files.
Package holdings contains wrappers for holding files.
sources

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL