wiki-dump-reader

command module
v0.0.0-...-dfdfd89 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 11, 2019 License: MIT Imports: 14 Imported by: 0

README

wiki-dump-parser

Parses wiki-dump xml files and indexes nodes as a directed graph to big-data graph DB.

Maintainability Test Coverage CircleCI

Build it

Binary
go get -u github.com/dgoldstein1/wiki-dump-parser
Docker
docker build . -t dgoldstein1/wikiDumpParser

Run it

dc up -d

or with dependencies running locally

export GRAPH_DB_ENDPOINT="http://localhost:5000" # endpoint of graph database
export TWO_WAY_KV_ENDPOINT="http://localhost:5001" # endpoint of k:v <-> v:k lookup metadata db
export PARALLELISM=20 # number of parallel threads to run
export METRICS_PORT=8002 # port where prom metrics are served
wiki-dump-parser parse enwiki-20190620-pages-articles1.xml-p10p30302 

Development

Local Development
./watch_dev_changes.sh
Testing
go test $(go list ./... | grep -v /vendor/)
Benchmarks
Dump Size Execution Time Number of Nodes Number of Edges Nodes Added / Sec
619mb 4m52.936s 1280817 2648926 4386.35
27gb (half of wikipedia) 1490m53.328s 29559863 124179160 330.45

Authors

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Documentation

The Go Gopher

There is no documentation for this package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL