wiki-dump-parser
Parses wiki-dump xml files and indexes nodes as a directed graph to big-data graph DB.

Build it
Binary
go get -u github.com/dgoldstein1/wiki-dump-parser
Docker
docker build . -t dgoldstein1/wikiDumpParser
Run it
dc up -d
or with dependencies running locally
export GRAPH_DB_ENDPOINT="http://localhost:5000" # endpoint of graph database
export TWO_WAY_KV_ENDPOINT="http://localhost:5001" # endpoint of k:v <-> v:k lookup metadata db
export PARALLELISM=20 # number of parallel threads to run
export METRICS_PORT=8002 # port where prom metrics are served
wiki-dump-parser parse enwiki-20190620-pages-articles1.xml-p10p30302
Development
Local Development
./watch_dev_changes.sh
Testing
go test $(go list ./... | grep -v /vendor/)
Benchmarks
| Dump Size |
Execution Time |
Number of Nodes |
Number of Edges |
Nodes Added / Sec |
| 619mb |
4m52.936s |
1280817 |
2648926 |
4386.35 |
| 27gb (half of wikipedia) |
1490m53.328s |
29559863 |
124179160 |
330.45 |
Authors
License
This project is licensed under the MIT License - see the LICENSE.md file for details