mempool-dumpster

module
v0.5.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Sep 18, 2023 License: MIT

README

Mempool Dumpster 🗑️♻️

Goreport status Test status

Archiving mempool transactions in Parquet and CSV format.

The data is freely available at https://mempool-dumpster.flashbots.net

Overview:


Available mempool transaction sources

  1. Generic EL nodes - go-ethereum, Infura, etc. (Websockets, using newPendingTransactions)
  2. Alchemy (Websockets, using alchemy_pendingTransactions, warning - burns a lot of credits)
  3. bloXroute (Websockets and gRPC)
  4. Chainbound Fiber (gRPC)
  5. Eden (Websockets)

Note: Some sources send transactions that are already included on-chain, which are discarded (not added to archive or summary)


Output files

Daily files uploaded by mempool-dumpster (i.e. for September 2023):

  1. Parquet file with transaction metadata and raw transactions (~800MB/day, i.e. 2023-09-08.parquet)
  2. CSV file with only the transaction metadata (~100MB/day zipped, i.e. 2023-09-08.csv.zip)
  3. CSV file with details about when each transaction was received by any source (~100MB/day zipped, i.e. 2023-09-08_sourcelog.csv.zip)
  4. Summary in text format (~2kB, i.e. 2023-09-08_summary.txt)

FAQ

  • What are exclusive transactions? ... a transaction that was seen from no other source (transaction only provided by a single source)
  • What does "XOF" stand for? ... XOF stands for "exclusive orderflow" (i.e. exclusive transactions)
  • What is a-pool? ... A-Pool is a regular geth node with some optimized peering settings, subscribed to over the network.

Working with Parquet

See this post for more details: https://collective.flashbots.net/t/mempool-dumpster-a-free-mempool-transaction-archive/2401

Apache Parquet is a column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk (more here).

We recommend to use ClickHouse local (as well as DuckDB) to work with Parquet files, it makes it easy to run queries like:

# count rows
$ clickhouse local -q "SELECT count(*) FROM 'transactions.parquet' LIMIT 1;"

# get the first hash+rawTx
$ clickhouse local -q "SELECT hash,hex(rawTx) FROM 'transactions.parquet' LIMIT 1;"

# get details of a particular hash
$ clickhouse local -q "SELECT timestamp,hash,from,to,hex(rawTx) FROM 'transactions.parquet' WHERE hash='0x152065ad73bcf63f68572f478e2dc6e826f1f434cb488b993e5956e6b7425eed';"

# show the schema
$ clickhouse local -q "DESCRIBE TABLE 'transactions.parquet';"
timestamp	Nullable(DateTime64(3))
hash	Nullable(String)
chainId	Nullable(String)
from	Nullable(String)
to	Nullable(String)
value	Nullable(String)
nonce	Nullable(String)
gas	Nullable(String)
gasPrice	Nullable(String)
gasTipCap	Nullable(String)
gasFeeCap	Nullable(String)
dataSize	Nullable(Int64)
data4Bytes	Nullable(String)
rawTx	Nullable(String)
sources Array(Nullable(String))

# get exclusive transactions from bloxroute
clickhouse local -q "SELECT COUNT(*) FROM 'transactions.parquet' WHERE length(sources) == 1 AND sources[1] == 'bloxroute';"

Interesting analyses

  • Amount of transactions which eventually lands on chain (by source)
  • Transaction quality (i.e. for high-volume XOF sources)

System architecture

  1. Collector: Connects to EL nodes and writes new mempool transactions and sourcelog to hourly CSV files. Multiple collector instances can run without colliding.
  2. Merger: Takes collector CSV files as input, de-duplicates, sorts by timestamp and writes CSV + Parquet output files.
  3. Analyzer: Analyzes sourcelog CSV files and produces summary report.
  4. Website: Website dev-mode as well as build + upload.

Getting started

Mempool Collector

  1. Subscribes to new pending transactions at various data sources
  2. Writes 3 files:
    1. Transactions CSV: timestamp_ms, hash, raw_tx (one file per hour by default)
    2. Sourcelog CSV: timestamp_ms, hash, source (one entry for every single transaction received by any source)
    3. Trash CSV: timestamp_ms, hash, source, reason, note (trash transactions received by any source, these are not added to the transactions CSV. currently only if already included in previous block)
  3. Note: the collector can store transactions repeatedly, and only the merger will properly deduplicate them later

Default filenames:

Transactions

  • Schema: <out_dir>/<date>/transactions/txs_<date>_<uid>.csv
  • Example: out/2023-08-07/transactions/txs_2023-08-07-10-00_collector1.csv

Sourcelog

  • Schema: <out_dir>/<date>/sourcelog/src_<date>_<uid>.csv
  • Example: out/2023-08-07/sourcelog/src_2023-08-07-10-00_collector1.csv

Trash

  • Schema: <out_dir>/<date>/trash/trash_<date>_<uid>.csv
  • Example: out/2023-08-07/trash/trash_2023-08-07-10-00_collector1.csv

Running the mempool collector:

# print help
go run cmd/collect/main.go -help

# Connect to ws://localhost:8546 and write CSVs into ./out
go run cmd/collect/main.go -out ./out

# Connect to multiple nodes
go run cmd/collect/main.go -out ./out -nodes ws://server1.com:8546,ws://server2.com:8546

Merger

  • Iterates over collector output directory / CSV files
  • Deduplicates transactions, sorts them by timestamp
go run cmd/merge/main.go -h

Architecture

General design goals

  • Keep it simple and stupid
  • Vendor-agnostic (main flow should work on any server, independent of a cloud provider)
  • Downtime-resilience to minimize any gaps in the archive
  • Multiple collector instances can run concurrently, without getting into each others way
  • Merger produces the final archive (based on the input of multiple collector outputs)
  • The final archive:
    • Includes (1) parquet file with transaction metadata, and (2) compressed file of raw transaction CSV files
    • Compatible with ClickHouse and S3 Select (Parquet using gzip compression)
    • Easily distributable as torrent

Collector

  • NodeConnection
    • One for each EL connection
    • New pending transactions are sent to TxProcessor via a channel
  • TxProcessor
    • Check if it already processed that tx
    • Store it in the output directory

Merger

Transaction RLP format


Contributing

Install dependencies

go install mvdan.cc/gofumpt@latest
go install honnef.co/go/tools/cmd/staticcheck@latest
go install github.com/golangci/golangci-lint/cmd/golangci-lint@latest
go install github.com/daixiang0/gci@latest

Lint, test, format

make lint
make test
make fmt

Further notes


License

MIT


Maintainers

Directories

Path Synopsis
cmd
analyze command
collect command
merge command
Loads many source CSV files (produced by the collector), creates summary files in CSV and Parquet, and writes a single CSV file with all raw transactions
Loads many source CSV files (produced by the collector), creates summary files in CSV and Parquet, and writes a single CSV file with all raw transactions
website command
Website dev server (-dev) and prod build/upload tool (-build and -upload)
Website dev server (-dev) and prod build/upload tool (-build and -upload)
Package collector contains the mempool collector service
Package collector contains the mempool collector service
Package common contains common functions and variables used by various scripts and services
Package common contains common functions and variables used by various scripts and services
scripts
get-tx command
subscibe-test command
Package website contains the service delivering the website
Package website contains the service delivering the website

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL