Documentation
¶
Overview ¶
Command backfill drives a Murmur counter pipeline from Spark- aggregated S3 JSON-Lines into a DynamoDB Sum-monoid store. It's the canonical "snapshot then stream" bootstrap step: a Spark job pre-aggregates raw events to hourly summaries, lands them in S3, and this binary scans the prefix and folds every row into the same pipeline a live Kafka/Kinesis worker would feed.
Run it once before flipping the live worker to a fresh DDB table, or repeatedly during a 40-day backfill window; the StableEventID extractor in package backfill keeps re-runs idempotent.
Required flags:
-bucket S3 bucket holding the Spark output -prefix S3 key prefix to scan (e.g. counters/bot_interaction/) -table DynamoDB table for the Sum store (the pipeline's primary state) -name Pipeline name (used for metrics + log lines)
Optional:
-concurrency Parallel S3 fetches (default 8) -region AWS region (default from environment)
Click to show internal directories.
Click to hide internal directories.