backfill

command
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 12, 2026 License: Apache-2.0 Imports: 18 Imported by: 0

Documentation

Overview

Command backfill drives a Murmur counter pipeline from Spark- aggregated S3 JSON-Lines into a DynamoDB Sum-monoid store. It's the canonical "snapshot then stream" bootstrap step: a Spark job pre-aggregates raw events to hourly summaries, lands them in S3, and this binary scans the prefix and folds every row into the same pipeline a live Kafka/Kinesis worker would feed.

Run it once before flipping the live worker to a fresh DDB table, or repeatedly during a 40-day backfill window; the StableEventID extractor in package backfill keeps re-runs idempotent.

Required flags:

-bucket    S3 bucket holding the Spark output
-prefix    S3 key prefix to scan (e.g. counters/bot_interaction/)
-table     DynamoDB table for the Sum store (the pipeline's primary state)
-name      Pipeline name (used for metrics + log lines)

Optional:

-concurrency  Parallel S3 fetches (default 8)
-region       AWS region (default from environment)

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL