arrowarc

module

v0.1.1 Latest Latest Go to latest Published: Aug 31, 2024 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/arrowarc/arrowarc

Links

Open Source Insights

README ¶

ArrowArc

ArrowArc is an experimental data transport mechanism that uses Apache Arrow for high-performance data manipulation. It is designed to be a zero-code, zero-config, and zero-maintenance data transport mechanism.

Benchmarks

I'll add more benchmarks as I stabilize the library.

Transport Parquet to Parquet

Transport 4 million records from Postgres to Parquet in under 3 seconds.

{
  "StartTime": "2024-08-31T13:10:54-05:00",
  "EndTime": "2024-08-31T13:10:57-05:00",
  "RecordsProcessed": 4000000,
  "Throughput": "1337039.12 records/second",
  "ThroughputBytes": "172.16 MB/second",
  "TotalBytes": "515.05 MB",
  "TotalDuration": "2.992s"
}

Getting Started

You have several options to use ArrowArc:

Use the command line utilities to transport data.
Use the library in your Go program.
Use a YAML configuration file to define your data pipelines.

Command Line Utilities

Use the arrowarc command to get started. It will display a help menu with available commands, including demos and benchmarks.

Go Library

Example of setting up a pipeline to transport data from BigQuery to DuckDB:


// Setup the BigQuery client and reader
bq, err := integrations.NewBigQueryReadClient(ctx)
reader, err := bq.NewBigQueryReader(ctx, projectID, datasetID, tableID)

// Setup the DuckDB client and writer
duck, err := integrations.OpenDuckDBConnection(ctx, dbFilePath)
writer, err := integrations.NewDuckDBRecordWriter(ctx, duck, tableID)

// Create and start the data pipeline
p, err := pipeline.NewDataPipeline(reader, writer)

// Start the pipeline
err = p.Start(ctx)
if err != nil {
    log.Fatalf("Failed to start pipeline: %v", err)
}

// Wait for the pipeline to finish
if pipelineErr := <-p.Done(); pipelineErr != nil {
    return "", fmt.Errorf("pipeline encountered an error: %w", pipelineErr)
}

// Print the Transport Report
fmt.Println(p.Report())

You can expect a report similar to this:

{
  "start_time": "2024-08-31T10:22:23-05:00",
  "end_time": "2024-08-31T10:22:26-05:00",
  "records_processed": 4000000,
  "total_size": "0.63 GB",
  "total_duration": "3.34s",
  "throughput": "1197492.21 records/s",
  "throughput_size": "194.11 MB/s"
}

Features

CLI Utilities

Utility	Status
Transport Table	✅
Rewrite Parquet	✅
Generate Parquet	✅
Generate IPC	✅
Avro To Parquet	✅
CSV To Parquet	✅
CSV To JSON	✅
JSON To Parquet	✅
Parquet to CSV	✅
Parquet to JSON	✅
Flight Server	✅
Sync Table	❌
Validate Table	❌

Integrations

Database Integrations

Database	Extraction	Ingestion
PostgreSQL	✅	🚧
BigQuery	✅	✅
DuckDB	✅	✅
Spanner	✅	❌
CockroachDB	✅	🚧
MySQL	🚧	❌
Oracle	❌	❌
Snowflake	❌	❌
SQLite	❌	❌
Flight	❌	❌

Cloud Storage Integrations

Provider	Extraction	Ingestion
Google Cloud Storage (GCS)	✅	✅
Amazon S3	❌	❌
Azure Blob Storage	❌	❌

Filesystem Formats

Format	Extraction	Ingestion
Parquet	✅	✅
Avro	✅	❌
CSV	✅	✅
JSON	✅	✅
IPC	✅	✅
Iceberg	✅	❌

Contributing

We welcome all contributions. Please see the Code of Conduct.

License

Please see the LICENSE for more details.

Directories ¶

Path	Synopsis
cmd
arrowarc command
avro_to_parquet command
csv_to_json command
csv_to_parquet command
flight command
generate_parquet command
parquet_to_csv command
parquet_to_json command
rewrite_parquet command
validate_config command
converter
experiments
generator
integrations
api/github
api/weather
bigquery
duckdb
filesystem
flight/sqlite Package example contains a FlightSQL Server implementation using sqlite as the backing engine.	Package example contains a FlightSQL Server implementation using sqlite as the backing engine.
gcs
postgres
internal
arrdata
arrjson Package arrjson provides types and functions to encode and decode ARROW types and data to and from JSON files.	Package arrjson provides types and functions to encode and decode ARROW types and data to and from JSON files.
cli
debug Package debug provides APIs for conditional runtime assertions and debug logging.	Package debug provides APIs for conditional runtime assertions and debug logging.
dictutils
flatbuf
interfaces
json
memory
proto
schemas
types
ui
pipeline
pkg
common/config
common/utils
csv
parquet

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL