arrowarc

module
v0.1.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 31, 2024 License: MIT

README

ArrowArc

Go Report Card ArrowArc Build Go Reference

ArrowArc is an experimental data transport mechanism that uses Apache Arrow for high-performance data manipulation. It is designed to be a zero-code, zero-config, and zero-maintenance data transport mechanism.

Benchmarks

I'll add more benchmarks as I stabilize the library.

Transport Parquet to Parquet

Transport 4 million records from Postgres to Parquet in under 3 seconds.

{
  "StartTime": "2024-08-31T13:10:54-05:00",
  "EndTime": "2024-08-31T13:10:57-05:00",
  "RecordsProcessed": 4000000,
  "Throughput": "1337039.12 records/second",
  "ThroughputBytes": "172.16 MB/second",
  "TotalBytes": "515.05 MB",
  "TotalDuration": "2.992s"
}

Getting Started

You have several options to use ArrowArc:

  1. Use the command line utilities to transport data.
  2. Use the library in your Go program.
  3. Use a YAML configuration file to define your data pipelines.
Command Line Utilities

Use the arrowarc command to get started. It will display a help menu with available commands, including demos and benchmarks.

Go Library

Example of setting up a pipeline to transport data from BigQuery to DuckDB:


// Setup the BigQuery client and reader
bq, err := integrations.NewBigQueryReadClient(ctx)
reader, err := bq.NewBigQueryReader(ctx, projectID, datasetID, tableID)

// Setup the DuckDB client and writer
duck, err := integrations.OpenDuckDBConnection(ctx, dbFilePath)
writer, err := integrations.NewDuckDBRecordWriter(ctx, duck, tableID)

// Create and start the data pipeline
p, err := pipeline.NewDataPipeline(reader, writer)

// Start the pipeline
err = p.Start(ctx)
if err != nil {
    log.Fatalf("Failed to start pipeline: %v", err)
}

// Wait for the pipeline to finish
if pipelineErr := <-p.Done(); pipelineErr != nil {
    return "", fmt.Errorf("pipeline encountered an error: %w", pipelineErr)
}

// Print the Transport Report
fmt.Println(p.Report())

You can expect a report similar to this:

{
  "start_time": "2024-08-31T10:22:23-05:00",
  "end_time": "2024-08-31T10:22:26-05:00",
  "records_processed": 4000000,
  "total_size": "0.63 GB",
  "total_duration": "3.34s",
  "throughput": "1197492.21 records/s",
  "throughput_size": "194.11 MB/s"
}

Features

CLI Utilities
Utility Status
Transport Table
Rewrite Parquet
Generate Parquet
Generate IPC
Avro To Parquet
CSV To Parquet
CSV To JSON
JSON To Parquet
Parquet to CSV
Parquet to JSON
Flight Server
Sync Table
Validate Table
Integrations
Database Integrations
Database Extraction Ingestion
PostgreSQL 🚧
BigQuery
DuckDB
Spanner
CockroachDB 🚧
MySQL 🚧
Oracle
Snowflake
SQLite
Flight
Cloud Storage Integrations
Provider Extraction Ingestion
Google Cloud Storage (GCS)
Amazon S3
Azure Blob Storage
Filesystem Formats
Format Extraction Ingestion
Parquet
Avro
CSV
JSON
IPC
Iceberg

Contributing

We welcome all contributions. Please see the Code of Conduct.

License

Please see the LICENSE for more details.

Directories

Path Synopsis
cmd
arrowarc command
avro_to_parquet command
csv_to_json command
csv_to_parquet command
flight command
parquet_to_csv command
parquet_to_json command
rewrite_parquet command
validate_config command
integrations
flight/sqlite
Package example contains a FlightSQL Server implementation using sqlite as the backing engine.
Package example contains a FlightSQL Server implementation using sqlite as the backing engine.
gcs
internal
arrjson
Package arrjson provides types and functions to encode and decode ARROW types and data to and from JSON files.
Package arrjson provides types and functions to encode and decode ARROW types and data to and from JSON files.
cli
debug
Package debug provides APIs for conditional runtime assertions and debug logging.
Package debug provides APIs for conditional runtime assertions and debug logging.
ui
pkg
csv

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL