parity

module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 20, 2025 License: MIT

README

Parity

Parity is a high-performance dataset comparison tool designed to efficiently detect and report differences between large datasets. By leveraging the power of Apache Arrow for in-memory columnar data processing, Parity can handle massive datasets with speed and efficiency.

Features

  • High-Speed Dataset Diffing: Compare large datasets efficiently using vectorized, batch-wise operations
  • Multiple Data Sources: Support for Arrow IPC, Parquet, CSV files, and ADBC-compatible databases
  • Comprehensive Diff Reports: Identify added, deleted, and modified records with column-level detail
  • Arrow-Powered Analysis: Leverage Arrow's in-memory columnar format for high-performance operations
  • Streaming Processing: Handle multi-terabyte datasets without loading them entirely in memory
  • Parallel Execution: Utilize Go's concurrency model for processing partitions simultaneously
  • Flexible Output: Export results in various formats including Arrow IPC, Parquet, JSON, Markdown, and HTML

Installation

To install Parity, use Go 1.24 or later:

go install github.com/TFMV/parity/cmd/parity@latest

Or clone the repository and build from source:

git clone https://github.com/TFMV/parity.git
cd parity
go build ./cmd/parity

Quick Start

Basic Comparison

Compare two Parquet files:

parity diff data/source.parquet data/target.parquet

Compare with specific key columns:

parity diff --key id,timestamp data/source.parquet data/target.parquet

Export differences to a Parquet file:

parity diff --output diffs.parquet source.parquet target.parquet
Advanced Usage

Compare with a tolerance for numeric values:

parity diff --tolerance 0.0001 --key id financial_data_v1.parquet financial_data_v2.parquet

Ignore specific columns in comparison:

parity diff --ignore updated_at,metadata source.parquet target.parquet

Change output format:

parity diff --format json --output diffs.json source.parquet target.parquet

Architecture

Parity is designed with a modular architecture that separates different concerns:

  1. Core: Core types and interfaces for dataset operations
  2. Readers: Implementations for reading from various data sources
  3. Writers: Implementations for writing data to various formats
  4. Diff: Dataset comparison algorithms and implementations
  5. Util: Utility functions and helpers
  6. CLI: Command-line interface
Dataset Readers
  • ParquetReader: Reads data from Parquet files
  • ArrowReader: Reads data from Arrow IPC files
  • CSVReader: Reads and converts CSV data to Arrow format
Dataset Writers
  • ParquetWriter: Writes data to Parquet files
  • ArrowWriter: Writes data to Arrow IPC files
  • JSONWriter: Writes data to JSON files
Diff Engines
  • ArrowDiffer: Uses Arrow's in-memory columnar format for efficient dataset comparison

Technical Details

Arrow Diffing Process

The Arrow differ works by:

  1. Loading input datasets into memory as Arrow records
  2. Building key arrays for efficient record matching
  3. Comparing columns with type-aware logic and customizable tolerance
  4. Identifying added, deleted, and modified records
  5. Producing detailed output with indicators for which fields were modified

The process is highly optimized for both memory usage and performance, with features like:

  • Streaming record processing to manage memory footprint
  • Efficient key-based record matching
  • Type-aware comparisons with customizable tolerance for floating-point values
  • Parallel comparison of records with configurable worker pools
Arrow Optimizations

Parity leverages Arrow's strengths:

  • Zero-copy operations where possible
  • Columnar data representation for efficient comparison
  • Vectorized operations for high throughput
  • Memory-efficient data structures

Development

Prerequisites
  • Go 1.24 or later
  • Apache Arrow libraries
Building
go build ./cmd/parity
Testing
go test ./...
Adding New Readers/Writers

To add a new data source reader, implement the core.DatasetReader interface:

type DatasetReader interface {
    Read(ctx context.Context) (arrow.Record, error)
    Schema() *arrow.Schema
    Close() error
}

To add a new output format writer, implement the core.DatasetWriter interface:

type DatasetWriter interface {
    Write(ctx context.Context, record arrow.Record) error
    Close() error
}

License

Parity is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

  • Apache Arrow - For the Arrow columnar memory format and efficient data processing capabilities

Directories

Path Synopsis
cmd
parity command
Package main provides the entry point for the Parity dataset comparison tool.
Package main provides the entry point for the Parity dataset comparison tool.
pkg
core
Package core provides the core types and interfaces for the Parity dataset comparison tool.
Package core provides the core types and interfaces for the Parity dataset comparison tool.
diff
Package diff provides implementations for computing differences between datasets.
Package diff provides implementations for computing differences between datasets.
readers
Package readers provides implementations of dataset readers for various data sources.
Package readers provides implementations of dataset readers for various data sources.
writers
Package writers provides implementations of dataset writers for various data formats.
Package writers provides implementations of dataset writers for various data formats.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL