Parity
Parity is a high-performance dataset comparison tool designed to efficiently detect and report differences between large datasets. By leveraging the power of Apache Arrow for in-memory columnar data processing, Parity can handle massive datasets with speed and efficiency.
Features
- High-Speed Dataset Diffing: Compare large datasets efficiently using vectorized, batch-wise operations
- Multiple Data Sources: Support for Arrow IPC, Parquet, CSV files, and ADBC-compatible databases
- Comprehensive Diff Reports: Identify added, deleted, and modified records with column-level detail
- Arrow-Powered Analysis: Leverage Arrow's in-memory columnar format for high-performance operations
- Streaming Processing: Handle multi-terabyte datasets without loading them entirely in memory
- Parallel Execution: Utilize Go's concurrency model for processing partitions simultaneously
- Flexible Output: Export results in various formats including Arrow IPC, Parquet, JSON, Markdown, and HTML
Installation
To install Parity, use Go 1.24 or later:
go install github.com/TFMV/parity/cmd/parity@latest
Or clone the repository and build from source:
git clone https://github.com/TFMV/parity.git
cd parity
go build ./cmd/parity
Quick Start
Basic Comparison
Compare two Parquet files:
parity diff data/source.parquet data/target.parquet
Compare with specific key columns:
parity diff --key id,timestamp data/source.parquet data/target.parquet
Export differences to a Parquet file:
parity diff --output diffs.parquet source.parquet target.parquet
Advanced Usage
Compare with a tolerance for numeric values:
parity diff --tolerance 0.0001 --key id financial_data_v1.parquet financial_data_v2.parquet
Ignore specific columns in comparison:
parity diff --ignore updated_at,metadata source.parquet target.parquet
Change output format:
parity diff --format json --output diffs.json source.parquet target.parquet
Architecture
Parity is designed with a modular architecture that separates different concerns:
- Core: Core types and interfaces for dataset operations
- Readers: Implementations for reading from various data sources
- Writers: Implementations for writing data to various formats
- Diff: Dataset comparison algorithms and implementations
- Util: Utility functions and helpers
- CLI: Command-line interface
Dataset Readers
ParquetReader: Reads data from Parquet files
ArrowReader: Reads data from Arrow IPC files
CSVReader: Reads and converts CSV data to Arrow format
Dataset Writers
ParquetWriter: Writes data to Parquet files
ArrowWriter: Writes data to Arrow IPC files
JSONWriter: Writes data to JSON files
Diff Engines
ArrowDiffer: Uses Arrow's in-memory columnar format for efficient dataset comparison
Technical Details
Arrow Diffing Process
The Arrow differ works by:
- Loading input datasets into memory as Arrow records
- Building key arrays for efficient record matching
- Comparing columns with type-aware logic and customizable tolerance
- Identifying added, deleted, and modified records
- Producing detailed output with indicators for which fields were modified
The process is highly optimized for both memory usage and performance, with features like:
- Streaming record processing to manage memory footprint
- Efficient key-based record matching
- Type-aware comparisons with customizable tolerance for floating-point values
- Parallel comparison of records with configurable worker pools
Arrow Optimizations
Parity leverages Arrow's strengths:
- Zero-copy operations where possible
- Columnar data representation for efficient comparison
- Vectorized operations for high throughput
- Memory-efficient data structures
Development
Prerequisites
- Go 1.24 or later
- Apache Arrow libraries
Building
go build ./cmd/parity
Testing
go test ./...
Adding New Readers/Writers
To add a new data source reader, implement the core.DatasetReader interface:
type DatasetReader interface {
Read(ctx context.Context) (arrow.Record, error)
Schema() *arrow.Schema
Close() error
}
To add a new output format writer, implement the core.DatasetWriter interface:
type DatasetWriter interface {
Write(ctx context.Context, record arrow.Record) error
Close() error
}
License
Parity is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
- Apache Arrow - For the Arrow columnar memory format and efficient data processing capabilities