parity

module

v0.1.0 Latest Latest Go to latest Published: Mar 20, 2025 License: MIT

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/TFMV/parity

Links

Open Source Insights

README ¶

Parity

Parity is a high-performance dataset comparison tool designed to efficiently detect and report differences between large datasets. By leveraging the power of Apache Arrow for in-memory columnar data processing, Parity can handle massive datasets with speed and efficiency.

Features

High-Speed Dataset Diffing: Compare large datasets efficiently using vectorized, batch-wise operations
Multiple Data Sources: Support for Arrow IPC, Parquet, CSV files, and ADBC-compatible databases
Comprehensive Diff Reports: Identify added, deleted, and modified records with column-level detail
Arrow-Powered Analysis: Leverage Arrow's in-memory columnar format for high-performance operations
Streaming Processing: Handle multi-terabyte datasets without loading them entirely in memory
Parallel Execution: Utilize Go's concurrency model for processing partitions simultaneously
Flexible Output: Export results in various formats including Arrow IPC, Parquet, JSON, Markdown, and HTML

Installation

To install Parity, use Go 1.24 or later:

go install github.com/TFMV/parity/cmd/parity@latest

Or clone the repository and build from source:

git clone https://github.com/TFMV/parity.git
cd parity
go build ./cmd/parity

Quick Start

Basic Comparison

Compare two Parquet files:

parity diff data/source.parquet data/target.parquet

Compare with specific key columns:

parity diff --key id,timestamp data/source.parquet data/target.parquet

Export differences to a Parquet file:

parity diff --output diffs.parquet source.parquet target.parquet

Advanced Usage

Compare with a tolerance for numeric values:

parity diff --tolerance 0.0001 --key id financial_data_v1.parquet financial_data_v2.parquet

Ignore specific columns in comparison:

parity diff --ignore updated_at,metadata source.parquet target.parquet

Change output format:

parity diff --format json --output diffs.json source.parquet target.parquet

Architecture

Parity is designed with a modular architecture that separates different concerns:

Core: Core types and interfaces for dataset operations
Readers: Implementations for reading from various data sources
Writers: Implementations for writing data to various formats
Diff: Dataset comparison algorithms and implementations
Util: Utility functions and helpers
CLI: Command-line interface

Dataset Readers

ParquetReader: Reads data from Parquet files
ArrowReader: Reads data from Arrow IPC files
CSVReader: Reads and converts CSV data to Arrow format

Dataset Writers

ParquetWriter: Writes data to Parquet files
ArrowWriter: Writes data to Arrow IPC files
JSONWriter: Writes data to JSON files

Diff Engines

ArrowDiffer: Uses Arrow's in-memory columnar format for efficient dataset comparison

Technical Details

Arrow Diffing Process

The Arrow differ works by:

Loading input datasets into memory as Arrow records
Building key arrays for efficient record matching
Comparing columns with type-aware logic and customizable tolerance
Identifying added, deleted, and modified records
Producing detailed output with indicators for which fields were modified

The process is highly optimized for both memory usage and performance, with features like:

Streaming record processing to manage memory footprint
Efficient key-based record matching
Type-aware comparisons with customizable tolerance for floating-point values
Parallel comparison of records with configurable worker pools

Arrow Optimizations

Parity leverages Arrow's strengths:

Zero-copy operations where possible
Columnar data representation for efficient comparison
Vectorized operations for high throughput
Memory-efficient data structures

Development

Prerequisites

Go 1.24 or later
Apache Arrow libraries

Building

go build ./cmd/parity

Testing

go test ./...

Adding New Readers/Writers

To add a new data source reader, implement the core.DatasetReader interface:

type DatasetReader interface {
    Read(ctx context.Context) (arrow.Record, error)
    Schema() *arrow.Schema
    Close() error
}

To add a new output format writer, implement the core.DatasetWriter interface:

type DatasetWriter interface {
    Write(ctx context.Context, record arrow.Record) error
    Close() error
}

License

Parity is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Apache Arrow - For the Arrow columnar memory format and efficient data processing capabilities

Directories ¶

Path	Synopsis
cmd
parity command Package main provides the entry point for the Parity dataset comparison tool.	Package main provides the entry point for the Parity dataset comparison tool.
pkg
core Package core provides the core types and interfaces for the Parity dataset comparison tool.	Package core provides the core types and interfaces for the Parity dataset comparison tool.
diff Package diff provides implementations for computing differences between datasets.	Package diff provides implementations for computing differences between datasets.
readers Package readers provides implementations of dataset readers for various data sources.	Package readers provides implementations of dataset readers for various data sources.
writers Package writers provides implementations of dataset writers for various data formats.	Package writers provides implementations of dataset writers for various data formats.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL