README
ΒΆ
π§ Icebox
A single-binary playground for Apache Iceberg
Five minutes to first query
Quick Start β’ Features β’ Examples β’ Usage Guide β’ Contributing
π― What is Icebox?
Icebox is a zero-configuration data lakehouse that gets you from zero to querying Iceberg tables in under five minutes. Perfect for:
- π¬ Experimenting with Apache Iceberg table format
- π Learning lakehouse concepts and workflows
- π§ͺ Prototyping data pipelines locally
- π Testing Iceberg integrations before production
No servers, no complex setup, no dependencies - just a single binary and your data.
β¨ Features
π Zero-Setup Experience
- Single binary - No installation complexity
- Embedded catalog - SQLite-based, no external database needed
- REST catalog support - Connect to existing Iceberg REST catalogs
- Embedded MinIO server - S3-compatible storage for testing production workflows
- Local storage - File system integration out of the box
- In-memory filesystem - Lightning-fast testing and development workflows
- Auto-configuration - Sensible defaults, minimal configuration required
π Data Operations
- Parquet import with automatic schema inference
- Demo datasets - NYC taxi data with realistic analytics examples
- Iceberg table creation and management
- Namespace organization and operations
- Pack/Unpack - Portable project archives for sharing and backup
- Arrow integration for efficient data processing
- Transaction support with proper ACID guarantees
π SQL Querying
- DuckDB integration for high-performance analytics
- Interactive SQL shell with command history and multi-line support
- Time-travel queries - Query tables at any point in their history
- Multiple output formats - table, CSV, JSON
- Auto-registration of catalog tables for immediate querying
- Query performance metrics and optimization features
π¬ Demo & Learning
- One-command setup - Instant demo datasets with realistic data
- NYC taxi analytics - Real-world data with 5,000+ trip records
- Sample queries - Pre-built analytics examples for learning
- Partitioned datasets - Explore advanced Iceberg features
- Temporal analysis - Date-based queries and time-series operations
- Business intelligence examples - Revenue, vendor, and trend analysis
π οΈ Developer-Friendly
- Rich CLI with intuitive commands and helpful output
- Comprehensive table operations - create, list, describe, history
- Namespace management for organized data governance
- Dry-run modes to preview operations
- YAML configuration for reproducible setups
π Quick Start
1. Install Icebox
# Build from source (Go 1.21+ required)
git clone https://github.com/TFMV/icebox.git
cd iceberg/icebox
go build -o icebox cmd/icebox/main.go
2. Try the Demo (Fastest Path!)
# Create demo project with NYC taxi data
./icebox init taxi-demo
cd taxi-demo
# Set up demo dataset (one command!)
./icebox demo
β
Demo setup complete!
π Try these commands:
π NYC Taxi:
# Count total number of taxi trips
icebox sql "SELECT COUNT(*) as total_trips FROM nyc_taxi"
# Calculate average fare amount
icebox sql "SELECT AVG(fare_amount) as avg_fare FROM nyc_taxi WHERE fare_amount > 0"
# Compare taxi vendors by performance
icebox sql "SELECT vendor_name, COUNT(*) as trips, AVG(fare_amount) as avg_fare FROM nyc_taxi WHERE vendor_name IS NOT NULL GROUP BY vendor_name"
3. Start Querying Real Data
# Query your demo data
./icebox sql "SELECT COUNT(*) FROM nyc_taxi"
π Registered 1 tables for querying
β±οΈ Query executed in 45ms
π 1 rows returned
βββββββββββββββ
β total_trips β
βββββββββββββββ€
β 5476 β
βββββββββββββββ
# Analyze payment methods
./icebox sql "SELECT payment_type, COUNT(*) as count FROM nyc_taxi GROUP BY payment_type"
β±οΈ Query executed in 67ms
π 2 rows returned
ββββββββββββββββ¬ββββββββ
β payment_type β count β
ββββββββββββββββΌββββββββ€
β Credit β 3891 β
β Cash β 1585 β
ββββββββββββββββ΄ββββββββ
# Use the interactive shell for complex analysis
./icebox shell
π§ Icebox SQL Shell v0.1.0
Interactive SQL querying for Apache Iceberg
Type \help for help, \quit to exit
icebox> SELECT vendor_name, AVG(trip_distance) as avg_distance FROM nyc_taxi GROUP BY vendor_name;
β±οΈ Query executed in 23ms
π 2 rows returned
βββββββββββββββ¬βββββββββββββββ
β vendor_name β avg_distance β
βββββββββββββββΌβββββββββββββββ€
β VTS β 2.87 β
β CMT β 3.12 β
βββββββββββββββ΄βββββββββββββββ
icebox> \quit
4. Import Your Own Data
# Import your own Parquet files
./icebox import sales_data.parquet --table sales
β
Successfully imported table!
π Import Results:
Table: [default sales]
Records: 1,000,000
Size: 45.2 MB
Location: file:///.icebox/data/default/sales
π You now have a working Iceberg lakehouse with real data and SQL querying!
π New Features
π¬ Demo Datasets
Get started immediately with realistic data and pre-built analytics examples:
# Set up demo environment with default name
./icebox init
cd icebox-lakehouse
./icebox demo
# Or set up with custom name
./icebox init my-demo
cd my-demo
./icebox demo
# Explore demo datasets
./icebox demo --list
π¬ Available Demo Datasets:
π **taxi**
Description: NYC Taxi trip data with partitioning by year and month - perfect for analytics
Namespace: demo
Table: nyc_taxi
Partitioned: Yes
Sample Queries: 6
π‘ Usage:
icebox demo # Set up all datasets
icebox demo --dataset taxi # Set up specific dataset
icebox demo --cleanup # Remove all demo data
Demo Features:
- π Real NYC Taxi Data - 5,000+ actual trip records with 22 columns
- π Temporal Patterns - Data across multiple months for time-series analysis
- π° Financial Analytics - Fare amounts, tips, and payment methods
- πͺ Vendor Comparisons - Multi-vendor data for comparative analysis
- π Sample Queries - 6 pre-built analytics examples
- β‘ Instant Setup - One command to working analytics environment
ποΈ Embedded MinIO Server
Test S3-compatible storage workflows locally with zero configuration:
# Initialize with embedded MinIO
./icebox init my-project --storage minio
# Or enable in existing project
cat >> .icebox.yml << EOF
storage:
type: minio
minio:
embedded: true
console: true # Enable web console at http://localhost:9000
EOF
# MinIO starts automatically with Icebox
./icebox sql "SHOW TABLES"
# ποΈ Starting embedded MinIO server...
# β
MinIO server started successfully
Features:
- π S3-Compatible API - Test cloud storage workflows locally
- π Web Console - Browser-based management interface
- π‘οΈ Secure by Default - Configurable authentication and TLS
- π Performance Optimized - Modern connection pooling and timeouts
- πΎ In-Memory Mode - Lightning-fast testing with temporary storage
π¦ Pack & Unpack
Create portable archives of your lakehouse projects:
# Create project archive
./icebox pack my-analytics-project.tar.gz
# Share and distribute
scp my-analytics-project.tar.gz colleague@server:/home/colleague/
# Restore anywhere
./icebox unpack my-analytics-project.tar.gz
Perfect for:
- π€ Sharing projects with colleagues
- πΎ Backup and archival
- π Distribution of datasets and schemas
- π§ͺ Testing with consistent environments
π Examples
NYC Taxi Analytics Demo
# Quickest path - use default directory
./icebox init && cd icebox-lakehouse
# Set up demo with NYC taxi data
./icebox demo
# Revenue analysis
./icebox sql "SELECT AVG(fare_amount) as avg_fare, AVG(total_amount) as avg_total FROM nyc_taxi WHERE fare_amount > 0"
# Temporal patterns
./icebox sql "SELECT DATE_TRUNC('month', pickup_datetime) as month, COUNT(*) as trips FROM nyc_taxi GROUP BY month ORDER BY month"
# Vendor performance comparison
./icebox sql "SELECT vendor_name, COUNT(*) as trips, AVG(fare_amount) as avg_fare, AVG(trip_distance) as avg_distance FROM nyc_taxi WHERE vendor_name IS NOT NULL GROUP BY vendor_name"
# Busy hour analysis
./icebox sql "SELECT EXTRACT(hour FROM pickup_datetime) as hour, COUNT(*) as trips FROM nyc_taxi GROUP BY hour ORDER BY hour"
Quick Data Analysis
# Import and analyze customer data
./icebox import customers.parquet --table customers
./icebox sql "SELECT region, AVG(lifetime_value) FROM customers GROUP BY region"
# Time-travel to see historical data
./icebox time-travel customers --as-of "2024-01-01"
--query "SELECT COUNT(*) FROM customers"
REST Catalog Integration
# Connect to production Iceberg REST catalog
./icebox init prod-analytics --catalog rest --uri https://catalog.company.com
# Import data and query immediately
./icebox import events.parquet --table analytics.user_events
./icebox sql "SELECT event_type, COUNT(*) FROM analytics.user_events GROUP BY event_type"
Project Organization
# Create namespaced tables
./icebox import transactions.parquet --table finance.transactions
./icebox import campaigns.parquet --table marketing.campaigns
./icebox import orders.parquet --table sales.orders
# Query across namespaces
./icebox sql "
SELECT f.account_type, SUM(s.amount)
FROM finance.transactions f
JOIN sales.orders s ON f.transaction_id = s.id
GROUP BY f.account_type
"
For more comprehensive examples and detailed usage, see our π Usage Guide.
π Storage & Catalog Support
| Storage Type | Description | Use Case |
|---|---|---|
| Local Filesystem | File-based storage | Development, testing |
| In-Memory | Temporary fast storage | Unit testing, experiments |
| Embedded MinIO | S3-compatible local server | Cloud workflow testing |
| External MinIO | Remote MinIO instance | Shared development |
| Catalog Type | Description | Use Case |
|---|---|---|
| SQLite | Embedded local catalog | Single-user development |
| REST | External Iceberg REST catalog | Multi-user, production |
ποΈ Architecture
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β CLI Layer β β Storage Layer β β Catalog Layer β
β β β β β β
β β’ import βββββΊβ β’ Local FS βββββΊβ β’ SQLite β
β β’ sql/shell β β β’ MinIO S3 β β β’ REST API β
β β’ table ops β β β’ Cloud storage β β β’ Authenticationβ
β β’ pack/unpack β β β’ File:// URIs β β β’ Multi-user β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β β β
βββββββββββββββββββββββββΌββββββββββββββββββββββββ
βΌ
βββββββββββββββββββββββ
β Apache Iceberg β
β β
β β’ Table format β
β β’ Time travel β
β β’ Transaction log β
β β’ DuckDB engine β
βββββββββββββββββββββββ
π Documentation
- π Complete Usage Guide - Comprehensive documentation for all features
- β‘ Quick Start - Get up and running in 5 minutes
- π§ Configuration - Complete configuration reference
- π Troubleshooting - Common issues and solutions
Feature Documentation
- ποΈ Embedded MinIO - S3-compatible local storage
- β° Time-Travel Queries - Query historical table states
- π Table Operations - Complete table management
- π Namespace Management - Organize your data
- π¦ Pack & Unpack - Portable project archives
- π REST Catalog - Enterprise catalog integration
πΊοΈ Roadmap
β Current Version (v0.1.0)
- β SQLite & REST catalog support with authentication
- β Embedded MinIO server with S3-compatible API
- β Parquet import with schema inference
- β SQL engine with DuckDB integration
- β Interactive SQL shell with rich features
- β Time-travel queries for historical data analysis
- β Table & namespace management operations
- β Pack/Unpack for portable project archives
π Future Releases
- Cloud Storage - Native S3, GCS, Azure integration
- Streaming Ingestion - Real-time data processing
- Web UI - Browser-based data exploration
- Advanced Analytics - Enhanced query capabilities
- SDK Libraries - Programmatic access
π€ Contributing
We welcome contributions! Icebox is designed to be approachable for developers at all levels.
Quick Contribution Guide
- π΄ Fork the repository and create a feature branch
- π§ͺ Write tests for your changes
- π Update documentation as needed
- β
Ensure tests pass with
go test ./... - π Submit a pull request
Development
# Build from source
git clone https://github.com/TFMV/icebox.git
cd icebox/icebox
go mod tidy
go build -o icebox cmd/icebox/main.go
# Run tests
go test ./...
Areas for Contribution
- π Bug fixes and stability improvements
- π Documentation and examples
- β¨ New features and enhancements
- π§ͺ Test coverage improvements
- π¨ CLI/UX enhancements
π License
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.
Made with β€οΈ for the data community
β Star this project β’ π Usage Guide β’ π Report Issue