olake

package module
v0.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 20, 2026 License: Apache-2.0 Imports: 7 Imported by: 6

README ΒΆ

olake
OLake

OLake is a high-performance, open-source data ingestion engine for replicating databases, S3, and Kafka into Apache Iceberg (or plain Parquet).
Built for scalable, real-time pipelines, OLake provides a simple web UI and CLI - used to ingest into vendor-lock-in free Iceberg tables supporting all the query-engines/warehouses.

Read the docs and benchmarks at olake.io/docs. Join our active community on Slack.

GitHub issues Documentation slack Contribute to OLake

OLake β€” Super-fast Sync to Apache Iceberg

OLake supports replication from transactional databases such as PostgreSQL, MySQL, MongoDB, Oracle, DB2, and MSSQL, event-streaming systems like Apache Kafka and Object-store like S3, into open data lakehouse formats such as Apache Iceberg or Plain Parquet β€” delivering blazing-fast performance with minimal infrastructure cost.

image


πŸš€ Why OLake?

  • 🧠 Smart sync: Full + CDC replication with automatic schema discovery & schema evolution
  • ⚑ High throughput: 580K RPS (Postgres) & 338K RPS (MySQL)
  • ➑️ Exactly once delivery & Arrow writes: Accuracy with speed.
  • πŸ’Ύ Iceberg-native: Supports Glue, Hive, JDBC, REST catalogs
  • πŸ–₯️ Self-serve UI: Deploy via Docker Compose and sync in minutes
  • πŸ’Έ Infra-light: No Spark, no Flink, no Kafka, no Debezium
  • πŸ—œοΈ Iceberg Table Optimization (Coming soon): Compaction tailored for CDC ingestion

πŸ“Š Benchmarks & possible connections

Full Load
Source β†’ Destination Full Load Relative Performance (Full Load) Full Report
Postgres β†’ Iceberg 5,80,113 RPS 12.5Γ— faster than Fivetran Full Report
MySQL β†’ Iceberg 3,38,005 RPS 2.83Γ— faster than Fivetran Full Report
MongoDB β†’ Iceberg 37,879 RPS - Full Report
Oracle β†’ Iceberg 5,26,337 RPS - Full Report
Kafka β†’ Iceberg 1,54,320 RPS (Bounded Incremental) 1.8x faster than Flink Full Report
CDC
Source β†’ Destination CDC Relative Performance (CDC) Full Report
Postgres β†’ Iceberg 55,555 RPS 2Γ— faster than Fivetran Full Report
MySQL β†’ Iceberg 51,867 RPS 1.85Γ— faster than Fivetran Full Report
MongoDB β†’ Iceberg 10,692 RPS - Full Report
Oracle β†’ Iceberg - - Full Report

*These are preliminary results. Fully reproducible benchmark scores will be published soon.


πŸ”§ Supported Sources and Destinations

Sources (Databases and S3)
Source Full Load CDC Incremental Notes Documentation
PostgreSQL βœ… βœ… pgoutput βœ… wal2json deprecated Postgres Docs
MySQL βœ… βœ… βœ… Binlog-based CDC MySQL Docs
MongoDB βœ… βœ… βœ… Oplog-based CDC MongoDB Docs
Oracle βœ… WIP βœ… JDBC based Full Load & Incremental Oracle Docs
DB2 βœ… - βœ… JDBC based Full Load & Incremental DB2 Docs
MSSQL βœ… βœ… βœ… Full Load, CDC & Incremental MSSQL Docs
S3 βœ… - βœ… Ingests from Amazon S3 or S3-compatible (MinIO, LocalStack) S3 Docs
Sources (Kafka)
Source Bounded Incremental Notes Documentation
Kafka βœ… Latest offset bounded incremental sync Kafka Docs
Destinations
Destination Format Supported Catalogs
Iceberg βœ… Glue, Hive, JDBC, REST (Nessie, Polaris, Unity, Lakekeeper, AWS S3 tables)
Parquet βœ… Filesystem
Other formats πŸ”œ Planned: Delta Lake, Hudi
Writer Docs
  1. Apache Iceberg Docs

    1. Catalogs
      1. AWS Glue Catalog
      2. REST Catalog
      3. JDBC Catalog
      4. Hive Catalog
    2. Azure ADLS Gen2
    3. Google Cloud Storage (GCS)
    4. MinIO (local)
    5. Iceberg Table Management
      1. S3 Tables Supported
  2. Parquet Writer

    1. AWS S3 Docs
    2. Google Cloud Storage (GCS)
    3. Local FileSystem Docs

πŸ§ͺ Quickstart (UI + Docker)

OLake UI is a web-based interface for managing OLake jobs, sources, destinations, and configurations. You can run the entire OLake stack (UI, Backend, and all dependencies) using Docker Compose. This is the recommended way to get started. Run the UI, connect your source DB, and start syncing in minutes.

curl -sSL https://raw.githubusercontent.com/datazip-inc/olake-ui/master/docker-compose.yml | docker compose -f - up -d

Access the UI: * OLake UI: http://localhost:8000 * Log in with default credentials: admin / password.

Detailed getting started using OLake UI can be found here.

olake-ui

Creating Your First Job

With the UI running, you can create a data pipeline in a few steps:

  1. Create a Job: Navigate to the Jobs tab and click Create Job.
  2. Configure Source: Set up your source connection (e.g., PostgreSQL, MySQL, MongoDB).
  3. Configure Destination: Set up your destination (e.g., Apache Iceberg with a Glue, REST, Hive, or JDBC catalog).
  4. Select Streams: Choose which tables to sync and configure their sync mode (CDC or Full Refresh).
  5. Configure & Run: Give your job a name, set a schedule, and click Create Job to finish.

For a detailed walkthrough, refer to the Jobs documentation.


πŸ› οΈ CLI Usage (Advanced)

For advanced users and automation, OLake's core logic is exposed via a powerful CLI. The core framework handles state management, configuration validation, logging, and type detection. It interacts with drivers using four main commands:

  • spec: Returns a render-able JSON Schema for a connector's configuration.
  • check: Validates connection configurations for sources and destinations.
  • discover: Returns all available streams (e.g., tables) and their schemas from a source.
  • sync: Executes the data replication job, extracting from the source and writing to the destination.

Find out more about CLI here.


Install OLake

Below are other different ways you can run OLake:

  1. OLake UI (Recommended)
  2. Kubernetes using Helm
  3. Standalone Docker container
  4. Airflow on EC2
  5. Airflow on Kubernetes

Playground

  1. OLake + Apache Iceberg + REST Catalog + Presto
  2. OLake + Apache Iceberg + AWS Glue + Trino
  3. OLake + Apache Iceberg + AWS Glue + Athena
  4. OLake + Apache Iceberg + AWS Glue + Snowflake
  5. OLake + Apache Iceberg + REST Catalog + Spark

🌍 Use Cases

  • βœ… Migrate from OLTP to Iceberg without Spark or Flink
  • βœ… Enable BI over fresh CDC data using Athena, StarRocks, Trino, Presto, Dremio, Databricks, Snowflake and more!
  • βœ… Build near real-time data lake-house on cost-efficient cloud object stores
  • βœ… Move away from vendor-lock-in warehouse or tools with open data lake-house
  • βœ… Single copy for both analytics & machine learning.

🧭 Roadmap Highlights

  • Oracle Full Load Support
  • Oracle Incremental
  • Filters for Full Load and Incremental
  • Compaction & other table optimisations (In-progress)
  • Iceberg V3 Support

πŸ“Œ Check out our GitHub Project Roadmap and the Upcoming OLake Roadmap to track what's next. If you have ideas or feedback, please share them in our GitHub Discussions or by opening an issue.


🀝 Contributing

We ❀️ contributions, big or small!

Check out our Bounty Program. A huge thanks to all our amazing contributors!

Documentation ΒΆ

Index ΒΆ

Constants ΒΆ

This section is empty.

Variables ΒΆ

This section is empty.

Functions ΒΆ

func RegisterDriver ΒΆ

func RegisterDriver(driver abstract.DriverInterface)

Types ΒΆ

This section is empty.

Directories ΒΆ

Path Synopsis
drivers
db2 module
google-sheets module
hubspot module
kafka module
mongodb module
mssql module
mysql module
oracle module
postgres module
s3 module
pkg
spec
uischema is used to serve the UI specifications of the jsonschema of all the drivers and destinations
uischema is used to serve the UI specifications of the jsonschema of all the drivers and destinations

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL