gorilla

package module

v0.4.0 Latest Latest Go to latest Published: Aug 2, 2025 License: Apache-2.0, MIT Imports: 16 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/paveg/gorilla

Links

Open Source Insights

README ¶

Gorilla 🦍

Gorilla is a high-performance, in-memory DataFrame library for Go. Inspired by libraries like Pandas and Polars, it is designed for fast and efficient data manipulation.

At its core, Gorilla is built on Apache Arrow. This allows it to provide excellent performance and memory efficiency by leveraging Arrow's columnar data format.

✨ Key Features

Built on Apache Arrow: Utilizes the Arrow columnar memory format for zero-copy data access and efficient computation.
Lazy Evaluation & Query Optimization: Operations aren't executed immediately. Gorilla builds an optimized query plan with predicate pushdown, filter fusion, and join optimization.
Automatic Parallelization: Intelligent parallel processing for datasets >1000 rows with adaptive worker pools and thread-safe operations.
Comprehensive I/O Operations: Native CSV read/write with automatic type inference, configurable delimiters, and robust error handling.
Advanced Join Operations: Inner, left, right, and full outer joins with multi-key support and automatic optimization strategies.
GroupBy & Aggregations: Efficient grouping with Sum, Count, Mean, Min, Max aggregations and parallel execution for large datasets.
HAVING Clause Support: Full SQL-compatible HAVING clause implementation for filtering grouped data after aggregation with high-performance optimization.
Streaming & Large Dataset Processing: Handle datasets larger than memory with chunk-based processing and automatic disk spilling.
Debug & Profiling: Built-in execution plan visualization, performance profiling, and bottleneck identification with .Debug() mode.
Flexible Configuration: Multi-format config support (JSON/YAML/env) with performance tuning, memory management, and optimization controls.
Rich Data Types: Support for string, bool, int64, int32, int16, int8, uint64, uint32, uint16, uint8, float64, float32 with automatic type coercion.
Expressive API: Chainable operations for filtering, selecting, transforming, sorting, and complex data manipulation.
Core Data Structures:
- Series: A 1-dimensional array of data, representing a single column.
- DataFrame: A 2-dimensional table-like structure, composed of multiple Series.
CLI Tool: gorilla-cli for benchmarking, demos, and performance testing.

🚀 Getting Started

Installation

To add Gorilla to your project, use go get:

go get github.com/paveg/gorilla

Memory Management

Gorilla is built on Apache Arrow, which requires explicit memory management. We strongly recommend using the defer pattern for most use cases as it provides better readability and prevents memory leaks.

✅ Recommended: Defer Pattern

// Create resources and immediately defer their cleanup
mem := memory.NewGoAllocator()
df := gorilla.NewDataFrame(series1, series2)
defer df.Release() // ← Clear resource lifecycle

result, err := df.Lazy().Filter(...).Collect()
defer result.Release() // ← Always clean up results

📋 When to Use MemoryManager

For complex scenarios with many short-lived resources, you can use MemoryManager:

err := gorilla.WithMemoryManager(mem, func(manager *gorilla.MemoryManager) error {
    // Create multiple temporary resources
    for i := 0; i < 100; i++ {
        temp := createTempDataFrame(i)
        manager.Track(temp) // Bulk cleanup at end
    }
    return processData()
})
// All tracked resources automatically released

Quick Example

Here is a quick example to demonstrate the basic usage of Gorilla.

package main

import (
	"fmt"
	"log"

	"github.com/apache/arrow-go/v18/arrow/memory"
	"github.com/paveg/gorilla"
)

func main() {
	// Gorilla uses Apache Arrow for memory management.
	// Always start by creating an allocator.
	mem := memory.NewGoAllocator()

	// 1. Create some Series (columns)
	names := []string{"Alice", "Bob", "Charlie", "Diana", "Eve"}
	ages := []int64{25, 30, 35, 28, 32}
	salaries := []float64{50000.0, 60000.0, 75000.0, 55000.0, 65000.0}
	nameSeries := gorilla.NewSeries("name", names, mem)
	ageSeries := gorilla.NewSeries("age", ages, mem)
	salarySeries := gorilla.NewSeries("salary", salaries, mem)

	// Remember to release the memory when you're done.
	defer nameSeries.Release()
	defer ageSeries.Release()
	defer salarySeries.Release()

	// 2. Create a DataFrame from the Series
	df := gorilla.NewDataFrame(nameSeries, ageSeries, salarySeries)
	defer df.Release()

	fmt.Println("Original DataFrame:")
	fmt.Println(df)

	// 3. Use Lazy Evaluation for powerful transformations
	lazyDf := df.Lazy().
		// Filter for rows where age is greater than 30
		Filter(gorilla.Col("age").Gt(gorilla.Lit(int64(30)))).
		// Add a new column 'bonus'
		WithColumn("bonus", gorilla.Col("salary").Mul(gorilla.Lit(0.1))).
		// Select the columns we want in the final result
		Select("name", "age", "bonus")

	// 4. Execute the plan and collect the results
	result, err := lazyDf.Collect()
	if err != nil {
		log.Fatal(err)
	}
	defer result.Release()
	defer lazyDf.Release()

	fmt.Println("Result after lazy evaluation:")
	fmt.Println(result)
}

📊 Advanced Features

I/O Operations

Gorilla provides comprehensive CSV I/O capabilities with automatic type inference:

import "github.com/paveg/gorilla/internal/io"

// Reading CSV with automatic type inference
mem := memory.NewGoAllocator()
csvData := `name,age,salary,active
Alice,25,50000.5,true
Bob,30,60000.0,false`

reader := io.NewCSVReader(strings.NewReader(csvData), io.DefaultCSVOptions(), mem)
df, err := reader.Read()
defer df.Release()

// Writing DataFrame to CSV
var output strings.Builder
writer := io.NewCSVWriter(&output, io.DefaultCSVOptions())
err = writer.Write(df)

Join Operations

Perform SQL-like joins between DataFrames:

// Inner join on multiple keys
result, err := left.Join(right, gorilla.JoinOptions{
    Type: gorilla.InnerJoin,
    LeftKeys: []string{"id", "dept"},
    RightKeys: []string{"user_id", "department"},
})
defer result.Release()

GroupBy and Aggregations

Efficient grouping and aggregation operations:

// Group by department and calculate aggregations
result, err := df.Lazy().
    GroupBy("department").
    Agg(
        gorilla.Sum(gorilla.Col("salary")).As("total_salary"),
        gorilla.Count(gorilla.Col("*")).As("employee_count"),
        gorilla.Mean(gorilla.Col("age")).As("avg_age"),
    ).
    Collect()
defer result.Release()

HAVING Clause Support

Filter grouped data after aggregation with full SQL-compatible HAVING clause support:

// HAVING with aggregation functions
result, err := df.Lazy().
    GroupBy("department").
    Agg(
        gorilla.Sum(gorilla.Col("salary")).As("total_salary"),
        gorilla.Count(gorilla.Col("*")).As("employee_count"),
    ).
    Having(gorilla.Sum(gorilla.Col("salary")).Gt(gorilla.Lit(100000))).
    Collect()
defer result.Release()

// HAVING with alias references
result, err := df.Lazy().
    GroupBy("department").
    Agg(gorilla.Count(gorilla.Col("*")).As("emp_count")).
    Having(gorilla.Col("emp_count").Gt(gorilla.Lit(5))).
    Collect()
defer result.Release()

// Complex HAVING conditions
result, err := df.Lazy().
    GroupBy("department").
    Agg(
        gorilla.Mean(gorilla.Col("salary")).As("avg_salary"),
        gorilla.Count(gorilla.Col("*")).As("count"),
    ).
    Having(
        gorilla.Mean(gorilla.Col("salary")).Gt(gorilla.Lit(75000)).
        And(gorilla.Col("count").Gte(gorilla.Lit(3))),
    ).
    Collect()
defer result.Release()

Debug and Profiling

Monitor performance and analyze query execution:

// Enable debug mode for detailed execution analysis
debugDf := df.Debug()
result, err := debugDf.Lazy().
    Filter(gorilla.Col("salary").Gt(gorilla.Lit(50000))).
    Sort("name", true).
    Collect()

// Get execution plan and performance metrics
plan := debugDf.GetExecutionPlan()
fmt.Println("Execution plan:", plan)

Configuration

Customize performance and behavior with flexible configuration:

// Load configuration from file or environment
config, err := gorilla.LoadConfig("gorilla.yaml")
df := gorilla.NewDataFrame(series...).WithConfig(config.Operations)

// Or configure programmatically
config := gorilla.OperationConfig{
    ParallelThreshold: 2000,
    WorkerPoolSize: 8,
    EnableQueryOptimization: true,
    TrackMemory: true,
}

Example configuration file (gorilla.yaml):

# Parallel Processing
parallel_threshold: 2000
worker_pool_size: 8
chunk_size: 1000

# Memory Management  
memory_threshold: 1073741824  # 1GB
gc_pressure_threshold: 0.75

# Query Optimization
filter_fusion: true
predicate_pushdown: true
join_optimization: true

# Debugging
enable_profiling: false
verbose_logging: false
metrics_collection: true

Streaming Large Datasets

Process datasets larger than memory:

processor := gorilla.NewStreamingProcessor(gorilla.StreamingConfig{
    ChunkSize: 10000,
    MaxMemoryMB: 1024,
})

err := processor.ProcessLargeCSV("large_dataset.csv", func(chunk *gorilla.DataFrame) error {
    // Process each chunk
    result := chunk.Filter(gorilla.Col("score").Gt(gorilla.Lit(0.8)))
    return saveResults(result)
})

🔧 CLI Tool

Gorilla includes a command-line tool for benchmarking and demonstrations:

# Build the CLI
make build

# Run interactive demo
./bin/gorilla-cli demo

# Run demo with custom data size
./bin/gorilla-cli demo --rows 5000

# Run benchmarks
./bin/gorilla-cli benchmark

⚡ Performance

Gorilla is designed for high-performance data processing with several optimization features:

Benchmarks

Recent performance benchmarks show excellent performance characteristics:

CSV I/O: ~52μs for 100 rows, ~5.3ms for 10,000 rows
Parallel Processing: Automatic scaling with adaptive worker pools
Memory Efficiency: Zero-copy operations with Apache Arrow columnar format
Query Optimization: 20-50% performance improvement with predicate pushdown and filter fusion

Performance Features

Lazy Evaluation: Build optimized execution plans before processing
Automatic Parallelization: Intelligent parallel processing for large datasets
Memory Management: Efficient Arrow-based columnar storage with GC pressure monitoring
Query Optimization: Predicate pushdown, filter fusion, and join optimization
Streaming: Process datasets larger than memory with configurable chunk sizes

Run benchmarks locally:

# Build and run performance tests
make build
./bin/gorilla-cli benchmark

# Run specific package benchmarks
go test -bench=. ./internal/dataframe
go test -bench=. ./internal/io

🤝 Contributing

Contributions are welcome! Please feel free to open an issue or submit a pull request.

To run the test suite, use the standard Go command:

go test ./...

📄 License

This project is licensed under the MIT License.

Documentation ¶

Overview ¶

Package gorilla provides a high-performance, in-memory DataFrame library for Go.

Gorilla is built on Apache Arrow and provides fast, efficient data manipulation with lazy evaluation and automatic parallelization. It offers a clear, intuitive API for filtering, selecting, transforming, and analyzing tabular data.

Core Concepts ¶

DataFrame: A 2-dimensional table of data with named columns (Series). Series: A 1-dimensional array of homogeneous data representing a single column. LazyFrame: Deferred computation that builds an optimized query plan. Expression: Type-safe operations for filtering, transforming, and aggregating data.

Memory Management ¶

Gorilla uses Apache Arrow's memory management. Always call Release() on DataFrames and Series to prevent memory leaks. The recommended pattern is:

df := gorilla.NewDataFrame(series1, series2)
defer df.Release() // Essential for proper cleanup

Basic Usage ¶

package main

import (
	"fmt"
	"github.com/apache/arrow-go/v18/arrow/memory"
	"github.com/paveg/gorilla"
)

func main() {
	mem := memory.NewGoAllocator()

	// Create Series (columns)
	names := gorilla.NewSeries("name", []string{"Alice", "Bob", "Charlie"}, mem)
	ages := gorilla.NewSeries("age", []int64{25, 30, 35}, mem)
	defer names.Release()
	defer ages.Release()

	// Create DataFrame
	df := gorilla.NewDataFrame(names, ages)
	defer df.Release()

	// Lazy evaluation with method chaining
	result, err := df.Lazy().
		Filter(gorilla.Col("age").Gt(gorilla.Lit(int64(30)))).
		Select("name").
		Collect()
	if err != nil {
		panic(err)
	}
	defer result.Release()

	fmt.Println(result)
}

Performance Features ¶

- Zero-copy operations using Apache Arrow columnar format - Automatic parallelization for DataFrames with 1000+ rows - Lazy evaluation with query optimization - Efficient join algorithms with automatic strategy selection - Memory-efficient aggregations and transformations

This package is the sole public API for the library.

Index ¶

Constants
Variables
func BuildInfo() version.BuildInfo
func Case() *expr.CaseExpr
func Coalesce(exprs ...expr.Expr) *expr.FunctionExpr
func Col(name string) *expr.ColumnExpr
func Concat(exprs ...expr.Expr) *expr.FunctionExpr
func Count(column expr.Expr) *expr.AggregationExpr
func Example()
func ExampleDataFrameGroupBy()
func ExampleDataFrameJoin()
func ExampleDataFrameLazyEvaluation()
func ExampleDataFrameMemoryManagement()
func ExampleSQLExecutor()
func If(condition, thenValue, elseValue expr.Expr) *expr.FunctionExpr
func Lit(value interface{}) *expr.LiteralExpr
func Max(column expr.Expr) *expr.AggregationExpr
func Mean(column expr.Expr) *expr.AggregationExpr
func Min(column expr.Expr) *expr.AggregationExpr
func Sum(column expr.Expr) *expr.AggregationExpr
func Version() string
func WithDataFrame(factory func() *DataFrame, fn func(*DataFrame) error) error
func WithMemoryManager(allocator memory.Allocator, fn func(*MemoryManager) error) error
func WithSeries(factory func() ISeries, fn func(ISeries) error) error
type BatchManager
- func NewBatchManager(monitor *MemoryUsageMonitor) *BatchManager
- func (bm *BatchManager) AddBatch(data *DataFrame)
- func (bm *BatchManager) EstimateMemory() int64
- func (bm *BatchManager) ForceCleanup() error
- func (bm *BatchManager) Release()
- func (bm *BatchManager) ReleaseAll()
- func (bm *BatchManager) SpillIfNeeded() error
- func (bm *BatchManager) SpillLRU() error
- func (bm *BatchManager) Track(resource memoryutil.Resource)
type ChunkReader
type ChunkWriter
type DataFrame
- func NewDataFrame(series ...ISeries) *DataFrame
- func (d *DataFrame) Column(name string) (ISeries, bool)
- func (d *DataFrame) Columns() []string
- func (d *DataFrame) Concat(others ...*DataFrame) *DataFrame
- func (d *DataFrame) Drop(names ...string) *DataFrame
- func (d *DataFrame) GroupBy(columns ...string) *GroupBy
- func (d *DataFrame) HasColumn(name string) bool
- func (d *DataFrame) Join(right *DataFrame, options *JoinOptions) (*DataFrame, error)
- func (d *DataFrame) Lazy() *LazyFrame
- func (d *DataFrame) Len() int
- func (d *DataFrame) Release()
- func (d *DataFrame) Select(names ...string) *DataFrame
- func (d *DataFrame) Slice(start, end int) *DataFrame
- func (d *DataFrame) Sort(column string, ascending bool) (*DataFrame, error)
- func (d *DataFrame) SortBy(columns []string, ascending []bool) (*DataFrame, error)
- func (d *DataFrame) String() string
- func (d *DataFrame) Width() int
- func (d *DataFrame) WithConfig(opConfig config.OperationConfig) *DataFrame
type GroupBy
- func (gb *GroupBy) Agg(aggregations ...*expr.AggregationExpr) *DataFrame
type ISeries
- func NewSeries[T any](name string, values []T, mem memory.Allocator) ISeries
type JoinOptions
type JoinType
type LazyFrame
- func (lf *LazyFrame) Collect() (*DataFrame, error)
- func (lf *LazyFrame) Filter(predicate expr.Expr) *LazyFrame
- func (lf *LazyFrame) GroupBy(columns ...string) *LazyGroupBy
- func (lf *LazyFrame) Join(right *LazyFrame, options *JoinOptions) *LazyFrame
- func (lf *LazyFrame) Release()
- func (lf *LazyFrame) Select(columns ...string) *LazyFrame
- func (lf *LazyFrame) Sort(column string, ascending bool) *LazyFrame
- func (lf *LazyFrame) SortBy(columns []string, ascending []bool) *LazyFrame
- func (lf *LazyFrame) String() string
- func (lf *LazyFrame) WithColumn(name string, expression expr.Expr) *LazyFrame
type LazyGroupBy
- func (lgb *LazyGroupBy) Agg(aggregations ...*expr.AggregationExpr) *LazyFrame
- func (lgb *LazyGroupBy) Count(column string) *LazyFrame
- func (lgb *LazyGroupBy) Max(column string) *LazyFrame
- func (lgb *LazyGroupBy) Mean(column string) *LazyFrame
- func (lgb *LazyGroupBy) Min(column string) *LazyFrame
- func (lgb *LazyGroupBy) Sum(column string) *LazyFrame
type MemoryAwareChunkReader
- func NewMemoryAwareChunkReader(reader ChunkReader, monitor *MemoryUsageMonitor) *MemoryAwareChunkReader
- func (mr *MemoryAwareChunkReader) Close() error
- func (mr *MemoryAwareChunkReader) HasNext() bool
- func (mr *MemoryAwareChunkReader) ReadChunk() (*DataFrame, error)
type MemoryManager
- func NewMemoryManager(allocator memory.Allocator) *MemoryManager
- func (m *MemoryManager) Count() int
- func (m *MemoryManager) ReleaseAll()
- func (m *MemoryManager) Track(resource Releasable)
type MemoryStats
type MemoryUsageMonitor
- func NewMemoryUsageMonitor(spillThreshold int64) *MemoryUsageMonitor
- func NewMemoryUsageMonitorFromConfig(cfg config.Config) *MemoryUsageMonitor
- func (m *MemoryUsageMonitor) CurrentUsage() int64
- func (m *MemoryUsageMonitor) GetStats() MemoryStats
- func (m *MemoryUsageMonitor) PeakUsage() int64
- func (m *MemoryUsageMonitor) RecordAllocation(bytes int64)
- func (m *MemoryUsageMonitor) RecordDeallocation(bytes int64)
- func (m *MemoryUsageMonitor) SetCleanupCallback(callback func() error)
- func (m *MemoryUsageMonitor) SetSpillCallback(callback func() error)
- func (m *MemoryUsageMonitor) SpillCount() int64
- func (m *MemoryUsageMonitor) StartMonitoring()
- func (m *MemoryUsageMonitor) StopMonitoring()
type Releasable
type SQLExecutor
- func NewSQLExecutor(mem memory.Allocator) *SQLExecutor
- func (se *SQLExecutor) BatchExecute(queries []string) ([]*DataFrame, error)
- func (se *SQLExecutor) ClearTables()
- func (se *SQLExecutor) Execute(query string) (*DataFrame, error)
- func (se *SQLExecutor) Explain(query string) (string, error)
- func (se *SQLExecutor) GetRegisteredTables() []string
- func (se *SQLExecutor) RegisterTable(name string, df *DataFrame)
- func (se *SQLExecutor) ValidateQuery(query string) error
type SpillableBatch
- func NewSpillableBatch(data *DataFrame) *SpillableBatch
- func (sb *SpillableBatch) EstimateMemory() int64
- func (sb *SpillableBatch) ForceCleanup() error
- func (sb *SpillableBatch) GetData() (*DataFrame, error)
- func (sb *SpillableBatch) Release()
- func (sb *SpillableBatch) Spill() error
- func (sb *SpillableBatch) SpillIfNeeded() error
type StreamingOperation
type StreamingProcessor
- func NewStreamingProcessor(chunkSize int, allocator memory.Allocator, monitor *MemoryUsageMonitor) *StreamingProcessor
- func (sp *StreamingProcessor) Close() error
- func (sp *StreamingProcessor) ProcessStreaming(reader ChunkReader, writer ChunkWriter, operations []StreamingOperation) error

Constants ¶

View Source

const (
	// DefaultCheckInterval is the default interval for memory monitoring.
	DefaultCheckInterval = 5 * time.Second
	// HighMemoryPressureThreshold is the threshold for triggering cleanup.
	HighMemoryPressureThreshold = 0.8
	// DefaultMemoryThreshold is the default memory threshold (1GB).
	DefaultMemoryThreshold = 1024 * 1024 * 1024
)

View Source

const (
	// DefaultChunkSize is the default size for processing chunks.
	DefaultChunkSize = 1000
	// BytesPerValue is the estimated bytes per DataFrame value.
	BytesPerValue = 8
)

Variables ¶

View Source

var ErrEndOfStream = errors.New("end of stream")

ErrEndOfStream indicates the end of the data stream.

Functions ¶

func BuildInfo ¶ added in v0.2.0

func BuildInfo() version.BuildInfo

BuildInfo returns detailed build information.

This includes version, build date, Git commit, Go version, and dependency information. Useful for debugging and support.

Example:

info := gorilla.BuildInfo()
fmt.Printf("Version: %s\n", info.Version)
fmt.Printf("Build Date: %s\n", info.BuildDate)
fmt.Printf("Git Commit: %s\n", info.GitCommit)
fmt.Printf("Go Version: %s\n", info.GoVersion)

func Case ¶

func Case() *expr.CaseExpr

Case starts a case expression.

Returns a concrete *expr.CaseExpr for optimal performance and type safety.

func Coalesce ¶

func Coalesce(exprs ...expr.Expr) *expr.FunctionExpr

Coalesce returns the first non-null expression.

Returns a concrete *expr.FunctionExpr for optimal performance and type safety.

func Col ¶

func Col(name string) *expr.ColumnExpr

Col creates a column expression that references a column by name.

This is the primary way to reference columns in filters, selections, and other DataFrame operations. The column name must exist in the DataFrame when the expression is evaluated.

Returns a concrete *expr.ColumnExpr for optimal performance and type safety.

Example:

// Reference the "age" column
ageCol := gorilla.Col("age")

// Use in filters and operations
result, err := df.Lazy().
	Filter(gorilla.Col("age").Gt(gorilla.Lit(30))).
	Select(gorilla.Col("name"), gorilla.Col("age")).
	Collect()

func Concat ¶

func Concat(exprs ...expr.Expr) *expr.FunctionExpr

Concat concatenates string expressions.

Returns a concrete *expr.FunctionExpr for optimal performance and type safety.

func Count ¶

func Count(column expr.Expr) *expr.AggregationExpr

Count returns an aggregation expression for count.

Returns a concrete *expr.AggregationExpr for optimal performance and type safety.

func Example ¶ added in v0.2.0

func Example()

Example demonstrates basic DataFrame creation and operations.

func ExampleDataFrameGroupBy ¶ added in v0.2.0

func ExampleDataFrameGroupBy()

ExampleDataFrameGroupBy demonstrates GroupBy operations with aggregations.

func ExampleDataFrameJoin ¶ added in v0.2.0

func ExampleDataFrameJoin()

ExampleDataFrameJoin demonstrates join operations between DataFrames.

func ExampleDataFrameLazyEvaluation ¶ added in v0.3.1

func ExampleDataFrameLazyEvaluation()

ExampleDataFrameLazyEvaluation demonstrates lazy evaluation with optimization.

This example shows how operations are deferred and optimized when using lazy evaluation, including automatic parallelization and query optimization.

func ExampleDataFrameMemoryManagement ¶ added in v0.3.1

func ExampleDataFrameMemoryManagement()

ExampleDataFrameMemoryManagement demonstrates proper memory management patterns.

This example shows the recommended patterns for memory management including defer usage, cleanup in error conditions, and resource lifecycle management.

func ExampleSQLExecutor ¶ added in v0.3.1

func ExampleSQLExecutor()

ExampleSQLExecutor demonstrates SQL query execution with DataFrames.

func If ¶

func If(condition, thenValue, elseValue expr.Expr) *expr.FunctionExpr

If returns a conditional expression.

Returns a concrete *expr.FunctionExpr for optimal performance and type safety.

func Lit ¶

func Lit(value interface{}) *expr.LiteralExpr

Lit creates a literal expression that represents a constant value.

This is used to create expressions with constant values for comparisons, arithmetic operations, and other transformations. The value type should match the expected operation type.

Supported types: string, int64, int32, float64, float32, bool

Returns a concrete *expr.LiteralExpr for optimal performance and type safety.

Example:

// Literal values for comparisons
age30 := gorilla.Lit(int64(30))
name := gorilla.Lit("Alice")
score := gorilla.Lit(95.5)
active := gorilla.Lit(true)

// Use in operations
result, err := df.Lazy().
	Filter(gorilla.Col("age").Gt(gorilla.Lit(int64(25)))).
	WithColumn("bonus", gorilla.Col("salary").Mul(gorilla.Lit(0.1))).
	Collect()

func Max ¶

func Max(column expr.Expr) *expr.AggregationExpr

Max returns an aggregation expression for max.

Returns a concrete *expr.AggregationExpr for optimal performance and type safety.

func Mean ¶

func Mean(column expr.Expr) *expr.AggregationExpr

Mean returns an aggregation expression for mean.

Returns a concrete *expr.AggregationExpr for optimal performance and type safety.

func Min ¶

func Min(column expr.Expr) *expr.AggregationExpr

Min returns an aggregation expression for min.

Returns a concrete *expr.AggregationExpr for optimal performance and type safety.

func Sum ¶

func Sum(column expr.Expr) *expr.AggregationExpr

Sum returns an aggregation expression for sum.

Returns a concrete *expr.AggregationExpr for optimal performance and type safety.

func Version ¶ added in v0.2.0

func Version() string

Version returns the current library version.

This returns the semantic version of the Gorilla DataFrame library. During development, this may return "dev". In releases, it returns the tagged version (e.g., "v1.0.0").

Example:

fmt.Printf("Using Gorilla DataFrame Library %s\n", gorilla.Version())

func WithDataFrame ¶

func WithDataFrame(factory func() *DataFrame, fn func(*DataFrame) error) error

WithDataFrame provides automatic resource management for DataFrame operations.

This helper function creates a DataFrame using the provided factory function, executes the given operation, and automatically releases the DataFrame when done. This pattern is useful for operations where you want guaranteed cleanup.

The factory function should create and return a DataFrame. The operation function receives the DataFrame and performs the desired operations. Any error from the operation function is returned to the caller.

Example:

err := gorilla.WithDataFrame(func() *gorilla.DataFrame {
	mem := memory.NewGoAllocator()
	series1 := gorilla.NewSeries("name", []string{"Alice", "Bob"}, mem)
	series2 := gorilla.NewSeries("age", []int64{25, 30}, mem)
	return gorilla.NewDataFrame(series1, series2)
}, func(df *gorilla.DataFrame) error {
	result, err := df.Lazy().Filter(gorilla.Col("age").Gt(gorilla.Lit(25))).Collect()
	if err != nil {
		return err
	}
	defer result.Release()
	fmt.Println(result)
	return nil
})
// DataFrame is automatically released here

func WithMemoryManager ¶

func WithMemoryManager(allocator memory.Allocator, fn func(*MemoryManager) error) error

WithMemoryManager creates a memory manager, executes a function with it, and releases all tracked resources.

func WithSeries ¶

func WithSeries(factory func() ISeries, fn func(ISeries) error) error

WithSeries creates a Series, executes a function with it, and automatically releases it.

Types ¶

type BatchManager ¶

type BatchManager struct {
	// contains filtered or unexported fields
}

BatchManager manages a collection of spillable batches. Implements the memory.ResourceManager interface for consistent resource management.

func NewBatchManager ¶

func NewBatchManager(monitor *MemoryUsageMonitor) *BatchManager

NewBatchManager creates a new batch manager.

func (*BatchManager) AddBatch ¶

func (bm *BatchManager) AddBatch(data *DataFrame)

AddBatch adds a new batch to the manager. Uses the Track method for consistent resource management.

func (*BatchManager) EstimateMemory ¶ added in v0.3.1

func (bm *BatchManager) EstimateMemory() int64

EstimateMemory returns the total estimated memory usage of all managed batches. Implements the memory.Resource interface.

func (*BatchManager) ForceCleanup ¶ added in v0.3.1

func (bm *BatchManager) ForceCleanup() error

ForceCleanup performs cleanup operations on all managed batches. Implements the memory.Resource interface.

func (*BatchManager) Release ¶ added in v0.3.1

func (bm *BatchManager) Release()

Release releases the batch manager and all managed batches. Implements the memory.Resource interface.

func (*BatchManager) ReleaseAll ¶

func (bm *BatchManager) ReleaseAll()

ReleaseAll releases all batches and clears the manager.

func (*BatchManager) SpillIfNeeded ¶ added in v0.3.1

func (bm *BatchManager) SpillIfNeeded() error

SpillIfNeeded checks memory pressure and spills batches if needed. Implements the memory.Resource interface.

func (*BatchManager) SpillLRU ¶

func (bm *BatchManager) SpillLRU() error

SpillLRU spills the least recently used batch to free memory.

func (*BatchManager) Track ¶ added in v0.3.1

func (bm *BatchManager) Track(resource memoryutil.Resource)

Track adds a batch to be managed by the BatchManager. Implements the memory.ResourceManager interface.

type ChunkReader ¶

type ChunkReader interface {
	// ReadChunk reads the next chunk of data
	ReadChunk() (*DataFrame, error)
	// HasNext returns true if there are more chunks to read
	HasNext() bool
	// Close closes the chunk reader and releases resources
	Close() error
}

ChunkReader represents a source of data chunks for streaming processing.

type ChunkWriter ¶

type ChunkWriter interface {
	// WriteChunk writes a processed chunk of data
	WriteChunk(*DataFrame) error
	// Close closes the chunk writer and releases resources
	Close() error
}

ChunkWriter represents a destination for processed data chunks.

type DataFrame ¶

type DataFrame struct {
	// contains filtered or unexported fields
}

DataFrame represents a 2-dimensional table of data with named columns.

A DataFrame is composed of multiple Series (columns) and provides operations for filtering, selecting, transforming, joining, and aggregating data. It supports both eager and lazy evaluation patterns.

Key features:

Zero-copy operations using Apache Arrow columnar format
Automatic parallelization for large datasets (1000+ rows)
Type-safe operations through the Expression system
Memory-efficient storage and computation

Memory management: DataFrames must be released to prevent memory leaks:

df := gorilla.NewDataFrame(series1, series2)
defer df.Release()

Operations can be performed eagerly or using lazy evaluation:

// Eager: operations execute immediately
filtered := df.Filter(gorilla.Col("age").Gt(gorilla.Lit(30)))

// Lazy: operations build a query plan, execute on Collect()
result, err := df.Lazy().
	Filter(gorilla.Col("age").Gt(gorilla.Lit(30))).
	Select("name", "age").
	Collect()

func NewDataFrame ¶

func NewDataFrame(series ...ISeries) *DataFrame

NewDataFrame creates a new DataFrame from one or more Series.

A DataFrame is a 2-dimensional table with named columns. Each Series becomes a column in the DataFrame. All Series must have the same length.

Memory management: The returned DataFrame must be released by calling Release() to prevent memory leaks. Use defer for automatic cleanup:

df := gorilla.NewDataFrame(series1, series2)
defer df.Release()

Example:

mem := memory.NewGoAllocator()
names := gorilla.NewSeries("name", []string{"Alice", "Bob"}, mem)
ages := gorilla.NewSeries("age", []int64{25, 30}, mem)
defer names.Release()
defer ages.Release()

df := gorilla.NewDataFrame(names, ages)
defer df.Release()

fmt.Println(df) // Displays the DataFrame

func (*DataFrame) Column ¶

func (d *DataFrame) Column(name string) (ISeries, bool)

Column returns the column with the given name.

func (*DataFrame) Columns ¶

func (d *DataFrame) Columns() []string

Columns returns the column names in order.

func (*DataFrame) Concat ¶

func (d *DataFrame) Concat(others ...*DataFrame) *DataFrame

Concat concatenates this DataFrame with others.

func (*DataFrame) Drop ¶

func (d *DataFrame) Drop(names ...string) *DataFrame

Drop returns a new DataFrame without the specified columns.

func (*DataFrame) GroupBy ¶

func (d *DataFrame) GroupBy(columns ...string) *GroupBy

GroupBy creates a GroupBy operation for eager evaluation.

func (*DataFrame) HasColumn ¶

func (d *DataFrame) HasColumn(name string) bool

HasColumn returns true if the DataFrame has the given column.

func (*DataFrame) Join ¶

func (d *DataFrame) Join(right *DataFrame, options *JoinOptions) (*DataFrame, error)

Join performs a join operation with another DataFrame.

func (*DataFrame) Lazy ¶

func (d *DataFrame) Lazy() *LazyFrame

Lazy initiates a lazy operation on a DataFrame.

func (*DataFrame) Len ¶

func (d *DataFrame) Len() int

Len returns the number of rows.

func (*DataFrame) Release ¶

func (d *DataFrame) Release()

Release frees the memory used by the DataFrame.

func (*DataFrame) Select ¶

func (d *DataFrame) Select(names ...string) *DataFrame

Select returns a new DataFrame with only the specified columns.

func (*DataFrame) Slice ¶

func (d *DataFrame) Slice(start, end int) *DataFrame

Slice returns a new DataFrame with rows from start to end (exclusive).

func (*DataFrame) Sort ¶

func (d *DataFrame) Sort(column string, ascending bool) (*DataFrame, error)

Sort returns a new DataFrame sorted by the specified column.

func (*DataFrame) SortBy ¶

func (d *DataFrame) SortBy(columns []string, ascending []bool) (*DataFrame, error)

SortBy returns a new DataFrame sorted by multiple columns.

func (*DataFrame) String ¶

func (d *DataFrame) String() string

String returns a string representation of the DataFrame.

func (*DataFrame) Width ¶

func (d *DataFrame) Width() int

Width returns the number of columns.

func (*DataFrame) WithConfig ¶ added in v0.3.1

func (d *DataFrame) WithConfig(opConfig config.OperationConfig) *DataFrame

WithConfig returns a new DataFrame with the specified operation configuration.

This allows per-DataFrame configuration overrides for parallel execution, memory usage, and other operational parameters. The configuration is inherited by lazy operations performed on this DataFrame.

Parameters:

opConfig: The operation configuration to apply to this DataFrame

Returns:

*DataFrame: A new DataFrame with the specified configuration

Example:

config := config.OperationConfig{
	ForceParallel:   true,
	CustomChunkSize: 5000,
	MaxMemoryUsage:  1024 * 1024 * 100, // 100MB
}

configuredDF := df.WithConfig(config)
defer configuredDF.Release()

// All operations on configuredDF will use the custom configuration
result := configuredDF.Lazy().
	Filter(Col("amount").Gt(Lit(1000))).
	Collect()

type GroupBy ¶

type GroupBy struct {
	// contains filtered or unexported fields
}

GroupBy is the public type for eager group by operations.

func (*GroupBy) Agg ¶

func (gb *GroupBy) Agg(aggregations ...*expr.AggregationExpr) *DataFrame

Agg performs aggregation operations on the GroupBy.

Takes aggregation expressions directly without wrapper overhead.

type ISeries ¶

type ISeries interface {
	Name() string                 // Returns the name of the Series
	Len() int                     // Returns the number of elements
	DataType() arrow.DataType     // Returns the Apache Arrow data type
	IsNull(index int) bool        // Checks if the value at index is null
	String() string               // Returns a string representation
	Array() arrow.Array           // Returns the underlying Arrow array
	Release()                     // Releases memory resources
	GetAsString(index int) string // Returns the value at index as a string
}

ISeries provides a type-erased interface for Series of any supported type.

ISeries allows different typed Series to be used together in DataFrames and operations. It wraps Apache Arrow arrays and provides common operations for all Series types.

Supported data types: string, int64, int32, float64, float32, bool

Memory management: Series implement the Release() method and must be released to prevent memory leaks:

series := gorilla.NewSeries("name", []string{"Alice", "Bob"}, mem)
defer series.Release()

The interface provides:

Name() - column name for use in DataFrames
Len() - number of elements in the Series
DataType() - Apache Arrow data type information
IsNull(index) - null value checking
String() - human-readable representation
Array() - access to underlying Arrow array
Release() - memory cleanup

func NewSeries ¶

func NewSeries[T any](name string, values []T, mem memory.Allocator) ISeries

NewSeries creates a new typed Series from a slice of values.

A Series is a 1-dimensional array of homogeneous data that represents a single column in a DataFrame. The type parameter T determines the data type of the Series.

Supported types: string, int64, int32, float64, float32, bool

Parameters:

name: The name of the Series (becomes the column name in a DataFrame)
values: Slice of values to populate the Series
mem: Apache Arrow memory allocator for managing the underlying storage

Memory management: The returned Series must be released by calling Release() to prevent memory leaks. Use defer for automatic cleanup:

series := gorilla.NewSeries("age", []int64{25, 30, 35}, mem)
defer series.Release()

Example:

mem := memory.NewGoAllocator()

// Create different types of Series
names := gorilla.NewSeries("name", []string{"Alice", "Bob", "Charlie"}, mem)
ages := gorilla.NewSeries("age", []int64{25, 30, 35}, mem)
scores := gorilla.NewSeries("score", []float64{95.5, 87.2, 92.1}, mem)
active := gorilla.NewSeries("active", []bool{true, false, true}, mem)

defer names.Release()
defer ages.Release()
defer scores.Release()
defer active.Release()

type JoinOptions ¶

type JoinOptions struct {
	Type      JoinType
	LeftKey   string   // Single join key for left DataFrame
	RightKey  string   // Single join key for right DataFrame
	LeftKeys  []string // Multiple join keys for left DataFrame
	RightKeys []string // Multiple join keys for right DataFrame
}

JoinOptions specifies parameters for join operations.

type JoinType ¶

type JoinType int

JoinType represents the type of join operation.

const (
	InnerJoin JoinType = iota
	LeftJoin
	RightJoin
	FullOuterJoin
)

type LazyFrame ¶

type LazyFrame struct {
	// contains filtered or unexported fields
}

LazyFrame provides deferred computation with query optimization.

LazyFrame builds an Abstract Syntax Tree (AST) of operations without executing them immediately. This allows for query optimization, operation fusion, and efficient memory usage. Operations are only executed when Collect() is called.

Benefits of lazy evaluation:

Query optimization (predicate pushdown, operation fusion)
Memory efficiency (only final results allocated)
Parallel execution planning
Operation chaining with method syntax

Example:

lazyResult := df.Lazy().
	Filter(gorilla.Col("age").Gt(gorilla.Lit(25))).
	WithColumn("bonus", gorilla.Col("salary").Mul(gorilla.Lit(0.1))).
	GroupBy("department").
	Agg(gorilla.Sum(gorilla.Col("salary")).As("total_salary")).
	Select("department", "total_salary")

// Execute the entire plan efficiently
result, err := lazyResult.Collect()
defer result.Release()

func (*LazyFrame) Collect ¶

func (lf *LazyFrame) Collect() (*DataFrame, error)

Collect executes all pending operations and returns the final DataFrame.

func (*LazyFrame) Filter ¶

func (lf *LazyFrame) Filter(predicate expr.Expr) *LazyFrame

Filter adds a filter operation to the LazyFrame.

The predicate can be any expression that implements expr.Expr interface.

func (*LazyFrame) GroupBy ¶

func (lf *LazyFrame) GroupBy(columns ...string) *LazyGroupBy

GroupBy adds a group by operation to the LazyFrame.

func (*LazyFrame) Join ¶

func (lf *LazyFrame) Join(right *LazyFrame, options *JoinOptions) *LazyFrame

Join adds a join operation to the LazyFrame.

func (*LazyFrame) Release ¶

func (lf *LazyFrame) Release()

Release frees the memory used by the LazyFrame.

func (*LazyFrame) Select ¶

func (lf *LazyFrame) Select(columns ...string) *LazyFrame

Select adds a select operation to the LazyFrame.

func (*LazyFrame) Sort ¶

func (lf *LazyFrame) Sort(column string, ascending bool) *LazyFrame

Sort adds a sort operation to the LazyFrame.

func (*LazyFrame) SortBy ¶

func (lf *LazyFrame) SortBy(columns []string, ascending []bool) *LazyFrame

SortBy adds a multi-column sort operation to the LazyFrame.

func (*LazyFrame) String ¶

func (lf *LazyFrame) String() string

String returns a string representation of the LazyFrame operations.

func (*LazyFrame) WithColumn ¶

func (lf *LazyFrame) WithColumn(name string, expression expr.Expr) *LazyFrame

WithColumn adds a new column to the LazyFrame.

The expression can be any expression that implements expr.Expr interface.

type LazyGroupBy ¶

type LazyGroupBy struct {
	// contains filtered or unexported fields
}

LazyGroupBy is the public type for lazy group by operations.

func (*LazyGroupBy) Agg ¶

func (lgb *LazyGroupBy) Agg(aggregations ...*expr.AggregationExpr) *LazyFrame

Agg adds aggregation operations to the LazyGroupBy.

Takes aggregation expressions directly without wrapper overhead.

func (*LazyGroupBy) Count ¶

func (lgb *LazyGroupBy) Count(column string) *LazyFrame

Count adds a count aggregation for the specified column.

func (*LazyGroupBy) Max ¶

func (lgb *LazyGroupBy) Max(column string) *LazyFrame

Max adds a max aggregation for the specified column.

func (*LazyGroupBy) Mean ¶

func (lgb *LazyGroupBy) Mean(column string) *LazyFrame

Mean adds a mean aggregation for the specified column.

func (*LazyGroupBy) Min ¶

func (lgb *LazyGroupBy) Min(column string) *LazyFrame

Min adds a min aggregation for the specified column.

func (*LazyGroupBy) Sum ¶

func (lgb *LazyGroupBy) Sum(column string) *LazyFrame

Sum adds a sum aggregation for the specified column.

type MemoryAwareChunkReader ¶

type MemoryAwareChunkReader struct {
	// contains filtered or unexported fields
}

MemoryAwareChunkReader wraps a ChunkReader with memory monitoring.

func NewMemoryAwareChunkReader ¶

func NewMemoryAwareChunkReader(reader ChunkReader, monitor *MemoryUsageMonitor) *MemoryAwareChunkReader

NewMemoryAwareChunkReader creates a new memory-aware chunk reader.

func (*MemoryAwareChunkReader) Close ¶

func (mr *MemoryAwareChunkReader) Close() error

Close closes the underlying reader.

func (*MemoryAwareChunkReader) HasNext ¶

func (mr *MemoryAwareChunkReader) HasNext() bool

HasNext returns true if there are more chunks to read.

func (*MemoryAwareChunkReader) ReadChunk ¶

func (mr *MemoryAwareChunkReader) ReadChunk() (*DataFrame, error)

ReadChunk reads the next chunk and records memory usage.

type MemoryManager ¶

type MemoryManager struct {
	// contains filtered or unexported fields
}

MemoryManager helps track and release multiple resources automatically.

MemoryManager is useful for complex scenarios where many short-lived resources are created and need bulk cleanup. For most use cases, prefer the defer pattern with individual Release() calls for better readability.

Use MemoryManager when:

Creating many temporary resources in loops
Complex operations with unpredictable resource lifetimes
Bulk operations where individual defer statements are impractical

The MemoryManager is safe for concurrent use from multiple goroutines.

Example:

err := gorilla.WithMemoryManager(mem, func(manager *gorilla.MemoryManager) error {
	for i := 0; i < 1000; i++ {
		temp := createTempDataFrame(i)
		manager.Track(temp) // Will be released automatically
	}
	return processData()
})
// All tracked resources are released here

func NewMemoryManager ¶

func NewMemoryManager(allocator memory.Allocator) *MemoryManager

NewMemoryManager creates a new memory manager with the given allocator.

func (*MemoryManager) Count ¶

func (m *MemoryManager) Count() int

Count returns the number of tracked resources.

func (*MemoryManager) ReleaseAll ¶

func (m *MemoryManager) ReleaseAll()

ReleaseAll releases all tracked resources and clears the tracking list.

func (*MemoryManager) Track ¶

func (m *MemoryManager) Track(resource Releasable)

Track adds a resource to be managed and automatically released.

type MemoryStats ¶

type MemoryStats struct {
	// Total allocated memory in bytes
	AllocatedBytes int64
	// Peak allocated memory in bytes
	PeakAllocatedBytes int64
	// Number of active allocations
	ActiveAllocations int64
	// Number of garbage collections triggered
	GCCount int64
	// Last garbage collection time
	LastGCTime time.Time
	// Memory pressure level (0.0 to 1.0)
	MemoryPressure float64
}

MemoryStats provides detailed memory usage statistics.

type MemoryUsageMonitor ¶

type MemoryUsageMonitor struct {
	// contains filtered or unexported fields
}

MemoryUsageMonitor provides real-time memory usage monitoring and automatic spilling.

func NewMemoryUsageMonitor ¶

func NewMemoryUsageMonitor(spillThreshold int64) *MemoryUsageMonitor

NewMemoryUsageMonitor creates a new memory usage monitor with the specified spill threshold.

func NewMemoryUsageMonitorFromConfig ¶ added in v0.3.1

func NewMemoryUsageMonitorFromConfig(cfg config.Config) *MemoryUsageMonitor

NewMemoryUsageMonitorFromConfig creates a new memory usage monitor using configuration values.

func (*MemoryUsageMonitor) CurrentUsage ¶

func (m *MemoryUsageMonitor) CurrentUsage() int64

CurrentUsage returns the current memory usage in bytes.

func (*MemoryUsageMonitor) GetStats ¶

func (m *MemoryUsageMonitor) GetStats() MemoryStats

GetStats returns current memory usage statistics.

func (*MemoryUsageMonitor) PeakUsage ¶

func (m *MemoryUsageMonitor) PeakUsage() int64

PeakUsage returns the peak memory usage in bytes.

func (*MemoryUsageMonitor) RecordAllocation ¶

func (m *MemoryUsageMonitor) RecordAllocation(bytes int64)

RecordAllocation records a new memory allocation.

func (*MemoryUsageMonitor) RecordDeallocation ¶

func (m *MemoryUsageMonitor) RecordDeallocation(bytes int64)

RecordDeallocation records a memory deallocation.

func (*MemoryUsageMonitor) SetCleanupCallback ¶

func (m *MemoryUsageMonitor) SetCleanupCallback(callback func() error)

SetCleanupCallback sets the callback function to be called for cleanup operations.

func (*MemoryUsageMonitor) SetSpillCallback ¶

func (m *MemoryUsageMonitor) SetSpillCallback(callback func() error)

SetSpillCallback sets the callback function to be called when memory usage exceeds threshold.

func (*MemoryUsageMonitor) SpillCount ¶

func (m *MemoryUsageMonitor) SpillCount() int64

SpillCount returns the number of times spilling was triggered.

func (*MemoryUsageMonitor) StartMonitoring ¶

func (m *MemoryUsageMonitor) StartMonitoring()

StartMonitoring starts the background memory monitoring goroutine.

func (*MemoryUsageMonitor) StopMonitoring ¶

func (m *MemoryUsageMonitor) StopMonitoring()

StopMonitoring stops the background memory monitoring goroutine.

type Releasable ¶

type Releasable interface {
	Release()
}

Releasable represents any resource that can be released to free memory.

This interface is implemented by DataFrames, Series, and other resources that use Apache Arrow memory management. Always call Release() when done with a resource to prevent memory leaks.

The recommended pattern is to use defer for automatic cleanup:

df := gorilla.NewDataFrame(series1, series2)
defer df.Release() // Automatic cleanup

type SQLExecutor ¶ added in v0.3.1

type SQLExecutor struct {
	// contains filtered or unexported fields
}

SQLExecutor provides SQL query execution capabilities for DataFrames.

SQLExecutor allows you to execute SQL queries against registered DataFrames, providing a familiar SQL interface for data analysis and manipulation. It supports standard SQL operations including SELECT, WHERE, GROUP BY, ORDER BY, and LIMIT.

Key features:

Standard SQL syntax support (SELECT, FROM, WHERE, GROUP BY, etc.)
Integration with existing DataFrame operations and optimizations
Query validation and error reporting
Prepared statement support for query reuse
EXPLAIN functionality for query plan analysis

Memory management: SQLExecutor manages memory automatically, but result DataFrames must still be released by the caller:

executor := gorilla.NewSQLExecutor(mem)
result, err := executor.Execute("SELECT name FROM employees WHERE age > 30")
defer result.Release()

Example:

mem := memory.NewGoAllocator()
executor := gorilla.NewSQLExecutor(mem)

// Register DataFrames as tables
executor.RegisterTable("employees", employeesDF)
executor.RegisterTable("departments", departmentsDF)

// Execute SQL queries
result, err := executor.Execute(`
	SELECT e.name, d.department_name, e.salary
	FROM employees e
	JOIN departments d ON e.dept_id = d.id
	WHERE e.salary > 50000
	ORDER BY e.salary DESC
	LIMIT 10
`)
if err != nil {
	panic(err)
}
defer result.Release()

func NewSQLExecutor ¶ added in v0.3.1

func NewSQLExecutor(mem memory.Allocator) *SQLExecutor

NewSQLExecutor creates a new SQL executor for DataFrame queries.

The executor allows you to register DataFrames as tables and execute SQL queries against them. All SQL operations are translated to efficient DataFrame operations and benefit from the same optimizations as direct DataFrame API usage.

Parameters:

mem: Apache Arrow memory allocator for managing query results

Example:

mem := memory.NewGoAllocator()
executor := gorilla.NewSQLExecutor(mem)

// Register a DataFrame as a table
executor.RegisterTable("sales", salesDF)

// Execute SQL queries
result, err := executor.Execute("SELECT * FROM sales WHERE amount > 1000")
defer result.Release()

func (*SQLExecutor) BatchExecute ¶ added in v0.3.1

func (se *SQLExecutor) BatchExecute(queries []string) ([]*DataFrame, error)

BatchExecute executes multiple SQL statements in sequence.

Executes queries in order and returns all results. If any query fails, all previously successful results are cleaned up and an error is returned.

Parameters:

queries: Slice of SQL query strings to execute

Returns:

Slice of DataFrames containing results (all must be released by caller)
Error if any query fails

Example:

queries := []string{
	"SELECT * FROM employees WHERE active = true",
	"SELECT department, COUNT(*) FROM employees GROUP BY department",
	"SELECT AVG(salary) FROM employees",
}
results, err := executor.BatchExecute(queries)
if err != nil {
	return err
}
defer func() {
	for _, result := range results {
		result.Release()
	}
}()

func (*SQLExecutor) ClearTables ¶ added in v0.3.1

func (se *SQLExecutor) ClearTables()

ClearTables removes all registered tables.

Useful for cleaning up or resetting the executor state. After calling this, all previously registered tables will need to be re-registered before use.

Example:

executor.ClearTables()
// Need to re-register tables before executing queries

func (*SQLExecutor) Execute ¶ added in v0.3.1

func (se *SQLExecutor) Execute(query string) (*DataFrame, error)

Execute executes a SQL query and returns the result DataFrame.

Supports standard SQL SELECT syntax including:

Column selection: SELECT col1, col2, *
Computed columns: SELECT col1, col2 * 2 AS doubled
Filtering: WHERE conditions with AND, OR, comparison operators
Aggregation: GROUP BY with SUM, COUNT, AVG, MIN, MAX functions
Sorting: ORDER BY with ASC/DESC
Limiting: LIMIT and OFFSET clauses
String functions: UPPER, LOWER, LENGTH
Math functions: ABS, ROUND
Conditional logic: CASE WHEN expressions

Parameters:

query: SQL query string to execute

Returns:

DataFrame containing query results (must be released by caller)
Error if query parsing, validation, or execution fails

Example:

result, err := executor.Execute(`
	SELECT department, AVG(salary) as avg_salary, COUNT(*) as employee_count
	FROM employees
	WHERE active = true
	GROUP BY department
	HAVING AVG(salary) > 50000
	ORDER BY avg_salary DESC
	LIMIT 5
`)
if err != nil {
	return err
}
defer result.Release()

func (*SQLExecutor) Explain ¶ added in v0.3.1

func (se *SQLExecutor) Explain(query string) (string, error)

Explain returns the execution plan for a SQL query.

Shows how the SQL query will be translated to DataFrame operations, including optimization steps, operation order, and estimated performance characteristics. Useful for query optimization and debugging.

Parameters:

query: SQL query string to explain

Returns:

String representation of the execution plan
Error if query parsing or translation fails

Example:

plan, err := executor.Explain("SELECT name FROM employees WHERE salary > 50000")
if err != nil {
	return err
}
fmt.Println("Execution Plan:")
fmt.Println(plan)

func (*SQLExecutor) GetRegisteredTables ¶ added in v0.3.1

func (se *SQLExecutor) GetRegisteredTables() []string

GetRegisteredTables returns the list of registered table names.

Useful for introspection and debugging to see which tables are available for SQL queries.

Returns:

Slice of table names that can be used in SQL queries

Example:

tables := executor.GetRegisteredTables()
fmt.Printf("Available tables: %v\n", tables)

func (*SQLExecutor) RegisterTable ¶ added in v0.3.1

func (se *SQLExecutor) RegisterTable(name string, df *DataFrame)

RegisterTable registers a DataFrame with a table name for SQL queries.

Once registered, the DataFrame can be referenced by the table name in SQL queries. Multiple DataFrames can be registered to support JOIN operations and complex queries.

Parameters:

name: Table name to use in SQL queries (case-sensitive)
df: DataFrame to register as a table

Example:

executor.RegisterTable("employees", employeesDF)
executor.RegisterTable("departments", departmentsDF)

// Now can use in SQL
result, err := executor.Execute("SELECT * FROM employees WHERE dept_id IN (SELECT id FROM departments)")

func (*SQLExecutor) ValidateQuery ¶ added in v0.3.1

func (se *SQLExecutor) ValidateQuery(query string) error

ValidateQuery validates a SQL query without executing it.

Performs syntax checking, table reference validation, and semantic analysis to ensure the query is valid before execution. Useful for query validation in applications or prepared statement creation.

Parameters:

query: SQL query string to validate

Returns:

nil if query is valid
Error describing validation issues

Example:

if err := executor.ValidateQuery("SELECT * FROM employees WHERE age >"); err != nil {
	fmt.Printf("Query validation failed: %v\n", err)
}

type SpillableBatch ¶

type SpillableBatch struct {
	// contains filtered or unexported fields
}

SpillableBatch represents a batch of data that can be spilled to disk. Implements the memory.Resource interface for consistent resource management.

func NewSpillableBatch ¶

func NewSpillableBatch(data *DataFrame) *SpillableBatch

NewSpillableBatch creates a new spillable batch.

func (*SpillableBatch) EstimateMemory ¶ added in v0.3.1

func (sb *SpillableBatch) EstimateMemory() int64

EstimateMemory returns the estimated memory usage of the batch. Implements the memory.Resource interface.

func (*SpillableBatch) ForceCleanup ¶ added in v0.3.1

func (sb *SpillableBatch) ForceCleanup() error

ForceCleanup performs cleanup operations on the batch. Implements the memory.Resource interface.

func (*SpillableBatch) GetData ¶

func (sb *SpillableBatch) GetData() (*DataFrame, error)

GetData returns the data, loading from spill if necessary. If the data has been spilled to disk, this method attempts to reload it. Returns an error if the data cannot be loaded from spill storage.

func (*SpillableBatch) Release ¶

func (sb *SpillableBatch) Release()

Release releases the batch resources.

func (*SpillableBatch) Spill ¶

func (sb *SpillableBatch) Spill() error

Spill spills the batch to disk and releases memory.

func (*SpillableBatch) SpillIfNeeded ¶ added in v0.3.1

func (sb *SpillableBatch) SpillIfNeeded() error

SpillIfNeeded checks if the batch should be spilled and performs spilling. Implements the memory.Resource interface.

type StreamingOperation ¶

type StreamingOperation interface {
	// Apply applies the operation to a data chunk
	Apply(*DataFrame) (*DataFrame, error)
	// Release releases any resources held by the operation
	Release()
}

StreamingOperation represents an operation that can be applied to data chunks.

type StreamingProcessor ¶

type StreamingProcessor struct {
	// contains filtered or unexported fields
}

StreamingProcessor handles processing of datasets larger than memory.

func NewStreamingProcessor ¶

func NewStreamingProcessor(chunkSize int, allocator memory.Allocator, monitor *MemoryUsageMonitor) *StreamingProcessor

NewStreamingProcessor creates a new streaming processor with specified chunk size.

func (*StreamingProcessor) Close ¶

func (sp *StreamingProcessor) Close() error

Close closes the streaming processor and releases resources.

func (*StreamingProcessor) ProcessStreaming ¶

func (sp *StreamingProcessor) ProcessStreaming(
	reader ChunkReader, writer ChunkWriter, operations []StreamingOperation,
) error

ProcessStreaming processes data in streaming fashion using chunks.

Source Files ¶

View all Source files

Directories ¶

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Path	Synopsis
cmd
gorilla-cli command
examples
config command Package main demonstrates how to use Gorilla's configurable processing parameters	Package main demonstrates how to use Gorilla's configurable processing parameters
etl_pipeline command Package main demonstrates ETL (Extract, Transform, Load) pipeline patterns with Gorilla DataFrame.	Package main demonstrates ETL (Extract, Transform, Load) pipeline patterns with Gorilla DataFrame.
financial_analysis command Package main demonstrates complex financial data analysis with Gorilla DataFrame.	Package main demonstrates complex financial data analysis with Gorilla DataFrame.
having_arithmetic command
io command
ml_preprocessing command Package main demonstrates machine learning preprocessing patterns with Gorilla DataFrame.	Package main demonstrates machine learning preprocessing patterns with Gorilla DataFrame.
monitoring command Package main demonstrates the performance monitoring and metrics functionality.	Package main demonstrates the performance monitoring and metrics functionality.
time_series command Package main demonstrates time series analysis with Gorilla DataFrame.	Package main demonstrates time series analysis with Gorilla DataFrame.
internal
common Package common provides shared utilities for string representations and type conversions	Package common provides shared utilities for string representations and type conversions
config Package config provides configuration management for Gorilla DataFrame operations	Package config provides configuration management for Gorilla DataFrame operations
dataframe Package dataframe provides high-performance DataFrame operations	Package dataframe provides high-performance DataFrame operations
errors Package errors provides standardized error types for DataFrame operations.	Package errors provides standardized error types for DataFrame operations.
expr Package expr provides expression evaluation for DataFrame operations	Package expr provides expression evaluation for DataFrame operations
io Package io provides data input/output operations for DataFrames.	Package io provides data input/output operations for DataFrames.
memory Package memory provides integration examples showing how to refactor existing duplicate code patterns to use the consolidated memory utilities.	Package memory provides integration examples showing how to refactor existing duplicate code patterns to use the consolidated memory utilities.
monitoring Package monitoring provides performance monitoring and metrics collection for DataFrame operations.	Package monitoring provides performance monitoring and metrics collection for DataFrame operations.
parallel Package parallel provides parallel processing infrastructure for DataFrame operations.	Package parallel provides parallel processing infrastructure for DataFrame operations.
series Package series provides data structures for column operations	Package series provides data structures for column operations
sql Package sql provides SQL query parsing and execution for DataFrames	Package sql provides SQL query parsing and execution for DataFrames
sql/sqlutil
testutil Package testutil provides common testing utilities to reduce code duplication across test files in the gorilla DataFrame library.	Package testutil provides common testing utilities to reduce code duplication across test files in the gorilla DataFrame library.
validation Package validation provides input validation utilities for DataFrame operations.	Package validation provides input validation utilities for DataFrame operations.
version Package version provides version information for the Gorilla DataFrame library.	Package version provides version information for the Gorilla DataFrame library.