io

package
v0.4.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 2, 2025 License: Apache-2.0, MIT Imports: 20 Imported by: 0

Documentation

Overview

Package io provides data input/output operations for DataFrames. It supports CSV reading and writing with type inference and various configuration options.

Package io provides I/O operations for reading and writing DataFrame data.

This package includes readers and writers for various data formats, with automatic type inference and schema handling. The primary implementation is CSV I/O with support for streaming large datasets.

Key components:

  • DataReader/DataWriter interfaces for pluggable I/O backends
  • CSVReader/CSVWriter for CSV file operations
  • Type inference for automatic schema detection
  • Configurable options for delimiters, headers, and batch sizes

Memory management: All I/O operations integrate with Apache Arrow's memory management system and require proper cleanup with defer patterns.

Index

Constants

View Source
const (
	// DefaultChunkSize is the default chunk size for parallel processing.
	DefaultChunkSize = 1000
	// DefaultBatchSize is the default batch size for I/O operations.
	DefaultBatchSize = 1000
	// DefaultRowGroupSize is the default row group size for Parquet files.
	DefaultRowGroupSize = 100000 // 100K rows per group
	// DefaultPageSize is the default page size for Parquet files.
	DefaultPageSize = 1048576 // 1MB pages
)
View Source
const (
	// ArrowTypeInt64 is the Arrow type name for int64 columns.
	ArrowTypeInt64 = "int64"
	// ArrowTypeInt32 is the Arrow type name for int32 columns.
	ArrowTypeInt32 = "int32"
	// ArrowTypeFloat64 is the Arrow type name for float64 columns.
	ArrowTypeFloat64 = "float64"
	// ArrowTypeFloat32 is the Arrow type name for float32 columns.
	ArrowTypeFloat32 = "float32"
	// ArrowTypeBool is the Arrow type name for bool columns.
	ArrowTypeBool = "bool"
	// ArrowTypeString is the Arrow type name for string columns.
	ArrowTypeString = "utf8"
)

Arrow data type name constants for consistent usage across I/O implementations.

Variables

This section is empty.

Functions

This section is empty.

Types

type CSVOptions

type CSVOptions struct {
	// Delimiter is the field delimiter (default: comma)
	Delimiter rune
	// Comment is the comment character (default: 0 = disabled)
	Comment rune
	// Header indicates whether the first row contains headers
	Header bool
	// SkipInitialSpace indicates whether to skip initial whitespace
	SkipInitialSpace bool
	// Parallel indicates whether to use parallel processing
	Parallel bool
	// ChunkSize is the size of chunks for parallel processing
	ChunkSize int
}

CSVOptions contains configuration options for CSV operations.

func DefaultCSVOptions

func DefaultCSVOptions() CSVOptions

DefaultCSVOptions returns default CSV options.

type CSVReader

type CSVReader struct {
	// contains filtered or unexported fields
}

CSVReader reads CSV data and converts it to DataFrames.

func NewCSVReader

func NewCSVReader(reader io.Reader, options CSVOptions, mem memory.Allocator) *CSVReader

NewCSVReader creates a new CSV reader with the specified options.

func (*CSVReader) Read

func (r *CSVReader) Read() (*dataframe.DataFrame, error)

Read reads CSV data and returns a DataFrame.

type CSVWriter

type CSVWriter struct {
	// contains filtered or unexported fields
}

CSVWriter writes DataFrames to CSV format.

func NewCSVWriter

func NewCSVWriter(writer io.Writer, options CSVOptions) *CSVWriter

NewCSVWriter creates a new CSV writer with the specified options.

func (*CSVWriter) Write

func (w *CSVWriter) Write(df *dataframe.DataFrame) error

Write writes the DataFrame to CSV format.

type DataReader

type DataReader interface {
	// Read reads data from the source and returns a DataFrame
	Read() (*dataframe.DataFrame, error)
}

DataReader defines the interface for reading data from various sources.

type DataWriter

type DataWriter interface {
	// Write writes the DataFrame to the destination
	Write(df *dataframe.DataFrame) error
}

DataWriter defines the interface for writing data to various destinations.

type JSONFormat added in v0.4.0

type JSONFormat int

JSONFormat specifies the JSON format type.

const (
	// JSONArray format stores data as a JSON array of objects.
	JSONArray JSONFormat = iota
	// JSONLines format stores data as newline-delimited JSON objects.
	JSONLines
)

type JSONOptions added in v0.4.0

type JSONOptions struct {
	// Format specifies whether to use JSON array or JSON Lines format
	Format JSONFormat
	// TypeInference enables automatic type inference from JSON values
	TypeInference bool
	// DateFormat specifies the format for parsing date strings
	DateFormat string
	// NullValues specifies string values that should be treated as null
	NullValues []string
	// MaxRecords limits the number of records to read (0 = no limit)
	MaxRecords int
	// Parallel enables parallel processing for large JSON files
	Parallel bool
}

JSONOptions contains configuration options for JSON operations.

func DefaultJSONOptions added in v0.4.0

func DefaultJSONOptions() JSONOptions

DefaultJSONOptions returns default JSON options.

type JSONReader added in v0.4.0

type JSONReader struct {
	// contains filtered or unexported fields
}

JSONReader reads JSON data and converts it to DataFrames.

func NewJSONReader added in v0.4.0

func NewJSONReader(reader io.Reader, options JSONOptions, mem memory.Allocator) *JSONReader

NewJSONReader creates a new JSON reader with the specified options.

func (*JSONReader) Read added in v0.4.0

func (r *JSONReader) Read() (*dataframe.DataFrame, error)

Read reads JSON data and returns a DataFrame.

type JSONWriter added in v0.4.0

type JSONWriter struct {
	// contains filtered or unexported fields
}

JSONWriter writes DataFrames to JSON format.

func NewJSONWriter added in v0.4.0

func NewJSONWriter(writer io.Writer, options JSONOptions) *JSONWriter

NewJSONWriter creates a new JSON writer with the specified options.

func (*JSONWriter) Write added in v0.4.0

func (w *JSONWriter) Write(df *dataframe.DataFrame) error

Write writes the DataFrame to JSON format.

type ParquetOptions

type ParquetOptions struct {
	// Compression type for Parquet files (snappy, gzip, lz4, zstd, uncompressed)
	Compression string
	// BatchSize for reading/writing operations
	BatchSize int
	// ColumnsToRead for selective column reading (nil reads all columns)
	ColumnsToRead []string
	// ParallelDecoding enables parallel decoding for better performance
	ParallelDecoding bool
	// RowGroupSize specifies the target size for row groups in rows
	RowGroupSize int64
	// PageSize specifies the target size for pages in bytes
	PageSize int64
	// EnableDict enables dictionary encoding for string columns
	EnableDict bool
}

ParquetOptions contains configuration options for Parquet operations.

func DefaultParquetOptions

func DefaultParquetOptions() ParquetOptions

DefaultParquetOptions returns default Parquet options.

type ParquetReader added in v0.2.0

type ParquetReader struct {
	// contains filtered or unexported fields
}

ParquetReader reads Parquet data and converts it to DataFrames.

func NewParquetReader added in v0.2.0

func NewParquetReader(reader io.Reader, options ParquetOptions, mem memory.Allocator) *ParquetReader

NewParquetReader creates a new Parquet reader with the specified options.

func (*ParquetReader) Read added in v0.2.0

func (r *ParquetReader) Read() (*dataframe.DataFrame, error)

Read reads Parquet data and returns a DataFrame.

type ParquetWriter added in v0.2.0

type ParquetWriter struct {
	// contains filtered or unexported fields
}

ParquetWriter writes DataFrames to Parquet format.

func NewParquetWriter added in v0.2.0

func NewParquetWriter(writer io.Writer, options ParquetOptions) *ParquetWriter

NewParquetWriter creates a new Parquet writer with the specified options.

func (*ParquetWriter) Write added in v0.2.0

func (w *ParquetWriter) Write(df *dataframe.DataFrame) error

Write writes the DataFrame to Parquet format.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL