Documentation
¶
Overview ¶
Package io provides data input/output operations for DataFrames. It supports CSV reading and writing with type inference and various configuration options.
Package io provides I/O operations for reading and writing DataFrame data.
This package includes readers and writers for various data formats, with automatic type inference and schema handling. The primary implementation is CSV I/O with support for streaming large datasets.
Key components:
- DataReader/DataWriter interfaces for pluggable I/O backends
- CSVReader/CSVWriter for CSV file operations
- Type inference for automatic schema detection
- Configurable options for delimiters, headers, and batch sizes
Memory management: All I/O operations integrate with Apache Arrow's memory management system and require proper cleanup with defer patterns.
Index ¶
Constants ¶
const ( // DefaultChunkSize is the default chunk size for parallel processing. DefaultChunkSize = 1000 // DefaultBatchSize is the default batch size for I/O operations. DefaultBatchSize = 1000 // DefaultRowGroupSize is the default row group size for Parquet files. DefaultRowGroupSize = 100000 // 100K rows per group // DefaultPageSize is the default page size for Parquet files. DefaultPageSize = 1048576 // 1MB pages )
const ( // ArrowTypeInt64 is the Arrow type name for int64 columns. ArrowTypeInt64 = "int64" // ArrowTypeInt32 is the Arrow type name for int32 columns. ArrowTypeInt32 = "int32" // ArrowTypeFloat64 is the Arrow type name for float64 columns. ArrowTypeFloat64 = "float64" // ArrowTypeFloat32 is the Arrow type name for float32 columns. ArrowTypeFloat32 = "float32" // ArrowTypeBool is the Arrow type name for bool columns. ArrowTypeBool = "bool" // ArrowTypeString is the Arrow type name for string columns. ArrowTypeString = "utf8" )
Arrow data type name constants for consistent usage across I/O implementations.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type CSVOptions ¶
type CSVOptions struct {
// Delimiter is the field delimiter (default: comma)
Delimiter rune
// Comment is the comment character (default: 0 = disabled)
Comment rune
// Header indicates whether the first row contains headers
Header bool
// SkipInitialSpace indicates whether to skip initial whitespace
SkipInitialSpace bool
// Parallel indicates whether to use parallel processing
Parallel bool
// ChunkSize is the size of chunks for parallel processing
ChunkSize int
}
CSVOptions contains configuration options for CSV operations.
func DefaultCSVOptions ¶
func DefaultCSVOptions() CSVOptions
DefaultCSVOptions returns default CSV options.
type CSVReader ¶
type CSVReader struct {
// contains filtered or unexported fields
}
CSVReader reads CSV data and converts it to DataFrames.
func NewCSVReader ¶
NewCSVReader creates a new CSV reader with the specified options.
type CSVWriter ¶
type CSVWriter struct {
// contains filtered or unexported fields
}
CSVWriter writes DataFrames to CSV format.
func NewCSVWriter ¶
func NewCSVWriter(writer io.Writer, options CSVOptions) *CSVWriter
NewCSVWriter creates a new CSV writer with the specified options.
type DataReader ¶
type DataReader interface {
// Read reads data from the source and returns a DataFrame
Read() (*dataframe.DataFrame, error)
}
DataReader defines the interface for reading data from various sources.
type DataWriter ¶
type DataWriter interface {
// Write writes the DataFrame to the destination
Write(df *dataframe.DataFrame) error
}
DataWriter defines the interface for writing data to various destinations.
type JSONFormat ¶ added in v0.4.0
type JSONFormat int
JSONFormat specifies the JSON format type.
const ( // JSONArray format stores data as a JSON array of objects. JSONArray JSONFormat = iota // JSONLines format stores data as newline-delimited JSON objects. JSONLines )
type JSONOptions ¶ added in v0.4.0
type JSONOptions struct {
// Format specifies whether to use JSON array or JSON Lines format
Format JSONFormat
// TypeInference enables automatic type inference from JSON values
TypeInference bool
// DateFormat specifies the format for parsing date strings
DateFormat string
// NullValues specifies string values that should be treated as null
NullValues []string
// MaxRecords limits the number of records to read (0 = no limit)
MaxRecords int
// Parallel enables parallel processing for large JSON files
Parallel bool
}
JSONOptions contains configuration options for JSON operations.
func DefaultJSONOptions ¶ added in v0.4.0
func DefaultJSONOptions() JSONOptions
DefaultJSONOptions returns default JSON options.
type JSONReader ¶ added in v0.4.0
type JSONReader struct {
// contains filtered or unexported fields
}
JSONReader reads JSON data and converts it to DataFrames.
func NewJSONReader ¶ added in v0.4.0
func NewJSONReader(reader io.Reader, options JSONOptions, mem memory.Allocator) *JSONReader
NewJSONReader creates a new JSON reader with the specified options.
type JSONWriter ¶ added in v0.4.0
type JSONWriter struct {
// contains filtered or unexported fields
}
JSONWriter writes DataFrames to JSON format.
func NewJSONWriter ¶ added in v0.4.0
func NewJSONWriter(writer io.Writer, options JSONOptions) *JSONWriter
NewJSONWriter creates a new JSON writer with the specified options.
type ParquetOptions ¶
type ParquetOptions struct {
// Compression type for Parquet files (snappy, gzip, lz4, zstd, uncompressed)
Compression string
// BatchSize for reading/writing operations
BatchSize int
// ColumnsToRead for selective column reading (nil reads all columns)
ColumnsToRead []string
// ParallelDecoding enables parallel decoding for better performance
ParallelDecoding bool
// RowGroupSize specifies the target size for row groups in rows
RowGroupSize int64
// PageSize specifies the target size for pages in bytes
PageSize int64
// EnableDict enables dictionary encoding for string columns
EnableDict bool
}
ParquetOptions contains configuration options for Parquet operations.
func DefaultParquetOptions ¶
func DefaultParquetOptions() ParquetOptions
DefaultParquetOptions returns default Parquet options.
type ParquetReader ¶ added in v0.2.0
type ParquetReader struct {
// contains filtered or unexported fields
}
ParquetReader reads Parquet data and converts it to DataFrames.
func NewParquetReader ¶ added in v0.2.0
func NewParquetReader(reader io.Reader, options ParquetOptions, mem memory.Allocator) *ParquetReader
NewParquetReader creates a new Parquet reader with the specified options.
type ParquetWriter ¶ added in v0.2.0
type ParquetWriter struct {
// contains filtered or unexported fields
}
ParquetWriter writes DataFrames to Parquet format.
func NewParquetWriter ¶ added in v0.2.0
func NewParquetWriter(writer io.Writer, options ParquetOptions) *ParquetWriter
NewParquetWriter creates a new Parquet writer with the specified options.