Documentation
¶
Overview ¶
Package csv provides comprehensive CSV file handling for the GoPCA toolkit. It includes parsing, validation, and writing functionality with built-in security measures and support for various CSV formats.
Features ¶
The package supports:
- Multiple delimiters (comma, semicolon, tab)
- Different decimal separators (period, comma)
- Automatic column type detection
- Missing value handling
- Large file streaming
- Security validation against malicious inputs
Security ¶
All file operations include security validations:
- Path traversal prevention
- File size limits (500MB default)
- Field length limits (10,000 characters)
- Row and column count limits
Parse Modes ¶
The package supports four parsing modes:
- ParseNumeric: All data as floating-point numbers (for PCA)
- ParseString: All data as strings (for editing)
- ParseMixed: Automatic type detection
- ParseMixedWithTargets: Type detection with target column identification
Usage ¶
Basic usage:
opts := csv.DefaultOptions()
data, err := csv.ParseFile("data.csv", opts)
European format:
opts := csv.EuropeanOptions()
data, err := csv.ParseFile("data.csv", opts)
Performance ¶
The package is optimized for both small and large datasets. Streaming mode is available for files that exceed memory constraints.
Package csv provides unified CSV parsing, writing, and validation functionality for the GoPCA monorepo. It consolidates previously scattered CSV operations into a single, well-tested package following the DRY principle.
Index ¶
- Constants
- func AnalyzeMissingValues(data *Data) map[string]interface{}
- func ConvertToPCAOutputData(result *types.PCAResult, data *Data, preprocessedData types.Matrix, ...) *types.PCAOutputData
- func ConvertToPCAOutputDataWithMetadata(result *types.PCAResult, data *Data, preprocessedData types.Matrix, ...) *types.PCAOutputData
- func Save(w io.Writer, data *Data, opts Options) error
- func SaveFile(filename string, data *Data, opts Options) error
- func SaveMatrix(filename string, matrix types.Matrix, headers []string, rowNames []string, ...) error
- func ToNumericMatrix(stringData [][]string, nullValues []string) (types.Matrix, [][]bool, error)
- func ToStringMatrix(matrix types.Matrix, precision int) [][]string
- func ValidateStructure(data *Data) error
- type CSVWritable
- type ColumnStatistics
- type Data
- type DataProvider
- type ExportMetadata
- type Options
- type ParseMode
- type Reader
- type ValidationResult
- type Validator
- type Writer
- func (w *Writer) Write(output io.Writer, data *Data) error
- func (w *Writer) WriteFile(filename string, data *Data) error
- func (w *Writer) WriteMatrix(output io.Writer, matrix types.Matrix, headers []string, rowNames []string) error
- func (w *Writer) WriteMatrixFile(filename string, matrix types.Matrix, headers []string, rowNames []string) error
Constants ¶
const ( MaxFileSize = security.MaxFileSize // Use security module's limit MaxFieldLength = security.MaxFieldLength // Use security module's limit )
Security limits for CSV parsing
Variables ¶
This section is empty.
Functions ¶
func AnalyzeMissingValues ¶
AnalyzeMissingValues analyzes missing value patterns
func ConvertToPCAOutputData ¶ added in v0.9.10
func ConvertToPCAOutputData(result *types.PCAResult, data *Data, preprocessedData types.Matrix, includeMetrics bool, config types.PCAConfig, preprocessor *core.Preprocessor, categoricalData map[string][]string, targetData map[string][]float64) *types.PCAOutputData
ConvertToPCAOutputData converts PCAResult and Data to PCAOutputData for export This function is shared between CLI and Desktop applications
func ConvertToPCAOutputDataWithMetadata ¶ added in v0.9.13
func ConvertToPCAOutputDataWithMetadata(result *types.PCAResult, data *Data, preprocessedData types.Matrix, includeMetrics bool, config types.PCAConfig, preprocessor *core.Preprocessor, categoricalData map[string][]string, targetData map[string][]float64, exportMeta *ExportMetadata) *types.PCAOutputData
ConvertToPCAOutputDataWithMetadata converts PCAResult and Data to PCAOutputData with optional metadata
func SaveMatrix ¶
func SaveMatrix(filename string, matrix types.Matrix, headers []string, rowNames []string, opts Options) error
SaveMatrix is a convenience function for writing a matrix to CSV
func ToNumericMatrix ¶
ToNumericMatrix converts string data to numeric matrix with missing value tracking. It parses each string value as a float64 and tracks missing values based on the provided null value list. Returns the numeric matrix, missing value mask, and any errors.
func ToStringMatrix ¶
ToStringMatrix converts numeric matrix to string representation. The precision parameter controls the number of decimal places (-1 for automatic). NaN and Inf values are converted to appropriate string representations.
func ValidateStructure ¶
ValidateStructure performs basic structural validation
Types ¶
type CSVWritable ¶
type CSVWritable interface {
DataProvider
WriteHeaders(w io.Writer, opts Options) error
WriteRow(w io.Writer, index int, opts Options) error
}
CSVWritable is an interface for data that can be written to CSV
type ColumnStatistics ¶
type ColumnStatistics struct {
Name string
Index int
DataType string // "numeric", "categorical", "mixed"
NonMissing int
Missing int
MissingPercent float64
Mean float64 // For numeric columns
StdDev float64 // For numeric columns
Min float64 // For numeric columns
Max float64 // For numeric columns
UniqueValues int // For categorical columns
HasZeroVariance bool // Warning flag
}
ColumnStatistics contains statistics for a single column
type Data ¶
type Data struct {
// Core numeric data (always present for PCA)
Matrix types.Matrix // Numeric data matrix
Headers []string // Column names
RowNames []string // Row names
MissingMask [][]bool // Track missing values (true = missing)
Rows int // Number of data rows
Columns int // Number of data columns
// Additional data types (optional)
StringData [][]string // Raw string data (for GoCSV)
CategoricalColumns map[string][]string // Categorical columns by name
NumericTargetColumns map[string][]float64 // Numeric target columns
}
Data represents parsed CSV data with support for different data types
func (*Data) GetMissingValueInfo ¶ added in v0.9.10
func (d *Data) GetMissingValueInfo(selectedColumns []int) *types.MissingValueInfo
GetMissingValueInfo returns information about missing values in selected columns
type DataProvider ¶
type DataProvider interface {
GetHeaders() []string
GetRowNames() []string
GetDimensions() (rows, cols int)
HasNumericData() bool
HasStringData() bool
}
DataProvider is an interface that different data representations can implement to provide consistent access to CSV data regardless of internal structure
type ExportMetadata ¶ added in v0.9.13
type ExportMetadata struct {
InputFilename string // Original input file name
Description string // User-provided description
Tags []string // User-defined tags
}
ExportMetadata contains optional metadata for PCA export
type Options ¶
type Options struct {
// Parsing options
Delimiter rune // Field delimiter: ',', ';', '\t'
DecimalSeparator rune // Decimal separator: '.', ','
HasHeaders bool // First row contains column names
HasRowNames bool // First column contains row names
NullValues []string // Strings to treat as missing values
ParseMode ParseMode // How to parse the data
TargetSuffix string // Suffix to identify target columns (e.g., "#target")
TargetCols []string // Explicit list of target column names to exclude from PCA
// Reading options (for large files)
SkipRows int // Number of rows to skip at start
MaxRows int // Maximum rows to read (0 for all)
Columns []int // Specific columns to read (empty for all)
StreamingMode bool // Enable streaming for large files
// Writing options
FloatFormat byte // Format for float output: 'g', 'f', 'e'
Precision int // Decimal precision for float output (-1 for auto)
}
Options provides unified configuration for CSV operations
func DefaultOptions ¶
func DefaultOptions() Options
DefaultOptions returns sensible default options for CSV operations
func EuropeanOptions ¶
func EuropeanOptions() Options
EuropeanOptions returns options for European CSV format (semicolon delimiter, comma decimal). This is commonly used in countries where comma is the decimal separator.
func TabDelimitedOptions ¶
func TabDelimitedOptions() Options
TabDelimitedOptions returns options for tab-delimited files
type Reader ¶
type Reader struct {
// contains filtered or unexported fields
}
Reader provides unified CSV reading functionality
type ValidationResult ¶
type ValidationResult struct {
Valid bool
Errors []string
Warnings []string
ColumnStats []ColumnStatistics
}
ValidationResult contains the results of CSV validation
func ValidateFile ¶
func ValidateFile(filename string, opts Options) (*ValidationResult, error)
ValidateFile validates a CSV file
type Validator ¶
type Validator struct {
// contains filtered or unexported fields
}
Validator provides CSV validation functionality
func NewValidator ¶
NewValidator creates a new CSV validator with the given options
func (*Validator) Validate ¶
func (v *Validator) Validate(data *Data) *ValidationResult
Validate performs comprehensive validation on CSV data
type Writer ¶
type Writer struct {
// contains filtered or unexported fields
}
Writer provides unified CSV writing functionality