dataquality

package
v1.3.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 24, 2026 License: MIT Imports: 6 Imported by: 0

Documentation

Overview

Package dataquality provides data quality analysis and missing value handling for tabular datasets, independent of any UI or Wails framework.

The package operates on plain [][]string data matrices and column metadata maps to avoid circular dependencies with UI-layer packages. Callers convert their domain types (e.g. FileData) into the AnalysisInput struct before calling package functions.

Key capabilities:

  • Missing value detection, statistics, and fill strategies (mean, median, mode, forward-fill, backward-fill, custom value)
  • Per-column statistics (mean, median, standard deviation, percentiles, skewness, kurtosis, categorical frequency)
  • Distribution analysis and histogram generation
  • Outlier detection via IQR and Z-score methods
  • Pairwise Pearson correlation matrix
  • Data quality scoring and actionable issue/recommendation generation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Fill

func Fill(data [][]string, headers []string, columnTypes map[string]string, req FillRequest) ([][]string, error)

Fill applies a fill strategy to a deep copy of data and returns the new matrix. The original data is never modified.

strategy values: "mean", "median", "mode", "forward", "backward", "custom". If req.Column is empty, all columns are processed; otherwise only the named column is filled.

Types

type AnalysisInput

type AnalysisInput struct {
	Data        [][]string
	Headers     []string
	ColumnTypes map[string]string // "numeric", "categorical", "target"
	RowNames    []string
	Rows        int
	Columns     int
}

AnalysisInput carries all data needed for quality analysis. Callers populate this from their own domain type (e.g. FileData) so that this package remains independent of UI/Wails types.

type ColumnAnalysis

type ColumnAnalysis struct {
	Name         string           `json:"name"`
	Type         string           `json:"type"` // "numeric", "categorical", "target"
	Stats        ColumnStatistics `json:"stats"`
	Distribution DistributionInfo `json:"distribution"`
	Outliers     []OutlierInfo    `json:"outliers"`
	QualityScore float64          `json:"qualityScore"`
}

ColumnAnalysis contains detailed analysis for a single column.

type ColumnMissing

type ColumnMissing struct {
	Name           string  `json:"name"`
	TotalValues    int     `json:"totalValues"`
	MissingValues  int     `json:"missingValues"`
	MissingPercent float64 `json:"missingPercent"`
	Pattern        string  `json:"pattern"` // "none", "random", "systematic", "top", "bottom"
}

ColumnMissing contains missing-value statistics for one column.

type ColumnStatistics

type ColumnStatistics struct {
	Count          int            `json:"count"`
	Missing        int            `json:"missing"`
	MissingPercent float64        `json:"missingPercent"`
	Unique         int            `json:"unique"`
	Mean           *float64       `json:"mean,omitempty"`
	Median         *float64       `json:"median,omitempty"`
	Mode           *string        `json:"mode,omitempty"`
	StdDev         *float64       `json:"stdDev,omitempty"`
	Min            *float64       `json:"min,omitempty"`
	Max            *float64       `json:"max,omitempty"`
	Q1             *float64       `json:"q1,omitempty"`
	Q3             *float64       `json:"q3,omitempty"`
	IQR            *float64       `json:"iqr,omitempty"`
	Skewness       *float64       `json:"skewness,omitempty"`
	Kurtosis       *float64       `json:"kurtosis,omitempty"`
	Categories     map[string]int `json:"categories,omitempty"`
}

ColumnStatistics contains statistical measures for a column.

type DataProfile

type DataProfile struct {
	Rows               int     `json:"rows"`
	Columns            int     `json:"columns"`
	NumericColumns     int     `json:"numericColumns"`
	CategoricalColumns int     `json:"categoricalColumns"`
	TargetColumns      int     `json:"targetColumns"`
	MissingPercent     float64 `json:"missingPercent"`
	DuplicateRows      int     `json:"duplicateRows"`
	MemorySize         string  `json:"memorySize"`
}

DataProfile contains overall dataset-level statistics.

type DataQualityReport

type DataQualityReport struct {
	DataProfile     DataProfile      `json:"dataProfile"`
	ColumnAnalysis  []ColumnAnalysis `json:"columnAnalysis"`
	QualityScore    float64          `json:"qualityScore"`
	Issues          []QualityIssue   `json:"issues"`
	Recommendations []Recommendation `json:"recommendations"`
}

DataQualityReport is the top-level result of a full data quality analysis.

func AnalyzeDataQuality

func AnalyzeDataQuality(in AnalysisInput) (*DataQualityReport, error)

AnalyzeDataQuality performs comprehensive data quality analysis on the given input and returns a DataQualityReport. Returns an error if the input is empty.

type DistributionInfo

type DistributionInfo struct {
	Histogram       []HistogramBin `json:"histogram,omitempty"`
	IsNormal        bool           `json:"isNormal"`
	NormalityPValue float64        `json:"normalityPValue,omitempty"`
	DistType        string         `json:"distType"` // "normal", "right-skewed", "left-skewed", "bimodal", "unknown"
}

DistributionInfo describes the distribution shape of a numeric column.

type FillRequest

type FillRequest struct {
	Strategy string // "mean", "median", "mode", "forward", "backward", "custom"
	Column   string // Column name, or empty string to process all columns
	Value    string // Custom fill value (used when Strategy == "custom")
}

FillRequest describes a missing-value fill operation.

type HistogramBin

type HistogramBin struct {
	Min   float64 `json:"min"`
	Max   float64 `json:"max"`
	Count int     `json:"count"`
}

HistogramBin represents one bin in a histogram.

type MissingValueStats

type MissingValueStats struct {
	TotalCells     int                       `json:"totalCells"`
	MissingCells   int                       `json:"missingCells"`
	MissingPercent float64                   `json:"missingPercent"`
	ColumnStats    map[string]*ColumnMissing `json:"columnStats"`
	RowStats       map[int]*RowMissing       `json:"rowStats"`
}

MissingValueStats contains missing-value statistics for an entire dataset.

func AnalyzeMissing

func AnalyzeMissing(data [][]string, headers []string) *MissingValueStats

AnalyzeMissing returns missing-value statistics for the given data matrix and column headers. An empty or nil data slice returns a zeroed stats struct.

type OutlierInfo

type OutlierInfo struct {
	RowIndex int     `json:"rowIndex"`
	Value    string  `json:"value"`
	Method   string  `json:"method"` // "iqr" or "zscore"
	Score    float64 `json:"score"`
}

OutlierInfo describes one detected outlier value.

type QualityIssue

type QualityIssue struct {
	Severity    string   `json:"severity"` // "error", "warning", "info"
	Category    string   `json:"category"` // "missing", "outlier", "duplicate", "correlation", "variance", "distribution"
	Description string   `json:"description"`
	Affected    []string `json:"affected"`
	Impact      string   `json:"impact"`
}

QualityIssue describes a detected data quality problem.

type Recommendation

type Recommendation struct {
	Priority    string   `json:"priority"` // "high", "medium", "low"
	Category    string   `json:"category"`
	Action      string   `json:"action"`
	Description string   `json:"description"`
	Columns     []string `json:"columns,omitempty"`
}

Recommendation is an actionable suggestion derived from the quality analysis.

type RowMissing

type RowMissing struct {
	Index          int     `json:"index"`
	TotalValues    int     `json:"totalValues"`
	MissingValues  int     `json:"missingValues"`
	MissingPercent float64 `json:"missingPercent"`
}

RowMissing contains missing-value statistics for one row.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL