synth

package
v0.11.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 27, 2026 License: MIT Imports: 17 Imported by: 0

Documentation

Overview

Package synth produces synthetic .pulse cohorts from either a schema declaration ("from-schema") or a statistical profile of a real cohort ("from-profile"). The generator is deterministic given a seed and writes directly into the .pulse binary format using the encoding package, so outputs are byte-identical for the same (spec, seed) pair.

Two top-level entry points are exported by this package:

Synth(spec Spec, opts Options) (*Result, error)
Profile(schema, records, opts ProfileOptions) (*Profile, error)

Pulse embedders should use the higher-level pulse.Pulse.Synth and pulse.Pulse.Profile facade methods instead.

Index

Constants

View Source
const (
	DistUniform             = "uniform"
	DistNormal              = "normal"
	DistLogNormal           = "lognormal"
	DistExponential         = "exponential"
	DistPoisson             = "poisson"
	DistPareto              = "pareto"
	DistBernoulli           = "bernoulli"
	DistMonotonicFrom       = "monotonic_from"
	DistWeightedCategorical = "weighted_categorical"
	DistUniformDate         = "uniform_date"
	DistRegex               = "regex"
	DistConstant            = "constant"
)

Distribution kind constants.

Variables

This section is empty.

Functions

func AllDistributions

func AllDistributions() []string

AllDistributions returns the registered kind names in sorted order. Used by the manifest and tests.

Types

type CategoricalProfile

type CategoricalProfile struct {
	Cardinality int           `json:"cardinality"`
	Top         []CategoryHit `json:"top"`
}

CategoricalProfile holds the top-K observed values and the total distinct count.

type CategoryHit

type CategoryHit struct {
	Value  string  `json:"value"`
	Weight float64 `json:"weight"`
}

CategoryHit records a categorical value with its observed weight.

type ConstraintSpec

type ConstraintSpec struct {
	Expr string `json:"expr"`
}

ConstraintSpec wraps an expression evaluated against the in-memory row.

type CorrelationSpec

type CorrelationSpec struct {
	A           string  `json:"a"`
	B           string  `json:"b"`
	Correlation float64 `json:"correlation"`
}

CorrelationSpec declares a target Pearson correlation between two numeric fields.

type CorrelationStat

type CorrelationStat struct {
	A   string  `json:"a"`
	B   string  `json:"b"`
	Rho float64 `json:"rho"`
}

CorrelationStat is a captured pairwise correlation entry.

type DateProfile

type DateProfile struct {
	Start    string `json:"start"`
	End      string `json:"end"`
	Weekdays [7]int `json:"weekdays"`
}

DateProfile holds the (start, end) range of date values plus a weekday histogram for Mode-A reconstruction.

type FieldProfile

type FieldProfile struct {
	Name        string              `json:"name"`
	Type        string              `json:"type"`
	Description string              `json:"description,omitempty"`
	NullRate    float64             `json:"null_rate"`
	Numeric     *NumericProfile     `json:"numeric,omitempty"`
	Categorical *CategoricalProfile `json:"categorical,omitempty"`
	Date        *DateProfile        `json:"date,omitempty"`
	// Precision/Scale carry decimal128 metadata so synth-from-profile can
	// reconstruct the original field shape.
	Precision uint8 `json:"precision,omitempty"`
	Scale     uint8 `json:"scale,omitempty"`
}

FieldProfile holds per-field summary statistics. Exactly one of Numeric, Categorical, or Date is populated based on the field's type.

type FieldSpec

type FieldSpec struct {
	Name         string         `json:"name"`
	Type         string         `json:"type"`
	Nullable     bool           `json:"nullable,omitempty"`
	Description  string         `json:"description,omitempty"`
	Distribution string         `json:"distribution"`
	Params       map[string]any `json:"params,omitempty"`

	// Precision and Scale apply to decimal128.
	Precision uint8 `json:"precision,omitempty"`
	Scale     uint8 `json:"scale,omitempty"`

	// NullRate is the per-row probability that the field will be null.
	// Only meaningful when Nullable is true; ignored otherwise.
	NullRate float64 `json:"null_rate,omitempty"`
}

FieldSpec is a single column declaration.

type NumericProfile

type NumericProfile struct {
	Min  float64 `json:"min"`
	Max  float64 `json:"max"`
	Mean float64 `json:"mean"`
	Std  float64 `json:"std"`
	// Percentiles holds {p1, p5, p25, p50, p75, p95, p99} when
	// IncludeStats was on; nil otherwise.
	Percentiles []float64 `json:"percentiles,omitempty"`
}

NumericProfile is the detail block for numeric fields.

type Options

type Options struct {
	// Seed makes the output deterministic. Same spec + same seed must
	// produce a byte-identical .pulse file.
	Seed int64
}

Options modulates how the spec is realized.

type Profile

type Profile struct {
	RowCount int               `json:"row_count"`
	Fields   []FieldProfile    `json:"fields"`
	Pairwise []CorrelationStat `json:"pairwise,omitempty"`
	Warnings []string          `json:"warnings,omitempty"`
	Meta     map[string]any    `json:"meta,omitempty"`
}

Profile is a serialization-friendly statistical summary of a cohort. It contains everything needed to drive synth from-profile without retaining any individual rows from the source data.

func ProfileBytes

func ProfileBytes(data []byte, opts ProfileOptions) (*Profile, error)

ProfileBytes summarizes a .pulse file given its raw bytes.

func ProfileFile

func ProfileFile(fs afero.Fs, path string, opts ProfileOptions) (*Profile, error)

ProfileFile reads a .pulse file from fs and produces a Profile.

func (*Profile) MarshalJSON

func (p *Profile) MarshalJSON() ([]byte, error)

MarshalJSON serializes the profile.

type ProfileOptions

type ProfileOptions struct {
	// TopK is the number of top categorical values to capture per
	// categorical field. Defaults to 32 when zero.
	TopK int
	// IncludeStats turns on percentile / stdev / kurtosis collection.
	// When false, only mean / min / max / null-rate are recorded for
	// numeric fields. Defaults to true.
	IncludeStats bool
	// IncludeCorrelations enables pairwise Pearson correlation capture
	// between numeric fields. Off by default to keep profile size bounded.
	IncludeCorrelations bool
	// CorrelationTopK caps the number of strongest |rho| pairs retained.
	// Defaults to 16 when IncludeCorrelations is true.
	CorrelationTopK int
	// SampleLimit caps the number of records ingested for the profile.
	// Zero = unlimited.
	SampleLimit int
}

ProfileOptions modulates how Profile summarizes a cohort.

type Result

type Result struct {
	RowsGenerated int      `json:"rows_generated"`
	RowsRejected  int      `json:"rows_rejected"`
	OutputPath    string   `json:"output_path"`
	Warnings      []string `json:"warnings,omitempty"`
}

Result is what the writer reports after a successful Synth call.

func Synth

func Synth(fs afero.Fs, spec *Spec, output string, opts Options) (*Result, error)

Synth materializes a synthetic .pulse file from a Spec into the given path on the provided filesystem. It is the package-level entry point; pulse.Pulse.Synth wraps it.

func SynthBytes

func SynthBytes(spec *Spec, opts Options) ([]byte, *Result, error)

SynthBytes returns the byte slice that Synth would have written, without touching the filesystem. Useful for round-trip tests and embedders that stream the output elsewhere.

type Spec

type Spec struct {
	// RowCount is the number of rows to generate. Required (>0).
	RowCount int `json:"row_count"`

	// Fields declares each column with its type and distribution.
	Fields []FieldSpec `json:"fields"`

	// Constraints, if non-empty, are evaluated per row by the expression
	// evaluator. Rows that fail any constraint are rejected and re-drawn.
	Constraints []ConstraintSpec `json:"constraints,omitempty"`

	// MaxRejectionRate caps the fraction of rejected rows during
	// constraint-driven rejection sampling. Defaults to 0.5 if zero.
	MaxRejectionRate float64 `json:"max_rejection_rate,omitempty"`

	// Correlations lists optional pairwise correlations to induce via
	// Gaussian copula post-processing. Each entry references two numeric
	// fields by name with a target Pearson correlation in [-1, 1]. Only
	// numeric fields can participate.
	Correlations []CorrelationSpec `json:"correlations,omitempty"`
}

Spec is the parsed top-level synthesis request. It is the in-memory shape of from-schema JSON. A from-profile call builds a Spec internally from the Profile and shares the rest of the writer pipeline.

func ParseSpec

func ParseSpec(raw []byte) (*Spec, error)

ParseSpec parses Spec JSON. Returns SERVICE_VALIDATION if shape is wrong.

func SpecFromProfile

func SpecFromProfile(p *Profile, rowCount int) *Spec

SpecFromProfile builds a Spec the synth pipeline can execute. Numeric fields are reconstructed as normal distributions (mean, std clipped at min/max), categorical fields as weighted_categorical, and date fields as uniform_date over the observed range.

func (*Spec) Hash added in v0.11.0

func (s *Spec) Hash() string

Hash returns the canonical content hash of the synth Spec. Same logical spec produces the same hash across processes and Pulse versions where the spec's semantic meaning is unchanged.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL