typos

package
v0.0.0-...-ffc4fba Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 13, 2026 License: Apache-2.0, Apache-2.0 Imports: 17 Imported by: 0

README

Typos Dataset Builder

Preface

Clean code is professional code. But more than that, typos in identifiers can be bugs waiting to happen (e.g., overriding a method but misspelling the name).

Problem

  • Finding common spelling mistakes in the codebase.
  • Generating datasets for training Machine Learning models to automatically fix typos (the original intent of this analyzer).

How analyzer solves it

The Typos analyzer looks for "typo-fix" patterns in the commit history. It identifies cases where an identifier was changed to another identifier with a very small Levenshtein distance (e.g., recieve -> receive).

Historical context

This was likely developed to support "Natural Code" research—building tools that autocorrect code like a spellchecker.

Real world examples

  • Dataset Generation: Creating a list of 10,000 real-world typo fixes to train a neural network.
  • QA: Scanning recent commits to catch typos that slipped through review.

How analyzer works here

  1. Diff Scan: Looks at modified lines in commits.
  2. Identifier Extraction: Uses UAST/Tokenization to find identifiers in the "before" and "after" versions.
  3. Distance Calculation: Computes Levenshtein distance.
  4. Filtering: If the distance is small (e.g., 1 or 2 edits) and the context is similar, it records it as a typo fix.

Limitations

  • False Positives: color -> colour might be a localization change, not a typo. i -> j in a loop is logic, not a typo.

Further plans

  • Context-aware validation.

Documentation

Overview

Package typos provides typos functionality.

Index

Constants

View Source
const (
	// DefaultMaximumAllowedTypoDistance is the default maximum Levenshtein distance for typo detection.
	DefaultMaximumAllowedTypoDistance = 4
	// ConfigTyposDatasetMaximumAllowedDistance is the configuration key for the maximum Levenshtein distance.
	ConfigTyposDatasetMaximumAllowedDistance = "TyposDatasetBuilder.MaximumAllowedDistance"
)
View Source
const (
	KindFileTypos = "file_typos"
	KindAggregate = "aggregate"
)

Store record kind constants.

Variables

This section is empty.

Functions

func ComputeAllMetrics

func ComputeAllMetrics(report analyze.Report) (*common.MetricSet, error)

ComputeAllMetrics runs all typos metrics and returns the results.

func GenerateStoreSections

func GenerateStoreSections(reader analyze.ReportReader) ([]plotpage.Section, error)

GenerateStoreSections reads pre-computed typo data from a ReportReader and builds the same plot sections as GenerateSections, without materializing a full Report or recomputing metrics.

func RegisterPlotSections

func RegisterPlotSections()

RegisterPlotSections registers the typos plot section renderer with the analyze package.

Types

type AggregateData

type AggregateData struct {
	TotalTypos      int `json:"total_typos"      yaml:"total_typos"`
	UniquePatterns  int `json:"unique_patterns"  yaml:"unique_patterns"`
	AffectedFiles   int `json:"affected_files"   yaml:"affected_files"`
	AffectedCommits int `json:"affected_commits" yaml:"affected_commits"`
}

AggregateData contains summary statistics.

type Analyzer

type Analyzer struct {
	*analyze.BaseHistoryAnalyzer[*common.MetricSet]
	common.NoStateHibernation

	UAST      *plumbing.UASTChangesAnalyzer
	FileDiff  *plumbing.FileDiffAnalyzer
	BlobCache *plumbing.BlobCacheAnalyzer

	MaximumAllowedDistance int
	// contains filtered or unexported fields
}

Analyzer detects typo-fix identifier pairs across commit history.

func NewAnalyzer

func NewAnalyzer() *Analyzer

NewAnalyzer creates a new typos analyzer.

func (*Analyzer) ApplySnapshot

func (t *Analyzer) ApplySnapshot(snap analyze.PlumbingSnapshot)

ApplySnapshot restores plumbing state from a previously captured snapshot.

func (*Analyzer) CPUHeavy

func (t *Analyzer) CPUHeavy() bool

CPUHeavy returns true because typo detection performs UAST processing per commit.

func (*Analyzer) Configure

func (t *Analyzer) Configure(facts map[string]any) error

Configure sets up the analyzer with the provided facts.

func (*Analyzer) Consume

func (t *Analyzer) Consume(ctx context.Context, ac *analyze.Context) (analyze.TC, error)

Consume processes a single commit and returns a TC with per-commit typo data. The analyzer does not retain any per-commit state; all output is in the TC.

func (*Analyzer) Fork

func (t *Analyzer) Fork(n int) []analyze.HistoryAnalyzer

Fork creates independent copies of the analyzer for parallel processing.

func (*Analyzer) Initialize

func (t *Analyzer) Initialize(_ *gitlib.Repository) error

Initialize prepares the analyzer for processing commits.

func (*Analyzer) Merge

func (t *Analyzer) Merge(_ []analyze.HistoryAnalyzer)

Merge is a no-op. Per-commit results are emitted as TCs.

func (*Analyzer) NeedsUAST

func (t *Analyzer) NeedsUAST() bool

NeedsUAST returns true to enable the UAST pipeline.

func (*Analyzer) ReleaseSnapshot

func (t *Analyzer) ReleaseSnapshot(snap analyze.PlumbingSnapshot)

ReleaseSnapshot releases UAST trees owned by the snapshot.

func (*Analyzer) SnapshotPlumbing

func (t *Analyzer) SnapshotPlumbing() analyze.PlumbingSnapshot

SnapshotPlumbing captures the current plumbing output state for parallel execution.

func (*Analyzer) WriteToStore

func (t *Analyzer) WriteToStore(ctx context.Context, ticks []analyze.TICK, w analyze.ReportWriter) error

WriteToStore implements analyze.StoreWriter. It extracts typos from TICKs, deduplicates, computes per-file typo counts and aggregate statistics, and streams them as individual records:

  • "file_typos": per-file FileTypoData records (sorted by typo count desc).
  • "aggregate": single AggregateData record.

type FileTypoData

type FileTypoData struct {
	File       string `json:"file"        yaml:"file"`
	TypoCount  int    `json:"typo_count"  yaml:"typo_count"`
	FixedTypos int    `json:"fixed_typos" yaml:"fixed_typos"`
}

FileTypoData contains typo statistics per file.

type ReportData

type ReportData struct {
	Typos []Typo
}

ReportData is the parsed input data for typos metrics computation.

func ParseReportData

func ParseReportData(report analyze.Report) (*ReportData, error)

ParseReportData extracts ReportData from an analyzer report.

type TickData

type TickData struct {
	Typos []Typo
}

TickData is the aggregated payload stored in analyze.TICK.Data.

type Typo

type Typo struct {
	Wrong   string
	Correct string
	File    string
	Commit  gitlib.Hash
	Line    int
}

Typo represents a detected typo-fix pair in source code.

type TypoData

type TypoData struct {
	Wrong   string `json:"wrong"   yaml:"wrong"`
	Correct string `json:"correct" yaml:"correct"`
	File    string `json:"file"    yaml:"file"`
	Line    int    `json:"line"    yaml:"line"`
	Commit  string `json:"commit"  yaml:"commit"`
}

TypoData contains information about a single typo fix.

type TypoPatternData

type TypoPatternData struct {
	Wrong     string `json:"wrong"     yaml:"wrong"`
	Correct   string `json:"correct"   yaml:"correct"`
	Frequency int    `json:"frequency" yaml:"frequency"`
}

TypoPatternData contains common typo patterns.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL