Documentation
¶
Overview ¶
Package flakereport implements Bayesian commit bisection for identifying which commit most likely introduced a flaky test in a CI system.
Problem ¶
A flaky test is one whose failure probability changes at some point in the commit history — typically because a code change introduced a race condition, timing sensitivity, or environmental dependency. Given a rolling window of CI runs (each associated with the HEAD commit at the time), we want to rank candidate commits by their posterior probability of being the "transition commit" that caused the flakiness.
Data ¶
The raw input is a set of test runs downloaded from GitHub Actions artifacts. Each run records whether a specific test passed or failed, and is tagged with the workflow RunID. RunIDs are mapped to commit SHAs via the GitHub Actions API (WorkflowRun.HeadSHA). Runs are then grouped into CommitObservation records: for each (test, commit SHA) pair we count the number of passing and failing runs.
Model ¶
The inference model assumes exactly one transition commit c in the history:
- Before commit c: the test has a background failure probability p_before.
- At commit c and all later commits: the test has an elevated failure probability p_after.
Neither p_before nor p_after is known, so both are assigned independent uniform (Beta(1,1)) priors. The model treats them as nuisance parameters and marginalizes them out, yielding the Beta-Binomial marginal likelihood:
P(data | transition at c) = BetaBinomial(n_before, k_before; 1, 1)
× BetaBinomial(n_after, k_after; 1, 1)
where n_before / k_before are the total runs / failures before commit c, and n_after / k_after are the total runs / failures at and after commit c. The closed form is:
BetaBinomial(n, k; α, β) = Beta(k+α, n-k+β) / Beta(α, β)
All arithmetic is performed in log-space to avoid floating-point underflow, with the log-sum-exp trick applied during normalization.
Commit Priors ¶
The prior probability that a given commit is the transition commit is not uniform across all commits. Commits that only touch documentation, CI configuration, or test files cannot plausibly affect production code behaviour, so they receive a reduced prior weight (down to 0.05×). Commits that touch source code receive the default weight of 1.0. These heuristic weights are fetched from the GitHub API in parallel and applied before running the inference.
Inference Algorithm ¶
For N commits with observations, the algorithm runs in O(N) time using prefix sums:
- Compute prefix sums of failures and passes across the commit sequence.
- For each candidate transition commit i, use the prefix sums to split the data into "before" and "after" segments and evaluate the log marginal likelihood.
- Add the log prior weight to each log-likelihood.
- Normalize via log-sum-exp to obtain posterior probabilities that sum to 1.
Output ¶
Commits are ranked by posterior probability. Only commits above a configurable minimum probability threshold (default 0.50) are reported. Results are surfaced in the GitHub Actions step summary, a markdown report artifact, and optionally a Slack message listing the "hottest" commits (those with the highest aggregate posterior probability across all analyzed tests).
Limitations ¶
The model assumes a single transition in the observed window. If multiple commits each contributed independently to flakiness, or if the failure rate oscillates, the model may identify the wrong commit or produce a diffuse posterior with no strong suspect. The model also requires a minimum of 5 failures and 30 total runs before it will attempt bisection on a test, to avoid over-fitting noisy data.
Index ¶
- func NewCliApp() *cli.App
- type ArtifactJob
- type ArtifactResult
- type ArtifactsResponse
- type BisectConfig
- type BisectResult
- type CommitMeta
- type CommitObservation
- type FailedTestRecord
- type ReportSummary
- type SlackBlock
- type SlackMessage
- type SlackText
- type SuiteReport
- type TestBisectReport
- type TestFailure
- type TestReport
- type TestRun
- type WorkflowArtifact
- type WorkflowRun
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
Types ¶
type ArtifactJob ¶
type ArtifactJob struct {
Repo string
RunID int64
RunCreatedAt time.Time
Artifact WorkflowArtifact
TempDir string
RunNumber int
TotalRuns int
ArtifactNum int
}
ArtifactJob represents a job to download and process an artifact
type ArtifactResult ¶
type ArtifactResult struct {
Failures []TestFailure
AllRuns []TestRun
Error error
}
ArtifactResult represents the result of processing an artifact
type ArtifactsResponse ¶
type ArtifactsResponse struct {
TotalCount int `json:"total_count"`
Artifacts []WorkflowArtifact `json:"artifacts"`
}
ArtifactsResponse represents the GitHub API response for artifacts
type BisectConfig ¶
type BisectConfig struct {
Repo string
TopN int // max tests to analyze; 0 = all qualifying tests
MinFailures int
MinRuns int
MinProbability float64 // only report tests whose top suspect exceeds this (0–1); 0 = report all
}
BisectConfig holds configuration for a bisect analysis run.
type BisectResult ¶
type BisectResult struct {
CommitSHA string
CommitIdx int
Probability float64 // posterior P(this commit introduced the flakiness)
PassesBefore int
FailsBefore int
PassesAfter int
FailsAfter int
CommitTitle string
CommitAuthor string
CommitDate string // formatted date of the commit, e.g. "2024-01-15"
HeuristicNote string // e.g. "only touches .github/ — deprioritized"
}
BisectResult is one candidate culprit commit with its posterior probability.
type CommitMeta ¶
type CommitMeta struct {
SHA string
Title string
Author string
CommittedAt time.Time
Files []string // relative paths of changed files
}
CommitMeta holds changed-file info fetched from the GitHub API. GET /repos/{owner}/{repo}/commits/{sha}
type CommitObservation ¶
type CommitObservation struct {
CommitSHA string
CommitIdx int // chronological index (0 = oldest)
Prior float64 // prior weight (1.0 = uniform; adjusted by heuristics)
HeuristicNote string // reason for prior adjustment, if any
Passes int
Fails int
}
CommitObservation holds aggregated pass/fail data for a single (test, commit) pair.
type FailedTestRecord ¶
type FailedTestRecord struct {
SuiteName string `json:"suite_name"`
TestName string `json:"test_name"`
FailureDate string `json:"failure_date"`
Link string `json:"link"`
FailureType string `json:"failure_type"`
}
FailedTestRecord represents a single test failure for the failures.json analytics export
type ReportSummary ¶
type ReportSummary struct {
FlakyTests []TestReport
Timeouts []TestReport // Tests ending with "(timeout)"
Crashes []TestReport // Tests containing "crash"
CIBreakers []TestReport // Tests that failed all retries (3x) in a single job
Suites []SuiteReport // Per-suite flake breakdown
TotalFailures int // Total raw failure count
TotalTestRuns int // Total test executions (all tests, all runs)
OverallFailureRate float64 // Overall failures per 1000 test runs
TotalFlakyCount int // Total flaky tests (not just top 10)
TotalWorkflowRuns int // Total workflow runs analyzed
SuccessfulRuns int // Workflow runs that succeeded
}
ReportSummary contains all processed report data
type SlackBlock ¶
type SlackMessage ¶
type SlackMessage struct {
Text string `json:"text"`
Blocks []SlackBlock `json:"blocks"`
}
SlackMessage represents Slack Block Kit message
type SuiteReport ¶
type SuiteReport struct {
SuiteName string // Test suite name from JUnit XML
FlakeRate float64 // Percentage of job executions with at least one non-retry failure
FailedRuns int // Number of job executions with at least one non-retry failure
TotalRuns int // Total number of job executions where this suite appeared
LastFailure time.Time // Timestamp of the most recent failure
}
SuiteReport represents aggregated flake data for a test suite
type TestBisectReport ¶
type TestBisectReport struct {
TestName string
TopSuspects []BisectResult // sorted by Probability descending
TotalObs int // total observations (pass + fail) used
Skipped bool // true if below signal or confidence threshold
}
TestBisectReport is the full bisect output for a single test.
type TestFailure ¶
type TestFailure struct {
ClassName string // Test class/module name
Name string // Test function name
SuiteName string // Top-level test suite name
ArtifactID string // Artifact identifier from GitHub
RunID int64 // GitHub Actions run ID
JobID string // GitHub Actions job ID (or "unknown")
MatrixName string // DB config name from artifact name (e.g. "sqlite", "cassandra")
Timestamp time.Time // When the workflow run was created
}
TestFailure represents a single test failure extracted from JUnit XML
type TestReport ¶
type TestReport struct {
TestName string // Normalized test name (retry suffix stripped)
FailureCount int // Total number of failures
TotalRuns int // Total number of times this test ran (including successes)
GitHubURLs []string // Up to max_links failure URLs
LastFailure time.Time // Timestamp of the most recent failure
}
TestReport represents aggregated failures for a single test
type TestRun ¶
type TestRun struct {
SuiteName string // Top-level test suite name
Name string // Test name
Failed bool // Whether the test failed
Skipped bool // Whether the test was skipped
RunID int64 // Workflow run ID
JobID string // GitHub Actions job ID (unique per matrix job/shard)
MatrixName string // DB config name from artifact name (e.g. "sqlite", "cassandra")
}
TestRun represents a test execution (success or failure)
type WorkflowArtifact ¶
type WorkflowArtifact struct {
ID int64 `json:"id"`
Name string `json:"name"`
CreatedAt time.Time `json:"created_at"`
Expired bool `json:"expired"`
}
WorkflowArtifact represents a downloadable artifact
type WorkflowRun ¶
type WorkflowRun struct {
ID int64 `json:"id"`
Number int `json:"run_number"`
CreatedAt time.Time `json:"created_at"`
Status string `json:"status"`
Conclusion string `json:"conclusion"`
HeadBranch string `json:"head_branch"`
HeadSHA string `json:"head_sha"`
}
WorkflowRun represents a GitHub Actions workflow run