categorizer

package
v0.21.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 21, 2026 License: Apache-2.0 Imports: 6 Imported by: 0

Documentation

Overview

Package categorizer classifies pod failures into a single named category with an optional subcategory, based on configurable rules. It runs at the executor, where full Kubernetes pod status is available. The resulting category and subcategory are set on the Error proto attached to events.

Configuration

Categories are defined in the executor config under application.errorCategories. Each category has a name and one or more rules. Rules are evaluated in config order across all categories; the first matching rule wins, setting both the category name and the rule's optional subcategory.

Each rule uses exactly one matcher:

  • OnConditions: matches Kubernetes failure signals (OOMKilled, Evicted, DeadlineExceeded)
  • OnExitCodes: matches non-zero container exit codes using In/NotIn set operators
  • OnTerminationMessage: matches container termination messages against a regex
  • OnPodError: matches a pod-level error message captured by the executor against a regex; covers failures with no useful container terminationMessage (image pull, missing volume, stuck terminating, deadline exceeded, etc.)

Container-level matchers honor ContainerName scoping when set. OnPodError ignores it because pod-level error text has no container attribution.

Each rule may also set Hint, an optional user-facing string that the executor appends to the failure message. Hints land in lookoutdb.job_run.error and are surfaced to users in Lookout alongside the raw runtime error.

Exit code 0 is always skipped. Both regular and init containers are checked.

Example

application:
  errorCategories:
    enabled: true
    defaultCategory: "uncategorized"
    defaultSubcategory: "unknown"
    categories:
      - name: infrastructure
        rules:
          - onConditions: ["OOMKilled"]
            subcategory: "oom"
            hint: "Increase the memory request in your job spec"
          - onConditions: ["Evicted"]
            subcategory: "eviction"
          - onPodError:
              pattern: "no match for platform in manifest"
            subcategory: "platform_mismatch"
            hint: "Build the image for the cluster's CPU architecture (typically x64/arm64 mismatch)"
      - name: user_code
        rules:
          - onExitCodes:
              operator: In
              values: [74, 75]
            subcategory: "cuda"
          - onTerminationMessage:
              pattern: "(?i)cuda.*error"
            subcategory: "cuda"

Validation

NewClassifier validates all config upfront: unknown condition strings, invalid exit code operators, empty value lists, and invalid regexes all return errors at construction time.

Usage

classifier, err := categorizer.NewClassifier(config.ErrorCategories)
if err != nil {
    // handle invalid config
}

// Terminated pod: container state carries the relevant termination signals.
result := classifier.ClassifyContainerError(pod)

// Pre-terminal failure: an executor-captured error message is matched
// against onPodError rules in addition to pod state.
result = classifier.ClassifyPodError(pod, podErrorMessage)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CategoryConfig

type CategoryConfig struct {
	Name  string         `yaml:"name"`
	Rules []CategoryRule `yaml:"rules"`
}

CategoryConfig defines a named error category with rules that match against pod failure signals. The first matching rule (across all categories, in config order) wins - setting both the category name and the rule's optional subcategory.

type CategoryRule

type CategoryRule struct {
	ContainerName        string                      `yaml:"containerName,omitempty"`
	OnExitCodes          *errormatch.ExitCodeMatcher `yaml:"onExitCodes,omitempty"`
	OnTerminationMessage *errormatch.RegexMatcher    `yaml:"onTerminationMessage,omitempty"`
	OnPodError           *errormatch.RegexMatcher    `yaml:"onPodError,omitempty"`
	OnConditions         []string                    `yaml:"onConditions,omitempty"`
	Subcategory          string                      `yaml:"subcategory,omitempty"`
	// Hint is operator-supplied user-facing copy describing this failure mode.
	// When set, it is appended to the failure message that lands in
	// lookoutdb.job_run.error so end users see actionable guidance alongside the
	// raw runtime error. Optional; empty means no hint is added.
	Hint string `yaml:"hint,omitempty"`
}

CategoryRule defines a single matching condition. Exactly one matcher must be set per rule (validated by NewClassifier). Rules within a category are OR'd.

Container-level matchers (OnConditions, OnExitCodes, OnTerminationMessage) inspect per-container state from pod.Status; ContainerName scopes them to a specific container when set, otherwise any container can match.

OnPodError is pod-level: it matches a regex against the failure message the executor captured for the issue. Use it for failures where no container has a useful terminationMessage, including kubelet/runtime errors (image pull, missing volume, missing config) and Armada-detected conditions (stuck terminating, active deadline exceeded, externally deleted). ContainerName is ignored for OnPodError because the message has no container attribution.

type Classifier

type Classifier struct {
	// contains filtered or unexported fields
}

Classifier evaluates pods against a set of category rules and returns the first matching category and subcategory.

func NewClassifier

func NewClassifier(config ErrorCategoriesConfig) (*Classifier, error)

NewClassifier validates config and compiles regex patterns. Returns an error if any regex is invalid, a condition is unknown, or an exit code matcher has an invalid operator.

func (*Classifier) ClassifyContainerError added in v0.20.43

func (c *Classifier) ClassifyContainerError(pod *v1.Pod) ClassifyResult

ClassifyContainerError returns the category and subcategory for a pod whose failure is described by its own state: terminated containers, exit codes, and Kubernetes conditions. Use it for terminated pods (PodFailed phase). Returns empty result if the receiver is nil or the pod is nil. Returns (defaultCategory, defaultSubcategory) if no rules match.

func (*Classifier) ClassifyPodError added in v0.20.43

func (c *Classifier) ClassifyPodError(pod *v1.Pod, podErrorMessage string) ClassifyResult

ClassifyPodError returns the category and subcategory for a pod-level failure captured by the executor (image pull, missing volume, stuck terminating, active deadline exceeded, etc.). It additionally matches podErrorMessage against onPodError rules (see CategoryRule.OnPodError); all other rule types are evaluated against pod state, preserving first-match-wins across config order. Returns empty result if the receiver is nil or the pod is nil. Returns (defaultCategory, defaultSubcategory) if no rules match.

type ClassifyResult added in v0.20.41

type ClassifyResult struct {
	Category    string
	Subcategory string
	// Hint is operator-supplied user-facing copy attached to the matching rule.
	// Use AppendHint to attach it to the failure message before emitting events.
	Hint string
}

ClassifyResult holds the classification output for a failed pod.

func (ClassifyResult) AppendHint added in v0.20.43

func (r ClassifyResult) AppendHint(message string) string

AppendHint returns the message with this result's hint appended after a blank line. Returns the message unchanged when no hint is set. Centralizing the format here keeps both event-reporting call sites consistent.

type ErrorCategoriesConfig added in v0.20.41

type ErrorCategoriesConfig struct {
	// Enabled toggles failure classification on pod errors. When false, no
	// failure_category or failure_subcategory is set on error events.
	Enabled bool `yaml:"enabled"`
	// DefaultCategory is the category assigned when no rule matches.
	// If empty, no category is assigned when no rule matches.
	DefaultCategory string `yaml:"defaultCategory"`
	// DefaultSubcategory is the subcategory assigned when no rule matches.
	// If empty, no subcategory is assigned when no rule matches.
	DefaultSubcategory string           `yaml:"defaultSubcategory"`
	Categories         []CategoryConfig `yaml:"categories"`
}

ErrorCategoriesConfig is the top-level config for failure classification.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL