Documentation
¶
Overview ¶
Package workflowexecution provides failure analysis for Tekton PipelineRun failures.
This file implements BR-WE-012 (Exponential Backoff) by detecting and categorizing failures to determine appropriate retry strategies.
Failure Analysis:
- Pre-Execution Failures: Configuration errors, permission issues, image pull failures → Apply exponential backoff (DD-WE-004)
- Execution Failures: Task-level failures during PipelineRun execution → Report to user, no automatic retry
Failure Categories: - OOMKilled: Container out of memory - DeadlineExceeded: Timeout reached - Forbidden: Permission denied - ImagePullBackOff: Container image not available - ConfigurationError: Invalid workflow configuration - TaskFailed: Workflow task failed during execution
See: docs/architecture/decisions/DD-WE-004-exponential-backoff.md
Package workflowexecution provides the WorkflowExecution CRD controller.
Business Purpose (BR-WE-003): WorkflowExecution orchestrates Tekton PipelineRuns for workflow execution, providing resource locking, exponential backoff, and comprehensive failure reporting.
Key Responsibilities: - BR-WE-003: Monitor execution status and sync with PipelineRun - BR-WE-005: Generate audit trail for execution lifecycle - BR-WE-006: Expose Kubernetes Conditions for status tracking - BR-WE-008: Emit Prometheus metrics for execution outcomes - BR-WE-012: Apply exponential backoff for failed executions
Architecture: - Pure Executor: Only executes workflows (routing handled by RemediationOrchestrator) - Status Sync: Continuously syncs WFE status with PipelineRun status - Failure Analysis: Detects Tekton task failures and reports detailed reasons
Design Decisions: - DD-WE-001: Resource locking safety (prevents concurrent execution on same target) - DD-WE-002: Dedicated execution namespace (isolates PipelineRuns) - DD-WE-003: Deterministic lock names (enables resource lock persistence) - DD-WE-004: Exponential backoff for pre-execution failures
See: docs/services/crd-controllers/03-workflowexecution/ for detailed documentation
Index ¶
- Constants
- type WorkflowExecutionReconciler
- func (r *WorkflowExecutionReconciler) BuildPipelineRunStatusSummary(ctx context.Context, pr *tektonv1.PipelineRun) *workflowexecutionv1alpha1.ExecutionStatusSummary
- func (r *WorkflowExecutionReconciler) CheckCooldownActive(ctx context.Context, targetResource, currentWFEKey string) (time.Duration, bool)
- func (r *WorkflowExecutionReconciler) ExtractFailureDetails(ctx context.Context, pr *tektonv1.PipelineRun, startTime *metav1.Time) *workflowexecutionv1alpha1.FailureDetails
- func (r *WorkflowExecutionReconciler) FindFailedTaskRun(ctx context.Context, pr *tektonv1.PipelineRun) (*tektonv1.TaskRun, int, error)
- func (r *WorkflowExecutionReconciler) FindWFEForOwnedResource(ctx context.Context, obj client.Object) []reconcile.Request
- func (r *WorkflowExecutionReconciler) GenerateNaturalLanguageSummary(wfe *workflowexecutionv1alpha1.WorkflowExecution, ...) string
- func (r *WorkflowExecutionReconciler) HandleAlreadyExists(ctx context.Context, wfe *workflowexecutionv1alpha1.WorkflowExecution, ...) (ctrl.Result, error)
- func (r *WorkflowExecutionReconciler) MarkCompleted(ctx context.Context, wfe *workflowexecutionv1alpha1.WorkflowExecution, ...) (ctrl.Result, error)
- func (r *WorkflowExecutionReconciler) MarkFailed(ctx context.Context, wfe *workflowexecutionv1alpha1.WorkflowExecution, ...) (ctrl.Result, error)
- func (r *WorkflowExecutionReconciler) MarkFailedWithReason(ctx context.Context, wfe *workflowexecutionv1alpha1.WorkflowExecution, ...) error
- func (r *WorkflowExecutionReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error)
- func (r *WorkflowExecutionReconciler) ReconcileDelete(ctx context.Context, wfe *workflowexecutionv1alpha1.WorkflowExecution) (ctrl.Result, error)
- func (r *WorkflowExecutionReconciler) ReconcileTerminal(ctx context.Context, wfe *workflowexecutionv1alpha1.WorkflowExecution) (ctrl.Result, error)
- func (r *WorkflowExecutionReconciler) SetupWithManager(mgr ctrl.Manager) error
- func (r *WorkflowExecutionReconciler) ValidateSpec(wfe *workflowexecutionv1alpha1.WorkflowExecution) error
Constants ¶
const ( // FinalizerName is the finalizer for WorkflowExecution cleanup // Per finalizers-lifecycle.md: domain/resource-cleanup pattern FinalizerName = "workflowexecution.kubernaut.ai/workflowexecution-cleanup" // DefaultCooldownPeriod is the default time between workflow executions on same target DefaultCooldownPeriod = 5 * time.Minute )
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type WorkflowExecutionReconciler ¶
type WorkflowExecutionReconciler struct {
client.Client
Scheme *runtime.Scheme
Recorder record.EventRecorder
// DD-STATUS-001: APIReader bypasses informer cache for direct API server reads.
// Used in reconcilePending to prevent race conditions from stale cache data:
// - Prevents duplicate audit events (cache lag between concurrent reconciles)
// - Ensures ExecutionRef is fresh for external deletion detection
APIReader client.Reader
// Metrics for observability (DD-005, DD-METRICS-001)
// Per DD-METRICS-001: Metrics MUST be dependency-injected, not global variables
// Initialized in main.go and injected via SetupWithManager()
Metrics *metrics.Metrics
// ========================================
// STATUS MANAGER (DD-PERF-001)
// 📋 Design Decision: DD-PERF-001 | ✅ Atomic Status Updates Pattern
// See: docs/architecture/decisions/DD-PERF-001-atomic-status-updates-mandate.md
// ========================================
//
// StatusManager manages atomic status updates to reduce K8s API calls
// Consolidates multiple status field updates into single atomic operations
//
// BENEFITS:
// - 50%+ API call reduction (2 updates → 1 atomic update)
// - Eliminates race conditions from sequential updates
// - Reduces etcd write load and watch events
//
// WIRED IN: cmd/workflowexecution/main.go
// USAGE: r.StatusManager.AtomicStatusUpdate(ctx, wfe, func() { ... })
StatusManager *status.Manager
// ExecutionNamespace is where PipelineRuns are created (DD-WE-002)
// Default: "kubernaut-workflows"
ExecutionNamespace string
// CooldownPeriod prevents redundant sequential workflows (DD-WE-001)
// Default: 5 minutes
CooldownPeriod time.Duration
// AuditStore for writing audit events (BR-WE-005, ADR-032)
// Uses pkg/audit buffered store via Data Storage Service
// Optional: nil disables audit (graceful degradation)
AuditStore audit.AuditStore
// PhaseManager manages phase state machine logic (P0: Phase State Machine)
// Per CONTROLLER_REFACTORING_PATTERN_LIBRARY.md §1
// Provides validated phase transitions and terminal state checking
PhaseManager *wephase.Manager
// AuditManager manages audit event emission (P3: Audit Manager)
// Per CONTROLLER_REFACTORING_PATTERN_LIBRARY.md §7
// Provides typed audit methods for better testability
AuditManager *weaudit.Manager
// ExecutorRegistry dispatches to the correct execution backend (BR-WE-014)
// Maps execution engine names ("tekton", "job") to Executor implementations.
// When nil, falls back to inline Tekton-only code path.
ExecutorRegistry *weexecutor.Registry
// DD-WE-006: WorkflowQuerier fetches workflow dependencies from DS on demand.
// Optional: nil disables dependency injection (workflows run without mounted deps).
WorkflowQuerier weclient.WorkflowQuerier
// DD-WE-006: DependencyValidator validates that declared dependencies exist
// with non-empty data in the execution namespace (defense in depth).
// Optional: nil disables execution-time validation.
DependencyValidator dsvalidation.DependencyValidator
}
WorkflowExecutionReconciler reconciles a WorkflowExecution object
func (*WorkflowExecutionReconciler) BuildPipelineRunStatusSummary ¶
func (r *WorkflowExecutionReconciler) BuildPipelineRunStatusSummary(ctx context.Context, pr *tektonv1.PipelineRun) *workflowexecutionv1alpha1.ExecutionStatusSummary
BuildPipelineRunStatusSummary creates a lightweight status summary from PipelineRun Provides visibility into task progress during execution (v3.2)
func (*WorkflowExecutionReconciler) CheckCooldownActive ¶
func (r *WorkflowExecutionReconciler) CheckCooldownActive(ctx context.Context, targetResource, currentWFEKey string) (time.Duration, bool)
======================================== CheckCooldownActive checks if cooldown is active for a target resource BR-WE-009: Cooldown Period is Configurable Returns (remaining duration, is active) currentWFEName format: "namespace/name" to uniquely identify the current WFE ========================================
func (*WorkflowExecutionReconciler) ExtractFailureDetails ¶
func (r *WorkflowExecutionReconciler) ExtractFailureDetails(ctx context.Context, pr *tektonv1.PipelineRun, startTime *metav1.Time) *workflowexecutionv1alpha1.FailureDetails
ExtractFailureDetails extracts structured failure information from PipelineRun Day 7: Now includes TaskRun-specific fields (FailedTaskName, FailedTaskIndex, ExitCode) Day 6 Extension (BR-WE-012): Includes WasExecutionFailure for backoff decisions Maps Tekton failure reasons to our FailureReason enum
func (*WorkflowExecutionReconciler) FindFailedTaskRun ¶
func (r *WorkflowExecutionReconciler) FindFailedTaskRun(ctx context.Context, pr *tektonv1.PipelineRun) (*tektonv1.TaskRun, int, error)
FindFailedTaskRun finds the first failed TaskRun in a PipelineRun's ChildReferences Returns the TaskRun, its index in ChildReferences, and any error Returns (nil, -1, nil) if no failed TaskRun is found
func (*WorkflowExecutionReconciler) FindWFEForOwnedResource ¶ added in v1.1.0
func (r *WorkflowExecutionReconciler) FindWFEForOwnedResource(ctx context.Context, obj client.Object) []reconcile.Request
======================================== FindWFEForOwnedResource maps owned resource events (PipelineRun, Job) to WorkflowExecution reconcile requests. Both executors label their resources with kubernaut.ai/workflow-execution and kubernaut.ai/source-namespace. ========================================
func (*WorkflowExecutionReconciler) GenerateNaturalLanguageSummary ¶
func (r *WorkflowExecutionReconciler) GenerateNaturalLanguageSummary(wfe *workflowexecutionv1alpha1.WorkflowExecution, details *workflowexecutionv1alpha1.FailureDetails) string
GenerateNaturalLanguageSummary creates a human/LLM-readable failure description For failure reporting and user notifications Day 9 (v3.5): Handles nil FailureDetails gracefully per Q4 decision
func (*WorkflowExecutionReconciler) HandleAlreadyExists ¶
func (r *WorkflowExecutionReconciler) HandleAlreadyExists(ctx context.Context, wfe *workflowexecutionv1alpha1.WorkflowExecution, resourceName string, err error) (ctrl.Result, error)
======================================== HandleAlreadyExists handles the race condition where PipelineRun already exists DD-WE-003: Layer 2 - Execution-time collision handling (not routing) V1.0: Fails WFE if race condition detected (RO should have prevented this) ========================================
func (*WorkflowExecutionReconciler) MarkCompleted ¶
func (r *WorkflowExecutionReconciler) MarkCompleted(ctx context.Context, wfe *workflowexecutionv1alpha1.WorkflowExecution, pr *tektonv1.PipelineRun, summary ...*workflowexecutionv1alpha1.ExecutionStatusSummary) (ctrl.Result, error)
MarkCompleted transitions WFE to Completed phase Calculates Duration from StartTime to CompletionTime (v3.2) Day 6 Extension (BR-WE-012): Resets ConsecutiveFailures counter Records metrics per BR-WE-008 (Day 7)
func (*WorkflowExecutionReconciler) MarkFailed ¶
func (r *WorkflowExecutionReconciler) MarkFailed(ctx context.Context, wfe *workflowexecutionv1alpha1.WorkflowExecution, pr *tektonv1.PipelineRun, summary ...*workflowexecutionv1alpha1.ExecutionStatusSummary) (ctrl.Result, error)
MarkFailed transitions WFE to Failed phase with FailureDetails Extracts failure information from PipelineRun (v3.2) Day 6 Extension (BR-WE-012): Handles exponential backoff for pre-execution failures Records metrics per BR-WE-008 (Day 7)
func (*WorkflowExecutionReconciler) MarkFailedWithReason ¶
func (r *WorkflowExecutionReconciler) MarkFailedWithReason(ctx context.Context, wfe *workflowexecutionv1alpha1.WorkflowExecution, reason, message string) error
======================================== MarkFailedWithReason - Handle pre-execution failures Used for validation errors, configuration errors before PipelineRun creation ========================================
func (*WorkflowExecutionReconciler) Reconcile ¶
func (r *WorkflowExecutionReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error)
Reconcile handles WorkflowExecution reconciliation Phase-based reconciliation per implementation plan
func (*WorkflowExecutionReconciler) ReconcileDelete ¶
func (r *WorkflowExecutionReconciler) ReconcileDelete(ctx context.Context, wfe *workflowexecutionv1alpha1.WorkflowExecution) (ctrl.Result, error)
======================================== ReconcileDelete - Handle deletion with finalizer DD-WE-003: Use deterministic PipelineRun name finalizers-lifecycle.md: Event emission ========================================
func (*WorkflowExecutionReconciler) ReconcileTerminal ¶
func (r *WorkflowExecutionReconciler) ReconcileTerminal(ctx context.Context, wfe *workflowexecutionv1alpha1.WorkflowExecution) (ctrl.Result, error)
======================================== ReconcileTerminal - Handle Completed/Failed phases Day 6: Cooldown enforcement and cleanup DD-WE-003: Lock Persistence (Deterministic Name) ========================================
func (*WorkflowExecutionReconciler) SetupWithManager ¶
func (r *WorkflowExecutionReconciler) SetupWithManager(mgr ctrl.Manager) error
======================================== SetupWithManager sets up the controller with the Manager ========================================
func (*WorkflowExecutionReconciler) ValidateSpec ¶
func (r *WorkflowExecutionReconciler) ValidateSpec(wfe *workflowexecutionv1alpha1.WorkflowExecution) error
ValidateSpec validates the WorkflowExecution spec Returns error if validation fails (ConfigurationError reason)