rca-operator

module
v0.0.16 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 30, 2026 License: MIT

README

RCA Operator for Kubernetes

Cluster-native incident detection, durable incident state, CRD-driven correlation rules, notifications, and dashboarding

Go Version Kubernetes kubebuilder

rca-operator.tech

What RCA Operator Does

RCA Operator is a Kubernetes-native incident detection operator that:

  • collects failure signals from native Kubernetes APIs (pods, events, nodes, deployments)
  • evaluates CRD-driven correlation rules (RCACorrelationRule) to detect multi-signal incidents
  • persists durable incident state in IncidentReport CRDs
  • manages incident lifecycle: DetectingActiveResolved
  • notifies humans via Slack and PagerDuty from incident lifecycle state
  • serves a built-in dashboard (light/dark theme) backed only by IncidentReport and RCAAgent CRDs

The operator avoids AI systems, external databases, and log-scraping dependencies so it stays easy to run and reason about in-cluster.

Architecture

Architecture

More detail lives in Architecture and Phase 2 Release Notes.

Current Feature Set

Feature Description
Native Kubernetes signal collection Reads pod, event, node, and workload state from Kubernetes (Deployments, StatefulSets, DaemonSets, Jobs, CronJobs)
CRD-driven correlation rules RCACorrelationRule CRDs define multi-signal rules — no Go code changes needed
Automatic rule detection Mines the correlation buffer for recurring signal patterns and auto-creates RCACorrelationRule CRDs
Durable incident records Deduplicates repeated signals into one IncidentReport per fingerprint
Incident lifecycle Tracks Detecting, Active, and Resolved phases
Notifications Sends Slack and PagerDuty notifications and emits Kubernetes events
Dashboard Built-in incident dashboard with light/dark theme toggle, workload + service topology views, and an inline Jaeger trace detail modal (no Jaeger UI hop)
Retention Automatically prunes old resolved incidents
OpenTelemetry Optional OTLP trace export for the operator's own spans

Quick Install

# Add repositories (one-time)
helm repo add rca-operator  https://gaurangkudale.github.io/rca-operator.github.io/charts
helm repo add opentelemetry https://open-telemetry.github.io/opentelemetry-helm-charts
helm repo add jaegertracing  https://jaegertracing.github.io/helm-charts
helm repo update

# Install
helm upgrade --install rca-operator rca-operator/rca-operator \
  --namespace rca-system --create-namespace \
  --wait --timeout 10m

--wait is required — the OpenTelemetryCollector and Instrumentation CRs are applied as post-install hooks after the otel-operator webhook is confirmed Ready.

The default chart values are the full profile. Source installs can also use helm/values-minimal.yaml or helm/values-external-observability.yaml; see Installation.

kubectl

kubectl apply -f https://github.com/gaurangkudale/RCA-Operator/releases/latest/download/install.yaml
kubectl apply -f config/samples/rca_v1alpha1_rcaagent.yaml

Documentation

Section Description
Prerequisites Cluster and tooling requirements
Installation Helm and kubectl installation
Quick Start Deploy your first agent in minutes
Architecture System design and data flow
Phase 2 Release Notes What's new in the Phase 2 release
Production Guide Production sizing, security, RBAC, network policy, retention, and cardinality guidance
Phase 1 Architecture Historical Kubernetes-native foundation design
RCAAgent CRD Reference RCAAgent schema and examples
IncidentReport CRD Reference IncidentReport schema and fields
RCACorrelationRule CRD Reference Correlation rule schema and examples
Auto-Detection Automatic correlation rule detection
OTLP Ingest In-operator OTLP/HTTP receiver for traces and logs
Topology Graph Incident topology graph (K8s + trace + Jaeger enrichment)
Dashboard Dashboard data model and access patterns
Metrics Reference Prometheus metrics exposed by the operator
RBAC Reference Permissions used by the operator
Local Development Run locally against a cluster
Testing Guide Unit, envtest, and e2e coverage
Helm Reference Override flags, from-source install, upgrade, troubleshooting
Helm Upgrade Guide CRD upgrade and migration steps

Custom Resources

RCAAgent

The main configuration resource. One agent can watch multiple namespaces and optionally configure notifications and retention.

kubectl get rcaagent -A
kubectl describe rcaagent <name> -n <namespace>

IncidentReport

Created automatically for detected incidents. Each report carries the incident fingerprint, lifecycle phase, severity, affected resources, and timeline.

kubectl get incidentreport -A
kubectl describe incidentreport <name> -n <namespace>

RCACorrelationRule

Cluster-scoped rules that define multi-signal correlation logic. Rules are loaded dynamically — no operator restart needed when rules change.

kubectl get rcacorrelationrules
kubectl describe rcacorrelationrule <name>

Four default rules are installed with the Helm chart (defaultRules.enabled: true):

Rule Trigger Condition Severity
node-plus-eviction NodeNotReady PodEvicted on same node P1 Critical
crashloop-plus-oom CrashLoopBackOff OOMKilled on same pod P2 High
crashloop-plus-deploy CrashLoopBackOff StalledRollout in same namespace P2 High
imagepull-no-history ImagePullBackOff No PodHealthy on same pod P2 High

When auto-detection is enabled (--enable-autodetect), the operator also creates rules automatically from observed signal patterns. Auto-generated rules use a fixed priority of 30 (below user rules) and are labeled rca.rca-operator.tech/auto-generated: "true". See Auto-Detection for details.

Contributing

Contributions are welcome — bug reports, docs, tests, correlation rules, or features.

  1. Read CONTRIBUTING.md and CODE_OF_CONDUCT.md.
  2. Find a good first issue on the issue tracker, or open a new one to discuss larger changes before coding.
  3. make lint && make test && make build must pass locally.
  4. Open a pull request — the PR template lists the merge checklist.

Community & Support

License

Licensed under the MIT License. See LICENSE.

Directories

Path Synopsis
api
v1alpha1
Package v1alpha1 contains API Schema definitions for the rca v1alpha1 API group.
Package v1alpha1 contains API Schema definitions for the rca v1alpha1 API group.
internal
correlator/graph
Package graph builds the incident topology graph — a compact JSON document linking the affected Kubernetes resources to the services / spans pulled from the originating distributed trace.
Package graph builds the incident topology graph — a compact JSON document linking the affected Kubernetes resources to the services / spans pulled from the originating distributed trace.
jaeger
Package jaeger provides a minimal HTTP client for the Jaeger Query API used by the incident graph builder to resolve trace-id → service-call topology.
Package jaeger provides a minimal HTTP client for the Jaeger Query API used by the incident graph builder to resolve trace-id → service-call topology.
metrics
Package metrics holds the Prometheus collectors for the RCA Operator's incident lifecycle.
Package metrics holds the Prometheus collectors for the RCA Operator's incident lifecycle.
otel
Package otel provides OpenTelemetry setup and span helpers for the RCA Operator.
Package otel provides OpenTelemetry setup and span helpers for the RCA Operator.
otelingest
Package otelingest is an in-operator OTLP/HTTP receiver that turns inbound spans and logs (fanned out from the cluster-wide OTel Collector DaemonSet) into watcher.CorrelatorEvent signals for the correlation engine.
Package otelingest is an in-operator OTLP/HTTP receiver that turns inbound spans and logs (fanned out from the cluster-wide OTel Collector DaemonSet) into watcher.CorrelatorEvent signals for the correlation engine.
reporter
Package reporter handles IncidentReport CR creation, patching, and resolution.
Package reporter handles IncidentReport CR creation, patching, and resolution.
rulengine
Package rulengine provides a generic, CRD-driven rule engine that loads RCACorrelationRule resources dynamically and evaluates them at runtime.
Package rulengine provides a generic, CRD-driven rule engine that loads RCACorrelationRule resources dynamically and evaluates them at runtime.
signals
Package signals implements the explicit signal processing pipeline: Normalize → Enrich → Deduplicate.
Package signals implements the explicit signal processing pipeline: Normalize → Enrich → Deduplicate.
webhook
Package webhook provides validating and defaulting admission webhooks for RCA Operator CRDs.
Package webhook provides validating and defaulting admission webhooks for RCA Operator CRDs.
test

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL