configuration-anomaly-detection

module

v0.0.0-...-a5d2bf9 Latest Latest Go to latest Published: May 29, 2026 License: Apache-2.0

README ¶

Configuration Anomaly Detection

Configuration Anomaly Detection

About

Configuration Anomaly Detection (CAD) is responsible for reducing manual SRE effort by pre-investigating alerts, detecting cluster anomalies and sending relevant communications to the cluster owner.

Overview

CAD consists of:

a tekton deployment including a custom tekton interceptor
the cadctl command line tool implementing alert remediations and pre-investigations

Workflows

There are two workflows - either via PagerDuty webhook or manual runs:

PagerDuty Webhook

PagerDuty Webhooks are used to trigger Configuration-Anomaly-Detection when a PagerDuty incident is created
The webhook routes to a Tekton EventListener
Received webhooks are filtered by a Tekton Interceptor that uses the payload to evaluate whether the alert has an implemented handler function in cadctl or not, and validates the webhook against the X-PagerDuty-Signature header. If there is no handler implemented, the alert is directly forwarded to a human SRE.
If cadctl implements a handler for the received payload/alert, a Tekton PipelineRun is started.
The pipeline runs cadctl which determines the handler function by itself based on the payload.

CAD Overview

Manual run

Invoke the cadctl command via the run subcommand: this will use your locally setup credentials

cadctl run -c <CLUSTER_ID> -i <INVESTIGATION>

Invoke a manual investigation via osdctl cluster cad run --cluster <CLUSTER_ID> which uses the hosted CAD to run your investigation. More information in this document

Contributing

Building

For build targets, see make help.

Adding a new investigation

CAD investigations are triggered by PagerDuty webhooks. Currently, CAD supports the following two formats of webhooks:

WebhookV3
EventOrchestrationWebhook

The required investigation is identified by CAD based on the incident and its payload. As PagerDuty itself does not provide finer granularity for webhooks than service-based, CAD filters out the alerts it should investigate. For more information, please refer to https://support.pagerduty.com/docs/webhooks.

To add a new alert investigation:

run make bootstrap-investigation to generate boilerplate code in pkg/investigations (This creates the corresponding folder & .go file, and also appends the investigation to the availableInvestigations interface in registry.go.).

The Run method of your investigation receives a ResourceBuilder. Use its With... methods to request the resources your investigation needs, then call Build() to get a Resources struct containing them. The builder automatically handles dependencies between resources (e.g., requesting an AWS client will also initialize the cluster object). For example:

func (c *Investigation) Run(rb investigation.ResourceBuilder) (investigation.InvestigationResult, error) {
    result := investigation.InvestigationResult{}
    // Request an AWS client. This will also ensure the Cluster resource is built.
    r, err := rb.WithAwsClient().Build()
    if err != nil {
        return result, err
    }
    // Now you can use r.AwsClient, r.Cluster, r.PdClient, etc.
    // ...
}

The returned Resources struct contains initialized clients and cluster objects. See Integrations for a full list of available resources.
Add test objects or scripts used to recreate the alert symptoms to the pkg/investigations/$INVESTIGATION_NAME/testing/ directory for future use. Be sure to clearly document the testing procedure under the Testing section of the investigation-specific README.md file

Graduating an investigation

New investigations and their remediation steps should be deployed in advancing stages through a progressive deployment strategy.

Informing Stage (Read-only): The investigation is merely informative through PagerDuty at this stage; remediation does not involve any write operations. Notes are collected throughout the investigation, and upon the investigation's conclusion are posted to PagerDuty.

Aim: Validating the investigation's accuracy and usefulness without performing any write actions.

Validation Criteria:
- The investigation successfully carries out each step on it's respective incident type, on both staging and production environments.
- It provides useful information (equivalent to a manual investigation) to SREs through PagerDuty.
- The investigation should be accompanied by unit tests and/or step-by-step manual tests in the investigation's testing README, including:
  - A clear step-by-step process to manually test the investigation (e.g. cluster setup, other expected conditions).
Actioning Stage (Read/Write): The investigation's remediation capabilities, including read and write operations, are performed on all applicable clusters.

Validation Criteria:
- The investigation is verified to conduct remediations on staging as expected.
- The investigation should be locally tested in staging against a live alert.
- E2E testing is desired for actioning investigations; the tests should cover the execution of remediative steps as well as verification of their effectiveness.

Integrations

Note: When writing an investiation, you can use them right away. They are initialized for you and passed to the investigation via investigation.Resources.

AWS -- Logging into the cluster, retreiving instance info and AWS CloudTrail events.
- See pkg/aws
PagerDuty -- Retrieving alert info, esclating or silencing incidents, and adding notes.
- See pkg/pagerduty
OCM -- Retrieving cluster info, sending service logs, and managing (post, delete) limited support reasons.
- See pkg/ocm
- In case of missing permissions to query an ocm resource, add it to the Configuration-Anomaly-Detection role in uhc-account-manager
osd-network-verifier -- Tool to verify the pre-configured networking components for ROSA and OSD CCS clusters.
k8sclient -- Interact with clusters kube-api
- Requires RBAC definitions for your investigation to be added to metadata.yaml

Testing locally

Against upstream stage OCM Backplane

Requires an existing cluster. Requires that the metadata.yaml is committed to the main branch of the upstream repo (see below for testing against a local metadata.yaml).

Create a test incident and payload file for your cluster
```
./test/generate_incident.sh <alertname> <clusterid>
```
Export the required env variables from vault

Note: For information on the envs see required env variables.
```
source test/set_stage_env.sh
```
make build
Run cadctl with the payload file created by test/generate_incident.sh
```
./bin/cadctl investigate --payload-path payload
```

Against local OCM Backplane

Requires existing cluster, same as above. The requests to /backplane/remediate and /backplane/remediation OCM Backplane endpoints are redirected to the local instance of OCM Backplane. This means the metadata.yaml committed to the main branch on your local disk is used to grant permissions (an alternate branch will be available after SREP-636 is complete).

Make sure to install the dependencies first with

dnf install jq openssl tinyproxy haproxy proxytunnel

It will run services on the following local ports:8001 8091 8443 8888

Create a test incident and payload file for your cluster
```
./test/generate_incident.sh <alertname> <clusterid>
```
In a separate terminal start the local infrastructure

Note: You need to clone the backplane-api code repository to a local directory and copy ocm.json from a staging cluster to its ./configs dir.

OCM_BACKPLANE_REPO_PATH=/home/me/backplane-api ./test/launch_local_env.sh

Export the required env variables from vault

Note: For information on the envs see required env variables.
```
source test/set_stage_env.sh
```
make build

Run cadctl with the payload file created by test/generate_incident.sh and proxy as well as the backplane URL set to localhost

BACKPLANE_URL=https://localhost:8443 HTTP_PROXY=http://127.0.0.1:8888 HTTPS_PROXY=http://127.0.0.1:8888 BACKPLANE_PROXY=http://127.0.0.1:8888  ./bin/cadctl investigate --payload-path ./payload --log-level debug

Close the local infrastructure when done by sending SIGINT (Ctrl+C) to the launch_local_env.sh

Run e2e test manually

See test/e2e/README.md

Logging levels

CAD allows for different logging levels (debug, info, warn, error, fatal, panic). The log level is determind through a hierarchy, where the cli flag log-level is checked first, and if not set the optional environment variable LOG_LEVEL is used. If neither is set, the log level defaults to info.

Documentation

Investigations

Every alert managed by CAD corresponds to an investigation, representing the executed code associated with the alert.

Investigation specific documentation can be found in the according investigation folder, e.g. for ClusterHasGoneMissing.

Integrations

AWS -- Logging into the cluster, retreiving instance info and AWS CloudTrail events.
PagerDuty -- Retrieving alert info, esclating or silencing incidents, and adding notes.
OCM -- Retrieving cluster info, sending service logs, and managing (post, delete) limited support reasons.
osd-network-verifier -- Tool to verify the pre-configured networking components for ROSA and OSD CCS clusters.

Templates

OpenShift -- Used by app-interface to deploy the CAD resources on a target cluster.

Dashboards

Grafana dashboard configmaps are stored in the Dashboards directory. See app-interface for further documentation on dashboards.

Boilerplate

Boilerplate -- Conventions for OSD containers.

PipelinePruner

PipelinePruner -- Documentation about PipelineRun pruning.

Required ENV variables

Note: For local execution, these can exported from vault with source test/set_stage_env.sh

CAD_OCM_CLIENT_ID: refers to the OCM client ID used by CAD to initialize the OCM client
CAD_OCM_CLIENT_SECRET: refers to the OCM client secret used by CAD to initialize the OCM client
CAD_OCM_URL: refers to the used OCM url used by CAD to initialize the OCM client
CAD_PD_EMAIL: refers to the email for a login via mail/pw credentials
CAD_PD_PW: refers to the password for a login via mail/pw credentials
CAD_PD_TOKEN: refers to the generated private access token for token-based authentication
CAD_PD_USERNAME: refers to the username of CAD on PagerDuty
CAD_SILENT_POLICY: refers to the silent policy CAD should use if the incident shall be silent
PD_SIGNATURE: refers to the PagerDuty webhook signature (HMAC+SHA256)
CAD_PROMETHEUS_PUSHGATEWAY: refers to the URL cad will push metrics to
BACKPLANE_URL: refers to the backplane url to use
BACKPLANE_INITIAL_ARN: refers to the initial ARN used for the isolated backplane jumprole flow

Optional ENV variables

BACKPLANE_PROXY: refers to the proxy CAD uses for the isolated backplane access flow.

Note: BACKPLANE_PROXY is required for local development, as a backplane api is only accessible through the proxy.

CAD_EXPERIMENTAL_ENABLED: enables experimental investigations when set to true, see mapping.go
CAD_ORG_POLICY_MAPPING: JSON configuration for organization-based escalation policy routing. When configured, the interceptor automatically reassigns PagerDuty incidents for clusters belonging to specific organizations to dedicated escalation policies. This enables organization-specific on-call rotations.

Example configuration:
```
{
  "organizations": [
    {
      "name": "Customer Alpha",
      "org_ids": ["org-id-1", "org-id-2"],
      "escalation_policy": "P123ABC"
    },
    {
      "name": "Customer Beta",
      "org_ids": ["org-id-3"],
      "escalation_policy": "P456DEF"
    }
  ]
}
```
Behavior:
- When an incident is triggered, the interceptor queries OCM to get the cluster's organization ID
- If the organization ID matches an entry in the mapping, the incident is reassigned to the specified escalation policy
- A note is added to the PagerDuty incident documenting the reassignment
- If reassignment fails, a note with error details is added for manual routing
- The investigation continues normally after reassignment (or skips reassignment if not configured)
Requirements:
- CAD_OCM_CLIENT_ID, CAD_OCM_CLIENT_SECRET, CAD_OCM_URL must be configured
- Escalation policy IDs must be valid in the PagerDuty account
- Escalation policies must have at least 2 levels (CAD's EscalateIncident() method escalates to level 2)
- This configuration is set as an environment variable in the deployment via app-interface saas file

For Red Hat employees, these environment variables can be found in the SRE-P vault.

LOG_LEVEL: refers to the CAD log level, if not set, the default is info. See

Configuration File

CAD now also has a configuration file and over time the environment variables will be moved to this file - for the supported fields at any time see the example configuration with comments in ./docs/investigation-filter-config.example.yaml.

Directories ¶

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

Path	Synopsis
cadctl Package main is the main package	Package main is the main package
cmd Package cmd holds the cadctl cobra data	Package cmd holds the cadctl cobra data
cmd/investigate Package investigate holds the investigate command	Package investigate holds the investigate command
cmd/manual
config
hack
update-template module
interceptor
pkg/interceptor
pkg
aws Package aws contains functions related to aws sdk	Package aws contains functions related to aws sdk
aws/mock Package awsmock is a generated GoMock package.	Package awsmock is a generated GoMock package.
backplane Package backplane provides helper functions for interacting with the backplane-api SDK	Package backplane provides helper functions for interacting with the backplane-api SDK
backplane/mock
config Package config provides configuration and tree-based filtering for investigation actions.	Package config provides configuration and tree-based filtering for investigation actions.
controller
executor Package reporter provides external system update functionality for investigation results	Package reporter provides external system update functionality for investigation results
investigations
investigations/aiassisted Package aiassisted provides AI-powered investigation using AWS AgentCore	Package aiassisted provides AI-powered investigation using AWS AgentCore
investigations/cannotretrieveupdatessre
investigations/ccam Package ccam Cluster Credentials Are Missing (CCAM) provides a service for detecting missing cluster credentials	Package ccam Cluster Credentials Are Missing (CCAM) provides a service for detecting missing cluster credentials
investigations/chgm Package chgm contains functionality for the chgm investigation	Package chgm contains functionality for the chgm investigation
investigations/clusterhealthcheck Package clusterhealthcheck implements a CAD investigation that runs comprehensive health checks against an OpenShift cluster, replicating the functionality of the managed-scripts health/cluster-health-check action.	Package clusterhealthcheck implements a CAD investigation that runs comprehensive health checks against an OpenShift cluster, replicating the functionality of the managed-scripts health/cluster-health-check action.
investigations/clustermonitoringerrorbudgetburn Package clustermonitoringerrorbudgetburn contains remediation for https://issues.redhat.com/browse/OCPBUGS-33863	Package clustermonitoringerrorbudgetburn contains remediation for https://issues.redhat.com/browse/OCPBUGS-33863
investigations/cpd Package cpd contains functionality for the ClusterProvisioningDelay investigation package cpd	Package cpd contains functionality for the ClusterProvisioningDelay investigation package cpd
investigations/describenodes Package describenodes implements a CAD investigation that describes all nodes in a cluster, providing the full output of `oc describe nodes` including pod details.	Package describenodes implements a CAD investigation that describes all nodes in a cluster, providing the full output of `oc describe nodes` including pod details.
investigations/etcddatabasequotalowspace Package etcddatabasequotalowspace takes etcd snapshots and performs database analysis for etcd quota issues	Package etcddatabasequotalowspace takes etcd snapshots and performs database analysis for etcd quota issues
investigations/insightsoperatordown
investigations/investigation
investigations/machinehealthcheckunterminatedshortcircuitsre machinehealthcheckunterminatedshortcircuitsre defines the investigation logic for the MachineHealthCheckUnterminatedShortCircuitSRE alert	machinehealthcheckunterminatedshortcircuitsre defines the investigation logic for the MachineHealthCheckUnterminatedShortCircuitSRE alert
investigations/mustgather Package mustgather implements an investigation that collects must-gather diagnostics from ROSA classic clusters and uploads them to the Red Hat SFTP server for analysis.	Package mustgather implements an investigation that collects must-gather diagnostics from ROSA classic clusters and uploads them to the Red Hat SFTP server for analysis.
investigations/ocmagentresponsefailure Package ocmagentresponsefailure implements the investigation logic for the "OCMAgentResponseFailureServiceLogsSRE" alert.	Package ocmagentresponsefailure implements the investigation logic for the "OCMAgentResponseFailureServiceLogsSRE" alert.
investigations/precheck
investigations/restartcontrolplane Package restartcontrolplane implements an investigation that restarts an HCP control plane.	Package restartcontrolplane implements an investigation that restarts an HCP control plane.
investigations/upgradeconfigsyncfailureover4hr Package upgradeconfigsyncfailureover4hr contains functionality for the UpgradeConfigSyncFailureOver4HrSRE investigation	Package upgradeconfigsyncfailureover4hr contains functionality for the UpgradeConfigSyncFailureOver4HrSRE investigation
investigations/utils/machine
investigations/utils/node node defines investigation utility logic related to node objects	node defines investigation utility logic related to node objects
investigations/utils/tarball
investigations/utils/version
k8s Package k8sclient handles creation and cleanup of backplane remediations meaning a kube-apiserver access to clusters with RBAC defined in an investigations metadata	Package k8sclient handles creation and cleanup of backplane remediations meaning a kube-apiserver access to clusters with RBAC defined in an investigations metadata
logging Package logging wraps the zap logging package to provide easier access and initialization of the logger	Package logging wraps the zap logging package to provide easier access and initialization of the logger
managedcloud Package managedcloud contains functionality to access cloud environments of managed clusters	Package managedcloud contains functionality to access cloud environments of managed clusters
metrics Package metrics provides prometheus instrumentation for CAD	Package metrics provides prometheus instrumentation for CAD
networkverifier Package networkverifier contains functionality for running the network verifier	Package networkverifier contains functionality for running the network verifier
notewriter
oc
ocm Package ocm contains ocm api related functions	Package ocm contains ocm api related functions
ocm/mock Package ocmmock is a generated GoMock package.	Package ocmmock is a generated GoMock package.
pagerduty Package pagerduty contains wrappers for pagerduty api calls	Package pagerduty contains wrappers for pagerduty api calls
pagerduty/mock Package pdmock is a generated GoMock package.	Package pdmock is a generated GoMock package.
pullsecret Package pullsecret provides pull secret validation functionality This package validates cluster pull secrets against OCM account data, similar to osdctl's validate-pull-secret-ext command.	Package pullsecret provides pull secret validation functionality This package validates cluster pull secrets against OCM account data, similar to osdctl's validate-pull-secret-ext command.
types Package types contains shared types used by both investigations and reporter packages	Package types contains shared types used by both investigations and reporter packages
utils Package utils contains utility functions	Package utils contains utility functions
test
e2e/utils