configuration-anomaly-detection

module
v0.0.0-...-a5d2bf9 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 29, 2026 License: Apache-2.0

README

Go Report Card PkgGoDev codecov License


Configuration Anomaly Detection

Configuration Anomaly Detection

About

Configuration Anomaly Detection (CAD) is responsible for reducing manual SRE effort by pre-investigating alerts, detecting cluster anomalies and sending relevant communications to the cluster owner.

Overview

CAD consists of:

  • a tekton deployment including a custom tekton interceptor
  • the cadctl command line tool implementing alert remediations and pre-investigations
Workflows

There are two workflows - either via PagerDuty webhook or manual runs:

PagerDuty Webhook
  1. PagerDuty Webhooks are used to trigger Configuration-Anomaly-Detection when a PagerDuty incident is created
  2. The webhook routes to a Tekton EventListener
  3. Received webhooks are filtered by a Tekton Interceptor that uses the payload to evaluate whether the alert has an implemented handler function in cadctl or not, and validates the webhook against the X-PagerDuty-Signature header. If there is no handler implemented, the alert is directly forwarded to a human SRE.
  4. If cadctl implements a handler for the received payload/alert, a Tekton PipelineRun is started.
  5. The pipeline runs cadctl which determines the handler function by itself based on the payload.

CAD Overview CAD Overview

Manual run
  1. Invoke the cadctl command via the run subcommand: this will use your locally setup credentials
cadctl run -c <CLUSTER_ID> -i <INVESTIGATION>
  1. Invoke a manual investigation via osdctl cluster cad run --cluster <CLUSTER_ID> which uses the hosted CAD to run your investigation. More information in this document

Contributing

Building

For build targets, see make help.

Adding a new investigation

CAD investigations are triggered by PagerDuty webhooks. Currently, CAD supports the following two formats of webhooks:

  • WebhookV3
  • EventOrchestrationWebhook

The required investigation is identified by CAD based on the incident and its payload. As PagerDuty itself does not provide finer granularity for webhooks than service-based, CAD filters out the alerts it should investigate. For more information, please refer to https://support.pagerduty.com/docs/webhooks.

To add a new alert investigation:

  • run make bootstrap-investigation to generate boilerplate code in pkg/investigations (This creates the corresponding folder & .go file, and also appends the investigation to the availableInvestigations interface in registry.go.).
  • The Run method of your investigation receives a ResourceBuilder. Use its With... methods to request the resources your investigation needs, then call Build() to get a Resources struct containing them. The builder automatically handles dependencies between resources (e.g., requesting an AWS client will also initialize the cluster object). For example:
    func (c *Investigation) Run(rb investigation.ResourceBuilder) (investigation.InvestigationResult, error) {
        result := investigation.InvestigationResult{}
        // Request an AWS client. This will also ensure the Cluster resource is built.
        r, err := rb.WithAwsClient().Build()
        if err != nil {
            return result, err
        }
        // Now you can use r.AwsClient, r.Cluster, r.PdClient, etc.
        // ...
    }
    
  • The returned Resources struct contains initialized clients and cluster objects. See Integrations for a full list of available resources.
  • Add test objects or scripts used to recreate the alert symptoms to the pkg/investigations/$INVESTIGATION_NAME/testing/ directory for future use. Be sure to clearly document the testing procedure under the Testing section of the investigation-specific README.md file
Graduating an investigation

New investigations and their remediation steps should be deployed in advancing stages through a progressive deployment strategy.

  1. Informing Stage (Read-only): The investigation is merely informative through PagerDuty at this stage; remediation does not involve any write operations. Notes are collected throughout the investigation, and upon the investigation's conclusion are posted to PagerDuty.

    Aim: Validating the investigation's accuracy and usefulness without performing any write actions.

    Validation Criteria:

    • The investigation successfully carries out each step on it's respective incident type, on both staging and production environments.
    • It provides useful information (equivalent to a manual investigation) to SREs through PagerDuty.
    • The investigation should be accompanied by unit tests and/or step-by-step manual tests in the investigation's testing README, including:
      • A clear step-by-step process to manually test the investigation (e.g. cluster setup, other expected conditions).
  2. Actioning Stage (Read/Write): The investigation's remediation capabilities, including read and write operations, are performed on all applicable clusters.

    Validation Criteria:

    • The investigation is verified to conduct remediations on staging as expected.
    • The investigation should be locally tested in staging against a live alert.
    • E2E testing is desired for actioning investigations; the tests should cover the execution of remediative steps as well as verification of their effectiveness.
Integrations

Note: When writing an investiation, you can use them right away. They are initialized for you and passed to the investigation via investigation.Resources.

  • AWS -- Logging into the cluster, retreiving instance info and AWS CloudTrail events.
    • See pkg/aws
  • PagerDuty -- Retrieving alert info, esclating or silencing incidents, and adding notes.
    • See pkg/pagerduty
  • OCM -- Retrieving cluster info, sending service logs, and managing (post, delete) limited support reasons.
    • See pkg/ocm
    • In case of missing permissions to query an ocm resource, add it to the Configuration-Anomaly-Detection role in uhc-account-manager
  • osd-network-verifier -- Tool to verify the pre-configured networking components for ROSA and OSD CCS clusters.
  • k8sclient -- Interact with clusters kube-api
    • Requires RBAC definitions for your investigation to be added to metadata.yaml

Testing locally

Against upstream stage OCM Backplane

Requires an existing cluster. Requires that the metadata.yaml is committed to the main branch of the upstream repo (see below for testing against a local metadata.yaml).

  1. Create a test incident and payload file for your cluster

    ./test/generate_incident.sh <alertname> <clusterid>
    
  2. Export the required env variables from vault

    Note: For information on the envs see required env variables.

    source test/set_stage_env.sh
    
  3. make build

  4. Run cadctl with the payload file created by test/generate_incident.sh

    ./bin/cadctl investigate --payload-path payload
    
Against local OCM Backplane

Requires existing cluster, same as above. The requests to /backplane/remediate and /backplane/remediation OCM Backplane endpoints are redirected to the local instance of OCM Backplane. This means the metadata.yaml committed to the main branch on your local disk is used to grant permissions (an alternate branch will be available after SREP-636 is complete).

Make sure to install the dependencies first with

dnf install jq openssl tinyproxy haproxy proxytunnel

It will run services on the following local ports:8001 8091 8443 8888

  1. Create a test incident and payload file for your cluster

    ./test/generate_incident.sh <alertname> <clusterid>
    
  2. In a separate terminal start the local infrastructure

Note: You need to clone the backplane-api code repository to a local directory and copy ocm.json from a staging cluster to its ./configs dir.

OCM_BACKPLANE_REPO_PATH=/home/me/backplane-api ./test/launch_local_env.sh
  1. Export the required env variables from vault

    Note: For information on the envs see required env variables.

    source test/set_stage_env.sh
    
  2. make build

  3. Run cadctl with the payload file created by test/generate_incident.sh and proxy as well as the backplane URL set to localhost

    BACKPLANE_URL=https://localhost:8443 HTTP_PROXY=http://127.0.0.1:8888 HTTPS_PROXY=http://127.0.0.1:8888 BACKPLANE_PROXY=http://127.0.0.1:8888  ./bin/cadctl investigate --payload-path ./payload --log-level debug
    
  4. Close the local infrastructure when done by sending SIGINT (Ctrl+C) to the launch_local_env.sh

Run e2e test manually

See test/e2e/README.md

Logging levels

CAD allows for different logging levels (debug, info, warn, error, fatal, panic). The log level is determind through a hierarchy, where the cli flag log-level is checked first, and if not set the optional environment variable LOG_LEVEL is used. If neither is set, the log level defaults to info.

Documentation

Investigations

Every alert managed by CAD corresponds to an investigation, representing the executed code associated with the alert.

Investigation specific documentation can be found in the according investigation folder, e.g. for ClusterHasGoneMissing.

Integrations
  • AWS -- Logging into the cluster, retreiving instance info and AWS CloudTrail events.
  • PagerDuty -- Retrieving alert info, esclating or silencing incidents, and adding notes.
  • OCM -- Retrieving cluster info, sending service logs, and managing (post, delete) limited support reasons.
  • osd-network-verifier -- Tool to verify the pre-configured networking components for ROSA and OSD CCS clusters.
Templates
  • OpenShift -- Used by app-interface to deploy the CAD resources on a target cluster.
Dashboards

Grafana dashboard configmaps are stored in the Dashboards directory. See app-interface for further documentation on dashboards.

Boilerplate
PipelinePruner
Required ENV variables

Note: For local execution, these can exported from vault with source test/set_stage_env.sh

  • CAD_OCM_CLIENT_ID: refers to the OCM client ID used by CAD to initialize the OCM client
  • CAD_OCM_CLIENT_SECRET: refers to the OCM client secret used by CAD to initialize the OCM client
  • CAD_OCM_URL: refers to the used OCM url used by CAD to initialize the OCM client
  • CAD_PD_EMAIL: refers to the email for a login via mail/pw credentials
  • CAD_PD_PW: refers to the password for a login via mail/pw credentials
  • CAD_PD_TOKEN: refers to the generated private access token for token-based authentication
  • CAD_PD_USERNAME: refers to the username of CAD on PagerDuty
  • CAD_SILENT_POLICY: refers to the silent policy CAD should use if the incident shall be silent
  • PD_SIGNATURE: refers to the PagerDuty webhook signature (HMAC+SHA256)
  • CAD_PROMETHEUS_PUSHGATEWAY: refers to the URL cad will push metrics to
  • BACKPLANE_URL: refers to the backplane url to use
  • BACKPLANE_INITIAL_ARN: refers to the initial ARN used for the isolated backplane jumprole flow
Optional ENV variables
  • BACKPLANE_PROXY: refers to the proxy CAD uses for the isolated backplane access flow.

Note: BACKPLANE_PROXY is required for local development, as a backplane api is only accessible through the proxy.

  • CAD_EXPERIMENTAL_ENABLED: enables experimental investigations when set to true, see mapping.go

  • CAD_ORG_POLICY_MAPPING: JSON configuration for organization-based escalation policy routing. When configured, the interceptor automatically reassigns PagerDuty incidents for clusters belonging to specific organizations to dedicated escalation policies. This enables organization-specific on-call rotations.

    Example configuration:

    {
      "organizations": [
        {
          "name": "Customer Alpha",
          "org_ids": ["org-id-1", "org-id-2"],
          "escalation_policy": "P123ABC"
        },
        {
          "name": "Customer Beta",
          "org_ids": ["org-id-3"],
          "escalation_policy": "P456DEF"
        }
      ]
    }
    

    Behavior:

    • When an incident is triggered, the interceptor queries OCM to get the cluster's organization ID
    • If the organization ID matches an entry in the mapping, the incident is reassigned to the specified escalation policy
    • A note is added to the PagerDuty incident documenting the reassignment
    • If reassignment fails, a note with error details is added for manual routing
    • The investigation continues normally after reassignment (or skips reassignment if not configured)

    Requirements:

    • CAD_OCM_CLIENT_ID, CAD_OCM_CLIENT_SECRET, CAD_OCM_URL must be configured
    • Escalation policy IDs must be valid in the PagerDuty account
    • Escalation policies must have at least 2 levels (CAD's EscalateIncident() method escalates to level 2)
    • This configuration is set as an environment variable in the deployment via app-interface saas file

For Red Hat employees, these environment variables can be found in the SRE-P vault.

  • LOG_LEVEL: refers to the CAD log level, if not set, the default is info. See
Configuration File

CAD now also has a configuration file and over time the environment variables will be moved to this file - for the supported fields at any time see the example configuration with comments in ./docs/investigation-filter-config.example.yaml.

Directories

Path Synopsis
Package main is the main package
Package main is the main package
cmd
Package cmd holds the cadctl cobra data
Package cmd holds the cadctl cobra data
cmd/investigate
Package investigate holds the investigate command
Package investigate holds the investigate command
hack
pkg
aws
Package aws contains functions related to aws sdk
Package aws contains functions related to aws sdk
aws/mock
Package awsmock is a generated GoMock package.
Package awsmock is a generated GoMock package.
backplane
Package backplane provides helper functions for interacting with the backplane-api SDK
Package backplane provides helper functions for interacting with the backplane-api SDK
config
Package config provides configuration and tree-based filtering for investigation actions.
Package config provides configuration and tree-based filtering for investigation actions.
executor
Package reporter provides external system update functionality for investigation results
Package reporter provides external system update functionality for investigation results
investigations/aiassisted
Package aiassisted provides AI-powered investigation using AWS AgentCore
Package aiassisted provides AI-powered investigation using AWS AgentCore
investigations/ccam
Package ccam Cluster Credentials Are Missing (CCAM) provides a service for detecting missing cluster credentials
Package ccam Cluster Credentials Are Missing (CCAM) provides a service for detecting missing cluster credentials
investigations/chgm
Package chgm contains functionality for the chgm investigation
Package chgm contains functionality for the chgm investigation
investigations/clusterhealthcheck
Package clusterhealthcheck implements a CAD investigation that runs comprehensive health checks against an OpenShift cluster, replicating the functionality of the managed-scripts health/cluster-health-check action.
Package clusterhealthcheck implements a CAD investigation that runs comprehensive health checks against an OpenShift cluster, replicating the functionality of the managed-scripts health/cluster-health-check action.
investigations/clustermonitoringerrorbudgetburn
Package clustermonitoringerrorbudgetburn contains remediation for https://issues.redhat.com/browse/OCPBUGS-33863
Package clustermonitoringerrorbudgetburn contains remediation for https://issues.redhat.com/browse/OCPBUGS-33863
investigations/cpd
Package cpd contains functionality for the ClusterProvisioningDelay investigation package cpd
Package cpd contains functionality for the ClusterProvisioningDelay investigation package cpd
investigations/describenodes
Package describenodes implements a CAD investigation that describes all nodes in a cluster, providing the full output of `oc describe nodes` including pod details.
Package describenodes implements a CAD investigation that describes all nodes in a cluster, providing the full output of `oc describe nodes` including pod details.
investigations/etcddatabasequotalowspace
Package etcddatabasequotalowspace takes etcd snapshots and performs database analysis for etcd quota issues
Package etcddatabasequotalowspace takes etcd snapshots and performs database analysis for etcd quota issues
investigations/machinehealthcheckunterminatedshortcircuitsre
machinehealthcheckunterminatedshortcircuitsre defines the investigation logic for the MachineHealthCheckUnterminatedShortCircuitSRE alert
machinehealthcheckunterminatedshortcircuitsre defines the investigation logic for the MachineHealthCheckUnterminatedShortCircuitSRE alert
investigations/mustgather
Package mustgather implements an investigation that collects must-gather diagnostics from ROSA classic clusters and uploads them to the Red Hat SFTP server for analysis.
Package mustgather implements an investigation that collects must-gather diagnostics from ROSA classic clusters and uploads them to the Red Hat SFTP server for analysis.
investigations/ocmagentresponsefailure
Package ocmagentresponsefailure implements the investigation logic for the "OCMAgentResponseFailureServiceLogsSRE" alert.
Package ocmagentresponsefailure implements the investigation logic for the "OCMAgentResponseFailureServiceLogsSRE" alert.
investigations/restartcontrolplane
Package restartcontrolplane implements an investigation that restarts an HCP control plane.
Package restartcontrolplane implements an investigation that restarts an HCP control plane.
investigations/upgradeconfigsyncfailureover4hr
Package upgradeconfigsyncfailureover4hr contains functionality for the UpgradeConfigSyncFailureOver4HrSRE investigation
Package upgradeconfigsyncfailureover4hr contains functionality for the UpgradeConfigSyncFailureOver4HrSRE investigation
investigations/utils/node
node defines investigation utility logic related to node objects
node defines investigation utility logic related to node objects
k8s
Package k8sclient handles creation and cleanup of backplane remediations meaning a kube-apiserver access to clusters with RBAC defined in an investigations metadata
Package k8sclient handles creation and cleanup of backplane remediations meaning a kube-apiserver access to clusters with RBAC defined in an investigations metadata
logging
Package logging wraps the zap logging package to provide easier access and initialization of the logger
Package logging wraps the zap logging package to provide easier access and initialization of the logger
managedcloud
Package managedcloud contains functionality to access cloud environments of managed clusters
Package managedcloud contains functionality to access cloud environments of managed clusters
metrics
Package metrics provides prometheus instrumentation for CAD
Package metrics provides prometheus instrumentation for CAD
networkverifier
Package networkverifier contains functionality for running the network verifier
Package networkverifier contains functionality for running the network verifier
oc
ocm
Package ocm contains ocm api related functions
Package ocm contains ocm api related functions
ocm/mock
Package ocmmock is a generated GoMock package.
Package ocmmock is a generated GoMock package.
pagerduty
Package pagerduty contains wrappers for pagerduty api calls
Package pagerduty contains wrappers for pagerduty api calls
pagerduty/mock
Package pdmock is a generated GoMock package.
Package pdmock is a generated GoMock package.
pullsecret
Package pullsecret provides pull secret validation functionality This package validates cluster pull secrets against OCM account data, similar to osdctl's validate-pull-secret-ext command.
Package pullsecret provides pull secret validation functionality This package validates cluster pull secrets against OCM account data, similar to osdctl's validate-pull-secret-ext command.
types
Package types contains shared types used by both investigations and reporter packages
Package types contains shared types used by both investigations and reporter packages
utils
Package utils contains utility functions
Package utils contains utility functions
test

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL