observability

package
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 17, 2026 License: GPL-3.0 Imports: 20 Imported by: 0

README

Observability Package

The observability package provides Prometheus metrics collection and OpenTelemetry tracing for cd-operator.

Features

  • Prometheus Metrics: Collect and export metrics about log entries, errors, and log write duration
  • Pull Mode: HTTP endpoint for Prometheus scraping (/metrics)
  • Push Mode: Push metrics to Prometheus Pushgateway on exit or periodically
  • OpenTelemetry Tracing: Distributed tracing with trace ID correlation
  • Metrics Writer: Automatic metrics collection from log output

Components

1. Configuration (config.go)

Define how metrics and tracing are configured:

// Metrics configuration
config := observability.DefaultMetricsConfig()
config.Enabled = true
config.Mode = "pull"           // "pull", "push", or "disabled"
config.PullPort = ":8081"      // HTTP port for /metrics endpoint
config.JobName = "cd-operator"

// Tracing configuration
tracingConfig := observability.DefaultTracingConfig()
tracingConfig.Enabled = true
tracingConfig.ServiceName = "cd-operator"
2. Metrics Collection (metrics.go)

Prometheus metrics for logging observability:

  • cd_operator_log_total - Counter: Total log entries by level
  • cd_operator_error_total - Counter: Total errors by level
  • cd_operator_log_duration_seconds - Histogram: Time spent writing logs
registry := prometheus.NewRegistry()
metrics := observability.NewMetricsWithRegistry(registry)

// Wrap your writer to collect metrics automatically
writer := observability.NewMetricsWriter(os.Stdout, metrics)
3. Metrics Export (exporter.go)

Export metrics via pull or push mode:

Pull Mode (Prometheus scrapes):

config := observability.MetricsConfig{
    Mode:     "pull",
    PullPort: ":8081",
}
exporter := observability.NewMetricsExporter(config, registry)
// Metrics available at http://localhost:8081/metrics

Push Mode (Push to Pushgateway):

config := observability.MetricsConfig{
    Mode:           "push",
    PushGatewayURL: "http://localhost:9091",
    PushOnExit:     true,
    JobName:        "cd-operator",
    InstanceID:     "instance-123",
}
exporter := observability.NewMetricsExporter(config, registry)
4. Tracing (tracer.go, span.go)

OpenTelemetry distributed tracing:

// Initialize tracer
tp, err := observability.InitTracer("cd-operator")
if err != nil {
    return err
}
defer tp.Shutdown(ctx)

// Create spans
ctx, span := observability.StartSpan(ctx, "deploy-application")
defer span.End()

// Extract trace ID for log correlation
traceID := observability.TraceIDFromContext(ctx)
log.Info(ctx, "deploying", "trace_id", traceID)

Usage Example

Complete Setup
package main

import (
    "context"
    "os"

    "github.com/prometheus/client_golang/prometheus"
    "github.com/grhili/cd-operator/pkg/observability"
)

func main() {
    ctx := context.Background()

    // Initialize tracing
    tp, err := observability.InitTracer("cd-operator")
    if err != nil {
        panic(err)
    }
    defer tp.Shutdown(ctx)

    // Configure metrics (pull mode)
    config := observability.DefaultMetricsConfig()
    config.Enabled = true
    config.Mode = "pull"
    config.PullPort = ":8081"

    // Create metrics and exporter
    registry := prometheus.NewRegistry()
    metrics := observability.NewMetricsWithRegistry(registry)
    exporter := observability.NewMetricsExporter(config, registry)
    metrics.SetExporter(exporter)

    // Wrap stdout with metrics collection
    writer := observability.NewMetricsWriter(os.Stdout, metrics)

    // Create your logger with this writer
    // logger := logger.New(logger.Config{Writer: writer})

    // Start a traced operation
    ctx, span := observability.StartSpan(ctx, "main-operation")
    defer span.End()

    // Get trace ID for correlation
    traceID := observability.TraceIDFromContext(ctx)
    _ = traceID

    // Do work...

    // Shutdown exporter
    defer exporter.Shutdown(context.Background())
}
Integration with Logger
// Create metrics
registry := prometheus.NewRegistry()
metrics := observability.NewMetricsWithRegistry(registry)

// Wrap writer
writer := observability.NewMetricsWriter(os.Stdout, metrics)

// Use with logger
logger := logger.New(logger.Config{
    Writer: writer,
    Level:  "info",
})

// All log writes will now record metrics
logger.Info(ctx, "application started")

Metrics Details

Counters
  • cd_operator_log_total{level="info"} - Total INFO logs
  • cd_operator_log_total{level="error"} - Total ERROR logs
  • cd_operator_log_total{level="warn"} - Total WARN logs
  • cd_operator_log_total{level="debug"} - Total DEBUG logs
  • cd_operator_error_total{level="error"} - Total write errors
Histograms
  • cd_operator_log_duration_seconds - Time to write logs (seconds)
    • Buckets: 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10

Architecture

Application Code
      |
      v
  Logger/Writer
      |
      v
MetricsWriter (records metrics)
      |
      +---> Prometheus Registry
      |           |
      |           v
      |     MetricsExporter
      |           |
      |           +---> Pull: HTTP Server (:8081/metrics)
      |           +---> Push: Pushgateway
      |
      v
  Underlying Writer (os.Stdout)

Design Principles

Following code-convention.md:

  1. Value types for config - MetricsConfig and TracingConfig are value types
  2. Pointer types for services - Metrics and MetricsExporter are pointer types
  3. Context-aware - All operations accept context.Context
  4. Typed parameters - No raw strings, everything is strongly typed
  5. Best-effort metrics - Metrics failures don't crash the application
  6. Clean shutdown - Exporter.Shutdown() pushes final metrics

Dependencies

  • github.com/prometheus/client_golang - Prometheus metrics
  • go.opentelemetry.io/otel - OpenTelemetry tracing
  • go.opentelemetry.io/otel/exporters/stdout/stdouttrace - Stdout trace exporter

Testing

Run tests:

go test -v ./pkg/observability/...

All tests use isolated registries to avoid conflicts.

Documentation

Overview

Package observability provides OpenTelemetry distributed tracing for cd-operator. This enables end-to-end visibility from PR discovery through deployment.

Index

Constants

View Source
const (
	// PR attributes
	AttrPRNumber     = "pr.number"
	AttrPRRepository = "pr.repository"
	AttrPRHeadSHA    = "pr.head.sha"
	AttrPRTitle      = "pr.title"
	AttrPRAuthor     = "pr.author"
	AttrPRState      = "pr.state"
	AttrPRMergeable  = "pr.mergeable"

	// Environment attributes
	AttrEnvSource = "environment.source"
	AttrEnvTarget = "environment.target"
	AttrEnvName   = "environment.name"

	// Cluster attributes
	AttrClusterName     = "cluster.name"
	AttrClusterEndpoint = "cluster.endpoint"

	// Test attributes
	AttrTestProvider = "test.provider"
	AttrTestJobName  = "test.job"
	AttrTestRunID    = "test.run_id"
	AttrTestStatus   = "test.status"
	AttrTestURL      = "test.url"

	// Policy attributes
	AttrPolicyName        = "policy.name"
	AttrPolicyAutoPromote = "policy.auto_promote"

	// ArgoCD attributes
	AttrArgoApplication  = "argocd.application"
	AttrArgoHealthStatus = "argocd.health.status"
	AttrArgoSyncStatus   = "argocd.sync.status"
	AttrArgoRevision     = "argocd.revision"

	// Error attributes
	AttrErrorType    = "error.type"
	AttrErrorMessage = "error.message"

	// Action attributes
	AttrAction = "action"
	AttrResult = "result"
)

Attribute keys for common span metadata. Following OpenTelemetry semantic conventions where applicable.

Variables

This section is empty.

Functions

func AddEvent

func AddEvent(span trace.Span, name string, message string)

AddEvent adds a timestamped event to a span. Use this to mark significant milestones within an operation.

Example:

observability.AddEvent(span, "tests-passed", "All external tests completed successfully")

func ExtractSpanID

func ExtractSpanID(span trace.Span) string

ExtractSpanID extracts the span ID from a span context. Returns empty string if no span ID is present. Use this for correlation in logs and metrics.

func ExtractTraceID

func ExtractTraceID(span trace.Span) string

ExtractTraceID extracts the trace ID from a span context. Returns empty string if no trace ID is present. Use this for correlation in logs and metrics.

func GetTracer

func GetTracer() trace.Tracer

GetTracer returns the global tracer instance for cd-operator. This should be used by all packages to create spans.

func InitTracing

func InitTracing(ctx context.Context, cfg TracingConfig) (*sdktrace.TracerProvider, error)

InitTracing initializes OpenTelemetry distributed tracing with OTLP exporter. Returns a TracerProvider that must be shut down on application exit.

Features: - OTLP gRPC exporter for Jaeger/Tempo/etc. - W3C TraceContext + Baggage propagation - Resource attributes (service name, version, environment) - Configurable sampling rate - Graceful shutdown support

Example:

cfg := observability.TracingConfig{
    Enabled:      true,
    Endpoint:     "localhost:4317",
    SamplingRate: 0.1,
    ServiceName:  "cd-operator",
}
tp, err := observability.InitTracing(ctx, cfg)
if err != nil {
    log.Fatal("failed to init tracing", zap.Error(err))
}
defer observability.Shutdown(ctx, tp)

func NewMetricsWriter

func NewMetricsWriter(writer io.Writer, metrics *Metrics) io.Writer

NewMetricsWriter wraps a writer to add Prometheus metrics collection. Each write operation updates log counters and duration histograms.

Example:

m := observability.NewMetrics()
writer := observability.NewMetricsWriter(os.Stdout, m)

logger, _ := logger.New(logger.Config{
    Writer: writer,
})

func RecordError

func RecordError(span trace.Span, err error)

RecordError records an error on a span with standardized attributes. This marks the span as failed and includes error details.

Example:

if err != nil {
    observability.RecordError(span, err)
    return err
}

func RecordSuccess

func RecordSuccess(span trace.Span)

RecordSuccess marks a span as successful. Use this at the end of an operation to indicate completion without errors.

Example:

defer span.End()
// ... do work ...
observability.RecordSuccess(span)

func SetActionResult

func SetActionResult(span trace.Span, action string, success bool)

SetActionResult records the result of an action. Use this to track success/failure rates of different operations.

func SetArgoCDAttributes

func SetArgoCDAttributes(span trace.Span, appName string, healthStatus string, syncStatus string, revision string)

SetArgoCDAttributes adds ArgoCD application attributes to an existing span. Use this when querying or updating ArgoCD applications.

func SetEnvironmentAttributes

func SetEnvironmentAttributes(span trace.Span, sourceEnv string, targetEnv string)

SetEnvironmentAttributes adds environment-related attributes to an existing span. Use this for promotion operations.

func SetPRAttributes

func SetPRAttributes(span trace.Span, prNumber int, repository string, headSHA string)

SetPRAttributes adds PR-related attributes to an existing span. Use this when PR information becomes available mid-operation.

func SetPolicyAttributes

func SetPolicyAttributes(span trace.Span, policyName string, autoPromote bool)

SetPolicyAttributes adds promotion policy attributes to an existing span. Use this when loading or applying promotion policies.

func SetTestAttributes

func SetTestAttributes(span trace.Span, provider string, jobName string, runID string, status string)

SetTestAttributes adds test-related attributes to an existing span. Use this when triggering or monitoring external tests.

func Shutdown

func Shutdown(ctx context.Context, tp *sdktrace.TracerProvider) error

Shutdown gracefully shuts down the tracer provider. This flushes any pending spans to the backend before exit. Must be called before application exit to avoid losing traces.

Example:

defer observability.Shutdown(context.Background(), tp)

func StartSpanWithCluster

func StartSpanWithCluster(
	ctx context.Context,
	tracer trace.Tracer,
	name string,
	clusterName string,
) (context.Context, trace.Span)

StartSpanWithCluster creates a span with cluster-specific attributes. Use this for operations that interact with ArgoCD clusters.

Example:

ctx, span := observability.StartSpanWithCluster(ctx, tracer, "query-argocd", "production")
defer span.End()

func StartSpanWithEnv

func StartSpanWithEnv(
	ctx context.Context,
	tracer trace.Tracer,
	name string,
	sourceEnv string,
	targetEnv string,
) (context.Context, trace.Span)

StartSpanWithEnv creates a span with environment-specific attributes. Use this for promotion operations that move between environments.

Example:

ctx, span := observability.StartSpanWithEnv(ctx, tracer, "promote", "dev", "staging")
defer span.End()

func StartSpanWithPR

func StartSpanWithPR(
	ctx context.Context,
	tracer trace.Tracer,
	name string,
	prNumber int,
	headSHA string,
) (context.Context, trace.Span)

StartSpanWithPR creates a span with PR-specific attributes. Use this for operations that process a specific pull request.

Example:

ctx, span := observability.StartSpanWithPR(ctx, tracer, "qualify-pr", 123, "abc123")
defer span.End()

Types

type ComponentMetrics

type ComponentMetrics struct {
	// contains filtered or unexported fields
}

ComponentMetrics holds all Prometheus metrics for cd-operator components. This is the central registry for all custom metrics, following the naming convention: cd_operator_<component>_<metric>_<unit>

All metrics are registered with the default Prometheus registry on package initialization. Metrics are designed to be non-blocking and best-effort to avoid impacting core operations.

func GetGlobalMetrics

func GetGlobalMetrics() *ComponentMetrics

GetGlobalMetrics returns the global metrics instance. Returns nil if InitGlobalMetrics has not been called yet.

func InitGlobalMetrics

func InitGlobalMetrics(registerer prometheus.Registerer) *ComponentMetrics

InitGlobalMetrics initializes the global metrics instance with the provided registry. This should be called once during application startup.

Example:

observability.InitGlobalMetrics(prometheus.DefaultRegisterer)

func NewComponentMetrics

func NewComponentMetrics(registerer prometheus.Registerer) *ComponentMetrics

NewComponentMetrics creates and registers all component metrics with the provided registry. If registerer is nil, uses prometheus.DefaultRegisterer.

All metrics are registered atomically. If any metric fails to register (e.g., duplicate), the function panics to fail fast during operator startup.

Example:

registry := prometheus.NewRegistry()
metrics := observability.NewComponentMetrics(registry)

func (*ComponentMetrics) DecPRState

func (m *ComponentMetrics) DecPRState(state, repository string)

DecPRState decrements the count of PRs in a given state.

func (*ComponentMetrics) IncPRState

func (m *ComponentMetrics) IncPRState(state, repository string)

IncPRState increments the count of PRs in a given state.

func (*ComponentMetrics) ObserveArgoCDAPIDuration

func (m *ComponentMetrics) ObserveArgoCDAPIDuration(start time.Time, cluster, method string)

ObserveArgoCDAPIDuration records the duration of an ArgoCD API call.

func (*ComponentMetrics) ObserveDriftResolutionDuration

func (m *ComponentMetrics) ObserveDriftResolutionDuration(start time.Time, cluster string)

ObserveDriftResolutionDuration records the duration of a drift resolution operation.

func (*ComponentMetrics) ObserveExternalTestDuration

func (m *ComponentMetrics) ObserveExternalTestDuration(start time.Time, provider string)

ObserveExternalTestDuration records the duration of an external test execution.

func (*ComponentMetrics) ObserveGitHubAPIDuration

func (m *ComponentMetrics) ObserveGitHubAPIDuration(start time.Time, method string)

ObserveGitHubAPIDuration records the duration of a GitHub API call.

func (*ComponentMetrics) ObservePRProcessingDuration

func (m *ComponentMetrics) ObservePRProcessingDuration(start time.Time, action, repository string)

ObservePRProcessingDuration records the duration of a PR processing operation. Use with defer for automatic timing:

defer metrics.ObservePRProcessingDuration(time.Now(), "qualify", "owner/repo")

func (*ComponentMetrics) ObservePromotionDuration

func (m *ComponentMetrics) ObservePromotionDuration(start time.Time, sourceEnv, targetEnv string)

ObservePromotionDuration records the duration of a promotion operation.

func (*ComponentMetrics) RecordArgoCDAPICall

func (m *ComponentMetrics) RecordArgoCDAPICall(cluster, method, status string)

RecordArgoCDAPICall records an ArgoCD API call with its result status.

func (*ComponentMetrics) RecordDriftDetection

func (m *ComponentMetrics) RecordDriftDetection(cluster, result string)

RecordDriftDetection records a drift detection operation result.

func (*ComponentMetrics) RecordExternalTestExecution

func (m *ComponentMetrics) RecordExternalTestExecution(provider, result string)

RecordExternalTestExecution records an external test execution result.

func (*ComponentMetrics) RecordGitHubAPICall

func (m *ComponentMetrics) RecordGitHubAPICall(method, status string)

RecordGitHubAPICall records a GitHub API call with its result status.

func (*ComponentMetrics) RecordPRDiscovery

func (m *ComponentMetrics) RecordPRDiscovery(repository, result string)

RecordPRDiscovery records a PR discovery operation result.

func (*ComponentMetrics) RecordPRMerge

func (m *ComponentMetrics) RecordPRMerge(repository, result string)

RecordPRMerge records a PR merge operation result.

func (*ComponentMetrics) RecordPRQualification

func (m *ComponentMetrics) RecordPRQualification(repository, result, reason string)

RecordPRQualification records a PR qualification operation result.

func (*ComponentMetrics) RecordPromotion

func (m *ComponentMetrics) RecordPromotion(sourceEnv, targetEnv, result string)

RecordPromotion records a promotion operation result.

func (*ComponentMetrics) SetDriftStatus

func (m *ComponentMetrics) SetDriftStatus(cluster, application, status string, value float64)

SetDriftStatus sets the drift status for an application in a cluster.

func (*ComponentMetrics) SetGitHubRateLimitRemaining

func (m *ComponentMetrics) SetGitHubRateLimitRemaining(resource string, remaining float64)

SetGitHubRateLimitRemaining sets the remaining GitHub API rate limit.

func (*ComponentMetrics) SetPRState

func (m *ComponentMetrics) SetPRState(state, repository string, count float64)

SetPRState sets the current count of PRs in a given state.

type Exporter

type Exporter interface {
	// Push sends metrics immediately
	Push(ctx context.Context) error
	// Shutdown gracefully stops the exporter and pushes final metrics
	Shutdown(ctx context.Context) error
}

Exporter defines the interface for metrics exporters. This interface can be mocked for testing.

type Metrics

type Metrics struct {
	// contains filtered or unexported fields
}

Metrics holds Prometheus collectors for logging observability.

func NewMetrics

func NewMetrics() *Metrics

NewMetrics creates and registers Prometheus metrics for logging. Uses the default Prometheus registry.

Exposed metrics:

  • cd_operator_log_total: Total number of log entries by level
  • cd_operator_error_total: Total number of errors by level
  • cd_operator_log_duration_seconds: Time spent writing logs

func NewMetricsWithRegistry

func NewMetricsWithRegistry(registerer prometheus.Registerer) *Metrics

NewMetricsWithRegistry creates metrics with a custom registry. This allows for isolated metrics collection and custom exporters.

func (*Metrics) Exporter

func (m *Metrics) Exporter() Exporter

Exporter returns the configured metrics exporter, if any.

func (*Metrics) RecordError

func (m *Metrics) RecordError(level string)

RecordError increments error counter.

func (*Metrics) RecordLog

func (m *Metrics) RecordLog(level string, duration time.Duration)

RecordLog increments log counters and records duration.

func (*Metrics) SetExporter

func (m *Metrics) SetExporter(exporter Exporter)

SetExporter configures a metrics exporter for this Metrics instance. This enables push mode or other export strategies.

type MetricsConfig

type MetricsConfig struct {
	// Enabled controls whether metrics collection is active
	Enabled bool

	// Mode determines how metrics are exported
	// "pull" - HTTP server for Prometheus scraping (default)
	// "push" - Push to Prometheus Push Gateway on exit
	// "disabled" - Collect but don't export
	Mode string

	// PullPort is the HTTP port for pull mode (e.g., ":8081")
	PullPort string

	// PushGatewayURL is the URL for push mode (e.g., "http://localhost:9091")
	PushGatewayURL string

	// PushOnExit controls whether to push metrics when application exits
	PushOnExit bool

	// JobName identifies this application in the push gateway
	JobName string

	// InstanceID uniquely identifies this process instance
	InstanceID string

	// PushInterval for periodic pushing (0 = disabled, only push on exit)
	PushInterval time.Duration
}

MetricsConfig controls how metrics are collected and exported.

func DefaultMetricsConfig

func DefaultMetricsConfig() MetricsConfig

DefaultMetricsConfig returns sensible defaults for cd-operator.

type MetricsExporter

type MetricsExporter struct {
	// contains filtered or unexported fields
}

MetricsExporter handles exporting metrics to various backends.

func NewMetricsExporter

func NewMetricsExporter(config MetricsConfig, registry *prometheus.Registry) *MetricsExporter

NewMetricsExporter creates a new metrics exporter with the given configuration.

func (*MetricsExporter) Push

func (e *MetricsExporter) Push(ctx context.Context) error

Push sends metrics to the push gateway immediately. Safe to call even if push mode is not configured (no-op). Handles transient failures gracefully by logging warnings instead of failing hard.

func (*MetricsExporter) Shutdown

func (e *MetricsExporter) Shutdown(ctx context.Context) error

Shutdown gracefully stops the exporter and pushes final metrics if configured.

func (*MetricsExporter) WithLogger

WithLogger configures the exporter to use the provided logger. This should be called after creating the exporter to integrate with application logging.

type TracingConfig

type TracingConfig struct {
	// Enabled controls whether tracing is active.
	// Default: false (tracing disabled)
	Enabled bool

	// Endpoint is the OTLP gRPC endpoint for trace export.
	// Example: "localhost:4317" (Jaeger), "tempo:4317" (Grafana Tempo)
	// Default: "localhost:4317"
	Endpoint string

	// SamplingRate determines the fraction of traces to record.
	// 0.0 = sample nothing, 1.0 = sample everything.
	// Default: 0.1 (10% sampling)
	SamplingRate float64

	// ServiceName identifies this service in the trace backend.
	// Default: "cd-operator"
	ServiceName string

	// ServiceVersion is the version of the operator (e.g., from git tag).
	// Default: "dev"
	ServiceVersion string

	// Environment identifies the deployment environment (dev, staging, prod).
	// Default: "development"
	Environment string

	// Insecure disables TLS for the OTLP exporter (useful for local dev).
	// Default: true (no TLS)
	Insecure bool
}

TracingConfig controls OpenTelemetry distributed tracing behavior.

func DefaultTracingConfig

func DefaultTracingConfig() TracingConfig

DefaultTracingConfig returns sensible defaults for cd-operator.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL