prometheus_monitoring

command

v0.4.0-beta Latest Latest Go to latest Published: Nov 18, 2025 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/dshills/langgraph-go

Links

Open Source Insights

README ¶

Prometheus Monitoring Example

This example demonstrates comprehensive Prometheus metrics collection for LangGraph-Go workflows, including:

Real-time performance metrics (latency, concurrency, queue depth)
Retry and error tracking
Parallel execution monitoring
HTTP endpoint for Prometheus scraping

Quick Start

# Run the example
cd examples/prometheus_monitoring
go run main.go

# Metrics will be exposed at http://localhost:9090/metrics
# The workflow will execute continuously every 2 seconds

Workflow Structure

The example workflow demonstrates various execution patterns:

fast (1-10ms)
  → medium (50-100ms)
    → slow (500-1000ms)
      → parallel (fan-out)
        → branchA (100-500ms) ⎤
        → branchB (100-500ms) ⎥→ terminal
        → branchC (100-500ms) ⎦

Node characteristics:

FastNode: Quick execution (1-10ms) - demonstrates low latency
MediumNode: Medium latency (50-100ms) - typical API call
SlowNode: Slow execution (500-1000ms) - simulates expensive operations
ParallelNode: Fan-out to 3 parallel branches - demonstrates concurrency
BranchNodes: Parallel execution with variable latency
FlakyNode: Fails 30% of the time - demonstrates retry metrics

Metrics Exposed

1. `langgraph_inflight_nodes`

Current number of nodes executing concurrently.

Query examples:

# Current concurrency
langgraph_inflight_nodes

# Peak concurrency over 5 minutes
max_over_time(langgraph_inflight_nodes[5m])

2. `langgraph_queue_depth`

Number of pending work items in the scheduler queue.

Query examples:

# Current queue depth
langgraph_queue_depth

# Queue saturation percentage
(langgraph_queue_depth / 64) * 100

3. `langgraph_step_latency_ms`

Node execution duration histogram.

Query examples:

# P95 latency for all nodes
histogram_quantile(0.95, rate(langgraph_step_latency_ms_bucket[5m]))

# P99 latency by node
histogram_quantile(0.99,
  sum by (node_id, le) (rate(langgraph_step_latency_ms_bucket[5m]))
)

# Average latency for slow node
rate(langgraph_step_latency_ms_sum{node_id="slow"}[5m]) /
rate(langgraph_step_latency_ms_count{node_id="slow"}[5m])

4. `langgraph_retries_total`

Cumulative retry attempts.

Query examples:

# Retry rate per second
rate(langgraph_retries_total[5m])

# Retries by node (top 5)
topk(5, sum by (node_id) (rate(langgraph_retries_total[5m])))

5. `langgraph_merge_conflicts_total`

State merge conflicts during concurrent execution.

Query examples:

# Conflict rate
rate(langgraph_merge_conflicts_total[5m])

6. `langgraph_backpressure_events_total`

Queue saturation events that occur when the scheduler queue reaches capacity (T033).

This counter tracks how many times the scheduler had to wait because the execution queue was full. Backpressure is a natural throttling mechanism to prevent unbounded queue growth when nodes execute faster than they can be drained.

Labels:

run_id: Workflow execution identifier
reason: Cause of backpressure event (currently "queue_full")

When backpressure occurs:

A work item is about to be enqueued
Current queue depth >= queue capacity (default: 64)
The enqueuer waits for a slot to become available (blocking)
Metric increments to track the throttling event
When a slot opens, execution continues

Query examples:

# Backpressure event rate (events per second)
rate(langgraph_backpressure_events_total[5m])

# Total backpressure events by run
sum by (run_id) (langgraph_backpressure_events_total)

# Backpressure events by reason
sum by (reason) (rate(langgraph_backpressure_events_total[5m]))

Example metric output:

# When backpressure occurs (queue is at capacity)
langgraph_backpressure_events_total{reason="queue_full",run_id="run-1"} 3.0
langgraph_backpressure_events_total{reason="queue_full",run_id="run-2"} 7.0

Interpreting backpressure metrics:

0 events: Queue never filled up during execution
Low rate (<0.1/s): Normal, slight queueing under load
High rate (>1/s): Queue frequently at capacity, consider:
- Increasing QueueDepth (if memory allows)
- Increasing MaxConcurrent (if CPU/resources allow)
- Optimizing slow nodes
- Distributing load across more instances

Prometheus Configuration

1. Install Prometheus

macOS:

brew install prometheus

Linux:

wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
cd prometheus-*

Docker:

docker run -d -p 9091:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

2. Configure Prometheus Scraper

Create prometheus.yml:

global:
  scrape_interval: 15s       # Scrape metrics every 15 seconds
  evaluation_interval: 15s   # Evaluate rules every 15 seconds

scrape_configs:
  - job_name: 'langgraph'
    static_configs:
      - targets: ['localhost:9090']  # Scrape LangGraph metrics endpoint

3. Start Prometheus

# Start Prometheus with config
prometheus --config.file=prometheus.yml

# Prometheus UI will be available at http://localhost:9091

Grafana Dashboards

Install Grafana

macOS:

brew install grafana
brew services start grafana

Docker:

docker run -d -p 3000:3000 grafana/grafana

Configure Data Source

Open Grafana: http://localhost:3000 (admin/admin)
Add Prometheus data source:
- URL: http://localhost:9091
- Access: Browser
Save & Test

Import Dashboard

Use the provided dashboard JSON (below) or create panels manually.

Recommended Panels

1. Workflow Execution Rate

Type: Graph
Query: rate(langgraph_step_latency_ms_count[5m])
Description: Workflows per second

2. Node Latency Heatmap

Type: Heatmap
Query: histogram_quantile(0.95, rate(langgraph_step_latency_ms_bucket[5m]))
Description: P95 latency distribution by node

3. Retry Rate by Node

Type: Bar chart
Query: sum by (node_id) (rate(langgraph_retries_total[5m]))
Description: Which nodes are retrying most

4. Concurrency Gauge

Type: Gauge
Query: langgraph_inflight_nodes
Thresholds: Warning > 6, Critical > 7 (MaxConcurrent=8)

5. Queue Depth Gauge

Type: Gauge
Query: langgraph_queue_depth
Thresholds: Warning > 50, Critical > 60 (QueueDepth=64)

6. Error Rate

Type: Graph
Query: rate(langgraph_step_latency_ms_count{status="error"}[5m])
Description: Errors per second

7. Backpressure Events (T033)

Type: Graph
Query: rate(langgraph_backpressure_events_total[5m])
Description: Queue saturation events per second
Alert threshold: > 1 event/s indicates queue at capacity

Dashboard JSON

Save this as langgraph-dashboard.json and import into Grafana:

{
  "dashboard": {
    "title": "LangGraph Workflow Monitoring",
    "panels": [
      {
        "title": "Workflow Execution Rate",
        "targets": [
          {
            "expr": "rate(langgraph_step_latency_ms_count[5m])"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Node Latency (P95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum by (node_id, le) (rate(langgraph_step_latency_ms_bucket[5m])))"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Retry Rate by Node",
        "targets": [
          {
            "expr": "sum by (node_id) (rate(langgraph_retries_total[5m]))"
          }
        ],
        "type": "bargauge"
      },
      {
        "title": "Concurrency",
        "targets": [
          {
            "expr": "langgraph_inflight_nodes"
          }
        ],
        "type": "gauge",
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"value": 0, "color": "green"},
                {"value": 6, "color": "yellow"},
                {"value": 7, "color": "red"}
              ]
            }
          }
        }
      },
      {
        "title": "Queue Depth",
        "targets": [
          {
            "expr": "langgraph_queue_depth"
          }
        ],
        "type": "gauge",
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"value": 0, "color": "green"},
                {"value": 50, "color": "yellow"},
                {"value": 60, "color": "red"}
              ]
            }
          }
        }
      },
      {
        "title": "Backpressure Events (T033)",
        "targets": [
          {
            "expr": "rate(langgraph_backpressure_events_total[5m])"
          }
        ],
        "type": "graph",
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"value": 0, "color": "green"},
                {"value": 1, "color": "red"}
              ]
            }
          }
        }
      }
    ]
  }
}

Alert Rules

Prometheus Alert Rules

Create alerts.yml:

groups:
  - name: langgraph_alerts
    interval: 30s
    rules:
      # High latency alert
      - alert: HighNodeLatency
        expr: histogram_quantile(0.95, rate(langgraph_step_latency_ms_bucket[5m])) > 5000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High node latency detected"
          description: "P95 latency is {{ $value }}ms (threshold: 5000ms)"

      # High retry rate alert
      - alert: HighRetryRate
        expr: rate(langgraph_retries_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High retry rate detected"
          description: "Retry rate is {{ $value }}/s (threshold: 0.1/s)"

      # Queue saturation alert
      - alert: QueueSaturated
        expr: (langgraph_queue_depth / 64) > 0.8
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Workflow queue is saturated"
          description: "Queue depth is {{ $value }}% of capacity"

      # Backpressure alert
      - alert: FrequentBackpressure
        expr: rate(langgraph_backpressure_events_total[5m]) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Frequent backpressure events"
          description: "Backpressure rate is {{ $value }}/s"

Load alerts in Prometheus:

# prometheus.yml
rule_files:
  - "alerts.yml"

Troubleshooting

Metrics Not Showing Up

Check metrics endpoint:

curl http://localhost:9090/metrics | grep langgraph

Verify Prometheus scraping:
- Go to http://localhost:9091/targets
- Ensure langgraph target status is UP

Check Prometheus logs:

# Look for scrape errors
grep "error" prometheus.log

High Queue Depth

If queue depth consistently high:

Increase QueueDepth:

Options{QueueDepth: 1024}  // Default is 64

Increase MaxConcurrentNodes:

Options{MaxConcurrentNodes: 16}  // Default is 8

Optimize slow nodes:
- Check P95/P99 latencies
- Add caching or batching
- Consider async patterns

High Retry Rate

If retry rate unexpectedly high:

Check error types:

sum by (node_id, reason) (rate(langgraph_retries_total[5m]))

Increase retry budget:

RetryPolicy{MaxAttempts: 5}  // More retry attempts

Add backoff:

RetryPolicy{
    BaseDelay: 1 * time.Second,
    MaxDelay:  30 * time.Second,
}

High Backpressure Events (T033)

If langgraph_backpressure_events_total rate is consistently high (>1/s):

Monitor queue saturation:

# Queue saturation percentage
(langgraph_queue_depth / 64) * 100

# When saturation is high, backpressure will trigger
rate(langgraph_backpressure_events_total[5m]) > 1

Increase queue capacity:

// Increase from default 64 to 256
graph.WithQueueDepth(256)

Increase concurrent node capacity:

// Increase from default 8 to 16
graph.WithMaxConcurrent(16)

Optimize slow nodes:
- Identify which nodes cause queue buildup
- Check langgraph_step_latency_ms per node
- Add caching or async patterns to slow operations
Scale horizontally:
- Run multiple workflow instances
- Use load balancer to distribute work
- Each instance gets own queue

Performance Tips

Scrape interval: 15s is good balance of granularity vs load
Retention: 15 days is typical (adjust based on disk space)
Label cardinality: Avoid high-cardinality labels like UUID run_ids
Metric types: Use histograms for latency, gauges for levels, counters for totals
Query efficiency: Use rate() for counters, avoid avg() on histograms

Documentation ¶

Overview ¶

Package main demonstrates usage of the LangGraph-Go framework.

Source Files ¶

View all Source files

main.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

Prometheus Monitoring Example

Quick Start

Workflow Structure

Metrics Exposed

1. langgraph_inflight_nodes

2. langgraph_queue_depth

3. langgraph_step_latency_ms

4. langgraph_retries_total

5. langgraph_merge_conflicts_total

6. langgraph_backpressure_events_total

Prometheus Configuration

1. Install Prometheus

2. Configure Prometheus Scraper

3. Start Prometheus

Grafana Dashboards

Install Grafana

Configure Data Source

Import Dashboard

Recommended Panels

Dashboard JSON

Alert Rules

Prometheus Alert Rules

Troubleshooting

Metrics Not Showing Up

High Queue Depth

High Retry Rate

High Backpressure Events (T033)

Performance Tips

See Also

Documentation ¶

Overview ¶

Source Files ¶

1. `langgraph_inflight_nodes`

2. `langgraph_queue_depth`

3. `langgraph_step_latency_ms`

4. `langgraph_retries_total`

5. `langgraph_merge_conflicts_total`

6. `langgraph_backpressure_events_total`