prometheus_monitoring

command

v0.2.0-alpha Latest Latest Go to latest Published: Oct 30, 2025 License: MIT Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/dshills/langgraph-go

Links

Open Source Insights

README ¶

Prometheus Monitoring Example

This example demonstrates comprehensive Prometheus metrics collection for LangGraph-Go workflows, including:

Real-time performance metrics (latency, concurrency, queue depth)
Retry and error tracking
Parallel execution monitoring
HTTP endpoint for Prometheus scraping

Quick Start

# Run the example
cd examples/prometheus_monitoring
go run main.go

# Metrics will be exposed at http://localhost:9090/metrics
# The workflow will execute continuously every 2 seconds

Workflow Structure

The example workflow demonstrates various execution patterns:

fast (1-10ms)
  → medium (50-100ms)
    → slow (500-1000ms)
      → parallel (fan-out)
        → branchA (100-500ms) ⎤
        → branchB (100-500ms) ⎥→ terminal
        → branchC (100-500ms) ⎦

Node characteristics:

FastNode: Quick execution (1-10ms) - demonstrates low latency
MediumNode: Medium latency (50-100ms) - typical API call
SlowNode: Slow execution (500-1000ms) - simulates expensive operations
ParallelNode: Fan-out to 3 parallel branches - demonstrates concurrency
BranchNodes: Parallel execution with variable latency
FlakyNode: Fails 30% of the time - demonstrates retry metrics

Metrics Exposed

1. `langgraph_inflight_nodes`

Current number of nodes executing concurrently.

Query examples:

# Current concurrency
langgraph_inflight_nodes

# Peak concurrency over 5 minutes
max_over_time(langgraph_inflight_nodes[5m])

2. `langgraph_queue_depth`

Number of pending work items in the scheduler queue.

Query examples:

# Current queue depth
langgraph_queue_depth

# Queue saturation percentage
(langgraph_queue_depth / 64) * 100

3. `langgraph_step_latency_ms`

Node execution duration histogram.

Query examples:

# P95 latency for all nodes
histogram_quantile(0.95, rate(langgraph_step_latency_ms_bucket[5m]))

# P99 latency by node
histogram_quantile(0.99,
  sum by (node_id, le) (rate(langgraph_step_latency_ms_bucket[5m]))
)

# Average latency for slow node
rate(langgraph_step_latency_ms_sum{node_id="slow"}[5m]) /
rate(langgraph_step_latency_ms_count{node_id="slow"}[5m])

4. `langgraph_retries_total`

Cumulative retry attempts.

Query examples:

# Retry rate per second
rate(langgraph_retries_total[5m])

# Retries by node (top 5)
topk(5, sum by (node_id) (rate(langgraph_retries_total[5m])))

5. `langgraph_merge_conflicts_total`

State merge conflicts during concurrent execution.

Query examples:

# Conflict rate
rate(langgraph_merge_conflicts_total[5m])

6. `langgraph_backpressure_events_total`

Queue saturation events.

Query examples:

# Backpressure event rate
rate(langgraph_backpressure_events_total[5m])

Prometheus Configuration

1. Install Prometheus

macOS:

brew install prometheus

Linux:

wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
cd prometheus-*

Docker:

docker run -d -p 9091:9090 \
  -v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
  prom/prometheus

2. Configure Prometheus Scraper

Create prometheus.yml:

global:
  scrape_interval: 15s       # Scrape metrics every 15 seconds
  evaluation_interval: 15s   # Evaluate rules every 15 seconds

scrape_configs:
  - job_name: 'langgraph'
    static_configs:
      - targets: ['localhost:9090']  # Scrape LangGraph metrics endpoint

3. Start Prometheus

# Start Prometheus with config
prometheus --config.file=prometheus.yml

# Prometheus UI will be available at http://localhost:9091

Grafana Dashboards

Install Grafana

macOS:

brew install grafana
brew services start grafana

Docker:

docker run -d -p 3000:3000 grafana/grafana

Configure Data Source

Open Grafana: http://localhost:3000 (admin/admin)
Add Prometheus data source:
- URL: http://localhost:9091
- Access: Browser
Save & Test

Import Dashboard

Use the provided dashboard JSON (below) or create panels manually.

Recommended Panels

1. Workflow Execution Rate

Type: Graph
Query: rate(langgraph_step_latency_ms_count[5m])
Description: Workflows per second

2. Node Latency Heatmap

Type: Heatmap
Query: histogram_quantile(0.95, rate(langgraph_step_latency_ms_bucket[5m]))
Description: P95 latency distribution by node

3. Retry Rate by Node

Type: Bar chart
Query: sum by (node_id) (rate(langgraph_retries_total[5m]))
Description: Which nodes are retrying most

4. Concurrency Gauge

Type: Gauge
Query: langgraph_inflight_nodes
Thresholds: Warning > 6, Critical > 7 (MaxConcurrent=8)

5. Queue Depth Gauge

Type: Gauge
Query: langgraph_queue_depth
Thresholds: Warning > 50, Critical > 60 (QueueDepth=64)

6. Error Rate

Type: Graph
Query: rate(langgraph_step_latency_ms_count{status="error"}[5m])
Description: Errors per second

Dashboard JSON

Save this as langgraph-dashboard.json and import into Grafana:

{
  "dashboard": {
    "title": "LangGraph Workflow Monitoring",
    "panels": [
      {
        "title": "Workflow Execution Rate",
        "targets": [
          {
            "expr": "rate(langgraph_step_latency_ms_count[5m])"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Node Latency (P95)",
        "targets": [
          {
            "expr": "histogram_quantile(0.95, sum by (node_id, le) (rate(langgraph_step_latency_ms_bucket[5m])))"
          }
        ],
        "type": "graph"
      },
      {
        "title": "Retry Rate by Node",
        "targets": [
          {
            "expr": "sum by (node_id) (rate(langgraph_retries_total[5m]))"
          }
        ],
        "type": "bargauge"
      },
      {
        "title": "Concurrency",
        "targets": [
          {
            "expr": "langgraph_inflight_nodes"
          }
        ],
        "type": "gauge",
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"value": 0, "color": "green"},
                {"value": 6, "color": "yellow"},
                {"value": 7, "color": "red"}
              ]
            }
          }
        }
      },
      {
        "title": "Queue Depth",
        "targets": [
          {
            "expr": "langgraph_queue_depth"
          }
        ],
        "type": "gauge",
        "fieldConfig": {
          "defaults": {
            "thresholds": {
              "steps": [
                {"value": 0, "color": "green"},
                {"value": 50, "color": "yellow"},
                {"value": 60, "color": "red"}
              ]
            }
          }
        }
      }
    ]
  }
}

Alert Rules

Prometheus Alert Rules

Create alerts.yml:

groups:
  - name: langgraph_alerts
    interval: 30s
    rules:
      # High latency alert
      - alert: HighNodeLatency
        expr: histogram_quantile(0.95, rate(langgraph_step_latency_ms_bucket[5m])) > 5000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High node latency detected"
          description: "P95 latency is {{ $value }}ms (threshold: 5000ms)"

      # High retry rate alert
      - alert: HighRetryRate
        expr: rate(langgraph_retries_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High retry rate detected"
          description: "Retry rate is {{ $value }}/s (threshold: 0.1/s)"

      # Queue saturation alert
      - alert: QueueSaturated
        expr: (langgraph_queue_depth / 64) > 0.8
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Workflow queue is saturated"
          description: "Queue depth is {{ $value }}% of capacity"

      # Backpressure alert
      - alert: FrequentBackpressure
        expr: rate(langgraph_backpressure_events_total[5m]) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Frequent backpressure events"
          description: "Backpressure rate is {{ $value }}/s"

Load alerts in Prometheus:

# prometheus.yml
rule_files:
  - "alerts.yml"

Troubleshooting

Metrics Not Showing Up

Check metrics endpoint:

curl http://localhost:9090/metrics | grep langgraph

Verify Prometheus scraping:
- Go to http://localhost:9091/targets
- Ensure langgraph target status is UP

Check Prometheus logs:

# Look for scrape errors
grep "error" prometheus.log

High Queue Depth

If queue depth consistently high:

Increase QueueDepth:

Options{QueueDepth: 1024}  // Default is 64

Increase MaxConcurrentNodes:

Options{MaxConcurrentNodes: 16}  // Default is 8

Optimize slow nodes:
- Check P95/P99 latencies
- Add caching or batching
- Consider async patterns

High Retry Rate

If retry rate unexpectedly high:

Check error types:

sum by (node_id, reason) (rate(langgraph_retries_total[5m]))

Increase retry budget:

RetryPolicy{MaxAttempts: 5}  // More retry attempts

Add backoff:

RetryPolicy{
    BaseDelay: 1 * time.Second,
    MaxDelay:  30 * time.Second,
}

Performance Tips

Scrape interval: 15s is good balance of granularity vs load
Retention: 15 days is typical (adjust based on disk space)
Label cardinality: Avoid high-cardinality labels like UUID run_ids
Metric types: Use histograms for latency, gauges for levels, counters for totals
Query efficiency: Use rate() for counters, avoid avg() on histograms

Documentation ¶

Overview ¶

Package main demonstrates usage of the LangGraph-Go framework.

Source Files ¶

View all Source files

main.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

Prometheus Monitoring Example

Quick Start

Workflow Structure

Metrics Exposed

1. langgraph_inflight_nodes

2. langgraph_queue_depth

3. langgraph_step_latency_ms

4. langgraph_retries_total

5. langgraph_merge_conflicts_total

6. langgraph_backpressure_events_total

Prometheus Configuration

1. Install Prometheus

2. Configure Prometheus Scraper

3. Start Prometheus

Grafana Dashboards

Install Grafana

Configure Data Source

Import Dashboard

Recommended Panels

Dashboard JSON

Alert Rules

Prometheus Alert Rules

Troubleshooting

Metrics Not Showing Up

High Queue Depth

High Retry Rate

Performance Tips

See Also

Documentation ¶

Overview ¶

Source Files ¶

1. `langgraph_inflight_nodes`

2. `langgraph_queue_depth`

3. `langgraph_step_latency_ms`

4. `langgraph_retries_total`

5. `langgraph_merge_conflicts_total`

6. `langgraph_backpressure_events_total`