README
¶
Prometheus Monitoring Example
This example demonstrates comprehensive Prometheus metrics collection for LangGraph-Go workflows, including:
- Real-time performance metrics (latency, concurrency, queue depth)
- Retry and error tracking
- Parallel execution monitoring
- HTTP endpoint for Prometheus scraping
Quick Start
# Run the example
cd examples/prometheus_monitoring
go run main.go
# Metrics will be exposed at http://localhost:9090/metrics
# The workflow will execute continuously every 2 seconds
Workflow Structure
The example workflow demonstrates various execution patterns:
fast (1-10ms)
→ medium (50-100ms)
→ slow (500-1000ms)
→ parallel (fan-out)
→ branchA (100-500ms) ⎤
→ branchB (100-500ms) ⎥→ terminal
→ branchC (100-500ms) ⎦
Node characteristics:
- FastNode: Quick execution (1-10ms) - demonstrates low latency
- MediumNode: Medium latency (50-100ms) - typical API call
- SlowNode: Slow execution (500-1000ms) - simulates expensive operations
- ParallelNode: Fan-out to 3 parallel branches - demonstrates concurrency
- BranchNodes: Parallel execution with variable latency
- FlakyNode: Fails 30% of the time - demonstrates retry metrics
Metrics Exposed
1. langgraph_inflight_nodes
Current number of nodes executing concurrently.
Query examples:
# Current concurrency
langgraph_inflight_nodes
# Peak concurrency over 5 minutes
max_over_time(langgraph_inflight_nodes[5m])
2. langgraph_queue_depth
Number of pending work items in the scheduler queue.
Query examples:
# Current queue depth
langgraph_queue_depth
# Queue saturation percentage
(langgraph_queue_depth / 64) * 100
3. langgraph_step_latency_ms
Node execution duration histogram.
Query examples:
# P95 latency for all nodes
histogram_quantile(0.95, rate(langgraph_step_latency_ms_bucket[5m]))
# P99 latency by node
histogram_quantile(0.99,
sum by (node_id, le) (rate(langgraph_step_latency_ms_bucket[5m]))
)
# Average latency for slow node
rate(langgraph_step_latency_ms_sum{node_id="slow"}[5m]) /
rate(langgraph_step_latency_ms_count{node_id="slow"}[5m])
4. langgraph_retries_total
Cumulative retry attempts.
Query examples:
# Retry rate per second
rate(langgraph_retries_total[5m])
# Retries by node (top 5)
topk(5, sum by (node_id) (rate(langgraph_retries_total[5m])))
5. langgraph_merge_conflicts_total
State merge conflicts during concurrent execution.
Query examples:
# Conflict rate
rate(langgraph_merge_conflicts_total[5m])
6. langgraph_backpressure_events_total
Queue saturation events.
Query examples:
# Backpressure event rate
rate(langgraph_backpressure_events_total[5m])
Prometheus Configuration
1. Install Prometheus
macOS:
brew install prometheus
Linux:
wget https://github.com/prometheus/prometheus/releases/download/v2.45.0/prometheus-2.45.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
cd prometheus-*
Docker:
docker run -d -p 9091:9090 \
-v $(pwd)/prometheus.yml:/etc/prometheus/prometheus.yml \
prom/prometheus
2. Configure Prometheus Scraper
Create prometheus.yml:
global:
scrape_interval: 15s # Scrape metrics every 15 seconds
evaluation_interval: 15s # Evaluate rules every 15 seconds
scrape_configs:
- job_name: 'langgraph'
static_configs:
- targets: ['localhost:9090'] # Scrape LangGraph metrics endpoint
3. Start Prometheus
# Start Prometheus with config
prometheus --config.file=prometheus.yml
# Prometheus UI will be available at http://localhost:9091
Grafana Dashboards
Install Grafana
macOS:
brew install grafana
brew services start grafana
Docker:
docker run -d -p 3000:3000 grafana/grafana
Configure Data Source
- Open Grafana: http://localhost:3000 (admin/admin)
- Add Prometheus data source:
- URL: http://localhost:9091
- Access: Browser
- Save & Test
Import Dashboard
Use the provided dashboard JSON (below) or create panels manually.
Recommended Panels
1. Workflow Execution Rate
- Type: Graph
- Query:
rate(langgraph_step_latency_ms_count[5m]) - Description: Workflows per second
2. Node Latency Heatmap
- Type: Heatmap
- Query:
histogram_quantile(0.95, rate(langgraph_step_latency_ms_bucket[5m])) - Description: P95 latency distribution by node
3. Retry Rate by Node
- Type: Bar chart
- Query:
sum by (node_id) (rate(langgraph_retries_total[5m])) - Description: Which nodes are retrying most
4. Concurrency Gauge
- Type: Gauge
- Query:
langgraph_inflight_nodes - Thresholds: Warning > 6, Critical > 7 (MaxConcurrent=8)
5. Queue Depth Gauge
- Type: Gauge
- Query:
langgraph_queue_depth - Thresholds: Warning > 50, Critical > 60 (QueueDepth=64)
6. Error Rate
- Type: Graph
- Query:
rate(langgraph_step_latency_ms_count{status="error"}[5m]) - Description: Errors per second
Dashboard JSON
Save this as langgraph-dashboard.json and import into Grafana:
{
"dashboard": {
"title": "LangGraph Workflow Monitoring",
"panels": [
{
"title": "Workflow Execution Rate",
"targets": [
{
"expr": "rate(langgraph_step_latency_ms_count[5m])"
}
],
"type": "graph"
},
{
"title": "Node Latency (P95)",
"targets": [
{
"expr": "histogram_quantile(0.95, sum by (node_id, le) (rate(langgraph_step_latency_ms_bucket[5m])))"
}
],
"type": "graph"
},
{
"title": "Retry Rate by Node",
"targets": [
{
"expr": "sum by (node_id) (rate(langgraph_retries_total[5m]))"
}
],
"type": "bargauge"
},
{
"title": "Concurrency",
"targets": [
{
"expr": "langgraph_inflight_nodes"
}
],
"type": "gauge",
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 6, "color": "yellow"},
{"value": 7, "color": "red"}
]
}
}
}
},
{
"title": "Queue Depth",
"targets": [
{
"expr": "langgraph_queue_depth"
}
],
"type": "gauge",
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 50, "color": "yellow"},
{"value": 60, "color": "red"}
]
}
}
}
}
]
}
}
Alert Rules
Prometheus Alert Rules
Create alerts.yml:
groups:
- name: langgraph_alerts
interval: 30s
rules:
# High latency alert
- alert: HighNodeLatency
expr: histogram_quantile(0.95, rate(langgraph_step_latency_ms_bucket[5m])) > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "High node latency detected"
description: "P95 latency is {{ $value }}ms (threshold: 5000ms)"
# High retry rate alert
- alert: HighRetryRate
expr: rate(langgraph_retries_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High retry rate detected"
description: "Retry rate is {{ $value }}/s (threshold: 0.1/s)"
# Queue saturation alert
- alert: QueueSaturated
expr: (langgraph_queue_depth / 64) > 0.8
for: 2m
labels:
severity: critical
annotations:
summary: "Workflow queue is saturated"
description: "Queue depth is {{ $value }}% of capacity"
# Backpressure alert
- alert: FrequentBackpressure
expr: rate(langgraph_backpressure_events_total[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Frequent backpressure events"
description: "Backpressure rate is {{ $value }}/s"
Load alerts in Prometheus:
# prometheus.yml
rule_files:
- "alerts.yml"
Troubleshooting
Metrics Not Showing Up
-
Check metrics endpoint:
curl http://localhost:9090/metrics | grep langgraph -
Verify Prometheus scraping:
- Go to http://localhost:9091/targets
- Ensure
langgraphtarget status is UP
-
Check Prometheus logs:
# Look for scrape errors grep "error" prometheus.log
High Queue Depth
If queue depth consistently high:
-
Increase QueueDepth:
Options{QueueDepth: 1024} // Default is 64 -
Increase MaxConcurrentNodes:
Options{MaxConcurrentNodes: 16} // Default is 8 -
Optimize slow nodes:
- Check P95/P99 latencies
- Add caching or batching
- Consider async patterns
High Retry Rate
If retry rate unexpectedly high:
-
Check error types:
sum by (node_id, reason) (rate(langgraph_retries_total[5m])) -
Increase retry budget:
RetryPolicy{MaxAttempts: 5} // More retry attempts -
Add backoff:
RetryPolicy{ BaseDelay: 1 * time.Second, MaxDelay: 30 * time.Second, }
Performance Tips
- Scrape interval: 15s is good balance of granularity vs load
- Retention: 15 days is typical (adjust based on disk space)
- Label cardinality: Avoid high-cardinality labels like UUID run_ids
- Metric types: Use histograms for latency, gauges for levels, counters for totals
- Query efficiency: Use
rate()for counters, avoidavg()on histograms