README
¶
StreamBus Observability Example
This example demonstrates StreamBus's comprehensive observability and monitoring capabilities using Prometheus metrics.
Features Demonstrated
1. Metrics Collection
- Broker Metrics: Uptime, status, connections, active requests
- Message Metrics: Messages produced/consumed, bytes in/out
- Performance Metrics: Produce/consume/replication/commit latencies
- Consumer Group Metrics: Groups, members, lag
- Transaction Metrics: Active, committed, aborted transactions
- Storage Metrics: Used/available bytes, segments, compactions
- Network Metrics: Bytes in/out, requests, errors
- Security Metrics: Authentication/authorization attempts and failures
- Cluster Metrics: Cluster size, leader status, Raft state
- Schema Metrics: Registered schemas, validation stats
2. Prometheus Integration
- HTTP endpoint exposing metrics in Prometheus format
- Proper metric naming and labeling
- Support for Counter, Gauge, and Histogram metric types
- Real-time metric updates
3. Real-World Usage
- Broker lifecycle management
- Message production and consumption
- Metric tracking during operations
- Clean shutdown procedures
Running the Example
# From the examples/observability directory
go run main.go
The example will:
- Create a metrics registry
- Initialize StreamBus-specific metrics
- Start a Prometheus HTTP server on port 9090
- Create and start a broker
- Create a topic with 3 partitions
- Produce 100 messages while tracking metrics
- Consume messages while tracking metrics
- Display a metrics summary
- Wait for Ctrl+C to shutdown
Accessing Metrics
While the example is running, you can access metrics at:
http://localhost:9090/metrics
Using curl
curl http://localhost:9090/metrics
Sample Output
# HELP streambus_broker_uptime_seconds Broker uptime in seconds
# TYPE streambus_broker_uptime_seconds gauge
streambus_broker_uptime_seconds 45.2
# HELP streambus_messages_produced_total Total number of messages produced
# TYPE streambus_messages_produced_total counter
streambus_messages_produced_total 100
# HELP streambus_produce_latency_seconds Produce request latency in seconds
# TYPE streambus_produce_latency_seconds histogram
streambus_produce_latency_seconds_bucket{le="0.001"} 10
streambus_produce_latency_seconds_bucket{le="0.005"} 85
streambus_produce_latency_seconds_bucket{le="0.01"} 100
streambus_produce_latency_seconds_bucket{le="+Inf"} 100
streambus_produce_latency_seconds_sum 0.425
streambus_produce_latency_seconds_count 100
Prometheus Configuration
To scrape metrics with Prometheus, add this to your prometheus.yml:
scrape_configs:
- job_name: 'streambus'
scrape_interval: 15s
static_configs:
- targets: ['localhost:9090']
labels:
env: 'dev'
service: 'streambus'
Grafana Dashboard
Once metrics are in Prometheus, you can create Grafana dashboards to visualize:
Message Throughput
# Messages per second
rate(streambus_messages_produced_total[1m])
rate(streambus_messages_consumed_total[1m])
Latency Percentiles
# 95th percentile produce latency
histogram_quantile(0.95, rate(streambus_produce_latency_seconds_bucket[5m]))
# 99th percentile consume latency
histogram_quantile(0.99, rate(streambus_consume_latency_seconds_bucket[5m]))
Consumer Lag
# Consumer group lag
streambus_consumer_group_lag
Broker Health
# Broker uptime
streambus_broker_uptime_seconds
# Broker status (0=stopped, 1=starting, 2=running, 3=stopping)
streambus_broker_status
# Active connections
streambus_broker_connections
Storage Utilization
# Storage used percentage
(streambus_storage_used_bytes / (streambus_storage_used_bytes + streambus_storage_available_bytes)) * 100
Available Metrics
Broker Metrics
streambus_broker_uptime_seconds- Broker uptime in secondsstreambus_broker_status- Broker status (0-3)streambus_broker_connections- Active client connectionsstreambus_broker_active_requests- Active requests
Message Metrics
streambus_messages_produced_total- Total messages producedstreambus_messages_consumed_total- Total messages consumedstreambus_messages_stored_total- Total messages storedstreambus_bytes_produced_total- Total bytes producedstreambus_bytes_consumed_total- Total bytes consumedstreambus_bytes_stored_total- Total bytes stored
Performance Metrics (Histograms)
streambus_produce_latency_seconds- Produce request latencystreambus_consume_latency_seconds- Consume request latencystreambus_replication_latency_seconds- Replication latencystreambus_commit_latency_seconds- Commit latency
Topic Metrics
streambus_topics_total- Total number of topicsstreambus_partitions_total- Total number of partitionsstreambus_replicas_total- Total number of replicas
Consumer Group Metrics
streambus_consumer_groups_total- Total consumer groupsstreambus_consumer_group_members_total- Consumer group membersstreambus_consumer_group_lag- Consumer group lag
Transaction Metrics
streambus_transactions_active- Active transactionsstreambus_transactions_committed_total- Committed transactionsstreambus_transactions_aborted_total- Aborted transactionsstreambus_transaction_duration_seconds- Transaction duration
Storage Metrics
streambus_storage_used_bytes- Storage space usedstreambus_storage_available_bytes- Storage space availablestreambus_segments_total- Total segmentsstreambus_compactions_total- Total compactions
Network Metrics
streambus_network_bytes_in_total- Total bytes receivedstreambus_network_bytes_out_total- Total bytes sentstreambus_network_requests_total- Total network requestsstreambus_network_errors_total- Total network errors
Security Metrics
streambus_authentication_attempts_total- Authentication attemptsstreambus_authentication_failures_total- Authentication failuresstreambus_authorization_checks_total- Authorization checksstreambus_authorization_denials_total- Authorization denialsstreambus_audit_events_logged_total- Audit events logged
Cluster Metrics
streambus_cluster_size- Number of brokersstreambus_cluster_leader- 1 if leader, 0 otherwisestreambus_raft_term- Current Raft termstreambus_raft_commit_index- Raft commit index
Schema Registry Metrics
streambus_schemas_registered_total- Total registered schemasstreambus_schema_validations_total- Total schema validationsstreambus_schema_validation_errors_total- Schema validation errors
Alerting Examples
High Produce Latency
- alert: HighProduceLatency
expr: histogram_quantile(0.99, rate(streambus_produce_latency_seconds_bucket[5m])) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High produce latency detected"
description: "99th percentile produce latency is {{ $value }}s"
High Consumer Lag
- alert: HighConsumerLag
expr: streambus_consumer_group_lag > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High consumer group lag"
description: "Consumer group lag is {{ $value }} messages"
Authentication Failures
- alert: HighAuthenticationFailures
expr: rate(streambus_authentication_failures_total[5m]) > 10
for: 5m
labels:
severity: critical
annotations:
summary: "High authentication failure rate"
description: "{{ $value }} authentication failures per second"
Storage Almost Full
- alert: StorageAlmostFull
expr: (streambus_storage_used_bytes / (streambus_storage_used_bytes + streambus_storage_available_bytes)) > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "Storage is almost full"
description: "Storage is {{ $value | humanizePercentage }} full"
Best Practices
- Metric Naming: All metrics follow Prometheus naming conventions
- Labels: Use labels for dimensions (broker_id, topic, partition, etc.)
- Cardinality: Be careful with high-cardinality labels
- Histograms: Use appropriate buckets for latency measurements
- Scrape Interval: 15-30 seconds is typical for most use cases
- Retention: Configure Prometheus retention based on your needs
Monitoring Stack
For a complete monitoring solution, consider:
- Prometheus: Time-series database and metrics collection
- Grafana: Visualization and dashboards
- Alertmanager: Alert routing and management
- Node Exporter: Host-level metrics
- Blackbox Exporter: Endpoint monitoring
Next Steps
- Set up Prometheus to scrape StreamBus metrics
- Create custom Grafana dashboards
- Configure alerting rules
- Add OpenTelemetry for distributed tracing
- Integrate with your existing monitoring infrastructure
Documentation
¶
There is no documentation for this package.
Click to show internal directories.
Click to hide internal directories.