README
¶
graphql monitoring proxy
Creates a passthrough proxy to a graphql endpoint(s), allowing you to analyse the queries and responses, producing the Prometheus metrics at a fraction of the cost - because, as we know - $0 is a fair price.
This project is in active use by telegram-bot.app, and was tested with 30k queries per second on a single instance, consuming 10 MB of RAM and 0.1% CPU. Benchmarks are available.
- graphql monitoring proxy
Why this project exists
I wanted to monitor the queries and responses of our graphql endpoint. Still, we didn't want to pay the price of the graphql server itself ( and I will not point fingers at a particular well-known project), as monitoring and basic security features should be a standard, free functionality.
Important releases
You should always try to stick to the latest and greatest version of the graphql-proxy to ensure that it's as much bug-free as possible. Following list will be kept to the maximum of five "most important" bugs and enhancements included in the latest versions.
-
19/09/2025 - 0.26.x - Major security enhancements: Fixed SQL injection vulnerability in event cleaner, added path traversal protection, implemented optional API authentication, enhanced log sanitization to prevent sensitive data exposure, and consolidated buffer pool implementations for better performance.
-
06/12/2024 - 0.25.12 - Fixes the bug where deeply nested introspection queries were blocked despite of being present on the whitelist. GraphQL proxy will now inspect the queries in depth to find any possible nested introspections.
-
20/08/2024 - 0.23.21+ - Fixes the bug when timeouts were not respected on proxy-graphql line. Affected versions before that were timeouting after 30 seconds which was set as default ( thanks to Jurica Železnjak for reporting ). It also provides a temporary fix for running within kubernetes deployment, when graphql server ( for example - hasura ) took more time to start than the proxy, causing avalanche of errors with "can't proxy the request".
-
19/08/2024 - 0.21.82+ - Fixed the issue when proxy failed to start if global cache was disabled, therefore not initialized and proxy tried to perform the cache operations during normal query operations.
How to deploy
You can find the example of the Kubernetes manifest in the example standalone deployment or example combined deployment files. Observed advantage of multideployment is that it allows the network requests to travel via localhost, without leaving the deployment which brings quite significant network performance boost.
Note on websocket support
Native WebSocket Support Available! Starting with version 0.27.0, the proxy includes native WebSocket support for GraphQL subscriptions. Enable it by setting WEBSOCKET_ENABLE=true
.
For backward compatibility or if you prefer routing WebSockets directly to your backend, you can use the Traefik configuration below:
Click to show Traefik Ingress Route example for direct WebSocket routing.
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
name: hasura-internal
spec:
entryPoints:
- websecure
routes:
# NON WEBSOCKET CONNECTION
- kind: Rule
match: Host(`example.com`) && PathPrefix(`/v1/graphql`) && !HeadersRegexp(`Upgrade`, `websocket`)
services:
- name: hasura-w-proxy-internal
port: proxy
middlewares:
- name: compression
namespace: default
# WEBSOCKET CONNECTION
- kind: Rule
match: Host(`example.com`) && PathPrefix(`/v1/graphql`) && HeadersRegexp(`Upgrade`, `websocket`)
services:
- name: hasura-w-proxy-internal
port: hasura
middlewares:
- name: compression
namespace: default
Endpoints
:8080/*
- the graphql passthrough endpoint:8080/admin
- the admin dashboard (if enabled):9393/metrics
- the prometheus metrics endpoint:8080/healthz
- the healthcheck endpoint:8080/livez
- the liveness probe endpoint:9090/api/*
- the monitoring proxy API endpoint
Features
Category | Detail |
---|---|
monitor | Prometheus / VictoriaMetrics metrics |
monitor | Extracting user id from JWT token and adding it as a label to metrics |
monitor | Extracting the query name and type and adding it as a label to metrics |
monitor | Calculating the query duration and adding it to the metrics |
monitor | OpenTelemetry tracing support with configurable endpoint |
monitor | Real-time admin dashboard with live metrics |
speed | Request coalescing to deduplicate concurrent identical queries |
speed | Caching the queries, together with per-query cache and TTL |
speed | Support for READ ONLY graphql endpoint |
speed | Memory-aware caching with compression and eviction |
speed | Native WebSocket support for GraphQL subscriptions |
resilience | Circuit breaker pattern for fault tolerance |
resilience | Retry budget to prevent retry storms |
resilience | Optimized HTTP client with granular timeout controls |
resilience | Structured error responses with retry recommendations |
security | Blocking schema introspection |
security | Rate limiting queries based on user role |
security | Blocking mutations in read-only mode |
security | Allow access only to listed URLs |
security | Ban / unban specific user from accessing the application |
maintenance | Hasura events cleaner |
Configuration
All the environment variables should be prefixed with GMP_
to avoid conflicts with other applications.
If GMP_
prefixed environment variable is present - it will take precedence over the non-prefixed one.
You can still use the non-prefixed environment variables in the spirit of the backward compatibility, but it's not recommended.
Parameter | Description | Default Value |
---|---|---|
MONITORING_PORT |
The port to expose the metrics endpoint | 9393 |
PORT_GRAPHQL |
The port to expose the graphql endpoint | 8080 |
HOST_GRAPHQL |
The host to proxy the graphql endpoint | http://localhost/ |
HOST_GRAPHQL_READONLY |
The host to proxy the read-only graphql endpoint | `` |
HEALTHCHECK_GRAPHQL_URL |
The URL to check the health of the graphql endpoint | `` |
JWT_USER_CLAIM_PATH |
Path to the user claim in the JWT token | `` |
JWT_ROLE_CLAIM_PATH |
Path to the role claim in the JWT token | `` |
ROLE_FROM_HEADER |
Header name to extract the role from | `` |
ROLE_RATE_LIMIT |
Enable request rate limiting based on role | false |
ENABLE_GLOBAL_CACHE |
Enable the cache | false |
CACHE_TTL |
The cache TTL | 60 |
CACHE_MAX_MEMORY_SIZE |
Maximum memory size for cache in MB | 100 |
CACHE_MAX_ENTRIES |
Maximum number of entries in cache | 10000 |
ENABLE_REDIS_CACHE |
Enable distributed Redis cache | false |
CACHE_REDIS_URL |
URL to redis server / cluster endpoint | localhost:6379 |
CACHE_REDIS_PASSWORD |
Redis connection password | `` |
CACHE_REDIS_DB |
Redis DB id | 0 |
ENABLE_CIRCUIT_BREAKER |
Enable circuit breaker pattern | false |
CIRCUIT_MAX_FAILURES |
Consecutive failures before circuit trips | 10 |
CIRCUIT_FAILURE_RATIO |
Failure ratio threshold (0.0-1.0) | 0.5 |
CIRCUIT_SAMPLE_SIZE |
Min requests for ratio calculation | 100 |
CIRCUIT_TIMEOUT_SECONDS |
Seconds circuit stays open | 60 |
CIRCUIT_MAX_HALF_OPEN_REQUESTS |
Max requests in half-open state | 5 |
CIRCUIT_RETURN_CACHED_ON_OPEN |
Return cached responses when open | true |
CIRCUIT_TRIP_ON_TIMEOUTS |
Trip circuit breaker on timeouts | true |
CIRCUIT_TRIP_ON_5XX |
Trip circuit breaker on 5XX responses | true |
CIRCUIT_TRIP_ON_4XX |
Trip circuit breaker on 4XX responses (except 429) | false |
CIRCUIT_BACKOFF_MULTIPLIER |
Exponential backoff multiplier (e.g., 1.5) | 1.0 |
CIRCUIT_MAX_BACKOFF_TIMEOUT |
Max timeout in seconds for backoff | 300 |
CLIENT_READ_TIMEOUT |
HTTP client read timeout in seconds | `` |
CLIENT_WRITE_TIMEOUT |
HTTP client write timeout in seconds | `` |
CLIENT_MAX_IDLE_CONN_DURATION |
Max idle connection duration in seconds | 300 |
MAX_CONNS_PER_HOST |
Maximum connections per host | 1024 |
CLIENT_DISABLE_TLS_VERIFY |
Disable TLS verification | false |
LOG_LEVEL |
The log level | info |
BLOCK_SCHEMA_INTROSPECTION |
Blocks the schema introspection | false |
ALLOWED_INTROSPECTION |
Allow only certain queries in introspection | `` |
ENABLE_ACCESS_LOG |
Enable the access log | false |
READ_ONLY_MODE |
Enable the read only mode | false |
ALLOWED_URLS |
Allow access only to certain URLs | /v1/graphql,/v1/version |
ENABLE_API |
Enable the monitoring API | false |
API_PORT |
The port to expose the monitoring API | 9090 |
ADMIN_API_KEY |
API key for admin endpoint authentication (optional) | `` |
BANNED_USERS_FILE |
The path to the file with banned users | /go/src/app/banned_users.json |
PROXIED_CLIENT_TIMEOUT |
The timeout for the proxied client in seconds | 120 |
PURGE_METRICS_ON_CRAWL |
Purge metrics on each /metrics crawl | false |
PURGE_METRICS_ON_TIMER |
Purge metrics every x seconds. 0 - disabled |
0 |
HASURA_EVENT_CLEANER |
Enable the hasura event cleaner | false |
HASURA_EVENT_CLEANER_OLDER_THAN |
The interval for the hasura event cleaner (in days) | 1 |
HASURA_EVENT_METADATA_DB |
URL to the hasura metadata database | postgresql://localhost:5432/hasura |
ENABLE_TRACE |
Enable OpenTelemetry tracing | false |
TRACE_ENDPOINT |
OpenTelemetry collector endpoint | localhost:4317 |
RETRY_BUDGET_ENABLE |
Enable retry budget mechanism | true |
RETRY_BUDGET_TOKENS_PER_SEC |
Retry tokens generated per second | 10.0 |
RETRY_BUDGET_MAX_TOKENS |
Maximum retry tokens allowed | 100 |
REQUEST_COALESCING_ENABLE |
Enable request deduplication | true |
WEBSOCKET_ENABLE |
Enable WebSocket support for subscriptions | false |
WEBSOCKET_PING_INTERVAL |
WebSocket ping interval in seconds | 30 |
WEBSOCKET_PONG_TIMEOUT |
WebSocket pong timeout in seconds | 60 |
WEBSOCKET_MAX_MESSAGE_SIZE |
Max WebSocket message size in bytes | 524288 (512KB) |
ADMIN_DASHBOARD_ENABLE |
Enable admin dashboard UI | true |
Tracing
The proxy supports OpenTelemetry tracing to help monitor and debug requests. When enabled, it will create spans for each proxied request and send them to the configured OpenTelemetry collector.
To use tracing:
- Enable tracing by setting
ENABLE_TRACE=true
- Configure the OpenTelemetry collector endpoint using
TRACE_ENDPOINT
(defaults tolocalhost:4317
) - Include trace context in your requests using the
X-Trace-Span
header with the following format:
{
"traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"
}
The proxy will extract the trace context from the header and create child spans for each request, allowing you to trace requests through your system.
Speed
Request Coalescing
Request coalescing (also known as request deduplication) is a powerful optimization that reduces backend load by combining multiple concurrent identical requests into a single backend call. This feature is enabled by default via REQUEST_COALESCING_ENABLE=true
.
How it works:
- When multiple clients send identical GraphQL queries simultaneously, only one request is forwarded to the backend
- All other concurrent identical requests wait for the first request to complete
- Once the response is received, it's shared with all waiting clients
- This can reduce backend load by 50-80% in high-traffic scenarios with repeated queries
Benefits:
- Dramatically reduces backend load during traffic spikes
- Prevents "thundering herd" problems when cache expires
- Improves response times for coalesced requests (they don't need to wait for backend processing)
- Zero additional latency for the primary request
Monitoring:
The admin dashboard (/admin
) provides real-time statistics:
- Total requests vs. primary requests
- Number of coalesced requests
- Backend savings percentage
Configuration:
# Enable request coalescing (default: true)
GMP_REQUEST_COALESCING_ENABLE=true
Use Cases:
- High-traffic applications with popular queries
- Applications with many concurrent users
- APIs with expensive backend operations
- Mobile/web apps where users often perform the same actions simultaneously
Retry Budget
The retry budget prevents retry storms and cascading failures by limiting the rate at which retries can occur. This is a critical resilience feature enabled by default.
How it works:
- Uses a token bucket algorithm: tokens are generated at a fixed rate
- Each retry attempt consumes one token
- When tokens are exhausted, retries are denied until tokens are refilled
- Automatic refill ensures the system can recover naturally
Benefits:
- Prevents retry storms that can overwhelm recovering backends
- Reduces cascading failures across services
- Maintains predictable load during outages
- Allows graceful degradation instead of complete failure
Configuration:
# Enable retry budget (default: true)
GMP_RETRY_BUDGET_ENABLE=true
# Tokens generated per second (default: 10)
GMP_RETRY_BUDGET_TOKENS_PER_SEC=10.0
# Maximum tokens that can accumulate (default: 100)
GMP_RETRY_BUDGET_MAX_TOKENS=100
Production Recommendations:
- High traffic (1000+ req/s): Set
TOKENS_PER_SEC=50
,MAX_TOKENS=500
- Medium traffic (100-1000 req/s): Use defaults (10 tokens/s, 100 max)
- Low traffic (<100 req/s): Set
TOKENS_PER_SEC=5
,MAX_TOKENS=50
Monitoring: The admin dashboard shows:
- Current available tokens
- Total retry attempts
- Denied retries
- Denial rate percentage
WebSocket Support
Native WebSocket support enables GraphQL subscriptions and real-time features. Enable via WEBSOCKET_ENABLE=true
.
Features:
- Bidirectional proxying between client and backend
- Automatic ping/pong keep-alive
- Configurable message size limits
- Connection statistics and monitoring
- Graceful connection handling
Configuration:
# Enable WebSocket support
GMP_WEBSOCKET_ENABLE=true
# Ping interval (seconds)
GMP_WEBSOCKET_PING_INTERVAL=30
# Pong timeout (seconds)
GMP_WEBSOCKET_PONG_TIMEOUT=60
# Max message size (bytes)
GMP_WEBSOCKET_MAX_MESSAGE_SIZE=524288 # 512KB
Example GraphQL Subscription:
subscription OnNewMessage {
messages {
id
content
createdAt
}
}
Monitoring:
The admin dashboard (/admin
) provides:
- Active WebSocket connections
- Total connections handled
- Messages sent/received
- Connection errors
Caching
The cache engine is enabled in the background by default, using no additional resources.
You can then start using the cache by setting the ENABLE_GLOBAL_CACHE
or ENABLE_REDIS_CACHE
environment variable to true
- which will enable the cache for all queries without introspection. You can leave the global cache disabled and enable the cache for specific queries by adding the @cached
directive to the query.
Important: The cache key is calculated from the entire request body, which includes both the GraphQL query and variables. This means:
- Identical queries with different variables are cached separately
- Identical queries with different variable values get their own cache entries
- This ensures correct caching behavior for parameterized queries
Example:
# These two requests will have DIFFERENT cache keys:
query GetUser($id: ID!) { user(id: $id) { name } }
variables: { "id": "123" }
query GetUser($id: ID!) { user(id: $id) { name } }
variables: { "id": "456" }
In the case of the @cached
you can add additional parameters to the directive which will set the cache for specific queries to the provided time.
For example, query MyCachedQuery @cached(ttl: 90) ....
will set the cache for the query to 90 seconds.
You can also set cache for specific query by using X-Cache-Graphql-Query
header, which will set the cache for the query to the provided time, for example X-Cache-Graphql-Query: 90
will set the cache for the query to 90 seconds.
You can also force refresh of the cache by using @cached(refresh: true)
directive in the query, for example:
query MyProducts @cached(refresh: true) {
products {
id
name
}
}
Memory-Aware Caching
Starting with version 0.26.0
, the memory cache implementation has been enhanced with memory-aware features to prevent out-of-memory situations:
- Memory limits: Set maximum memory usage via
CACHE_MAX_MEMORY_SIZE
(default: 100MB) - Entry limits: Set maximum number of entries via
CACHE_MAX_ENTRIES
(default: 10,000) - Smart eviction: When limits are reached, the cache will automatically evict the least recently used entries
- Compression: Large cache entries are automatically compressed to reduce memory footprint
- Memory monitoring: Memory usage is tracked and reported in metrics
Example configurations:
Basic memory-aware caching:
GMP_ENABLE_GLOBAL_CACHE=true
GMP_CACHE_TTL=60
GMP_CACHE_MAX_MEMORY_SIZE=100
GMP_CACHE_MAX_ENTRIES=10000
High-performance caching for large responses:
GMP_ENABLE_GLOBAL_CACHE=true
GMP_CACHE_TTL=300
GMP_CACHE_MAX_MEMORY_SIZE=500
GMP_CACHE_MAX_ENTRIES=5000
Resource-constrained environment:
GMP_ENABLE_GLOBAL_CACHE=true
GMP_CACHE_TTL=120
GMP_CACHE_MAX_MEMORY_SIZE=50
GMP_CACHE_MAX_ENTRIES=1000
These features ensure the cache runs efficiently even under high load and with large response payloads. The memory-aware cache prevents memory leaks and resource exhaustion while maintaining performance benefits.
Since version 0.5.30
the cache is gzipped in the memory, which should optimise the memory usage quite significantly.
Since version 0.15.48
the you can also use the distributed Redis cache.
Read-only endpoint
You can now specify the read-only GraphQL endpoint by setting the HOST_GRAPHQL_READONLY
environment variable. The default value is empty, preventing the proxy from using the read-only endpoint for the queries and directing all the requests to the main endpoint specified as HOST_GRAPHQL
. If the HOST_GRAPHQL_READONLY
is set, the proxy will use the read-only endpoint for the queries with the query
type and the main endpoint for the mutation
type queries. Format of the read-only endpoint is the same as HOST_GRAPHQL
endpoint, for example http://localhost:8080/
.
You can check out the example of combined deployment with RW and read-only hasura.
Resilience
Circuit Breaker Pattern
The proxy implements an advanced circuit breaker pattern to prevent cascading failures when backend services are unstable. When enabled via ENABLE_CIRCUIT_BREAKER=true
, the proxy monitors for failures and automatically trips the circuit based on configurable thresholds.
Key features:
- Dual tripping strategies: Trip on consecutive failures OR failure ratio
- Automatic recovery: The circuit breaker will automatically attempt recovery after a timeout period
- Health monitoring endpoint: Check circuit breaker status via
/api/circuit-breaker/health
- Configurable thresholds: Set failure thresholds, timeouts, and recovery behavior
- Fallback mechanism: Can serve cached responses when the circuit is open
- Selective error filtering: Configure which HTTP status codes trigger failures
- Exponential backoff: Optional progressive timeout increases for repeated failures
Production-Ready Configuration for High Traffic
For high-traffic production environments, use these recommended settings:
# Basic circuit breaker configuration
GMP_ENABLE_CIRCUIT_BREAKER=true
GMP_CIRCUIT_MAX_FAILURES=10 # Tolerant of transient failures
GMP_CIRCUIT_FAILURE_RATIO=0.5 # Trip at 50% failure rate
GMP_CIRCUIT_SAMPLE_SIZE=100 # Statistically significant sample
GMP_CIRCUIT_TIMEOUT_SECONDS=60 # 1 minute recovery window
GMP_CIRCUIT_MAX_HALF_OPEN_REQUESTS=5 # More probe requests for validation
# Caching fallback
GMP_CIRCUIT_RETURN_CACHED_ON_OPEN=true
# Error type configuration
GMP_CIRCUIT_TRIP_ON_TIMEOUTS=true
GMP_CIRCUIT_TRIP_ON_5XX=true
GMP_CIRCUIT_TRIP_ON_4XX=false # 4xx are usually client errors
# Backoff configuration (optional)
GMP_CIRCUIT_BACKOFF_MULTIPLIER=1.0 # No backoff by default
GMP_CIRCUIT_MAX_BACKOFF_TIMEOUT=300 # 5 minutes maximum
All Circuit Breaker Configuration Options
ENABLE_CIRCUIT_BREAKER
: Enable the circuit breaker pattern (default:false
)CIRCUIT_MAX_FAILURES
: Consecutive failures before circuit trips (default:10
)CIRCUIT_FAILURE_RATIO
: Failure ratio threshold 0.0-1.0 (default:0.5
)CIRCUIT_SAMPLE_SIZE
: Minimum requests for ratio calculation (default:100
)CIRCUIT_TIMEOUT_SECONDS
: Seconds circuit stays open (default:60
)CIRCUIT_MAX_HALF_OPEN_REQUESTS
: Max requests in half-open state (default:5
)CIRCUIT_RETURN_CACHED_ON_OPEN
: Return cached responses when open (default:true
)CIRCUIT_TRIP_ON_TIMEOUTS
: Count timeouts as failures (default:true
)CIRCUIT_TRIP_ON_5XX
: Count 5XX responses as failures (default:true
)CIRCUIT_TRIP_ON_4XX
: Count 4XX responses as failures, except 429 (default:false
)CIRCUIT_BACKOFF_MULTIPLIER
: Exponential backoff multiplier, e.g., 1.5 (default:1.0
)CIRCUIT_MAX_BACKOFF_TIMEOUT
: Maximum timeout in seconds for backoff (default:300
)
Example configurations:
Minimal circuit breaker configuration:
GMP_ENABLE_CIRCUIT_BREAKER=true
GMP_CIRCUIT_MAX_FAILURES=5
GMP_CIRCUIT_TIMEOUT_SECONDS=30
Production-ready circuit breaker with fallback:
GMP_ENABLE_CIRCUIT_BREAKER=true
GMP_CIRCUIT_MAX_FAILURES=3
GMP_CIRCUIT_TIMEOUT_SECONDS=15
GMP_CIRCUIT_MAX_HALF_OPEN_REQUESTS=1
GMP_CIRCUIT_RETURN_CACHED_ON_OPEN=true
GMP_CIRCUIT_TRIP_ON_TIMEOUTS=true
GMP_CIRCUIT_TRIP_ON_5XX=true
Aggressive circuit breaking for critical systems:
GMP_ENABLE_CIRCUIT_BREAKER=true
GMP_CIRCUIT_MAX_FAILURES=1
GMP_CIRCUIT_TIMEOUT_SECONDS=60
GMP_CIRCUIT_MAX_HALF_OPEN_REQUESTS=1
GMP_CIRCUIT_RETURN_CACHED_ON_OPEN=true
GMP_CIRCUIT_TRIP_ON_TIMEOUTS=true
GMP_CIRCUIT_TRIP_ON_5XX=true
Enhanced HTTP Client
The proxy includes an optimized HTTP client with granular controls for timeouts, connection pooling, and TLS verification. This helps improve performance and reliability when communicating with backend GraphQL servers.
Configuration:
CLIENT_READ_TIMEOUT
: HTTP client read timeout in secondsCLIENT_WRITE_TIMEOUT
: HTTP client write timeout in secondsCLIENT_MAX_IDLE_CONN_DURATION
: Maximum duration to keep idle connections open (default:300
seconds)MAX_CONNS_PER_HOST
: Maximum number of connections per host (default:1024
)CLIENT_DISABLE_TLS_VERIFY
: Disable TLS certificate verification (default:false
)
GraphQL Parsing Optimizations
Version 0.26.0 includes several optimizations to GraphQL query parsing and execution:
- Query parsing cache: Identical queries are parsed only once, improving performance for repeated queries
- Efficient mutation detection: Optimized logic for identifying and routing mutations
- Memory efficiency: Improved memory management during GraphQL operations
- Enhanced introspection handling: Better security for introspection queries
These optimizations are applied automatically with no configuration required, resulting in improved performance and reduced resource usage, especially for high-traffic deployments.
Example configurations:
High-performance client for low-latency environments:
GMP_CLIENT_READ_TIMEOUT=1
GMP_CLIENT_WRITE_TIMEOUT=1
GMP_CLIENT_MAX_IDLE_CONN_DURATION=60
GMP_MAX_CONNS_PER_HOST=2048
Client for high-reliability environments:
GMP_CLIENT_READ_TIMEOUT=5
GMP_CLIENT_WRITE_TIMEOUT=5
GMP_CLIENT_MAX_IDLE_CONN_DURATION=120
GMP_MAX_CONNS_PER_HOST=1024
Connection Resilience and Startup Management
The proxy includes comprehensive connection resilience features to handle backend GraphQL endpoint startup delays and connection recovery scenarios.
Startup Readiness Probe
The proxy can wait for the GraphQL backend to become available before accepting traffic, preventing failed requests during backend startup:
# Wait up to 5 minutes for backend to be ready (default: 300 seconds)
GMP_BACKEND_STARTUP_TIMEOUT=300
When enabled, the proxy will:
- Perform periodic health checks against the GraphQL backend during startup
- Use exponential backoff with jitter for health check retries
- Log startup progress and backend readiness status
- Start accepting traffic only after backend is confirmed healthy
- Continue startup if backend doesn't respond within the timeout (with warnings)
Backend Health Monitoring
Continuous health monitoring runs in the background to detect backend availability:
- Health Check Interval: 5 seconds
- Health Check Method: Minimal GraphQL introspection query (
{__typename}
) - Failure Tracking: Consecutive failure counting with automatic recovery detection
- Integration: Works with circuit breaker and retry mechanisms
Intelligent Retry with Connection Awareness
Enhanced retry mechanism that adapts based on backend health and error types:
Normal Operation (Healthy Backend):
- 7 retry attempts
- Initial delay: 500ms
- Maximum delay: 10 seconds
- Exponential backoff
Degraded Operation (Unhealthy Backend):
- 10 retry attempts
- Initial delay: 2 seconds
- Maximum delay: 30 seconds
- Longer delays to account for backend recovery time
Error Classification:
- Connection errors (connection refused, reset, etc.): Retryable
- Timeout errors: Limited retries to prevent cascade failures
- 4xx client errors: Generally not retryable (except 429, 503)
- 5xx server errors: Retryable with backoff
Connection Pool with Auto-Recovery
Advanced connection pool management with automatic health monitoring and recovery:
Keep-Alive Mechanism:
- Interval: 15 seconds
- Lightweight GraphQL queries to maintain connection health
- Automatic failure detection and recovery
Connection Recovery:
- Recovery check interval: 60 seconds
- Automatic connection pool reset after 5+ consecutive failures
- Coordinated with backend health status
Connection Statistics Tracking:
- Active connection count
- Total connection attempts
- Failure rate monitoring
- Last recovery attempt timestamp
Graceful Degradation
When the backend is unavailable, the proxy provides graceful degradation:
Cache Fallback (if circuit breaker configured):
- Serve cached responses when backend is unavailable
- Automatic cache hit metrics tracking
Informative Error Responses:
- Standard GraphQL error format with helpful extensions
- Includes retry recommendations and timeout information
- Maintains API contract even during failures
Example Error Response:
{
"errors": [{
"message": "GraphQL backend is temporarily unavailable",
"extensions": {
"code": "SERVICE_UNAVAILABLE",
"retryable": true,
"retry_after": 60
}
}],
"data": null
}
Monitoring and Observability
Connection resilience provides extensive monitoring through API endpoints:
Backend Health Endpoint: /api/backend/health
{
"status": "healthy",
"backend_url": "http://graphql-backend:4000",
"last_health_check": "2024-01-15T10:30:00Z",
"consecutive_failures": 0,
"check_interval": "5s"
}
Connection Pool Health Endpoint: /api/connection-pool/health
{
"status": "healthy",
"active_connections": 12,
"total_connections": 1547,
"connection_failures": 2,
"last_recovery_attempt": "2024-01-15T09:15:00Z",
"cleanup_interval": "30s",
"keepalive_interval": "15s",
"recovery_check_interval": "60s"
}
Production Configuration Example
For high-availability production environments:
# Backend startup management
GMP_BACKEND_STARTUP_TIMEOUT=600 # 10 minutes for complex backends
# Enhanced connection pool
GMP_MAX_CONNS_PER_HOST=2048
GMP_CLIENT_MAX_IDLE_CONN_DURATION=300
# Circuit breaker for graceful degradation
GMP_ENABLE_CIRCUIT_BREAKER=true
GMP_CIRCUIT_RETURN_CACHED_ON_OPEN=true
GMP_CIRCUIT_MAX_FAILURES=5
GMP_CIRCUIT_TIMEOUT_SECONDS=120
# Caching for fallback responses
GMP_ENABLE_GLOBAL_CACHE=true
GMP_CACHE_TTL=300
This configuration provides:
- Extended startup patience for complex GraphQL backends
- High connection capacity with efficient pooling
- Circuit breaker protection with cache fallback
- 5-minute cache retention for fallback scenarios
Maintenance
Hasura event cleaner
When enabled via HASURA_EVENT_CLEANER=true
- proxy needs to have a direct access to the database to execute simple delete queries on schedule. You can specify number of days the logs should be kept for using HASURA_EVENT_CLEANER_OLDER_THAN
, for example HASURA_EVENT_CLEANER_OLDER_THAN=14
will keep 14 days of event execution logs. Ticker managing the cleaner routine will be executed every hour.
Following tables are being cleaned:
hdb_catalog.event_invocation_logs
hdb_catalog.event_log
hdb_catalog.hdb_action_log
hdb_catalog.hdb_cron_event_invocation_logs
hdb_catalog.hdb_scheduled_event_invocation_logs
Security
Advanced Rate Limiting
The proxy supports multiple rate limiting strategies to protect your GraphQL endpoint from abuse:
Role-based Rate Limiting
Enable rate limiting based on user roles using the ROLE_RATE_LIMIT
environment variable. The proxy extracts the role from JWT tokens or headers and applies appropriate limits.
Configuration:
JWT_ROLE_CLAIM_PATH
: Path to the role claim in JWT tokenROLE_FROM_HEADER
: Header name to extract role from (takes priority over JWT)ROLE_RATE_LIMIT
: Enable role-based rate limiting (default:false
)
Features:
- Dynamic configuration reload: Rate limit configuration is automatically reloaded periodically without restart
- Burst control: Optional burst limits for handling traffic spikes
- Per-endpoint limits: Different rate limits for specific GraphQL endpoints
- IP-based limiting: Additional rate limiting by client IP address
Available interval values:
nano
, micro
, milli
, second
, minute
, hour
, day
, or duration strings like 5s
, 10m
Basic Rate Limit Configuration (ratelimit.json
)
{
"ratelimit": {
"admin": {
"req": 100,
"interval": "second"
},
"guest": {
"req": 50,
"interval": "minute"
},
"-": { // Default/fallback role
"req": 100,
"interval": "day"
}
}
}
Production-Ready Rate Limit Configuration for High Traffic
{
"ratelimit": {
"admin": {
"req": 1000,
"interval": "second",
"burst": 2000, // Allow bursts up to 2000 requests
"endpoints": ["/v1/graphql", "/v1/relay"] // Optional endpoint-specific limits
},
"premium": {
"req": 500,
"interval": "second",
"burst": 1000
},
"standard": {
"req": 100,
"interval": "second",
"burst": 200
},
"guest": {
"req": 10,
"interval": "second",
"burst": 20
},
"-": { // Default/fallback role - deny by default for security
"req": 5,
"interval": "second"
}
}
}
If you'd like to change it - mount your configmap as /app/ratelimit.json
file.
Remember to include the -
role, which is used for unauthenticated users or when claim can't be found for any reason.
If rate limit has been reached - the proxy will return 429 Too Many Requests
error.
Read-only mode
You can enable the read-only mode by setting the READ_ONLY_MODE
environment variable to true
- which will block all the mutation
queries.
Allowing access to listed URLs
You can allow access only to certain URLs by setting the ALLOWED_URLS
environment variable to a comma-separated list of URLs. If enabled - other URLs will return 403 Forbidden
error and request will not reach the proxied service.
Blocking introspection
You can block the schema introspection by setting the BLOCK_SCHEMA_INTROSPECTION
environment variable to true
- which will block all the queries with introspection parts, like:
__schema
, __type
, __typename
, __directive
, __directivelocation
, __field
, __inputvalue
, __enumvalue
, __typekind
, __fieldtype
, __inputobjecttype
, __enumtype
, __uniontype
, __scalars
, __objects
, __interfaces
, __unions
, __enums
, __inputobjects
, __directives
If you'd like to keep blocking of the schema introspection on but allow one or more of from the list of above for any reason, you can use the ALLOWED_INTROSPECTION
environment variable to specify the list of allowed queries.
ALLOWED_INTROSPECTION="__typename,__type"
Security Best Practices
The GraphQL monitoring proxy implements several security measures to protect your GraphQL endpoints:
-
Input Validation: All user inputs are validated and sanitized to prevent injection attacks. File paths are validated to prevent path traversal attacks.
-
Parameterized Queries: Database queries use parameterized statements to prevent SQL injection vulnerabilities.
-
Log Sanitization: Sensitive data (passwords, tokens, API keys, credit cards, SSNs) are automatically redacted from debug logs to prevent information disclosure.
-
Optional API Authentication: Admin endpoints can be protected with API key authentication when needed, while supporting network-level security for internal deployments.
-
Rate Limiting: Role-based rate limiting prevents abuse and DDoS attacks.
-
GraphQL Query Complexity: The proxy can analyze and limit query complexity to prevent resource exhaustion attacks.
For production deployments, we recommend:
- Running the proxy in a secure network segment (VPC, Kubernetes cluster)
- Using TLS for all connections
- Enabling authentication for admin APIs in less secure environments
- Implementing proper monitoring and alerting
- Regularly updating to the latest version for security patches
API endpoints
Authentication
The admin API endpoints support optional authentication for flexibility in different deployment scenarios:
-
Without Authentication (default): When
ADMIN_API_KEY
orGMP_ADMIN_API_KEY
is not set, the API endpoints are accessible without authentication. This is suitable for internal services protected by network segmentation (firewalls, VPCs, Kubernetes network policies, service mesh, etc.). -
With Authentication: When
ADMIN_API_KEY
orGMP_ADMIN_API_KEY
is set to a value, all admin API requests must include theX-API-Key
header with the matching key. This provides application-level security for deployments in less secure environments.
Example with authentication enabled:
curl -X POST \
http://localhost:9090/api/cache-clear \
-H 'X-API-Key: your-secret-key-here' \
-H 'Content-Type: application/json'
Ban or unban the user
Your monitoring system can detect user misbehaving, for example trying to extract / scrap the data. To prevent user from doing so you can use the simple API to ban the user from accessing the application.
To do so - you need to enable the api by setting env variable ENABLE_API=true
which will expose the API on the port API_PORT=9090
. When deployed internally, keep it secure by not exposing it outside of your cluster. For additional security, set ADMIN_API_KEY
to require authentication.
Then you can use the following endpoints:
POST /api/user-ban
- ban the user from accessing the applicationPOST /api/user-unban
- unban the user from accessing the application
Cache operations
POST /api/cache-clear
- clear the cacheGET /api/cache-stats
- get the cache statistics ( hits, misses, size )
Circuit Breaker Health
GET /api/circuit-breaker/health
- get the circuit breaker health status
The circuit breaker health endpoint returns detailed information about the circuit state:
- Current state (healthy/recovering/unhealthy)
- Request counts and failure statistics
- Current configuration
Example response:
{
"status": "healthy",
"state": "closed",
"counts": {
"requests": 1000,
"total_successes": 950,
"total_failures": 50,
"consecutive_successes": 10,
"consecutive_failures": 0
},
"configuration": {
"max_failures": 10,
"failure_ratio": 0.5,
"sample_size": 100,
"timeout_seconds": 60,
"max_half_open_reqs": 5,
"backoff_multiplier": 1.0
}
}
Both ban/unban endpoints require the user_id
and reason
parameters to be present in the request body.
Example request without authentication (internal deployment):
curl -X POST \
http://localhost:9090/api/user-ban \
-H 'Content-Type: application/json' \
-d '{
"user_id": "1337",
"reason": "Scraping data"
}'
Example request with authentication enabled:
curl -X POST \
http://localhost:9090/api/user-ban \
-H 'X-API-Key: your-secret-key-here' \
-H 'Content-Type: application/json' \
-d '{
"user_id": "1337",
"reason": "Scraping data"
}'
Ban details will be stored in the banned_users.json
file, which you can mount as a file or configmap to the /go/src/app/banned_users.json
path ( or use BANNED_USERS_FILE
environment variable to specify the path to the file). The file operation is important if you have multiple instances of the proxy running, as it will allow you to ban the user from accessing the application on all instances.
Admin Dashboard
The admin dashboard provides a real-time, web-based interface for monitoring proxy performance and health. Access it at /admin
or /admin/dashboard
on the main proxy port (default: :8080/admin
).
Features:
- Real-time metrics: Auto-refreshes every 5 seconds
- System health: Backend GraphQL and Redis connectivity status
- Circuit breaker: Current state, configuration, and statistics
- Request coalescing: Deduplication rate and backend savings
- Retry budget: Available tokens and denial rate
- WebSocket: Active connections and message statistics
- Connection pool: Active connections and health status
- Cache statistics: Hit/miss rates and memory usage
Configuration:
# Enable admin dashboard (default: true)
GMP_ADMIN_DASHBOARD_ENABLE=true
Security Considerations:
- The dashboard is accessible on the main proxy port
- For production, consider:
- Using Kubernetes NetworkPolicies to restrict access
- Adding authentication via ingress/service mesh
- Disabling the dashboard in production if not needed
- Using port-forwarding for administrative access
Dashboard Sections:
-
System Health
- Overall health status (healthy/unhealthy)
- Backend GraphQL connectivity
- Redis connectivity (if enabled)
- Response times for health checks
-
Key Metrics
- Request coalescing rate (% of backend savings)
- Retry budget tokens available
- Active WebSocket connections
- Active connection pool connections
-
Circuit Breaker
- Current state (closed/half-open/open)
- Configuration (max failures, timeout, etc.)
- Recent statistics
-
Detailed Statistics
- Request coalescing: Total, primary, and coalesced requests with backend savings percentage
- Retry budget: Current tokens, max tokens, total attempts, denied retries, and denial rate
- Control actions: Reset statistics, clear cache
API Endpoints: The dashboard fetches data from these API endpoints:
GET /admin/api/health
- System health statusGET /admin/api/circuit-breaker
- Circuit breaker statusGET /admin/api/coalescing
- Request coalescing statisticsGET /admin/api/retry-budget
- Retry budget statisticsGET /admin/api/websocket
- WebSocket connection statisticsGET /admin/api/connections
- Connection pool statisticsPOST /admin/api/coalescing/reset
- Reset coalescing statsPOST /admin/api/retry-budget/reset
- Reset retry budget stats
Screenshot:
General
Metrics which matter
You can always enable PURGE_METRICS_ON_CRAWL
environment variable to purge the metrics on each /metrics
crawl. This will allow you to see only the current metrics, without potential leftovers from the previous crawls. This is useful if you want to monitor the metrics in real-time and / or limit the amount of data ingested into the monitoring system. When enabled you will most likely need to update your monitoring queries.
With the PURGE_METRICS_ON_CRAWL
enabled, the graphql_proxy_requests_failed
, graphql_proxy_requests_skipped
and graphql_proxy_requests_succesful
metrics will remain between resets.
If you prefer more control over the metrics purging - you can enable PURGE_METRICS_ON_TIMER
environment variable and set the interval in seconds. This will allow you to purge the metrics on a regular basis, for example every 90 seconds. It could be better solution if you have multiple crawlers checking the metrics endpoints and you want to avoid the situation when metrics are purged by for example healthcheck.
Healthcheck
If you'd like the /healthz
endpoint to perform actual check for the connectivity to the graphql endpoint - set the HEALTHCHECK_GRAPHQL_URL
environment variable to the exact URL of the graphql endpoint. The query executed will be query { __typename }
and if the response is not 200 OK
- the healthcheck will fail. Remember that the endpoint is a full URL which you'd like to check, so it should include the protocol, host and path - for example http://localhost:8080/v1/graphql
and it's NOT the same as value of HOST_GRAPHQL
environment variable which should provide only the host, without path, ending with slash.
Monitoring endpoint
Example metrics produced by the proxy:
graphql_proxy_timed_query_bucket{cached="false",user_id="-",op_type="mutation",op_name="updateUserDetails",vmrange="1.000e-02...1.136e-02"} 6
graphql_proxy_timed_query_count{op_name="",cached="false",user_id="-",op_type=""} 78
graphql_proxy_timed_query_bucket{op_name="MyQuery",cached="false",user_id="-",op_type="query",vmrange="5.995e+00...6.813e+00"} 1
graphql_proxy_timed_query_sum{op_name="MyQuery",cached="false",user_id="-",op_type="query"} 6
graphql_proxy_timed_query_count{op_name="MyQuery",cached="false",user_id="-",op_type="query"} 1
graphql_proxy_executed_query{user_id="-",op_type="mutation",op_name="updateKnownSpammer",cached="false"} 1486
graphql_proxy_executed_query{user_id="-",op_type="query",op_name="checkIfAdminsNeedRefreshing",cached="false"} 13167
graphql_proxy_executed_query{user_id="1337",op_type="query",op_name="checkIfKnownMedia",cached="false"} 429
graphql_proxy_executed_query{user_id="-",op_type="query",op_name="checkIfSpamAIRequiresUpdate",cached="false"} 8891
graphql_proxy_requests_failed 324
graphql_proxy_requests_skipped 0
graphql_proxy_requests_succesful 454823
graphql_proxy_cache_hit{microservice="graphql_proxy",pod="hasura-w-proxy-internal-6b5f4b4bbb-9xwfc"} 7
graphql_proxy_cache_hit{pod="hasura-w-proxy-internal-6b5f4b4bbb-9xwfc",microservice="graphql_proxy"} 1
graphql_proxy_cache_miss{microservice="graphql_proxy",pod="hasura-w-proxy-internal-6b5f4b4bbb-9xwfc"} 23
Documentation
¶
There is no documentation for this package.
Source Files
¶
- admin_dashboard.go
- api.go
- backend_health.go
- buffer_pool.go
- circuit_breaker_metrics.go
- connection_pool.go
- details.go
- errors.go
- events.go
- graphql.go
- lru_cache.go
- main.go
- metrics_aggregator.go
- monitoring.go
- proxy.go
- ratelimit.go
- ratelimit_errors.go
- request_coalescing.go
- retry_budget.go
- rps_tracker.go
- server.go
- shutdown.go
- struct_config.go
- websocket.go