Documentation
¶
Index ¶
Constants ¶
const SelfServiceName = "otelcontext"
SelfServiceName is the OTel service.name attribute the binary attaches to its own self-instrumentation spans. Mirrors the literal in main.initTracerProvider — keep the two in sync.
Variables ¶
This section is empty.
Functions ¶
This section is empty.
Types ¶
type Config ¶
type Config struct {
Env string
LogLevel string
HTTPPort string
GRPCPort string
DBDriver string
DBDSN string
DLQPath string
DLQReplayInterval string
// Ingestion Filtering
IngestMinSeverity string
IngestAllowedServices string
IngestExcludedServices string
// DB Connection Pool
DBMaxOpenConns int
DBMaxIdleConns int
DBConnMaxLifetime string // e.g. "1h", "30m"
// Postgres-only opt-in: declarative range partitioning of the logs table by
// day. When set to "daily", AutoMigrate provisions logs as a partitioned
// table and the PartitionScheduler creates lookahead partitions and drops
// expired ones (DROP PARTITION beats DELETE for retention by orders of
// magnitude). Greenfield only — startup refuses if `logs` already exists
// as a non-partitioned table. Empty / "none" = legacy unpartitioned schema.
DBPostgresPartitioning string
// Number of future daily partitions to maintain ahead of "today" when
// DBPostgresPartitioning=daily. Defaults to 3. Tune up if your retention
// policy is short and ingest spikes around a daily boundary.
DBPartitionLookaheadDays int
// Retention
HotRetentionDays int
// Retention tuning. Defaults (batch=50000, sleep=1ms) work for Postgres at
// 100k logs/sec sustained. Lower on resource-constrained hosts; raise on
// dedicated DB machines. 0/negative values use defaults.
RetentionBatchSize int
RetentionBatchSleepMs int
// TSDB
TSDBRingBufferDuration string // e.g. "1h"
// Smart Observability — Adaptive Sampling
SamplingRate float64
SamplingAlwaysOnErrors bool
SamplingLatencyThresholdMs int
// Smart Observability — Metric Cardinality
MetricAttributeKeys string // comma-separated allowlist
MetricMaxCardinality int
// Per-tenant cardinality cap. 0 = unlimited (only the global cap
// applies, preserving legacy single-tenant behavior). Setting this
// gives every tenant its own series budget so a noisy tenant cannot
// starve siblings of fresh series in the in-memory TSDB. The global
// cap (MetricMaxCardinality) remains a backstop and is checked
// after the per-tenant cap.
MetricMaxCardinalityPerTenant int
// DLQ Safety
DLQMaxFiles int
DLQMaxDiskMB int
DLQMaxRetries int
// DLQMaxReplayPerTick caps how many DLQ files the replay worker attempts
// in a single tick. Without it, an outage that filled the DLQ with 10k
// files would replay all of them in the first post-restart tick,
// hammering the (just-restarted) DB and exhausting connections.
// 0 = unlimited (legacy default).
DLQMaxReplayPerTick int
// API Protection
APIRateLimitRPS int
// MCP Server
MCPEnabled bool
MCPPath string
// MCPMaxConcurrent caps the in-flight tools/call invocations server-wide.
// Beyond this, callers receive a JSON-RPC server-overloaded error. <=0
// disables the cap. Default 32 — sized for tight agent polling loops
// without overrunning the GraphRAG in-memory store.
MCPMaxConcurrent int
// MCPCallTimeoutMs is the per-invocation deadline for tools/call. A tool
// that exceeds it gets cancelled and the client receives an RPC timeout
// error. <=0 disables the deadline. Default 30000 (30s).
MCPCallTimeoutMs int
// MCPCacheTTLMs is the lifetime of a memoized tool result for the cheap
// in-memory GraphRAG tools (get_service_map, impact_analysis, etc.).
// <=0 disables caching. Default 5000 (5s).
MCPCacheTTLMs int
// Compression
CompressionLevel string // "default", "fast", "best"
// Vector Index
VectorIndexMaxEntries int
// VectorIndexSnapshotPath is the on-disk location for periodic vectordb
// snapshots. When empty, persistence is disabled and the index rebuilds
// from DB on every restart (legacy behaviour). Default
// "data/vectordb.snapshot".
VectorIndexSnapshotPath string
// VectorIndexSnapshotInterval, e.g. "5m". When set and
// VectorIndexSnapshotPath is non-empty, the index serializes its state
// to disk on this cadence. "0" / empty disables periodic writes (a
// final snapshot still fires on graceful shutdown). Default "5m".
VectorIndexSnapshotInterval string
// LogFTSEnabled toggles SQLite FTS5 provisioning + querying. The FTS5
// inverted index typically consumes 30-40% of SQLite DB disk for
// log-heavy workloads, while the LIKE fallback (log_repo.go:105) keeps
// search_logs functional without it. Default false; opt in with
// LOG_FTS_ENABLED=true. Only meaningful on SQLite; Postgres uses pg_trgm
// independently of this flag.
LogFTSEnabled bool
// GraphRAG worker count (background consumers of the ingestion event channel).
// Defaults to 4 if unset or <=0. Increase under sustained high ingest.
GraphRAGWorkerCount int
// GraphRAG event channel buffer size. Defaults to 10000 if unset or <=0.
GraphRAGEventQueueSize int
// Async ingest pipeline (Phase 1 robustness work). Decouples OTLP Export
// from synchronous DB writes. When enabled, Export() returns as soon as
// the parsed batch is enqueued; persistence runs on a worker pool.
//
// Backpressure is hybrid:
// <90% queue — accept all
// 90%-100% queue — drop healthy batches (silent), errors/slow always pass
// 100% queue — return RESOURCE_EXHAUSTED so OTLP clients back off
IngestAsyncEnabled bool // default true; opt out via INGEST_ASYNC_ENABLED=false
IngestPipelineQueueSize int // default 50000 batches; per-deployment tunable
IngestPipelineWorkers int // default 8 worker goroutines
// IngestPipelinePerTenantCap caps in-flight batches per tenant so a noisy
// tenant cannot starve siblings of fresh queue slots when fullness is
// below the soft-backpressure threshold. 0 (default) disables — single-
// tenant deployments need no cap. Operators on multi-tenant deployments
// should set INGEST_PIPELINE_PER_TENANT_CAP to roughly Capacity/N where
// N is the expected number of concurrently-active tenants, with some
// headroom (e.g. 2× the fair-share value) for short bursts.
IngestPipelinePerTenantCap int
// TLS (HTTP + gRPC). When both paths are set, TLS is enabled on both servers.
// Empty values (default) keep plaintext behavior.
TLSCertFile string
TLSKeyFile string
// TLSAutoSelfsigned enables zero-friction self-signed TLS bootstrap for dev /
// internal deployments. Ignored when TLSCertFile/TLSKeyFile are set (explicit
// cert-file mode wins). Generated material is cached under TLSCacheDir.
TLSAutoSelfsigned bool
TLSCacheDir string
// API key authentication. When empty, auth middleware is a pass-through.
// Loaded from API_KEY env var — never logged.
APIKey string
// OTelExporterEndpoint enables self-instrumentation. When set, the platform
// exports its own spans to the configured OTLP endpoint (e.g. "localhost:4317"
// for self-ingest, or an external collector).
OTelExporterEndpoint string
// DefaultTenant is the tenant ID assigned to rows ingested without an explicit
// X-Tenant-ID header (HTTP) / x-tenant-id gRPC metadata.
DefaultTenant string
// OTLPTrustResourceTenant enables resolving the tenant from the OTLP
// `tenant.id` resource attribute when no transport-level tenant header
// was provided. Disabled by default because resource attributes are
// client-controlled — a compromised SDK could set tenant.id to forge
// another tenant's data. Only turn this on in closed environments where
// all OTLP producers are trusted.
OTLPTrustResourceTenant bool
// APITenantKeysFile, when non-empty, switches API auth from a single
// shared API_KEY into per-tenant bearer tokens. The file contains one
// `key=tenant` pair per line; the matched key's tenant OVERRIDES any
// X-Tenant-ID header so callers cannot cross tenants. Empty = disabled
// (legacy shared-key mode remains available for single-tenant dev).
APITenantKeysFile string
// DevMode disables origin checks for WebSocket and enables dev-friendly defaults.
// Derived from APP_ENV == "development".
DevMode bool
// gRPC server tuning — protects against huge OTLP batches and connection abuse.
GRPCMaxRecvMB int
GRPCMaxConcurrentStreams int
// AllowSqliteProd lets operators explicitly acknowledge that SQLite is
// being used outside dev/test. Without it, a production Env + SQLite
// combination refuses to start.
AllowSqliteProd bool
// WSMaxClients caps simultaneous WebSocket connections to /ws*
// endpoints. 0 = unlimited (default). When set, new connections past
// the cap receive HTTP 503. Sized for the operator's expected dashboard
// audience — small for ops dashboards, larger for read-heavy public UIs.
WSMaxClients int
}
func (*Config) GuardSelfInstrumentation ¶
func (c *Config) GuardSelfInstrumentation()
GuardSelfInstrumentation prevents an amplification loop when OTEL_EXPORTER_OTLP_ENDPOINT points at the binary's own gRPC port. Without this, every span the OTel SDK emits would re-enter Export, generate more spans (one per Export call), and re-enter again — unbounded fan-out.
Strategy: when the configured endpoint resolves to a loopback address, the own service name is auto-added to IngestExcludedServices so the ingest filter drops self-emitted batches. Operators can still override by setting the variable explicitly — the guard only ADDS, never removes.
No-op when self-instrumentation is disabled (empty endpoint) or the endpoint is non-loopback (a separate collector, the operator's responsibility).
func (*Config) TLSCertFileMode ¶
TLSCertFileMode reports whether explicit cert-file TLS is configured. This path has precedence over self-signed.
func (*Config) TLSEnabled ¶
TLSEnabled reports whether HTTPS + gRPC-TLS should be served using any mode (explicit files or auto self-signed).
func (*Config) TLSSelfsignedMode ¶
TLSSelfsignedMode reports whether the self-signed bootstrap path should be used. False when explicit cert files are set (cert-file wins).
func (*Config) Validate ¶
Validate checks that all configuration values are within valid ranges. Call this once after Load() during startup to catch misconfiguration early.
func (*Config) ValidateDBForEnv ¶
ValidateDBForEnv refuses the combination of SQLite driver + production environment unless AllowSqliteProd is explicitly set. SQLite's single-writer lock caps sustained throughput to ~5 services; using it in production will silently throttle ingestion.
Call once during startup after Load + Validate.