Documentation
¶
Overview ¶
Package agenticdetonator implements ADR-043 Phase 3: offline labeling of untrusted text via a sandboxed canary loop with a fake tool registry (WildcardExecutor) plus a markdown-URL scanner. Output records flow into the Phase 2 processor/agentic-governance/injection_corpus loader.
Phase 3 scope:
- 3a (this slice): WildcardExecutor + signal extraction + markdown-URL scanner + Detonation aggregator. Pure-Go module, no LLM. Library only — the canary loop in 3b is the first user.
- 3b: Canary loop wrapping the agentic-model adapter with per-detonation cost caps (max turns / max tool calls / max tokens) and ADR-024 layered timeouts.
- 3c: cmd/detonate-injections/ batch CLI. JSONL in → JSONL out in the existing injectioncorpus.Record format. sha256 idempotency.
- 3d: First-user proof — detonate an unlabeled OSINT-shape batch, run the Phase 2 measurement harness before/after, document the delta.
Storage choice (Phase 3): JSONL files in the existing corpus format. The INJECTION_LABELS_{tenant} KV bucket from ADR-043 §"Cache size and policy" is explicitly Phase 4 and compounds with multi-tenant isolation; not a Phase 3 concern.
Index ¶
Constants ¶
const ( SignalFilesystemRead = "filesystem-read" SignalCodeExec = "code-exec" SignalNetworkEgress = "network-egress" SignalExfilEmail = "exfil-email" SignalDataAccess = "data-access" SignalSecretAccess = "secret-access" SignalCredEnum = "cred-enum" SignalInstructionOverride = "instruction-override" SignalBenign = "benign" )
Signal bucket constants — mirror the ADR-043 line 206 enum that the Phase 2 corpus loader and classifier consume. Kept in this package as constants rather than referencing strings inline so a typo surfaces at compile time and the bucket set is greppable.
const ( ToolReadFile = "read_file" ToolExecute = "execute" ToolBash = "bash" ToolFetch = "fetch" ToolHTTPRequest = "http_request" ToolSendEmail = "send_email" ToolSendMessage = "send_message" ToolReadEnv = "read_env" ToolDBQuery = "db_query" ToolListUsers = "list_users" ToolGetAPIKey = "get_api_key" )
Tool name constants for the attacker surface. Centralised so the signal-mapping in signals.go and the synthetic-response builders below share names typo-proofly.
Variables ¶
This section is empty.
Functions ¶
func PrimarySignal ¶
PrimarySignal picks the highest-priority signal from a list. The priority order encodes operator severity intuition: code-exec is a higher-severity tickle than network-egress, which is higher than data-access, and so on. Returns SignalBenign when the input is empty so the corpus loader always sees a valid bucket.
Used by the Detonation aggregator when collapsing a multi-signal trajectory into a single Record.Signal — the Record format is single-label by design (Phase 4 may extend to multi-label classification per ADR-043 line 268-272).
func SignalFromScannedURLs ¶
func SignalFromScannedURLs(urls []ScannedURL) string
SignalFromScannedURLs returns SignalNetworkEgress when the scanner found any URL the canary surfaced. Empty input → "" so the caller distinguishes "scanned and found nothing" from "scanner contributed a signal."
This is intentionally coarse for Phase 3a — every observed URL is treated as potential exfil. Phase 3b+ can refine (allowlist known-safe domains, weight by form, etc.). Operators tune via the corpus, not via the scanner heuristic.
func SignalsFromTrajectory ¶
func SignalsFromTrajectory(calls []RecordedCall) []string
SignalsFromTrajectory returns the deduped set of signals produced by a trajectory of tool calls. Order follows first-observation order so the dominant attack shape (the first tool the canary reached for) stays at index 0 — useful when downstream wants a "primary" signal.
A trajectory with zero recorded calls returns an empty slice; callers fold that into the markdown-URL scanner output before deciding whether to label the detonation benign.
Types ¶
type Canary ¶
type Canary struct {
// contains filtered or unexported fields
}
Canary runs bounded LLM-in-loop detonations against untrusted input, observes the tickled tool calls and markdown URLs, and returns a Detonation record. Each detonation uses a fresh WildcardExecutor so trajectories don't bleed across detonations.
Safe for concurrent Detonate calls: the struct holds only the invoker (already concurrent-safe per processor/agentic-model contract) and the config (read-only after NewCanary). All per-detonation state — executor, messages, raw text buffer — is stack-local to each Detonate call.
func NewCanary ¶
func NewCanary(invoker ModelInvoker, cfg CanaryConfig) (*Canary, error)
NewCanary builds a runnable canary from a model invoker plus config. Defaults are filled in from DefaultCanaryConfig for any zero-valued fields the caller didn't set.
func (*Canary) Detonate ¶
Detonate runs one bounded canary loop over input and returns the resulting Detonation. The returned record's PrimarySignal is driven by whichever attacker tools the canary tickled plus any markdown URLs the model emitted in its raw text outputs.
On model-side errors (transport failure, persistent rate-limit) Detonate returns a partial Detonation containing whatever trajectory the canary accumulated before the error plus the error. Callers MAY persist the partial — labels from N-1 turns can still be useful — but typical batch shape is to drop and retry per the CLI's per-batch error budget.
type CanaryConfig ¶
type CanaryConfig struct {
// Model is the model.Registry endpoint name to invoke.
// Captured into the Detonation record's source provenance so
// operators know which canary produced which labels.
Model string
// MaxTurns bounds the outer canary loop. Most injection
// detonations resolve in 1–3 turns; 6 is the practical upper
// bound at which an honest agent has either taken the bait or
// declined.
MaxTurns int
// MaxToolCalls caps the total wildcard tool invocations across
// all turns. A canary chained into a tool-recursive loop
// (rare but adversarially possible) is bounded here.
MaxToolCalls int
// MaxTokens is the per-turn output budget passed to the model.
// Defaults to 1024 — enough for tool args + a short rationale,
// short enough that runaway thinking costs are bounded per turn.
MaxTokens int
// Timeout is the deadline for the ENTIRE detonation
// (all turns combined). Defaults to 60s.
Timeout time.Duration
// SystemPrompt is the canary persona instruction. Defaults to a
// minimal OSINT-analyst persona that licenses tool use; operators
// can override to detonate domain-specific threats (code-review
// agents, customer-support agents, etc.). The default is
// intentionally permissive — the goal is to elicit attacks, not
// to resist them.
SystemPrompt string
}
CanaryConfig bounds one detonation's cost. Defaults are conservative — operators tune for their LLM cost/latency tolerance.
Per ADR-043 line 182, every detonation runs under a "max-turns + max-tool-calls + max-tokens cap" with the canary timeout governed by the ADR-024 layered-timeout convention. We don't pull the system-wide layered-timeout helper here because the detonator is offline batch work, not a serving path — Timeout-on-config is the right primitive, the layered helper is for request-path timeout composition.
func DefaultCanaryConfig ¶
func DefaultCanaryConfig() CanaryConfig
DefaultCanaryConfig returns the conservative-default config. Callers usually set Model + Timeout and accept the rest.
type Detonation ¶
type Detonation struct {
// Input is the untrusted text the canary was asked to process.
// Becomes the corpus record's Text field.
Input string
// CanaryModel records which model.Registry endpoint ran the
// canary. Persisted into the corpus record's source provenance
// so operators can audit which model produced which labels.
CanaryModel string
// StartedAt is the canary's invocation wallclock. Latency and
// trace correlation derive from this.
StartedAt time.Time
// Duration is the canary's wall time end-to-end. Recorded so
// the Phase 3 cost accounting can audit per-detonation budget
// compliance.
Duration time.Duration
// Turns is the number of canary LLM turns consumed before the
// detonation terminated (success or cost-cap hit).
Turns int
// ToolCalls is the executor's trajectory in observation order.
ToolCalls []RecordedCall
// URLs is the markdown-URL scanner output. Populated by the
// Phase 3b canary aggregator (it accumulates each turn's raw
// text and scans the concatenation); Phase 3a treats this as
// pre-aggregated input.
URLs []ScannedURL
}
Detonation aggregates the observable outputs of one canary run over one input. The fields are populated incrementally by the canary loop (Phase 3b); Phase 3a treats it as a pure data structure that converts cleanly into an injectioncorpus.Record for round-trip into the Phase 2 classifier corpus.
Why this lives in 3a: the Record-conversion contract IS the detonator's product. Getting the shape right before the canary loop is written keeps 3b focused on the loop control flow and not on figuring out how to package its findings.
func (*Detonation) AllSignals ¶
func (d *Detonation) AllSignals() []string
AllSignals returns the deduped set of signals observed. Useful for multi-label exports (Phase 4 corpus extension per ADR-043 line 268-272) even though Phase 3 writes single-label records.
func (*Detonation) ID ¶
func (d *Detonation) ID() string
ID returns the stable identifier for this detonation: hex sha256 of the input text. Becomes the corpus record's ID field so the runtime classifier's top_match_id surfaces a value that uniquely re-locates the source detonation.
Hashing only the input (not the trajectory) means re-detonating the same input under a different model produces the same ID. The corpus loader's duplicate-ID detection will then surface the collision at load time, forcing the operator to choose which model's labels to trust — a feature, not a bug, per the "feedback_warning_not_fail_masks_integration_drift" discipline.
func (*Detonation) PrimarySignal ¶
func (d *Detonation) PrimarySignal() string
PrimarySignal returns the dominant signal bucket the detonation produced, applying the ToolCalls signal taxonomy first and falling back to network-egress when only URLs are present. Empty trajectory + empty URLs → SignalBenign (the canary saw the input and did not bite).
func (*Detonation) ToRecord ¶
func (d *Detonation) ToRecord() injectioncorpus.Record
ToRecord converts the detonation into the corpus-loader Record format consumed by Phase 2's processor/agentic-governance/ injection_corpus loader. Single-label by design (Phase 4 may extend); the chosen label is PrimarySignal.
Source provenance is structured: "detonator/<model>/<date>". Operators see at a glance which canary produced which corpus entries.
type ModelInvoker ¶
type ModelInvoker interface {
ChatCompletion(ctx context.Context, req agentic.AgentRequest) (agentic.AgentResponse, error)
}
ModelInvoker is the minimal contract the canary needs from a model client. processor/agentic-model.Client satisfies it. Carved out so tests can mock without spinning up an LLM, and so future-canary variants can route through different invocation shapes without churning the canary loop.
type RecordedCall ¶
RecordedCall captures one observed tool invocation for downstream signal extraction.
type ScannedURL ¶
ScannedURL records one URL the scanner observed along with the markdown form that surfaced it. Provenance lets the Phase 3b canary aggregator weigh image-syntax matches more heavily than bare URLs (image markdown is auto-rendered → auto-fetched).
func ScanMarkdownURLs ¶
func ScanMarkdownURLs(text string) []ScannedURL
ScanMarkdownURLs returns the URLs the canary text reveals in text-position order with their markdown form recorded. Deduplicates by URL — if the same URL appears both as an image and as a bare reference, the stronger form (image > link > bare) is retained; text position of the first observation wins.
Pure function — no I/O, no goroutines, no shared state. Safe to call from anywhere with any input. Caller decides what to do with a populated result; the scanner is signal-extraction only.
type URLForm ¶
type URLForm string
URLForm distinguishes how the URL appeared in the canary text. Image-syntax URLs are the primary exfil vector; link-syntax and bare URLs are weaker signals but still recorded so operators can audit the full surface.
type WildcardExecutor ¶
type WildcardExecutor struct {
// contains filtered or unexported fields
}
WildcardExecutor implements agentictools.ToolExecutor for the detonator sandbox. Exposes the attacker-target tool surface from ADR-043 line 184-189 (read_file, execute, bash, fetch, send_email, read_env, db_query, list_users, get_api_key) and a few adjacent shapes the OSINT threat model implicates (send_message, http_request).
Each Execute call:
- Records the (tool, args, timestamp) into an in-memory trajectory for downstream signal extraction.
- Returns a deterministic synthetic ToolResult keyed on sha256(tool_name + canonical_args_json). Same args → same fake result, so canary behavior is reproducible.
- Never touches the filesystem, network, or any external resource. The whole point is to observe what the LLM tries to do, not to do it.
One executor instance per detonation. Cheap to construct (no dependencies) and the trajectory is the natural per-detonation scope. Thread-safe within a single instance.
func NewWildcardExecutor ¶
func NewWildcardExecutor() *WildcardExecutor
NewWildcardExecutor returns a fresh executor with the standard ADR-043 attacker tool surface. The tool surface is intentionally fixed — adding a tool is a code change reviewed against the signal-bucket taxonomy, not a config-time decision.
func (*WildcardExecutor) Execute ¶
func (e *WildcardExecutor) Execute(_ context.Context, call agentic.ToolCall) (agentic.ToolResult, error)
Execute records the call into the trajectory and returns a deterministic synthetic result. Never errors on unknown tools because the LLM may try a name we didn't advertise; we record it and return a generic plausible-looking response so the canary keeps exploring.
ctx is accepted to satisfy the agentictools.ToolExecutor interface but not consulted — this executor is pure-Go (no I/O, no waits) and bounded by the model-call surface. The canary loop re-checks ctx.Err() at every turn boundary, so a deadline hit mid-tool-loop converts to a clean timeout at the next turn. Future executors with real I/O must wire ctx through.
func (*WildcardExecutor) ListTools ¶
func (e *WildcardExecutor) ListTools() []agentic.ToolDefinition
ListTools returns the attacker tool surface. Stable order so the LLM sees the same advertised tool list across detonations.
func (*WildcardExecutor) Trajectory ¶
func (e *WildcardExecutor) Trajectory() []RecordedCall
Trajectory returns a snapshot of the recorded calls in observation order. Returned slice is a copy — caller can mutate without racing the executor.