Documentation
¶
Overview ¶
Package injectioncorpus loads labeled injection examples into the shape consumed by graph/query.EmbeddingClassifier.
Phase 2 of ADR-043: one bootstrap source, JSONL on disk, normalized to []*query.DomainExamples on read. JSONL is chosen over single-JSON so the Phase 3 detonator can append labels without rewriting the file.
One record per line:
{"id": "sha256-hex", "text": "...", "signal": "instruction-override", "source": "internal-seed-v0"}
The `text` field becomes Example.Query (the substrate predates injection use; the field name is a query-router legacy). The `signal` field becomes Example.Intent — that is the surface rules match on via governance.injection.signal.
Index ¶
Constants ¶
const ( OptionKeyID = "id" OptionKeySource = "source" )
OptionKey constants name the keys used inside Example.Options for corpus-loaded records. Phase 2b runtime-side accessors read these when building governance.injection.top_match_id triples; centralising the key names here keeps the contract typo-proof.
Variables ¶
This section is empty.
Functions ¶
func Load ¶
func Load(sources []Source) ([]*query.DomainExamples, error)
Load reads one or more JSONL corpus files and returns the result as the []*query.DomainExamples shape the classifier consumes.
Aggregates errors across files via errors.Join so a misconfigured deployment sees every problem on a single boot. Also detects duplicate record IDs across sources — Phase 3 detonator workers and vendored public corpora overlapping the internal seed are realistic collision paths, and silent dedup hides which record actually became the nearest-neighbor.
Types ¶
type Record ¶
type Record struct {
// ID is a stable identifier for the record. Recommended: hex
// sha256 of the text. Becomes governance.injection.top_match_id
// when this record is the nearest neighbor.
ID string `json:"id"`
// Text is the labeled input (the injection attempt, or a
// benign counter-example).
Text string `json:"text"`
// Signal is the bucket the text belongs to (one of the values
// enumerated in ADR-043 line 206). The classifier emits this
// as governance.injection.signal on match.
Signal string `json:"signal"`
// Source identifies the origin of the record (e.g.,
// "internal-seed-v0", "deepset/prompt-injections@<sha>").
// Persisted for provenance; not used in classification.
Source string `json:"source,omitempty"`
}
Record is the on-disk JSONL shape. Kept in this package rather than graph/query so the corpus format can evolve without touching the query-router substrate.
type Source ¶
type Source struct {
// Domain is a free-form tag (e.g., "injection-internal-seed",
// "injection-deepset"). Surfaces in DomainExamples.Domain.
Domain string
// Version is the corpus revision tag (e.g., "v0.1").
Version string
// Path is the JSONL file to load.
Path string
}
Source describes one corpus file plus the domain label that will be attached to its examples. The domain label is metadata only — the classifier aggregates examples across domains.