classifier

package
v1.4.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 14, 2026 License: Apache-2.0 Imports: 16 Imported by: 0

Documentation

Index

Constants

View Source
const (
	// PIIDirectionRequest marks PII scanning on inbound/request content.
	PIIDirectionRequest = "request"
	// PIIDirectionResponse marks PII scanning on outbound/response content.
	PIIDirectionResponse = "response"
)
View Source
const (
	// DefaultMinScore is the Presidio-compatible minimum confidence threshold.
	// Matches below this score are discarded unless boosted by context words.
	DefaultMinScore = 0.5

	// ContextSimilarityFactor is the score boost applied when context words are
	// found near a match. Matches Presidio's default context_similarity_factor.
	ContextSimilarityFactor = 0.35

	// ContextWindowChars is the number of characters to search before and after
	// a match when looking for context words.
	ContextWindowChars = 100
)

Variables

View Source
var EUPatterns []PIIPattern

EUPatterns is the compiled default pattern set, built at init time from the embedded YAML. Kept for backward compatibility with code that references this variable directly.

View Source
var IBANLengths = map[string]int{
	"AT": 20, "BE": 16, "BG": 22, "CY": 28, "CZ": 24,
	"DE": 22, "DK": 18, "EE": 20, "ES": 24, "FI": 18,
	"FR": 27, "GB": 22, "GR": 27, "HR": 21, "HU": 28,
	"IE": 22, "IT": 27, "LT": 20, "LU": 20, "LV": 21,
	"MT": 31, "NL": 18, "PL": 28, "PT": 25, "RO": 24,
	"SE": 24, "SI": 19, "SK": 24,
}

IBANLengths maps EU+UK country codes to their exact IBAN character length (ISO 13616). Used by ValidateIBAN to reject strings that match the IBAN regex but have the wrong length for their country (e.g. VAT IDs like DE123456789 are 11 chars, not DE's 22).

Functions

func PIIEntitiesToCanonical

func PIIEntitiesToCanonical(entities []PIIEntity) []*entity.CanonicalEntity

PIIEntitiesToCanonical converts a slice of PIIEntity from the scanner to detector-agnostic canonical entities for the enrichment pipeline. Ids are assigned sequentially (1-based). Source is set to entity.SourceCustom.

func RecordEnrichmentAttempt

func RecordEnrichmentAttempt(ctx context.Context, entityType string)

RecordEnrichmentAttempt records one enrichment attempt for an entity type.

func RecordEnrichmentAttribute

func RecordEnrichmentAttribute(ctx context.Context, attrName, attrValue string)

RecordEnrichmentAttribute records one attribute emitted (e.g. gender=female, scope=city).

func RecordEnrichmentFallbackUnknown

func RecordEnrichmentFallbackUnknown(ctx context.Context, entityType string)

RecordEnrichmentFallbackUnknown records fallback to unknown for an attribute.

func RecordPIIDetection

func RecordPIIDetection(ctx context.Context, piiType, direction, action string)

RecordPIIDetection increments the PII detection counter per entity type.

func RecordPIIRedaction

func RecordPIIRedaction(ctx context.Context, piiType, direction string)

RecordPIIRedaction increments the PII redaction counter per entity type.

func WithPIIDirection

func WithPIIDirection(ctx context.Context, direction string) context.Context

WithPIIDirection returns a child context carrying the given PII scan direction ("request" or "response"). Scan and Redact read this to attribute metrics correctly.

Types

type Classification

type Classification struct {
	HasPII   bool        `json:"has_pii"`
	Entities []PIIEntity `json:"entities"`
	Tier     int         `json:"tier"` // 0-2
	Redacted string      `json:"redacted,omitempty"`
}

Classification holds the result of PII scanning.

type EnrichmentConfig

type EnrichmentConfig struct {
	Enabled               bool
	Mode                  string   // off | shadow | enforce
	AllowedAttributes     []string // e.g. ["gender", "scope"]
	ConfidenceThreshold   float64
	EmitUnknownAttributes bool
	DefaultPersonGender   string
	DefaultLocationScope  string
	PreserveTitles        bool
}

EnrichmentConfig holds semantic enrichment settings. Callers (e.g. runner) populate this from policy; classifier does not depend on policy package.

type EnrichmentPolicy

type EnrichmentPolicy interface {
	EmitAttributes(ctx context.Context, mode string, allowed []string, entityType string, attrs map[string]string) []string
}

EnrichmentPolicy is implemented by the caller (e.g. policy engine adapter) to decide which attributes may be emitted for an entity. Classifier does not import policy package.

type LanguageContext

type LanguageContext struct {
	Language string   `yaml:"language" json:"language"`
	Context  []string `yaml:"context,omitempty" json:"context,omitempty"`
}

LanguageContext holds context words for a specific language.

type PIIEntity

type PIIEntity struct {
	Type        string  `json:"type"`
	Value       string  `json:"value"`
	Position    int     `json:"position"`
	Confidence  float64 `json:"confidence"`
	Sensitivity int     `json:"sensitivity"` // 1-3 from recognizer; 0 means unset (treated as 1 for tiering)
}

PIIEntity represents a detected PII instance.

type PIIPattern

type PIIPattern struct {
	Name          string
	Type          string
	Pattern       *regexp.Regexp
	Countries     []string
	Sensitivity   int      // 1-3, higher = more sensitive
	Score         float64  // base confidence from YAML (Presidio-compatible)
	ContextWords  []string // merged from all supported_languages[].context
	ValidateLuhn  bool     // Talon extension: ISO/IEC 7812 checksum gate
	ValidateIBAN  bool     // Talon extension: ISO 13616 MOD-97 + country length gate
	ValidateBSN   bool     // Talon extension: Dutch BSN 11-test
	ValidatePESEL bool     // Talon extension: Polish PESEL check digit
}

PIIPattern represents a compiled, ready-to-use PII detection pattern.

func CompilePIIPatterns

func CompilePIIPatterns(recognizers []RecognizerConfig) ([]PIIPattern, error)

CompilePIIPatterns converts a list of recognizer configs into the compiled []PIIPattern slice used by the Scanner at runtime. Disabled recognizers are skipped. Each regex pattern in a recognizer produces one PIIPattern entry, with the entity type normalized to the lower_snake_case used internally.

type PatternConfig

type PatternConfig struct {
	Name  string   `yaml:"name" json:"name"`
	Regex string   `yaml:"regex" json:"regex"`
	Score *float64 `yaml:"score,omitempty" json:"score,omitempty"`
}

PatternConfig is a single regex pattern within a recognizer. Score is optional; when omitted (nil), DefaultMinScore is used at compile time so that custom patterns are not filtered out by the scanner's minScore threshold.

type RecognizerConfig

type RecognizerConfig struct {
	Name               string            `yaml:"name" json:"name"`
	SupportedEntity    string            `yaml:"supported_entity" json:"supported_entity"`
	Enabled            *bool             `yaml:"enabled,omitempty" json:"enabled,omitempty"`
	Patterns           []PatternConfig   `yaml:"patterns,omitempty" json:"patterns,omitempty"`
	SupportedLanguages []LanguageContext `yaml:"supported_languages,omitempty" json:"supported_languages,omitempty"`
	DenyList           []string          `yaml:"deny_list,omitempty" json:"deny_list,omitempty"`
	DenyListScore      float64           `yaml:"deny_list_score,omitempty" json:"deny_list_score,omitempty"`
	// Talon extensions (safe to include — Presidio ignores unknown fields)
	Sensitivity   int      `yaml:"sensitivity,omitempty" json:"sensitivity,omitempty"`
	Countries     []string `yaml:"countries,omitempty" json:"countries,omitempty"`
	ValidateLuhn  bool     `yaml:"validate_luhn,omitempty" json:"validate_luhn,omitempty"`
	ValidateIBAN  bool     `yaml:"validate_iban,omitempty" json:"validate_iban,omitempty"`
	ValidateBSN   bool     `yaml:"validate_bsn,omitempty" json:"validate_bsn,omitempty"`
	ValidatePESEL bool     `yaml:"validate_pesel,omitempty" json:"validate_pesel,omitempty"`
	// Injection-specific extension (used by attachment scanner only)
	Severity int `yaml:"severity,omitempty" json:"severity,omitempty"`
}

RecognizerConfig mirrors Presidio's YAML recognizer schema with Talon extensions.

func DefaultRecognizers

func DefaultRecognizers() ([]RecognizerConfig, error)

DefaultRecognizers returns the built-in PII recognizers parsed from the embedded pii_eu.yaml file. This is the first layer in the merge chain.

func FilterByEntities

func FilterByEntities(recognizers []RecognizerConfig, enabledEntities, disabledEntities []string) []RecognizerConfig

FilterByEntities applies enabled/disabled entity filters to a recognizer list. If enabledEntities is non-empty, only recognizers with matching supported_entity are kept (whitelist). Then any recognizer in disabledEntities is removed (blacklist).

func MergeRecognizers

func MergeRecognizers(layers ...[]*RecognizerConfig) []RecognizerConfig

MergeRecognizers performs a 3-layer merge: defaults, then global overrides, then per-agent overrides. Later layers override earlier ones by matching on the recognizer Name field. New recognizers are appended.

type RecognizerFile

type RecognizerFile struct {
	Recognizers []RecognizerConfig `yaml:"recognizers"`
}

RecognizerFile is the top-level YAML structure for a recognizer config file. Mirrors Presidio's recognizer registry YAML format.

func LoadRecognizerFile

func LoadRecognizerFile(path string) (*RecognizerFile, error)

LoadRecognizerFile reads and parses a recognizer YAML file from disk. Returns nil (not an error) if the file does not exist, so callers can treat a missing global config as a no-op.

func ParseRecognizerFile

func ParseRecognizerFile(data []byte) (*RecognizerFile, error)

ParseRecognizerFile parses recognizer YAML bytes into a RecognizerFile.

type Scanner

type Scanner struct {
	// contains filtered or unexported fields
}

Scanner detects PII in text using configurable regex patterns. Optional semantic enrichment: when Enricher, EnrichmentConfig, and EnrichmentPolicy are set and config.Enabled and config.Mode != "off", Redact uses enriched placeholders.

func MustNewScanner

func MustNewScanner(opts ...ScannerOption) *Scanner

MustNewScanner is like NewScanner but panics on error. Useful for zero-config startup where the embedded defaults are expected to always compile.

func NewScanner

func NewScanner(opts ...ScannerOption) (*Scanner, error)

NewScanner creates a PII scanner. Without options it uses the embedded EU defaults. Options layer global overrides and per-agent customization on top.

func (*Scanner) Redact

func (s *Scanner) Redact(ctx context.Context, text string) string

Redact replaces PII with type-based placeholders (e.g. "[EMAIL]"). Uses Scan() for validated detection, then position-based replacement to handle overlapping patterns correctly.

func (*Scanner) Scan

func (s *Scanner) Scan(ctx context.Context, text string) *Classification

Scan analyzes text for PII and returns a classification result. Each match goes through hard validation gates (IBAN checksum/length, Luhn) and then Presidio-style score-based context filtering before being accepted.

type ScannerOption

type ScannerOption func(*scannerConfig)

ScannerOption configures a Scanner via the functional options pattern.

func WithCustomRecognizers

func WithCustomRecognizers(recognizers []RecognizerConfig) ScannerOption

WithCustomRecognizers adds per-agent custom recognizer definitions.

func WithDisabledEntities

func WithDisabledEntities(entities []string) ScannerOption

WithDisabledEntities sets a blacklist of entity types to exclude.

func WithEnabledEntities

func WithEnabledEntities(entities []string) ScannerOption

WithEnabledEntities sets a whitelist of entity types. When non-empty, only recognizers with a matching supported_entity will be active.

func WithMinScore

func WithMinScore(score float64) ScannerOption

WithMinScore overrides the default minimum confidence threshold for matches.

func WithPatternFile

func WithPatternFile(path string) ScannerOption

WithPatternFile loads additional recognizers from a global patterns.yaml file. If the file does not exist, it is silently skipped.

func WithSemanticEnrichment

func WithSemanticEnrichment(enricher enrich.Enricher, config *EnrichmentConfig, policy EnrichmentPolicy) ScannerOption

WithSemanticEnrichment enables semantic enrichment of PII placeholders (e.g. gender, scope). When set, Redact may produce <PII type="..." id="..." .../> when config.Mode is "enforce".

Directories

Path Synopsis
Package entity provides a detector-agnostic canonical representation of PII entities for use by the semantic enricher and placeholder renderer.
Package entity provides a detector-agnostic canonical representation of PII entities for use by the semantic enricher and placeholder renderer.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL