Documentation
¶
Overview ¶
Package extraction implements the regex-driven entity + date extraction that runs at store_memory time. This is the Go-side absorption of what lives in src/ogham/extraction.py on the Python side.
v0.5 Day 1 scope is English-only and covers the four pattern-based categories that need no external data files:
- CamelCase identifiers -> entity:
- file paths -> file:
- error / exception types -> error:
- person names -> person:
Multi-language support (de/fr/es/zh) and the richer enrichment entities (events, emotions, relationships, quantities, preferences, locations via GeoText) are v0.6 scope. Do NOT extend this file with YAML word lists until then.
Index ¶
- Constants
- Variables
- func Dates(content string) []string
- func DatesAt(content string, ref time.Time) []string
- func DatesAtForLang(content string, ref time.Time, lang string) []string
- func Entities(content string) []string
- func EntitiesForLang(content, lang string) []string
- func Importance(content string, tags []string) float64
- func ImportanceForLang(content string, tags []string, lang string) float64
- func ListLanguages() []string
- func Recurrence(content, lang string) (pattern string, tags []string, ok bool)
- func Set(s []string) map[string]struct{}
- func SetLower(s []string) map[string]struct{}
- type LanguageRules
- type RecurrencePattern
Constants ¶
const MaxEntities = 20
MaxEntities is the cap the Python implementation enforces on the final sorted result. Mirrored here so parity tests don't diverge.
Variables ¶
var ErrLanguageNotFound = errLanguageNotFound{}
ErrLanguageNotFound is the sentinel error LoadLanguage returns when a code isn't in the embedded set. Exported so callers can errors.Is() against it rather than string-matching.
Functions ¶
func Dates ¶
Dates extracts sorted deduplicated ISO-format (YYYY-MM-DD) dates from content. Recognises three families and mirrors src/ogham/extraction.py:: extract_dates:
- ISO machine dates: 2026-04-20 or 2026/04/20 (slash normalised)
- Natural English: "April 20, 2026" / "20 April 2026" case-insensitive, optional ordinal suffix
- Relative phrases: "yesterday" / "today" / "tomorrow" / "last|next|this <weekday|week|month|year>" / "N days|weeks|months|years ago" / "in N days|weeks|months|years"
Relative phrases resolve only when no absolute date is present -- matches Python behaviour.
Output is always sorted ascending, deduplicated, and every token matches ^\d{4}-\d{2}-\d{2}$.
func DatesAt ¶
DatesAt is the testable variant: relative phrases resolve against `ref` instead of time.Now(). Tests use a fixed ref so the committed PICT matrix asserts deterministic expected dates. English-only -- a language-aware variant is available via DatesAtForLang.
func DatesAtForLang ¶
DatesAtForLang resolves dates using the specified language's month names, weekday names, and relative-phrase anchors (today/tomorrow/yesterday equivalents). Unknown language codes fall back to English; see resolveRules for the logging policy.
func Entities ¶
Entities extracts typed tag strings from content and returns a sorted, deduplicated, length-capped slice. Output shape parity with Python's extract_entities(): each element is prefix:value.
Uses English person-name rules. For localised content, callers should use EntitiesForLang so the denylist vocab swaps to the memory's language.
func EntitiesForLang ¶
EntitiesForLang is the language-aware entity extractor. Only the person-name classifier is language-sensitive today -- the CamelCase, file-path, and error-type regexes are universal because their anchors (A-Z, dot segments, Error/Exception suffix) don't vary by locale.
func Importance ¶
Importance scores content on a 0.0-1.0 scale. Mirrors src/ogham/extraction.py::compute_importance:
base 0.2 + 0.3 if content contains a DECISION_WORDS signal + 0.2 if ERROR_WORDS signal or an ...Error/...Exception regex match + 0.2 if ARCHITECTURE_WORDS signal + 0.1 if a file path appears + 0.1 if a code fence (```) or inline code (`) marker appears + 0.1 if len(content) > 500 + 0.1 if len(tags) >= 3 capped at 1.0
Backward-compatible entry point. Uses English rules, matching the pre-language-plumbing behaviour to keep parity tests stable.
func ImportanceForLang ¶
ImportanceForLang is the language-aware variant. lang is a 2-letter code ("en", "de", ...) or empty / unknown -- empty or unknown codes fall back to English and emit a single debug-level slog warning so operators notice config drift without failing requests.
The Python reference (compute_importance) unions every language's signal words into one global set; we do the same via Union mode when lang == "all" or "*". Default is per-language because the Go call site knows the memory's language from metadata.
func ListLanguages ¶
func ListLanguages() []string
ListLanguages returns the sorted list of available 2-letter codes. Useful for CLI flag validation + error messages.
func Recurrence ¶
Recurrence detects recurring-event signals in content and returns a normalised pattern string + list of prefixed tags suitable for the store-time metadata merge. Mirrors Python's extract_recurrence in src/ogham/extraction.py but returns a pattern+tags shape so the Go store pipeline can emit a `recurrence:<normalised>` tag without a separate transform.
Detection runs in two stages:
YAML-driven explicit phrases (recurrence_patterns block): "daily", "weekly", "biweekly", "monthly", "yearly", German equivalents ("wöchentlich", "monatlich", ...). Each pattern maps to a canonical normalised category.
every_words + day_names -- Python's original path. "every monday" / "jeden Dienstag" / adverbial "montags" all fire this branch. Multiple day hits collapse to a single "weekly" pattern + one recurrence:<dayname> tag per matched day.
Returns (pattern, tags, true) on a hit, (_, nil, false) otherwise. Tags are sorted + deduplicated; the canonical pattern is the coarse category ("daily" / "weekly" / "biweekly" / "monthly" / "quarterly" / "yearly"). Callers should:
metadata["recurrence"] = pattern tags = append(tags, recurrenceTags...)
The pattern is safe to store as-is in JSONB.
Types ¶
type LanguageRules ¶
type LanguageRules struct {
// Named-day lookup: "monday" -> 1 etc. 0-indexed, Sunday=0 matching
// Python's datetime.weekday() convention shifted by one. Preserve
// the exact mapping or recurrence detection drifts across languages.
DayNames map[string]int `yaml:"day_names"`
// Keywords that signal a recurring event (every, each, weekly, ...).
EveryWords []string `yaml:"every_words"`
// Low-recall temporal markers -- when, date, time, ago, last, ...
TemporalKeywords []string `yaml:"temporal_keywords"`
// Direction markers, split by direction (after / before / around).
// The YAML file nests: direction_words: { after: [...], before: [...] }
DirectionWords map[string][]string `yaml:"direction_words"`
// Scoring signal classes. importance += 0.3 if any decision word
// matches, +0.2 for error, +0.2 for architecture, etc.
DecisionWords []string `yaml:"decision_words"`
ErrorWords []string `yaml:"error_words"`
ArchitectureWords []string `yaml:"architecture_words"`
EventWords []string `yaml:"event_words"`
ActivityWords []string `yaml:"activity_words"`
EmotionWords []string `yaml:"emotion_words"`
RelationshipWords []string `yaml:"relationship_words"`
PossessiveTriggers []string `yaml:"possessive_triggers"`
QuantityUnits []string `yaml:"quantity_units"`
PreferenceWords []string `yaml:"preference_words"`
NegationMarkers []string `yaml:"negation_markers"`
CompressionDecisionWords []string `yaml:"compression_decision_words"`
// Month name -> month number (1-12). Parse-assist for "15 March" etc.
MonthNames map[string]int `yaml:"month_names"`
// Numeric spelling: "one" -> 1, "fifty" -> 50. Needed for
// "in two weeks" style relative-date parsing.
WordNumbers map[string]int `yaml:"word_numbers"`
// QueryFiller holds low-information words we strip from queries
// before hitting the FTS index ("how do I X?" -> "X").
QueryFiller []string `yaml:"query_filler"`
// QueryHints is a nested map: { multi_hop: [...], ordering: [...],
// summary: [...] }. Each bucket signals a different intent-detection
// gate; the keys are stable across languages but the values are
// localised.
QueryHints map[string][]string `yaml:"query_hints"`
// --- Date anchors + modifiers (v0.7) --------------------------------
// Today / tomorrow / yesterday equivalents. Single-token phrases --
// multi-word anchors like "the day after tomorrow" are out of scope
// (handled by parsedatetime on the Python side, not ported).
TodayWords []string `yaml:"today_words"`
TomorrowWords []string `yaml:"tomorrow_words"`
YesterdayWords []string `yaml:"yesterday_words"`
// Modifiers for "last/next/this <weekday|period>". English has one
// word per bucket; German has several (letzten/letzte/letzter).
ModifierLast []string `yaml:"modifier_last"`
ModifierNext []string `yaml:"modifier_next"`
ModifierThis []string `yaml:"modifier_this"`
// Periods: "week" / "month" / "year" equivalents.
PeriodWeek []string `yaml:"period_week"`
PeriodMonth []string `yaml:"period_month"`
PeriodYear []string `yaml:"period_year"`
// Units for "N <unit> ago" / "in N <unit>". English includes both
// singular + plural forms ("day", "days"); German inflects
// differently so YAML spells out every surface form.
UnitDay []string `yaml:"unit_day"`
UnitWeek []string `yaml:"unit_week"`
UnitMonth []string `yaml:"unit_month"`
UnitYear []string `yaml:"unit_year"`
// Direction markers: "ago" = past, "in" = future.
AgoMarkers []string `yaml:"ago_markers"`
InMarkers []string `yaml:"in_markers"`
// --- Recurrence (v0.7) ---------------------------------------------
// Additional recurrence anchor phrases beyond EveryWords+DayNames.
// Each entry: { pattern: regex, normalised: weekly|daily|... }.
// Regex is anchored at word boundaries by the recurrence matcher.
RecurrencePatterns []RecurrencePattern `yaml:"recurrence_patterns"`
// --- Person-name regex tightening (v0.7) ---------------------------
// PersonNameDenylist is the set of Capitalised bigrams / unigrams
// that SHOULD NOT be tagged as person: even if they pass the basic
// shape check. Case-insensitive match at lookup site.
PersonNameDenylist []string `yaml:"person_name_denylist"`
// PersonNameContextWords is the set of preceding-context tokens that
// license a standalone Capitalised bigram (e.g. "by Kevin Burns",
// "from John Doe"). Without a matching context word in the 3-token
// window before the candidate, the candidate is rejected. English-
// seed: by/from/user/person/with/met/said/told/asked.
PersonNameContextWords []string `yaml:"person_name_context_words"`
}
LanguageRules holds every word list + keyword map a language file ships. The Python YAML keys map 1:1 onto field tags so parity stays honest: when the Python YAML adds a new section, the Go struct gets a compile error until we add the field.
Words are NOT lowercased at load time on purpose -- some languages have case-significant stopwords (German nouns, for example). Callers that want case-insensitive lookup should fold to lowercase at the lookup site. Use the Set() helper on each list to get a map[string]struct{} for O(1) lookups.
func LoadLanguage ¶
func LoadLanguage(code string) (*LanguageRules, error)
LoadLanguage returns the parsed rules for a 2-letter code (or "pt-br" for regional variants). Returns a nil-safe result + ErrLanguageNotFound if the code isn't in the embedded set -- callers can fall back to English rather than crashing.
The registry is built once per process and memoised. Subsequent calls are a map lookup; cheap enough to call on every Importance() invocation.
type RecurrencePattern ¶
type RecurrencePattern struct {
Pattern string `yaml:"pattern"`
Normalised string `yaml:"normalised"`
}
RecurrencePattern is one entry in a language's recurrence_patterns block. The regex is compiled once per pattern and cached inside the recurrence package-level cache. normalised is the canonical form exposed as metadata.recurrence + a recurrence:<normalised> tag.