extraction

package
v0.7.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 20, 2026 License: MIT Imports: 11 Imported by: 0

Documentation

Overview

Package extraction implements the regex-driven entity + date extraction that runs at store_memory time. This is the Go-side absorption of what lives in src/ogham/extraction.py on the Python side.

v0.5 Day 1 scope is English-only and covers the four pattern-based categories that need no external data files:

  • CamelCase identifiers -> entity:
  • file paths -> file:
  • error / exception types -> error:
  • person names -> person:

Multi-language support (de/fr/es/zh) and the richer enrichment entities (events, emotions, relationships, quantities, preferences, locations via GeoText) are v0.6 scope. Do NOT extend this file with YAML word lists until then.

Index

Constants

View Source
const MaxEntities = 20

MaxEntities is the cap the Python implementation enforces on the final sorted result. Mirrored here so parity tests don't diverge.

Variables

View Source
var ErrLanguageNotFound = errLanguageNotFound{}

ErrLanguageNotFound is the sentinel error LoadLanguage returns when a code isn't in the embedded set. Exported so callers can errors.Is() against it rather than string-matching.

Functions

func Dates

func Dates(content string) []string

Dates extracts sorted deduplicated ISO-format (YYYY-MM-DD) dates from content. Recognises three families and mirrors src/ogham/extraction.py:: extract_dates:

  • ISO machine dates: 2026-04-20 or 2026/04/20 (slash normalised)
  • Natural English: "April 20, 2026" / "20 April 2026" case-insensitive, optional ordinal suffix
  • Relative phrases: "yesterday" / "today" / "tomorrow" / "last|next|this <weekday|week|month|year>" / "N days|weeks|months|years ago" / "in N days|weeks|months|years"

Relative phrases resolve only when no absolute date is present -- matches Python behaviour.

Output is always sorted ascending, deduplicated, and every token matches ^\d{4}-\d{2}-\d{2}$.

func DatesAt

func DatesAt(content string, ref time.Time) []string

DatesAt is the testable variant: relative phrases resolve against `ref` instead of time.Now(). Tests use a fixed ref so the committed PICT matrix asserts deterministic expected dates. English-only -- a language-aware variant is available via DatesAtForLang.

func DatesAtForLang

func DatesAtForLang(content string, ref time.Time, lang string) []string

DatesAtForLang resolves dates using the specified language's month names, weekday names, and relative-phrase anchors (today/tomorrow/yesterday equivalents). Unknown language codes fall back to English; see resolveRules for the logging policy.

func Entities

func Entities(content string) []string

Entities extracts typed tag strings from content and returns a sorted, deduplicated, length-capped slice. Output shape parity with Python's extract_entities(): each element is prefix:value.

Uses English person-name rules. For localised content, callers should use EntitiesForLang so the denylist vocab swaps to the memory's language.

func EntitiesForLang

func EntitiesForLang(content, lang string) []string

EntitiesForLang is the language-aware entity extractor. Only the person-name classifier is language-sensitive today -- the CamelCase, file-path, and error-type regexes are universal because their anchors (A-Z, dot segments, Error/Exception suffix) don't vary by locale.

func Importance

func Importance(content string, tags []string) float64

Importance scores content on a 0.0-1.0 scale. Mirrors src/ogham/extraction.py::compute_importance:

base 0.2
+ 0.3 if content contains a DECISION_WORDS signal
+ 0.2 if ERROR_WORDS signal or an ...Error/...Exception regex match
+ 0.2 if ARCHITECTURE_WORDS signal
+ 0.1 if a file path appears
+ 0.1 if a code fence (```) or inline code (`) marker appears
+ 0.1 if len(content) > 500
+ 0.1 if len(tags) >= 3
capped at 1.0

Backward-compatible entry point. Uses English rules, matching the pre-language-plumbing behaviour to keep parity tests stable.

func ImportanceForLang

func ImportanceForLang(content string, tags []string, lang string) float64

ImportanceForLang is the language-aware variant. lang is a 2-letter code ("en", "de", ...) or empty / unknown -- empty or unknown codes fall back to English and emit a single debug-level slog warning so operators notice config drift without failing requests.

The Python reference (compute_importance) unions every language's signal words into one global set; we do the same via Union mode when lang == "all" or "*". Default is per-language because the Go call site knows the memory's language from metadata.

func ListLanguages

func ListLanguages() []string

ListLanguages returns the sorted list of available 2-letter codes. Useful for CLI flag validation + error messages.

func Recurrence

func Recurrence(content, lang string) (pattern string, tags []string, ok bool)

Recurrence detects recurring-event signals in content and returns a normalised pattern string + list of prefixed tags suitable for the store-time metadata merge. Mirrors Python's extract_recurrence in src/ogham/extraction.py but returns a pattern+tags shape so the Go store pipeline can emit a `recurrence:<normalised>` tag without a separate transform.

Detection runs in two stages:

  1. YAML-driven explicit phrases (recurrence_patterns block): "daily", "weekly", "biweekly", "monthly", "yearly", German equivalents ("wöchentlich", "monatlich", ...). Each pattern maps to a canonical normalised category.

  2. every_words + day_names -- Python's original path. "every monday" / "jeden Dienstag" / adverbial "montags" all fire this branch. Multiple day hits collapse to a single "weekly" pattern + one recurrence:<dayname> tag per matched day.

Returns (pattern, tags, true) on a hit, (_, nil, false) otherwise. Tags are sorted + deduplicated; the canonical pattern is the coarse category ("daily" / "weekly" / "biweekly" / "monthly" / "quarterly" / "yearly"). Callers should:

metadata["recurrence"] = pattern
tags = append(tags, recurrenceTags...)

The pattern is safe to store as-is in JSONB.

func Set

func Set(s []string) map[string]struct{}

Set returns a string set (map[string]struct{}) for O(1) membership checks. Uses the input slice as-is -- caller handles case folding if needed. Safe on a nil slice (returns an empty set).

func SetLower

func SetLower(s []string) map[string]struct{}

SetLower is the common case: case-fold to lowercase + deduplicate. Use Set() when case-sensitivity matters (German noun stopwords, etc.).

Types

type LanguageRules

type LanguageRules struct {
	// Named-day lookup: "monday" -> 1 etc. 0-indexed, Sunday=0 matching
	// Python's datetime.weekday() convention shifted by one. Preserve
	// the exact mapping or recurrence detection drifts across languages.
	DayNames map[string]int `yaml:"day_names"`

	// Keywords that signal a recurring event (every, each, weekly, ...).
	EveryWords []string `yaml:"every_words"`

	// Low-recall temporal markers -- when, date, time, ago, last, ...
	TemporalKeywords []string `yaml:"temporal_keywords"`

	// Direction markers, split by direction (after / before / around).
	// The YAML file nests: direction_words: { after: [...], before: [...] }
	DirectionWords map[string][]string `yaml:"direction_words"`

	// Scoring signal classes. importance += 0.3 if any decision word
	// matches, +0.2 for error, +0.2 for architecture, etc.
	DecisionWords            []string `yaml:"decision_words"`
	ErrorWords               []string `yaml:"error_words"`
	ArchitectureWords        []string `yaml:"architecture_words"`
	EventWords               []string `yaml:"event_words"`
	ActivityWords            []string `yaml:"activity_words"`
	EmotionWords             []string `yaml:"emotion_words"`
	RelationshipWords        []string `yaml:"relationship_words"`
	PossessiveTriggers       []string `yaml:"possessive_triggers"`
	QuantityUnits            []string `yaml:"quantity_units"`
	PreferenceWords          []string `yaml:"preference_words"`
	NegationMarkers          []string `yaml:"negation_markers"`
	CompressionDecisionWords []string `yaml:"compression_decision_words"`

	// Month name -> month number (1-12). Parse-assist for "15 March" etc.
	MonthNames map[string]int `yaml:"month_names"`

	// Numeric spelling: "one" -> 1, "fifty" -> 50. Needed for
	// "in two weeks" style relative-date parsing.
	WordNumbers map[string]int `yaml:"word_numbers"`

	// QueryFiller holds low-information words we strip from queries
	// before hitting the FTS index ("how do I X?" -> "X").
	QueryFiller []string `yaml:"query_filler"`

	// QueryHints is a nested map: { multi_hop: [...], ordering: [...],
	// summary: [...] }. Each bucket signals a different intent-detection
	// gate; the keys are stable across languages but the values are
	// localised.
	QueryHints map[string][]string `yaml:"query_hints"`

	// --- Date anchors + modifiers (v0.7) --------------------------------
	// Today / tomorrow / yesterday equivalents. Single-token phrases --
	// multi-word anchors like "the day after tomorrow" are out of scope
	// (handled by parsedatetime on the Python side, not ported).
	TodayWords     []string `yaml:"today_words"`
	TomorrowWords  []string `yaml:"tomorrow_words"`
	YesterdayWords []string `yaml:"yesterday_words"`

	// Modifiers for "last/next/this <weekday|period>". English has one
	// word per bucket; German has several (letzten/letzte/letzter).
	ModifierLast []string `yaml:"modifier_last"`
	ModifierNext []string `yaml:"modifier_next"`
	ModifierThis []string `yaml:"modifier_this"`

	// Periods: "week" / "month" / "year" equivalents.
	PeriodWeek  []string `yaml:"period_week"`
	PeriodMonth []string `yaml:"period_month"`
	PeriodYear  []string `yaml:"period_year"`

	// Units for "N <unit> ago" / "in N <unit>". English includes both
	// singular + plural forms ("day", "days"); German inflects
	// differently so YAML spells out every surface form.
	UnitDay   []string `yaml:"unit_day"`
	UnitWeek  []string `yaml:"unit_week"`
	UnitMonth []string `yaml:"unit_month"`
	UnitYear  []string `yaml:"unit_year"`

	// Direction markers: "ago" = past, "in" = future.
	AgoMarkers []string `yaml:"ago_markers"`
	InMarkers  []string `yaml:"in_markers"`

	// --- Recurrence (v0.7) ---------------------------------------------
	// Additional recurrence anchor phrases beyond EveryWords+DayNames.
	// Each entry: { pattern: regex, normalised: weekly|daily|... }.
	// Regex is anchored at word boundaries by the recurrence matcher.
	RecurrencePatterns []RecurrencePattern `yaml:"recurrence_patterns"`

	// --- Person-name regex tightening (v0.7) ---------------------------
	// PersonNameDenylist is the set of Capitalised bigrams / unigrams
	// that SHOULD NOT be tagged as person: even if they pass the basic
	// shape check. Case-insensitive match at lookup site.
	PersonNameDenylist []string `yaml:"person_name_denylist"`

	// PersonNameContextWords is the set of preceding-context tokens that
	// license a standalone Capitalised bigram (e.g. "by Kevin Burns",
	// "from John Doe"). Without a matching context word in the 3-token
	// window before the candidate, the candidate is rejected. English-
	// seed: by/from/user/person/with/met/said/told/asked.
	PersonNameContextWords []string `yaml:"person_name_context_words"`
}

LanguageRules holds every word list + keyword map a language file ships. The Python YAML keys map 1:1 onto field tags so parity stays honest: when the Python YAML adds a new section, the Go struct gets a compile error until we add the field.

Words are NOT lowercased at load time on purpose -- some languages have case-significant stopwords (German nouns, for example). Callers that want case-insensitive lookup should fold to lowercase at the lookup site. Use the Set() helper on each list to get a map[string]struct{} for O(1) lookups.

func LoadLanguage

func LoadLanguage(code string) (*LanguageRules, error)

LoadLanguage returns the parsed rules for a 2-letter code (or "pt-br" for regional variants). Returns a nil-safe result + ErrLanguageNotFound if the code isn't in the embedded set -- callers can fall back to English rather than crashing.

The registry is built once per process and memoised. Subsequent calls are a map lookup; cheap enough to call on every Importance() invocation.

type RecurrencePattern

type RecurrencePattern struct {
	Pattern    string `yaml:"pattern"`
	Normalised string `yaml:"normalised"`
}

RecurrencePattern is one entry in a language's recurrence_patterns block. The regex is compiled once per pattern and cached inside the recurrence package-level cache. normalised is the canonical form exposed as metadata.recurrence + a recurrence:<normalised> tag.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL