extraction

package

v0.7.3 Latest Latest Go to latest Published: May 20, 2026 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/ogham-mcp/ogham-cli

Links

Open Source Insights

Documentation ¶

Overview ¶

Package extraction implements the regex-driven entity + date extraction that runs at store_memory time. This is the Go-side absorption of what lives in src/ogham/extraction.py on the Python side.

v0.5 Day 1 scope is English-only and covers the four pattern-based categories that need no external data files:

CamelCase identifiers -> entity:
file paths -> file:
error / exception types -> error:
person names -> person:

Multi-language support (de/fr/es/zh) and the richer enrichment entities (events, emotions, relationships, quantities, preferences, locations via GeoText) are v0.6 scope. Do NOT extend this file with YAML word lists until then.

Index ¶

Constants
Variables
func Dates(content string) []string
func DatesAt(content string, ref time.Time) []string
func DatesAtForLang(content string, ref time.Time, lang string) []string
func Entities(content string) []string
func EntitiesForLang(content, lang string) []string
func Importance(content string, tags []string) float64
func ImportanceForLang(content string, tags []string, lang string) float64
func ListLanguages() []string
func Recurrence(content, lang string) (pattern string, tags []string, ok bool)
func Set(s []string) map[string]struct{}
func SetLower(s []string) map[string]struct{}
type LanguageRules
- func LoadLanguage(code string) (*LanguageRules, error)
type RecurrencePattern

Constants ¶

View Source

const MaxEntities = 20

MaxEntities is the cap the Python implementation enforces on the final sorted result. Mirrored here so parity tests don't diverge.

Variables ¶

View Source

var ErrLanguageNotFound = errLanguageNotFound{}

ErrLanguageNotFound is the sentinel error LoadLanguage returns when a code isn't in the embedded set. Exported so callers can errors.Is() against it rather than string-matching.

Functions ¶

func Dates ¶

func Dates(content string) []string

Dates extracts sorted deduplicated ISO-format (YYYY-MM-DD) dates from content. Recognises three families and mirrors src/ogham/extraction.py:: extract_dates:

ISO machine dates: 2026-04-20 or 2026/04/20 (slash normalised)
Natural English: "April 20, 2026" / "20 April 2026" case-insensitive, optional ordinal suffix
Relative phrases: "yesterday" / "today" / "tomorrow" / "last|next|this <weekday|week|month|year>" / "N days|weeks|months|years ago" / "in N days|weeks|months|years"

Relative phrases resolve only when no absolute date is present -- matches Python behaviour.

Output is always sorted ascending, deduplicated, and every token matches ^\d{4}-\d{2}-\d{2}$.

func DatesAt ¶

func DatesAt(content string, ref time.Time) []string

DatesAt is the testable variant: relative phrases resolve against `ref` instead of time.Now(). Tests use a fixed ref so the committed PICT matrix asserts deterministic expected dates. English-only -- a language-aware variant is available via DatesAtForLang.

func DatesAtForLang ¶

func DatesAtForLang(content string, ref time.Time, lang string) []string

DatesAtForLang resolves dates using the specified language's month names, weekday names, and relative-phrase anchors (today/tomorrow/yesterday equivalents). Unknown language codes fall back to English; see resolveRules for the logging policy.

func Entities ¶

func Entities(content string) []string

Entities extracts typed tag strings from content and returns a sorted, deduplicated, length-capped slice. Output shape parity with Python's extract_entities(): each element is prefix:value.

Uses English person-name rules. For localised content, callers should use EntitiesForLang so the denylist vocab swaps to the memory's language.

func EntitiesForLang ¶

func EntitiesForLang(content, lang string) []string

EntitiesForLang is the language-aware entity extractor. Only the person-name classifier is language-sensitive today -- the CamelCase, file-path, and error-type regexes are universal because their anchors (A-Z, dot segments, Error/Exception suffix) don't vary by locale.

func Importance ¶

func Importance(content string, tags []string) float64

Importance scores content on a 0.0-1.0 scale. Mirrors src/ogham/extraction.py::compute_importance:

base 0.2
+ 0.3 if content contains a DECISION_WORDS signal
+ 0.2 if ERROR_WORDS signal or an ...Error/...Exception regex match
+ 0.2 if ARCHITECTURE_WORDS signal
+ 0.1 if a file path appears
+ 0.1 if a code fence (```) or inline code (`) marker appears
+ 0.1 if len(content) > 500
+ 0.1 if len(tags) >= 3
capped at 1.0

Backward-compatible entry point. Uses English rules, matching the pre-language-plumbing behaviour to keep parity tests stable.

func ImportanceForLang ¶

func ImportanceForLang(content string, tags []string, lang string) float64

ImportanceForLang is the language-aware variant. lang is a 2-letter code ("en", "de", ...) or empty / unknown -- empty or unknown codes fall back to English and emit a single debug-level slog warning so operators notice config drift without failing requests.

The Python reference (compute_importance) unions every language's signal words into one global set; we do the same via Union mode when lang == "all" or "*". Default is per-language because the Go call site knows the memory's language from metadata.

func ListLanguages ¶

func ListLanguages() []string

ListLanguages returns the sorted list of available 2-letter codes. Useful for CLI flag validation + error messages.

func Recurrence ¶

func Recurrence(content, lang string) (pattern string, tags []string, ok bool)

Recurrence detects recurring-event signals in content and returns a normalised pattern string + list of prefixed tags suitable for the store-time metadata merge. Mirrors Python's extract_recurrence in src/ogham/extraction.py but returns a pattern+tags shape so the Go store pipeline can emit a `recurrence:<normalised>` tag without a separate transform.

Detection runs in two stages:

YAML-driven explicit phrases (recurrence_patterns block): "daily", "weekly", "biweekly", "monthly", "yearly", German equivalents ("wöchentlich", "monatlich", ...). Each pattern maps to a canonical normalised category.
every_words + day_names -- Python's original path. "every monday" / "jeden Dienstag" / adverbial "montags" all fire this branch. Multiple day hits collapse to a single "weekly" pattern + one recurrence:<dayname> tag per matched day.

Returns (pattern, tags, true) on a hit, (_, nil, false) otherwise. Tags are sorted + deduplicated; the canonical pattern is the coarse category ("daily" / "weekly" / "biweekly" / "monthly" / "quarterly" / "yearly"). Callers should:

metadata["recurrence"] = pattern
tags = append(tags, recurrenceTags...)

The pattern is safe to store as-is in JSONB.

func Set ¶

func Set(s []string) map[string]struct{}

Set returns a string set (map[string]struct{}) for O(1) membership checks. Uses the input slice as-is -- caller handles case folding if needed. Safe on a nil slice (returns an empty set).

func SetLower ¶

func SetLower(s []string) map[string]struct{}

SetLower is the common case: case-fold to lowercase + deduplicate. Use Set() when case-sensitivity matters (German noun stopwords, etc.).

Types ¶

type LanguageRules ¶

type LanguageRules struct {
	// Named-day lookup: "monday" -> 1 etc. 0-indexed, Sunday=0 matching
	// Python's datetime.weekday() convention shifted by one. Preserve
	// the exact mapping or recurrence detection drifts across languages.
	DayNames map[string]int `yaml:"day_names"`

	// Keywords that signal a recurring event (every, each, weekly, ...).
	EveryWords []string `yaml:"every_words"`

	// Low-recall temporal markers -- when, date, time, ago, last, ...
	TemporalKeywords []string `yaml:"temporal_keywords"`

	// Direction markers, split by direction (after / before / around).
	// The YAML file nests: direction_words: { after: [...], before: [...] }
	DirectionWords map[string][]string `yaml:"direction_words"`

	// Scoring signal classes. importance += 0.3 if any decision word
	// matches, +0.2 for error, +0.2 for architecture, etc.
	DecisionWords            []string `yaml:"decision_words"`
	ErrorWords               []string `yaml:"error_words"`
	ArchitectureWords        []string `yaml:"architecture_words"`
	EventWords               []string `yaml:"event_words"`
	ActivityWords            []string `yaml:"activity_words"`
	EmotionWords             []string `yaml:"emotion_words"`
	RelationshipWords        []string `yaml:"relationship_words"`
	PossessiveTriggers       []string `yaml:"possessive_triggers"`
	QuantityUnits            []string `yaml:"quantity_units"`
	PreferenceWords          []string `yaml:"preference_words"`
	NegationMarkers          []string `yaml:"negation_markers"`
	CompressionDecisionWords []string `yaml:"compression_decision_words"`

	// Month name -> month number (1-12). Parse-assist for "15 March" etc.
	MonthNames map[string]int `yaml:"month_names"`

	// Numeric spelling: "one" -> 1, "fifty" -> 50. Needed for
	// "in two weeks" style relative-date parsing.
	WordNumbers map[string]int `yaml:"word_numbers"`

	// QueryFiller holds low-information words we strip from queries
	// before hitting the FTS index ("how do I X?" -> "X").
	QueryFiller []string `yaml:"query_filler"`

	// QueryHints is a nested map: { multi_hop: [...], ordering: [...],
	// summary: [...] }. Each bucket signals a different intent-detection
	// gate; the keys are stable across languages but the values are
	// localised.
	QueryHints map[string][]string `yaml:"query_hints"`

	// --- Date anchors + modifiers (v0.7) --------------------------------
	// Today / tomorrow / yesterday equivalents. Single-token phrases --
	// multi-word anchors like "the day after tomorrow" are out of scope
	// (handled by parsedatetime on the Python side, not ported).
	TodayWords     []string `yaml:"today_words"`
	TomorrowWords  []string `yaml:"tomorrow_words"`
	YesterdayWords []string `yaml:"yesterday_words"`

	// Modifiers for "last/next/this <weekday|period>". English has one
	// word per bucket; German has several (letzten/letzte/letzter).
	ModifierLast []string `yaml:"modifier_last"`
	ModifierNext []string `yaml:"modifier_next"`
	ModifierThis []string `yaml:"modifier_this"`

	// Periods: "week" / "month" / "year" equivalents.
	PeriodWeek  []string `yaml:"period_week"`
	PeriodMonth []string `yaml:"period_month"`
	PeriodYear  []string `yaml:"period_year"`

	// Units for "N <unit> ago" / "in N <unit>". English includes both
	// singular + plural forms ("day", "days"); German inflects
	// differently so YAML spells out every surface form.
	UnitDay   []string `yaml:"unit_day"`
	UnitWeek  []string `yaml:"unit_week"`
	UnitMonth []string `yaml:"unit_month"`
	UnitYear  []string `yaml:"unit_year"`

	// Direction markers: "ago" = past, "in" = future.
	AgoMarkers []string `yaml:"ago_markers"`
	InMarkers  []string `yaml:"in_markers"`

	// --- Recurrence (v0.7) ---------------------------------------------
	// Additional recurrence anchor phrases beyond EveryWords+DayNames.
	// Each entry: { pattern: regex, normalised: weekly|daily|... }.
	// Regex is anchored at word boundaries by the recurrence matcher.
	RecurrencePatterns []RecurrencePattern `yaml:"recurrence_patterns"`

	// --- Person-name regex tightening (v0.7) ---------------------------
	// PersonNameDenylist is the set of Capitalised bigrams / unigrams
	// that SHOULD NOT be tagged as person: even if they pass the basic
	// shape check. Case-insensitive match at lookup site.
	PersonNameDenylist []string `yaml:"person_name_denylist"`

	// PersonNameContextWords is the set of preceding-context tokens that
	// license a standalone Capitalised bigram (e.g. "by Kevin Burns",
	// "from John Doe"). Without a matching context word in the 3-token
	// window before the candidate, the candidate is rejected. English-
	// seed: by/from/user/person/with/met/said/told/asked.
	PersonNameContextWords []string `yaml:"person_name_context_words"`
}

LanguageRules holds every word list + keyword map a language file ships. The Python YAML keys map 1:1 onto field tags so parity stays honest: when the Python YAML adds a new section, the Go struct gets a compile error until we add the field.

Words are NOT lowercased at load time on purpose -- some languages have case-significant stopwords (German nouns, for example). Callers that want case-insensitive lookup should fold to lowercase at the lookup site. Use the Set() helper on each list to get a map[string]struct{} for O(1) lookups.

func LoadLanguage ¶

func LoadLanguage(code string) (*LanguageRules, error)

LoadLanguage returns the parsed rules for a 2-letter code (or "pt-br" for regional variants). Returns a nil-safe result + ErrLanguageNotFound if the code isn't in the embedded set -- callers can fall back to English rather than crashing.

The registry is built once per process and memoised. Subsequent calls are a map lookup; cheap enough to call on every Importance() invocation.

type RecurrencePattern ¶

type RecurrencePattern struct {
	Pattern    string `yaml:"pattern"`
	Normalised string `yaml:"normalised"`
}

RecurrencePattern is one entry in a language's recurrence_patterns block. The regex is compiled once per pattern and cached inside the recurrence package-level cache. normalised is the canonical form exposed as metadata.recurrence + a recurrence:<normalised> tag.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL