punkt

package
v0.23.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 21, 2026 License: MIT Imports: 8 Imported by: 0

Documentation

Overview

Package punkt is a forked, allocation-clean subset of the trained Punkt sentence tokenizer from neurosnap/sentences v1.1.2 (https://github.com/neurosnap/sentences). It vendors only what MDS024 needs: Storage, Token, WordTokenizer, TokenGrouper, OrthoContext, DefaultSentenceTokenizer, and the English supervised abbreviations. The non-English language data and IsNonPunct (no call site in upstream's English pipeline, per plan 187) are not vendored.

CJK terminal punctuation is supported at the same level upstream's English pipeline supports it: full-width `。 ; ! ?` are word boundaries (`IsCjkPunct`, used inside `TokenizeInto`), and the full-width period `。` flags a sentence break via `HasPeriodFinal` the same way ASCII `.` does. Full-width `!` and `?` do NOT flag sentence breaks on their own — the English pipeline's `HasSentEndChars` set covers only ASCII `!`/`?` plus their quote/paren variants, matching upstream. Author CJK paragraphs with `。` between sentences for the segmenter to produce one Sentence per `。`.

The fork is segmentation-equivalent to upstream over the equivalence corpus in internal/mdtext/sentence_equivalence_test.go — that is the gate any drift fails on. The differences from upstream are all allocation-driven:

  • Token carries no regex pointers; the upstream per-token allocation of six *regexp.Regexp is gone. Type-classification regexes (reInitial, reAlpha, reEllipsis, reListNumber, reCoordinateSecondPart, reNumeric) are replaced with byte scanners at package scope.
  • WordTokenizer.Type is a one-pass byte scan into a reusable buffer instead of `reNumeric.ReplaceAllString` + `strings.ToLower` + `strings.Replace`.
  • Collocation lookups rebuild the upstream `typ + "," + nextTyp` key into a reusable byte buffer and hit `Collocations[string(buf)]` — relying on the compiler's `m[string(b)]` elision so the lookup itself does not allocate, instead of `strings.Join` followed by a SetString lookup.
  • TokenGrouper reuses a buffer across passes; one allocation per Tokenize call instead of three.
  • TypeBasedAnnotation's hyphenation check is a strings.LastIndexByte scan over the token (avoiding the `[]byte(tok)` conversion bytes.LastIndexByte would force), replacing strings.Split.
  • The hot per-call buffers (tokens, ptrs, pairs, type-builder bytes) come from a sync.Pool so a sequence of Tokenize calls amortizes their allocations to ~0.

Plan 193 records the rationale and the per-call allocation budget.

Upstream commit: github.com/neurosnap/sentences@v1.1.2 (https://github.com/neurosnap/sentences/tree/v1.1.2). License: MIT — see the UPSTREAM_LICENSE file in this package for the verbatim upstream copyright and permissions notice. (The file has no extension so mdsmith's content rules do not lint the verbatim license text.)

Multilingual loaders and IsNonPunct are not vendored. The English pipeline never calls IsNonPunct (plan 187 records the negative) and the non-English language data is unrelated to mdsmith's English Markdown corpus. CJK terminal punctuation IS supported in the word tokenizer so non-ASCII paragraphs flowing through MDS024 segment the same way upstream does — exercised by the equivalence harness's CJK paragraphs.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func HasPeriodFinal

func HasPeriodFinal(tok string) bool

HasPeriodFinal reports whether tok ends with an ASCII period or the CJK full-width period `。`. Mirrors upstream's DefaultWordTokenizer.HasPeriodFinal byte-for-byte (the period classification governs whether the multi-punct annotator considers a token at all, so the CJK acceptance must survive).

func IsCjkPunct

func IsCjkPunct(r rune) bool

IsCjkPunct reports whether r is one of the four CJK terminal punctuation marks upstream's tokenizer treats as a word boundary: full-width period, semicolon, exclamation, question. Mirrors upstream sentences.IsCjkPunct verbatim. Used inside TokenizeInto so a string like "中文。English" splits the same way upstream does (`中文。` becomes one token, then `English`).

func MatchAbbrPattern

func MatchAbbrPattern(tok string) bool

MatchAbbrPattern reports whether the upstream regex `((?:[\w]\.)+[\w]*\.)` would find at least one match anywhere in tok. It is the boolean form of `len(reAbbr.FindAllString(tok, 1)) > 0` from `github.com/neurosnap/sentences/english/main.go:15`, with the `regexp` engine's backtracking removed.

The pattern in plain English: at least one `\w\.` pair, optionally followed by more word characters, ending with `\.`. Concretely, any matching substring has the form `\w \. (\w|\.)* \.` where the run of `\w`-or-`\.` between the first and final period may be empty.

Plan 191 introduced this DFA in internal/mdtext/abbr.go; it is promoted here unchanged so the fork pipeline can reach it without importing back into mdtext. The byte-equivalence guarantee against the regex is gated by token_test.go's TestMatchAbbrPattern_* (mirrored from mdtext) and ultimately by the sentence-equivalence harness.

Types

type OrthoContext

type OrthoContext struct {
	Storage *Storage
}

OrthoContext determines the orthographic-evidence heuristic (section 4.1.1 of the Punkt paper). The reformulation keeps the same trio of returns as upstream: 1 (sentence starter), 0 (not a sentence starter), -1 (unknown).

The struct embeds storage so heuristic lookups are direct field accesses (Storage.OrthoContext, SentStarters, etc.) instead of going through the embedded interface mash upstream uses. The caller passes a typeBuf that heuristic uses to compute TypeNoSentPeriod without allocating; reset to length 0 on entry.

type Sentence

type Sentence struct {
	Start int
	End   int
	Text  string
}

Sentence carries the [start, end) byte slice of a sentence inside the original text. The Text field is computed by Tokenize so the caller need not slice the source itself; the slice header points into the original string, so no copy.

type SetString

type SetString map[string]int

SetString is a string-keyed set matching upstream's JSON shape: values are int (always 1 for set membership) so existing training JSON loads without translation. Lookups go through Has, which returns true for any non-zero value.

func (SetString) Add

func (ss SetString) Add(str string)

Add marks str as present in the set.

func (SetString) Has

func (ss SetString) Has(str string) bool

Has reports whether str is present in the set.

type Storage

type Storage struct {
	AbbrevTypes  SetString `json:"AbbrevTypes"`
	Collocations SetString `json:"Collocations"`
	SentStarters SetString `json:"SentStarters"`
	OrthoContext SetString `json:"OrthoContext"`
}

Storage holds the trained Punkt model. The JSON-loaded fields (AbbrevTypes, Collocations, SentStarters, OrthoContext) mirror the upstream shape so existing training assets (data/english.json from neurosnap/sentences/data) deserialize unchanged.

Collocations is keyed by the upstream `typ + "," + nextTyp` string. The runtime path in tokenAnnotation reproduces the key into a pooled byte buffer and looks it up with `Collocations[string(buf)]` — relying on the compiler's `m[string(b)]` elision so the lookup itself does not allocate. An earlier draft of plan 193 carried a derived `map[[2]string]` index, but the elision path is allocation-equivalent and keeps Storage one map smaller.

func LoadTraining

func LoadTraining(data []byte) (*Storage, error)

LoadTraining parses the JSON training data shipped with neurosnap/sentences and returns the corresponding Storage. An empty/malformed input returns the json.Unmarshal error.

func NewStorage

func NewStorage() *Storage

NewStorage returns an empty Storage with all maps initialized. Used in tests; LoadTraining is the production constructor.

func (*Storage) IsAbbr

func (s *Storage) IsAbbr(tokens ...string) bool

IsAbbr reports whether any of tokens is a known abbreviation type. Mirrors upstream Storage.IsAbbr.

type Token

type Token struct {
	Tok       string
	Position  int
	SentBreak bool
	ParaStart bool
	LineStart bool
	Abbr      bool
}

Token is a tokenized word annotated by the Punkt pipeline. The fork drops the six *regexp.Regexp pointers upstream Token carries — every per-token regex match is replaced by a byte scanner in word_tokenizer.go, so the struct shrinks to just its flag and metadata fields. Pooled allocation lives on the tokenizer's per-call state buffers, not here.

func Tokenize

func Tokenize(text string, onlyPeriodContext bool) []Token

Tokenize splits text into Tokens annotated with line- and paragraph-start markers. Mirrors upstream DefaultWordTokenizer.Tokenize byte-for-byte (including the CJK punctuation branch) except for the per-token allocation of regex pointers (gone — Token has no such field).

The tokens slice is allocated fresh on every call. Callers that want to pool the result pass in a preallocated slice via TokenizeInto.

func TokenizeInto

func TokenizeInto(dst []Token, text string, onlyPeriodContext bool) []Token

TokenizeInto appends tokens parsed from text to dst and returns the extended slice. Used by the pooled-state path on DefaultSentenceTokenizer so the same slice is reused across calls.

Behaviour matches upstream DefaultWordTokenizer.Tokenize: split on unicode.IsSpace OR IsCjkPunct (so `中文。English` produces `中文。` followed by `English`), mark paragraph starts after a blank line, mark line starts after a single newline. With onlyPeriodContext = false the tokenizer emits every word; with true it emits only words near a sentence-ending punctuation character. If this call emits zero new tokens (whitespace-only text, or onlyPeriodContext filtering every word), the upstream fallback fires and a single token equal to the whole input is appended — checked against the call's original len(dst), not zero, so the fallback also fires when the caller passes an already-populated dst.

func (*Token) String

func (t *Token) String() string

String is the fmt.Stringer impl, retained for log/debug inspection and parity with upstream.

type Tokenizer

type Tokenizer struct {
	Storage *Storage
	// contains filtered or unexported fields
}

Tokenizer is the public entry point of the fork. Goroutine-safe: the per-call state lives in a sync.Pool, so concurrent Tokenize calls each get an independent buffer set.

func New

func New(s *Storage) *Tokenizer

New constructs a Tokenizer over an arbitrary Storage. Used by tests that need a hermetic, synthetic Storage; production code goes through NewEnglish.

func NewEnglish

func NewEnglish() *Tokenizer

NewEnglish constructs the fork's equivalent of upstream english.NewSentenceTokenizer(nil): loads the bundled English training data, applies the same three supervised abbreviations (sgt, gov, no), and assembles the three annotators TypeBased, TokenBased, MultiPunct.

func (*Tokenizer) Tokenize

func (t *Tokenizer) Tokenize(text string) []Sentence

Tokenize splits text into Sentences. Mirrors upstream DefaultSentenceTokenizer.Tokenize but reuses pooled state so the per-call allocation count drops to a handful (the result slice itself, plus any buffer growth on a particularly long input).

The Text field of each Sentence points into text via a substring (no copy). Callers that retain Sentences past the lifetime of text must copy Text first.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL