Documentation
¶
Index ¶
- func CountCharacters(text string) int
- func CountSentences(text string) int
- func CountWords(text string) int
- func ExtractPlainText(node ast.Node, source []byte) string
- func IsSpace(r rune) bool
- func NonNegativeUTF16RuneLen(r rune) int
- func Slugify(s string) string
- func SplitSentences(text string) []string
- func SplitSentencesInto(dst []string, text string) []string
- func UTF16FromByteOffset(line []byte, byteOff int) int
- func UTF16ToByteOffset(line []byte, target int) int
- type TOCItem
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func CountCharacters ¶
CountCharacters counts letters and digits in text (no spaces or punctuation).
func CountSentences ¶
CountSentences counts sentences by splitting on sentence-ending punctuation (., !, ?) followed by whitespace or end of text. Returns at least 1 for non-empty text.
func CountWords ¶
CountWords counts whitespace-delimited words in text. It is exactly len(strings.Fields(text)) — a word is a maximal run of non-space runes, space being IsSpace (exactly unicode.IsSpace) — but counts in a single rune scan instead of allocating the []string. CountWords is called per sentence, per paragraph, per file; the slice strings.Fields built only to be discarded was ~0.48 GB over the 600-file check gate (plan 175 profiling).
func ExtractPlainText ¶
ExtractPlainText extracts readable text from a goldmark AST node, stripping markdown syntax. Keeps: text content, link display text, emphasis inner text, image alt text, code span text.
func IsSpace ¶ added in v0.21.0
IsSpace reports whether r is a Unicode space, with exactly the result unicode.IsSpace gives but an inlinable ASCII fast path: for r < utf8.RuneSelf the only spaces are ' ' and '\t'..'\r', so two integer comparisons decide it and only genuine non-ASCII runes pay for unicode.IsSpace's table lookup. It is called per rune of every word of every file on the check hot path, where unicode.IsSpace alone was ~5.5% of CPU (plan 175 profiling).
func NonNegativeUTF16RuneLen ¶ added in v0.23.0
NonNegativeUTF16RuneLen wraps utf16.RuneLen so its negative "invalid code point" return cannot decrement a caller's running UTF-16 unit total. utf8.DecodeRune already maps invalid bytes to RuneError (U+FFFD, width 1), so in practice utf16.RuneLen never returns a negative for runes decoded from real input; the guard is defensive against a future Go change that weakens that invariant. A negative width means the rune is outside [0, MaxRune] or is a surrogate, both of which take one UTF-16 unit when serialized as RuneError.
func Slugify ¶ added in v0.6.0
Slugify converts heading text to a GitHub-compatible URL anchor slug. Lowercase, letters/digits preserved, spaces and hyphens become a single dash.
func SplitSentences ¶
SplitSentences splits text into individual sentences using a Punkt sentence tokenizer. Handles abbreviations, decimals, and ellipses. The actual segmentation is delegated to splitSentences (defined by the active build tag).
The returned slice is freshly allocated. Hot callers that want to pool the destination should use SplitSentencesInto instead.
func SplitSentencesInto ¶ added in v0.23.0
SplitSentencesInto is the pool-friendly variant of SplitSentences: it appends the segmented sentences (trimmed, non-empty) to dst and returns the extended slice. The intended pattern is
bufPtr := sentBufPool.Get().(*[]string) *bufPtr = mdtext.SplitSentencesInto((*bufPtr)[:0], text) defer sentBufPool.Put(bufPtr)
so the per-call `make([]string, 0, n)` plain SplitSentences pays is amortized across a sync.Pool. MDS024's hot path uses this form to stay within the per-rule allocation budget on cold-File runs.
func UTF16FromByteOffset ¶ added in v0.23.0
UTF16FromByteOffset returns the UTF-16 code-unit offset that corresponds to UTF-8 byte offset byteOff within line. The result is clamped to [0, total UTF-16 length of line] so callers cannot receive a negative or past-end position even when given a malformed byte column.
func UTF16ToByteOffset ¶ added in v0.23.0
UTF16ToByteOffset returns the byte offset in line at the given UTF-16 code-unit count. Offsets past the line's end clamp to len(line) so a defensive guard upstream still sees an in-range value. A target that lands inside a surrogate pair rounds up to the next codepoint boundary.
Types ¶
type TOCItem ¶ added in v0.6.0
TOCItem represents a single heading entry for table-of-contents generation.
func CollectTOCItems ¶ added in v0.6.0
CollectTOCItems returns all headings from the AST as TOC items, in document order. Anchors are disambiguated by insertion order: first occurrence keeps the plain slug, subsequent duplicates get -1, -2, … suffixes — matching the anchor computation in crossfilereferenceintegrity. Tracks used anchors (not just base slugs) to guarantee unique anchors even when a later heading's base slug matches an earlier heading's disambiguated anchor.