mdtext

package
v0.23.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 21, 2026 License: MIT Imports: 8 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CountCharacters

func CountCharacters(text string) int

CountCharacters counts letters and digits in text (no spaces or punctuation).

func CountSentences

func CountSentences(text string) int

CountSentences counts sentences by splitting on sentence-ending punctuation (., !, ?) followed by whitespace or end of text. Returns at least 1 for non-empty text.

func CountWords

func CountWords(text string) int

CountWords counts whitespace-delimited words in text. It is exactly len(strings.Fields(text)) — a word is a maximal run of non-space runes, space being IsSpace (exactly unicode.IsSpace) — but counts in a single rune scan instead of allocating the []string. CountWords is called per sentence, per paragraph, per file; the slice strings.Fields built only to be discarded was ~0.48 GB over the 600-file check gate (plan 175 profiling).

func ExtractPlainText

func ExtractPlainText(node ast.Node, source []byte) string

ExtractPlainText extracts readable text from a goldmark AST node, stripping markdown syntax. Keeps: text content, link display text, emphasis inner text, image alt text, code span text.

func IsSpace added in v0.21.0

func IsSpace(r rune) bool

IsSpace reports whether r is a Unicode space, with exactly the result unicode.IsSpace gives but an inlinable ASCII fast path: for r < utf8.RuneSelf the only spaces are ' ' and '\t'..'\r', so two integer comparisons decide it and only genuine non-ASCII runes pay for unicode.IsSpace's table lookup. It is called per rune of every word of every file on the check hot path, where unicode.IsSpace alone was ~5.5% of CPU (plan 175 profiling).

func NonNegativeUTF16RuneLen added in v0.23.0

func NonNegativeUTF16RuneLen(r rune) int

NonNegativeUTF16RuneLen wraps utf16.RuneLen so its negative "invalid code point" return cannot decrement a caller's running UTF-16 unit total. utf8.DecodeRune already maps invalid bytes to RuneError (U+FFFD, width 1), so in practice utf16.RuneLen never returns a negative for runes decoded from real input; the guard is defensive against a future Go change that weakens that invariant. A negative width means the rune is outside [0, MaxRune] or is a surrogate, both of which take one UTF-16 unit when serialized as RuneError.

func Slugify added in v0.6.0

func Slugify(s string) string

Slugify converts heading text to a GitHub-compatible URL anchor slug. Lowercase, letters/digits preserved, spaces and hyphens become a single dash.

func SplitSentences

func SplitSentences(text string) []string

SplitSentences splits text into individual sentences using a Punkt sentence tokenizer. Handles abbreviations, decimals, and ellipses. The actual segmentation is delegated to splitSentences (defined by the active build tag).

The returned slice is freshly allocated. Hot callers that want to pool the destination should use SplitSentencesInto instead.

func SplitSentencesInto added in v0.23.0

func SplitSentencesInto(dst []string, text string) []string

SplitSentencesInto is the pool-friendly variant of SplitSentences: it appends the segmented sentences (trimmed, non-empty) to dst and returns the extended slice. The intended pattern is

bufPtr := sentBufPool.Get().(*[]string)
*bufPtr = mdtext.SplitSentencesInto((*bufPtr)[:0], text)
defer sentBufPool.Put(bufPtr)

so the per-call `make([]string, 0, n)` plain SplitSentences pays is amortized across a sync.Pool. MDS024's hot path uses this form to stay within the per-rule allocation budget on cold-File runs.

func UTF16FromByteOffset added in v0.23.0

func UTF16FromByteOffset(line []byte, byteOff int) int

UTF16FromByteOffset returns the UTF-16 code-unit offset that corresponds to UTF-8 byte offset byteOff within line. The result is clamped to [0, total UTF-16 length of line] so callers cannot receive a negative or past-end position even when given a malformed byte column.

func UTF16ToByteOffset added in v0.23.0

func UTF16ToByteOffset(line []byte, target int) int

UTF16ToByteOffset returns the byte offset in line at the given UTF-16 code-unit count. Offsets past the line's end clamp to len(line) so a defensive guard upstream still sees an in-range value. A target that lands inside a surrogate pair rounds up to the next codepoint boundary.

Types

type TOCItem added in v0.6.0

type TOCItem struct {
	Level  int
	Text   string
	Anchor string
}

TOCItem represents a single heading entry for table-of-contents generation.

func CollectTOCItems added in v0.6.0

func CollectTOCItems(root ast.Node, source []byte) []TOCItem

CollectTOCItems returns all headings from the AST as TOC items, in document order. Anchors are disambiguated by insertion order: first occurrence keeps the plain slug, subsequent duplicates get -1, -2, … suffixes — matching the anchor computation in crossfilereferenceintegrity. Tracks used anchors (not just base slugs) to guarantee unique anchors even when a later heading's base slug matches an earlier heading's disambiguated anchor.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL