tokenizer

package module

v0.0.3 Latest Latest Go to latest Published: Mar 28, 2026 License: Apache-2.0 Imports: 6 Imported by: 2

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/mutablelogic/go-tokenizer

Links

Open Source Insights

README ¶

go-tokenizer

A general-purpose tokenizer and Markdown parser with HTML rendering for Go.

Features

Lexical Scanner: Tokenizes text into identifiers, numbers, strings, operators, and punctuation
Markdown Parser: Converts Markdown text into an Abstract Syntax Tree (AST)
HTML Renderer: Renders Markdown AST to HTML with proper escaping
Configurable: Optional features like comment parsing, newline handling, and float parsing

Installation

go get github.com/mutablelogic/go-tokenizer

Requires Go 1.23 or later.

Quick Start

Tokenizing Text

package main

import (
    "fmt"
    "strings"
    
    "github.com/mutablelogic/go-tokenizer"
)

func main() {
    scanner := tokenizer.NewScanner(strings.NewReader("hello world 123"), tokenizer.Pos{})
    for {
        tok := scanner.Next()
        if tok.Kind == tokenizer.EOF {
            break
        }
        fmt.Printf("%s: %q\n", tok.Kind, tok.Value)
    }
}

Output:

Ident: "hello"
Space: " "
Ident: "world"
Space: " "
NumberInteger: "123"

Parsing Markdown

package main

import (
    "fmt"
    "strings"
    
    "github.com/mutablelogic/go-tokenizer"
    "github.com/mutablelogic/go-tokenizer/pkg/markdown"
    "github.com/mutablelogic/go-tokenizer/pkg/markdown/html"
)

func main() {
    input := `# Hello World

This is **bold** and _italic_ text.

- Item 1
- Item 2
- Item 3
`
    doc := markdown.Parse(strings.NewReader(input), tokenizer.Pos{})
    output := html.RenderString(doc)
    fmt.Println(output)
}

Output:

<h1>Hello World</h1><p>This is <strong>bold</strong> and <em>italic</em> text.</p><ul><li>Item 1</li><li>Item 2</li><li>Item 3</li></ul>

Packages

`tokenizer` (root package)

The lexical scanner that breaks input text into tokens.

Token Types:

Ident - Identifiers (hello, world)
NumberInteger, NumberFloat, NumberHex, NumberOctal, NumberBinary - Numbers
String, QuotedString - String literals
Hash, Asterisk, Underscore, Backtick, Tilde - Special characters
Space, Newline - Whitespace
Comment - Comments (when enabled)
And more...

Scanner Features:

// Enable features with bitwise OR
scanner := tokenizer.NewScanner(r, pos, 
    tokenizer.HashComment |      // # style comments
    tokenizer.LineComment |      // // style comments  
    tokenizer.BlockComment |     // /* */ style comments
    tokenizer.NewlineToken |     // Emit newlines as separate tokens
    tokenizer.UnderscoreToken |  // Emit underscores as separate tokens
    tokenizer.HyphenIdentToken | // Allow hyphenated identifiers like X-ORIGIN
    tokenizer.NumberFloatToken,  // Parse floating point numbers
)

`pkg/ast`

Defines the AST node types and tree traversal.

// Node interface
type Node interface {
    Kind() Kind
    Children() []Node
}

// Walk the AST
ast.Walk(doc, func(node ast.Node, depth int) error {
    fmt.Printf("%s%s\n", strings.Repeat("  ", depth), node.Kind())
    return nil
})

`pkg/markdown`

Parses Markdown text into an AST.

Supported Syntax:

Headings: # H1 through ###### H6
Paragraphs: Text separated by blank lines
Emphasis: _italic_ or *italic*
Strong: __bold__ or **bold**
Strikethrough: ~~deleted~~
Inline code: `code`
Code blocks: ```language ... ```
Links: [text](url) or <url>
Images: ![alt](url)
Blockquotes: > quoted text
Unordered lists: - item, * item, or + item
Ordered lists: 1. item or 1) item
Horizontal rules: ---, ***, or ___

`pkg/markdown/html`

Renders Markdown AST to HTML.

// Render to string
output := html.RenderString(doc)

// Render to io.Writer with indentation
renderer := html.NewRenderer(w).WithIndent(true)
err := renderer.Render(doc)

Features:

Proper HTML escaping for XSS prevention
Optional indented output for readability
Language classes on code blocks: <code class="language-go">

AST Node Types

Kind	Description	HTML Output
`Document`	Root node	(container)
`Paragraph`	Text block	`<p>...</p>`
`Heading`	H1-H6	`<h1>...</h1>`
`Text`	Plain text	(escaped text)
`Emphasis`	Italic	`<em>...</em>`
`Strong`	Bold	`<strong>...</strong>`
`Strikethrough`	Deleted	`<del>...</del>`
`Code`	Inline code	`<code>...</code>`
`CodeBlock`	Fenced code	`<pre><code>...</code></pre>`
`Link`	Hyperlink	`<a href="...">...</a>`
`Image`	Image	`<img src="..." alt="..."/>`
`Blockquote`	Quote	`<blockquote>...</blockquote>`
`List`	Ordered/Unordered	`<ol>...</ol>` or `<ul>...</ul>`
`ListItem`	List item	`<li>...</li>`
`HorizontalRule`	Divider	`<hr/>`

License

Apache 2.0 - see LICENSE for details.

Documentation ¶

Overview ¶

Package tokenizer implements a generic lexical scanner for tokenizing text input.

The tokenizer breaks input text into tokens such as identifiers, numbers, strings, operators, and punctuation. It supports various number formats (integer, float, hex, octal, binary) and can be configured with optional features like comment parsing and newline handling.

Basic Usage ¶

scanner := tokenizer.NewScanner(strings.NewReader("hello world"), tokenizer.Pos{})
for {
	tok := scanner.Next()
	if tok.Kind == tokenizer.EOF {
		break
	}
	fmt.Println(tok)
}

Features ¶

The scanner supports optional features that can be enabled:

HashComment: Enable # style single-line comments
LineComment: Enable // style single-line comments
BlockComment: Enable block comments
UnderscoreToken: Emit underscores as separate tokens (for markdown parsing)
NewlineToken: Emit newlines as separate tokens instead of whitespace

Features are combined using bitwise OR:

scanner := tokenizer.NewScanner(r, pos, tokenizer.HashComment|tokenizer.LineComment)

Index ¶

func NewParseError(t *Token) error
func NewPosError(err error, pos Pos) error
type Error
type Feature
type Pos
- func (p *Pos) String() string
type PosError
- func (e *PosError) Error() string
type Scanner
- func NewScanner(r io.Reader, pos Pos, features ...Feature) *Scanner
type Token
- func NewToken(kind TokenKind, val string, pos Pos) *Token
- func (t *Token) String() string
type TokenKind
- func (k TokenKind) String() string

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func NewParseError ¶

func NewParseError(t *Token) error

NewParseError creates a parse error for the given token. The error message includes the token's value and position.

func NewPosError ¶

func NewPosError(err error, pos Pos) error

NewPosError creates a new error with positional information. The resulting error implements the error interface and includes file, line, and column information in its Error() output.

Types ¶

type Error ¶

type Error uint

const (
	ErrSuccess Error = iota
	ErrBadParameter
	ErrDuplicateEntry
	ErrUnexpectedResponse
	ErrNotFound
	ErrNotModified
	ErrInternalAppError
	ErrNotImplemented
)

func (Error) Error ¶

func (e Error) Error() string

func (Error) With ¶

func (e Error) With(args ...any) error

func (Error) Withf ¶

func (e Error) Withf(format string, args ...any) error

type Feature ¶

type Feature uint

Feature represents optional scanner features that can be enabled

const (
	// HashComment enables # style comments (# comment until end of line)
	HashComment Feature = 1 << iota

	// LineComment enables // style comments (// comment until end of line)
	LineComment

	// BlockComment enables /* */ style comments (/* block comment */)
	BlockComment

	// UnderscoreToken emits underscores as separate Underscore tokens
	// When disabled, underscores are part of identifiers (hello_world)
	UnderscoreToken

	// NewlineToken emits newlines as separate Newline tokens instead of Space
	NewlineToken

	// NumberFloatToken enables parsing of floating point numbers (1.5, 3.14e10)
	// When disabled, "1.5" is parsed as NumberInteger("1") + Punkt(".") + NumberInteger("5")
	NumberFloatToken

	// SingleQuoteToken emits single quotes as separate tokens instead of parsing
	// single-quoted strings.
	SingleQuoteToken

	// DoubleQuoteToken emits double quotes as separate tokens instead of parsing
	// double-quoted strings.
	DoubleQuoteToken

	// HyphenIdentToken allows identifiers to include hyphens (e.g., "hello-world") instead
	// of treating hyphens as separate tokens.
	HyphenIdentToken
)

type Pos ¶

type Pos struct {
	// Path is an optional pointer to the source file path
	Path *string
	// Line is the zero-indexed line number
	Line uint
	// Col is the zero-indexed column number
	Col uint
	// contains filtered or unexported fields
}

Pos represents a position in source text, typically within a file. Line and Col are zero-indexed internally; add 1 when displaying to users. The Path field is optional and can be nil if the source has no associated file.

func (*Pos) String ¶

func (p *Pos) String() string

type PosError ¶

type PosError struct {
	// Err is the underlying error
	Err error
	// Pos indicates where in the source the error occurred
	Pos Pos
}

PosError wraps an error with positional information from the source. This allows error messages to include file, line, and column information.

func (*PosError) Error ¶

func (e *PosError) Error() string

type Scanner ¶

type Scanner struct {
	// contains filtered or unexported fields
}

Scanner represents a lexical scanner.

func NewScanner ¶

func NewScanner(r io.Reader, pos Pos, features ...Feature) *Scanner

NewScanner returns a new instance of Scanner with optional features. Features can be combined using bitwise OR: HashComment|LineComment|BlockComment

func (*Scanner) Err ¶ added in v0.0.2

func (s *Scanner) Err() error

Err returns the first scanning error encountered by Peak, Next, or Tokens.

func (*Scanner) NewError ¶

func (s *Scanner) NewError(err error) error

NewError wraps the given error with the scanner's current position. This is useful for creating error messages that include file, line, and column information indicating where the error occurred.

func (*Scanner) Next ¶

func (s *Scanner) Next() *Token

Next returns the next token and advances the scanner position. If the scanner has reached EOF, subsequent calls continue to return EOF. Use Peak() instead if you need to look ahead without consuming the token.

func (*Scanner) Peak ¶

func (s *Scanner) Peak() *Token

Peak returns the next token without advancing the scanner position. This allows looking ahead at upcoming tokens without consuming them. If the scanner has reached EOF, subsequent calls continue to return EOF. Note: The token is buffered, so multiple Peak() calls return the same token.

func (*Scanner) Tokens ¶

func (s *Scanner) Tokens() ([]*Token, error)

Tokens scans all remaining input and returns a slice of tokens. Scanning stops at EOF or when illegal input is encountered. Returns an error with positional information if illegal input is found.

type Token ¶

type Token struct {
	// Kind identifies the type of token (e.g., Ident, String, NumberInteger)
	Kind TokenKind
	// Val contains the literal text of the token
	Val string
	// Pos indicates where in the source the token was found
	Pos Pos
}

Token represents a lexical token produced by the scanner. It contains the token's kind, its literal value as a string, and the position in the source where it was found.

func NewToken ¶

func NewToken(kind TokenKind, val string, pos Pos) *Token

NewToken creates a new Token with the specified kind, value, and position.

func (*Token) String ¶

func (t *Token) String() string

type TokenKind ¶

type TokenKind uint

TokenKind classifies the type of a token produced by the scanner. Each token has a kind that identifies what type of lexical element it represents, such as an identifier, number, string, operator, or punctuation.

const (
	Any TokenKind = iota
	String
	Expr
	Space
	Ident
	NumberInteger
	NumberFloat
	NumberOctal
	NumberHex
	NumberBinary
	Punkt
	Question
	Colon
	SemiColon
	Comma
	OpenParen
	CloseParen
	OpenSquare
	CloseSquare
	OpenBrace
	CloseBrace
	Ampersand
	Equal
	Less
	Greater
	Plus
	Minus
	Multiply
	Divide
	Not
	Backtick
	Tilde
	Pipe
	Backslash
	SingleQuote
	DoubleQuote
	Underscore
	Hash
	At
	Caret
	Percent
	Dollar
	True
	False
	Null
	Comment
	Newline
	EOF
	Lowest = Equal // Lowest precedence
)

func (TokenKind) String ¶

func (k TokenKind) String() string

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
pkg
ast Package ast defines the abstract syntax tree node types used by parsers.	Package ast defines the abstract syntax tree node types used by parsers.
markdown Package markdown provides a parser for converting Markdown text into an AST.	Package markdown provides a parser for converting Markdown text into an AST.
markdown/html Package html provides an HTML renderer for Markdown AST nodes.	Package html provides an HTML renderer for Markdown AST nodes.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL