tokenizer

package module
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 28, 2026 License: Apache-2.0 Imports: 6 Imported by: 2

README

go-tokenizer

A general-purpose tokenizer and Markdown parser with HTML rendering for Go.

Go Reference

Features

  • Lexical Scanner: Tokenizes text into identifiers, numbers, strings, operators, and punctuation
  • Markdown Parser: Converts Markdown text into an Abstract Syntax Tree (AST)
  • HTML Renderer: Renders Markdown AST to HTML with proper escaping
  • Configurable: Optional features like comment parsing, newline handling, and float parsing

Installation

go get github.com/mutablelogic/go-tokenizer

Requires Go 1.23 or later.

Quick Start

Tokenizing Text
package main

import (
    "fmt"
    "strings"
    
    "github.com/mutablelogic/go-tokenizer"
)

func main() {
    scanner := tokenizer.NewScanner(strings.NewReader("hello world 123"), tokenizer.Pos{})
    for {
        tok := scanner.Next()
        if tok.Kind == tokenizer.EOF {
            break
        }
        fmt.Printf("%s: %q\n", tok.Kind, tok.Value)
    }
}

Output:

Ident: "hello"
Space: " "
Ident: "world"
Space: " "
NumberInteger: "123"
Parsing Markdown
package main

import (
    "fmt"
    "strings"
    
    "github.com/mutablelogic/go-tokenizer"
    "github.com/mutablelogic/go-tokenizer/pkg/markdown"
    "github.com/mutablelogic/go-tokenizer/pkg/markdown/html"
)

func main() {
    input := `# Hello World

This is **bold** and _italic_ text.

- Item 1
- Item 2
- Item 3
`
    doc := markdown.Parse(strings.NewReader(input), tokenizer.Pos{})
    output := html.RenderString(doc)
    fmt.Println(output)
}

Output:

<h1>Hello World</h1><p>This is <strong>bold</strong> and <em>italic</em> text.</p><ul><li>Item 1</li><li>Item 2</li><li>Item 3</li></ul>

Packages

tokenizer (root package)

The lexical scanner that breaks input text into tokens.

Token Types:

  • Ident - Identifiers (hello, world)
  • NumberInteger, NumberFloat, NumberHex, NumberOctal, NumberBinary - Numbers
  • String, QuotedString - String literals
  • Hash, Asterisk, Underscore, Backtick, Tilde - Special characters
  • Space, Newline - Whitespace
  • Comment - Comments (when enabled)
  • And more...

Scanner Features:

// Enable features with bitwise OR
scanner := tokenizer.NewScanner(r, pos, 
    tokenizer.HashComment |      // # style comments
    tokenizer.LineComment |      // // style comments  
    tokenizer.BlockComment |     // /* */ style comments
    tokenizer.NewlineToken |     // Emit newlines as separate tokens
    tokenizer.UnderscoreToken |  // Emit underscores as separate tokens
    tokenizer.HyphenIdentToken | // Allow hyphenated identifiers like X-ORIGIN
    tokenizer.NumberFloatToken,  // Parse floating point numbers
)
pkg/ast

Defines the AST node types and tree traversal.

// Node interface
type Node interface {
    Kind() Kind
    Children() []Node
}

// Walk the AST
ast.Walk(doc, func(node ast.Node, depth int) error {
    fmt.Printf("%s%s\n", strings.Repeat("  ", depth), node.Kind())
    return nil
})
pkg/markdown

Parses Markdown text into an AST.

Supported Syntax:

  • Headings: # H1 through ###### H6
  • Paragraphs: Text separated by blank lines
  • Emphasis: _italic_ or *italic*
  • Strong: __bold__ or **bold**
  • Strikethrough: ~~deleted~~
  • Inline code: `code`
  • Code blocks: ```language ... ```
  • Links: [text](url) or <url>
  • Images: ![alt](url)
  • Blockquotes: > quoted text
  • Unordered lists: - item, * item, or + item
  • Ordered lists: 1. item or 1) item
  • Horizontal rules: ---, ***, or ___
pkg/markdown/html

Renders Markdown AST to HTML.

// Render to string
output := html.RenderString(doc)

// Render to io.Writer with indentation
renderer := html.NewRenderer(w).WithIndent(true)
err := renderer.Render(doc)

Features:

  • Proper HTML escaping for XSS prevention
  • Optional indented output for readability
  • Language classes on code blocks: <code class="language-go">

AST Node Types

Kind Description HTML Output
Document Root node (container)
Paragraph Text block <p>...</p>
Heading H1-H6 <h1>...</h1>
Text Plain text (escaped text)
Emphasis Italic <em>...</em>
Strong Bold <strong>...</strong>
Strikethrough Deleted <del>...</del>
Code Inline code <code>...</code>
CodeBlock Fenced code <pre><code>...</code></pre>
Link Hyperlink <a href="...">...</a>
Image Image <img src="..." alt="..."/>
Blockquote Quote <blockquote>...</blockquote>
List Ordered/Unordered <ol>...</ol> or <ul>...</ul>
ListItem List item <li>...</li>
HorizontalRule Divider <hr/>

License

Apache 2.0 - see LICENSE for details.

Documentation

Overview

Package tokenizer implements a generic lexical scanner for tokenizing text input.

The tokenizer breaks input text into tokens such as identifiers, numbers, strings, operators, and punctuation. It supports various number formats (integer, float, hex, octal, binary) and can be configured with optional features like comment parsing and newline handling.

Basic Usage

scanner := tokenizer.NewScanner(strings.NewReader("hello world"), tokenizer.Pos{})
for {
	tok := scanner.Next()
	if tok.Kind == tokenizer.EOF {
		break
	}
	fmt.Println(tok)
}

Features

The scanner supports optional features that can be enabled:

  • HashComment: Enable # style single-line comments
  • LineComment: Enable // style single-line comments
  • BlockComment: Enable block comments
  • UnderscoreToken: Emit underscores as separate tokens (for markdown parsing)
  • NewlineToken: Emit newlines as separate tokens instead of whitespace

Features are combined using bitwise OR:

scanner := tokenizer.NewScanner(r, pos, tokenizer.HashComment|tokenizer.LineComment)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewParseError

func NewParseError(t *Token) error

NewParseError creates a parse error for the given token. The error message includes the token's value and position.

func NewPosError

func NewPosError(err error, pos Pos) error

NewPosError creates a new error with positional information. The resulting error implements the error interface and includes file, line, and column information in its Error() output.

Types

type Error

type Error uint
const (
	ErrSuccess Error = iota
	ErrBadParameter
	ErrDuplicateEntry
	ErrUnexpectedResponse
	ErrNotFound
	ErrNotModified
	ErrInternalAppError
	ErrNotImplemented
)

func (Error) Error

func (e Error) Error() string

func (Error) With

func (e Error) With(args ...any) error

func (Error) Withf

func (e Error) Withf(format string, args ...any) error

type Feature

type Feature uint

Feature represents optional scanner features that can be enabled

const (
	// HashComment enables # style comments (# comment until end of line)
	HashComment Feature = 1 << iota

	// LineComment enables // style comments (// comment until end of line)
	LineComment

	// BlockComment enables /* */ style comments (/* block comment */)
	BlockComment

	// UnderscoreToken emits underscores as separate Underscore tokens
	// When disabled, underscores are part of identifiers (hello_world)
	UnderscoreToken

	// NewlineToken emits newlines as separate Newline tokens instead of Space
	NewlineToken

	// NumberFloatToken enables parsing of floating point numbers (1.5, 3.14e10)
	// When disabled, "1.5" is parsed as NumberInteger("1") + Punkt(".") + NumberInteger("5")
	NumberFloatToken

	// SingleQuoteToken emits single quotes as separate tokens instead of parsing
	// single-quoted strings.
	SingleQuoteToken

	// DoubleQuoteToken emits double quotes as separate tokens instead of parsing
	// double-quoted strings.
	DoubleQuoteToken

	// HyphenIdentToken allows identifiers to include hyphens (e.g., "hello-world") instead
	// of treating hyphens as separate tokens.
	HyphenIdentToken
)

type Pos

type Pos struct {
	// Path is an optional pointer to the source file path
	Path *string
	// Line is the zero-indexed line number
	Line uint
	// Col is the zero-indexed column number
	Col uint
	// contains filtered or unexported fields
}

Pos represents a position in source text, typically within a file. Line and Col are zero-indexed internally; add 1 when displaying to users. The Path field is optional and can be nil if the source has no associated file.

func (*Pos) String

func (p *Pos) String() string

type PosError

type PosError struct {
	// Err is the underlying error
	Err error
	// Pos indicates where in the source the error occurred
	Pos Pos
}

PosError wraps an error with positional information from the source. This allows error messages to include file, line, and column information.

func (*PosError) Error

func (e *PosError) Error() string

type Scanner

type Scanner struct {
	// contains filtered or unexported fields
}

Scanner represents a lexical scanner.

func NewScanner

func NewScanner(r io.Reader, pos Pos, features ...Feature) *Scanner

NewScanner returns a new instance of Scanner with optional features. Features can be combined using bitwise OR: HashComment|LineComment|BlockComment

func (*Scanner) Err added in v0.0.2

func (s *Scanner) Err() error

Err returns the first scanning error encountered by Peak, Next, or Tokens.

func (*Scanner) NewError

func (s *Scanner) NewError(err error) error

NewError wraps the given error with the scanner's current position. This is useful for creating error messages that include file, line, and column information indicating where the error occurred.

func (*Scanner) Next

func (s *Scanner) Next() *Token

Next returns the next token and advances the scanner position. If the scanner has reached EOF, subsequent calls continue to return EOF. Use Peak() instead if you need to look ahead without consuming the token.

func (*Scanner) Peak

func (s *Scanner) Peak() *Token

Peak returns the next token without advancing the scanner position. This allows looking ahead at upcoming tokens without consuming them. If the scanner has reached EOF, subsequent calls continue to return EOF. Note: The token is buffered, so multiple Peak() calls return the same token.

func (*Scanner) Tokens

func (s *Scanner) Tokens() ([]*Token, error)

Tokens scans all remaining input and returns a slice of tokens. Scanning stops at EOF or when illegal input is encountered. Returns an error with positional information if illegal input is found.

type Token

type Token struct {
	// Kind identifies the type of token (e.g., Ident, String, NumberInteger)
	Kind TokenKind
	// Val contains the literal text of the token
	Val string
	// Pos indicates where in the source the token was found
	Pos Pos
}

Token represents a lexical token produced by the scanner. It contains the token's kind, its literal value as a string, and the position in the source where it was found.

func NewToken

func NewToken(kind TokenKind, val string, pos Pos) *Token

NewToken creates a new Token with the specified kind, value, and position.

func (*Token) String

func (t *Token) String() string

type TokenKind

type TokenKind uint

TokenKind classifies the type of a token produced by the scanner. Each token has a kind that identifies what type of lexical element it represents, such as an identifier, number, string, operator, or punctuation.

const (
	Any TokenKind = iota
	String
	Expr
	Space
	Ident
	NumberInteger
	NumberFloat
	NumberOctal
	NumberHex
	NumberBinary
	Punkt
	Question
	Colon
	SemiColon
	Comma
	OpenParen
	CloseParen
	OpenSquare
	CloseSquare
	OpenBrace
	CloseBrace
	Ampersand
	Equal
	Less
	Greater
	Plus
	Minus
	Multiply
	Divide
	Not
	Backtick
	Tilde
	Pipe
	Backslash
	SingleQuote
	DoubleQuote
	Underscore
	Hash
	At
	Caret
	Percent
	Dollar
	True
	False
	Null
	Comment
	Newline
	EOF
	Lowest = Equal // Lowest precedence
)

func (TokenKind) String

func (k TokenKind) String() string

Directories

Path Synopsis
pkg
ast
Package ast defines the abstract syntax tree node types used by parsers.
Package ast defines the abstract syntax tree node types used by parsers.
markdown
Package markdown provides a parser for converting Markdown text into an AST.
Package markdown provides a parser for converting Markdown text into an AST.
markdown/html
Package html provides an HTML renderer for Markdown AST nodes.
Package html provides an HTML renderer for Markdown AST nodes.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL