tokenizer

package

v0.3.0 Latest Latest Go to latest Published: Aug 25, 2025 License: Apache-2.0 Imports: 1 Imported by: 1

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/zerfoo/zerfoo

Links

Open Source Insights

Documentation ¶

Overview ¶

Package tokenizer provides basic text tokenization functionality.

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type Tokenizer ¶

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer provides basic text tokenization functionality. This is a highly simplified example (whitespace tokenization). A feature-complete tokenizer would implement subword algorithms (BPE, WordPiece, SentencePiece).

func NewTokenizer ¶

func NewTokenizer() *Tokenizer

NewTokenizer creates a new simple Tokenizer.

func (*Tokenizer) AddToken ¶

func (t *Tokenizer) AddToken(token string) int

AddToken adds a token to the vocabulary if it doesn't exist.

func (*Tokenizer) Decode ¶

func (t *Tokenizer) Decode(tokenIDs []int) string

Decode converts a slice of token IDs back into a text string.

func (*Tokenizer) Encode ¶

func (t *Tokenizer) Encode(text string) []int

Encode converts a text string into a slice of token IDs. This uses simple whitespace tokenization.

func (*Tokenizer) GetToken ¶ added in v0.3.0

func (t *Tokenizer) GetToken(id int) string

GetToken returns the token string for a given ID, or "<unk>" if not found.

Source Files ¶

View all Source files

tokenizer.go

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL