tokenizer

package
v0.3.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 25, 2025 License: Apache-2.0 Imports: 1 Imported by: 1

Documentation

Overview

Package tokenizer provides basic text tokenization functionality.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer provides basic text tokenization functionality. This is a highly simplified example (whitespace tokenization). A feature-complete tokenizer would implement subword algorithms (BPE, WordPiece, SentencePiece).

func NewTokenizer

func NewTokenizer() *Tokenizer

NewTokenizer creates a new simple Tokenizer.

func (*Tokenizer) AddToken

func (t *Tokenizer) AddToken(token string) int

AddToken adds a token to the vocabulary if it doesn't exist.

func (*Tokenizer) Decode

func (t *Tokenizer) Decode(tokenIDs []int) string

Decode converts a slice of token IDs back into a text string.

func (*Tokenizer) Encode

func (t *Tokenizer) Encode(text string) []int

Encode converts a text string into a slice of token IDs. This uses simple whitespace tokenization.

func (*Tokenizer) GetToken added in v0.3.0

func (t *Tokenizer) GetToken(id int) string

GetToken returns the token string for a given ID, or "<unk>" if not found.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL