tokenizer

package
v0.5.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 30, 2026 License: EUPL-1.2 Imports: 3 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func NewApproximationTokenizer

func NewApproximationTokenizer() ports.Tokenizer

NewApproximationTokenizer creates a new ApproximationTokenizer with default settings. Uses 4.0 characters per token as the default ratio.

func NewApproximationTokenizerWithRatio

func NewApproximationTokenizerWithRatio(charsPerToken float64) ports.Tokenizer

NewApproximationTokenizerWithRatio creates a new ApproximationTokenizer with a custom characters-per-token ratio. Useful for tuning accuracy for specific languages or domains.

func NewTiktokenTokenizer

func NewTiktokenTokenizer(modelName string) (ports.Tokenizer, error)

NewTiktokenTokenizer creates a new TiktokenTokenizer for the specified model. Common models: "cl100k_base" (GPT-4, GPT-3.5-turbo), "p50k_base" (Codex), "r50k_base" (GPT-3).

Types

type ApproximationTokenizer

type ApproximationTokenizer struct {
	// contains filtered or unexported fields
}

ApproximationTokenizer implements ports.Tokenizer using character-based estimation. Provides fast, approximate token counting as a fallback when exact tokenization is not available or not needed. Uses a simple heuristic: ~4 characters per token.

func (*ApproximationTokenizer) CountTokens

func (a *ApproximationTokenizer) CountTokens(text string) (int, error)

CountTokens returns an approximate token count based on character length. Formula: token_count ≈ len(text) / charsPerToken

func (*ApproximationTokenizer) CountTurnsTokens

func (a *ApproximationTokenizer) CountTurnsTokens(turns []string) (int, error)

CountTurnsTokens returns the total approximate token count across multiple conversation turns. Sums the estimated token counts for each turn.

func (*ApproximationTokenizer) IsEstimate

func (a *ApproximationTokenizer) IsEstimate() bool

IsEstimate returns true because this tokenizer produces approximate counts.

func (*ApproximationTokenizer) ModelName

func (a *ApproximationTokenizer) ModelName() string

ModelName returns the identifier for this approximation method.

type TiktokenTokenizer

type TiktokenTokenizer struct {
	// contains filtered or unexported fields
}

TiktokenTokenizer implements ports.Tokenizer using pkoukk/tiktoken-go library. Provides accurate token counting for OpenAI-compatible models.

func (*TiktokenTokenizer) CountTokens

func (t *TiktokenTokenizer) CountTokens(text string) (int, error)

CountTokens returns the exact number of tokens in the given text.

func (*TiktokenTokenizer) CountTurnsTokens

func (t *TiktokenTokenizer) CountTurnsTokens(turns []string) (int, error)

CountTurnsTokens returns the total token count across multiple conversation turns.

func (*TiktokenTokenizer) IsEstimate

func (t *TiktokenTokenizer) IsEstimate() bool

IsEstimate returns false because tiktoken provides exact counts.

func (*TiktokenTokenizer) ModelName

func (t *TiktokenTokenizer) ModelName() string

ModelName returns the tiktoken model identifier.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL