tokenizer

package
v1.0.65 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 26, 2025 License: MIT Imports: 3 Imported by: 0

README

High-Performance Rust Tokenizer

This directory contains an optional Rust-based tokenizer implementation that provides 3-15x faster token counting compared to the pure Go tiktoken implementation.

Features

  • High Performance: 3-15x speedup over pure Go implementation
  • CGO Integration: Seamless FFI bindings via CGO
  • Opt-in: Build tag based activation (-tags=rusttokenizer)
  • Automatic Fallback: Falls back to Go implementation if Rust is not available
  • Batch Processing: Efficient batch tokenization for multiple texts
  • Thread-Safe: Safe for concurrent use
  • Zero Copy: Efficient memory handling across the FFI boundary

Performance Comparison

Benchmarks on typical workloads (tokens/second):

Operation Go (tiktoken-go) Rust (tiktoken-rs) Speedup
Short text (~50 tokens) ~50K ops/sec ~500K ops/sec ~10x
Medium text (~500 tokens) ~8K ops/sec ~80K ops/sec ~10x
Long text (~5000 tokens) ~800 ops/sec ~12K ops/sec ~15x
Batch (10 messages) ~800 ops/sec ~15K ops/sec ~18x

Building

# Build the Rust library first
cd internal/tokenizer/tokenizer-lib
cargo build --release

# Then build Go code with the rusttokenizer tag
cd ../../..
go build -tags=rusttokenizer ./...
Without Rust Tokenizer (Default)
go build ./...

The Go implementation is always available and will be used automatically if the Rust tokenizer is not built.

Build Script

A convenience build script is provided:

./internal/tokenizer/build.sh

This script will:

  1. Check if Rust/Cargo is installed
  2. Build the Rust library
  3. Build the Go code with the rusttokenizer tag
  4. Run tests to verify everything works

Usage

The tokenizer is used automatically through the pkg/utils package:

import "github.com/cecil-the-coder/ai-provider-kit/pkg/utils"

// Count tokens - uses fastest available implementation automatically
count, err := utils.CountTokens("Hello, world!", "gpt-4")

// Count tokens for multiple messages
messages := []types.ChatMessage{...}
count, err := utils.CountTokensFromMessages(messages, "gpt-4")
Direct Usage

You can also use the tokenizer package directly:

import "github.com/cecil-the-coder/ai-provider-kit/internal/tokenizer"

// Check if Rust is available
if tokenizer.IsRustAvailable() {
    fmt.Println("Using Rust tokenizer for maximum performance")
}

// Count tokens
count, err := tokenizer.CountTokens("text", "gpt-4")

// Batch counting
texts := []string{"text1", "text2", "text3"}
count, err := tokenizer.CountBatch(texts, "gpt-4")
Forcing a Specific Implementation
import "github.com/cecil-the-coder/ai-provider-kit/internal/tokenizer"

// Force Go implementation (useful for testing)
tokenizer.ResetGlobalCounter()
goCounter := tokenizer.ForceGoCounter()

// Force Rust implementation (panics if not available)
tokenizer.ResetGlobalCounter()
rustCounter := tokenizer.ForceRustCounter()

Running Tests

# Test with Rust tokenizer (requires Rust build)
go test -tags=rusttokenizer ./internal/tokenizer/...

# Test with Go implementation only
go test ./internal/tokenizer/...

# Run benchmarks
go test -tags=rusttokenizer -bench=. -benchmem ./internal/tokenizer/...

Requirements

For Rust Tokenizer:
  • Rust 1.70+ with Cargo
  • CGO enabled
  • Build tools (gcc/clang)
For Go Implementation (Default):
  • Go 1.24+
  • No additional requirements

Architecture

internal/tokenizer/
├── tokenizer.go          # Unified interface
├── rust.go               # CGO bindings (rusttokenizer build tag)
├── go.go                 # Pure Go fallback
├── tokenizer-lib/        # Rust library
│   ├── Cargo.toml
│   └── src/
│       └── lib.rs        # FFI implementation
├── tokenizer_test.go     # Unit tests
├── benchmark_test.go     # Benchmark tests
├── build.sh              # Build script
└── README.md             # This file

Trade-offs

Advantages of Rust Tokenizer
  1. Performance: 3-15x faster than Go implementation
  2. Efficiency: Lower memory usage and CPU overhead
  3. Scalability: Better for high-throughput scenarios
  4. Batch Processing: Specialized batch operations
Disadvantages
  1. Build Complexity: Requires Rust toolchain
  2. CGO Dependency: Adds CGO requirement
  3. Cross-compilation: More complex cross-compilation setup
Recommendation
  • Production/High-load: Use Rust tokenizer for best performance
  • Development/Testing: Go implementation is simpler and adequate
  • Minimal Dependencies: Use Go implementation

Model Support

The tokenizer supports the same models as the Go implementation:

  • GPT-4: gpt-4, gpt-4-turbo, gpt-4-32k, etc.
  • GPT-3.5: gpt-3.5-turbo, gpt-35-turbo, etc.
  • GPT-4o: gpt-4o, gpt-4o-mini
  • Claude: claude-3-opus, claude-3-sonnet, claude-3.5-sonnet, etc.
  • Embedding Models: text-embedding-ada-002, text-embedding-3-small/large
  • Code Models: code-cushman-001, code-davinci-002, etc.

Troubleshooting

Build Errors

If you see errors like:

cannot find -ltokenizer-lib

Make sure you've built the Rust library:

cd internal/tokenizer/tokenizer-lib
cargo build --release
Runtime Errors

If you see:

rust tokenizer not available

Either:

  1. Build with -tags=rusttokenizer
  2. The Rust library wasn't built
  3. There's a CGO/linking issue

The code will automatically fall back to the Go implementation.

Performance Not Improved

Make sure you're actually using the Rust implementation:

fmt.Println(tokenizer.GetImplementationName())

Should output something like: rust-cgo (0.1.0-rust-tiktoken-rs)

Contributing

When modifying the tokenizer:

  1. Rust changes: Update tokenizer-lib/src/lib.rs and rebuild with cargo build --release
  2. Go changes: Update appropriate files based on build tag
  3. Tests: Ensure both implementations pass tests
  4. Benchmarks: Run benchmarks to verify performance

License

Same as the parent project (MIT License).

Documentation

Overview

Package tokenizer provides a unified interface for token counting with optional Rust-backed high-performance implementation via CGO.

The Rust implementation provides 3-15x performance improvement over the pure Go tiktoken implementation. It is opt-in via the "rusttokenizer" build tag or falls back to the Go implementation automatically.

Usage:

import "github.com/cecil-the-coder/ai-provider-kit/internal/tokenizer"

// Count tokens (uses fastest available implementation)
count, err := tokenizer.CountTokens("Hello, world!", "gpt-4")

Build with Rust tokenizer:

go build -tags=rusttokenizer

Or use pure Go (always available):

go build

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func CountBatch

func CountBatch(texts []string, model string) (int, error)

CountBatch counts tokens for multiple texts in a single operation. This is more efficient than calling CountTokens multiple times, especially for the Rust implementation.

Parameters:

  • texts: Slice of input texts to tokenize
  • model: The model name (e.g., "gpt-4", "claude-3", "gpt-4o")

Returns the total token count and any error that occurred.

func CountTokens

func CountTokens(text, model string) (int, error)

CountTokens counts the number of tokens in the given text for the specified model. This is the main entry point for token counting and uses the fastest available implementation automatically.

Parameters:

  • text: The input text to tokenize
  • model: The model name (e.g., "gpt-4", "claude-3", "gpt-4o")

Returns the token count and any error that occurred.

func ForceGoCounter

func ForceGoCounter() tokenCounter

ForceGoCounter forces the use of the pure Go implementation. Useful for testing or when CGO is not desired.

func ForceRustCounter

func ForceRustCounter() tokenCounter

ForceRustCounter forces the use of the Rust implementation. Returns nil if the Rust implementation is not available. Panics if called when Rust is not available.

func GetCounter

func GetCounter() tokenCounter

GetCounter returns the fastest available token counter implementation. It prioritizes the Rust CGO implementation if available, otherwise falls back to the pure Go implementation.

func GetImplementationName

func GetImplementationName() string

GetImplementationName returns the name of the active implementation.

func IsRustAvailable

func IsRustAvailable() bool

IsRustAvailable returns true if the Rust implementation is available.

func ResetGlobalCounter

func ResetGlobalCounter()

ResetGlobalCounter resets the global counter (mainly for testing).

Types

This section is empty.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL