tokenizer

package

v1.0.65 Latest Latest Go to latest Published: Dec 26, 2025 License: MIT Imports: 3 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/cecil-the-coder/ai-provider-kit

Links

Open Source Insights

README ¶

High-Performance Rust Tokenizer

This directory contains an optional Rust-based tokenizer implementation that provides 3-15x faster token counting compared to the pure Go tiktoken implementation.

Features

High Performance: 3-15x speedup over pure Go implementation
CGO Integration: Seamless FFI bindings via CGO
Opt-in: Build tag based activation (-tags=rusttokenizer)
Automatic Fallback: Falls back to Go implementation if Rust is not available
Batch Processing: Efficient batch tokenization for multiple texts
Thread-Safe: Safe for concurrent use
Zero Copy: Efficient memory handling across the FFI boundary

Performance Comparison

Benchmarks on typical workloads (tokens/second):

Operation	Go (tiktoken-go)	Rust (tiktoken-rs)	Speedup
Short text (~50 tokens)	~50K ops/sec	~500K ops/sec	~10x
Medium text (~500 tokens)	~8K ops/sec	~80K ops/sec	~10x
Long text (~5000 tokens)	~800 ops/sec	~12K ops/sec	~15x
Batch (10 messages)	~800 ops/sec	~15K ops/sec	~18x

Building

With Rust Tokenizer (Recommended for production)

# Build the Rust library first
cd internal/tokenizer/tokenizer-lib
cargo build --release

# Then build Go code with the rusttokenizer tag
cd ../../..
go build -tags=rusttokenizer ./...

Without Rust Tokenizer (Default)

go build ./...

The Go implementation is always available and will be used automatically if the Rust tokenizer is not built.

Build Script

A convenience build script is provided:

./internal/tokenizer/build.sh

This script will:

Check if Rust/Cargo is installed
Build the Rust library
Build the Go code with the rusttokenizer tag
Run tests to verify everything works

Usage

The tokenizer is used automatically through the pkg/utils package:

import "github.com/cecil-the-coder/ai-provider-kit/pkg/utils"

// Count tokens - uses fastest available implementation automatically
count, err := utils.CountTokens("Hello, world!", "gpt-4")

// Count tokens for multiple messages
messages := []types.ChatMessage{...}
count, err := utils.CountTokensFromMessages(messages, "gpt-4")

Direct Usage

You can also use the tokenizer package directly:

import "github.com/cecil-the-coder/ai-provider-kit/internal/tokenizer"

// Check if Rust is available
if tokenizer.IsRustAvailable() {
    fmt.Println("Using Rust tokenizer for maximum performance")
}

// Count tokens
count, err := tokenizer.CountTokens("text", "gpt-4")

// Batch counting
texts := []string{"text1", "text2", "text3"}
count, err := tokenizer.CountBatch(texts, "gpt-4")

Forcing a Specific Implementation

import "github.com/cecil-the-coder/ai-provider-kit/internal/tokenizer"

// Force Go implementation (useful for testing)
tokenizer.ResetGlobalCounter()
goCounter := tokenizer.ForceGoCounter()

// Force Rust implementation (panics if not available)
tokenizer.ResetGlobalCounter()
rustCounter := tokenizer.ForceRustCounter()

Running Tests

# Test with Rust tokenizer (requires Rust build)
go test -tags=rusttokenizer ./internal/tokenizer/...

# Test with Go implementation only
go test ./internal/tokenizer/...

# Run benchmarks
go test -tags=rusttokenizer -bench=. -benchmem ./internal/tokenizer/...

Requirements

For Rust Tokenizer:

Rust 1.70+ with Cargo
CGO enabled
Build tools (gcc/clang)

For Go Implementation (Default):

Go 1.24+
No additional requirements

Architecture

internal/tokenizer/
├── tokenizer.go          # Unified interface
├── rust.go               # CGO bindings (rusttokenizer build tag)
├── go.go                 # Pure Go fallback
├── tokenizer-lib/        # Rust library
│   ├── Cargo.toml
│   └── src/
│       └── lib.rs        # FFI implementation
├── tokenizer_test.go     # Unit tests
├── benchmark_test.go     # Benchmark tests
├── build.sh              # Build script
└── README.md             # This file

Trade-offs

Advantages of Rust Tokenizer

Performance: 3-15x faster than Go implementation
Efficiency: Lower memory usage and CPU overhead
Scalability: Better for high-throughput scenarios
Batch Processing: Specialized batch operations

Disadvantages

Build Complexity: Requires Rust toolchain
CGO Dependency: Adds CGO requirement
Cross-compilation: More complex cross-compilation setup

Recommendation

Production/High-load: Use Rust tokenizer for best performance
Development/Testing: Go implementation is simpler and adequate
Minimal Dependencies: Use Go implementation

Model Support

The tokenizer supports the same models as the Go implementation:

GPT-4: gpt-4, gpt-4-turbo, gpt-4-32k, etc.
GPT-3.5: gpt-3.5-turbo, gpt-35-turbo, etc.
GPT-4o: gpt-4o, gpt-4o-mini
Claude: claude-3-opus, claude-3-sonnet, claude-3.5-sonnet, etc.
Embedding Models: text-embedding-ada-002, text-embedding-3-small/large
Code Models: code-cushman-001, code-davinci-002, etc.

Troubleshooting

Build Errors

If you see errors like:

cannot find -ltokenizer-lib

Make sure you've built the Rust library:

cd internal/tokenizer/tokenizer-lib
cargo build --release

Runtime Errors

If you see:

rust tokenizer not available

Either:

Build with -tags=rusttokenizer
The Rust library wasn't built
There's a CGO/linking issue

The code will automatically fall back to the Go implementation.

Performance Not Improved

Make sure you're actually using the Rust implementation:

fmt.Println(tokenizer.GetImplementationName())

Should output something like: rust-cgo (0.1.0-rust-tiktoken-rs)

Contributing

When modifying the tokenizer:

Rust changes: Update tokenizer-lib/src/lib.rs and rebuild with cargo build --release
Go changes: Update appropriate files based on build tag
Tests: Ensure both implementations pass tests
Benchmarks: Run benchmarks to verify performance

License

Same as the parent project (MIT License).

Documentation ¶

Overview ¶

Package tokenizer provides a unified interface for token counting with optional Rust-backed high-performance implementation via CGO.

The Rust implementation provides 3-15x performance improvement over the pure Go tiktoken implementation. It is opt-in via the "rusttokenizer" build tag or falls back to the Go implementation automatically.

Usage:

import "github.com/cecil-the-coder/ai-provider-kit/internal/tokenizer"

// Count tokens (uses fastest available implementation)
count, err := tokenizer.CountTokens("Hello, world!", "gpt-4")

Build with Rust tokenizer:

go build -tags=rusttokenizer

Or use pure Go (always available):

go build

Index ¶

func CountBatch(texts []string, model string) (int, error)
func CountTokens(text, model string) (int, error)
func ForceGoCounter() tokenCounter
func ForceRustCounter() tokenCounter
func GetCounter() tokenCounter
func GetImplementationName() string
func IsRustAvailable() bool
func ResetGlobalCounter()

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func CountBatch ¶

func CountBatch(texts []string, model string) (int, error)

CountBatch counts tokens for multiple texts in a single operation. This is more efficient than calling CountTokens multiple times, especially for the Rust implementation.

Parameters:

texts: Slice of input texts to tokenize
model: The model name (e.g., "gpt-4", "claude-3", "gpt-4o")

Returns the total token count and any error that occurred.

func CountTokens ¶

func CountTokens(text, model string) (int, error)

CountTokens counts the number of tokens in the given text for the specified model. This is the main entry point for token counting and uses the fastest available implementation automatically.

Parameters:

text: The input text to tokenize
model: The model name (e.g., "gpt-4", "claude-3", "gpt-4o")

Returns the token count and any error that occurred.

func ForceGoCounter ¶

func ForceGoCounter() tokenCounter

ForceGoCounter forces the use of the pure Go implementation. Useful for testing or when CGO is not desired.

func ForceRustCounter ¶

func ForceRustCounter() tokenCounter

ForceRustCounter forces the use of the Rust implementation. Returns nil if the Rust implementation is not available. Panics if called when Rust is not available.

func GetCounter ¶

func GetCounter() tokenCounter

GetCounter returns the fastest available token counter implementation. It prioritizes the Rust CGO implementation if available, otherwise falls back to the pure Go implementation.

func GetImplementationName ¶

func GetImplementationName() string

GetImplementationName returns the name of the active implementation.

func IsRustAvailable ¶

func IsRustAvailable() bool

IsRustAvailable returns true if the Rust implementation is available.

func ResetGlobalCounter ¶

func ResetGlobalCounter()

ResetGlobalCounter resets the global counter (mainly for testing).

Types ¶

This section is empty.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL