tokenizer

package

v0.4.1 Latest Latest Go to latest Published: Aug 19, 2025 License: Apache-2.0 Imports: 14 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/vllm-project/aibrix

Links

Open Source Insights

README ¶

Tokenizer Package

The tokenizer package provides a unified interface for text tokenization in AIBrix, supporting both local and remote tokenization implementations. It's designed to work with various LLM inference engines like vLLM, SGLang, and others.

Overview

This package implements a flexible tokenizer architecture that:

Provides a common interface for different tokenization backends
Supports local tokenizers (tiktoken, character-based)
Supports remote tokenizers via HTTP API (vLLM, SGLang, etc.)
Follows Go idioms with a minimal public API surface
Uses type assertions for advanced features (progressive disclosure pattern)

Architecture

Core Design Principles

Minimal Public API: Only the Tokenizer interface is exported
Progressive Disclosure: Basic usage is simple, advanced features available via type assertions
Type Safety: Compile-time checks with Go's type system
Extensibility: Easy to add new tokenizer implementations

Interface Hierarchy

Tokenizer (public interface)
    ├── Local implementations
    │   ├── TiktokenTokenizer
    │   └── CharacterTokenizer
    └── Remote implementation
        └── remoteTokenizerImpl (supports advanced features internally)

Quick Start

Basic Usage

import "github.com/vllm-project/aibrix/pkg/utils/tokenizer"

// Option 1: Using the factory function
tok1, err := tokenizer.NewTokenizer("tiktoken", nil)
tokens, err := tok1.TokenizeInputText("Hello, world!")

// Option 2: Using direct constructors
// Create a local tiktoken tokenizer
tok2 := tokenizer.NewTiktokenTokenizer()
tokens, err := tok2.TokenizeInputText("Hello, world!")

// Create a character tokenizer
tok3 := tokenizer.NewCharacterTokenizer()
tokens, err := tok3.TokenizeInputText("Hello, world!")

// Create a remote tokenizer
config := tokenizer.RemoteTokenizerConfig{
    Engine:   "vllm",
    Endpoint: "http://vllm-service:8000",
    Model:    "meta-llama/Llama-2-7b-hf",
    Timeout:  30 * time.Second,
}
tok4, err := tokenizer.NewRemoteTokenizer(config)
tokens, err := tok4.TokenizeInputText("Hello, world!")

// Using factory function for remote tokenizer
tok5, err := tokenizer.NewTokenizer("remote", config)
tokens, err := tok5.TokenizeInputText("Hello, world!")

Advanced Features with Type Assertions

Since advanced interfaces are not exported, use type assertions to access extended functionality:

import (
    "context"
    "time"
    "github.com/vllm-project/aibrix/pkg/utils/tokenizer"
)

// Create a remote tokenizer
config := tokenizer.RemoteTokenizerConfig{
    Engine:             "vllm",
    Endpoint:           "http://vllm-service:8000",
    Model:              "meta-llama/Llama-2-7b-hf",
    Timeout:            30 * time.Second,
    MaxRetries:         3,
    AddSpecialTokens:   true,
    ReturnTokenStrings: true,
}

tok, err := tokenizer.NewRemoteTokenizer(config)
if err != nil {
    log.Fatal(err)
}

// Basic tokenization (always available)
tokens, err := tok.TokenizeInputText("Hello, world!")

// Advanced tokenization features
if extended, ok := tok.(interface {
    TokenizeWithOptions(context.Context, tokenizer.TokenizeInput) (*tokenizer.TokenizeResult, error)
    Detokenize(context.Context, []int) (string, error)
}); ok {
    ctx := context.Background()
    
    // Tokenize with options
    input := tokenizer.TokenizeInput{
        Type:                tokenizer.CompletionInput,
        Text:                "Hello, world!",
        AddSpecialTokens:    true,
        ReturnTokenStrings:  true,
    }
    result, err := extended.TokenizeWithOptions(ctx, input)
    if err != nil {
        log.Fatal(err)
    }
    
    fmt.Printf("Token count: %d\n", result.Count)
    fmt.Printf("Tokens: %v\n", result.Tokens)
    fmt.Printf("Token strings: %v\n", result.TokenStrings)
    
    // Detokenize
    text, err := extended.Detokenize(ctx, result.Tokens)
    fmt.Printf("Detokenized: %s\n", text)
}

// Remote-specific features
if remote, ok := tok.(interface {
    IsHealthy(context.Context) bool
    GetEndpoint() string
    Close() error
}); ok {
    ctx := context.Background()
    
    // Check health
    if remote.IsHealthy(ctx) {
        fmt.Println("Tokenizer service is healthy")
    }
    
    // Get endpoint
    fmt.Printf("Using endpoint: %s\n", remote.GetEndpoint())
    
    // Clean up
    defer remote.Close()
}

Chat Tokenization

// Chat tokenization requires type assertion to access advanced features
if extended, ok := tok.(interface {
    TokenizeWithOptions(context.Context, tokenizer.TokenizeInput) (*tokenizer.TokenizeResult, error)
}); ok {
    ctx := context.Background()
    
    chatInput := tokenizer.TokenizeInput{
        Type: tokenizer.ChatInput,
        Messages: []tokenizer.ChatMessage{
            {Role: "system", Content: "You are a helpful assistant."},
            {Role: "user", Content: "What is the weather today?"},
        },
        AddGenerationPrompt: true,
    }
    
    result, err := extended.TokenizeWithOptions(ctx, chatInput)
    if err != nil {
        log.Fatal(err)
    }
}

File Structure

tokenizer/
├── Core Components
│   ├── interfaces.go          # Interface definitions (only Tokenizer exported)
│   ├── types.go              # Type definitions
│   ├── errors.go             # Error types
│   ├── utils.go              # Shared utilities
│   └── tokenizer.go          # Main factory function
│
├── Local Implementations
│   ├── local_tiktoken.go     # Tiktoken tokenizer with constructor
│   └── local_characters.go   # Character tokenizer with constructor
│
├── Remote Implementation
│   ├── remote_tokenizer.go   # Generic remote tokenizer
│   └── remote_client.go      # HTTP client with retry logic
│
├── Engine Adapters
│   ├── adapter_vllm.go       # vLLM adapter (internal)
│   └── adapter_sglang.go     # SGLang adapter (internal)
│
└── Tests
    └── remote_client_test.go

Supported Tokenizers

Local Tokenizers

Tiktoken (tiktoken)
- Uses OpenAI's tiktoken library
- Encoding: cl100k_base
- Suitable for GPT models
Character (character)
- Simple byte-based tokenization
- Useful for testing and special cases

Remote Tokenizers

The package currently supports the following remote tokenizer engines:

vLLM (vllm)
- Full support for tokenization and detokenization
- Chat template support
- Special tokens handling
- Multimodal support (via mm_processor_kwargs)
SGLang (sglang)
- Limited support (no tokenize/detokenize endpoints)
- As of 2025-07-18, SGLang Runtime's HTTP service does not expose /tokenize or /detokenize endpoints
- Official documentation only lists endpoints like /generate, /encode, /v1/chat/completions, /v1/embeddings, etc.
- Community has requested tokenizer endpoints (Issues #5653 and #7068) for RL/Agent frameworks, but these remain in discussion/planning phase
- Workarounds if you need tokenization with SGLang:
  - Client-side tokenization: Use HuggingFace Tokenizer or similar tools to convert text to input_ids before calling SGLang endpoints
  - Skip server tokenization: Start SGLang with --skip-tokenizer-init flag and pass input_ids arrays directly in requests
- The adapter is included as a placeholder for future implementation when/if SGLang adds tokenizer endpoints

Extending the Package

Adding a New Local Tokenizer

Create a new tokenizer file (e.g., local_custom.go):

package tokenizer

type CustomTokenizer struct {
    // fields
}

func (t *CustomTokenizer) TokenizeInputText(text string) ([]byte, error) {
    // Implementation
    tokens := tokenizeCustom(text) // Your tokenization logic
    return intToByteArray(tokens), nil
}

func NewCustomTokenizer() Tokenizer {
    return &CustomTokenizer{}
}

The tokenizer automatically implements the Tokenizer interface.

API Reference

Public Interface

Tokenizer

The only exported interface, providing basic tokenization:

type Tokenizer interface {
    TokenizeInputText(string) ([]byte, error)
}

Factory Function

The main factory function for creating tokenizer instances:

func NewTokenizer(tokenizerType string, config interface{}) (Tokenizer, error)

Supported tokenizer types:

"tiktoken" - OpenAI's tiktoken tokenizer (config: nil)
"character" - Simple character-based tokenizer (config: nil)
"remote" - Remote tokenizer via HTTP API (config: RemoteTokenizerConfig)

Advanced Features via Type Assertions

While not exported, advanced features are available through type assertions:

// Extended tokenization features
interface {
    TokenizeWithOptions(ctx context.Context, input TokenizeInput) (*TokenizeResult, error)
    Detokenize(ctx context.Context, tokens []int) (string, error)
}

// Remote-specific features
interface {
    IsHealthy(ctx context.Context) bool
    GetEndpoint() string
    Close() error
}

Key Types

TokenizeInput

type TokenizeInput struct {
    Type                TokenizeInputType     // "completion" or "chat"
    Text                string                // For completion input
    Messages            []ChatMessage         // For chat input
    AddSpecialTokens    bool                 // Add special tokens
    ReturnTokenStrings  bool                 // Return token strings
    AddGenerationPrompt bool                 // For chat only
}

TokenizeResult

type TokenizeResult struct {
    Count        int      // Number of tokens
    MaxModelLen  int      // Maximum model length
    Tokens       []int    // Token IDs
    TokenStrings []string // Token strings (optional)
}

ChatMessage

type ChatMessage struct {
    Role    string // "system", "user", "assistant"
    Content string // Message content
}

Configuration

RemoteTokenizerConfig

type RemoteTokenizerConfig struct {
    Engine             string        // "vllm", "sglang", etc.
    Endpoint           string        // Base URL
    Model              string        // Model name (optional)
    Timeout            time.Duration // Request timeout
    MaxRetries         int          // Retry attempts
    AddSpecialTokens   bool         // Default: true
    ReturnTokenStrings bool         // Default: false
}

Error Handling

The package defines several error types for better error handling:

ErrInvalidConfig: Configuration validation errors
ErrTokenizationFailed: Tokenization failures
ErrDetokenizationFailed: Detokenization failures
ErrUnsupportedOperation: Operation not supported by engine
ErrHTTPRequest: HTTP request errors

Example:

tokens, err := tokenizer.TokenizeInputText("text")
if err != nil {
    switch e := err.(type) {
    case tokenizer.ErrUnsupportedOperation:
        log.Printf("Operation %s not supported by %s", e.Operation, e.Engine)
    case tokenizer.ErrHTTPRequest:
        log.Printf("HTTP error %d: %s", e.StatusCode, e.Message)
    default:
        log.Printf("Tokenization error: %v", err)
    }
}

Performance Considerations

Connection Pooling: The HTTP client uses connection pooling for better performance
Retry Logic:
- Automatic retry with exponential backoff for transient failures
- Respects Retry-After header for rate-limited requests (HTTP 429)
- Configurable retry attempts with smart status code handling
- Only retries on specific status codes (408, 429, 500, 502, 503, 504)
Timeout Configuration: Configure appropriate timeouts based on your use case
Local vs Remote: Local tokenizers are faster but may not match model's exact tokenization

Testing

Run tests with:

go test ./pkg/utils/tokenizer/...

The package includes comprehensive tests for:

Basic tokenization functionality
Remote tokenizer with mock servers
Error handling scenarios
Type assertion patterns

Migration Guide

From Deprecated Components

VLLMTokenizer to RemoteTokenizer:

// Old (no longer supported)
// config := tokenizer.VLLMTokenizerConfig{...}
// tok, _ := tokenizer.NewVLLMTokenizer(config)

// New
config := tokenizer.RemoteTokenizerConfig{
    Engine:   "vllm",
    Endpoint: "http://vllm:8000",
    Model:    "llama2",
    Timeout:  30 * time.Second,
}
tok, _ := tokenizer.NewRemoteTokenizer(config)

ExtendedTokenizer/RemoteTokenizer interfaces:

// Old (no longer exported)
// var tok tokenizer.ExtendedTokenizer

// New - use type assertions
tok, _ := tokenizer.NewRemoteTokenizer(config)
if extended, ok := tok.(interface {
    TokenizeWithOptions(context.Context, tokenizer.TokenizeInput) (*tokenizer.TokenizeResult, error)
}); ok {
    // Use extended features
}

Contributing

When contributing to this package:

Maintain the minimal public API philosophy
Add tests for new features
Update this README for significant changes
Follow Go idioms and conventions
Use type assertions for advanced features
Ensure backward compatibility for the Tokenizer interface

License

This package is part of the AIBrix project and follows the same Apache 2.0 license.

Documentation ¶

Index ¶

type ChatMessage
type ErrDetokenizationFailed
- func (e ErrDetokenizationFailed) Error() string
- func (e ErrDetokenizationFailed) Unwrap() error
type ErrHTTPRequest
- func (e ErrHTTPRequest) Error() string
type ErrInvalidConfig
- func (e ErrInvalidConfig) Error() string
type ErrTokenizationFailed
- func (e ErrTokenizationFailed) Error() string
- func (e ErrTokenizationFailed) Unwrap() error
type ErrUnsupportedOperation
- func (e ErrUnsupportedOperation) Error() string
type RemoteTokenizerConfig
type TokenizeInput
type TokenizeInputType
type TokenizeResult
type Tokenizer

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

This section is empty.

Types ¶

type ChatMessage ¶ added in v0.4.0

type ChatMessage struct {
	Role    string `json:"role"`
	Content string `json:"content"`
}

ChatMessage represents a single message in a chat conversation

type ErrDetokenizationFailed ¶ added in v0.4.0

type ErrDetokenizationFailed struct {
	Message string
	Cause   error
}

ErrDetokenizationFailed represents a detokenization failure

func (ErrDetokenizationFailed) Error ¶ added in v0.4.0

func (e ErrDetokenizationFailed) Error() string

func (ErrDetokenizationFailed) Unwrap ¶ added in v0.4.0

func (e ErrDetokenizationFailed) Unwrap() error

Unwrap returns the underlying error cause

type ErrHTTPRequest ¶ added in v0.4.0

type ErrHTTPRequest struct {
	StatusCode int
	Message    string
	Body       string
}

ErrHTTPRequest represents an HTTP request error

func (ErrHTTPRequest) Error ¶ added in v0.4.0

func (e ErrHTTPRequest) Error() string

type ErrInvalidConfig ¶ added in v0.4.0

type ErrInvalidConfig struct {
	Message string
}

ErrInvalidConfig represents a configuration validation error

func (ErrInvalidConfig) Error ¶ added in v0.4.0

func (e ErrInvalidConfig) Error() string

type ErrTokenizationFailed ¶ added in v0.4.0

type ErrTokenizationFailed struct {
	Message string
	Cause   error
}

ErrTokenizationFailed represents a tokenization failure

func (ErrTokenizationFailed) Error ¶ added in v0.4.0

func (e ErrTokenizationFailed) Error() string

func (ErrTokenizationFailed) Unwrap ¶ added in v0.4.0

func (e ErrTokenizationFailed) Unwrap() error

Unwrap returns the underlying error cause

type ErrUnsupportedOperation ¶ added in v0.4.0

type ErrUnsupportedOperation struct {
	Engine    string
	Operation string
}

ErrUnsupportedOperation represents an error when an operation is not supported by the engine

func (ErrUnsupportedOperation) Error ¶ added in v0.4.0

func (e ErrUnsupportedOperation) Error() string

type RemoteTokenizerConfig ¶ added in v0.4.0

type RemoteTokenizerConfig struct {
	Engine             string        // "vllm", "sglang", "triton", etc.
	Endpoint           string        // Base URL of the service
	Model              string        // Model identifier (optional)
	Timeout            time.Duration // Request timeout
	MaxRetries         int           // Max retry attempts
	AddSpecialTokens   bool          // Default: true
	ReturnTokenStrings bool          // Default: false
}

RemoteTokenizerConfig represents configuration for a remote tokenizer

type TokenizeInput ¶ added in v0.4.0

type TokenizeInput struct {
	Type                TokenizeInputType
	Text                string        // For completion input
	Messages            []ChatMessage // For chat input
	AddSpecialTokens    bool
	ReturnTokenStrings  bool
	AddGenerationPrompt bool // For chat input only
}

TokenizeInput represents a unified input structure for tokenization

type TokenizeInputType ¶ added in v0.4.0

type TokenizeInputType string

TokenizeInputType represents the type of tokenization input

const (
	// CompletionInput represents simple text completion tokenization
	CompletionInput TokenizeInputType = "completion"
	// ChatInput represents chat message tokenization with templates
	ChatInput TokenizeInputType = "chat"
)

type TokenizeResult ¶ added in v0.4.0

type TokenizeResult struct {
	Count        int      `json:"count"`
	MaxModelLen  int      `json:"max_model_len"`
	Tokens       []int    `json:"tokens"`
	TokenStrings []string `json:"token_strs,omitempty"`
}

TokenizeResult represents the result of tokenization

type Tokenizer ¶

type Tokenizer interface {
	TokenizeInputText(string) ([]byte, error)
}

Tokenizer is the basic tokenizer interface for backward compatibility

func NewCharacterTokenizer ¶

func NewCharacterTokenizer() Tokenizer

NewCharacterTokenizer creates a new character tokenizer instance

func NewRemoteTokenizer ¶ added in v0.4.0

func NewRemoteTokenizer(config RemoteTokenizerConfig) (Tokenizer, error)

NewRemoteTokenizer creates a new remote tokenizer instance

func NewTiktokenTokenizer ¶

func NewTiktokenTokenizer() Tokenizer

NewTiktokenTokenizer creates a new tiktoken tokenizer instance

func NewTokenizer ¶ added in v0.4.0

func NewTokenizer(tokenizerType string, config interface{}) (Tokenizer, error)

NewTokenizer creates a tokenizer instance based on the provided type Supports "tiktoken", "character", and "remote" tokenizer types

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL