tokenizers

package module
v0.1.2 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 7, 2025 License: MIT Imports: 23 Imported by: 0

README ΒΆ

pure-tokenizers

Go Version CI Status License Release

CGo-free tokenizers for Go with automatic library management and HuggingFace Hub integration.

  • βœ… No CGo required - Pure Go implementation using purego FFI
  • βœ… HuggingFace Hub integration - Load tokenizers directly from HuggingFace models
  • βœ… Automatic downloads - Platform-specific libraries fetched on demand
  • βœ… Cross-platform - Windows, macOS, Linux (including ARM)
  • βœ… Production ready - Checksum verification and ABI compatibility checks

Quick Start

Load directly from HuggingFace Hub
package main

import (
    "fmt"
    "log"

    "github.com/amikos-tech/pure-tokenizers"
)

func main() {
    // Load tokenizer directly from HuggingFace model
    tokenizer, err := tokenizers.FromHuggingFace("bert-base-uncased")
    if err != nil {
        log.Fatal(err)
    }
    defer tokenizer.Close()

    // Tokenize text
    encoding, err := tokenizer.Encode("Hello, world!", tokenizers.WithAddSpecialTokens())
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println("Tokens:", encoding.Tokens)
    fmt.Println("Token IDs:", encoding.IDs)
}
Or load from a local file
// Load tokenizer from file
tokenizer, err := tokenizers.FromFile("tokenizer.json")
if err != nil {
    log.Fatal(err)
}
defer tokenizer.Close()

That's it! The library automatically downloads the correct binary for your platform on first use.

Installation

go get github.com/amikos-tech/pure-tokenizers

Features

πŸš€ Zero Configuration

The library automatically manages platform-specific binaries. No manual downloads, no build steps, no CGo.

πŸ” Secure by Default
  • SHA256 checksum verification for all downloads
  • ABI version compatibility checking
  • Secure HTTPS-only downloads
🎯 Platform Native

Optimized binaries for each platform and architecture:

  • macOS (Intel & Apple Silicon)
  • Linux (x86_64, ARM64, including musl)
  • Windows (x86_64)
⚑ High Performance

Native Rust performance without CGo overhead. Direct FFI calls using purego.

Performance Benchmarks

The following benchmarks compare pure-tokenizers (CGo-free) with CGo-based implementations. Results show competitive performance while maintaining the benefits of a CGo-free approach.

Benchmark Comparison

Test Environment:

  • pure-tokenizers: Apple M3 Max, macOS (CGo-free implementation)
  • CGo baseline: Apple M1 Pro, macOS (daulet/tokenizers)

Note: Different hardware affects absolute timings. Focus on relative performance patterns and memory characteristics rather than exact microsecond differences.

Text Characteristics:

  • Short: <50 characters (typical word or phrase)
  • Medium: 100-500 characters (typical sentence or paragraph)
  • Long: >1000 characters (multiple paragraphs)
Operation Implementation Time/op Memory/op Allocs/op Notes
Encode (Short Text) pure-tokenizers 7.80ΞΌs 920 B 16 CGo-free
CGo baseline 10.50ΞΌs 256 B 12 HuggingFace tokenizer
Encode (Medium Text) pure-tokenizers 30.50ΞΌs 1,552 B 35 CGo-free
Encode (Long Text) pure-tokenizers 267.00ΞΌs 6,864 B 165 CGo-free
Decode Operations pure-tokenizers 13.40ΞΌs 740 B 10 CGo-free
CGo baseline 1.50ΞΌs 64 B 2 HuggingFace tokenizer
Encode/Decode Cycle pure-tokenizers 52.50ΞΌs 2,296 B 45 Medium text, CGo-free
Key Performance Characteristics

βœ… Advantages of CGo-free approach:

  • No CGo overhead: Eliminates C-Go boundary crossing costs
  • Cross-compilation friendly: No CGo dependencies simplify building
  • Memory safety: Pure Go memory management
  • Deployment simplicity: Single binary with automatic library management

πŸ“Š Performance Analysis:

  • Encoding performance: Competitive with CGo implementations, often faster for short texts
  • Memory usage: Higher allocation count due to FFI boundary (16 vs 12 allocs), but predictable patterns
  • Batch processing: Efficient handling of multiple text inputs
  • Platform consistency: Consistent performance across all supported platforms
Advanced Benchmarks
Feature Time/op Memory/op Allocs/op Notes
Batch Processing (5 texts) 356.00ΞΌs 11,568 B 261 Parallel encoding
With Options (all attributes) 34.30ΞΌs 2,160 B 41 Full feature set
Truncation (128 tokens) 258.00ΞΌs 5,632 B 127 Max length enforcement
Padding (256 tokens) 84.90ΞΌs 16,272 B 535 Fixed length output
HuggingFace Loading (cached) 26.20ms 6.45 MB 92,188 Model initialization
Benchmark Environment
# Run benchmarks locally
make build && go test -bench=. -benchmem

# Compare with different tokenizers
go test -bench=BenchmarkEncode -benchmem
go test -bench=BenchmarkDecode -benchmem

Platform-specific results: Benchmarks run continuously in CI across Linux, macOS, and Windows. See benchmark workflow for automated performance tracking.

Usage Examples

HuggingFace Hub Integration
// Load tokenizer from any public HuggingFace model
tokenizer, err := tokenizers.FromHuggingFace("bert-base-uncased")
tokenizer, err := tokenizers.FromHuggingFace("gpt2")
tokenizer, err := tokenizers.FromHuggingFace("sentence-transformers/all-MiniLM-L6-v2")

// Load from private/gated models with authentication
tokenizer, err := tokenizers.FromHuggingFace("meta-llama/Llama-2-7b-hf",
    tokenizers.WithHFToken(os.Getenv("HF_TOKEN")))

// Configure HuggingFace options
tokenizer, err := tokenizers.FromHuggingFace("bert-base-uncased",
    tokenizers.WithHFToken(token),           // Authentication token
    tokenizers.WithHFRevision("main"),       // Specific revision/branch
    tokenizers.WithHFCacheDir("/custom/cache"), // Custom cache directory
    tokenizers.WithHFTimeout(30*time.Second),   // Download timeout
    tokenizers.WithHFOfflineMode(true),      // Use cached version only
)

// The tokenizer is automatically cached for offline use
// Cache location: ~/.cache/tokenizers/huggingface/ (Linux/macOS)
//                 %APPDATA%/tokenizers/huggingface/ (Windows)

πŸ“š See also:

Basic Tokenization
// Load a tokenizer from file
tokenizer, err := tokenizers.FromFile("path/to/tokenizer.json")
if err != nil {
    log.Fatal(err)
}
defer tokenizer.Close()

// Simple encoding
encoding, err := tokenizer.Encode("Hello, world!")

// With special tokens
encoding, err := tokenizer.Encode("Hello, world!", tokenizers.WithAddSpecialTokens())
Advanced Options
// Encoding with custom options
encoding, err := tokenizer.Encode("Your text here",
    tokenizers.WithAddSpecialTokens(),
    tokenizers.WithReturnTokens(),
    tokenizers.WithReturnAttentionMask(),
    tokenizers.WithReturnTypeIDs(),
)

// Create tokenizer with truncation and padding
tokenizer, err := tokenizers.FromFile("tokenizer.json",
    tokenizers.WithTruncation(512, tokenizers.TruncationDirectionRight, tokenizers.TruncationStrategyLongestFirst),
    tokenizers.WithPadding(true, tokenizers.PaddingStrategy{Tag: tokenizers.PaddingStrategyFixed, FixedSize: 512}),
)

// Access different parts of the encoding result
if encoding.Tokens != nil {
    fmt.Println("Tokens:", encoding.Tokens)
}
if encoding.IDs != nil {
    fmt.Println("Token IDs:", encoding.IDs)
}
if encoding.AttentionMask != nil {
    fmt.Println("Attention mask:", encoding.AttentionMask)
}
Decoding Tokens
// Decode token IDs back to text
ids := []uint32{101, 7592, 1010, 2088, 999, 102}
text, err := tokenizer.Decode(ids, true)
fmt.Println(text)  // "hello, world!"
Loading from Configuration Files
// Load tokenizer from a downloaded tokenizer.json file
tokenizer, err := tokenizers.FromFile("path/to/tokenizer.json")

// Load from byte configuration
configBytes, _ := os.ReadFile("tokenizer.json")
tokenizer, err := tokenizers.FromBytes(configBytes)

// Use with custom library path
tokenizer, err := tokenizers.FromFile("tokenizer.json",
    tokenizers.WithLibraryPath("/custom/path/to/libtokenizers.so"))

Configuration

Environment Variables
Variable Description Default
TOKENIZERS_LIB_PATH Custom library path Auto-detect
TOKENIZERS_GITHUB_REPO GitHub repo for downloads amikos-tech/pure-tokenizers
TOKENIZERS_VERSION Library version to download latest
GITHUB_TOKEN GitHub API token (for rate limits) None
Library Loading Options
// Use a specific library path
tokenizer, err := tokenizers.FromFile("tokenizer.json",
    tokenizers.WithLibraryPath("/custom/path/to/libtokenizers.so"))

// The library loading priority:
// 1. User-provided path via WithLibraryPath()
// 2. TOKENIZERS_LIB_PATH environment variable
// 3. Cached library in platform directory
// 4. Automatic download from GitHub releases
Cache Management

For comprehensive cache management documentation, see Cache Management Guide.

Library Cache
// Get the library cache directory
cachePath := tokenizers.GetCachedLibraryPath()

// Clear the library cache
err := tokenizers.ClearLibraryCache()

// Download and cache a specific version
err := tokenizers.DownloadAndCacheLibraryWithVersion("v0.1.0")
HuggingFace Cache
// Get HuggingFace cache information
info, err := tokenizers.GetHFCacheInfo("bert-base-uncased")

// Clear cache for a specific model
err := tokenizers.ClearHFModelCache("bert-base-uncased")

// Clear entire HuggingFace cache
err := tokenizers.ClearHFCache()

// Use offline mode (only use cached tokenizers)
tokenizer, err := tokenizers.FromHuggingFace("bert-base-uncased",
    tokenizers.WithHFOfflineMode(true))

Platform Support

Platform Architecture Binary Status
macOS x86_64 .dylib βœ…
macOS aarch64 (M1/M2) .dylib βœ…
Linux x86_64 .so βœ…
Linux aarch64 .so βœ…
Linux (musl) x86_64 .so βœ…
Linux (musl) aarch64 .so βœ…
Windows x86_64 .dll βœ…

Development

Building from Source
# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone the repository
git clone https://github.com/amikos-tech/pure-tokenizers
cd pure-tokenizers

# Build the Rust library
make build

# Run tests
make test

# Run linting
make lint-fix      # Go linting
make lint-rust     # Rust linting
Testing
Unit Tests
# Run all unit tests
make test

# Run with specific library path
make test-lib-path
Integration Tests

Integration tests verify real-world functionality with HuggingFace models:

# Setup for local testing
cp .env.example .env
# Edit .env and add your HF_TOKEN (get from https://huggingface.co/settings/tokens)

# Run all integration tests (requires HF_TOKEN for private models)
make test-integration

# Run only HuggingFace integration tests
make test-integration-hf

The integration tests cover:

  • Public model downloads (BERT, GPT2, DistilBERT)
  • Private model access (with HF_TOKEN)
  • Caching behavior verification
  • Rate limiting handling
  • Offline mode functionality

Note: Integration tests are automatically run in CI for the main branch and PRs with the integration label.

Project Structure
pure-tokenizers/
β”œβ”€β”€ src/           # Rust FFI implementation
β”œβ”€β”€ *.go           # Go bindings
β”œβ”€β”€ download.go    # Auto-download functionality
β”œβ”€β”€ library.go     # Platform-specific FFI loading
└── Makefile       # Build automation
Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'feat: add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built on top of the excellent Hugging Face Tokenizers library.

Documentation ΒΆ

Index ΒΆ

Constants ΒΆ

View Source
const (
	GitHubRepo      = "amikos-tech/pure-tokenizers"
	DefaultTag      = "latest"
	DownloadTimeout = 30 * time.Second
)
View Source
const (
	HFDefaultRevision = "main"
	HFDefaultTimeout  = 30 * time.Second
	HFMaxRetries      = 3
	HFRetryDelay      = time.Second
	// HFMaxRetryAfterDelay caps the maximum delay from Retry-After headers
	// to prevent excessive waits from misconfigured or malicious servers
	HFMaxRetryAfterDelay = 5 * time.Minute

	// DefaultMaxTokenizerSize is the default maximum size for tokenizer files (500MB)
	// This prevents OOM errors from excessively large downloads
	DefaultMaxTokenizerSize = 500 * 1024 * 1024
)
View Source
const (
	SUCCESS                    = 0
	ErrInvalidUTF8             = -1
	ErrEncodingFailed          = -2
	ErrNullOutput              = -3
	ErrInvalidTokenizerRef     = -4
	ErrNullInput               = -5
	ErrTokenizerCreationFailed = -6
	ErrInvalidPath             = -7
	ErrFileNotFound            = -8
	ErrTruncationFailed        = -9
	ErrPaddingFailed           = -10
	ErrDecodeFailed            = -11
	ErrCStringConversionFailed = -12
	ErrInvalidIDs              = -13
	ErrInvalidOptions          = -14
)
View Source
const AbiCompatibilityConstraint = "^0.1.x"

AbiCompatibilityConstraint defines the required version range for ABI compatibility. The library version from Cargo.toml is used as the ABI version. Update this constraint when making breaking changes to the FFI interface.

View Source
const LibName = "tokenizers"
View Source
const TruncationMaxLengthDefault uintptr = 512 // Default truncation length, can be overridden by user

Variables ΒΆ

View Source
var (
	HFHubBaseURL = "https://huggingface.co" // Variable to allow testing with mock server

	// ErrCacheNotFound is returned when a requested cache file does not exist
	ErrCacheNotFound = errors.New("cache file not found")
)

Functions ΒΆ

func ClearHFCache ΒΆ added in v0.1.1

func ClearHFCache() error

ClearHFCache clears all HuggingFace tokenizer cache

func ClearHFCachePattern ΒΆ added in v0.1.1

func ClearHFCachePattern(pattern string) (int, error)

ClearHFCachePattern clears cache entries matching a glob pattern. The pattern is matched against model IDs (e.g., "bert-*", "huggingface/*"). Patterns use standard glob syntax: * matches any sequence, ? matches any single character.

Examples:

  • "bert-*" matches all BERT model variants
  • "huggingface/*" matches all models from the huggingface organization
  • "*/bert-*" matches BERT models from any organization

For security, patterns containing ".." in path segments or absolute paths are rejected. Returns the number of cache entries cleared and any error encountered.

func ClearHFModelCache ΒΆ added in v0.1.1

func ClearHFModelCache(modelID string) error

ClearHFModelCache clears the cache for a specific model

func ClearLibraryCache ΒΆ

func ClearLibraryCache() error

ClearLibraryCache removes the cached library file

func DownloadAndCacheLibrary ΒΆ

func DownloadAndCacheLibrary() error

DownloadAndCacheLibrary downloads and caches the library for the current platform

func DownloadAndCacheLibraryWithVersion ΒΆ

func DownloadAndCacheLibraryWithVersion(version string) error

DownloadAndCacheLibraryWithVersion downloads and caches a specific version of the library

func DownloadLibraryFromGitHub ΒΆ

func DownloadLibraryFromGitHub(destPath string) error

DownloadLibraryFromGitHub downloads the platform-specific library from GitHub releases

func DownloadLibraryFromGitHubWithVersion ΒΆ

func DownloadLibraryFromGitHubWithVersion(destPath, version string) error

DownloadLibraryFromGitHubWithVersion downloads a specific version of the library

func GetAvailableVersions ΒΆ

func GetAvailableVersions() ([]string, error)

GetAvailableVersions fetches available versions from GitHub releases

func GetCachedLibraryPath ΒΆ

func GetCachedLibraryPath() string

GetCachedLibraryPath returns the path where the library would be cached

func GetHFCacheInfo ΒΆ added in v0.1.1

func GetHFCacheInfo(modelID string) (map[string]interface{}, error)

GetHFCacheInfo returns information about the HuggingFace cache for a model

func GetLibraryInfo ΒΆ

func GetLibraryInfo() map[string]interface{}

GetLibraryInfo returns information about the current library setup

func GetLibraryVersion ΒΆ added in v0.1.1

func GetLibraryVersion() string

GetLibraryVersion returns the current library version used in User-Agent

func IsLibraryCached ΒΆ

func IsLibraryCached() bool

IsLibraryCached checks if the library is already cached and valid

func LoadTokenizerLibrary ΒΆ

func LoadTokenizerLibrary(userPath string) (uintptr, error)

LoadTokenizerLibrary loads the tokenizer shared library from the specified path or attempts to find it through various fallback mechanisms: 1. User-provided path 2. TOKENIZERS_LIB_PATH environment variable 3. Cached library in platform-specific directory 4. Automatic download from GitHub releases

func MasksFromBuf ΒΆ

func MasksFromBuf(buf Buffer) (special, attention []uint32)

func SetLibraryVersion ΒΆ added in v0.1.1

func SetLibraryVersion(version string)

SetLibraryVersion sets the library version for User-Agent headers

func TokensFromBuf ΒΆ

func TokensFromBuf(buf Buffer) []string

Types ΒΆ

type Buffer ΒΆ

type Buffer struct {
	IDs               *uint32
	TypeIDs           *uint32
	SpecialTokensMask *uint32
	AttentionMask     *uint32
	Tokens            **byte
	Offsets           *uintptr
	Len               uintptr
}

type EncodeOption ΒΆ

type EncodeOption func(eo *EncodeOptions) error

func WithAddSpecialTokens ΒΆ

func WithAddSpecialTokens() EncodeOption

func WithReturnAllAttributes ΒΆ

func WithReturnAllAttributes() EncodeOption

func WithReturnAttentionMask ΒΆ

func WithReturnAttentionMask() EncodeOption

func WithReturnOffsets ΒΆ

func WithReturnOffsets() EncodeOption

func WithReturnSpecialTokensMask ΒΆ

func WithReturnSpecialTokensMask() EncodeOption

func WithReturnTokens ΒΆ

func WithReturnTokens() EncodeOption

func WithReturnTypeIDs ΒΆ

func WithReturnTypeIDs() EncodeOption

type EncodeOptions ΒΆ

type EncodeOptions struct {
	AddSpecialTokens        bool
	ReturnTypeIDs           bool
	ReturnTokens            bool
	ReturnSpecialTokensMask bool
	ReturnAttentionMask     bool
	ReturnOffsets           bool
}

type EncodeResult ΒΆ

type EncodeResult struct {
	IDs               []uint32
	TypeIDs           []uint32
	SpecialTokensMask []uint32
	AttentionMask     []uint32
	Tokens            []string
	Offsets           []uint32
}

type GitHubAsset ΒΆ

type GitHubAsset struct {
	Name               string `json:"name"`
	BrowserDownloadURL string `json:"browser_download_url"`
	Digest             string `json:"digest,omitempty"` // Optional field for checksums
}

type GitHubRelease ΒΆ

type GitHubRelease struct {
	TagName string        `json:"tag_name"`
	Assets  []GitHubAsset `json:"assets"`
}

GitHub API structures

type HFConfig ΒΆ added in v0.1.1

type HFConfig struct {
	Token       string
	Revision    string
	CacheDir    string
	Timeout     time.Duration
	MaxRetries  int
	OfflineMode bool
	// UseLocalCache enables checking the HuggingFace hub cache before downloading
	UseLocalCache bool
	// CacheTTL specifies how long cached tokenizers are considered valid (0 = forever)
	CacheTTL time.Duration
	// MaxTokenizerSize is the maximum allowed size for tokenizer files in bytes
	// (env: HF_MAX_TOKENIZER_SIZE, default: 500MB).
	// When set to 0 (zero value), falls back to HF_MAX_TOKENIZER_SIZE environment variable,
	// or DefaultMaxTokenizerSize (500MB) if the environment variable is not set.
	// Use WithHFMaxTokenizerSize to explicitly set this value.
	MaxTokenizerSize int64
	// BaseURL is the base URL for HuggingFace Hub API (defaults to HFHubBaseURL if empty)
	// This is primarily used for testing with mock servers
	BaseURL string

	// HTTP client pooling configuration
	// These settings control connection reuse for improved performance.
	// Config fields take priority over environment variables.
	//
	// IMPORTANT: The HTTP client is initialized once per process using sync.Once.
	// Changes to these configuration values after the first HuggingFace download
	// will NOT take effect. Set these values before any HuggingFace operations.
	//
	// Performance trade-offs:
	// - Higher values: Better connection reuse, reduced latency for subsequent requests, but increased memory usage
	// - Lower values: Reduced memory footprint, but more connection establishment overhead
	//
	// Recommended configurations:
	// - High-throughput services: Increase HTTPMaxIdleConnsPerHost (e.g., 20-50) for parallel downloads
	// - Resource-constrained environments: Reduce both values (e.g., 50/5) to minimize memory usage
	// - Short-lived scripts: Reduce HTTPIdleTimeout (e.g., 10s) to release resources quickly
	//
	// Note: HTTPMaxIdleConns will be automatically adjusted to be >= HTTPMaxIdleConnsPerHost for logical consistency
	//
	// Debug mode: Set DEBUG=1 environment variable to see actual configuration values being used
	HTTPMaxIdleConns        int           // Maximum idle connections across all hosts (env: HF_HTTP_MAX_IDLE_CONNS, default: 100, max: 1000)
	HTTPMaxIdleConnsPerHost int           // Maximum idle connections per host (env: HF_HTTP_MAX_IDLE_CONNS_PER_HOST, default: 10, max: 100)
	HTTPIdleTimeout         time.Duration // How long to keep idle connections open (env: HF_HTTP_IDLE_TIMEOUT, default: 90s)
}

HFConfig holds HuggingFace-specific configuration

type PaddingOptions ΒΆ

type PaddingOptions struct {
	Enabled  bool
	Strategy PaddingStrategy
}

type PaddingStrategy ΒΆ

type PaddingStrategy struct {
	Tag       PaddingStrategyTag
	FixedSize uintptr // Only valid if Tag == PaddingStrategyFixed
}

type PaddingStrategyTag ΒΆ

type PaddingStrategyTag int
const (
	PaddingStrategyBatchLongest PaddingStrategyTag = iota
	PaddingStrategyFixed
)

type StringResult ΒΆ

type StringResult struct {
	String    *string
	ErrorCode int32
}

type Tokenizer ΒΆ

type Tokenizer struct {
	LibraryPath string // Path to the shared library

	TruncationEnabled   bool
	TruncationDirection TruncationDirection
	TruncationStrategy  TruncationStrategy
	TruncationMaxLength uintptr // Maximum length for truncation
	PaddingEnabled      bool
	PaddingStrategy     PaddingStrategy // Strategy for padding
	// contains filtered or unexported fields
}

func FromBytes ΒΆ

func FromBytes(config []byte, opts ...TokenizerOption) (*Tokenizer, error)

func FromFile ΒΆ

func FromFile(configFile string, opts ...TokenizerOption) (*Tokenizer, error)

func FromHuggingFace ΒΆ added in v0.1.1

func FromHuggingFace(modelID string, opts ...TokenizerOption) (*Tokenizer, error)

FromHuggingFace loads a tokenizer from HuggingFace Hub using the model identifier.

The model identifier can be in the format "organization/model" or just "model". For example: "bert-base-uncased", "google/flan-t5-base", "meta-llama/Llama-2-7b-hf".

By default, it loads from the "main" branch/revision. Use WithHFRevision to specify a different revision (branch, tag, or commit hash).

For private or gated models, authentication is required. Set the HF_TOKEN environment variable or use WithHFToken option.

The tokenizer is cached locally for faster subsequent loads. The cache location is platform-specific and can be overridden with WithHFCacheDir.

Example:

tokenizer, err := FromHuggingFace("bert-base-uncased")
if err != nil {
    log.Fatal(err)
}
defer tokenizer.Close()

func (*Tokenizer) Close ΒΆ

func (t *Tokenizer) Close() error

func (*Tokenizer) Decode ΒΆ

func (t *Tokenizer) Decode(ids []uint32, skipSpecialTokens bool) (string, error)

func (*Tokenizer) Encode ΒΆ

func (t *Tokenizer) Encode(message string, opts ...EncodeOption) (*EncodeResult, error)

func (*Tokenizer) EncodePair ΒΆ added in v0.1.2

func (t *Tokenizer) EncodePair(sequence string, pair string, opts ...EncodeOption) (*EncodeResult, error)

EncodePair encodes a single sequence pair. This is a convenience wrapper around EncodePairs for encoding a single pair.

func (*Tokenizer) EncodePairs ΒΆ added in v0.1.2

func (t *Tokenizer) EncodePairs(sequences []string, pairs []string, opts ...EncodeOption) ([]*EncodeResult, error)

EncodePairs encodes multiple sequence pairs in parallel. This is useful for reranking tasks where you need to encode query-document pairs.

func (*Tokenizer) GetLibraryVersion ΒΆ added in v0.1.1

func (t *Tokenizer) GetLibraryVersion() string

GetLibraryVersion returns the version of the tokenizer library

func (*Tokenizer) VocabSize ΒΆ

func (t *Tokenizer) VocabSize() (uint32, error)

type TokenizerOption ΒΆ

type TokenizerOption func(t *Tokenizer) error

func WithHFCacheDir ΒΆ added in v0.1.1

func WithHFCacheDir(dir string) TokenizerOption

WithHFCacheDir sets a custom cache directory for HuggingFace tokenizers

func WithHFCacheTTL ΒΆ added in v0.1.1

func WithHFCacheTTL(ttl time.Duration) TokenizerOption

WithHFCacheTTL sets the cache time-to-live for cached tokenizers

func WithHFMaxTokenizerSize ΒΆ added in v0.1.1

func WithHFMaxTokenizerSize(maxSize int64) TokenizerOption

WithHFMaxTokenizerSize sets the maximum allowed size for tokenizer files in bytes Default is 500MB. Set to a very large value to effectively disable size validation.

func WithHFOfflineMode ΒΆ added in v0.1.1

func WithHFOfflineMode(offline bool) TokenizerOption

WithHFOfflineMode forces the tokenizer to only use cached versions

func WithHFRevision ΒΆ added in v0.1.1

func WithHFRevision(revision string) TokenizerOption

WithHFRevision sets the model revision (branch, tag, or commit hash)

func WithHFTimeout ΒΆ added in v0.1.1

func WithHFTimeout(timeout time.Duration) TokenizerOption

WithHFTimeout sets the download timeout for HuggingFace requests

func WithHFToken ΒΆ added in v0.1.1

func WithHFToken(token string) TokenizerOption

WithHFToken sets the HuggingFace API token for authentication

func WithHFUseLocalCache ΒΆ added in v0.1.1

func WithHFUseLocalCache(useCache bool) TokenizerOption

WithHFUseLocalCache enables or disables checking the HuggingFace hub cache

func WithLibraryPath ΒΆ

func WithLibraryPath(path string) TokenizerOption

WithLibraryPath sets the path to the shared library for the tokenizer. This must be the path to the .so/dylib/dll file that contains the tokenizer implementation.

func WithPadding ΒΆ

func WithPadding(enabled bool, strategy PaddingStrategy) TokenizerOption

func WithTruncation ΒΆ

func WithTruncation(maxLen uintptr, direction TruncationDirection, strategy TruncationStrategy) TokenizerOption

type TokenizerOptions ΒΆ

type TokenizerOptions struct {
	AddSpecialTokens bool
	Trunc            TruncationOptions
	Pad              PaddingOptions
}

type TokenizerResult ΒΆ

type TokenizerResult struct {
	Tokenizer unsafe.Pointer
	ErrorCode int32
}

type TruncationDirection ΒΆ

type TruncationDirection uint8
const (
	TruncationDirectionLeft TruncationDirection = iota
	TruncationDirectionRight
)
const TruncationDirectionDefault TruncationDirection = TruncationDirectionRight

type TruncationOptions ΒΆ

type TruncationOptions struct {
	Enabled   bool
	MaxLen    uintptr
	Strategy  TruncationStrategy
	Direction TruncationDirection
	Stride    uintptr
}

type TruncationStrategy ΒΆ

type TruncationStrategy uint8
const (
	TruncationStrategyLongestFirst TruncationStrategy = iota
	TruncationStrategyOnlyFirst
	TruncationStrategyOnlySecond
)
const TruncationStrategyDefault TruncationStrategy = TruncationStrategyLongestFirst

type VocabSizeResult ΒΆ

type VocabSizeResult struct {
	VocabSize uint32
	ErrorCode int32
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL