tokenizers

package module

v0.1.2 Latest Latest Go to latest Published: Nov 7, 2025 License: MIT Imports: 23 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/amikos-tech/pure-tokenizers

Links

Open Source Insights

README ¶

pure-tokenizers

CGo-free tokenizers for Go with automatic library management and HuggingFace Hub integration.

✅ No CGo required - Pure Go implementation using purego FFI
✅ HuggingFace Hub integration - Load tokenizers directly from HuggingFace models
✅ Automatic downloads - Platform-specific libraries fetched on demand
✅ Cross-platform - Windows, macOS, Linux (including ARM)
✅ Production ready - Checksum verification and ABI compatibility checks

Quick Start

Load directly from HuggingFace Hub

package main

import (
    "fmt"
    "log"

    "github.com/amikos-tech/pure-tokenizers"
)

func main() {
    // Load tokenizer directly from HuggingFace model
    tokenizer, err := tokenizers.FromHuggingFace("bert-base-uncased")
    if err != nil {
        log.Fatal(err)
    }
    defer tokenizer.Close()

    // Tokenize text
    encoding, err := tokenizer.Encode("Hello, world!", tokenizers.WithAddSpecialTokens())
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println("Tokens:", encoding.Tokens)
    fmt.Println("Token IDs:", encoding.IDs)
}

Or load from a local file

// Load tokenizer from file
tokenizer, err := tokenizers.FromFile("tokenizer.json")
if err != nil {
    log.Fatal(err)
}
defer tokenizer.Close()

That's it! The library automatically downloads the correct binary for your platform on first use.

Installation

go get github.com/amikos-tech/pure-tokenizers

Features

🚀 Zero Configuration

The library automatically manages platform-specific binaries. No manual downloads, no build steps, no CGo.

🔐 Secure by Default

SHA256 checksum verification for all downloads
ABI version compatibility checking
Secure HTTPS-only downloads

🎯 Platform Native

Optimized binaries for each platform and architecture:

macOS (Intel & Apple Silicon)
Linux (x86_64, ARM64, including musl)
Windows (x86_64)

⚡ High Performance

Native Rust performance without CGo overhead. Direct FFI calls using purego.

Performance Benchmarks

The following benchmarks compare pure-tokenizers (CGo-free) with CGo-based implementations. Results show competitive performance while maintaining the benefits of a CGo-free approach.

Benchmark Comparison

Test Environment:

pure-tokenizers: Apple M3 Max, macOS (CGo-free implementation)
CGo baseline: Apple M1 Pro, macOS (daulet/tokenizers)

Note: Different hardware affects absolute timings. Focus on relative performance patterns and memory characteristics rather than exact microsecond differences.

Text Characteristics:

Short: <50 characters (typical word or phrase)
Medium: 100-500 characters (typical sentence or paragraph)
Long: >1000 characters (multiple paragraphs)

Operation	Implementation	Time/op	Memory/op	Allocs/op	Notes
Encode (Short Text)	pure-tokenizers	7.80μs	920 B	16	CGo-free
	CGo baseline	10.50μs	256 B	12	HuggingFace tokenizer
Encode (Medium Text)	pure-tokenizers	30.50μs	1,552 B	35	CGo-free
Encode (Long Text)	pure-tokenizers	267.00μs	6,864 B	165	CGo-free
Decode Operations	pure-tokenizers	13.40μs	740 B	10	CGo-free
	CGo baseline	1.50μs	64 B	2	HuggingFace tokenizer
Encode/Decode Cycle	pure-tokenizers	52.50μs	2,296 B	45	Medium text, CGo-free

Key Performance Characteristics

✅ Advantages of CGo-free approach:

No CGo overhead: Eliminates C-Go boundary crossing costs
Cross-compilation friendly: No CGo dependencies simplify building
Memory safety: Pure Go memory management
Deployment simplicity: Single binary with automatic library management

📊 Performance Analysis:

Encoding performance: Competitive with CGo implementations, often faster for short texts
Memory usage: Higher allocation count due to FFI boundary (16 vs 12 allocs), but predictable patterns
Batch processing: Efficient handling of multiple text inputs
Platform consistency: Consistent performance across all supported platforms

Advanced Benchmarks

Feature	Time/op	Memory/op	Allocs/op	Notes
Batch Processing (5 texts)	356.00μs	11,568 B	261	Parallel encoding
With Options (all attributes)	34.30μs	2,160 B	41	Full feature set
Truncation (128 tokens)	258.00μs	5,632 B	127	Max length enforcement
Padding (256 tokens)	84.90μs	16,272 B	535	Fixed length output
HuggingFace Loading (cached)	26.20ms	6.45 MB	92,188	Model initialization

Benchmark Environment

# Run benchmarks locally
make build && go test -bench=. -benchmem

# Compare with different tokenizers
go test -bench=BenchmarkEncode -benchmem
go test -bench=BenchmarkDecode -benchmem

Platform-specific results: Benchmarks run continuously in CI across Linux, macOS, and Windows. See benchmark workflow for automated performance tracking.

Usage Examples

HuggingFace Hub Integration

// Load tokenizer from any public HuggingFace model
tokenizer, err := tokenizers.FromHuggingFace("bert-base-uncased")
tokenizer, err := tokenizers.FromHuggingFace("gpt2")
tokenizer, err := tokenizers.FromHuggingFace("sentence-transformers/all-MiniLM-L6-v2")

// Load from private/gated models with authentication
tokenizer, err := tokenizers.FromHuggingFace("meta-llama/Llama-2-7b-hf",
    tokenizers.WithHFToken(os.Getenv("HF_TOKEN")))

// Configure HuggingFace options
tokenizer, err := tokenizers.FromHuggingFace("bert-base-uncased",
    tokenizers.WithHFToken(token),           // Authentication token
    tokenizers.WithHFRevision("main"),       // Specific revision/branch
    tokenizers.WithHFCacheDir("/custom/cache"), // Custom cache directory
    tokenizers.WithHFTimeout(30*time.Second),   // Download timeout
    tokenizers.WithHFOfflineMode(true),      // Use cached version only
)

// The tokenizer is automatically cached for offline use
// Cache location: ~/.cache/tokenizers/huggingface/ (Linux/macOS)
//                 %APPDATA%/tokenizers/huggingface/ (Windows)

📚 See also:

HuggingFace Integration Guide - Comprehensive documentation
Example: Basic Usage - Loading various models
Example: Cache Management - Working with cache
Example: Private Models - Authentication and gated models

Basic Tokenization

// Load a tokenizer from file
tokenizer, err := tokenizers.FromFile("path/to/tokenizer.json")
if err != nil {
    log.Fatal(err)
}
defer tokenizer.Close()

// Simple encoding
encoding, err := tokenizer.Encode("Hello, world!")

// With special tokens
encoding, err := tokenizer.Encode("Hello, world!", tokenizers.WithAddSpecialTokens())

Advanced Options

// Encoding with custom options
encoding, err := tokenizer.Encode("Your text here",
    tokenizers.WithAddSpecialTokens(),
    tokenizers.WithReturnTokens(),
    tokenizers.WithReturnAttentionMask(),
    tokenizers.WithReturnTypeIDs(),
)

// Create tokenizer with truncation and padding
tokenizer, err := tokenizers.FromFile("tokenizer.json",
    tokenizers.WithTruncation(512, tokenizers.TruncationDirectionRight, tokenizers.TruncationStrategyLongestFirst),
    tokenizers.WithPadding(true, tokenizers.PaddingStrategy{Tag: tokenizers.PaddingStrategyFixed, FixedSize: 512}),
)

// Access different parts of the encoding result
if encoding.Tokens != nil {
    fmt.Println("Tokens:", encoding.Tokens)
}
if encoding.IDs != nil {
    fmt.Println("Token IDs:", encoding.IDs)
}
if encoding.AttentionMask != nil {
    fmt.Println("Attention mask:", encoding.AttentionMask)
}

Decoding Tokens

// Decode token IDs back to text
ids := []uint32{101, 7592, 1010, 2088, 999, 102}
text, err := tokenizer.Decode(ids, true)
fmt.Println(text)  // "hello, world!"

Loading from Configuration Files

// Load tokenizer from a downloaded tokenizer.json file
tokenizer, err := tokenizers.FromFile("path/to/tokenizer.json")

// Load from byte configuration
configBytes, _ := os.ReadFile("tokenizer.json")
tokenizer, err := tokenizers.FromBytes(configBytes)

// Use with custom library path
tokenizer, err := tokenizers.FromFile("tokenizer.json",
    tokenizers.WithLibraryPath("/custom/path/to/libtokenizers.so"))

Configuration

Environment Variables

Variable	Description	Default
`TOKENIZERS_LIB_PATH`	Custom library path	Auto-detect
`TOKENIZERS_GITHUB_REPO`	GitHub repo for downloads	`amikos-tech/pure-tokenizers`
`TOKENIZERS_VERSION`	Library version to download	`latest`
`GITHUB_TOKEN`	GitHub API token (for rate limits)	None

Library Loading Options

// Use a specific library path
tokenizer, err := tokenizers.FromFile("tokenizer.json",
    tokenizers.WithLibraryPath("/custom/path/to/libtokenizers.so"))

// The library loading priority:
// 1. User-provided path via WithLibraryPath()
// 2. TOKENIZERS_LIB_PATH environment variable
// 3. Cached library in platform directory
// 4. Automatic download from GitHub releases

Cache Management

For comprehensive cache management documentation, see Cache Management Guide.

Library Cache

// Get the library cache directory
cachePath := tokenizers.GetCachedLibraryPath()

// Clear the library cache
err := tokenizers.ClearLibraryCache()

// Download and cache a specific version
err := tokenizers.DownloadAndCacheLibraryWithVersion("v0.1.0")

HuggingFace Cache

// Get HuggingFace cache information
info, err := tokenizers.GetHFCacheInfo("bert-base-uncased")

// Clear cache for a specific model
err := tokenizers.ClearHFModelCache("bert-base-uncased")

// Clear entire HuggingFace cache
err := tokenizers.ClearHFCache()

// Use offline mode (only use cached tokenizers)
tokenizer, err := tokenizers.FromHuggingFace("bert-base-uncased",
    tokenizers.WithHFOfflineMode(true))

Platform Support

Platform	Architecture	Binary	Status
macOS	x86_64	`.dylib`	✅
macOS	aarch64 (M1/M2)	`.dylib`	✅
Linux	x86_64	`.so`	✅
Linux	aarch64	`.so`	✅
Linux (musl)	x86_64	`.so`	✅
Linux (musl)	aarch64	`.so`	✅
Windows	x86_64	`.dll`	✅

Development

Building from Source

# Install Rust (if not already installed)
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh

# Clone the repository
git clone https://github.com/amikos-tech/pure-tokenizers
cd pure-tokenizers

# Build the Rust library
make build

# Run tests
make test

# Run linting
make lint-fix      # Go linting
make lint-rust     # Rust linting

Testing

Unit Tests

# Run all unit tests
make test

# Run with specific library path
make test-lib-path

Integration Tests

Integration tests verify real-world functionality with HuggingFace models:

# Setup for local testing
cp .env.example .env
# Edit .env and add your HF_TOKEN (get from https://huggingface.co/settings/tokens)

# Run all integration tests (requires HF_TOKEN for private models)
make test-integration

# Run only HuggingFace integration tests
make test-integration-hf

The integration tests cover:

Public model downloads (BERT, GPT2, DistilBERT)
Private model access (with HF_TOKEN)
Caching behavior verification
Rate limiting handling
Offline mode functionality

Note: Integration tests are automatically run in CI for the main branch and PRs with the integration label.

Project Structure

pure-tokenizers/
├── src/           # Rust FFI implementation
├── *.go           # Go bindings
├── download.go    # Auto-download functionality
├── library.go     # Platform-specific FFI loading
└── Makefile       # Build automation

Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'feat: add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Built on top of the excellent Hugging Face Tokenizers library.

Documentation ¶

Rendered for

Index ¶

Constants
Variables
func ClearHFCache() error
func ClearHFCachePattern(pattern string) (int, error)
func ClearHFModelCache(modelID string) error
func ClearLibraryCache() error
func DownloadAndCacheLibrary() error
func DownloadAndCacheLibraryWithVersion(version string) error
func DownloadLibraryFromGitHub(destPath string) error
func DownloadLibraryFromGitHubWithVersion(destPath, version string) error
func GetAvailableVersions() ([]string, error)
func GetCachedLibraryPath() string
func GetHFCacheInfo(modelID string) (map[string]interface{}, error)
func GetLibraryInfo() map[string]interface{}
func GetLibraryVersion() string
func IsLibraryCached() bool
func LoadTokenizerLibrary(userPath string) (uintptr, error)
func MasksFromBuf(buf Buffer) (special, attention []uint32)
func SetLibraryVersion(version string)
func TokensFromBuf(buf Buffer) []string
type Buffer
type EncodeOption
- func WithAddSpecialTokens() EncodeOption
- func WithReturnAllAttributes() EncodeOption
- func WithReturnAttentionMask() EncodeOption
- func WithReturnOffsets() EncodeOption
- func WithReturnSpecialTokensMask() EncodeOption
- func WithReturnTokens() EncodeOption
- func WithReturnTypeIDs() EncodeOption
type EncodeOptions
type EncodeResult
type GitHubAsset
type GitHubRelease
type HFConfig
type PaddingOptions
type PaddingStrategy
type PaddingStrategyTag
type StringResult
type Tokenizer
- func FromBytes(config []byte, opts ...TokenizerOption) (*Tokenizer, error)
- func FromFile(configFile string, opts ...TokenizerOption) (*Tokenizer, error)
- func FromHuggingFace(modelID string, opts ...TokenizerOption) (*Tokenizer, error)
- func (t *Tokenizer) Close() error
- func (t *Tokenizer) Decode(ids []uint32, skipSpecialTokens bool) (string, error)
- func (t *Tokenizer) Encode(message string, opts ...EncodeOption) (*EncodeResult, error)
- func (t *Tokenizer) EncodePair(sequence string, pair string, opts ...EncodeOption) (*EncodeResult, error)
- func (t *Tokenizer) EncodePairs(sequences []string, pairs []string, opts ...EncodeOption) ([]*EncodeResult, error)
- func (t *Tokenizer) GetLibraryVersion() string
- func (t *Tokenizer) VocabSize() (uint32, error)
type TokenizerOption
- func WithHFCacheDir(dir string) TokenizerOption
- func WithHFCacheTTL(ttl time.Duration) TokenizerOption
- func WithHFMaxTokenizerSize(maxSize int64) TokenizerOption
- func WithHFOfflineMode(offline bool) TokenizerOption
- func WithHFRevision(revision string) TokenizerOption
- func WithHFTimeout(timeout time.Duration) TokenizerOption
- func WithHFToken(token string) TokenizerOption
- func WithHFUseLocalCache(useCache bool) TokenizerOption
- func WithLibraryPath(path string) TokenizerOption
- func WithPadding(enabled bool, strategy PaddingStrategy) TokenizerOption
- func WithTruncation(maxLen uintptr, direction TruncationDirection, strategy TruncationStrategy) TokenizerOption
type TokenizerOptions
type TokenizerResult
type TruncationDirection
type TruncationOptions
type TruncationStrategy
type VocabSizeResult

Constants ¶

View Source

const (
	GitHubRepo      = "amikos-tech/pure-tokenizers"
	DefaultTag      = "latest"
	DownloadTimeout = 30 * time.Second
)

View Source

const (
	HFDefaultRevision = "main"
	HFDefaultTimeout  = 30 * time.Second
	HFMaxRetries      = 3
	HFRetryDelay      = time.Second
	// HFMaxRetryAfterDelay caps the maximum delay from Retry-After headers
	// to prevent excessive waits from misconfigured or malicious servers
	HFMaxRetryAfterDelay = 5 * time.Minute

	// DefaultMaxTokenizerSize is the default maximum size for tokenizer files (500MB)
	// This prevents OOM errors from excessively large downloads
	DefaultMaxTokenizerSize = 500 * 1024 * 1024
)

View Source

const (
	SUCCESS                    = 0
	ErrInvalidUTF8             = -1
	ErrEncodingFailed          = -2
	ErrNullOutput              = -3
	ErrInvalidTokenizerRef     = -4
	ErrNullInput               = -5
	ErrTokenizerCreationFailed = -6
	ErrInvalidPath             = -7
	ErrFileNotFound            = -8
	ErrTruncationFailed        = -9
	ErrPaddingFailed           = -10
	ErrDecodeFailed            = -11
	ErrCStringConversionFailed = -12
	ErrInvalidIDs              = -13
	ErrInvalidOptions          = -14
)

View Source

const AbiCompatibilityConstraint = "^0.1.x"

AbiCompatibilityConstraint defines the required version range for ABI compatibility. The library version from Cargo.toml is used as the ABI version. Update this constraint when making breaking changes to the FFI interface.

View Source

const LibName = "tokenizers"

View Source

const TruncationMaxLengthDefault uintptr = 512 // Default truncation length, can be overridden by user

Variables ¶

View Source

var (
	HFHubBaseURL = "https://huggingface.co" // Variable to allow testing with mock server

	// ErrCacheNotFound is returned when a requested cache file does not exist
	ErrCacheNotFound = errors.New("cache file not found")
)

Functions ¶

func ClearHFCache ¶ added in v0.1.1

func ClearHFCache() error

ClearHFCache clears all HuggingFace tokenizer cache

func ClearHFCachePattern ¶ added in v0.1.1

func ClearHFCachePattern(pattern string) (int, error)

ClearHFCachePattern clears cache entries matching a glob pattern. The pattern is matched against model IDs (e.g., "bert-*", "huggingface/*"). Patterns use standard glob syntax: * matches any sequence, ? matches any single character.

Examples:

"bert-*" matches all BERT model variants
"huggingface/*" matches all models from the huggingface organization
"*/bert-*" matches BERT models from any organization

For security, patterns containing ".." in path segments or absolute paths are rejected. Returns the number of cache entries cleared and any error encountered.

func ClearHFModelCache ¶ added in v0.1.1

func ClearHFModelCache(modelID string) error

ClearHFModelCache clears the cache for a specific model

func ClearLibraryCache ¶

func ClearLibraryCache() error

ClearLibraryCache removes the cached library file

func DownloadAndCacheLibrary ¶

func DownloadAndCacheLibrary() error

DownloadAndCacheLibrary downloads and caches the library for the current platform

func DownloadAndCacheLibraryWithVersion ¶

func DownloadAndCacheLibraryWithVersion(version string) error

DownloadAndCacheLibraryWithVersion downloads and caches a specific version of the library

func DownloadLibraryFromGitHub ¶

func DownloadLibraryFromGitHub(destPath string) error

DownloadLibraryFromGitHub downloads the platform-specific library from GitHub releases

func DownloadLibraryFromGitHubWithVersion ¶

func DownloadLibraryFromGitHubWithVersion(destPath, version string) error

DownloadLibraryFromGitHubWithVersion downloads a specific version of the library

func GetAvailableVersions ¶

func GetAvailableVersions() ([]string, error)

GetAvailableVersions fetches available versions from GitHub releases

func GetCachedLibraryPath ¶

func GetCachedLibraryPath() string

GetCachedLibraryPath returns the path where the library would be cached

func GetHFCacheInfo ¶ added in v0.1.1

func GetHFCacheInfo(modelID string) (map[string]interface{}, error)

GetHFCacheInfo returns information about the HuggingFace cache for a model

func GetLibraryInfo ¶

func GetLibraryInfo() map[string]interface{}

GetLibraryInfo returns information about the current library setup

func GetLibraryVersion ¶ added in v0.1.1

func GetLibraryVersion() string

GetLibraryVersion returns the current library version used in User-Agent

func IsLibraryCached ¶

func IsLibraryCached() bool

IsLibraryCached checks if the library is already cached and valid

func LoadTokenizerLibrary ¶

func LoadTokenizerLibrary(userPath string) (uintptr, error)

LoadTokenizerLibrary loads the tokenizer shared library from the specified path or attempts to find it through various fallback mechanisms: 1. User-provided path 2. TOKENIZERS_LIB_PATH environment variable 3. Cached library in platform-specific directory 4. Automatic download from GitHub releases

func MasksFromBuf ¶

func MasksFromBuf(buf Buffer) (special, attention []uint32)

func SetLibraryVersion ¶ added in v0.1.1

func SetLibraryVersion(version string)

SetLibraryVersion sets the library version for User-Agent headers

func TokensFromBuf ¶

func TokensFromBuf(buf Buffer) []string

Types ¶

type Buffer ¶

type Buffer struct {
	IDs               *uint32
	TypeIDs           *uint32
	SpecialTokensMask *uint32
	AttentionMask     *uint32
	Tokens            **byte
	Offsets           *uintptr
	Len               uintptr
}

type EncodeOption ¶

type EncodeOption func(eo *EncodeOptions) error

func WithAddSpecialTokens ¶

func WithAddSpecialTokens() EncodeOption

func WithReturnAllAttributes ¶

func WithReturnAllAttributes() EncodeOption

func WithReturnAttentionMask ¶

func WithReturnAttentionMask() EncodeOption

func WithReturnOffsets ¶

func WithReturnOffsets() EncodeOption

func WithReturnSpecialTokensMask ¶

func WithReturnSpecialTokensMask() EncodeOption

func WithReturnTokens ¶

func WithReturnTokens() EncodeOption

func WithReturnTypeIDs ¶

func WithReturnTypeIDs() EncodeOption

type EncodeOptions ¶

type EncodeOptions struct {
	AddSpecialTokens        bool
	ReturnTypeIDs           bool
	ReturnTokens            bool
	ReturnSpecialTokensMask bool
	ReturnAttentionMask     bool
	ReturnOffsets           bool
}

type EncodeResult ¶

type EncodeResult struct {
	IDs               []uint32
	TypeIDs           []uint32
	SpecialTokensMask []uint32
	AttentionMask     []uint32
	Tokens            []string
	Offsets           []uint32
}

type GitHubAsset ¶

type GitHubAsset struct {
	Name               string `json:"name"`
	BrowserDownloadURL string `json:"browser_download_url"`
	Digest             string `json:"digest,omitempty"` // Optional field for checksums
}

type GitHubRelease ¶

type GitHubRelease struct {
	TagName string        `json:"tag_name"`
	Assets  []GitHubAsset `json:"assets"`
}

GitHub API structures

type HFConfig ¶ added in v0.1.1

type HFConfig struct {
	Token       string
	Revision    string
	CacheDir    string
	Timeout     time.Duration
	MaxRetries  int
	OfflineMode bool
	// UseLocalCache enables checking the HuggingFace hub cache before downloading
	UseLocalCache bool
	// CacheTTL specifies how long cached tokenizers are considered valid (0 = forever)
	CacheTTL time.Duration
	// MaxTokenizerSize is the maximum allowed size for tokenizer files in bytes
	// (env: HF_MAX_TOKENIZER_SIZE, default: 500MB).
	// When set to 0 (zero value), falls back to HF_MAX_TOKENIZER_SIZE environment variable,
	// or DefaultMaxTokenizerSize (500MB) if the environment variable is not set.
	// Use WithHFMaxTokenizerSize to explicitly set this value.
	MaxTokenizerSize int64
	// BaseURL is the base URL for HuggingFace Hub API (defaults to HFHubBaseURL if empty)
	// This is primarily used for testing with mock servers
	BaseURL string

	// HTTP client pooling configuration
	// These settings control connection reuse for improved performance.
	// Config fields take priority over environment variables.
	//
	// IMPORTANT: The HTTP client is initialized once per process using sync.Once.
	// Changes to these configuration values after the first HuggingFace download
	// will NOT take effect. Set these values before any HuggingFace operations.
	//
	// Performance trade-offs:
	// - Higher values: Better connection reuse, reduced latency for subsequent requests, but increased memory usage
	// - Lower values: Reduced memory footprint, but more connection establishment overhead
	//
	// Recommended configurations:
	// - High-throughput services: Increase HTTPMaxIdleConnsPerHost (e.g., 20-50) for parallel downloads
	// - Resource-constrained environments: Reduce both values (e.g., 50/5) to minimize memory usage
	// - Short-lived scripts: Reduce HTTPIdleTimeout (e.g., 10s) to release resources quickly
	//
	// Note: HTTPMaxIdleConns will be automatically adjusted to be >= HTTPMaxIdleConnsPerHost for logical consistency
	//
	// Debug mode: Set DEBUG=1 environment variable to see actual configuration values being used
	HTTPMaxIdleConns        int           // Maximum idle connections across all hosts (env: HF_HTTP_MAX_IDLE_CONNS, default: 100, max: 1000)
	HTTPMaxIdleConnsPerHost int           // Maximum idle connections per host (env: HF_HTTP_MAX_IDLE_CONNS_PER_HOST, default: 10, max: 100)
	HTTPIdleTimeout         time.Duration // How long to keep idle connections open (env: HF_HTTP_IDLE_TIMEOUT, default: 90s)
}

HFConfig holds HuggingFace-specific configuration

type PaddingOptions ¶

type PaddingOptions struct {
	Enabled  bool
	Strategy PaddingStrategy
}

type PaddingStrategy ¶

type PaddingStrategy struct {
	Tag       PaddingStrategyTag
	FixedSize uintptr // Only valid if Tag == PaddingStrategyFixed
}

type PaddingStrategyTag ¶

type PaddingStrategyTag int

const (
	PaddingStrategyBatchLongest PaddingStrategyTag = iota
	PaddingStrategyFixed
)

type StringResult ¶

type StringResult struct {
	String    *string
	ErrorCode int32
}

type Tokenizer ¶

type Tokenizer struct {
	LibraryPath string // Path to the shared library

	TruncationEnabled   bool
	TruncationDirection TruncationDirection
	TruncationStrategy  TruncationStrategy
	TruncationMaxLength uintptr // Maximum length for truncation
	PaddingEnabled      bool
	PaddingStrategy     PaddingStrategy // Strategy for padding
	// contains filtered or unexported fields
}

func FromBytes ¶

func FromBytes(config []byte, opts ...TokenizerOption) (*Tokenizer, error)

func FromFile ¶

func FromFile(configFile string, opts ...TokenizerOption) (*Tokenizer, error)

func FromHuggingFace ¶ added in v0.1.1

func FromHuggingFace(modelID string, opts ...TokenizerOption) (*Tokenizer, error)

FromHuggingFace loads a tokenizer from HuggingFace Hub using the model identifier.

The model identifier can be in the format "organization/model" or just "model". For example: "bert-base-uncased", "google/flan-t5-base", "meta-llama/Llama-2-7b-hf".

By default, it loads from the "main" branch/revision. Use WithHFRevision to specify a different revision (branch, tag, or commit hash).

For private or gated models, authentication is required. Set the HF_TOKEN environment variable or use WithHFToken option.

The tokenizer is cached locally for faster subsequent loads. The cache location is platform-specific and can be overridden with WithHFCacheDir.

Example:

tokenizer, err := FromHuggingFace("bert-base-uncased")
if err != nil {
    log.Fatal(err)
}
defer tokenizer.Close()

func (*Tokenizer) Close ¶

func (t *Tokenizer) Close() error

func (*Tokenizer) Decode ¶

func (t *Tokenizer) Decode(ids []uint32, skipSpecialTokens bool) (string, error)

func (*Tokenizer) Encode ¶

func (t *Tokenizer) Encode(message string, opts ...EncodeOption) (*EncodeResult, error)

func (*Tokenizer) EncodePair ¶ added in v0.1.2

func (t *Tokenizer) EncodePair(sequence string, pair string, opts ...EncodeOption) (*EncodeResult, error)

EncodePair encodes a single sequence pair. This is a convenience wrapper around EncodePairs for encoding a single pair.

func (*Tokenizer) EncodePairs ¶ added in v0.1.2

func (t *Tokenizer) EncodePairs(sequences []string, pairs []string, opts ...EncodeOption) ([]*EncodeResult, error)

EncodePairs encodes multiple sequence pairs in parallel. This is useful for reranking tasks where you need to encode query-document pairs.

func (*Tokenizer) GetLibraryVersion ¶ added in v0.1.1

func (t *Tokenizer) GetLibraryVersion() string

GetLibraryVersion returns the version of the tokenizer library

func (*Tokenizer) VocabSize ¶

func (t *Tokenizer) VocabSize() (uint32, error)

type TokenizerOption ¶

type TokenizerOption func(t *Tokenizer) error

func WithHFCacheDir ¶ added in v0.1.1

func WithHFCacheDir(dir string) TokenizerOption

WithHFCacheDir sets a custom cache directory for HuggingFace tokenizers

func WithHFCacheTTL ¶ added in v0.1.1

func WithHFCacheTTL(ttl time.Duration) TokenizerOption

WithHFCacheTTL sets the cache time-to-live for cached tokenizers

func WithHFMaxTokenizerSize ¶ added in v0.1.1

func WithHFMaxTokenizerSize(maxSize int64) TokenizerOption

WithHFMaxTokenizerSize sets the maximum allowed size for tokenizer files in bytes Default is 500MB. Set to a very large value to effectively disable size validation.

func WithHFOfflineMode ¶ added in v0.1.1

func WithHFOfflineMode(offline bool) TokenizerOption

WithHFOfflineMode forces the tokenizer to only use cached versions

func WithHFRevision ¶ added in v0.1.1

func WithHFRevision(revision string) TokenizerOption

WithHFRevision sets the model revision (branch, tag, or commit hash)

func WithHFTimeout ¶ added in v0.1.1

func WithHFTimeout(timeout time.Duration) TokenizerOption

WithHFTimeout sets the download timeout for HuggingFace requests

func WithHFToken ¶ added in v0.1.1

func WithHFToken(token string) TokenizerOption

WithHFToken sets the HuggingFace API token for authentication

func WithHFUseLocalCache ¶ added in v0.1.1

func WithHFUseLocalCache(useCache bool) TokenizerOption

WithHFUseLocalCache enables or disables checking the HuggingFace hub cache

func WithLibraryPath ¶

func WithLibraryPath(path string) TokenizerOption

WithLibraryPath sets the path to the shared library for the tokenizer. This must be the path to the .so/dylib/dll file that contains the tokenizer implementation.

func WithPadding ¶

func WithPadding(enabled bool, strategy PaddingStrategy) TokenizerOption

func WithTruncation ¶

func WithTruncation(maxLen uintptr, direction TruncationDirection, strategy TruncationStrategy) TokenizerOption

type TokenizerOptions ¶

type TokenizerOptions struct {
	AddSpecialTokens bool
	Trunc            TruncationOptions
	Pad              PaddingOptions
}

type TokenizerResult ¶

type TokenizerResult struct {
	Tokenizer unsafe.Pointer
	ErrorCode int32
}

type TruncationDirection ¶

type TruncationDirection uint8

const (
	TruncationDirectionLeft TruncationDirection = iota
	TruncationDirectionRight
)

const TruncationDirectionDefault TruncationDirection = TruncationDirectionRight

type TruncationOptions ¶

type TruncationOptions struct {
	Enabled   bool
	MaxLen    uintptr
	Strategy  TruncationStrategy
	Direction TruncationDirection
	Stride    uintptr
}

type TruncationStrategy ¶

type TruncationStrategy uint8

const (
	TruncationStrategyLongestFirst TruncationStrategy = iota
	TruncationStrategyOnlyFirst
	TruncationStrategyOnlySecond
)

const TruncationStrategyDefault TruncationStrategy = TruncationStrategyLongestFirst

type VocabSizeResult ¶

type VocabSizeResult struct {
	VocabSize uint32
	ErrorCode int32
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL