consistent-classifier

module

v0.1.0 Latest Latest Go to latest Published: Oct 10, 2025 License: MIT

README ¶

Consistent Classifier

A high-performance Go package for classifying large volumes of unlabeled text data using LLM-powered classification with intelligent caching and label clustering.

Features

Smart Caching: Vector-based similarity search reduces redundant LLM calls by ~80-95% on similar text
Label Clustering: Automatically merges semantically similar labels (e.g., "technical_question" and "tech_support") using Disjoint Set Union (DSU)
Production Ready: Thread-safe, context-aware, with graceful shutdown and persistent state
Pluggable Adapters: Easily swap embedding providers (Voyage AI), vector stores (Pinecone), or LLMs (OpenAI-compatible)
Zero Config: Works out-of-the-box with environment variables, or fully customize every component

Installation

go get github.com/FrenchMajesty/consistent-classifier

Quick Start

Basic Usage

package main

import (
    "context"
    "log"

    "github.com/FrenchMajesty/consistent-classifier/pkg/classifier"
)

func main() {
    // Create classifier with defaults (reads from environment variables)
    clf, err := classifier.NewClassifier(classifier.Config{})
    if err != nil {
        log.Fatal(err)
    }
    defer clf.Close() // Saves state and waits for background tasks

    // Classify text
    result, err := clf.Classify(context.Background(), "Thanks for the help!")
    if err != nil {
        log.Fatal(err)
    }

    log.Printf("Label: %s (cache hit: %v, latency: %v)",
        result.Label, result.CacheHit, result.UserFacingLatency)
}

Environment Variables

Set these to use the default adapters:

export VOYAGEAI_API_KEY="your-voyage-key"
export PINECONE_API_KEY="your-pinecone-key"
export PINECONE_HOST="your-index-host.pinecone.io"
export OPENAI_API_KEY="your-openai-key"

Advanced Configuration

Custom Clients

import (
    "github.com/FrenchMajesty/consistent-classifier/pkg/adapters"
    "github.com/FrenchMajesty/consistent-classifier/pkg/classifier"
)

// Create custom clients
embeddingClient, _ := adapters.NewVoyageEmbeddingAdapter(nil)
vectorClientLabel, _ := adapters.NewPineconeVectorAdapter(nil, nil, "prod_labels")
vectorClientContent, _ := adapters.NewPineconeVectorAdapter(nil, nil, "prod_content")
llmClient, _ := adapters.NewDefaultLLMClient(nil, "", "gpt-4o-mini", "")

clf, _ := classifier.NewClassifier(classifier.Config{
    EmbeddingClient:      embeddingClient,
    VectorClientLabel:    vectorClientLabel,
    VectorClientContent:  vectorClientContent,
    LLMClient:            llmClient,
    MinSimilarityContent: 0.90, // Higher threshold = fewer cache hits, more precision
    MinSimilarityLabel:   0.75, // Threshold for merging similar labels
    DSUPersistence:       classifier.NewFileDSUPersistence("./labels.bin"),
})
defer clf.Close()

Custom LLM System Prompt

customPrompt := `Classify the following customer support ticket into one of:
- bug_report
- feature_request
- billing_question
- other

Return only the label.`

llmClient, _ := adapters.NewDefaultLLMClient(nil, customPrompt, "gpt-4o", "")

OpenAI-Compatible Providers

Works with any OpenAI-compatible API (e.g., Azure, local models):

llmClient, _ := adapters.NewDefaultLLMClient(
    nil,
    "",  // system prompt
    "llama-3.370b-versatile",
    "https://api.groq.com/openai/v1", // base URL
)

How It Works

Embedding Generation: Text is converted to a vector using Voyage AI (or custom provider)
Cache Check: Searches vector store for similar previously-classified text
On Cache Hit: Returns cached label instantly (typically <100ms)
On Cache Miss: Calls LLM for classification, then:
- Stores text embedding for future lookups
- Searches for similar labels and clusters them using DSU
- Stores label embedding for clustering

Label Clustering Example

If the LLM generates these labels across classifications:

"technical_question" → "tech_question" → "technical_support"

The DSU automatically groups them, so future queries return the root label of the cluster, ensuring consistency.

API Reference

Core Methods

// Create a new classifier
func NewClassifier(cfg Config) (*Classifier, error)

// Classify text and return result
func (c *Classifier) Classify(ctx context.Context, text string) (*Result, error)

// Get current metrics
func (c *Classifier) GetMetrics() Metrics

// Graceful shutdown (waits for background tasks and saves state)
func (c *Classifier) Close() error

Result Structure

type Result struct {
    Label             string        // Classified label
    CacheHit          bool          // Whether result came from cache
    Confidence        float32       // Similarity score (if cache hit)
    UserFacingLatency time.Duration // Time user waited
    BackgroundLatency time.Duration // Time spent on clustering/caching
}

Metrics

type Metrics struct {
    UniqueLabels    int     // Total unique labels seen
    ConvergedLabels int     // Number of label clusters after merging
    CacheHitRate    float32 // Percentage of cache hits
}

Production Considerations

Rate Limiting

The package doesn't enforce rate limits. For production use with high volume:

// Wrap with your own rate limiter
type RateLimitedLLM struct {
    limiter *rate.Limiter
    client  classifier.LLMClient
}

func (r *RateLimitedLLM) Classify(ctx context.Context, text string) (string, error) {
    if err := r.limiter.Wait(ctx); err != nil {
        return "", err
    }
    return r.client.Classify(ctx, text)
}

Monitoring

// Poll metrics periodically
ticker := time.NewTicker(30 * time.Second)
go func() {
    for range ticker.C {
        m := clf.GetMetrics()
        // Send to your metrics system (Prometheus, Datadog, etc.)
        log.Printf("Labels: %d/%d, Cache: %.1f%%",
            m.ConvergedLabels, m.UniqueLabels, m.CacheHitRate)
    }
}()

Namespace Isolation

For multiple instances or environments, use unique Pinecone namespaces:

vectorLabel, _ := adapters.NewPineconeVectorAdapter(nil, nil, "prod_labels_v2")
vectorContent, _ := adapters.NewPineconeVectorAdapter(nil, nil, "prod_content_v2")

Testing

# Run all tests
go test ./...

# Run with coverage
go test -cover ./pkg/...

# Run benchmarks
go test -bench=. ./...

Example: Classify Replies to Tweets

See cmd/benchmark/vectorize.go for a full example of classifying thousands of tweet replies.

Performance

On a dataset of 10,000 tweet replies in a specific niche (using cmd/benchmark):

Cache hit rate: 25% hit rate after 500. Reaches +50% by 2,000
Avg latency (cache hit): <200ms
Avg latency (cache miss): ~1-2s (LLM dependent)
Cost reduction: 90%+ fewer LLM calls vs naive classification

Architecture

pkg/
├── classifier/         # Core classification logic
│   ├── classifier.go   # Main Classifier implementation
│   ├── config.go       # Configuration and defaults
│   ├── interfaces.go   # Client interfaces
│   └── types.go        # Result and Metrics types
├── adapters/           # External service adapters
│   ├── adapters.go     # Voyage and Pinecone adapters
│   ├── llm_client.go   # OpenAI adapter
│   ├── openai/         # OpenAI client implementation
│   ├── pinecone/       # Pinecone client implementation
│   └── voyage/         # Voyage AI client implementation
└── types/              # Shared types

utils/disjoint_set/     # DSU implementation for label clustering

Contributing

Contributions welcome! Please open an issue or PR.

License

MIT License - see LICENSE for details.

Directories ¶

Path	Synopsis
cmd
benchmark command
internal
retry
pkg
adapters
adapters/openai
adapters/pinecone
adapters/voyage
classifier
testutil
types
utils
disjoint_set

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL