Documentation
¶
Index ¶
- Constants
- func AddBias(out, bias []float32, rows, cols int)
- func EnsureModel(ctx context.Context, repo, model string, onProgress func(string)) error
- func GELU(x []float32)
- func L2Normalize(x []float32)
- func LayerNorm(x, weight, bias []float32, n, dim int, eps float64)
- func MatMul(a, bT []float32, M, K, N int, out []float32)
- func MatMulAdd(a, bT []float32, M, K, N int, bias, out []float32)
- func ModelDir(model string) string
- func Softmax(x []float32, rows, cols int)
- func SoftmaxMasked(x []float32, rows, cols int, mask []int32)
- func ZeroSlice(x []float32)
- type AttentionWeights
- type EmbeddingWeights
- type EncoderLayer
- type LayerNormWeights
- type LinearWeights
- type Model
- type ModelConfig
- type Provider
- type SafeTensors
- type Scratch
- type Tokenizer
Constants ¶
const DefaultMaxLen = 512
DefaultMaxLen is the default maximum sequence length for BERT models.
const DefaultModel = "bge-small-en-v1.5"
DefaultModel is the default BERT model shipped with Gramaton.
const DefaultModelRepo = "BAAI/bge-small-en-v1.5"
DefaultModelRepo is the HuggingFace repository for the default model.
Variables ¶
This section is empty.
Functions ¶
func EnsureModel ¶
EnsureModel checks if the model files exist locally. If not, downloads them from HuggingFace Hub. Always verifies every file's SHA256 against its sidecar before returning -- catches on-disk corruption between runs.
Integrity model:
- Downloads verify Content-Length and write a SHA256 sidecar on success.
- Subsequent loads recompute the hash and compare against the sidecar.
- Mismatch quarantines the bad file (renamed to .suspect.<unix-ts>) and returns an error; restarting will re-download cleanly while preserving the suspect bytes for forensic analysis.
- File present without sidecar (e.g., manually placed) bootstraps the sidecar with a warning log.
This is trust-on-first-use: the first download is whatever HF serves. Subsequent corruption, truncation, or tampering is caught.
func GELU ¶
func GELU(x []float32)
GELU applies the Gaussian Error Linear Unit activation in-place. Uses the tanh approximation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3)))
func L2Normalize ¶
func L2Normalize(x []float32)
L2Normalize normalizes a vector in-place to unit length.
func LayerNorm ¶
LayerNorm applies layer normalization in-place over the last dimension. x is [n, dim]. weight and bias are [dim]. eps is typically 1e-12 for BERT.
func MatMul ¶
MatMul computes C = A * B^T where A is [M, K] and bT is [N, K] (transposed). Output is written to out [M, N] which must be pre-allocated and zeroed.
On amd64 with AVX2 + FMA3, dispatches the 4x4 tile body to an assembly kernel for ~4-6x speedup over pure Go on BERT-sized matrices. Falls back to the pure-Go tiled implementation for matrices too small to benefit, for remainder rows/columns, and for pre-Haswell/Rosetta hosts without AVX2.
func MatMulAdd ¶
MatMulAdd computes out = A * B^T + bias, where bias is [N] and broadcast across rows. out must be pre-zeroed before the matmul.
func ModelDir ¶
ModelDir returns the cache directory for a model. Default: ~/.gramaton/models/<model>/
func Softmax ¶
Softmax applies row-wise softmax in-place. x is [rows, cols]. Numerically stable: subtracts row max before exponentiation.
func SoftmaxMasked ¶
SoftmaxMasked applies row-wise softmax in-place with an attention mask. x is [rows, cols]. mask is [cols] where 0 means masked (set to -inf before softmax).
Types ¶
type AttentionWeights ¶
type AttentionWeights struct {
Q LinearWeights // [HiddenSize, HiddenSize]
K LinearWeights
V LinearWeights
O LinearWeights // output projection
}
AttentionWeights holds Q, K, V projection weights and the output projection for multi-head self-attention.
type EmbeddingWeights ¶
type EmbeddingWeights struct {
Word []float32 // [VocabSize, HiddenSize]
Position []float32 // [MaxPositionEmbeds, HiddenSize]
TokenType []float32 // [2, HiddenSize]
LNWeight []float32 // [HiddenSize]
LNBias []float32 // [HiddenSize]
}
EmbeddingWeights holds the token, position, and segment embedding tables plus the post-embedding layer norm.
type EncoderLayer ¶
type EncoderLayer struct {
Attn AttentionWeights
AttnLN LayerNormWeights
FFNUp LinearWeights // HiddenSize -> IntermediateSize
FFNDown LinearWeights // IntermediateSize -> HiddenSize
FFNLN LayerNormWeights
}
EncoderLayer holds weights for one transformer encoder layer.
type LayerNormWeights ¶
LayerNormWeights holds weight and bias for layer normalization.
type LinearWeights ¶
type LinearWeights struct {
Weight []float32 // [Out, In] -- already transposed for MatMul
Bias []float32 // [Out]
}
LinearWeights holds weight and bias for a linear layer. Weight is stored as [OutFeatures, InFeatures] (transposed for MatMul).
type Model ¶
type Model struct {
Config ModelConfig
Embedding EmbeddingWeights
Layers []EncoderLayer
}
Model holds the weights and configuration for BERT inference. Weights are backed by mmap'd safetensors data (zero-copy).
Read-only after LoadModel. Concurrent Forward calls are safe when each caller supplies its own Scratch.
func LoadModel ¶
func LoadModel(st *SafeTensors, cfg ModelConfig) (*Model, error)
LoadModel loads a BERT model from a safetensors file using the given config.
func (*Model) Forward ¶
Forward runs the BERT encoder and returns the CLS embedding (L2-normalized). tokenIDs and attentionMask must be the same length (<= MaxPositionEmbeds).
The caller supplies a Scratch sized for the model's MaxPositionEmbeds. Each goroutine calling Forward concurrently must use its own Scratch.
type ModelConfig ¶
type ModelConfig struct {
HiddenSize int `json:"hidden_size"`
NumAttentionHeads int `json:"num_attention_heads"`
IntermediateSize int `json:"intermediate_size"`
NumHiddenLayers int `json:"num_hidden_layers"`
MaxPositionEmbeds int `json:"max_position_embeddings"`
VocabSize int `json:"vocab_size"`
LayerNormEps float64 `json:"layer_norm_eps"`
}
ModelConfig holds the BERT model hyperparameters, typically loaded from config.json in the model directory.
func ParseModelConfig ¶
func ParseModelConfig(data []byte) (ModelConfig, error)
ParseModelConfig reads a HuggingFace config.json file.
type Provider ¶
type Provider struct {
// contains filtered or unexported fields
}
Provider implements embed.Provider using a pure Go BERT inference engine. Default model is bge-small-en-v1.5 (384-dim, 12-layer BERT encoder).
Thread-safety (RWMutex pattern):
- Embed takes RLock for the duration of each per-text Encode + Forward. Multiple goroutines can hold RLocks concurrently; the model is read-only after LoadModel and each Embed iteration uses its own Scratch from the pool, so concurrent Forward is safe.
- Close takes the full Lock and blocks until every in-flight RLock holder releases. After Close returns, model/tokenizer/ scratchPool/st are all nil; Embed checks under RLock and returns "bert: provider closed" cleanly without segfault.
Critical: the RLock must wrap BOTH the nil-check AND Encode + Forward. Releasing RLock between them would let Close Munmap the safetensors region while Forward is mid-read of float32 slices that point into mmap'd memory.
scratchPool holds Scratch instances reused across Forward calls. Each Embed iteration acquires a Scratch from the pool, runs Forward, returns the Scratch to the pool. Concurrent Embed goroutines each get their own Scratch instance.
Memory bound: each Scratch is ~14MB at maxSeq=512, hidden=384, intermediate=1536, heads=12. The pool grows under contention and shrinks during idle (sync.Pool semantics; entries are GC-eligible when not referenced). Peak live Scratches per Provider is bounded by maxWorkers (default min(GOMAXPROCS, 8) = ~112MB) under inner-loop fanout, plus one per concurrent caller goroutine holding RLock.
func New ¶
func New(cfg config.EmbeddingConfig) (*Provider, error)
New creates a BERT embedding provider. Downloads the model from HuggingFace on first use if not cached locally.
func (*Provider) Close ¶
Close releases the mmap'd safetensors file. Takes the full write Lock; blocks until every concurrent Embed holding RLock has released. Without this guard, a concurrent Forward could read float32 slices that point into the mmap'd region after Munmap, causing a segfault.
Callers must NOT call Embed after Close returns; the model, tokenizer, and scratchPool fields are zeroed to make subsequent misuse return "bert: provider closed" rather than silently corrupt.
func (*Provider) ContextWindow ¶
ContextWindow returns the model's maximum sequence length in tokens.
func (*Provider) Embed ¶
Embed generates embeddings for the given texts. Returns one vector per input text in the same order. Returns nil, nil for empty input.
Concurrency model:
- Single text (most common path; called from chunking and search in tight loops): runs inline without spawning goroutines.
- Multiple texts: bounded errgroup fanout. Each text's Encode + Forward runs in its own goroutine, holding RLock for the duration. Worker count bounded by Provider.maxWorkers (default min(GOMAXPROCS, 8)).
On any goroutine error (provider closed, ctx cancelled), the errgroup's context is cancelled, in-flight goroutines exit at their next ctx check, and Embed returns (nil, err).
type SafeTensors ¶
type SafeTensors struct {
// contains filtered or unexported fields
}
SafeTensors provides zero-copy access to tensors stored in the HuggingFace safetensors format. The file is mmap'd read-only; float32 tensor data is accessed directly without copying.
Format: [8-byte header_len (uint64 LE)] [JSON header] [tensor data]
func OpenSafeTensors ¶
func OpenSafeTensors(path string) (*SafeTensors, error)
OpenSafeTensors opens a safetensors file via mmap for zero-copy access.
func (*SafeTensors) Close ¶
func (st *SafeTensors) Close() error
Close unmaps the file and releases resources.
func (*SafeTensors) GetFloat32 ¶
func (st *SafeTensors) GetFloat32(name string) ([]float32, []int, error)
GetFloat32 returns a float32 slice backed by the mmap'd data for the named tensor. The returned slice is valid until Close is called. The tensor must have dtype "F32".
func (*SafeTensors) Has ¶
func (st *SafeTensors) Has(name string) bool
Has reports whether the named tensor exists.
func (*SafeTensors) Names ¶
func (st *SafeTensors) Names() []string
Names returns all tensor names (excluding __metadata__).
type Scratch ¶
type Scratch struct {
// contains filtered or unexported fields
}
Scratch holds pre-allocated buffers used by Forward. Each Forward call writes every buffer location before reading (verified in Layer A audit), so a recycled Scratch is safe to reuse across calls without zeroing.
Each goroutine running Forward must supply its own Scratch. Concurrent reuse of the same Scratch instance corrupts output.
func NewScratch ¶
func NewScratch(maxSeq int, cfg ModelConfig) *Scratch
NewScratch allocates a Scratch sized for the given configuration. Sized for max sequence length so the same Scratch handles any input up to MaxPositionEmbeds tokens.
type Tokenizer ¶
type Tokenizer struct {
// contains filtered or unexported fields
}
Tokenizer implements BERT's WordPiece tokenization pipeline. Supports loading from HuggingFace tokenizer.json or plain vocab.txt.
func NewTokenizerFromJSON ¶
NewTokenizerFromJSON parses a HuggingFace tokenizer.json file and returns a configured Tokenizer. This is the preferred loading method as it captures the model's exact normalizer and vocab settings.
func NewTokenizerFromVocab ¶
NewTokenizerFromVocab parses a plain vocab.txt file (one token per line, indexed by line number). Uses default BERT normalizer settings.
func (*Tokenizer) Encode ¶
Encode tokenizes a text string and returns input tensors for BERT. Returns token IDs, attention mask (1 for real tokens, 0 for padding), and token type IDs (all 0 for single-segment input). The output is truncated to maxLen and does NOT include padding -- the caller can pad if needed for batching.
func (*Tokenizer) SetMaxLen ¶
SetMaxLen overrides the tokenizer's maximum sequence length. Used by the provider to clamp tokenizer truncation to the model's MaxPositionEmbeds when tokenizer.json declares a larger value than the model can actually process. Without this clamp, model.Forward panics with a slice-bounds error on the scratch buffers.