Documentation
¶
Overview ¶
Package nn provides neural network modules and layers for building deep learning models. It includes activations, attention mechanisms, normalization layers, and more.
Package nn implements neural network modules for the Born ML Framework.
This package provides building blocks for constructing neural networks:
- Module interface: Base interface for all NN components
- Parameter: Trainable parameters with gradient tracking
- Linear: Fully connected layer
- Activations: ReLU, Sigmoid, Tanh
- Loss functions: MSE, CrossEntropy
- Sequential: Container for stacking layers
Design inspired by PyTorch's nn.Module but adapted for Go generics.
Index ¶
- func Accuracy[B tensor.Backend](logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B]) float32
- func CausalMask[B tensor.Backend](seqLen int, backend B) *tensor.Tensor[float32, B]
- func CrossEntropyBackward[B tensor.Backend](logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B], backend B) *tensor.Tensor[float32, B]
- func GELUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func GLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func GeGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func Ones[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
- func Randn[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
- func ReGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func ReLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func RepeatKV[B tensor.Backend](kv *tensor.Tensor[float32, B], nRep int) *tensor.Tensor[float32, B]
- func ScaledDotProductAttention[B tensor.Backend](query, key, value *tensor.Tensor[float32, B], mask *tensor.Tensor[float32, B], ...) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])
- func SiLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func SigmoidFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func SwiGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func Xavier[B tensor.Backend](fanIn, fanOut int, shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
- func Zeros[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
- type ALiBi
- type Conv2D
- func (c *Conv2D[B]) ComputeOutputSize(inputH, inputW int) [2]int
- func (c *Conv2D[B]) Forward(input *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func (c *Conv2D[B]) InChannels() int
- func (c *Conv2D[B]) KernelSize() [2]int
- func (c *Conv2D[B]) OutChannels() int
- func (c *Conv2D[B]) Padding() int
- func (c *Conv2D[B]) Parameters() []*Parameter[B]
- func (c *Conv2D[B]) Stride() int
- func (c *Conv2D[B]) String() string
- type CrossEntropyLoss
- type Embedding
- type FFN
- type GQAConfig
- type GroupedQueryAttention
- func (g *GroupedQueryAttention[B]) Forward(x *tensor.Tensor[float32, B], cache *KVCache[B], startPos int) *tensor.Tensor[float32, B]
- func (g *GroupedQueryAttention[B]) ForwardWithMask(x *tensor.Tensor[float32, B], mask *tensor.Tensor[float32, B], ...) *tensor.Tensor[float32, B]
- func (g *GroupedQueryAttention[B]) Parameters() []*Parameter[B]
- type KVCache
- type LayerNorm
- type LearnedPositionalEmbedding
- type Linear
- type MSELoss
- type MaxPool2D
- func (m *MaxPool2D[B]) ComputeOutputSize(inputH, inputW int) [2]int
- func (m *MaxPool2D[B]) Forward(input *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func (m *MaxPool2D[B]) KernelSize() int
- func (m *MaxPool2D[B]) Parameters() []*Parameter[B]
- func (m *MaxPool2D[B]) Stride() int
- func (m *MaxPool2D[B]) String() string
- type Module
- type MultiHeadAttention
- func (m *MultiHeadAttention[B]) Forward(query, key, value *tensor.Tensor[float32, B], mask *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func (m *MultiHeadAttention[B]) ForwardWithCache(query *tensor.Tensor[float32, B], cache *KVCache[B]) *tensor.Tensor[float32, B]
- func (m *MultiHeadAttention[B]) ForwardWithWeights(query, key, value *tensor.Tensor[float32, B], mask *tensor.Tensor[float32, B]) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])
- func (m *MultiHeadAttention[B]) Parameters() []*Parameter[B]
- type Normalizer
- type Parameter
- type RMSNorm
- type ReLU
- type ReLUBackend
- type RotaryEncoding
- type RotaryEncodingConfig
- type Sequential
- type SiLU
- type SiLUBackend
- type Sigmoid
- type SigmoidBackend
- type SinusoidalPositionalEncoding
- type SwiGLUFFN
- type SwiGLUFFNConfig
- type Tanh
- type TanhBackend
- type TransformerBlock
- type TransformerConfig
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Accuracy ¶
func Accuracy[B tensor.Backend]( logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B], ) float32
Accuracy computes classification accuracy for a batch.
Parameters:
- logits: Model predictions [batch_size, num_classes]
- targets: Ground truth class indices [batch_size]
Returns:
- Accuracy as a float between 0 and 1.
func CausalMask ¶ added in v0.4.0
CausalMask creates a causal (autoregressive) attention mask.
In causal attention, each position can only attend to earlier positions (including itself). This is used in autoregressive models like GPT.
Returns a mask tensor where:
- Upper triangle (future positions) = -inf (masked out)
- Lower triangle + diagonal (past + current) = 0 (allowed)
The mask is applied additively to attention scores before softmax:
scores = QK^T / sqrt(d_k) + mask
Shape: [1, 1, seq_len, seq_len] (broadcastable to [batch, heads, seq, seq])
Example:
// For seq_len=4: // [[0, -inf, -inf, -inf], // [0, 0, -inf, -inf], // [0, 0, 0, -inf], // [0, 0, 0, 0 ]] backend := cpu.New() mask := nn.CausalMask(10, backend) // [1, 1, 10, 10]
func CrossEntropyBackward ¶
func CrossEntropyBackward[B tensor.Backend]( logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B], backend B, ) *tensor.Tensor[float32, B]
CrossEntropyBackward computes gradient of CrossEntropyLoss w.r.t. logits.
This function provides manual backward pass for CrossEntropyLoss. It will be integrated with autodiff in Phase 2.
Gradient Formula:
∂L/∂logits[i] = softmax(logits)[i] - y_one_hot[i]
= probs[i] - (1 if i==target else 0)
For single class target:
∂L/∂logits[i] = probs[i] if i ≠ target ∂L/∂logits[i] = probs[i] - 1 if i = target
Parameters:
- logits: [batch_size, num_classes]
- targets: [batch_size] (class indices)
Returns:
- grads: [batch_size, num_classes] gradient tensor
Note: Gradients are automatically averaged over batch size.
func GELUFunc ¶ added in v0.5.0
GELUFunc applies GELU (Gaussian Error Linear Unit) activation.
Uses the tanh approximation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3))).
GELU is used in BERT, GPT-2, and other transformers.
Example:
output := nn.GELUFunc(input)
func GLU ¶ added in v0.5.0
GLU applies Gated Linear Unit: GLU(x, gate) = x * sigmoid(gate).
GLU is the base gating mechanism used in various transformer FFN layers.
Parameters:
- x: input tensor.
- gate: gating tensor (same shape as x).
Returns: x * sigmoid(gate).
Example:
output := nn.GLU(x, gate)
func GeGLU ¶ added in v0.5.0
GeGLU applies GELU-Gated Linear Unit: GeGLU(x, gate) = x * GELU(gate).
GeGLU uses GELU activation for gating instead of SiLU. Used in some transformer variants for different activation characteristics.
Parameters:
- x: input tensor.
- gate: gating tensor.
Returns: x * GELU(gate).
Example:
output := nn.GeGLU(up, gate)
func Ones ¶
Ones creates a tensor filled with ones.
Parameters:
- shape: Shape of the tensor
- backend: Backend to use for tensor creation
Returns a tensor filled with ones.
func Randn ¶
Randn creates a tensor with random values from standard normal distribution.
Values are drawn from N(0, 1).
Parameters:
- shape: Shape of the tensor
- backend: Backend to use for tensor creation
Returns a tensor with random normal values.
func ReGLU ¶ added in v0.5.0
ReGLU applies ReLU-Gated Linear Unit: ReGLU(x, gate) = x * ReLU(gate).
ReGLU uses ReLU activation for gating. It's simpler but may have "dead neuron" issues compared to SwiGLU or GeGLU.
Parameters:
- x: input tensor.
- gate: gating tensor.
Returns: x * ReLU(gate).
Example:
output := nn.ReGLU(up, gate)
func ReLUFunc ¶ added in v0.5.0
ReLUFunc applies ReLU activation: f(x) = max(0, x).
Example:
output := nn.ReLUFunc(input)
func RepeatKV ¶ added in v0.5.0
func RepeatKV[B tensor.Backend]( kv *tensor.Tensor[float32, B], nRep int, ) *tensor.Tensor[float32, B]
RepeatKV broadcasts KV heads to match query heads count.
This is the key operation in GQA that allows fewer KV heads than Q heads. Each KV head is repeated nRep times to match the Q head count.
Input: [batch, n_kv_heads, seq_len, head_dim] Output: [batch, n_q_heads, seq_len, head_dim] where n_q_heads = n_kv_heads * nRep
Example:
// 8 KV heads -> 32 Q heads (nRep=4)
kv := tensor.Randn[float32](tensor.Shape{2, 8, 100, 128}, backend)
expanded := nn.RepeatKV(kv, 4) // [2, 32, 100, 128]
If nRep=1 (standard MHA), returns the input unchanged.
func ScaledDotProductAttention ¶ added in v0.4.0
func ScaledDotProductAttention[B tensor.Backend]( query, key, value *tensor.Tensor[float32, B], mask *tensor.Tensor[float32, B], scale float32, ) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])
ScaledDotProductAttention computes attention scores using the scaled dot-product mechanism.
This is the core attention mechanism used in transformers, implementing:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V
Where:
- Q (query): what information we're looking for [batch, heads, seq_q, head_dim]
- K (key): what information is available [batch, heads, seq_k, head_dim]
- V (value): the actual information to retrieve [batch, heads, seq_k, head_dim]
- mask: optional attention mask (additive, -inf for masked positions)
- scale: scaling factor (typically 1/sqrt(head_dim)), 0 for auto-compute
Parameters:
- query: Query tensor [batch, heads, seq_q, head_dim]
- key: Key tensor [batch, heads, seq_k, head_dim]
- value: Value tensor [batch, heads, seq_k, head_dim]
- mask: Optional attention mask [batch, 1, seq_q, seq_k] or nil (additive mask, -inf for masked)
- scale: Scaling factor (0 for auto-compute as 1/sqrt(head_dim))
Returns:
- output: Attended values [batch, heads, seq_q, head_dim]
- weights: Attention weights [batch, heads, seq_q, seq_k]
Example:
backend := autodiff.New(cpu.New())
Q := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend) // batch=2, heads=8, seq=10, dim=64
K := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
V := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
output, weights := nn.ScaledDotProductAttention(Q, K, V, nil, 0) // auto-scale
func SiLUFunc ¶ added in v0.5.0
SiLUFunc applies SiLU (Swish) activation: f(x) = x * sigmoid(x).
This is the functional version of SiLU activation, useful in GLU variants.
Example:
output := nn.SiLUFunc(input)
func SigmoidFunc ¶ added in v0.5.0
SigmoidFunc applies Sigmoid activation: σ(x) = 1 / (1 + exp(-x)).
Example:
output := nn.SigmoidFunc(input)
func SwiGLU ¶ added in v0.5.0
SwiGLU applies Swish-Gated Linear Unit: SwiGLU(x, gate) = x * SiLU(gate).
SwiGLU is used in modern LLMs like LLaMA, Mistral, and DeepSeek. It combines the input with SiLU-activated gate for better gradient flow.
Parameters:
- x: input tensor (typically "up" projection).
- gate: gating tensor (typically "gate" projection).
Returns: x * SiLU(gate) where SiLU(z) = z * sigmoid(z).
Example:
// In LLaMA-style FFN: up := upProj.Forward(input) gate := gateProj.Forward(input) hidden := nn.SwiGLU(up, gate)
func Xavier ¶
func Xavier[B tensor.Backend](fanIn, fanOut int, shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
Xavier (Glorot) initialization for weights.
Initializes weights with values drawn from a uniform distribution: U(-sqrt(6/(fan_in + fan_out)), sqrt(6/(fan_in + fan_out)))
This initialization helps maintain variance of activations across layers.
Parameters:
- fanIn: Number of input units
- fanOut: Number of output units
- shape: Shape of the weight tensor
- backend: Backend to use for tensor creation
Returns a tensor initialized with Xavier distribution.
Types ¶
type ALiBi ¶ added in v0.4.0
type ALiBi[B tensor.Backend] struct { NumHeads int // Number of attention heads Slopes []float32 // Slope for each head (geometric sequence) // contains filtered or unexported fields }
ALiBi implements Attention with Linear Biases.
ALiBi is a positional encoding method that adds a linear bias to attention scores based on the distance between query and key positions. This approach is used in BLOOM, MPT, and other models.
Instead of adding positional information to embeddings, ALiBi adds a bias to the attention scores:
attention_scores = Q @ K^T + bias
Where bias[i,j] = -slope * |i - j|, and each attention head has a different slope.
The slopes are determined by a geometric sequence:
slopes = [2^(-8/n), 2^(-16/n), ..., 2^(-8)] for n heads
This allows the model to extrapolate to longer sequences than seen during training.
Example:
alibi := nn.NewALiBi(8, backend) // 8 attention heads bias := alibi.GetBias(128) // Bias for sequence length 128 // Shape: [1, 8, 128, 128] // In attention: scores := Q.BatchMatMul(K.Transpose()) // [batch, 8, seq, seq] scores = scores.Add(bias) // Add ALiBi bias weights := scores.Softmax(-1)
func NewALiBi ¶ added in v0.4.0
NewALiBi creates a new ALiBi bias generator.
Computes slopes for each attention head using the formula from the paper:
For n heads: slopes = [2^(-8/n * i) for i in 1..n]
Example slopes for 8 heads:
[2^(-1), 2^(-2), 2^(-3), 2^(-4), 2^(-5), 2^(-6), 2^(-7), 2^(-8)] ≈ [0.5, 0.25, 0.125, 0.0625, 0.03125, 0.015625, 0.0078125, 0.00390625]
Parameters:
- numHeads: Number of attention heads
- backend: Computation backend
Returns a new ALiBi instance with pre-computed slopes.
func (*ALiBi[B]) GetBias ¶ added in v0.4.0
GetBias returns the ALiBi bias matrix for the specified sequence length.
The bias has shape [1, num_heads, seq_len, seq_len], where:
- bias[0, h, i, j] = -slopes[h] * |i - j|
The leading dimension is 1 for broadcasting across batch dimension.
Parameters:
- seqLen: Sequence length for the bias matrix
Returns:
- Bias tensor [1, num_heads, seq_len, seq_len]
Example:
alibi := nn.NewALiBi(8, backend) bias := alibi.GetBias(64) // [1, 8, 64, 64] // In attention computation: scores := Q.BatchMatMul(K.T()) // [batch, 8, 64, 64] scores = scores.Add(bias) // Broadcast and add weights := scores.Softmax(-1)
type Conv2D ¶
Conv2D is a 2D convolutional layer.
Performs convolution: output = Conv2D(input, weight) + bias
Input shape: [batch, in_channels, height, width] Weight shape: [out_channels, in_channels, kernel_h, kernel_w] Bias shape: [out_channels] Output shape: [batch, out_channels, out_h, out_w]
Where:
out_h = (height + 2*padding - kernel_h) / stride + 1 out_w = (width + 2*padding - kernel_w) / stride + 1
Example:
// Create 2D conv: 1 channel -> 6 channels, 5x5 kernel
conv := nn.NewConv2D(1, 6, 5, 5, 1, 0, true, backend)
// Forward pass
input := tensor.Zeros[float32](tensor.Shape{32, 1, 28, 28}, backend) // MNIST-like
output := conv.Forward(input) // [32, 6, 24, 24]
func NewConv2D ¶
func NewConv2D[B tensor.Backend]( inChannels, outChannels int, kernelH, kernelW int, stride, padding int, useBias bool, backend B, ) *Conv2D[B]
NewConv2D creates a new 2D convolutional layer with Xavier initialization.
Parameters:
- inChannels: Number of input channels
- outChannels: Number of output channels (number of filters)
- kernelH, kernelW: Kernel dimensions
- stride: Stride for convolution (commonly 1 or 2)
- padding: Zero padding to apply to input (commonly 0, 1, 2)
- useBias: Whether to include bias term
- backend: Backend for computation
Initialization:
- Weights: Xavier/Glorot uniform initialization
- Bias: Zeros
func (*Conv2D[B]) ComputeOutputSize ¶
ComputeOutputSize computes output spatial dimensions for given input size.
Returns: [out_height, out_width].
func (*Conv2D[B]) Forward ¶
Forward performs the forward pass.
Input: [batch, in_channels, height, width] Output: [batch, out_channels, out_h, out_w].
func (*Conv2D[B]) InChannels ¶
InChannels returns the number of input channels.
func (*Conv2D[B]) KernelSize ¶
KernelSize returns the kernel size [height, width].
func (*Conv2D[B]) OutChannels ¶
OutChannels returns the number of output channels.
func (*Conv2D[B]) Parameters ¶
Parameters returns all trainable parameters.
type CrossEntropyLoss ¶
CrossEntropyLoss computes cross-entropy loss for multi-class classification.
This implementation uses the LogSoftmax + NLLLoss decomposition for numerical stability, following modern best practices (PyTorch, Burn 2025).
Mathematical Formulation:
Loss = -log_probs[target] where log_probs = LogSoftmax(logits)
Gradient (Backward):
∂L/∂logits = Softmax(logits) - y_one_hot
Usage:
criterion := nn.NewCrossEntropyLoss[Backend](backend) logits := model.Forward(input) // [batch_size, num_classes] loss := criterion.Forward(logits, targets) // targets: [batch_size] (class indices)
Key Properties:
- Expects raw logits (unnormalized scores) as input
- Uses log-sum-exp trick for numerical stability
- Prevents overflow when logits > 88 (float32 limit)
- Prevents underflow when all logits are very negative
References:
- "Adam: A Method for Stochastic Optimization" (Kingma & Ba, 2014)
- PyTorch CrossEntropyLoss documentation
- Burn framework loss implementations
func NewCrossEntropyLoss ¶
func NewCrossEntropyLoss[B tensor.Backend](backend B) *CrossEntropyLoss[B]
NewCrossEntropyLoss creates a new cross-entropy loss function.
func (*CrossEntropyLoss[B]) Forward ¶
func (c *CrossEntropyLoss[B]) Forward( logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B], ) *tensor.Tensor[float32, B]
Forward computes cross-entropy loss.
Parameters:
- logits: Model predictions (unnormalized scores) with shape [batch_size, num_classes]
- targets: Ground truth class indices with shape [batch_size] (values in range [0, num_classes-1])
Returns:
- Scalar loss value (mean over batch)
Note: This is a simplified implementation for Phase 1 (MNIST proof-of-concept). Full autodiff support for Softmax/Log operations will be added in Phase 2.
func (*CrossEntropyLoss[B]) Parameters ¶
func (c *CrossEntropyLoss[B]) Parameters() []*Parameter[B]
Parameters returns an empty slice (loss functions have no trainable parameters).
type Embedding ¶ added in v0.3.0
type Embedding[B tensor.Backend] struct { Weight *Parameter[B] // Embedding weight matrix [NumEmbed, EmbedDim] NumEmbed int // Number of embeddings (vocabulary size) EmbedDim int // Embedding dimension (vector size) }
Embedding is a lookup table that maps discrete indices to dense vectors.
This is a fundamental layer in NLP and sequence models, converting token IDs to continuous embeddings. The embedding vectors are learnable parameters.
Architecture:
- Weight: [NumEmbed, EmbedDim] learnable parameter
- Forward: indices [batch, seq] -> embeddings [batch, seq, EmbedDim]
- Backward: gradients scatter-add to weight rows
Example:
// Vocabulary of 10000 words, embedding dimension 256
embed := nn.NewEmbedding[B](10000, 256, backend)
// Token IDs for batch of 2 sequences, each 5 tokens
indices := tensor.FromSlice([]int32{1, 2, 3, 4, 5, 10, 11, 12, 13, 14},
tensor.Shape{2, 5}, backend)
// Get embeddings [2, 5, 256]
embeddings := embed.Forward(indices)
func NewEmbedding ¶ added in v0.3.0
NewEmbedding creates a new Embedding layer.
The embedding weights are initialized from a standard normal distribution N(0, 1). For other initialization strategies (Xavier, truncated normal), initialize the weight tensor manually and pass it to NewEmbeddingWithWeight.
Parameters:
- numEmbeddings: Size of the embedding dictionary (e.g., vocabulary size)
- embeddingDim: Dimension of each embedding vector
- backend: Computation backend
Returns a new Embedding layer with randomly initialized weights.
func NewEmbeddingWithWeight ¶ added in v0.3.0
NewEmbeddingWithWeight creates an Embedding layer with pre-initialized weights.
Use this when you want custom initialization (Xavier, truncated normal, pretrained, etc.)
Parameters:
- weight: Pre-initialized weight tensor [numEmbeddings, embeddingDim]
Returns a new Embedding layer using the provided weights.
func (*Embedding[B]) Forward ¶ added in v0.3.0
Forward performs embedding lookup.
Maps each index to its corresponding embedding vector.
Parameters:
- indices: Tensor of indices [batch, seq] or any shape [...] of type int32
Returns:
- embeddings: Tensor [..., EmbedDim] with embedding vectors
Example:
indices := tensor.FromSlice([]int32{0, 1, 2}, tensor.Shape{3}, backend)
embeddings := embed.Forward(indices) // Shape: [3, EmbedDim]
Panics if any index is out of bounds [0, NumEmbed).
func (*Embedding[B]) Parameters ¶ added in v0.3.0
Parameters returns the list of trainable parameters.
type FFN ¶ added in v0.4.0
type FFN[B tensor.Backend] struct { Linear1 *Linear[B] // [embed_dim → ffn_dim] Linear2 *Linear[B] // [ffn_dim → embed_dim] SiLU *SiLU[B] // Activation function // contains filtered or unexported fields }
FFN implements a Feed-Forward Network (also called MLP - Multi-Layer Perceptron).
Architecture:
FFN(x) = Linear2(SiLU(Linear1(x)))
Where:
- Linear1: [embed_dim → ffn_dim] (expansion)
- SiLU: Activation function (x * sigmoid(x))
- Linear2: [ffn_dim → embed_dim] (projection back)
The FFN is a core component of transformer blocks, typically with ffn_dim = 4 * embed_dim. This expansion-and-projection pattern helps the model learn complex transformations.
Used in all transformer architectures:
- GPT: embed_dim=768, ffn_dim=3072 (4x expansion)
- BERT: embed_dim=768, ffn_dim=3072
- LLaMA: embed_dim=4096, ffn_dim=11008 (~2.7x expansion)
Example:
backend := autodiff.New(cpu.New()) ffn := nn.NewFFN[B](768, 3072, backend) // GPT-2 small output := ffn.Forward(x) // [batch, seq, 768] -> [batch, seq, 768]
func NewFFN ¶ added in v0.4.0
NewFFN creates a new Feed-Forward Network.
Parameters:
- embedDim: Input/output dimension (e.g., 768 for GPT-2)
- ffnDim: Hidden dimension (typically 4 * embedDim)
- backend: Computation backend
The network expands the input from embedDim to ffnDim, applies SiLU activation, then projects back to embedDim.
Example:
ffn := nn.NewFFN[B](768, 3072, backend) // GPT-2 small
func (*FFN[B]) Forward ¶ added in v0.4.0
Forward computes the FFN output.
Shapes:
- input: [batch, seq, embed_dim] (3D) or [batch, embed_dim] (2D)
- output: same shape as input
Algorithm:
- Expand: x -> Linear1(x) [embed_dim → ffn_dim]
- Activate: x -> SiLU(x)
- Project: x -> Linear2(x) [ffn_dim → embed_dim]
Note: Linear layers expect 2D input [batch, features], so we reshape if needed.
func (*FFN[B]) Parameters ¶ added in v0.4.0
Parameters returns all trainable parameters (Linear1 and Linear2).
type GQAConfig ¶ added in v0.5.0
type GQAConfig struct {
EmbedDim int // Model dimension (d_model)
NQHeads int // Number of query heads
NKVHeads int // Number of key-value heads (must divide NQHeads evenly)
HeadDim int // Dimension per head
Dropout float32 // Dropout rate (not used in inference)
UseRoPE bool // Whether to use Rotary Position Embeddings
MaxSeqLen int // Maximum sequence length (for RoPE)
Theta float64 // RoPE base frequency (default: 10000.0)
}
GQAConfig configures a GroupedQueryAttention layer.
func MQA ¶ added in v0.5.0
MQA creates a Multi-Query Attention config (GQA with n_kv_heads=1).
MQA is the extreme case of GQA where all query heads share a single KV head. This provides maximum memory savings but may reduce model capacity.
Example:
cfg := nn.MQA(4096, 32, 128) // 32 Q heads, 1 KV head mqa := nn.NewGQA(cfg, backend)
type GroupedQueryAttention ¶ added in v0.5.0
type GroupedQueryAttention[B tensor.Backend] struct { QProj *Linear[B] // Query projection [embed_dim, n_q_heads * head_dim] KProj *Linear[B] // Key projection [embed_dim, n_kv_heads * head_dim] VProj *Linear[B] // Value projection [embed_dim, n_kv_heads * head_dim] OutProj *Linear[B] // Output projection [n_q_heads * head_dim, embed_dim] // contains filtered or unexported fields }
GroupedQueryAttention implements Grouped Query Attention (GQA).
GQA is a variant of multi-head attention where the number of key-value heads is less than the number of query heads. This provides significant memory savings for KV-cache during inference while maintaining model quality.
Architecture comparison:
MHA: n_q_heads = n_kv_heads (e.g., 32 Q, 32 K, 32 V) GQA: n_q_heads > n_kv_heads (e.g., 32 Q, 8 K, 8 V) -> 4x memory savings MQA: n_kv_heads = 1 (e.g., 32 Q, 1 K, 1 V) -> 32x memory savings (extreme)
GQA is used in LLaMA 2/3, Mistral, DeepSeek, Qwen2, Phi-3, and other modern LLMs.
Example:
cfg := nn.GQAConfig{
EmbedDim: 4096,
NQHeads: 32,
NKVHeads: 8, // 4:1 ratio
HeadDim: 128,
MaxSeqLen: 2048,
UseRoPE: true,
}
gqa := nn.NewGQA(cfg, backend)
output := gqa.Forward(x, cache, startPos)
func NewGQA ¶ added in v0.5.0
func NewGQA[B tensor.Backend](cfg GQAConfig, backend B) *GroupedQueryAttention[B]
NewGQA creates a new GroupedQueryAttention module.
Validates that:
- NQHeads is divisible by NKVHeads
- EmbedDim equals NQHeads * HeadDim
If HeadDim is 0, it's computed as EmbedDim / NQHeads.
Example:
// LLaMA 2 7B style config
cfg := nn.GQAConfig{
EmbedDim: 4096,
NQHeads: 32,
NKVHeads: 8,
HeadDim: 128,
UseRoPE: true,
MaxSeqLen: 4096,
}
gqa := nn.NewGQA(cfg, backend)
func (*GroupedQueryAttention[B]) Forward ¶ added in v0.5.0
func (g *GroupedQueryAttention[B]) Forward( x *tensor.Tensor[float32, B], cache *KVCache[B], startPos int, ) *tensor.Tensor[float32, B]
Forward computes grouped query attention with optional KV-cache.
Args:
- x: Input tensor [batch, seq_len, embed_dim]
- cache: Optional KV-cache for efficient autoregressive generation
- startPos: Position offset for RoPE (used with KV-cache)
Returns:
- Output tensor [batch, seq_len, embed_dim]
The method automatically applies:
- RoPE to Q and K if configured
- KV head broadcasting (repeatKV) to match Q heads
- Causal masking for autoregressive attention
Example:
// Training: process full sequence output := gqa.Forward(x, nil, 0) // Inference with KV-cache cache := nn.NewKVCache[B](1, 8, 512, 128, backend) output := gqa.Forward(x, cache, 0) // First token(s) output := gqa.Forward(nextToken, cache, seqLen) // Subsequent tokens
func (*GroupedQueryAttention[B]) ForwardWithMask ¶ added in v0.5.0
func (g *GroupedQueryAttention[B]) ForwardWithMask( x *tensor.Tensor[float32, B], mask *tensor.Tensor[float32, B], cache *KVCache[B], startPos int, ) *tensor.Tensor[float32, B]
ForwardWithMask computes attention with a custom mask.
Args:
- x: Input tensor [batch, seq_len, embed_dim]
- mask: Optional attention mask [batch, 1, seq_q, seq_k] or nil for auto causal mask
- cache: Optional KV-cache
- startPos: Position offset for RoPE
Returns:
- Output tensor [batch, seq_len, embed_dim]
func (*GroupedQueryAttention[B]) Parameters ¶ added in v0.5.0
func (g *GroupedQueryAttention[B]) Parameters() []*Parameter[B]
Parameters returns all trainable parameters.
type KVCache ¶ added in v0.4.0
KVCache stores key-value pairs for efficient autoregressive generation.
Without cache: O(n²) computation for generating n tokens (recompute K,V for all previous tokens) With cache: O(n) computation (only compute K,V for new token and append to cache)
This can provide 10-100x speedup for inference depending on sequence length.
Example:
cache := nn.NewKVCache[B](2, 8, 512, 64, backend) // batch=2, heads=8, maxSeq=512, headDim=64
for pos := 0; pos < numTokens; pos++ {
output := mha.ForwardWithCache(queryToken, cache, pos)
}
func NewKVCache ¶ added in v0.4.0
NewKVCache creates a new KV cache.
Parameters:
- batchSize: Batch size (reserved for future use)
- numHeads: Number of attention heads (reserved for future use)
- maxSeqLen: Maximum sequence length
- headDim: Dimension per attention head (reserved for future use)
- backend: Computation backend
The cache starts empty (length=0) and grows as key-value pairs are added. Note: batchSize, numHeads, and headDim are not currently used but kept for API consistency.
Example:
cache := nn.NewKVCache[B](2, 8, 512, 64, backend)
func (*KVCache[B]) Get ¶ added in v0.4.0
Get returns cached keys and values up to the current length.
Returns:
- keys: [batch, num_heads, length, head_dim]
- values: [batch, num_heads, length, head_dim]
If the cache is empty, panics.
Example:
keys, values := cache.Get() // keys/values: [2, 8, 15, 64] if 15 tokens were added
func (*KVCache[B]) Len ¶ added in v0.4.0
Len returns the current sequence length in cache.
Example:
if cache.Len() > 100 {
// Generate summary or truncate
}
func (*KVCache[B]) Reset ¶ added in v0.4.0
func (c *KVCache[B]) Reset()
Reset clears the cache for new generation.
After reset, the cache is empty (length=0) and ready for new sequences.
Example:
cache.Reset() // Clear cache // Start new generation sequence
func (*KVCache[B]) Update ¶ added in v0.4.0
Update adds new key-value pairs to the cache at the current position.
Parameters:
- key: New key tensor [batch, num_heads, seq_len, head_dim]
- value: New value tensor [batch, num_heads, seq_len, head_dim]
The new tensors are appended to the cache and the length is updated. Panics if the cache would exceed maxLen.
Example:
// Add single token (seq_len=1) cache.Update(key, value) // key/value: [2, 8, 1, 64] // Add multiple tokens (seq_len=10) cache.Update(key, value) // key/value: [2, 8, 10, 64]
type LayerNorm ¶ added in v0.4.0
type LayerNorm[B tensor.Backend] struct { Gamma *Parameter[B] // learnable scale [d_model] Beta *Parameter[B] // learnable shift [d_model] Epsilon float32 // numerical stability constant // contains filtered or unexported fields }
LayerNorm applies Layer Normalization over an input tensor along the last dimension.
Formula: Y = gamma * (X - mean(X)) / sqrt(var(X) + eps) + beta
Where:
- X is the input tensor
- Y is the output tensor
- gamma is the learnable scale parameter [d_model]
- beta is the learnable shift parameter [d_model]
- mean and variance are computed along the last dimension
- eps is a small value to avoid division by zero
LayerNorm normalizes activations by computing statistics across features, which helps stabilize training and is widely used in transformers (BERT, GPT, etc.).
Example:
backend := autodiff.New(cpu.New()) layernorm := nn.NewLayerNorm[AutodiffBackend](768, 1e-5, backend) output := layernorm.Forward(hiddenStates) // [..., 768] -> [..., 768]
func NewLayerNorm ¶ added in v0.4.0
NewLayerNorm creates a new LayerNorm layer.
Parameters:
- normalizedShape: size of the last dimension (feature dimension)
- epsilon: small constant for numerical stability (typically 1e-5 or 1e-6)
- backend: computation backend
The gamma parameter is initialized to ones, beta to zeros.
func (*LayerNorm[B]) Forward ¶ added in v0.4.0
Forward applies LayerNorm to the input tensor.
Shapes:
- input: [..., any, d_model]
- output: [..., any, d_model]
Algorithm:
- Compute mean = mean(x) along last dimension (keepdim=true)
- Subtract mean: x_centered = x - mean
- Compute variance = mean((x - mean)^2) along last dimension
- Normalize: x_norm = x_centered / sqrt(variance + epsilon)
- Scale and shift: output = gamma * x_norm + beta
func (*LayerNorm[B]) Parameters ¶ added in v0.4.0
Parameters returns the learnable parameters (gamma and beta).
type LearnedPositionalEmbedding ¶ added in v0.4.0
type LearnedPositionalEmbedding[B tensor.Backend] struct { Embedding *Embedding[B] // Embedding layer for position indices MaxLen int // Maximum sequence length Dim int // Embedding dimension // contains filtered or unexported fields }
LearnedPositionalEmbedding implements learned positional embeddings.
Unlike fixed sinusoidal encodings, these embeddings are learned parameters that are updated during training. This approach is used in GPT-2 and other models.
Architecture:
- Embedding matrix: [MaxLen, Dim] - learned parameters
- Forward: returns embeddings for positions [0, seqLen)
Example:
pe := nn.NewLearnedPositionalEmbedding(512, 256, backend) positions := pe.Forward(100) // Get learned embeddings for first 100 positions // Shape: [1, 100, 256]
The embeddings are initialized from a normal distribution N(0, 1).
func NewLearnedPositionalEmbedding ¶ added in v0.4.0
func NewLearnedPositionalEmbedding[B tensor.Backend](maxLen, dim int, backend B) *LearnedPositionalEmbedding[B]
NewLearnedPositionalEmbedding creates a new LearnedPositionalEmbedding layer.
The embeddings are initialized from a standard normal distribution N(0, 1).
Parameters:
- maxLen: Maximum sequence length (number of position embeddings)
- dim: Embedding dimension (typically same as model dimension)
- backend: Computation backend
Returns a new LearnedPositionalEmbedding with randomly initialized embeddings.
func (*LearnedPositionalEmbedding[B]) Forward ¶ added in v0.4.0
func (l *LearnedPositionalEmbedding[B]) Forward(seqLen int) *tensor.Tensor[float32, B]
Forward returns learned position embeddings for the specified sequence length.
Parameters:
- seqLen: Length of the sequence (must be <= MaxLen)
Returns:
- Position embeddings with shape [1, seqLen, dim] The batch dimension is 1 for broadcasting to any batch size.
Example:
pe := nn.NewLearnedPositionalEmbedding(512, 256, backend) encodings := pe.Forward(100) // [1, 100, 256] // Add to token embeddings embeddings := tokenEmbed.Forward(tokens) // [batch, 100, 256] embeddings = embeddings.Add(encodings) // Broadcast over batch
Panics if seqLen > MaxLen.
func (*LearnedPositionalEmbedding[B]) Parameters ¶ added in v0.4.0
func (l *LearnedPositionalEmbedding[B]) Parameters() []*Parameter[B]
Parameters returns the trainable parameters (learned embeddings).
type Linear ¶
Linear implements a fully connected (dense) layer.
Performs the transformation: y = x @ W.T + b where:
- x is the input tensor with shape [batch_size, in_features]
- W is the weight matrix with shape [out_features, in_features]
- b is the bias vector with shape [out_features]
- y is the output tensor with shape [batch_size, out_features]
Weights are initialized using Xavier/Glorot initialization. Biases are initialized to zeros.
Example:
backend := cpu.New()
layer := nn.NewLinear(784, 128, backend)
input := tensor.Randn[float32](tensor.Shape{32, 784}, backend) // batch_size=32
output := layer.Forward(input) // shape: [32, 128]
func NewLinear ¶
NewLinear creates a new Linear layer.
Weights are initialized using Xavier/Glorot uniform distribution. Biases are initialized to zeros.
Parameters:
- inFeatures: Number of input features
- outFeatures: Number of output features
- backend: Backend to use for tensor operations
Returns a new Linear layer.
func (*Linear[B]) Forward ¶
Forward computes the output of the linear layer.
Performs: y = x @ W.T + b
Input shape: [batch_size, in_features] Output shape: [batch_size, out_features]
Parameters:
- input: Input tensor with shape [batch_size, in_features]
Returns output tensor with shape [batch_size, out_features].
func (*Linear[B]) InFeatures ¶
InFeatures returns the number of input features.
func (*Linear[B]) OutFeatures ¶
OutFeatures returns the number of output features.
func (*Linear[B]) Parameters ¶
Parameters returns the trainable parameters of this layer.
Returns [weight, bias] if bias is present, otherwise [weight].
type MSELoss ¶
MSELoss computes Mean Squared Error loss.
Loss = mean((predictions - targets)²)
MSE is commonly used for regression tasks where the goal is to predict continuous values.
Example:
mse := nn.NewMSELoss[Backend]() predictions := model.Forward(input) loss := mse.Forward(predictions, targets)
func NewMSELoss ¶
NewMSELoss creates a new MSE loss function.
func (*MSELoss[B]) Forward ¶
func (m *MSELoss[B]) Forward(predictions, targets *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
Forward computes the MSE loss.
Loss = mean((predictions - targets)²)
Parameters:
- predictions: Model predictions with shape [batch_size, ...]
- targets: Ground truth targets with same shape as predictions
Returns a scalar loss value (shape [1] or []).
func (*MSELoss[B]) Parameters ¶
Parameters returns an empty slice (loss functions have no trainable parameters).
type MaxPool2D ¶
MaxPool2D is a 2D max pooling layer.
Max pooling reduces spatial dimensions by taking the maximum value in each non-overlapping window. Unlike Conv2D, MaxPool2D has no learnable parameters.
Input shape: [batch, channels, height, width] Output shape: [batch, channels, out_height, out_width]
Where:
out_height = (height - kernelSize) / stride + 1 out_width = (width - kernelSize) / stride + 1
Common configurations:
- 2x2 pool, stride=2: Reduces spatial dimensions by half (most common)
- 3x3 pool, stride=2: Aggressive downsampling
- 2x2 pool, stride=1: Overlapping pooling (less common)
Example:
// Create 2x2 max pooling with stride 2
pool := nn.NewMaxPool2D(2, 2, backend)
// Forward pass
input := tensor.Randn[float32](tensor.Shape{32, 64, 28, 28}, backend)
output := pool.Forward(input) // [32, 64, 14, 14]
func NewMaxPool2D ¶
NewMaxPool2D creates a new 2D max pooling layer.
Parameters:
- kernelSize: Size of pooling window (square)
- stride: Stride for pooling (typically same as kernelSize for non-overlapping)
- backend: Backend for computation
Common patterns:
- NewMaxPool2D(2, 2, backend): Standard 2x2 non-overlapping pooling
- NewMaxPool2D(3, 2, backend): Overlapping 3x3 pooling with stride 2
func (*MaxPool2D[B]) ComputeOutputSize ¶
ComputeOutputSize computes output spatial dimensions for given input size.
Returns: [out_height, out_width].
func (*MaxPool2D[B]) Forward ¶
Forward performs the forward pass.
Input: [batch, channels, height, width] Output: [batch, channels, out_height, out_width].
func (*MaxPool2D[B]) KernelSize ¶
KernelSize returns the pooling kernel size.
func (*MaxPool2D[B]) Parameters ¶
Parameters returns all trainable parameters (empty for MaxPool2D).
MaxPool2D has no learnable parameters, so this always returns an empty slice.
type Module ¶
type Module[B tensor.Backend] interface { // Forward computes the output of the module given an input tensor. // // The input tensor should have the appropriate shape for this module. // For example, Linear expects [batch_size, in_features]. // // Returns the output tensor with shape determined by the module type. Forward(input *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B] // Parameters returns all trainable parameters of this module. // // This includes weights, biases, and any nested module parameters. // Returns an empty slice for modules without trainable parameters // (e.g., activation functions). Parameters() []*Parameter[B] }
Module is the base interface for all neural network components.
Every NN module must implement:
- Forward: Compute output from input
- Parameters: Return all trainable parameters
Modules can be composed to build complex architectures:
model := nn.Sequential[Backend](
nn.NewLinear(784, 128, backend),
nn.NewReLU(),
nn.NewLinear(128, 10, backend),
)
Type parameter B must satisfy the tensor.Backend interface.
type MultiHeadAttention ¶ added in v0.4.0
type MultiHeadAttention[B tensor.Backend] struct { WQ *Linear[B] // Query projection [embed_dim, embed_dim] WK *Linear[B] // Key projection [embed_dim, embed_dim] WV *Linear[B] // Value projection [embed_dim, embed_dim] WO *Linear[B] // Output projection [embed_dim, embed_dim] NumHeads int HeadDim int EmbedDim int // contains filtered or unexported fields }
MultiHeadAttention implements the multi-head attention mechanism.
Architecture:
MHA(Q, K, V) = Concat(head_1, ..., head_h) * W_O head_i = SDPA(Q*W_Q_i, K*W_K_i, V*W_V_i)
This is the core attention layer used in all transformer architectures including BERT, GPT, LLaMA, and others.
Example:
backend := autodiff.New(cpu.New()) mha := nn.NewMultiHeadAttention[B](768, 12, backend) // 768 dim, 12 heads output := mha.Forward(x, x, x, nil) // Self-attention output := mha.Forward(q, kv, kv, mask) // Cross-attention
func NewMultiHeadAttention ¶ added in v0.4.0
func NewMultiHeadAttention[B tensor.Backend]( embedDim, numHeads int, backend B, ) *MultiHeadAttention[B]
NewMultiHeadAttention creates a new multi-head attention module.
Parameters:
- embedDim: Total embedding dimension (must be divisible by numHeads)
- numHeads: Number of attention heads
- backend: Computation backend
The head dimension is computed as embedDim / numHeads.
Example:
mha := nn.NewMultiHeadAttention[B](768, 12, backend) // embedDim=768, numHeads=12 -> headDim=64
func (*MultiHeadAttention[B]) Forward ¶ added in v0.4.0
func (m *MultiHeadAttention[B]) Forward( query, key, value *tensor.Tensor[float32, B], mask *tensor.Tensor[float32, B], ) *tensor.Tensor[float32, B]
Forward computes multi-head attention.
Args:
- query: Query tensor [batch, seq_q, embed_dim]
- key: Key tensor [batch, seq_k, embed_dim]
- value: Value tensor [batch, seq_k, embed_dim]
- mask: Optional attention mask [batch, 1, seq_q, seq_k] or nil
Returns:
- output: [batch, seq_q, embed_dim]
For self-attention, pass the same tensor for query, key, and value. For cross-attention, query differs from key/value.
func (*MultiHeadAttention[B]) ForwardWithCache ¶ added in v0.4.0
func (m *MultiHeadAttention[B]) ForwardWithCache( query *tensor.Tensor[float32, B], cache *KVCache[B], ) *tensor.Tensor[float32, B]
ForwardWithCache computes attention using KV cache for efficient autoregressive generation.
This method is optimized for inference where tokens are generated one at a time. Instead of recomputing K,V for all previous tokens, we cache them and only compute for the new token.
Args:
- query: Query tensor [batch, 1, embed_dim] (typically single token)
- cache: KV cache storing previous key-value pairs
Returns:
- output: [batch, 1, embed_dim]
The cache is automatically updated with new K,V pairs.
Example:
cache := nn.NewKVCache[B](1, 12, 512, 64, backend)
for i := 0; i < 100; i++ {
token := getNextToken(i) // [1, 1, 768]
output := mha.ForwardWithCache(token, cache)
}
func (*MultiHeadAttention[B]) ForwardWithWeights ¶ added in v0.4.0
func (m *MultiHeadAttention[B]) ForwardWithWeights( query, key, value *tensor.Tensor[float32, B], mask *tensor.Tensor[float32, B], ) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])
ForwardWithWeights computes multi-head attention and returns attention weights.
Same as Forward but also returns attention weights for visualization/analysis.
Returns:
- output: [batch, seq_q, embed_dim]
- weights: [batch, num_heads, seq_q, seq_k]
func (*MultiHeadAttention[B]) Parameters ¶ added in v0.4.0
func (m *MultiHeadAttention[B]) Parameters() []*Parameter[B]
Parameters returns all trainable parameters (WQ, WK, WV, WO weights and biases).
type Normalizer ¶ added in v0.4.0
type Normalizer[B tensor.Backend] interface { Forward(x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B] Parameters() []*Parameter[B] }
Normalizer is an interface for normalization layers (LayerNorm and RMSNorm).
This allows TransformerBlock to work with both LayerNorm and RMSNorm without caring about the implementation details.
type Parameter ¶
Parameter represents a trainable parameter in a neural network.
Parameters are tensors that require gradient computation during training. They typically represent weights and biases of layers.
Example:
// Create a weight parameter
weight := nn.NewParameter("weight", weightTensor)
// Access the tensor
w := weight.Tensor()
// Get gradient after backward pass
grad := weight.Grad()
func NewParameter ¶
NewParameter creates a new trainable parameter.
The parameter tensor should be initialized before creating the Parameter. Gradient will be allocated during the first backward pass.
Parameters:
- name: Descriptive name for this parameter (e.g., "linear1.weight")
- tensor: The initialized parameter tensor
Returns a new Parameter.
func (*Parameter[B]) Grad ¶
Grad returns the gradient tensor.
Returns nil if no gradient has been computed yet (before backward pass).
func (*Parameter[B]) SetGrad ¶
SetGrad sets the gradient tensor.
This is typically called by the optimizer or during backward pass.
type RMSNorm ¶ added in v0.3.0
type RMSNorm[B tensor.Backend] struct { Gamma *Parameter[B] // learnable scale [d_model] Epsilon float32 // numerical stability constant // contains filtered or unexported fields }
RMSNorm applies Root Mean Square Normalization over an input tensor along the last dimension.
Formula: Y = X / sqrt(mean(X^2) + eps) * gamma
Where:
- X is the input tensor
- Y is the output tensor
- gamma is the learnable scale parameter [d_model]
- mean is computed along the last dimension
- eps is a small value to avoid division by zero
RMSNorm is simpler and faster than LayerNorm (no mean subtraction), and is widely used in modern LLM architectures (LLaMA, Mistral, Gemma).
Example:
backend := autodiff.New(cpu.New()) rmsnorm := nn.NewRMSNorm[AutodiffBackend](768, 1e-5, backend) output := rmsnorm.Forward(hiddenStates) // [..., 768] -> [..., 768]
func NewRMSNorm ¶ added in v0.3.0
NewRMSNorm creates a new RMSNorm layer.
Parameters:
- dModel: size of the last dimension (feature dimension)
- epsilon: small constant for numerical stability (typically 1e-5 or 1e-6)
- backend: computation backend
The gamma parameter is initialized to ones.
func (*RMSNorm[B]) Forward ¶ added in v0.3.0
Forward applies RMSNorm to the input tensor.
Shapes:
- input: [..., any, d_model]
- output: [..., any, d_model]
Algorithm:
- Compute variance = mean(x^2) along last dimension (keepdim=true)
- Compute rms = sqrt(variance + epsilon)
- Normalize: x_norm = x / rms
- Scale: output = x_norm * gamma
func (*RMSNorm[B]) Parameters ¶ added in v0.3.0
Parameters returns the learnable parameters (gamma).
type ReLU ¶
ReLU is a Rectified Linear Unit activation module.
Applies the element-wise function: f(x) = max(0, x)
ReLU is the most commonly used activation function in deep learning. It helps with the vanishing gradient problem and is computationally efficient.
Example:
relu := nn.NewReLU[Backend]() output := relu.Forward(input) // All negative values become 0
func (*ReLU[B]) Parameters ¶
Parameters returns an empty slice (ReLU has no trainable parameters).
type ReLUBackend ¶
ReLUBackend is an interface for backends that support ReLU activation.
type RotaryEncoding ¶ added in v0.4.0
type RotaryEncoding[B tensor.Backend] struct { FreqCos *tensor.Tensor[float32, B] // [max_seq_len, d_model/2] - cosine values FreqSin *tensor.Tensor[float32, B] // [max_seq_len, d_model/2] - sine values MaxSeqLen int // Maximum sequence length DModel int // Model dimension (must be even) // contains filtered or unexported fields }
RotaryEncoding implements Rotary Position Embedding (RoPE).
RoPE is a modern positional encoding used in LLaMA, Mistral, DeepSeek, and other state-of-the-art LLMs. It applies a rotation to query and key embeddings based on their position, allowing the model to capture relative position information.
Mathematical formulation:
For position m and dimension pair (2i, 2i+1):
θ_i = base^(-2i/d) (typically base=10000)
[q'_{2i} ] [cos(m·θ_i) -sin(m·θ_i)] [q_{2i} ]
[q'_{2i+1}] = [sin(m·θ_i) cos(m·θ_i)] [q_{2i+1}]
Architecture:
- Pre-computes cos and sin values for all positions and dimensions
- Applies rotation by splitting input into even/odd pairs
- Supports both training (full sequence) and inference (with offset for KV-cache)
Example:
config := nn.RotaryEncodingConfig{
DModel: 64, // Head dimension (typically 64-128)
MaxSeqLen: 2048, // Maximum sequence length
Theta: 10000.0,
}
rope := nn.NewRotaryEncoding(config, backend)
// During training: apply to full sequence
q := tensor.Randn[float32](tensor.Shape{batch, heads, seq, 64}, backend)
q_rotated := rope.Forward(q)
// During inference with KV-cache: apply with position offset
q_new := tensor.Randn[float32](tensor.Shape{batch, heads, 1, 64}, backend)
q_rotated := rope.ForwardWithOffset(q_new, currentPosition)
func NewRotaryEncoding ¶ added in v0.4.0
func NewRotaryEncoding[B tensor.Backend](cfg RotaryEncodingConfig, backend B) *RotaryEncoding[B]
NewRotaryEncoding creates a new RotaryEncoding layer.
Pre-computes cosine and sine values for all positions and dimension pairs.
Parameters:
- cfg: Configuration for RoPE (dimension, max sequence length, theta base)
- backend: Computation backend
Returns a new RotaryEncoding layer with pre-computed rotation matrices.
Panics if DModel is not even (RoPE requires pairing dimensions).
func (*RotaryEncoding[B]) Forward ¶ added in v0.4.0
Forward applies rotary position embeddings to the input tensor.
Supports both 3D and 4D input tensors:
- 3D: [batch, seq_len, d_model] - applies RoPE to entire sequence
- 4D: [batch, n_heads, seq_len, d_k] - applies RoPE per head (typical for attention)
The rotation is applied to dimension pairs (2i, 2i+1) using pre-computed cos/sin values.
Parameters:
- x: Input tensor [batch, seq_len, d_model] or [batch, n_heads, seq_len, d_k]
Returns tensor with same shape as input, with rotary embeddings applied.
Panics if sequence length exceeds MaxSeqLen or if last dimension doesn't match DModel.
func (*RotaryEncoding[B]) ForwardWithOffset ¶ added in v0.4.0
func (r *RotaryEncoding[B]) ForwardWithOffset(x *tensor.Tensor[float32, B], offset int) *tensor.Tensor[float32, B]
ForwardWithOffset applies rotary embeddings with a position offset.
This is useful for incremental decoding with KV-cache, where new tokens are generated one at a time but need position embeddings that account for previous tokens.
Parameters:
- x: Input tensor [batch, seq_len, d_model] or [batch, n_heads, seq_len, d_k]
- offset: Position offset (e.g., current position in KV-cache)
Returns tensor with rotary embeddings applied at positions [offset, offset+seq_len).
Example (KV-cache inference):
// Initial prompt: positions [0, prompt_len) q_prompt := rope.Forward(q_prompt_tokens) // Generate token 1: position [prompt_len] q_new := rope.ForwardWithOffset(q_new_token, prompt_len) // Generate token 2: position [prompt_len + 1] q_new := rope.ForwardWithOffset(q_new_token, prompt_len + 1)
Panics if offset + seq_len exceeds MaxSeqLen.
type RotaryEncodingConfig ¶ added in v0.4.0
type RotaryEncodingConfig struct {
DModel int // Dimension per head (typically 64-128, must be even)
MaxSeqLen int // Maximum sequence length (e.g., 2048, 4096)
Theta float64 // Base frequency for rotation (default: 10000.0)
}
RotaryEncodingConfig configures a RotaryEncoding layer.
type Sequential ¶
Sequential is a container module that chains multiple modules together.
Each module's output becomes the next module's input, creating a sequential pipeline of transformations.
Example:
model := nn.NewSequential(
nn.NewLinear(784, 128, backend),
nn.NewReLU(),
nn.NewLinear(128, 10, backend),
)
output := model.Forward(input)
This is equivalent to:
h1 := linear1.Forward(input) h2 := relu.Forward(h1) output := linear2.Forward(h2)
func NewSequential ¶
func NewSequential[B tensor.Backend](modules ...Module[B]) *Sequential[B]
NewSequential creates a new Sequential container.
Parameters:
- modules: List of modules to chain together
Returns a new Sequential container.
func (*Sequential[B]) Add ¶
func (s *Sequential[B]) Add(module Module[B])
Add appends a module to the sequence.
This allows building models incrementally:
model := nn.NewSequential[Backend]() model.Add(nn.NewLinear(784, 128, backend)) model.Add(nn.NewReLU()) model.Add(nn.NewLinear(128, 10, backend))
func (*Sequential[B]) Forward ¶
Forward applies all modules in sequence.
The output of each module becomes the input to the next module.
Parameters:
- input: Input tensor to the first module
Returns the output of the last module.
func (*Sequential[B]) Len ¶
func (s *Sequential[B]) Len() int
Len returns the number of modules in the sequence.
func (*Sequential[B]) Module ¶
func (s *Sequential[B]) Module(index int) Module[B]
Module returns the module at the given index.
Panics if index is out of bounds.
func (*Sequential[B]) Parameters ¶
func (s *Sequential[B]) Parameters() []*Parameter[B]
Parameters returns all trainable parameters from all modules.
Parameters are collected from all modules in the sequence.
type SiLU ¶ added in v0.3.0
SiLU is a SiLU (Swish) activation module.
Applies the element-wise function: f(x) = x * sigmoid(x)
SiLU (Sigmoid Linear Unit), also known as Swish, is widely used in modern transformer architectures like LLaMA, Mistral, and GPT-Neo. It provides smooth, non-monotonic activation that helps with gradient flow.
Example:
silu := nn.NewSiLU[Backend]() output := silu.Forward(input) // Smooth activation
func (*SiLU[B]) Parameters ¶ added in v0.3.0
Parameters returns an empty slice (SiLU has no trainable parameters).
type SiLUBackend ¶ added in v0.3.0
SiLUBackend is an interface for backends that support SiLU activation.
type Sigmoid ¶
Sigmoid is a sigmoid activation module.
Applies the element-wise function: σ(x) = 1 / (1 + exp(-x))
Sigmoid squashes values to the range (0, 1), making it useful for binary classification and gate mechanisms in LSTMs/GRUs.
Example:
sigmoid := nn.NewSigmoid[Backend]() output := sigmoid.Forward(input) // Values in range (0, 1)
func NewSigmoid ¶
NewSigmoid creates a new Sigmoid activation module.
func (*Sigmoid[B]) Parameters ¶
Parameters returns an empty slice (Sigmoid has no trainable parameters).
type SigmoidBackend ¶
SigmoidBackend is an interface for backends that support Sigmoid activation.
type SinusoidalPositionalEncoding ¶ added in v0.4.0
type SinusoidalPositionalEncoding[B tensor.Backend] struct { Encoding *tensor.Tensor[float32, B] // [max_len, dim] - pre-computed encodings MaxLen int // Maximum sequence length Dim int // Embedding dimension // contains filtered or unexported fields }
SinusoidalPositionalEncoding implements fixed sinusoidal positional encodings.
This is the original positional encoding from "Attention is All You Need" (Vaswani et al., 2017). It uses sine and cosine functions at different frequencies to encode position information.
Mathematical formulation:
PE(pos, 2i) = sin(pos / 10000^(2i/d)) PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Where:
- pos is the position (0 to max_len-1)
- i is the dimension (0 to d/2-1)
- d is the model dimension
These encodings are fixed (not learned) and allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos).
Example:
pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend) positions := pe.Forward(10) // Get encodings for first 10 positions // Shape: [1, 10, 256]
func NewSinusoidalPositionalEncoding ¶ added in v0.4.0
func NewSinusoidalPositionalEncoding[B tensor.Backend](maxLen, dim int, backend B) *SinusoidalPositionalEncoding[B]
NewSinusoidalPositionalEncoding creates a new SinusoidalPositionalEncoding layer.
Pre-computes all positional encodings up to maxLen.
Parameters:
- maxLen: Maximum sequence length to pre-compute
- dim: Embedding dimension (typically same as model dimension)
- backend: Computation backend
Returns a new SinusoidalPositionalEncoding with pre-computed encodings.
func (*SinusoidalPositionalEncoding[B]) Forward ¶ added in v0.4.0
func (s *SinusoidalPositionalEncoding[B]) Forward(seqLen int) *tensor.Tensor[float32, B]
Forward returns positional encodings for the specified sequence length.
Parameters:
- seqLen: Length of the sequence (must be <= MaxLen)
Returns:
- Positional encodings with shape [1, seqLen, dim] The batch dimension is 1 for broadcasting to any batch size.
Example:
pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend) encodings := pe.Forward(100) // [1, 100, 256] // Add to token embeddings embeddings := tokenEmbed.Forward(tokens) // [batch, 100, 256] embeddings = embeddings.Add(encodings) // Broadcast over batch
Panics if seqLen > MaxLen.
type SwiGLUFFN ¶ added in v0.5.0
SwiGLUFFN implements a feed-forward network with SwiGLU activation.
Architecture (LLaMA-style):
hidden = SwiGLU(x @ W_up, x @ W_gate) output = hidden @ W_down
Where SwiGLU(up, gate) = up * SiLU(gate).
This is more parameter-efficient than standard FFN with GELU:
- Standard FFN: 2 * d_model * ffn_dim parameters.
- SwiGLU FFN: 3 * d_model * ffn_dim parameters (but ffn_dim is smaller).
LLaMA uses ffn_dim = 2.7 * d_model (vs 4 * d_model in standard FFN) resulting in similar total parameters but better performance.
Example:
cfg := nn.SwiGLUFFNConfig{
EmbedDim: 4096,
FFNDim: 11008, // LLaMA 7B
}
ffn := nn.NewSwiGLUFFN(cfg, backend)
output := ffn.Forward(x) // [batch, seq, 4096] -> [batch, seq, 4096]
func NewSwiGLUFFN ¶ added in v0.5.0
func NewSwiGLUFFN[B tensor.Backend](cfg SwiGLUFFNConfig, backend B) *SwiGLUFFN[B]
NewSwiGLUFFN creates a new SwiGLUFFN layer.
If GLUVariant is empty, defaults to "swiglu". If FFNDim is 0, it's computed as 8/3 * EmbedDim (LLaMA formula).
Example:
// LLaMA 7B FFN
ffn := nn.NewSwiGLUFFN(nn.SwiGLUFFNConfig{
EmbedDim: 4096,
FFNDim: 11008,
}, backend)
func (*SwiGLUFFN[B]) Forward ¶ added in v0.5.0
Forward computes the SwiGLU FFN output.
Input: [batch, seq_len, embed_dim] or [batch*seq_len, embed_dim]. Output: same shape as input.
Computation:
gate = x @ W_gate up = x @ W_up hidden = GLU_variant(up, gate) // e.g., up * SiLU(gate) output = hidden @ W_down
func (*SwiGLUFFN[B]) Parameters ¶ added in v0.5.0
Parameters returns all trainable parameters.
type SwiGLUFFNConfig ¶ added in v0.5.0
type SwiGLUFFNConfig struct {
EmbedDim int // Model dimension (d_model), e.g., 4096.
FFNDim int // Intermediate/hidden dimension, e.g., 11008 for LLaMA 7B.
GLUVariant string // Variant: "swiglu" (default), "geglu", "reglu", "glu".
UseBias bool // Whether to use bias in linear layers (LLaMA doesn't).
}
SwiGLUFFNConfig configures a SwiGLUFFN layer.
type Tanh ¶
Tanh is a hyperbolic tangent activation module.
Applies the element-wise function: tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))
Tanh squashes values to the range (-1, 1), making it zero-centered which can help with training. Often used in RNNs.
Example:
tanh := nn.NewTanh[Backend]() output := tanh.Forward(input) // Values in range (-1, 1)
func (*Tanh[B]) Parameters ¶
Parameters returns an empty slice (Tanh has no trainable parameters).
type TanhBackend ¶
TanhBackend is an interface for backends that support Tanh activation.
type TransformerBlock ¶ added in v0.4.0
type TransformerBlock[B tensor.Backend] struct { Config TransformerConfig AttnNorm Normalizer[B] // RMSNorm or LayerNorm before/after attention Attention *MultiHeadAttention[B] FFNNorm Normalizer[B] // RMSNorm or LayerNorm before/after FFN FFN *FFN[B] // contains filtered or unexported fields }
TransformerBlock implements a complete Transformer Block.
Architecture (Pre-Norm, LLaMA style):
x → LayerNorm → MHA → + → LayerNorm → FFN → + → output
↑_______| ↑_______|
(residual) (residual)
Architecture (Post-Norm, original Transformer):
x → MHA → + → LayerNorm → FFN → + → LayerNorm → output
↑___| ↑___|
(residual) (residual)
Pre-Norm is preferred in modern LLMs as it provides:
- Better gradient flow (no need for learning rate warmup)
- More stable training
- Easier to stack many layers (100+ layers possible)
Components:
- AttnNorm: Normalization before/after attention (RMSNorm or LayerNorm)
- Attention: Multi-Head Self-Attention (see MultiHeadAttention)
- FFNNorm: Normalization before/after FFN
- FFN: Feed-Forward Network (2-layer MLP with SiLU activation)
Example:
backend := autodiff.New(cpu.New())
config := nn.TransformerConfig{
EmbedDim: 768,
NumHeads: 12,
FFNDim: 3072,
NormFirst: true, // Pre-Norm
UseRMSNorm: true, // RMSNorm
NormEps: 1e-5,
}
block := nn.NewTransformerBlock(config, backend)
output := block.Forward(x, mask) // [batch, seq, 768] -> [batch, seq, 768]
func NewTransformerBlock ¶ added in v0.4.0
func NewTransformerBlock[B tensor.Backend](config TransformerConfig, backend B) *TransformerBlock[B]
NewTransformerBlock creates a new Transformer Block.
Parameters:
- config: Configuration (embedDim, numHeads, ffnDim, normalization type, etc.)
- backend: Computation backend
The block is initialized with:
- Multi-Head Attention (embedDim, numHeads)
- FFN (embedDim, ffnDim)
- Two normalization layers (RMSNorm or LayerNorm based on config)
Example:
config := nn.TransformerConfig{
EmbedDim: 768,
NumHeads: 12,
FFNDim: 3072,
NormFirst: true, // Pre-Norm (LLaMA style)
UseRMSNorm: true, // RMSNorm (faster than LayerNorm)
NormEps: 1e-5,
}
block := nn.NewTransformerBlock(config, backend)
func (*TransformerBlock[B]) Forward ¶ added in v0.4.0
func (t *TransformerBlock[B]) Forward(x, mask *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
Forward computes the transformer block output.
Args:
- x: Input tensor [batch, seq, embed_dim]
- mask: Optional attention mask [batch, 1, seq, seq] or nil
Returns:
- output: [batch, seq, embed_dim]
The forward pass applies:
- Self-Attention with residual connection
- FFN with residual connection
Normalization is applied either before (Pre-Norm) or after (Post-Norm) each sub-layer.
Example:
x := tensor.Randn[float32](tensor.Shape{2, 16, 768}, backend)
mask := createCausalMask(16, backend) // For autoregressive generation
output := block.Forward(x, mask) // [2, 16, 768]
func (*TransformerBlock[B]) ForwardWithCache ¶ added in v0.4.0
func (t *TransformerBlock[B]) ForwardWithCache( x *tensor.Tensor[float32, B], cache *KVCache[B], ) *tensor.Tensor[float32, B]
ForwardWithCache computes attention using KV cache for efficient autoregressive generation.
This method is optimized for inference where tokens are generated one at a time. The cache stores previous key-value pairs, avoiding recomputation.
Args:
- x: Query tensor [batch, 1, embed_dim] (typically single token)
- cache: KV cache storing previous key-value pairs
Returns:
- output: [batch, 1, embed_dim]
Note: Only Pre-Norm is supported with cache. Post-Norm would require caching intermediate states which is more complex.
Example:
cache := nn.NewKVCache[B](1, 12, 512, 64, backend)
for i := 0; i < 100; i++ {
token := getNextToken(i) // [1, 1, 768]
output := block.ForwardWithCache(token, cache)
}
func (*TransformerBlock[B]) Parameters ¶ added in v0.4.0
func (t *TransformerBlock[B]) Parameters() []*Parameter[B]
Parameters returns all trainable parameters.
Returns parameters from:
- AttnNorm (gamma, beta or just gamma for RMSNorm)
- Attention (WQ, WK, WV, WO weights and biases)
- FFNNorm (gamma, beta or just gamma for RMSNorm)
- FFN (Linear1, Linear2 weights and biases)
Total parameters for GPT-2 768d/12h:
- Attention: ~2.4M params (4 * 768*768)
- AttnNorm: 768 (RMSNorm) or 1536 (LayerNorm)
- FFN: ~4.7M params (768*3072 + 3072*768)
- FFNNorm: 768 (RMSNorm) or 1536 (LayerNorm)
- Total: ~7.1M per block
type TransformerConfig ¶ added in v0.4.0
type TransformerConfig struct {
EmbedDim int // d_model: Embedding dimension (e.g., 768 for GPT-2)
NumHeads int // Number of attention heads (e.g., 12 for GPT-2)
FFNDim int // FFN hidden dimension (typically 4 * EmbedDim = 3072)
Dropout float32 // Dropout rate (0 = no dropout, not implemented yet)
NormFirst bool // true = Pre-Norm (LLaMA), false = Post-Norm (original Transformer)
UseRMSNorm bool // true = RMSNorm (LLaMA), false = LayerNorm (BERT/GPT)
NormEps float32 // Normalization epsilon (1e-5 typical, 1e-6 for RMSNorm)
}
TransformerConfig defines the configuration for a Transformer Block.