nn

package
v0.5.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 1, 2025 License: Apache-2.0 Imports: 4 Imported by: 0

Documentation

Overview

Package nn provides neural network modules and layers for building deep learning models. It includes activations, attention mechanisms, normalization layers, and more.

Package nn implements neural network modules for the Born ML Framework.

This package provides building blocks for constructing neural networks:

  • Module interface: Base interface for all NN components
  • Parameter: Trainable parameters with gradient tracking
  • Linear: Fully connected layer
  • Activations: ReLU, Sigmoid, Tanh
  • Loss functions: MSE, CrossEntropy
  • Sequential: Container for stacking layers

Design inspired by PyTorch's nn.Module but adapted for Go generics.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Accuracy

func Accuracy[B tensor.Backend](
	logits *tensor.Tensor[float32, B],
	targets *tensor.Tensor[int32, B],
) float32

Accuracy computes classification accuracy for a batch.

Parameters:

  • logits: Model predictions [batch_size, num_classes]
  • targets: Ground truth class indices [batch_size]

Returns:

  • Accuracy as a float between 0 and 1.

func CausalMask added in v0.4.0

func CausalMask[B tensor.Backend](seqLen int, backend B) *tensor.Tensor[float32, B]

CausalMask creates a causal (autoregressive) attention mask.

In causal attention, each position can only attend to earlier positions (including itself). This is used in autoregressive models like GPT.

Returns a mask tensor where:

  • Upper triangle (future positions) = -inf (masked out)
  • Lower triangle + diagonal (past + current) = 0 (allowed)

The mask is applied additively to attention scores before softmax:

scores = QK^T / sqrt(d_k) + mask

Shape: [1, 1, seq_len, seq_len] (broadcastable to [batch, heads, seq, seq])

Example:

// For seq_len=4:
// [[0,   -inf, -inf, -inf],
//  [0,   0,    -inf, -inf],
//  [0,   0,    0,    -inf],
//  [0,   0,    0,    0   ]]

backend := cpu.New()
mask := nn.CausalMask(10, backend)  // [1, 1, 10, 10]

func CrossEntropyBackward

func CrossEntropyBackward[B tensor.Backend](
	logits *tensor.Tensor[float32, B],
	targets *tensor.Tensor[int32, B],
	backend B,
) *tensor.Tensor[float32, B]

CrossEntropyBackward computes gradient of CrossEntropyLoss w.r.t. logits.

This function provides manual backward pass for CrossEntropyLoss. It will be integrated with autodiff in Phase 2.

Gradient Formula:

∂L/∂logits[i] = softmax(logits)[i] - y_one_hot[i]
              = probs[i] - (1 if i==target else 0)

For single class target:

∂L/∂logits[i] = probs[i]         if i ≠ target
∂L/∂logits[i] = probs[i] - 1     if i = target

Parameters:

  • logits: [batch_size, num_classes]
  • targets: [batch_size] (class indices)

Returns:

  • grads: [batch_size, num_classes] gradient tensor

Note: Gradients are automatically averaged over batch size.

func GELUFunc added in v0.5.0

func GELUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

GELUFunc applies GELU (Gaussian Error Linear Unit) activation.

Uses the tanh approximation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3))).

GELU is used in BERT, GPT-2, and other transformers.

Example:

output := nn.GELUFunc(input)

func GLU added in v0.5.0

func GLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

GLU applies Gated Linear Unit: GLU(x, gate) = x * sigmoid(gate).

GLU is the base gating mechanism used in various transformer FFN layers.

Parameters:

  • x: input tensor.
  • gate: gating tensor (same shape as x).

Returns: x * sigmoid(gate).

Example:

output := nn.GLU(x, gate)

func GeGLU added in v0.5.0

func GeGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

GeGLU applies GELU-Gated Linear Unit: GeGLU(x, gate) = x * GELU(gate).

GeGLU uses GELU activation for gating instead of SiLU. Used in some transformer variants for different activation characteristics.

Parameters:

  • x: input tensor.
  • gate: gating tensor.

Returns: x * GELU(gate).

Example:

output := nn.GeGLU(up, gate)

func Ones

func Ones[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Ones creates a tensor filled with ones.

Parameters:

  • shape: Shape of the tensor
  • backend: Backend to use for tensor creation

Returns a tensor filled with ones.

func Randn

func Randn[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Randn creates a tensor with random values from standard normal distribution.

Values are drawn from N(0, 1).

Parameters:

  • shape: Shape of the tensor
  • backend: Backend to use for tensor creation

Returns a tensor with random normal values.

func ReGLU added in v0.5.0

func ReGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

ReGLU applies ReLU-Gated Linear Unit: ReGLU(x, gate) = x * ReLU(gate).

ReGLU uses ReLU activation for gating. It's simpler but may have "dead neuron" issues compared to SwiGLU or GeGLU.

Parameters:

  • x: input tensor.
  • gate: gating tensor.

Returns: x * ReLU(gate).

Example:

output := nn.ReGLU(up, gate)

func ReLUFunc added in v0.5.0

func ReLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

ReLUFunc applies ReLU activation: f(x) = max(0, x).

Example:

output := nn.ReLUFunc(input)

func RepeatKV added in v0.5.0

func RepeatKV[B tensor.Backend](
	kv *tensor.Tensor[float32, B],
	nRep int,
) *tensor.Tensor[float32, B]

RepeatKV broadcasts KV heads to match query heads count.

This is the key operation in GQA that allows fewer KV heads than Q heads. Each KV head is repeated nRep times to match the Q head count.

Input: [batch, n_kv_heads, seq_len, head_dim] Output: [batch, n_q_heads, seq_len, head_dim] where n_q_heads = n_kv_heads * nRep

Example:

// 8 KV heads -> 32 Q heads (nRep=4)
kv := tensor.Randn[float32](tensor.Shape{2, 8, 100, 128}, backend)
expanded := nn.RepeatKV(kv, 4)  // [2, 32, 100, 128]

If nRep=1 (standard MHA), returns the input unchanged.

func ScaledDotProductAttention added in v0.4.0

func ScaledDotProductAttention[B tensor.Backend](
	query, key, value *tensor.Tensor[float32, B],
	mask *tensor.Tensor[float32, B],
	scale float32,
) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])

ScaledDotProductAttention computes attention scores using the scaled dot-product mechanism.

This is the core attention mechanism used in transformers, implementing:

Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V

Where:

  • Q (query): what information we're looking for [batch, heads, seq_q, head_dim]
  • K (key): what information is available [batch, heads, seq_k, head_dim]
  • V (value): the actual information to retrieve [batch, heads, seq_k, head_dim]
  • mask: optional attention mask (additive, -inf for masked positions)
  • scale: scaling factor (typically 1/sqrt(head_dim)), 0 for auto-compute

Parameters:

  • query: Query tensor [batch, heads, seq_q, head_dim]
  • key: Key tensor [batch, heads, seq_k, head_dim]
  • value: Value tensor [batch, heads, seq_k, head_dim]
  • mask: Optional attention mask [batch, 1, seq_q, seq_k] or nil (additive mask, -inf for masked)
  • scale: Scaling factor (0 for auto-compute as 1/sqrt(head_dim))

Returns:

  • output: Attended values [batch, heads, seq_q, head_dim]
  • weights: Attention weights [batch, heads, seq_q, seq_k]

Example:

backend := autodiff.New(cpu.New())
Q := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)  // batch=2, heads=8, seq=10, dim=64
K := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
V := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
output, weights := nn.ScaledDotProductAttention(Q, K, V, nil, 0)  // auto-scale

func SiLUFunc added in v0.5.0

func SiLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

SiLUFunc applies SiLU (Swish) activation: f(x) = x * sigmoid(x).

This is the functional version of SiLU activation, useful in GLU variants.

Example:

output := nn.SiLUFunc(input)

func SigmoidFunc added in v0.5.0

func SigmoidFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

SigmoidFunc applies Sigmoid activation: σ(x) = 1 / (1 + exp(-x)).

Example:

output := nn.SigmoidFunc(input)

func SwiGLU added in v0.5.0

func SwiGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

SwiGLU applies Swish-Gated Linear Unit: SwiGLU(x, gate) = x * SiLU(gate).

SwiGLU is used in modern LLMs like LLaMA, Mistral, and DeepSeek. It combines the input with SiLU-activated gate for better gradient flow.

Parameters:

  • x: input tensor (typically "up" projection).
  • gate: gating tensor (typically "gate" projection).

Returns: x * SiLU(gate) where SiLU(z) = z * sigmoid(z).

Example:

// In LLaMA-style FFN:
up := upProj.Forward(input)
gate := gateProj.Forward(input)
hidden := nn.SwiGLU(up, gate)

func Xavier

func Xavier[B tensor.Backend](fanIn, fanOut int, shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Xavier (Glorot) initialization for weights.

Initializes weights with values drawn from a uniform distribution: U(-sqrt(6/(fan_in + fan_out)), sqrt(6/(fan_in + fan_out)))

This initialization helps maintain variance of activations across layers.

Parameters:

  • fanIn: Number of input units
  • fanOut: Number of output units
  • shape: Shape of the weight tensor
  • backend: Backend to use for tensor creation

Returns a tensor initialized with Xavier distribution.

func Zeros

func Zeros[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Zeros creates a tensor filled with zeros.

This is commonly used for bias initialization.

Parameters:

  • shape: Shape of the tensor
  • backend: Backend to use for tensor creation

Returns a zero-filled tensor.

Types

type ALiBi added in v0.4.0

type ALiBi[B tensor.Backend] struct {
	NumHeads int       // Number of attention heads
	Slopes   []float32 // Slope for each head (geometric sequence)
	// contains filtered or unexported fields
}

ALiBi implements Attention with Linear Biases.

ALiBi is a positional encoding method that adds a linear bias to attention scores based on the distance between query and key positions. This approach is used in BLOOM, MPT, and other models.

Instead of adding positional information to embeddings, ALiBi adds a bias to the attention scores:

attention_scores = Q @ K^T + bias

Where bias[i,j] = -slope * |i - j|, and each attention head has a different slope.

The slopes are determined by a geometric sequence:

slopes = [2^(-8/n), 2^(-16/n), ..., 2^(-8)]  for n heads

This allows the model to extrapolate to longer sequences than seen during training.

Example:

alibi := nn.NewALiBi(8, backend)  // 8 attention heads
bias := alibi.GetBias(128)        // Bias for sequence length 128
// Shape: [1, 8, 128, 128]

// In attention:
scores := Q.BatchMatMul(K.Transpose())  // [batch, 8, seq, seq]
scores = scores.Add(bias)               // Add ALiBi bias
weights := scores.Softmax(-1)

func NewALiBi added in v0.4.0

func NewALiBi[B tensor.Backend](numHeads int, backend B) *ALiBi[B]

NewALiBi creates a new ALiBi bias generator.

Computes slopes for each attention head using the formula from the paper:

For n heads: slopes = [2^(-8/n * i) for i in 1..n]

Example slopes for 8 heads:

[2^(-1), 2^(-2), 2^(-3), 2^(-4), 2^(-5), 2^(-6), 2^(-7), 2^(-8)]
≈ [0.5, 0.25, 0.125, 0.0625, 0.03125, 0.015625, 0.0078125, 0.00390625]

Parameters:

  • numHeads: Number of attention heads
  • backend: Computation backend

Returns a new ALiBi instance with pre-computed slopes.

func (*ALiBi[B]) GetBias added in v0.4.0

func (a *ALiBi[B]) GetBias(seqLen int) *tensor.Tensor[float32, B]

GetBias returns the ALiBi bias matrix for the specified sequence length.

The bias has shape [1, num_heads, seq_len, seq_len], where:

  • bias[0, h, i, j] = -slopes[h] * |i - j|

The leading dimension is 1 for broadcasting across batch dimension.

Parameters:

  • seqLen: Sequence length for the bias matrix

Returns:

  • Bias tensor [1, num_heads, seq_len, seq_len]

Example:

alibi := nn.NewALiBi(8, backend)
bias := alibi.GetBias(64)  // [1, 8, 64, 64]

// In attention computation:
scores := Q.BatchMatMul(K.T())  // [batch, 8, 64, 64]
scores = scores.Add(bias)        // Broadcast and add
weights := scores.Softmax(-1)

type Conv2D

type Conv2D[B tensor.Backend] struct {
	// contains filtered or unexported fields
}

Conv2D is a 2D convolutional layer.

Performs convolution: output = Conv2D(input, weight) + bias

Input shape: [batch, in_channels, height, width] Weight shape: [out_channels, in_channels, kernel_h, kernel_w] Bias shape: [out_channels] Output shape: [batch, out_channels, out_h, out_w]

Where:

out_h = (height + 2*padding - kernel_h) / stride + 1
out_w = (width + 2*padding - kernel_w) / stride + 1

Example:

// Create 2D conv: 1 channel -> 6 channels, 5x5 kernel
conv := nn.NewConv2D(1, 6, 5, 5, 1, 0, true, backend)

// Forward pass
input := tensor.Zeros[float32](tensor.Shape{32, 1, 28, 28}, backend) // MNIST-like
output := conv.Forward(input) // [32, 6, 24, 24]

func NewConv2D

func NewConv2D[B tensor.Backend](
	inChannels, outChannels int,
	kernelH, kernelW int,
	stride, padding int,
	useBias bool,
	backend B,
) *Conv2D[B]

NewConv2D creates a new 2D convolutional layer with Xavier initialization.

Parameters:

  • inChannels: Number of input channels
  • outChannels: Number of output channels (number of filters)
  • kernelH, kernelW: Kernel dimensions
  • stride: Stride for convolution (commonly 1 or 2)
  • padding: Zero padding to apply to input (commonly 0, 1, 2)
  • useBias: Whether to include bias term
  • backend: Backend for computation

Initialization:

  • Weights: Xavier/Glorot uniform initialization
  • Bias: Zeros

func (*Conv2D[B]) ComputeOutputSize

func (c *Conv2D[B]) ComputeOutputSize(inputH, inputW int) [2]int

ComputeOutputSize computes output spatial dimensions for given input size.

Returns: [out_height, out_width].

func (*Conv2D[B]) Forward

func (c *Conv2D[B]) Forward(input *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

Forward performs the forward pass.

Input: [batch, in_channels, height, width] Output: [batch, out_channels, out_h, out_w].

func (*Conv2D[B]) InChannels

func (c *Conv2D[B]) InChannels() int

InChannels returns the number of input channels.

func (*Conv2D[B]) KernelSize

func (c *Conv2D[B]) KernelSize() [2]int

KernelSize returns the kernel size [height, width].

func (*Conv2D[B]) OutChannels

func (c *Conv2D[B]) OutChannels() int

OutChannels returns the number of output channels.

func (*Conv2D[B]) Padding

func (c *Conv2D[B]) Padding() int

Padding returns the padding.

func (*Conv2D[B]) Parameters

func (c *Conv2D[B]) Parameters() []*Parameter[B]

Parameters returns all trainable parameters.

func (*Conv2D[B]) Stride

func (c *Conv2D[B]) Stride() int

Stride returns the stride.

func (*Conv2D[B]) String

func (c *Conv2D[B]) String() string

String returns a string representation of the layer.

type CrossEntropyLoss

type CrossEntropyLoss[B tensor.Backend] struct {
	// contains filtered or unexported fields
}

CrossEntropyLoss computes cross-entropy loss for multi-class classification.

This implementation uses the LogSoftmax + NLLLoss decomposition for numerical stability, following modern best practices (PyTorch, Burn 2025).

Mathematical Formulation:

Loss = -log_probs[target]
where log_probs = LogSoftmax(logits)

Gradient (Backward):

∂L/∂logits = Softmax(logits) - y_one_hot

Usage:

criterion := nn.NewCrossEntropyLoss[Backend](backend)
logits := model.Forward(input)  // [batch_size, num_classes]
loss := criterion.Forward(logits, targets)  // targets: [batch_size] (class indices)

Key Properties:

  • Expects raw logits (unnormalized scores) as input
  • Uses log-sum-exp trick for numerical stability
  • Prevents overflow when logits > 88 (float32 limit)
  • Prevents underflow when all logits are very negative

References:

  • "Adam: A Method for Stochastic Optimization" (Kingma & Ba, 2014)
  • PyTorch CrossEntropyLoss documentation
  • Burn framework loss implementations

func NewCrossEntropyLoss

func NewCrossEntropyLoss[B tensor.Backend](backend B) *CrossEntropyLoss[B]

NewCrossEntropyLoss creates a new cross-entropy loss function.

func (*CrossEntropyLoss[B]) Forward

func (c *CrossEntropyLoss[B]) Forward(
	logits *tensor.Tensor[float32, B],
	targets *tensor.Tensor[int32, B],
) *tensor.Tensor[float32, B]

Forward computes cross-entropy loss.

Parameters:

  • logits: Model predictions (unnormalized scores) with shape [batch_size, num_classes]
  • targets: Ground truth class indices with shape [batch_size] (values in range [0, num_classes-1])

Returns:

  • Scalar loss value (mean over batch)

Note: This is a simplified implementation for Phase 1 (MNIST proof-of-concept). Full autodiff support for Softmax/Log operations will be added in Phase 2.

func (*CrossEntropyLoss[B]) Parameters

func (c *CrossEntropyLoss[B]) Parameters() []*Parameter[B]

Parameters returns an empty slice (loss functions have no trainable parameters).

type Embedding added in v0.3.0

type Embedding[B tensor.Backend] struct {
	Weight   *Parameter[B] // Embedding weight matrix [NumEmbed, EmbedDim]
	NumEmbed int           // Number of embeddings (vocabulary size)
	EmbedDim int           // Embedding dimension (vector size)
}

Embedding is a lookup table that maps discrete indices to dense vectors.

This is a fundamental layer in NLP and sequence models, converting token IDs to continuous embeddings. The embedding vectors are learnable parameters.

Architecture:

  • Weight: [NumEmbed, EmbedDim] learnable parameter
  • Forward: indices [batch, seq] -> embeddings [batch, seq, EmbedDim]
  • Backward: gradients scatter-add to weight rows

Example:

// Vocabulary of 10000 words, embedding dimension 256
embed := nn.NewEmbedding[B](10000, 256, backend)

// Token IDs for batch of 2 sequences, each 5 tokens
indices := tensor.FromSlice([]int32{1, 2, 3, 4, 5, 10, 11, 12, 13, 14},
    tensor.Shape{2, 5}, backend)

// Get embeddings [2, 5, 256]
embeddings := embed.Forward(indices)

func NewEmbedding added in v0.3.0

func NewEmbedding[B tensor.Backend](numEmbeddings, embeddingDim int, backend B) *Embedding[B]

NewEmbedding creates a new Embedding layer.

The embedding weights are initialized from a standard normal distribution N(0, 1). For other initialization strategies (Xavier, truncated normal), initialize the weight tensor manually and pass it to NewEmbeddingWithWeight.

Parameters:

  • numEmbeddings: Size of the embedding dictionary (e.g., vocabulary size)
  • embeddingDim: Dimension of each embedding vector
  • backend: Computation backend

Returns a new Embedding layer with randomly initialized weights.

func NewEmbeddingWithWeight added in v0.3.0

func NewEmbeddingWithWeight[B tensor.Backend](weight *tensor.Tensor[float32, B]) *Embedding[B]

NewEmbeddingWithWeight creates an Embedding layer with pre-initialized weights.

Use this when you want custom initialization (Xavier, truncated normal, pretrained, etc.)

Parameters:

  • weight: Pre-initialized weight tensor [numEmbeddings, embeddingDim]

Returns a new Embedding layer using the provided weights.

func (*Embedding[B]) Forward added in v0.3.0

func (e *Embedding[B]) Forward(indices *tensor.Tensor[int32, B]) *tensor.Tensor[float32, B]

Forward performs embedding lookup.

Maps each index to its corresponding embedding vector. This operation is differentiable - gradients flow back to the weight tensor.

Parameters:

  • indices: Tensor of indices [batch, seq] or any shape [...] of type int32

Returns:

  • embeddings: Tensor [..., EmbedDim] with embedding vectors

Example:

indices := tensor.FromSlice([]int32{0, 1, 2}, tensor.Shape{3}, backend)
embeddings := embed.Forward(indices) // Shape: [3, EmbedDim]

Panics if any index is out of bounds [0, NumEmbed).

func (*Embedding[B]) Parameters added in v0.3.0

func (e *Embedding[B]) Parameters() []*Parameter[B]

Parameters returns the list of trainable parameters.

type FFN added in v0.4.0

type FFN[B tensor.Backend] struct {
	Linear1 *Linear[B] // [embed_dim → ffn_dim]
	Linear2 *Linear[B] // [ffn_dim → embed_dim]
	SiLU    *SiLU[B]   // Activation function
	// contains filtered or unexported fields
}

FFN implements a Feed-Forward Network (also called MLP - Multi-Layer Perceptron).

Architecture:

FFN(x) = Linear2(SiLU(Linear1(x)))

Where:

  • Linear1: [embed_dim → ffn_dim] (expansion)
  • SiLU: Activation function (x * sigmoid(x))
  • Linear2: [ffn_dim → embed_dim] (projection back)

The FFN is a core component of transformer blocks, typically with ffn_dim = 4 * embed_dim. This expansion-and-projection pattern helps the model learn complex transformations.

Used in all transformer architectures:

  • GPT: embed_dim=768, ffn_dim=3072 (4x expansion)
  • BERT: embed_dim=768, ffn_dim=3072
  • LLaMA: embed_dim=4096, ffn_dim=11008 (~2.7x expansion)

Example:

backend := autodiff.New(cpu.New())
ffn := nn.NewFFN[B](768, 3072, backend)  // GPT-2 small
output := ffn.Forward(x)  // [batch, seq, 768] -> [batch, seq, 768]

func NewFFN added in v0.4.0

func NewFFN[B tensor.Backend](embedDim, ffnDim int, backend B) *FFN[B]

NewFFN creates a new Feed-Forward Network.

Parameters:

  • embedDim: Input/output dimension (e.g., 768 for GPT-2)
  • ffnDim: Hidden dimension (typically 4 * embedDim)
  • backend: Computation backend

The network expands the input from embedDim to ffnDim, applies SiLU activation, then projects back to embedDim.

Example:

ffn := nn.NewFFN[B](768, 3072, backend)  // GPT-2 small

func (*FFN[B]) Forward added in v0.4.0

func (f *FFN[B]) Forward(x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

Forward computes the FFN output.

Shapes:

  • input: [batch, seq, embed_dim] (3D) or [batch, embed_dim] (2D)
  • output: same shape as input

Algorithm:

  1. Expand: x -> Linear1(x) [embed_dim → ffn_dim]
  2. Activate: x -> SiLU(x)
  3. Project: x -> Linear2(x) [ffn_dim → embed_dim]

Note: Linear layers expect 2D input [batch, features], so we reshape if needed.

func (*FFN[B]) Parameters added in v0.4.0

func (f *FFN[B]) Parameters() []*Parameter[B]

Parameters returns all trainable parameters (Linear1 and Linear2).

type GQAConfig added in v0.5.0

type GQAConfig struct {
	EmbedDim  int     // Model dimension (d_model)
	NQHeads   int     // Number of query heads
	NKVHeads  int     // Number of key-value heads (must divide NQHeads evenly)
	HeadDim   int     // Dimension per head
	Dropout   float32 // Dropout rate (not used in inference)
	UseRoPE   bool    // Whether to use Rotary Position Embeddings
	MaxSeqLen int     // Maximum sequence length (for RoPE)
	Theta     float64 // RoPE base frequency (default: 10000.0)
}

GQAConfig configures a GroupedQueryAttention layer.

func MQA added in v0.5.0

func MQA(embedDim, nQHeads, headDim int) GQAConfig

MQA creates a Multi-Query Attention config (GQA with n_kv_heads=1).

MQA is the extreme case of GQA where all query heads share a single KV head. This provides maximum memory savings but may reduce model capacity.

Example:

cfg := nn.MQA(4096, 32, 128)  // 32 Q heads, 1 KV head
mqa := nn.NewGQA(cfg, backend)

type GroupedQueryAttention added in v0.5.0

type GroupedQueryAttention[B tensor.Backend] struct {
	QProj   *Linear[B] // Query projection [embed_dim, n_q_heads * head_dim]
	KProj   *Linear[B] // Key projection [embed_dim, n_kv_heads * head_dim]
	VProj   *Linear[B] // Value projection [embed_dim, n_kv_heads * head_dim]
	OutProj *Linear[B] // Output projection [n_q_heads * head_dim, embed_dim]
	// contains filtered or unexported fields
}

GroupedQueryAttention implements Grouped Query Attention (GQA).

GQA is a variant of multi-head attention where the number of key-value heads is less than the number of query heads. This provides significant memory savings for KV-cache during inference while maintaining model quality.

Architecture comparison:

MHA: n_q_heads = n_kv_heads (e.g., 32 Q, 32 K, 32 V)
GQA: n_q_heads > n_kv_heads (e.g., 32 Q, 8 K, 8 V) -> 4x memory savings
MQA: n_kv_heads = 1 (e.g., 32 Q, 1 K, 1 V) -> 32x memory savings (extreme)

GQA is used in LLaMA 2/3, Mistral, DeepSeek, Qwen2, Phi-3, and other modern LLMs.

Example:

cfg := nn.GQAConfig{
    EmbedDim:  4096,
    NQHeads:   32,
    NKVHeads:  8,    // 4:1 ratio
    HeadDim:   128,
    MaxSeqLen: 2048,
    UseRoPE:   true,
}
gqa := nn.NewGQA(cfg, backend)
output := gqa.Forward(x, cache, startPos)

func NewGQA added in v0.5.0

func NewGQA[B tensor.Backend](cfg GQAConfig, backend B) *GroupedQueryAttention[B]

NewGQA creates a new GroupedQueryAttention module.

Validates that:

  • NQHeads is divisible by NKVHeads
  • EmbedDim equals NQHeads * HeadDim

If HeadDim is 0, it's computed as EmbedDim / NQHeads.

Example:

// LLaMA 2 7B style config
cfg := nn.GQAConfig{
    EmbedDim:  4096,
    NQHeads:   32,
    NKVHeads:  8,
    HeadDim:   128,
    UseRoPE:   true,
    MaxSeqLen: 4096,
}
gqa := nn.NewGQA(cfg, backend)

func (*GroupedQueryAttention[B]) Forward added in v0.5.0

func (g *GroupedQueryAttention[B]) Forward(
	x *tensor.Tensor[float32, B],
	cache *KVCache[B],
	startPos int,
) *tensor.Tensor[float32, B]

Forward computes grouped query attention with optional KV-cache.

Args:

  • x: Input tensor [batch, seq_len, embed_dim]
  • cache: Optional KV-cache for efficient autoregressive generation
  • startPos: Position offset for RoPE (used with KV-cache)

Returns:

  • Output tensor [batch, seq_len, embed_dim]

The method automatically applies:

  • RoPE to Q and K if configured
  • KV head broadcasting (repeatKV) to match Q heads
  • Causal masking for autoregressive attention

Example:

// Training: process full sequence
output := gqa.Forward(x, nil, 0)

// Inference with KV-cache
cache := nn.NewKVCache[B](1, 8, 512, 128, backend)
output := gqa.Forward(x, cache, 0)  // First token(s)
output := gqa.Forward(nextToken, cache, seqLen)  // Subsequent tokens

func (*GroupedQueryAttention[B]) ForwardWithMask added in v0.5.0

func (g *GroupedQueryAttention[B]) ForwardWithMask(
	x *tensor.Tensor[float32, B],
	mask *tensor.Tensor[float32, B],
	cache *KVCache[B],
	startPos int,
) *tensor.Tensor[float32, B]

ForwardWithMask computes attention with a custom mask.

Args:

  • x: Input tensor [batch, seq_len, embed_dim]
  • mask: Optional attention mask [batch, 1, seq_q, seq_k] or nil for auto causal mask
  • cache: Optional KV-cache
  • startPos: Position offset for RoPE

Returns:

  • Output tensor [batch, seq_len, embed_dim]

func (*GroupedQueryAttention[B]) Parameters added in v0.5.0

func (g *GroupedQueryAttention[B]) Parameters() []*Parameter[B]

Parameters returns all trainable parameters.

type KVCache added in v0.4.0

type KVCache[B tensor.Backend] struct {
	// contains filtered or unexported fields
}

KVCache stores key-value pairs for efficient autoregressive generation.

Without cache: O(n²) computation for generating n tokens (recompute K,V for all previous tokens) With cache: O(n) computation (only compute K,V for new token and append to cache)

This can provide 10-100x speedup for inference depending on sequence length.

Example:

cache := nn.NewKVCache[B](2, 8, 512, 64, backend) // batch=2, heads=8, maxSeq=512, headDim=64
for pos := 0; pos < numTokens; pos++ {
    output := mha.ForwardWithCache(queryToken, cache, pos)
}

func NewKVCache added in v0.4.0

func NewKVCache[B tensor.Backend](
	_, _, maxSeqLen int, _ int,
	backend B,
) *KVCache[B]

NewKVCache creates a new KV cache.

Parameters:

  • batchSize: Batch size (reserved for future use)
  • numHeads: Number of attention heads (reserved for future use)
  • maxSeqLen: Maximum sequence length
  • headDim: Dimension per attention head (reserved for future use)
  • backend: Computation backend

The cache starts empty (length=0) and grows as key-value pairs are added. Note: batchSize, numHeads, and headDim are not currently used but kept for API consistency.

Example:

cache := nn.NewKVCache[B](2, 8, 512, 64, backend)

func (*KVCache[B]) Get added in v0.4.0

func (c *KVCache[B]) Get() (keys, values *tensor.Tensor[float32, B])

Get returns cached keys and values up to the current length.

Returns:

  • keys: [batch, num_heads, length, head_dim]
  • values: [batch, num_heads, length, head_dim]

If the cache is empty, panics.

Example:

keys, values := cache.Get()
// keys/values: [2, 8, 15, 64] if 15 tokens were added

func (*KVCache[B]) Len added in v0.4.0

func (c *KVCache[B]) Len() int

Len returns the current sequence length in cache.

Example:

if cache.Len() > 100 {
    // Generate summary or truncate
}

func (*KVCache[B]) Reset added in v0.4.0

func (c *KVCache[B]) Reset()

Reset clears the cache for new generation.

After reset, the cache is empty (length=0) and ready for new sequences.

Example:

cache.Reset() // Clear cache
// Start new generation sequence

func (*KVCache[B]) Update added in v0.4.0

func (c *KVCache[B]) Update(key, value *tensor.Tensor[float32, B])

Update adds new key-value pairs to the cache at the current position.

Parameters:

  • key: New key tensor [batch, num_heads, seq_len, head_dim]
  • value: New value tensor [batch, num_heads, seq_len, head_dim]

The new tensors are appended to the cache and the length is updated. Panics if the cache would exceed maxLen.

Example:

// Add single token (seq_len=1)
cache.Update(key, value) // key/value: [2, 8, 1, 64]
// Add multiple tokens (seq_len=10)
cache.Update(key, value) // key/value: [2, 8, 10, 64]

type LayerNorm added in v0.4.0

type LayerNorm[B tensor.Backend] struct {
	Gamma   *Parameter[B] // learnable scale [d_model]
	Beta    *Parameter[B] // learnable shift [d_model]
	Epsilon float32       // numerical stability constant
	// contains filtered or unexported fields
}

LayerNorm applies Layer Normalization over an input tensor along the last dimension.

Formula: Y = gamma * (X - mean(X)) / sqrt(var(X) + eps) + beta

Where:

  • X is the input tensor
  • Y is the output tensor
  • gamma is the learnable scale parameter [d_model]
  • beta is the learnable shift parameter [d_model]
  • mean and variance are computed along the last dimension
  • eps is a small value to avoid division by zero

LayerNorm normalizes activations by computing statistics across features, which helps stabilize training and is widely used in transformers (BERT, GPT, etc.).

Example:

backend := autodiff.New(cpu.New())
layernorm := nn.NewLayerNorm[AutodiffBackend](768, 1e-5, backend)
output := layernorm.Forward(hiddenStates)  // [..., 768] -> [..., 768]

func NewLayerNorm added in v0.4.0

func NewLayerNorm[B tensor.Backend](normalizedShape int, epsilon float32, backend B) *LayerNorm[B]

NewLayerNorm creates a new LayerNorm layer.

Parameters:

  • normalizedShape: size of the last dimension (feature dimension)
  • epsilon: small constant for numerical stability (typically 1e-5 or 1e-6)
  • backend: computation backend

The gamma parameter is initialized to ones, beta to zeros.

func (*LayerNorm[B]) Forward added in v0.4.0

func (l *LayerNorm[B]) Forward(x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

Forward applies LayerNorm to the input tensor.

Shapes:

  • input: [..., any, d_model]
  • output: [..., any, d_model]

Algorithm:

  1. Compute mean = mean(x) along last dimension (keepdim=true)
  2. Subtract mean: x_centered = x - mean
  3. Compute variance = mean((x - mean)^2) along last dimension
  4. Normalize: x_norm = x_centered / sqrt(variance + epsilon)
  5. Scale and shift: output = gamma * x_norm + beta

func (*LayerNorm[B]) Parameters added in v0.4.0

func (l *LayerNorm[B]) Parameters() []*Parameter[B]

Parameters returns the learnable parameters (gamma and beta).

type LearnedPositionalEmbedding added in v0.4.0

type LearnedPositionalEmbedding[B tensor.Backend] struct {
	Embedding *Embedding[B] // Embedding layer for position indices
	MaxLen    int           // Maximum sequence length
	Dim       int           // Embedding dimension
	// contains filtered or unexported fields
}

LearnedPositionalEmbedding implements learned positional embeddings.

Unlike fixed sinusoidal encodings, these embeddings are learned parameters that are updated during training. This approach is used in GPT-2 and other models.

Architecture:

  • Embedding matrix: [MaxLen, Dim] - learned parameters
  • Forward: returns embeddings for positions [0, seqLen)

Example:

pe := nn.NewLearnedPositionalEmbedding(512, 256, backend)
positions := pe.Forward(100)  // Get learned embeddings for first 100 positions
// Shape: [1, 100, 256]

The embeddings are initialized from a normal distribution N(0, 1).

func NewLearnedPositionalEmbedding added in v0.4.0

func NewLearnedPositionalEmbedding[B tensor.Backend](maxLen, dim int, backend B) *LearnedPositionalEmbedding[B]

NewLearnedPositionalEmbedding creates a new LearnedPositionalEmbedding layer.

The embeddings are initialized from a standard normal distribution N(0, 1).

Parameters:

  • maxLen: Maximum sequence length (number of position embeddings)
  • dim: Embedding dimension (typically same as model dimension)
  • backend: Computation backend

Returns a new LearnedPositionalEmbedding with randomly initialized embeddings.

func (*LearnedPositionalEmbedding[B]) Forward added in v0.4.0

func (l *LearnedPositionalEmbedding[B]) Forward(seqLen int) *tensor.Tensor[float32, B]

Forward returns learned position embeddings for the specified sequence length.

Parameters:

  • seqLen: Length of the sequence (must be <= MaxLen)

Returns:

  • Position embeddings with shape [1, seqLen, dim] The batch dimension is 1 for broadcasting to any batch size.

Example:

pe := nn.NewLearnedPositionalEmbedding(512, 256, backend)
encodings := pe.Forward(100)  // [1, 100, 256]

// Add to token embeddings
embeddings := tokenEmbed.Forward(tokens)  // [batch, 100, 256]
embeddings = embeddings.Add(encodings)    // Broadcast over batch

Panics if seqLen > MaxLen.

func (*LearnedPositionalEmbedding[B]) Parameters added in v0.4.0

func (l *LearnedPositionalEmbedding[B]) Parameters() []*Parameter[B]

Parameters returns the trainable parameters (learned embeddings).

type Linear

type Linear[B tensor.Backend] struct {
	// contains filtered or unexported fields
}

Linear implements a fully connected (dense) layer.

Performs the transformation: y = x @ W.T + b where:

  • x is the input tensor with shape [batch_size, in_features]
  • W is the weight matrix with shape [out_features, in_features]
  • b is the bias vector with shape [out_features]
  • y is the output tensor with shape [batch_size, out_features]

Weights are initialized using Xavier/Glorot initialization. Biases are initialized to zeros.

Example:

backend := cpu.New()
layer := nn.NewLinear(784, 128, backend)

input := tensor.Randn[float32](tensor.Shape{32, 784}, backend)  // batch_size=32
output := layer.Forward(input)  // shape: [32, 128]

func NewLinear

func NewLinear[B tensor.Backend](inFeatures, outFeatures int, backend B) *Linear[B]

NewLinear creates a new Linear layer.

Weights are initialized using Xavier/Glorot uniform distribution. Biases are initialized to zeros.

Parameters:

  • inFeatures: Number of input features
  • outFeatures: Number of output features
  • backend: Backend to use for tensor operations

Returns a new Linear layer.

func (*Linear[B]) Bias

func (l *Linear[B]) Bias() *Parameter[B]

Bias returns the bias parameter.

func (*Linear[B]) Forward

func (l *Linear[B]) Forward(input *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

Forward computes the output of the linear layer.

Performs: y = x @ W.T + b

Input shape: [batch_size, in_features] Output shape: [batch_size, out_features]

Parameters:

  • input: Input tensor with shape [batch_size, in_features]

Returns output tensor with shape [batch_size, out_features].

func (*Linear[B]) InFeatures

func (l *Linear[B]) InFeatures() int

InFeatures returns the number of input features.

func (*Linear[B]) OutFeatures

func (l *Linear[B]) OutFeatures() int

OutFeatures returns the number of output features.

func (*Linear[B]) Parameters

func (l *Linear[B]) Parameters() []*Parameter[B]

Parameters returns the trainable parameters of this layer.

Returns [weight, bias] if bias is present, otherwise [weight].

func (*Linear[B]) Weight

func (l *Linear[B]) Weight() *Parameter[B]

Weight returns the weight parameter.

type MSELoss

type MSELoss[B tensor.Backend] struct {
	// contains filtered or unexported fields
}

MSELoss computes Mean Squared Error loss.

Loss = mean((predictions - targets)²)

MSE is commonly used for regression tasks where the goal is to predict continuous values.

Example:

mse := nn.NewMSELoss[Backend]()
predictions := model.Forward(input)
loss := mse.Forward(predictions, targets)

func NewMSELoss

func NewMSELoss[B tensor.Backend](backend B) *MSELoss[B]

NewMSELoss creates a new MSE loss function.

func (*MSELoss[B]) Forward

func (m *MSELoss[B]) Forward(predictions, targets *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

Forward computes the MSE loss.

Loss = mean((predictions - targets)²)

Parameters:

  • predictions: Model predictions with shape [batch_size, ...]
  • targets: Ground truth targets with same shape as predictions

Returns a scalar loss value (shape [1] or []).

func (*MSELoss[B]) Parameters

func (m *MSELoss[B]) Parameters() []*Parameter[B]

Parameters returns an empty slice (loss functions have no trainable parameters).

type MaxPool2D

type MaxPool2D[B tensor.Backend] struct {
	// contains filtered or unexported fields
}

MaxPool2D is a 2D max pooling layer.

Max pooling reduces spatial dimensions by taking the maximum value in each non-overlapping window. Unlike Conv2D, MaxPool2D has no learnable parameters.

Input shape: [batch, channels, height, width] Output shape: [batch, channels, out_height, out_width]

Where:

out_height = (height - kernelSize) / stride + 1
out_width = (width - kernelSize) / stride + 1

Common configurations:

  • 2x2 pool, stride=2: Reduces spatial dimensions by half (most common)
  • 3x3 pool, stride=2: Aggressive downsampling
  • 2x2 pool, stride=1: Overlapping pooling (less common)

Example:

// Create 2x2 max pooling with stride 2
pool := nn.NewMaxPool2D(2, 2, backend)

// Forward pass
input := tensor.Randn[float32](tensor.Shape{32, 64, 28, 28}, backend)
output := pool.Forward(input) // [32, 64, 14, 14]

func NewMaxPool2D

func NewMaxPool2D[B tensor.Backend](kernelSize, stride int, backend B) *MaxPool2D[B]

NewMaxPool2D creates a new 2D max pooling layer.

Parameters:

  • kernelSize: Size of pooling window (square)
  • stride: Stride for pooling (typically same as kernelSize for non-overlapping)
  • backend: Backend for computation

Common patterns:

  • NewMaxPool2D(2, 2, backend): Standard 2x2 non-overlapping pooling
  • NewMaxPool2D(3, 2, backend): Overlapping 3x3 pooling with stride 2

func (*MaxPool2D[B]) ComputeOutputSize

func (m *MaxPool2D[B]) ComputeOutputSize(inputH, inputW int) [2]int

ComputeOutputSize computes output spatial dimensions for given input size.

Returns: [out_height, out_width].

func (*MaxPool2D[B]) Forward

func (m *MaxPool2D[B]) Forward(input *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

Forward performs the forward pass.

Input: [batch, channels, height, width] Output: [batch, channels, out_height, out_width].

func (*MaxPool2D[B]) KernelSize

func (m *MaxPool2D[B]) KernelSize() int

KernelSize returns the pooling kernel size.

func (*MaxPool2D[B]) Parameters

func (m *MaxPool2D[B]) Parameters() []*Parameter[B]

Parameters returns all trainable parameters (empty for MaxPool2D).

MaxPool2D has no learnable parameters, so this always returns an empty slice.

func (*MaxPool2D[B]) Stride

func (m *MaxPool2D[B]) Stride() int

Stride returns the stride.

func (*MaxPool2D[B]) String

func (m *MaxPool2D[B]) String() string

String returns a string representation of the layer.

type Module

type Module[B tensor.Backend] interface {
	// Forward computes the output of the module given an input tensor.
	//
	// The input tensor should have the appropriate shape for this module.
	// For example, Linear expects [batch_size, in_features].
	//
	// Returns the output tensor with shape determined by the module type.
	Forward(input *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

	// Parameters returns all trainable parameters of this module.
	//
	// This includes weights, biases, and any nested module parameters.
	// Returns an empty slice for modules without trainable parameters
	// (e.g., activation functions).
	Parameters() []*Parameter[B]
}

Module is the base interface for all neural network components.

Every NN module must implement:

  • Forward: Compute output from input
  • Parameters: Return all trainable parameters

Modules can be composed to build complex architectures:

model := nn.Sequential[Backend](
    nn.NewLinear(784, 128, backend),
    nn.NewReLU(),
    nn.NewLinear(128, 10, backend),
)

Type parameter B must satisfy the tensor.Backend interface.

type MultiHeadAttention added in v0.4.0

type MultiHeadAttention[B tensor.Backend] struct {
	WQ       *Linear[B] // Query projection [embed_dim, embed_dim]
	WK       *Linear[B] // Key projection [embed_dim, embed_dim]
	WV       *Linear[B] // Value projection [embed_dim, embed_dim]
	WO       *Linear[B] // Output projection [embed_dim, embed_dim]
	NumHeads int
	HeadDim  int
	EmbedDim int
	// contains filtered or unexported fields
}

MultiHeadAttention implements the multi-head attention mechanism.

Architecture:

MHA(Q, K, V) = Concat(head_1, ..., head_h) * W_O
head_i = SDPA(Q*W_Q_i, K*W_K_i, V*W_V_i)

This is the core attention layer used in all transformer architectures including BERT, GPT, LLaMA, and others.

Example:

backend := autodiff.New(cpu.New())
mha := nn.NewMultiHeadAttention[B](768, 12, backend)  // 768 dim, 12 heads
output := mha.Forward(x, x, x, nil)  // Self-attention
output := mha.Forward(q, kv, kv, mask)  // Cross-attention

func NewMultiHeadAttention added in v0.4.0

func NewMultiHeadAttention[B tensor.Backend](
	embedDim, numHeads int,
	backend B,
) *MultiHeadAttention[B]

NewMultiHeadAttention creates a new multi-head attention module.

Parameters:

  • embedDim: Total embedding dimension (must be divisible by numHeads)
  • numHeads: Number of attention heads
  • backend: Computation backend

The head dimension is computed as embedDim / numHeads.

Example:

mha := nn.NewMultiHeadAttention[B](768, 12, backend)
// embedDim=768, numHeads=12 -> headDim=64

func (*MultiHeadAttention[B]) Forward added in v0.4.0

func (m *MultiHeadAttention[B]) Forward(
	query, key, value *tensor.Tensor[float32, B],
	mask *tensor.Tensor[float32, B],
) *tensor.Tensor[float32, B]

Forward computes multi-head attention.

Args:

  • query: Query tensor [batch, seq_q, embed_dim]
  • key: Key tensor [batch, seq_k, embed_dim]
  • value: Value tensor [batch, seq_k, embed_dim]
  • mask: Optional attention mask [batch, 1, seq_q, seq_k] or nil

Returns:

  • output: [batch, seq_q, embed_dim]

For self-attention, pass the same tensor for query, key, and value. For cross-attention, query differs from key/value.

func (*MultiHeadAttention[B]) ForwardWithCache added in v0.4.0

func (m *MultiHeadAttention[B]) ForwardWithCache(
	query *tensor.Tensor[float32, B],
	cache *KVCache[B],
) *tensor.Tensor[float32, B]

ForwardWithCache computes attention using KV cache for efficient autoregressive generation.

This method is optimized for inference where tokens are generated one at a time. Instead of recomputing K,V for all previous tokens, we cache them and only compute for the new token.

Args:

  • query: Query tensor [batch, 1, embed_dim] (typically single token)
  • cache: KV cache storing previous key-value pairs

Returns:

  • output: [batch, 1, embed_dim]

The cache is automatically updated with new K,V pairs.

Example:

cache := nn.NewKVCache[B](1, 12, 512, 64, backend)
for i := 0; i < 100; i++ {
    token := getNextToken(i) // [1, 1, 768]
    output := mha.ForwardWithCache(token, cache)
}

func (*MultiHeadAttention[B]) ForwardWithWeights added in v0.4.0

func (m *MultiHeadAttention[B]) ForwardWithWeights(
	query, key, value *tensor.Tensor[float32, B],
	mask *tensor.Tensor[float32, B],
) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])

ForwardWithWeights computes multi-head attention and returns attention weights.

Same as Forward but also returns attention weights for visualization/analysis.

Returns:

  • output: [batch, seq_q, embed_dim]
  • weights: [batch, num_heads, seq_q, seq_k]

func (*MultiHeadAttention[B]) Parameters added in v0.4.0

func (m *MultiHeadAttention[B]) Parameters() []*Parameter[B]

Parameters returns all trainable parameters (WQ, WK, WV, WO weights and biases).

type Normalizer added in v0.4.0

type Normalizer[B tensor.Backend] interface {
	Forward(x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
	Parameters() []*Parameter[B]
}

Normalizer is an interface for normalization layers (LayerNorm and RMSNorm).

This allows TransformerBlock to work with both LayerNorm and RMSNorm without caring about the implementation details.

type Parameter

type Parameter[B tensor.Backend] struct {
	// contains filtered or unexported fields
}

Parameter represents a trainable parameter in a neural network.

Parameters are tensors that require gradient computation during training. They typically represent weights and biases of layers.

Example:

// Create a weight parameter
weight := nn.NewParameter("weight", weightTensor)

// Access the tensor
w := weight.Tensor()

// Get gradient after backward pass
grad := weight.Grad()

func NewParameter

func NewParameter[B tensor.Backend](name string, t *tensor.Tensor[float32, B]) *Parameter[B]

NewParameter creates a new trainable parameter.

The parameter tensor should be initialized before creating the Parameter. Gradient will be allocated during the first backward pass.

Parameters:

  • name: Descriptive name for this parameter (e.g., "linear1.weight")
  • tensor: The initialized parameter tensor

Returns a new Parameter.

func (*Parameter[B]) Grad

func (p *Parameter[B]) Grad() *tensor.Tensor[float32, B]

Grad returns the gradient tensor.

Returns nil if no gradient has been computed yet (before backward pass).

func (*Parameter[B]) Name

func (p *Parameter[B]) Name() string

Name returns the parameter name.

func (*Parameter[B]) SetGrad

func (p *Parameter[B]) SetGrad(grad *tensor.Tensor[float32, B])

SetGrad sets the gradient tensor.

This is typically called by the optimizer or during backward pass.

func (*Parameter[B]) Tensor

func (p *Parameter[B]) Tensor() *tensor.Tensor[float32, B]

Tensor returns the parameter tensor.

func (*Parameter[B]) ZeroGrad

func (p *Parameter[B]) ZeroGrad()

ZeroGrad clears the gradient tensor.

This should be called before each training iteration to avoid accumulating gradients from previous iterations.

type RMSNorm added in v0.3.0

type RMSNorm[B tensor.Backend] struct {
	Gamma   *Parameter[B] // learnable scale [d_model]
	Epsilon float32       // numerical stability constant
	// contains filtered or unexported fields
}

RMSNorm applies Root Mean Square Normalization over an input tensor along the last dimension.

Formula: Y = X / sqrt(mean(X^2) + eps) * gamma

Where:

  • X is the input tensor
  • Y is the output tensor
  • gamma is the learnable scale parameter [d_model]
  • mean is computed along the last dimension
  • eps is a small value to avoid division by zero

RMSNorm is simpler and faster than LayerNorm (no mean subtraction), and is widely used in modern LLM architectures (LLaMA, Mistral, Gemma).

Example:

backend := autodiff.New(cpu.New())
rmsnorm := nn.NewRMSNorm[AutodiffBackend](768, 1e-5, backend)
output := rmsnorm.Forward(hiddenStates)  // [..., 768] -> [..., 768]

func NewRMSNorm added in v0.3.0

func NewRMSNorm[B tensor.Backend](dModel int, epsilon float32, backend B) *RMSNorm[B]

NewRMSNorm creates a new RMSNorm layer.

Parameters:

  • dModel: size of the last dimension (feature dimension)
  • epsilon: small constant for numerical stability (typically 1e-5 or 1e-6)
  • backend: computation backend

The gamma parameter is initialized to ones.

func (*RMSNorm[B]) Forward added in v0.3.0

func (r *RMSNorm[B]) Forward(x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

Forward applies RMSNorm to the input tensor.

Shapes:

  • input: [..., any, d_model]
  • output: [..., any, d_model]

Algorithm:

  1. Compute variance = mean(x^2) along last dimension (keepdim=true)
  2. Compute rms = sqrt(variance + epsilon)
  3. Normalize: x_norm = x / rms
  4. Scale: output = x_norm * gamma

func (*RMSNorm[B]) Parameters added in v0.3.0

func (r *RMSNorm[B]) Parameters() []*Parameter[B]

Parameters returns the learnable parameters (gamma).

type ReLU

type ReLU[B tensor.Backend] struct{}

ReLU is a Rectified Linear Unit activation module.

Applies the element-wise function: f(x) = max(0, x)

ReLU is the most commonly used activation function in deep learning. It helps with the vanishing gradient problem and is computationally efficient.

Example:

relu := nn.NewReLU[Backend]()
output := relu.Forward(input)  // All negative values become 0

func NewReLU

func NewReLU[B tensor.Backend]() *ReLU[B]

NewReLU creates a new ReLU activation module.

func (*ReLU[B]) Forward

func (r *ReLU[B]) Forward(input *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

Forward applies ReLU activation: f(x) = max(0, x).

func (*ReLU[B]) Parameters

func (r *ReLU[B]) Parameters() []*Parameter[B]

Parameters returns an empty slice (ReLU has no trainable parameters).

type ReLUBackend

type ReLUBackend interface {
	ReLU(*tensor.RawTensor) *tensor.RawTensor
}

ReLUBackend is an interface for backends that support ReLU activation.

type RotaryEncoding added in v0.4.0

type RotaryEncoding[B tensor.Backend] struct {
	FreqCos   *tensor.Tensor[float32, B] // [max_seq_len, d_model/2] - cosine values
	FreqSin   *tensor.Tensor[float32, B] // [max_seq_len, d_model/2] - sine values
	MaxSeqLen int                        // Maximum sequence length
	DModel    int                        // Model dimension (must be even)
	// contains filtered or unexported fields
}

RotaryEncoding implements Rotary Position Embedding (RoPE).

RoPE is a modern positional encoding used in LLaMA, Mistral, DeepSeek, and other state-of-the-art LLMs. It applies a rotation to query and key embeddings based on their position, allowing the model to capture relative position information.

Mathematical formulation:

For position m and dimension pair (2i, 2i+1):
  θ_i = base^(-2i/d)  (typically base=10000)

  [q'_{2i}  ]   [cos(m·θ_i)  -sin(m·θ_i)] [q_{2i}  ]
  [q'_{2i+1}] = [sin(m·θ_i)   cos(m·θ_i)] [q_{2i+1}]

Architecture:

  • Pre-computes cos and sin values for all positions and dimensions
  • Applies rotation by splitting input into even/odd pairs
  • Supports both training (full sequence) and inference (with offset for KV-cache)

Example:

config := nn.RotaryEncodingConfig{
    DModel:    64,     // Head dimension (typically 64-128)
    MaxSeqLen: 2048,   // Maximum sequence length
    Theta:     10000.0,
}
rope := nn.NewRotaryEncoding(config, backend)

// During training: apply to full sequence
q := tensor.Randn[float32](tensor.Shape{batch, heads, seq, 64}, backend)
q_rotated := rope.Forward(q)

// During inference with KV-cache: apply with position offset
q_new := tensor.Randn[float32](tensor.Shape{batch, heads, 1, 64}, backend)
q_rotated := rope.ForwardWithOffset(q_new, currentPosition)

func NewRotaryEncoding added in v0.4.0

func NewRotaryEncoding[B tensor.Backend](cfg RotaryEncodingConfig, backend B) *RotaryEncoding[B]

NewRotaryEncoding creates a new RotaryEncoding layer.

Pre-computes cosine and sine values for all positions and dimension pairs.

Parameters:

  • cfg: Configuration for RoPE (dimension, max sequence length, theta base)
  • backend: Computation backend

Returns a new RotaryEncoding layer with pre-computed rotation matrices.

Panics if DModel is not even (RoPE requires pairing dimensions).

func (*RotaryEncoding[B]) Forward added in v0.4.0

func (r *RotaryEncoding[B]) Forward(x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

Forward applies rotary position embeddings to the input tensor.

Supports both 3D and 4D input tensors:

  • 3D: [batch, seq_len, d_model] - applies RoPE to entire sequence
  • 4D: [batch, n_heads, seq_len, d_k] - applies RoPE per head (typical for attention)

The rotation is applied to dimension pairs (2i, 2i+1) using pre-computed cos/sin values.

Parameters:

  • x: Input tensor [batch, seq_len, d_model] or [batch, n_heads, seq_len, d_k]

Returns tensor with same shape as input, with rotary embeddings applied.

Panics if sequence length exceeds MaxSeqLen or if last dimension doesn't match DModel.

func (*RotaryEncoding[B]) ForwardWithOffset added in v0.4.0

func (r *RotaryEncoding[B]) ForwardWithOffset(x *tensor.Tensor[float32, B], offset int) *tensor.Tensor[float32, B]

ForwardWithOffset applies rotary embeddings with a position offset.

This is useful for incremental decoding with KV-cache, where new tokens are generated one at a time but need position embeddings that account for previous tokens.

Parameters:

  • x: Input tensor [batch, seq_len, d_model] or [batch, n_heads, seq_len, d_k]
  • offset: Position offset (e.g., current position in KV-cache)

Returns tensor with rotary embeddings applied at positions [offset, offset+seq_len).

Example (KV-cache inference):

// Initial prompt: positions [0, prompt_len)
q_prompt := rope.Forward(q_prompt_tokens)

// Generate token 1: position [prompt_len]
q_new := rope.ForwardWithOffset(q_new_token, prompt_len)

// Generate token 2: position [prompt_len + 1]
q_new := rope.ForwardWithOffset(q_new_token, prompt_len + 1)

Panics if offset + seq_len exceeds MaxSeqLen.

type RotaryEncodingConfig added in v0.4.0

type RotaryEncodingConfig struct {
	DModel    int     // Dimension per head (typically 64-128, must be even)
	MaxSeqLen int     // Maximum sequence length (e.g., 2048, 4096)
	Theta     float64 // Base frequency for rotation (default: 10000.0)
}

RotaryEncodingConfig configures a RotaryEncoding layer.

type Sequential

type Sequential[B tensor.Backend] struct {
	// contains filtered or unexported fields
}

Sequential is a container module that chains multiple modules together.

Each module's output becomes the next module's input, creating a sequential pipeline of transformations.

Example:

model := nn.NewSequential(
    nn.NewLinear(784, 128, backend),
    nn.NewReLU(),
    nn.NewLinear(128, 10, backend),
)

output := model.Forward(input)

This is equivalent to:

h1 := linear1.Forward(input)
h2 := relu.Forward(h1)
output := linear2.Forward(h2)

func NewSequential

func NewSequential[B tensor.Backend](modules ...Module[B]) *Sequential[B]

NewSequential creates a new Sequential container.

Parameters:

  • modules: List of modules to chain together

Returns a new Sequential container.

func (*Sequential[B]) Add

func (s *Sequential[B]) Add(module Module[B])

Add appends a module to the sequence.

This allows building models incrementally:

model := nn.NewSequential[Backend]()
model.Add(nn.NewLinear(784, 128, backend))
model.Add(nn.NewReLU())
model.Add(nn.NewLinear(128, 10, backend))

func (*Sequential[B]) Forward

func (s *Sequential[B]) Forward(input *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

Forward applies all modules in sequence.

The output of each module becomes the input to the next module.

Parameters:

  • input: Input tensor to the first module

Returns the output of the last module.

func (*Sequential[B]) Len

func (s *Sequential[B]) Len() int

Len returns the number of modules in the sequence.

func (*Sequential[B]) Module

func (s *Sequential[B]) Module(index int) Module[B]

Module returns the module at the given index.

Panics if index is out of bounds.

func (*Sequential[B]) Parameters

func (s *Sequential[B]) Parameters() []*Parameter[B]

Parameters returns all trainable parameters from all modules.

Parameters are collected from all modules in the sequence.

type SiLU added in v0.3.0

type SiLU[B tensor.Backend] struct{}

SiLU is a SiLU (Swish) activation module.

Applies the element-wise function: f(x) = x * sigmoid(x)

SiLU (Sigmoid Linear Unit), also known as Swish, is widely used in modern transformer architectures like LLaMA, Mistral, and GPT-Neo. It provides smooth, non-monotonic activation that helps with gradient flow.

Example:

silu := nn.NewSiLU[Backend]()
output := silu.Forward(input)  // Smooth activation

func NewSiLU added in v0.3.0

func NewSiLU[B tensor.Backend]() *SiLU[B]

NewSiLU creates a new SiLU activation module.

func (*SiLU[B]) Forward added in v0.3.0

func (s *SiLU[B]) Forward(input *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

Forward applies SiLU activation: f(x) = x * sigmoid(x).

func (*SiLU[B]) Parameters added in v0.3.0

func (s *SiLU[B]) Parameters() []*Parameter[B]

Parameters returns an empty slice (SiLU has no trainable parameters).

type SiLUBackend added in v0.3.0

type SiLUBackend interface {
	SiLU(*tensor.RawTensor) *tensor.RawTensor
}

SiLUBackend is an interface for backends that support SiLU activation.

type Sigmoid

type Sigmoid[B tensor.Backend] struct{}

Sigmoid is a sigmoid activation module.

Applies the element-wise function: σ(x) = 1 / (1 + exp(-x))

Sigmoid squashes values to the range (0, 1), making it useful for binary classification and gate mechanisms in LSTMs/GRUs.

Example:

sigmoid := nn.NewSigmoid[Backend]()
output := sigmoid.Forward(input)  // Values in range (0, 1)

func NewSigmoid

func NewSigmoid[B tensor.Backend]() *Sigmoid[B]

NewSigmoid creates a new Sigmoid activation module.

func (*Sigmoid[B]) Forward

func (s *Sigmoid[B]) Forward(input *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

Forward applies Sigmoid activation: σ(x) = 1 / (1 + exp(-x)).

func (*Sigmoid[B]) Parameters

func (s *Sigmoid[B]) Parameters() []*Parameter[B]

Parameters returns an empty slice (Sigmoid has no trainable parameters).

type SigmoidBackend

type SigmoidBackend interface {
	Sigmoid(*tensor.RawTensor) *tensor.RawTensor
}

SigmoidBackend is an interface for backends that support Sigmoid activation.

type SinusoidalPositionalEncoding added in v0.4.0

type SinusoidalPositionalEncoding[B tensor.Backend] struct {
	Encoding *tensor.Tensor[float32, B] // [max_len, dim] - pre-computed encodings
	MaxLen   int                        // Maximum sequence length
	Dim      int                        // Embedding dimension
	// contains filtered or unexported fields
}

SinusoidalPositionalEncoding implements fixed sinusoidal positional encodings.

This is the original positional encoding from "Attention is All You Need" (Vaswani et al., 2017). It uses sine and cosine functions at different frequencies to encode position information.

Mathematical formulation:

PE(pos, 2i)   = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Where:

  • pos is the position (0 to max_len-1)
  • i is the dimension (0 to d/2-1)
  • d is the model dimension

These encodings are fixed (not learned) and allow the model to easily learn to attend by relative positions, since for any fixed offset k, PE(pos+k) can be represented as a linear function of PE(pos).

Example:

pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend)
positions := pe.Forward(10)  // Get encodings for first 10 positions
// Shape: [1, 10, 256]

func NewSinusoidalPositionalEncoding added in v0.4.0

func NewSinusoidalPositionalEncoding[B tensor.Backend](maxLen, dim int, backend B) *SinusoidalPositionalEncoding[B]

NewSinusoidalPositionalEncoding creates a new SinusoidalPositionalEncoding layer.

Pre-computes all positional encodings up to maxLen.

Parameters:

  • maxLen: Maximum sequence length to pre-compute
  • dim: Embedding dimension (typically same as model dimension)
  • backend: Computation backend

Returns a new SinusoidalPositionalEncoding with pre-computed encodings.

func (*SinusoidalPositionalEncoding[B]) Forward added in v0.4.0

func (s *SinusoidalPositionalEncoding[B]) Forward(seqLen int) *tensor.Tensor[float32, B]

Forward returns positional encodings for the specified sequence length.

Parameters:

  • seqLen: Length of the sequence (must be <= MaxLen)

Returns:

  • Positional encodings with shape [1, seqLen, dim] The batch dimension is 1 for broadcasting to any batch size.

Example:

pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend)
encodings := pe.Forward(100)  // [1, 100, 256]

// Add to token embeddings
embeddings := tokenEmbed.Forward(tokens)  // [batch, 100, 256]
embeddings = embeddings.Add(encodings)    // Broadcast over batch

Panics if seqLen > MaxLen.

type SwiGLUFFN added in v0.5.0

type SwiGLUFFN[B tensor.Backend] struct {
	// contains filtered or unexported fields
}

SwiGLUFFN implements a feed-forward network with SwiGLU activation.

Architecture (LLaMA-style):

hidden = SwiGLU(x @ W_up, x @ W_gate)
output = hidden @ W_down

Where SwiGLU(up, gate) = up * SiLU(gate).

This is more parameter-efficient than standard FFN with GELU:

  • Standard FFN: 2 * d_model * ffn_dim parameters.
  • SwiGLU FFN: 3 * d_model * ffn_dim parameters (but ffn_dim is smaller).

LLaMA uses ffn_dim = 2.7 * d_model (vs 4 * d_model in standard FFN) resulting in similar total parameters but better performance.

Example:

cfg := nn.SwiGLUFFNConfig{
    EmbedDim: 4096,
    FFNDim:   11008,  // LLaMA 7B
}
ffn := nn.NewSwiGLUFFN(cfg, backend)
output := ffn.Forward(x)  // [batch, seq, 4096] -> [batch, seq, 4096]

func NewSwiGLUFFN added in v0.5.0

func NewSwiGLUFFN[B tensor.Backend](cfg SwiGLUFFNConfig, backend B) *SwiGLUFFN[B]

NewSwiGLUFFN creates a new SwiGLUFFN layer.

If GLUVariant is empty, defaults to "swiglu". If FFNDim is 0, it's computed as 8/3 * EmbedDim (LLaMA formula).

Example:

// LLaMA 7B FFN
ffn := nn.NewSwiGLUFFN(nn.SwiGLUFFNConfig{
    EmbedDim: 4096,
    FFNDim:   11008,
}, backend)

func (*SwiGLUFFN[B]) DownProj added in v0.5.0

func (f *SwiGLUFFN[B]) DownProj() *Linear[B]

DownProj returns the down projection layer.

func (*SwiGLUFFN[B]) Forward added in v0.5.0

func (f *SwiGLUFFN[B]) Forward(x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

Forward computes the SwiGLU FFN output.

Input: [batch, seq_len, embed_dim] or [batch*seq_len, embed_dim]. Output: same shape as input.

Computation:

gate = x @ W_gate
up = x @ W_up
hidden = GLU_variant(up, gate)  // e.g., up * SiLU(gate)
output = hidden @ W_down

func (*SwiGLUFFN[B]) GateProj added in v0.5.0

func (f *SwiGLUFFN[B]) GateProj() *Linear[B]

GateProj returns the gate projection layer.

func (*SwiGLUFFN[B]) Parameters added in v0.5.0

func (f *SwiGLUFFN[B]) Parameters() []*Parameter[B]

Parameters returns all trainable parameters.

func (*SwiGLUFFN[B]) UpProj added in v0.5.0

func (f *SwiGLUFFN[B]) UpProj() *Linear[B]

UpProj returns the up projection layer.

type SwiGLUFFNConfig added in v0.5.0

type SwiGLUFFNConfig struct {
	EmbedDim   int    // Model dimension (d_model), e.g., 4096.
	FFNDim     int    // Intermediate/hidden dimension, e.g., 11008 for LLaMA 7B.
	GLUVariant string // Variant: "swiglu" (default), "geglu", "reglu", "glu".
	UseBias    bool   // Whether to use bias in linear layers (LLaMA doesn't).
}

SwiGLUFFNConfig configures a SwiGLUFFN layer.

type Tanh

type Tanh[B tensor.Backend] struct{}

Tanh is a hyperbolic tangent activation module.

Applies the element-wise function: tanh(x) = (exp(x) - exp(-x)) / (exp(x) + exp(-x))

Tanh squashes values to the range (-1, 1), making it zero-centered which can help with training. Often used in RNNs.

Example:

tanh := nn.NewTanh[Backend]()
output := tanh.Forward(input)  // Values in range (-1, 1)

func NewTanh

func NewTanh[B tensor.Backend]() *Tanh[B]

NewTanh creates a new Tanh activation module.

func (*Tanh[B]) Forward

func (t *Tanh[B]) Forward(input *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

Forward applies Tanh activation.

func (*Tanh[B]) Parameters

func (t *Tanh[B]) Parameters() []*Parameter[B]

Parameters returns an empty slice (Tanh has no trainable parameters).

type TanhBackend

type TanhBackend interface {
	Tanh(*tensor.RawTensor) *tensor.RawTensor
}

TanhBackend is an interface for backends that support Tanh activation.

type TransformerBlock added in v0.4.0

type TransformerBlock[B tensor.Backend] struct {
	Config    TransformerConfig
	AttnNorm  Normalizer[B] // RMSNorm or LayerNorm before/after attention
	Attention *MultiHeadAttention[B]
	FFNNorm   Normalizer[B] // RMSNorm or LayerNorm before/after FFN
	FFN       *FFN[B]
	// contains filtered or unexported fields
}

TransformerBlock implements a complete Transformer Block.

Architecture (Pre-Norm, LLaMA style):

x → LayerNorm → MHA → + → LayerNorm → FFN → + → output
         ↑_______|            ↑_______|
       (residual)           (residual)

Architecture (Post-Norm, original Transformer):

x → MHA → + → LayerNorm → FFN → + → LayerNorm → output
     ↑___|                 ↑___|
   (residual)            (residual)

Pre-Norm is preferred in modern LLMs as it provides:

  • Better gradient flow (no need for learning rate warmup)
  • More stable training
  • Easier to stack many layers (100+ layers possible)

Components:

  • AttnNorm: Normalization before/after attention (RMSNorm or LayerNorm)
  • Attention: Multi-Head Self-Attention (see MultiHeadAttention)
  • FFNNorm: Normalization before/after FFN
  • FFN: Feed-Forward Network (2-layer MLP with SiLU activation)

Example:

backend := autodiff.New(cpu.New())
config := nn.TransformerConfig{
    EmbedDim:   768,
    NumHeads:   12,
    FFNDim:     3072,
    NormFirst:  true,   // Pre-Norm
    UseRMSNorm: true,   // RMSNorm
    NormEps:    1e-5,
}
block := nn.NewTransformerBlock(config, backend)
output := block.Forward(x, mask)  // [batch, seq, 768] -> [batch, seq, 768]

func NewTransformerBlock added in v0.4.0

func NewTransformerBlock[B tensor.Backend](config TransformerConfig, backend B) *TransformerBlock[B]

NewTransformerBlock creates a new Transformer Block.

Parameters:

  • config: Configuration (embedDim, numHeads, ffnDim, normalization type, etc.)
  • backend: Computation backend

The block is initialized with:

  • Multi-Head Attention (embedDim, numHeads)
  • FFN (embedDim, ffnDim)
  • Two normalization layers (RMSNorm or LayerNorm based on config)

Example:

config := nn.TransformerConfig{
    EmbedDim:   768,
    NumHeads:   12,
    FFNDim:     3072,
    NormFirst:  true,   // Pre-Norm (LLaMA style)
    UseRMSNorm: true,   // RMSNorm (faster than LayerNorm)
    NormEps:    1e-5,
}
block := nn.NewTransformerBlock(config, backend)

func (*TransformerBlock[B]) Forward added in v0.4.0

func (t *TransformerBlock[B]) Forward(x, mask *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

Forward computes the transformer block output.

Args:

  • x: Input tensor [batch, seq, embed_dim]
  • mask: Optional attention mask [batch, 1, seq, seq] or nil

Returns:

  • output: [batch, seq, embed_dim]

The forward pass applies:

  1. Self-Attention with residual connection
  2. FFN with residual connection

Normalization is applied either before (Pre-Norm) or after (Post-Norm) each sub-layer.

Example:

x := tensor.Randn[float32](tensor.Shape{2, 16, 768}, backend)
mask := createCausalMask(16, backend)  // For autoregressive generation
output := block.Forward(x, mask)  // [2, 16, 768]

func (*TransformerBlock[B]) ForwardWithCache added in v0.4.0

func (t *TransformerBlock[B]) ForwardWithCache(
	x *tensor.Tensor[float32, B],
	cache *KVCache[B],
) *tensor.Tensor[float32, B]

ForwardWithCache computes attention using KV cache for efficient autoregressive generation.

This method is optimized for inference where tokens are generated one at a time. The cache stores previous key-value pairs, avoiding recomputation.

Args:

  • x: Query tensor [batch, 1, embed_dim] (typically single token)
  • cache: KV cache storing previous key-value pairs

Returns:

  • output: [batch, 1, embed_dim]

Note: Only Pre-Norm is supported with cache. Post-Norm would require caching intermediate states which is more complex.

Example:

cache := nn.NewKVCache[B](1, 12, 512, 64, backend)
for i := 0; i < 100; i++ {
    token := getNextToken(i) // [1, 1, 768]
    output := block.ForwardWithCache(token, cache)
}

func (*TransformerBlock[B]) Parameters added in v0.4.0

func (t *TransformerBlock[B]) Parameters() []*Parameter[B]

Parameters returns all trainable parameters.

Returns parameters from:

  • AttnNorm (gamma, beta or just gamma for RMSNorm)
  • Attention (WQ, WK, WV, WO weights and biases)
  • FFNNorm (gamma, beta or just gamma for RMSNorm)
  • FFN (Linear1, Linear2 weights and biases)

Total parameters for GPT-2 768d/12h:

  • Attention: ~2.4M params (4 * 768*768)
  • AttnNorm: 768 (RMSNorm) or 1536 (LayerNorm)
  • FFN: ~4.7M params (768*3072 + 3072*768)
  • FFNNorm: 768 (RMSNorm) or 1536 (LayerNorm)
  • Total: ~7.1M per block

type TransformerConfig added in v0.4.0

type TransformerConfig struct {
	EmbedDim   int     // d_model: Embedding dimension (e.g., 768 for GPT-2)
	NumHeads   int     // Number of attention heads (e.g., 12 for GPT-2)
	FFNDim     int     // FFN hidden dimension (typically 4 * EmbedDim = 3072)
	Dropout    float32 // Dropout rate (0 = no dropout, not implemented yet)
	NormFirst  bool    // true = Pre-Norm (LLaMA), false = Post-Norm (original Transformer)
	UseRMSNorm bool    // true = RMSNorm (LLaMA), false = LayerNorm (BERT/GPT)
	NormEps    float32 // Normalization epsilon (1e-5 typical, 1e-6 for RMSNorm)
}

TransformerConfig defines the configuration for a Transformer Block.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL