Documentation
¶
Overview ¶
Package nn provides neural network layers and building blocks.
Overview ¶
This package contains:
- Layers: Linear, Conv2D, MaxPool2D
- Activations: ReLU, Sigmoid, Tanh
- Loss functions: CrossEntropyLoss, MSELoss
- Utilities: Sequential, Module interface, Parameter
- Initialization: Xavier, Zeros, Ones, Randn
Basic Usage ¶
import (
"github.com/born-ml/born/nn"
"github.com/born-ml/born/backend/cpu"
)
func main() {
backend := cpu.New()
// Build a simple MLP
model := nn.NewSequential(
nn.NewLinear(784, 128, backend),
nn.NewReLU(),
nn.NewLinear(128, 10, backend),
)
// Forward pass
output := model.Forward(input)
}
Layers ¶
Linear: Fully connected layer with Xavier initialization
layer := nn.NewLinear(inFeatures, outFeatures, backend)
Conv2D: 2D convolutional layer with im2col algorithm
conv := nn.NewConv2D(inChannels, outChannels, kernelSize, stride, padding, backend)
MaxPool2D: 2D max pooling layer
pool := nn.NewMaxPool2D(kernelSize, stride, backend)
Activations ¶
Common activation functions:
relu := nn.NewReLU() sigmoid := nn.NewSigmoid() tanh := nn.NewTanh()
Loss Functions ¶
CrossEntropyLoss: For classification tasks (numerically stable)
criterion := nn.NewCrossEntropyLoss(backend) loss := criterion.Forward(logits, labels)
MSELoss: For regression tasks
criterion := nn.NewMSELoss(backend) loss := criterion.Forward(predictions, targets)
Sequential Models ¶
Build models by composing layers:
model := nn.NewSequential(
nn.NewLinear(784, 256, backend),
nn.NewReLU(),
nn.NewLinear(256, 128, backend),
nn.NewReLU(),
nn.NewLinear(128, 10, backend),
)
Parameter Management ¶
Access model parameters for optimization:
params := model.Parameters()
for _, param := range params {
fmt.Println(param.Name(), param.Tensor().Shape())
}
Package nn provides public wrappers for positional encodings.
Index ¶
- func Accuracy[B tensor.Backend](logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B]) float32
- func CausalMask[B tensor.Backend](seqLen int, backend B) *tensor.Tensor[float32, B]
- func CrossEntropyBackward[B tensor.Backend](logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B], backend B) *tensor.Tensor[float32, B]
- func GELUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func GLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func GeGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func Ones[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
- func Randn[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
- func ReGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func ReLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func RepeatKV[B tensor.Backend](kv *tensor.Tensor[float32, B], nRep int) *tensor.Tensor[float32, B]
- func ScaledDotProductAttention[B tensor.Backend](query, key, value *tensor.Tensor[float32, B], mask *tensor.Tensor[float32, B], ...) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])
- func SiLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func SigmoidFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func SwiGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func Xavier[B tensor.Backend](fanIn, fanOut int, shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
- func Zeros[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
- type ALiBi
- type Conv2D
- type CrossEntropyLoss
- type Embedding
- type FFN
- type GQAConfig
- type GroupedQueryAttention
- type KVCache
- type LayerNorm
- type LearnedPositionalEmbedding
- type Linear
- type MSELoss
- type MaxPool2D
- type Module
- type MultiHeadAttention
- type Parameter
- type RMSNorm
- type ReLU
- type RotaryEncoding
- type RotaryEncodingConfig
- type Sequential
- type SiLU
- type Sigmoid
- type SinusoidalPositionalEncoding
- type SwiGLUFFN
- type SwiGLUFFNConfig
- type Tanh
- type TransformerBlock
- type TransformerConfig
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Accuracy ¶
func Accuracy[B tensor.Backend]( logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B], ) float32
Accuracy computes the classification accuracy.
Example:
acc := nn.Accuracy(predictions, labels)
fmt.Printf("Accuracy: %.2f%%\n", acc*100)
func CausalMask ¶ added in v0.4.0
CausalMask creates a causal (autoregressive) attention mask.
In causal attention, each position can only attend to earlier positions. This is used in autoregressive models like GPT.
Returns a mask tensor where future positions are masked with -inf. Shape: [1, 1, seq_len, seq_len] (broadcastable to [batch, heads, seq, seq])
Example:
mask := nn.CausalMask(10, backend) // [1, 1, 10, 10] output, weights := nn.ScaledDotProductAttention(Q, K, V, mask, 0)
func CrossEntropyBackward ¶
func CrossEntropyBackward[B tensor.Backend]( logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B], backend B, ) *tensor.Tensor[float32, B]
CrossEntropyBackward computes the backward pass for cross-entropy loss.
func GELUFunc ¶ added in v0.5.0
GELUFunc applies GELU (Gaussian Error Linear Unit) activation.
Uses the tanh approximation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3))).
GELU is used in BERT, GPT-2, and other transformers.
Example:
output := nn.GELUFunc(input)
func GLU ¶ added in v0.5.0
GLU applies Gated Linear Unit: GLU(x, gate) = x * sigmoid(gate).
GLU is the base gating mechanism used in various transformer FFN layers.
Parameters:
- x: input tensor.
- gate: gating tensor (same shape as x).
Returns: x * sigmoid(gate).
Example:
output := nn.GLU(x, gate)
func GeGLU ¶ added in v0.5.0
GeGLU applies GELU-Gated Linear Unit: GeGLU(x, gate) = x * GELU(gate).
GeGLU uses GELU activation for gating instead of SiLU. Used in some transformer variants for different activation characteristics.
Parameters:
- x: input tensor.
- gate: gating tensor.
Returns: x * GELU(gate).
Example:
output := nn.GeGLU(up, gate)
func Ones ¶
Ones initializes a tensor with ones.
Example:
backend := cpu.New()
weights := nn.Ones(tensor.Shape{128, 784}, backend)
func Randn ¶
Randn initializes a tensor with random values from N(0, 1).
Example:
backend := cpu.New()
weights := nn.Randn(tensor.Shape{128, 784}, backend)
func ReGLU ¶ added in v0.5.0
ReGLU applies ReLU-Gated Linear Unit: ReGLU(x, gate) = x * ReLU(gate).
ReGLU uses ReLU activation for gating. It's simpler but may have "dead neuron" issues compared to SwiGLU or GeGLU.
Parameters:
- x: input tensor.
- gate: gating tensor.
Returns: x * ReLU(gate).
Example:
output := nn.ReGLU(up, gate)
func ReLUFunc ¶ added in v0.5.0
ReLUFunc applies the ReLU activation function element-wise. ReLU(x) = max(0, x).
func RepeatKV ¶ added in v0.5.0
RepeatKV broadcasts KV heads to match query heads count.
This is the key operation in GQA that allows fewer KV heads than Q heads. Each KV head is repeated nRep times to match the Q head count.
Input: [batch, n_kv_heads, seq_len, head_dim] Output: [batch, n_q_heads, seq_len, head_dim] where n_q_heads = n_kv_heads * nRep
Example:
// 8 KV heads -> 32 Q heads (nRep=4)
kv := tensor.Randn[float32](tensor.Shape{2, 8, 100, 128}, backend)
expanded := nn.RepeatKV(kv, 4) // [2, 32, 100, 128]
If nRep=1 (standard MHA), returns the input unchanged.
func ScaledDotProductAttention ¶ added in v0.4.0
func ScaledDotProductAttention[B tensor.Backend]( query, key, value *tensor.Tensor[float32, B], mask *tensor.Tensor[float32, B], scale float32, ) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])
ScaledDotProductAttention computes attention scores using the scaled dot-product mechanism.
This is the core attention mechanism used in transformers.
Parameters:
- query: Query tensor [batch, heads, seq_q, head_dim]
- key: Key tensor [batch, heads, seq_k, head_dim]
- value: Value tensor [batch, heads, seq_k, head_dim]
- mask: Optional attention mask [batch, 1, seq_q, seq_k] or nil (additive mask, -inf for masked)
- scale: Scaling factor (0 for auto-compute as 1/sqrt(head_dim))
Returns:
- output: Attended values [batch, heads, seq_q, head_dim]
- weights: Attention weights [batch, heads, seq_q, seq_k]
Example:
Q := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
K := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
V := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
output, weights := nn.ScaledDotProductAttention(Q, K, V, nil, 0)
func SiLUFunc ¶ added in v0.5.0
SiLUFunc applies SiLU (Swish) activation: f(x) = x * sigmoid(x).
This is the functional version of SiLU activation, useful in GLU variants.
Example:
output := nn.SiLUFunc(input)
func SigmoidFunc ¶ added in v0.5.0
SigmoidFunc applies the sigmoid activation function element-wise. Sigmoid(x) = 1 / (1 + exp(-x)).
func SwiGLU ¶ added in v0.5.0
SwiGLU applies Swish-Gated Linear Unit: SwiGLU(x, gate) = x * SiLU(gate).
SwiGLU is used in modern LLMs like LLaMA, Mistral, and DeepSeek. It combines the input with SiLU-activated gate for better gradient flow.
Parameters:
- x: input tensor (typically "up" projection).
- gate: gating tensor (typically "gate" projection).
Returns: x * SiLU(gate) where SiLU(z) = z * sigmoid(z).
Example:
// In LLaMA-style FFN: up := upProj.Forward(input) gate := gateProj.Forward(input) hidden := nn.SwiGLU(up, gate)
Types ¶
type ALiBi ¶ added in v0.4.0
ALiBi implements Attention with Linear Biases.
ALiBi adds a linear bias to attention scores based on the distance between positions. Used in BLOOM, MPT, and other models. Allows extrapolation to longer sequences.
Example:
backend := cpu.New() alibi := nn.NewALiBi(8, backend) // 8 attention heads bias := alibi.GetBias(128) // [1, 8, 128, 128] // In attention: scores := Q.BatchMatMul(K.T()) scores = scores.Add(bias) weights := scores.Softmax(-1)
func NewALiBi ¶ added in v0.4.0
NewALiBi creates a new ALiBi bias generator.
Computes slopes for each attention head using a geometric sequence.
Parameters:
- numHeads: Number of attention heads
- backend: Computation backend
Example:
alibi := nn.NewALiBi(8, backend) bias := alibi.GetBias(64) // Get bias for sequence length 64
type Conv2D ¶
Conv2D represents a 2D convolutional layer.
func NewConv2D ¶
func NewConv2D[B tensor.Backend]( inChannels, outChannels int, kernelH, kernelW int, stride, padding int, useBias bool, backend B, ) *Conv2D[B]
NewConv2D creates a new 2D convolutional layer.
Example:
backend := cpu.New() conv := nn.NewConv2D(1, 32, 3, 3, 1, 1, true, backend) // in_channels=1, out_channels=32, kernel=3x3, stride=1, padding=1, useBias=true
type CrossEntropyLoss ¶
type CrossEntropyLoss[B tensor.Backend] = nn.CrossEntropyLoss[B]
CrossEntropyLoss represents the cross-entropy loss for classification.
func NewCrossEntropyLoss ¶
func NewCrossEntropyLoss[B tensor.Backend](backend B) *CrossEntropyLoss[B]
NewCrossEntropyLoss creates a new cross-entropy loss function.
Example:
backend := cpu.New() criterion := nn.NewCrossEntropyLoss(backend) loss := criterion.Forward(logits, labels)
type Embedding ¶ added in v0.3.0
Embedding represents a lookup table for embeddings.
func NewEmbedding ¶ added in v0.3.0
NewEmbedding creates a new embedding layer.
Example:
backend := cpu.New()
embed := nn.NewEmbedding[B](50000, 768, backend) // vocab=50000, dim=768
tokenIds := tensor.FromSlice([]int32{1, 5, 10}, tensor.Shape{1, 3}, backend)
embeddings := embed.Forward(tokenIds) // [1, 3, 768]
func NewEmbeddingWithWeight ¶ added in v0.5.0
NewEmbeddingWithWeight creates an embedding layer from an existing weight tensor.
This is useful when loading pre-trained embeddings.
Example:
weights := tensor.Randn[float32](tensor.Shape{50000, 768}, backend)
embed := nn.NewEmbeddingWithWeight(weights)
type FFN ¶ added in v0.4.0
FFN (Feed-Forward Network) is a 2-layer MLP with SiLU activation.
Architecture:
FFN(x) = Linear2(SiLU(Linear1(x)))
Used inside TransformerBlock.
type GQAConfig ¶ added in v0.5.0
GQAConfig configures a GroupedQueryAttention layer.
func MQA ¶ added in v0.5.0
MQA creates a Multi-Query Attention config (GQA with n_kv_heads=1).
MQA is the extreme case of GQA where all query heads share a single KV head. This provides maximum memory savings but may reduce model capacity.
Example:
cfg := nn.MQA(4096, 32, 128) // 32 Q heads, 1 KV head mqa := nn.NewGQA(cfg, backend)
type GroupedQueryAttention ¶ added in v0.5.0
type GroupedQueryAttention[B tensor.Backend] = nn.GroupedQueryAttention[B]
GroupedQueryAttention implements Grouped Query Attention (GQA).
GQA is a variant of multi-head attention where the number of key-value heads is less than the number of query heads. This provides significant memory savings for KV-cache during inference while maintaining model quality.
Architecture comparison:
MHA: n_q_heads = n_kv_heads (e.g., 32 Q, 32 K, 32 V) GQA: n_q_heads > n_kv_heads (e.g., 32 Q, 8 K, 8 V) -> 4x memory savings MQA: n_kv_heads = 1 (e.g., 32 Q, 1 K, 1 V) -> 32x memory savings (extreme)
GQA is used in LLaMA 2/3, Mistral, DeepSeek, Qwen2, Phi-3, and other modern LLMs.
func NewGQA ¶ added in v0.5.0
func NewGQA[B tensor.Backend](cfg GQAConfig, backend B) *GroupedQueryAttention[B]
NewGQA creates a new GroupedQueryAttention module.
Validates that:
- NQHeads is divisible by NKVHeads
- EmbedDim equals NQHeads * HeadDim
If HeadDim is 0, it's computed as EmbedDim / NQHeads.
Example:
// LLaMA 2 7B style config
cfg := nn.GQAConfig{
EmbedDim: 4096,
NQHeads: 32,
NKVHeads: 8,
HeadDim: 128,
UseRoPE: true,
MaxSeqLen: 4096,
}
gqa := nn.NewGQA(cfg, backend)
output := gqa.Forward(x, cache, startPos)
type KVCache ¶ added in v0.4.0
KVCache is a public alias for internal KV cache implementation.
KVCache stores key-value pairs for efficient autoregressive generation. See internal/nn/kvcache.go for detailed documentation.
type LayerNorm ¶ added in v0.4.0
LayerNorm represents Layer Normalization.
func NewLayerNorm ¶ added in v0.4.0
NewLayerNorm creates a new LayerNorm layer.
Example:
backend := cpu.New() norm := nn.NewLayerNorm[B](768, 1e-5, backend) output := norm.Forward(input) // [..., 768] -> [..., 768]
type LearnedPositionalEmbedding ¶ added in v0.4.0
type LearnedPositionalEmbedding[B tensor.Backend] = nn.LearnedPositionalEmbedding[B]
LearnedPositionalEmbedding implements learned positional embeddings.
These embeddings are trainable parameters that are updated during training. Used in GPT-2 and other models.
Example:
backend := cpu.New() pe := nn.NewLearnedPositionalEmbedding(512, 256, backend) encodings := pe.Forward(100) // [1, 100, 256] // Get parameters for optimizer params := pe.Parameters()
func NewLearnedPositionalEmbedding ¶ added in v0.4.0
func NewLearnedPositionalEmbedding[B tensor.Backend](maxLen, dim int, backend B) *LearnedPositionalEmbedding[B]
NewLearnedPositionalEmbedding creates a new learned positional embedding layer.
The embeddings are initialized from a normal distribution N(0, 1).
Parameters:
- maxLen: Maximum sequence length
- dim: Embedding dimension
- backend: Computation backend
Example:
pe := nn.NewLearnedPositionalEmbedding(512, 256, backend)
type MSELoss ¶
MSELoss represents the mean squared error loss for regression.
func NewMSELoss ¶
NewMSELoss creates a new MSE loss function.
Example:
backend := cpu.New() criterion := nn.NewMSELoss(backend) loss := criterion.Forward(predictions, targets)
type MultiHeadAttention ¶ added in v0.4.0
type MultiHeadAttention[B tensor.Backend] = nn.MultiHeadAttention[B]
MultiHeadAttention represents the multi-head attention mechanism.
func NewMultiHeadAttention ¶ added in v0.4.0
func NewMultiHeadAttention[B tensor.Backend](embedDim, numHeads int, backend B) *MultiHeadAttention[B]
NewMultiHeadAttention creates a new multi-head attention module.
Parameters:
- embedDim: Total embedding dimension (must be divisible by numHeads)
- numHeads: Number of attention heads
- backend: Computation backend
Example:
backend := cpu.New() mha := nn.NewMultiHeadAttention[B](768, 12, backend) // BERT-base config output := mha.Forward(x, x, x, nil) // Self-attention
type RotaryEncoding ¶ added in v0.4.0
type RotaryEncoding[B tensor.Backend] = nn.RotaryEncoding[B]
RotaryEncoding implements Rotary Position Embedding (RoPE).
RoPE is used in modern LLMs like LLaMA, Mistral, DeepSeek, and Qwen. It applies a rotation to query and key embeddings based on their position.
Example:
backend := cpu.New()
config := nn.RotaryEncodingConfig{
DModel: 64,
MaxSeqLen: 2048,
Theta: 10000.0,
}
rope := nn.NewRotaryEncoding(config, backend)
// Apply to attention queries/keys
q := tensor.Randn[float32](tensor.Shape{batch, heads, seq, 64}, backend)
q_rotated := rope.Forward(q)
func NewRotaryEncoding ¶ added in v0.4.0
func NewRotaryEncoding[B tensor.Backend](cfg RotaryEncodingConfig, backend B) *RotaryEncoding[B]
NewRotaryEncoding creates a new RoPE (Rotary Position Embedding) layer.
Pre-computes cosine and sine values for all positions and dimension pairs.
Parameters:
- cfg: Configuration (DModel, MaxSeqLen, Theta)
- backend: Computation backend
Example:
config := nn.RotaryEncodingConfig{
DModel: 64, // Head dimension
MaxSeqLen: 2048, // Max sequence length
Theta: 10000.0,
}
rope := nn.NewRotaryEncoding(config, backend)
type RotaryEncodingConfig ¶ added in v0.4.0
type RotaryEncodingConfig = nn.RotaryEncodingConfig
RotaryEncodingConfig configures a RotaryEncoding layer.
type Sequential ¶
type Sequential[B tensor.Backend] = nn.Sequential[B]
Sequential represents a sequential container of modules.
func NewSequential ¶
func NewSequential[B tensor.Backend](modules ...Module[B]) *Sequential[B]
NewSequential creates a new sequential model.
Example:
backend := cpu.New()
model := nn.NewSequential(
nn.NewLinear(784, 128, backend),
nn.NewReLU(),
nn.NewLinear(128, 10, backend),
)
type SiLU ¶ added in v0.3.0
SiLU represents the Sigmoid Linear Unit (SiLU/Swish) activation function. SiLU(x) = x * sigmoid(x).
type Sigmoid ¶
Sigmoid represents the Sigmoid activation function.
func NewSigmoid ¶
NewSigmoid creates a new Sigmoid activation layer.
Example:
sigmoid := nn.NewSigmoid()
type SinusoidalPositionalEncoding ¶ added in v0.4.0
type SinusoidalPositionalEncoding[B tensor.Backend] = nn.SinusoidalPositionalEncoding[B]
SinusoidalPositionalEncoding implements fixed sinusoidal positional encodings.
This is the original positional encoding from "Attention is All You Need" (Vaswani et al., 2017).
Example:
backend := cpu.New() pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend) encodings := pe.Forward(100) // [1, 100, 256] // Add to embeddings embeddings := embeddings.Add(encodings)
func NewSinusoidalPositionalEncoding ¶ added in v0.4.0
func NewSinusoidalPositionalEncoding[B tensor.Backend](maxLen, dim int, backend B) *SinusoidalPositionalEncoding[B]
NewSinusoidalPositionalEncoding creates a new sinusoidal positional encoding layer.
Pre-computes all positional encodings up to maxLen using sine and cosine functions.
Parameters:
- maxLen: Maximum sequence length
- dim: Embedding dimension
- backend: Computation backend
Example:
pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend)
type SwiGLUFFN ¶ added in v0.5.0
SwiGLUFFN implements a feed-forward network with SwiGLU activation.
Architecture (LLaMA-style):
hidden = SwiGLU(x @ W_up, x @ W_gate) output = hidden @ W_down
Where SwiGLU(up, gate) = up * SiLU(gate).
This is more parameter-efficient than standard FFN with GELU. LLaMA uses ffn_dim = 2.7 * d_model (vs 4 * d_model in standard FFN) resulting in similar total parameters but better performance.
Example:
backend := autodiff.New(cpu.New())
cfg := nn.SwiGLUFFNConfig{
EmbedDim: 4096,
FFNDim: 11008, // LLaMA 7B
}
ffn := nn.NewSwiGLUFFN(cfg, backend)
output := ffn.Forward(x) // [batch, seq, 4096] -> [batch, seq, 4096]
func NewSwiGLUFFN ¶ added in v0.5.0
func NewSwiGLUFFN[B tensor.Backend](cfg SwiGLUFFNConfig, backend B) *SwiGLUFFN[B]
NewSwiGLUFFN creates a new SwiGLUFFN layer.
If GLUVariant is empty, defaults to "swiglu". If FFNDim is 0, it's computed as 8/3 * EmbedDim (LLaMA formula).
Example:
// LLaMA 7B FFN
ffn := nn.NewSwiGLUFFN(nn.SwiGLUFFNConfig{
EmbedDim: 4096,
FFNDim: 11008,
}, backend)
type SwiGLUFFNConfig ¶ added in v0.5.0
type SwiGLUFFNConfig = nn.SwiGLUFFNConfig
SwiGLUFFNConfig configures a SwiGLUFFN layer.
type TransformerBlock ¶ added in v0.4.0
type TransformerBlock[B tensor.Backend] = nn.TransformerBlock[B]
TransformerBlock is a complete Transformer Block with attention and FFN.
Architecture (Pre-Norm):
x → Norm → MHA → + → Norm → FFN → + → output
↑_______| ↑_______|
(residual) (residual)
Used in all transformer models (GPT, BERT, LLaMA, etc.)
func NewTransformerBlock ¶ added in v0.4.0
func NewTransformerBlock[B tensor.Backend](config TransformerConfig, backend B) *TransformerBlock[B]
NewTransformerBlock creates a new Transformer Block.
Parameters:
- config: Configuration (embedDim, numHeads, ffnDim, etc.)
- backend: Computation backend
Example:
backend := autodiff.New(cpu.New())
config := nn.TransformerConfig{
EmbedDim: 768,
NumHeads: 12,
FFNDim: 3072,
NormFirst: true,
UseRMSNorm: true,
NormEps: 1e-5,
}
block := nn.NewTransformerBlock(config, backend)
output := block.Forward(x, mask)
type TransformerConfig ¶ added in v0.4.0
type TransformerConfig = nn.TransformerConfig
TransformerConfig defines the configuration for a Transformer Block.
Fields:
- EmbedDim: Embedding dimension (d_model, e.g., 768 for GPT-2)
- NumHeads: Number of attention heads (e.g., 12 for GPT-2)
- FFNDim: FFN hidden dimension (typically 4 * EmbedDim)
- Dropout: Dropout rate (0 = no dropout, not yet implemented)
- NormFirst: true = Pre-Norm (LLaMA), false = Post-Norm (original)
- UseRMSNorm: true = RMSNorm (LLaMA), false = LayerNorm (BERT/GPT)
- NormEps: Normalization epsilon (1e-5 typical)
Example:
config := nn.TransformerConfig{
EmbedDim: 768,
NumHeads: 12,
FFNDim: 3072,
NormFirst: true,
UseRMSNorm: true,
NormEps: 1e-5,
}