Documentation
¶
Overview ¶
Package nn provides neural network layers and building blocks.
Overview ¶
This package contains:
- Layers: Linear, Conv2D, MaxPool2D
- Activations: ReLU, Sigmoid, Tanh
- Loss functions: CrossEntropyLoss, MSELoss
- Utilities: Sequential, Module interface, Parameter
- Initialization: Xavier, Zeros, Ones, Randn
Basic Usage ¶
import (
"github.com/born-ml/born/nn"
"github.com/born-ml/born/backend/cpu"
)
func main() {
backend := cpu.New()
// Build a simple MLP
model := nn.NewSequential(
nn.NewLinear(784, 128, backend),
nn.NewReLU(),
nn.NewLinear(128, 10, backend),
)
// Forward pass
output := model.Forward(input)
}
Layers ¶
Linear: Fully connected layer with Xavier initialization
layer := nn.NewLinear(inFeatures, outFeatures, backend)
Conv2D: 2D convolutional layer with im2col algorithm
conv := nn.NewConv2D(inChannels, outChannels, kernelSize, stride, padding, backend)
MaxPool2D: 2D max pooling layer
pool := nn.NewMaxPool2D(kernelSize, stride, backend)
Activations ¶
Common activation functions:
relu := nn.NewReLU() sigmoid := nn.NewSigmoid() tanh := nn.NewTanh()
Loss Functions ¶
CrossEntropyLoss: For classification tasks (numerically stable)
criterion := nn.NewCrossEntropyLoss(backend) loss := criterion.Forward(logits, labels)
MSELoss: For regression tasks
criterion := nn.NewMSELoss(backend) loss := criterion.Forward(predictions, targets)
Sequential Models ¶
Build models by composing layers:
model := nn.NewSequential(
nn.NewLinear(784, 256, backend),
nn.NewReLU(),
nn.NewLinear(256, 128, backend),
nn.NewReLU(),
nn.NewLinear(128, 10, backend),
)
Parameter Management ¶
Access model parameters for optimization:
params := model.Parameters()
for _, param := range params {
fmt.Println(param.Name(), param.Tensor().Shape())
}
Package nn provides neural network modules and layers for the Born ML Framework.
This package offers building blocks for constructing neural networks:
- Module interface: Base interface for all NN components
- Parameter interface: Trainable parameters with gradient tracking
- Linear: Fully connected layer
- Conv2D: Convolutional layer
- Activations: ReLU, Sigmoid, Tanh, SiLU
- Normalization: RMSNorm, LayerNorm
- Embedding: Token embeddings
- Attention: Multi-head attention, causal masking
- Loss functions: MSE, CrossEntropy
- Sequential: Container for stacking layers
Design inspired by PyTorch's nn.Module but adapted for Go generics and type safety.
Package nn provides public wrappers for positional encodings.
Index ¶
- func Accuracy[B tensor.Backend](logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B]) float32
- func CausalMask[B tensor.Backend](seqLen int, backend B) *tensor.Tensor[float32, B]
- func CrossEntropyBackward[B tensor.Backend](logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B], backend B) *tensor.Tensor[float32, B]
- func GELUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func GLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func GeGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func Load[B tensor.Backend](path string, backend B, module Module[B]) (serialization.Header, error)
- func Ones[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
- func Randn[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
- func ReGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func ReLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func RepeatKV[B tensor.Backend](kv *tensor.Tensor[float32, B], nRep int) *tensor.Tensor[float32, B]
- func Save[B tensor.Backend](module Module[B], path, modelType string, metadata map[string]string) error
- func ScaledDotProductAttention[B tensor.Backend](query, key, value *tensor.Tensor[float32, B], mask *tensor.Tensor[float32, B], ...) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])
- func SiLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func SigmoidFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func SwiGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
- func Xavier[B tensor.Backend](fanIn, fanOut int, shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
- func Zeros[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
- type ALiBi
- type Conv2D
- type CrossEntropyLoss
- type Embedding
- type FFN
- type GQAConfig
- type GroupedQueryAttention
- type KVCache
- type LayerNorm
- type LearnedPositionalEmbedding
- type Linear
- type LinearOption
- type MSELoss
- type MaxPool2D
- type Module
- type MultiHeadAttention
- type Parameter
- type RMSNorm
- type ReLU
- type RotaryEncoding
- type RotaryEncodingConfig
- type Sequential
- type SiLU
- type Sigmoid
- type SinusoidalPositionalEncoding
- type SwiGLUFFN
- type SwiGLUFFNConfig
- type Tanh
- type TransformerBlock
- type TransformerConfig
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Accuracy ¶
func Accuracy[B tensor.Backend]( logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B], ) float32
Accuracy computes the classification accuracy.
Example:
acc := nn.Accuracy(predictions, labels)
fmt.Printf("Accuracy: %.2f%%\n", acc*100)
func CausalMask ¶ added in v0.4.0
CausalMask creates a causal (autoregressive) attention mask.
In causal attention, each position can only attend to earlier positions. This is used in autoregressive models like GPT.
Returns a mask tensor where future positions are masked with -inf. Shape: [1, 1, seq_len, seq_len] (broadcastable to [batch, heads, seq, seq])
Example:
mask := nn.CausalMask(10, backend) // [1, 1, 10, 10] output, weights := nn.ScaledDotProductAttention(Q, K, V, mask, 0)
func CrossEntropyBackward ¶
func CrossEntropyBackward[B tensor.Backend]( logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B], backend B, ) *tensor.Tensor[float32, B]
CrossEntropyBackward computes the backward pass for cross-entropy loss.
func GELUFunc ¶ added in v0.5.0
GELUFunc applies GELU (Gaussian Error Linear Unit) activation.
Uses the tanh approximation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3))).
GELU is used in BERT, GPT-2, and other transformers.
Example:
output := nn.GELUFunc(input)
func GLU ¶ added in v0.5.0
GLU applies Gated Linear Unit: GLU(x, gate) = x * sigmoid(gate).
GLU is the base gating mechanism used in various transformer FFN layers.
Parameters:
- x: input tensor.
- gate: gating tensor (same shape as x).
Returns: x * sigmoid(gate).
Example:
output := nn.GLU(x, gate)
func GeGLU ¶ added in v0.5.0
GeGLU applies GELU-Gated Linear Unit: GeGLU(x, gate) = x * GELU(gate).
GeGLU uses GELU activation for gating instead of SiLU. Used in some transformer variants for different activation characteristics.
Parameters:
- x: input tensor.
- gate: gating tensor.
Returns: x * GELU(gate).
Example:
output := nn.GeGLU(up, gate)
func Load ¶ added in v0.7.7
Load loads a module from a .born file.
This is a convenience function that reads a state dictionary from a file and loads it into the provided module.
Parameters:
- path: File path to read from
- backend: Backend to use for tensors
- module: The module to load into (will be modified)
Returns the header and an error if loading fails.
Example:
backend := cpu.New()
model := nn.NewLinear(784, 10, backend)
header, err := nn.Load("model.born", backend, model)
func Ones ¶
Ones initializes a tensor with ones.
Example:
backend := cpu.New()
weights := nn.Ones(tensor.Shape{128, 784}, backend)
func Randn ¶
Randn initializes a tensor with random values from N(0, 1).
Example:
backend := cpu.New()
weights := nn.Randn(tensor.Shape{128, 784}, backend)
func ReGLU ¶ added in v0.5.0
ReGLU applies ReLU-Gated Linear Unit: ReGLU(x, gate) = x * ReLU(gate).
ReGLU uses ReLU activation for gating. It's simpler but may have "dead neuron" issues compared to SwiGLU or GeGLU.
Parameters:
- x: input tensor.
- gate: gating tensor.
Returns: x * ReLU(gate).
Example:
output := nn.ReGLU(up, gate)
func ReLUFunc ¶ added in v0.5.0
ReLUFunc applies the ReLU activation function element-wise. ReLU(x) = max(0, x).
func RepeatKV ¶ added in v0.5.0
RepeatKV broadcasts KV heads to match query heads count.
This is the key operation in GQA that allows fewer KV heads than Q heads. Each KV head is repeated nRep times to match the Q head count.
Input: [batch, n_kv_heads, seq_len, head_dim] Output: [batch, n_q_heads, seq_len, head_dim] where n_q_heads = n_kv_heads * nRep
Example:
// 8 KV heads -> 32 Q heads (nRep=4)
kv := tensor.Randn[float32](tensor.Shape{2, 8, 100, 128}, backend)
expanded := nn.RepeatKV(kv, 4) // [2, 32, 100, 128]
If nRep=1 (standard MHA), returns the input unchanged.
func Save ¶ added in v0.7.7
func Save[B tensor.Backend](module Module[B], path, modelType string, metadata map[string]string) error
Save saves a module to a .born file.
This is a convenience function that exports the module's state dictionary and writes it to a file using the Born native format.
Parameters:
- module: The module to save
- path: File path to write to
- modelType: Type name of the model (e.g., "Sequential", "Linear")
- metadata: Optional metadata (can be nil)
Returns an error if saving fails.
Example:
backend := cpu.New() model := nn.NewLinear(784, 10, backend) err := nn.Save(model, "model.born", "Linear", nil)
func ScaledDotProductAttention ¶ added in v0.4.0
func ScaledDotProductAttention[B tensor.Backend]( query, key, value *tensor.Tensor[float32, B], mask *tensor.Tensor[float32, B], scale float32, ) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])
ScaledDotProductAttention computes attention scores using the scaled dot-product mechanism.
This is the core attention mechanism used in transformers.
Parameters:
- query: Query tensor [batch, heads, seq_q, head_dim]
- key: Key tensor [batch, heads, seq_k, head_dim]
- value: Value tensor [batch, heads, seq_k, head_dim]
- mask: Optional attention mask [batch, 1, seq_q, seq_k] or nil (additive mask, -inf for masked)
- scale: Scaling factor (0 for auto-compute as 1/sqrt(head_dim))
Returns:
- output: Attended values [batch, heads, seq_q, head_dim]
- weights: Attention weights [batch, heads, seq_q, seq_k]
Example:
Q := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
K := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
V := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
output, weights := nn.ScaledDotProductAttention(Q, K, V, nil, 0)
func SiLUFunc ¶ added in v0.5.0
SiLUFunc applies SiLU (Swish) activation: f(x) = x * sigmoid(x).
This is the functional version of SiLU activation, useful in GLU variants.
Example:
output := nn.SiLUFunc(input)
func SigmoidFunc ¶ added in v0.5.0
SigmoidFunc applies the sigmoid activation function element-wise. Sigmoid(x) = 1 / (1 + exp(-x)).
func SwiGLU ¶ added in v0.5.0
SwiGLU applies Swish-Gated Linear Unit: SwiGLU(x, gate) = x * SiLU(gate).
SwiGLU is used in modern LLMs like LLaMA, Mistral, and DeepSeek. It combines the input with SiLU-activated gate for better gradient flow.
Parameters:
- x: input tensor (typically "up" projection).
- gate: gating tensor (typically "gate" projection).
Returns: x * SiLU(gate) where SiLU(z) = z * sigmoid(z).
Example:
// In LLaMA-style FFN: up := upProj.Forward(input) gate := gateProj.Forward(input) hidden := nn.SwiGLU(up, gate)
Types ¶
type ALiBi ¶ added in v0.4.0
ALiBi implements Attention with Linear Biases.
ALiBi adds a linear bias to attention scores based on the distance between positions. Used in BLOOM, MPT, and other models. Allows extrapolation to longer sequences.
Example:
backend := cpu.New() alibi := nn.NewALiBi(8, backend) // 8 attention heads bias := alibi.GetBias(128) // [1, 8, 128, 128] // In attention: scores := Q.BatchMatMul(K.T()) scores = scores.Add(bias) weights := scores.Softmax(-1)
func NewALiBi ¶ added in v0.4.0
NewALiBi creates a new ALiBi bias generator.
Computes slopes for each attention head using a geometric sequence.
Parameters:
- numHeads: Number of attention heads
- backend: Computation backend
Example:
alibi := nn.NewALiBi(8, backend) bias := alibi.GetBias(64) // Get bias for sequence length 64
type Conv2D ¶
Conv2D represents a 2D convolutional layer.
func NewConv2D ¶
func NewConv2D[B tensor.Backend]( inChannels, outChannels int, kernelH, kernelW int, stride, padding int, useBias bool, backend B, ) *Conv2D[B]
NewConv2D creates a new 2D convolutional layer.
Example:
backend := cpu.New() conv := nn.NewConv2D(1, 32, 3, 3, 1, 1, true, backend) // in_channels=1, out_channels=32, kernel=3x3, stride=1, padding=1, useBias=true
type CrossEntropyLoss ¶
type CrossEntropyLoss[B tensor.Backend] = nn.CrossEntropyLoss[B]
CrossEntropyLoss represents the cross-entropy loss for classification.
func NewCrossEntropyLoss ¶
func NewCrossEntropyLoss[B tensor.Backend](backend B) *CrossEntropyLoss[B]
NewCrossEntropyLoss creates a new cross-entropy loss function.
Example:
backend := cpu.New() criterion := nn.NewCrossEntropyLoss(backend) loss := criterion.Forward(logits, labels)
type Embedding ¶ added in v0.3.0
Embedding represents a lookup table for embeddings.
func NewEmbedding ¶ added in v0.3.0
NewEmbedding creates a new embedding layer.
Example:
backend := cpu.New()
embed := nn.NewEmbedding[B](50000, 768, backend) // vocab=50000, dim=768
tokenIds := tensor.FromSlice([]int32{1, 5, 10}, tensor.Shape{1, 3}, backend)
embeddings := embed.Forward(tokenIds) // [1, 3, 768]
func NewEmbeddingWithWeight ¶ added in v0.5.0
NewEmbeddingWithWeight creates an embedding layer from an existing weight tensor.
This is useful when loading pre-trained embeddings.
Example:
weights := tensor.Randn[float32](tensor.Shape{50000, 768}, backend)
embed := nn.NewEmbeddingWithWeight(weights)
type FFN ¶ added in v0.4.0
FFN (Feed-Forward Network) is a 2-layer MLP with SiLU activation.
Architecture:
FFN(x) = Linear2(SiLU(Linear1(x)))
Used inside TransformerBlock.
type GQAConfig ¶ added in v0.5.0
GQAConfig configures a GroupedQueryAttention layer.
func MQA ¶ added in v0.5.0
MQA creates a Multi-Query Attention config (GQA with n_kv_heads=1).
MQA is the extreme case of GQA where all query heads share a single KV head. This provides maximum memory savings but may reduce model capacity.
Example:
cfg := nn.MQA(4096, 32, 128) // 32 Q heads, 1 KV head mqa := nn.NewGQA(cfg, backend)
type GroupedQueryAttention ¶ added in v0.5.0
type GroupedQueryAttention[B tensor.Backend] = nn.GroupedQueryAttention[B]
GroupedQueryAttention implements Grouped Query Attention (GQA).
GQA is a variant of multi-head attention where the number of key-value heads is less than the number of query heads. This provides significant memory savings for KV-cache during inference while maintaining model quality.
Architecture comparison:
MHA: n_q_heads = n_kv_heads (e.g., 32 Q, 32 K, 32 V) GQA: n_q_heads > n_kv_heads (e.g., 32 Q, 8 K, 8 V) -> 4x memory savings MQA: n_kv_heads = 1 (e.g., 32 Q, 1 K, 1 V) -> 32x memory savings (extreme)
GQA is used in LLaMA 2/3, Mistral, DeepSeek, Qwen2, Phi-3, and other modern LLMs.
func NewGQA ¶ added in v0.5.0
func NewGQA[B tensor.Backend](cfg GQAConfig, backend B) *GroupedQueryAttention[B]
NewGQA creates a new GroupedQueryAttention module.
Validates that:
- NQHeads is divisible by NKVHeads
- EmbedDim equals NQHeads * HeadDim
If HeadDim is 0, it's computed as EmbedDim / NQHeads.
Example:
// LLaMA 2 7B style config
cfg := nn.GQAConfig{
EmbedDim: 4096,
NQHeads: 32,
NKVHeads: 8,
HeadDim: 128,
UseRoPE: true,
MaxSeqLen: 4096,
}
gqa := nn.NewGQA(cfg, backend)
output := gqa.Forward(x, cache, startPos)
type KVCache ¶ added in v0.4.0
KVCache is a public alias for internal KV cache implementation.
KVCache stores key-value pairs for efficient autoregressive generation. See internal/nn/kvcache.go for detailed documentation.
type LayerNorm ¶ added in v0.4.0
LayerNorm represents Layer Normalization.
func NewLayerNorm ¶ added in v0.4.0
NewLayerNorm creates a new LayerNorm layer.
Example:
backend := cpu.New() norm := nn.NewLayerNorm[B](768, 1e-5, backend) output := norm.Forward(input) // [..., 768] -> [..., 768]
type LearnedPositionalEmbedding ¶ added in v0.4.0
type LearnedPositionalEmbedding[B tensor.Backend] = nn.LearnedPositionalEmbedding[B]
LearnedPositionalEmbedding implements learned positional embeddings.
These embeddings are trainable parameters that are updated during training. Used in GPT-2 and other models.
Example:
backend := cpu.New() pe := nn.NewLearnedPositionalEmbedding(512, 256, backend) encodings := pe.Forward(100) // [1, 100, 256] // Get parameters for optimizer params := pe.Parameters()
func NewLearnedPositionalEmbedding ¶ added in v0.4.0
func NewLearnedPositionalEmbedding[B tensor.Backend](maxLen, dim int, backend B) *LearnedPositionalEmbedding[B]
NewLearnedPositionalEmbedding creates a new learned positional embedding layer.
The embeddings are initialized from a normal distribution N(0, 1).
Parameters:
- maxLen: Maximum sequence length
- dim: Embedding dimension
- backend: Computation backend
Example:
pe := nn.NewLearnedPositionalEmbedding(512, 256, backend)
type Linear ¶
Linear represents a fully connected (dense) layer.
func NewLinear ¶
func NewLinear[B tensor.Backend](inFeatures, outFeatures int, backend B, opts ...LinearOption) *Linear[B]
NewLinear creates a new linear layer with Xavier initialization.
Example:
backend := cpu.New() layer := nn.NewLinear(784, 128, backend) // Without bias (for LLaMA, attention projections, etc.) lm_head := nn.NewLinear(hidden_size, vocab_size, backend, nn.WithBias(false))
type LinearOption ¶ added in v0.7.4
type LinearOption = nn.LinearOption
LinearOption is a functional option for configuring a Linear layer.
func WithBias ¶ added in v0.7.4
func WithBias(useBias bool) LinearOption
WithBias sets whether the Linear layer should use bias.
Default is true. Set to false for architectures like LLaMA that don't use bias.
Example:
// Linear layer without bias (LLaMA-style) lm_head := nn.NewLinear(hidden_size, vocab_size, backend, nn.WithBias(false)) // Linear layer with bias (default) layer := nn.NewLinear(784, 128, backend) // same as WithBias(true)
type MSELoss ¶
MSELoss represents the mean squared error loss for regression.
func NewMSELoss ¶
NewMSELoss creates a new MSE loss function.
Example:
backend := cpu.New() criterion := nn.NewMSELoss(backend) loss := criterion.Forward(predictions, targets)
type Module ¶
type Module[B tensor.Backend] interface { // Forward computes the output of the module given an input tensor. // // The input tensor should have the appropriate shape for this module. // For example, Linear expects [batch_size, in_features]. // // Returns the output tensor with shape determined by the module type. Forward(input *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B] // Parameters returns all trainable parameters of this module. // // This includes weights, biases, and any nested module parameters. // Returns an empty slice for modules without trainable parameters // (e.g., activation functions). Parameters() []*Parameter[B] // StateDict returns a map of parameter names to raw tensors. // // This is used for serialization. The returned map contains all // trainable parameters with their names as keys. StateDict() map[string]*tensor.RawTensor // LoadStateDict loads parameters from a state dictionary. // // This is used for deserialization. The state dictionary should // contain parameter names as keys and RawTensors as values. // // Returns an error if a required parameter is missing or has wrong shape. LoadStateDict(stateDict map[string]*tensor.RawTensor) error }
Module is the base interface for all neural network components.
Every NN module must implement:
- Forward: Compute output from input
- Parameters: Return all trainable parameters
- StateDict: Export parameters for serialization
- LoadStateDict: Import parameters from serialization
Modules can be composed to build complex architectures:
model := nn.NewSequential(
nn.NewLinear(784, 128, backend),
nn.NewReLU[Backend](),
nn.NewLinear(128, 10, backend),
)
Type parameter B must satisfy the tensor.Backend interface.
type MultiHeadAttention ¶ added in v0.4.0
type MultiHeadAttention[B tensor.Backend] = nn.MultiHeadAttention[B]
MultiHeadAttention represents the multi-head attention mechanism.
func NewMultiHeadAttention ¶ added in v0.4.0
func NewMultiHeadAttention[B tensor.Backend](embedDim, numHeads int, backend B) *MultiHeadAttention[B]
NewMultiHeadAttention creates a new multi-head attention module.
Parameters:
- embedDim: Total embedding dimension (must be divisible by numHeads)
- numHeads: Number of attention heads
- backend: Computation backend
Example:
backend := cpu.New() mha := nn.NewMultiHeadAttention[B](768, 12, backend) // BERT-base config output := mha.Forward(x, x, x, nil) // Self-attention
type Parameter ¶
Parameter represents a trainable parameter in a neural network.
Parameters are tensors that require gradient computation during training. They typically represent weights and biases of layers.
Example:
// Create a weight parameter
weight := nn.NewParameter("weight", weightTensor)
// Access the tensor
w := weight.Tensor()
// Get gradient after backward pass
grad := weight.Grad()
Methods:
Name() string
Returns the parameter name (e.g., "weight", "bias").
Tensor() *tensor.Tensor[float32, B]
Returns the parameter tensor.
Grad() *tensor.Tensor[float32, B]
Returns the gradient tensor (nil if not computed yet).
SetGrad(grad *tensor.Tensor[float32, B])
Sets the gradient tensor.
ZeroGrad()
Clears the gradient tensor.
Note: Parameter is implemented as a type alias because it is used as a return type in the Module interface. Go's type system requires exact type matches for interface implementations, so we cannot use an interface here.
func NewParameter ¶
NewParameter creates a new trainable parameter.
The parameter tensor should be initialized before creating the Parameter. Gradient will be allocated during the first backward pass.
Parameters:
- name: Descriptive name for this parameter (e.g., "linear1.weight")
- t: The initialized parameter tensor
Returns a new Parameter.
Example:
backend := cpu.New()
weights := tensor.Randn[float32](tensor.Shape{128, 784}, backend)
param := nn.NewParameter("layer1.weight", weights)
type RotaryEncoding ¶ added in v0.4.0
type RotaryEncoding[B tensor.Backend] = nn.RotaryEncoding[B]
RotaryEncoding implements Rotary Position Embedding (RoPE).
RoPE is used in modern LLMs like LLaMA, Mistral, DeepSeek, and Qwen. It applies a rotation to query and key embeddings based on their position.
Example:
backend := cpu.New()
config := nn.RotaryEncodingConfig{
DModel: 64,
MaxSeqLen: 2048,
Theta: 10000.0,
}
rope := nn.NewRotaryEncoding(config, backend)
// Apply to attention queries/keys
q := tensor.Randn[float32](tensor.Shape{batch, heads, seq, 64}, backend)
q_rotated := rope.Forward(q)
func NewRotaryEncoding ¶ added in v0.4.0
func NewRotaryEncoding[B tensor.Backend](cfg RotaryEncodingConfig, backend B) *RotaryEncoding[B]
NewRotaryEncoding creates a new RoPE (Rotary Position Embedding) layer.
Pre-computes cosine and sine values for all positions and dimension pairs.
Parameters:
- cfg: Configuration (DModel, MaxSeqLen, Theta)
- backend: Computation backend
Example:
config := nn.RotaryEncodingConfig{
DModel: 64, // Head dimension
MaxSeqLen: 2048, // Max sequence length
Theta: 10000.0,
}
rope := nn.NewRotaryEncoding(config, backend)
type RotaryEncodingConfig ¶ added in v0.4.0
type RotaryEncodingConfig = nn.RotaryEncodingConfig
RotaryEncodingConfig configures a RotaryEncoding layer.
type Sequential ¶
type Sequential[B tensor.Backend] = nn.Sequential[B]
Sequential represents a sequential container of modules.
func NewSequential ¶
func NewSequential[B tensor.Backend](modules ...Module[B]) *Sequential[B]
NewSequential creates a new sequential model.
Example:
backend := cpu.New()
model := nn.NewSequential(
nn.NewLinear(784, 128, backend),
nn.NewReLU(),
nn.NewLinear(128, 10, backend),
)
type SiLU ¶ added in v0.3.0
SiLU represents the Sigmoid Linear Unit (SiLU/Swish) activation function. SiLU(x) = x * sigmoid(x).
type Sigmoid ¶
Sigmoid represents the Sigmoid activation function.
func NewSigmoid ¶
NewSigmoid creates a new Sigmoid activation layer.
Example:
sigmoid := nn.NewSigmoid()
type SinusoidalPositionalEncoding ¶ added in v0.4.0
type SinusoidalPositionalEncoding[B tensor.Backend] = nn.SinusoidalPositionalEncoding[B]
SinusoidalPositionalEncoding implements fixed sinusoidal positional encodings.
This is the original positional encoding from "Attention is All You Need" (Vaswani et al., 2017).
Example:
backend := cpu.New() pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend) encodings := pe.Forward(100) // [1, 100, 256] // Add to embeddings embeddings := embeddings.Add(encodings)
func NewSinusoidalPositionalEncoding ¶ added in v0.4.0
func NewSinusoidalPositionalEncoding[B tensor.Backend](maxLen, dim int, backend B) *SinusoidalPositionalEncoding[B]
NewSinusoidalPositionalEncoding creates a new sinusoidal positional encoding layer.
Pre-computes all positional encodings up to maxLen using sine and cosine functions.
Parameters:
- maxLen: Maximum sequence length
- dim: Embedding dimension
- backend: Computation backend
Example:
pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend)
type SwiGLUFFN ¶ added in v0.5.0
SwiGLUFFN implements a feed-forward network with SwiGLU activation.
Architecture (LLaMA-style):
hidden = SwiGLU(x @ W_up, x @ W_gate) output = hidden @ W_down
Where SwiGLU(up, gate) = up * SiLU(gate).
This is more parameter-efficient than standard FFN with GELU. LLaMA uses ffn_dim = 2.7 * d_model (vs 4 * d_model in standard FFN) resulting in similar total parameters but better performance.
Example:
backend := autodiff.New(cpu.New())
cfg := nn.SwiGLUFFNConfig{
EmbedDim: 4096,
FFNDim: 11008, // LLaMA 7B
}
ffn := nn.NewSwiGLUFFN(cfg, backend)
output := ffn.Forward(x) // [batch, seq, 4096] -> [batch, seq, 4096]
func NewSwiGLUFFN ¶ added in v0.5.0
func NewSwiGLUFFN[B tensor.Backend](cfg SwiGLUFFNConfig, backend B) *SwiGLUFFN[B]
NewSwiGLUFFN creates a new SwiGLUFFN layer.
If GLUVariant is empty, defaults to "swiglu". If FFNDim is 0, it's computed as 8/3 * EmbedDim (LLaMA formula).
Example:
// LLaMA 7B FFN
ffn := nn.NewSwiGLUFFN(nn.SwiGLUFFNConfig{
EmbedDim: 4096,
FFNDim: 11008,
}, backend)
type SwiGLUFFNConfig ¶ added in v0.5.0
type SwiGLUFFNConfig = nn.SwiGLUFFNConfig
SwiGLUFFNConfig configures a SwiGLUFFN layer.
type TransformerBlock ¶ added in v0.4.0
type TransformerBlock[B tensor.Backend] = nn.TransformerBlock[B]
TransformerBlock is a complete Transformer Block with attention and FFN.
Architecture (Pre-Norm):
x → Norm → MHA → + → Norm → FFN → + → output
↑_______| ↑_______|
(residual) (residual)
Used in all transformer models (GPT, BERT, LLaMA, etc.)
func NewTransformerBlock ¶ added in v0.4.0
func NewTransformerBlock[B tensor.Backend](config TransformerConfig, backend B) *TransformerBlock[B]
NewTransformerBlock creates a new Transformer Block.
Parameters:
- config: Configuration (embedDim, numHeads, ffnDim, etc.)
- backend: Computation backend
Example:
backend := autodiff.New(cpu.New())
config := nn.TransformerConfig{
EmbedDim: 768,
NumHeads: 12,
FFNDim: 3072,
NormFirst: true,
UseRMSNorm: true,
NormEps: 1e-5,
}
block := nn.NewTransformerBlock(config, backend)
output := block.Forward(x, mask)
type TransformerConfig ¶ added in v0.4.0
type TransformerConfig = nn.TransformerConfig
TransformerConfig defines the configuration for a Transformer Block.
Fields:
- EmbedDim: Embedding dimension (d_model, e.g., 768 for GPT-2)
- NumHeads: Number of attention heads (e.g., 12 for GPT-2)
- FFNDim: FFN hidden dimension (typically 4 * EmbedDim)
- Dropout: Dropout rate (0 = no dropout, not yet implemented)
- NormFirst: true = Pre-Norm (LLaMA), false = Post-Norm (original)
- UseRMSNorm: true = RMSNorm (LLaMA), false = LayerNorm (BERT/GPT)
- NormEps: Normalization epsilon (1e-5 typical)
Example:
config := nn.TransformerConfig{
EmbedDim: 768,
NumHeads: 12,
FFNDim: 3072,
NormFirst: true,
UseRMSNorm: true,
NormEps: 1e-5,
}