nn

package
v0.7.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 9, 2025 License: Apache-2.0 Imports: 2 Imported by: 0

Documentation

Overview

Package nn provides neural network layers and building blocks.

Overview

This package contains:

  • Layers: Linear, Conv2D, MaxPool2D
  • Activations: ReLU, Sigmoid, Tanh
  • Loss functions: CrossEntropyLoss, MSELoss
  • Utilities: Sequential, Module interface, Parameter
  • Initialization: Xavier, Zeros, Ones, Randn

Basic Usage

import (
    "github.com/born-ml/born/nn"
    "github.com/born-ml/born/backend/cpu"
)

func main() {
    backend := cpu.New()

    // Build a simple MLP
    model := nn.NewSequential(
        nn.NewLinear(784, 128, backend),
        nn.NewReLU(),
        nn.NewLinear(128, 10, backend),
    )

    // Forward pass
    output := model.Forward(input)
}

Layers

Linear: Fully connected layer with Xavier initialization

layer := nn.NewLinear(inFeatures, outFeatures, backend)

Conv2D: 2D convolutional layer with im2col algorithm

conv := nn.NewConv2D(inChannels, outChannels, kernelSize, stride, padding, backend)

MaxPool2D: 2D max pooling layer

pool := nn.NewMaxPool2D(kernelSize, stride, backend)

Activations

Common activation functions:

relu := nn.NewReLU()
sigmoid := nn.NewSigmoid()
tanh := nn.NewTanh()

Loss Functions

CrossEntropyLoss: For classification tasks (numerically stable)

criterion := nn.NewCrossEntropyLoss(backend)
loss := criterion.Forward(logits, labels)

MSELoss: For regression tasks

criterion := nn.NewMSELoss(backend)
loss := criterion.Forward(predictions, targets)

Sequential Models

Build models by composing layers:

model := nn.NewSequential(
    nn.NewLinear(784, 256, backend),
    nn.NewReLU(),
    nn.NewLinear(256, 128, backend),
    nn.NewReLU(),
    nn.NewLinear(128, 10, backend),
)

Parameter Management

Access model parameters for optimization:

params := model.Parameters()
for _, param := range params {
    fmt.Println(param.Name(), param.Tensor().Shape())
}

Package nn provides public wrappers for positional encodings.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Accuracy

func Accuracy[B tensor.Backend](
	logits *tensor.Tensor[float32, B],
	targets *tensor.Tensor[int32, B],
) float32

Accuracy computes the classification accuracy.

Example:

acc := nn.Accuracy(predictions, labels)
fmt.Printf("Accuracy: %.2f%%\n", acc*100)

func CausalMask added in v0.4.0

func CausalMask[B tensor.Backend](seqLen int, backend B) *tensor.Tensor[float32, B]

CausalMask creates a causal (autoregressive) attention mask.

In causal attention, each position can only attend to earlier positions. This is used in autoregressive models like GPT.

Returns a mask tensor where future positions are masked with -inf. Shape: [1, 1, seq_len, seq_len] (broadcastable to [batch, heads, seq, seq])

Example:

mask := nn.CausalMask(10, backend)  // [1, 1, 10, 10]
output, weights := nn.ScaledDotProductAttention(Q, K, V, mask, 0)

func CrossEntropyBackward

func CrossEntropyBackward[B tensor.Backend](
	logits *tensor.Tensor[float32, B],
	targets *tensor.Tensor[int32, B],
	backend B,
) *tensor.Tensor[float32, B]

CrossEntropyBackward computes the backward pass for cross-entropy loss.

func GELUFunc added in v0.5.0

func GELUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

GELUFunc applies GELU (Gaussian Error Linear Unit) activation.

Uses the tanh approximation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3))).

GELU is used in BERT, GPT-2, and other transformers.

Example:

output := nn.GELUFunc(input)

func GLU added in v0.5.0

func GLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

GLU applies Gated Linear Unit: GLU(x, gate) = x * sigmoid(gate).

GLU is the base gating mechanism used in various transformer FFN layers.

Parameters:

  • x: input tensor.
  • gate: gating tensor (same shape as x).

Returns: x * sigmoid(gate).

Example:

output := nn.GLU(x, gate)

func GeGLU added in v0.5.0

func GeGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

GeGLU applies GELU-Gated Linear Unit: GeGLU(x, gate) = x * GELU(gate).

GeGLU uses GELU activation for gating instead of SiLU. Used in some transformer variants for different activation characteristics.

Parameters:

  • x: input tensor.
  • gate: gating tensor.

Returns: x * GELU(gate).

Example:

output := nn.GeGLU(up, gate)

func Ones

func Ones[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Ones initializes a tensor with ones.

Example:

backend := cpu.New()
weights := nn.Ones(tensor.Shape{128, 784}, backend)

func Randn

func Randn[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Randn initializes a tensor with random values from N(0, 1).

Example:

backend := cpu.New()
weights := nn.Randn(tensor.Shape{128, 784}, backend)

func ReGLU added in v0.5.0

func ReGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

ReGLU applies ReLU-Gated Linear Unit: ReGLU(x, gate) = x * ReLU(gate).

ReGLU uses ReLU activation for gating. It's simpler but may have "dead neuron" issues compared to SwiGLU or GeGLU.

Parameters:

  • x: input tensor.
  • gate: gating tensor.

Returns: x * ReLU(gate).

Example:

output := nn.ReGLU(up, gate)

func ReLUFunc added in v0.5.0

func ReLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

ReLUFunc applies the ReLU activation function element-wise. ReLU(x) = max(0, x).

func RepeatKV added in v0.5.0

func RepeatKV[B tensor.Backend](kv *tensor.Tensor[float32, B], nRep int) *tensor.Tensor[float32, B]

RepeatKV broadcasts KV heads to match query heads count.

This is the key operation in GQA that allows fewer KV heads than Q heads. Each KV head is repeated nRep times to match the Q head count.

Input: [batch, n_kv_heads, seq_len, head_dim] Output: [batch, n_q_heads, seq_len, head_dim] where n_q_heads = n_kv_heads * nRep

Example:

// 8 KV heads -> 32 Q heads (nRep=4)
kv := tensor.Randn[float32](tensor.Shape{2, 8, 100, 128}, backend)
expanded := nn.RepeatKV(kv, 4)  // [2, 32, 100, 128]

If nRep=1 (standard MHA), returns the input unchanged.

func ScaledDotProductAttention added in v0.4.0

func ScaledDotProductAttention[B tensor.Backend](
	query, key, value *tensor.Tensor[float32, B],
	mask *tensor.Tensor[float32, B],
	scale float32,
) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])

ScaledDotProductAttention computes attention scores using the scaled dot-product mechanism.

This is the core attention mechanism used in transformers.

Parameters:

  • query: Query tensor [batch, heads, seq_q, head_dim]
  • key: Key tensor [batch, heads, seq_k, head_dim]
  • value: Value tensor [batch, heads, seq_k, head_dim]
  • mask: Optional attention mask [batch, 1, seq_q, seq_k] or nil (additive mask, -inf for masked)
  • scale: Scaling factor (0 for auto-compute as 1/sqrt(head_dim))

Returns:

  • output: Attended values [batch, heads, seq_q, head_dim]
  • weights: Attention weights [batch, heads, seq_q, seq_k]

Example:

Q := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
K := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
V := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
output, weights := nn.ScaledDotProductAttention(Q, K, V, nil, 0)

func SiLUFunc added in v0.5.0

func SiLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

SiLUFunc applies SiLU (Swish) activation: f(x) = x * sigmoid(x).

This is the functional version of SiLU activation, useful in GLU variants.

Example:

output := nn.SiLUFunc(input)

func SigmoidFunc added in v0.5.0

func SigmoidFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

SigmoidFunc applies the sigmoid activation function element-wise. Sigmoid(x) = 1 / (1 + exp(-x)).

func SwiGLU added in v0.5.0

func SwiGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

SwiGLU applies Swish-Gated Linear Unit: SwiGLU(x, gate) = x * SiLU(gate).

SwiGLU is used in modern LLMs like LLaMA, Mistral, and DeepSeek. It combines the input with SiLU-activated gate for better gradient flow.

Parameters:

  • x: input tensor (typically "up" projection).
  • gate: gating tensor (typically "gate" projection).

Returns: x * SiLU(gate) where SiLU(z) = z * sigmoid(z).

Example:

// In LLaMA-style FFN:
up := upProj.Forward(input)
gate := gateProj.Forward(input)
hidden := nn.SwiGLU(up, gate)

func Xavier

func Xavier[B tensor.Backend](fanIn, fanOut int, shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Xavier initializes a tensor using Xavier/Glorot initialization.

Example:

backend := cpu.New()
weights := nn.Xavier(784, 128, tensor.Shape{128, 784}, backend)

func Zeros

func Zeros[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Zeros initializes a tensor with zeros (for biases).

Example:

backend := cpu.New()
bias := nn.Zeros(tensor.Shape{128}, backend)

Types

type ALiBi added in v0.4.0

type ALiBi[B tensor.Backend] = nn.ALiBi[B]

ALiBi implements Attention with Linear Biases.

ALiBi adds a linear bias to attention scores based on the distance between positions. Used in BLOOM, MPT, and other models. Allows extrapolation to longer sequences.

Example:

backend := cpu.New()
alibi := nn.NewALiBi(8, backend)  // 8 attention heads
bias := alibi.GetBias(128)        // [1, 8, 128, 128]

// In attention:
scores := Q.BatchMatMul(K.T())
scores = scores.Add(bias)
weights := scores.Softmax(-1)

func NewALiBi added in v0.4.0

func NewALiBi[B tensor.Backend](numHeads int, backend B) *ALiBi[B]

NewALiBi creates a new ALiBi bias generator.

Computes slopes for each attention head using a geometric sequence.

Parameters:

  • numHeads: Number of attention heads
  • backend: Computation backend

Example:

alibi := nn.NewALiBi(8, backend)
bias := alibi.GetBias(64)  // Get bias for sequence length 64

type Conv2D

type Conv2D[B tensor.Backend] = nn.Conv2D[B]

Conv2D represents a 2D convolutional layer.

func NewConv2D

func NewConv2D[B tensor.Backend](
	inChannels, outChannels int,
	kernelH, kernelW int,
	stride, padding int,
	useBias bool,
	backend B,
) *Conv2D[B]

NewConv2D creates a new 2D convolutional layer.

Example:

backend := cpu.New()
conv := nn.NewConv2D(1, 32, 3, 3, 1, 1, true, backend)  // in_channels=1, out_channels=32, kernel=3x3, stride=1, padding=1, useBias=true

type CrossEntropyLoss

type CrossEntropyLoss[B tensor.Backend] = nn.CrossEntropyLoss[B]

CrossEntropyLoss represents the cross-entropy loss for classification.

func NewCrossEntropyLoss

func NewCrossEntropyLoss[B tensor.Backend](backend B) *CrossEntropyLoss[B]

NewCrossEntropyLoss creates a new cross-entropy loss function.

Example:

backend := cpu.New()
criterion := nn.NewCrossEntropyLoss(backend)
loss := criterion.Forward(logits, labels)

type Embedding added in v0.3.0

type Embedding[B tensor.Backend] = nn.Embedding[B]

Embedding represents a lookup table for embeddings.

func NewEmbedding added in v0.3.0

func NewEmbedding[B tensor.Backend](numEmbeddings, embeddingDim int, backend B) *Embedding[B]

NewEmbedding creates a new embedding layer.

Example:

backend := cpu.New()
embed := nn.NewEmbedding[B](50000, 768, backend)  // vocab=50000, dim=768
tokenIds := tensor.FromSlice([]int32{1, 5, 10}, tensor.Shape{1, 3}, backend)
embeddings := embed.Forward(tokenIds)  // [1, 3, 768]

func NewEmbeddingWithWeight added in v0.5.0

func NewEmbeddingWithWeight[B tensor.Backend](weight *tensor.Tensor[float32, B]) *Embedding[B]

NewEmbeddingWithWeight creates an embedding layer from an existing weight tensor.

This is useful when loading pre-trained embeddings.

Example:

weights := tensor.Randn[float32](tensor.Shape{50000, 768}, backend)
embed := nn.NewEmbeddingWithWeight(weights)

type FFN added in v0.4.0

type FFN[B tensor.Backend] = nn.FFN[B]

FFN (Feed-Forward Network) is a 2-layer MLP with SiLU activation.

Architecture:

FFN(x) = Linear2(SiLU(Linear1(x)))

Used inside TransformerBlock.

func NewFFN added in v0.4.0

func NewFFN[B tensor.Backend](embedDim, ffnDim int, backend B) *FFN[B]

NewFFN creates a new Feed-Forward Network.

Parameters:

  • embedDim: Input/output dimension
  • ffnDim: Hidden dimension (typically 4 * embedDim)
  • backend: Computation backend

Example:

ffn := nn.NewFFN[B](768, 3072, backend)
output := ffn.Forward(x)

type GQAConfig added in v0.5.0

type GQAConfig = nn.GQAConfig

GQAConfig configures a GroupedQueryAttention layer.

func MQA added in v0.5.0

func MQA(embedDim, nQHeads, headDim int) GQAConfig

MQA creates a Multi-Query Attention config (GQA with n_kv_heads=1).

MQA is the extreme case of GQA where all query heads share a single KV head. This provides maximum memory savings but may reduce model capacity.

Example:

cfg := nn.MQA(4096, 32, 128)  // 32 Q heads, 1 KV head
mqa := nn.NewGQA(cfg, backend)

type GroupedQueryAttention added in v0.5.0

type GroupedQueryAttention[B tensor.Backend] = nn.GroupedQueryAttention[B]

GroupedQueryAttention implements Grouped Query Attention (GQA).

GQA is a variant of multi-head attention where the number of key-value heads is less than the number of query heads. This provides significant memory savings for KV-cache during inference while maintaining model quality.

Architecture comparison:

MHA: n_q_heads = n_kv_heads (e.g., 32 Q, 32 K, 32 V)
GQA: n_q_heads > n_kv_heads (e.g., 32 Q, 8 K, 8 V) -> 4x memory savings
MQA: n_kv_heads = 1 (e.g., 32 Q, 1 K, 1 V) -> 32x memory savings (extreme)

GQA is used in LLaMA 2/3, Mistral, DeepSeek, Qwen2, Phi-3, and other modern LLMs.

func NewGQA added in v0.5.0

func NewGQA[B tensor.Backend](cfg GQAConfig, backend B) *GroupedQueryAttention[B]

NewGQA creates a new GroupedQueryAttention module.

Validates that:

  • NQHeads is divisible by NKVHeads
  • EmbedDim equals NQHeads * HeadDim

If HeadDim is 0, it's computed as EmbedDim / NQHeads.

Example:

// LLaMA 2 7B style config
cfg := nn.GQAConfig{
    EmbedDim:  4096,
    NQHeads:   32,
    NKVHeads:  8,
    HeadDim:   128,
    UseRoPE:   true,
    MaxSeqLen: 4096,
}
gqa := nn.NewGQA(cfg, backend)
output := gqa.Forward(x, cache, startPos)

type KVCache added in v0.4.0

type KVCache[B tensor.Backend] = nn.KVCache[B]

KVCache is a public alias for internal KV cache implementation.

KVCache stores key-value pairs for efficient autoregressive generation. See internal/nn/kvcache.go for detailed documentation.

func NewKVCache added in v0.4.0

func NewKVCache[B tensor.Backend](
	batchSize, numHeads, maxSeqLen, headDim int,
	backend B,
) *KVCache[B]

NewKVCache creates a new KV cache.

This is a convenience wrapper for the internal implementation. See internal/nn.NewKVCache for detailed documentation.

type LayerNorm added in v0.4.0

type LayerNorm[B tensor.Backend] = nn.LayerNorm[B]

LayerNorm represents Layer Normalization.

func NewLayerNorm added in v0.4.0

func NewLayerNorm[B tensor.Backend](normalizedShape int, epsilon float32, backend B) *LayerNorm[B]

NewLayerNorm creates a new LayerNorm layer.

Example:

backend := cpu.New()
norm := nn.NewLayerNorm[B](768, 1e-5, backend)
output := norm.Forward(input)  // [..., 768] -> [..., 768]

type LearnedPositionalEmbedding added in v0.4.0

type LearnedPositionalEmbedding[B tensor.Backend] = nn.LearnedPositionalEmbedding[B]

LearnedPositionalEmbedding implements learned positional embeddings.

These embeddings are trainable parameters that are updated during training. Used in GPT-2 and other models.

Example:

backend := cpu.New()
pe := nn.NewLearnedPositionalEmbedding(512, 256, backend)
encodings := pe.Forward(100)  // [1, 100, 256]

// Get parameters for optimizer
params := pe.Parameters()

func NewLearnedPositionalEmbedding added in v0.4.0

func NewLearnedPositionalEmbedding[B tensor.Backend](maxLen, dim int, backend B) *LearnedPositionalEmbedding[B]

NewLearnedPositionalEmbedding creates a new learned positional embedding layer.

The embeddings are initialized from a normal distribution N(0, 1).

Parameters:

  • maxLen: Maximum sequence length
  • dim: Embedding dimension
  • backend: Computation backend

Example:

pe := nn.NewLearnedPositionalEmbedding(512, 256, backend)

type Linear

type Linear[B tensor.Backend] = nn.Linear[B]

Linear represents a fully connected (dense) layer.

func NewLinear

func NewLinear[B tensor.Backend](inFeatures, outFeatures int, backend B) *Linear[B]

NewLinear creates a new linear layer with Xavier initialization.

Example:

backend := cpu.New()
layer := nn.NewLinear(784, 128, backend)

type MSELoss

type MSELoss[B tensor.Backend] = nn.MSELoss[B]

MSELoss represents the mean squared error loss for regression.

func NewMSELoss

func NewMSELoss[B tensor.Backend](backend B) *MSELoss[B]

NewMSELoss creates a new MSE loss function.

Example:

backend := cpu.New()
criterion := nn.NewMSELoss(backend)
loss := criterion.Forward(predictions, targets)

type MaxPool2D

type MaxPool2D[B tensor.Backend] = nn.MaxPool2D[B]

MaxPool2D represents a 2D max pooling layer.

func NewMaxPool2D

func NewMaxPool2D[B tensor.Backend](kernelSize, stride int, backend B) *MaxPool2D[B]

NewMaxPool2D creates a new 2D max pooling layer.

Example:

backend := cpu.New()
pool := nn.NewMaxPool2D(2, 2, backend)  // kernel=2, stride=2

type Module

type Module[B tensor.Backend] = nn.Module[B]

Module interface defines the common interface for all neural network modules.

type MultiHeadAttention added in v0.4.0

type MultiHeadAttention[B tensor.Backend] = nn.MultiHeadAttention[B]

MultiHeadAttention represents the multi-head attention mechanism.

func NewMultiHeadAttention added in v0.4.0

func NewMultiHeadAttention[B tensor.Backend](embedDim, numHeads int, backend B) *MultiHeadAttention[B]

NewMultiHeadAttention creates a new multi-head attention module.

Parameters:

  • embedDim: Total embedding dimension (must be divisible by numHeads)
  • numHeads: Number of attention heads
  • backend: Computation backend

Example:

backend := cpu.New()
mha := nn.NewMultiHeadAttention[B](768, 12, backend)  // BERT-base config
output := mha.Forward(x, x, x, nil)  // Self-attention

type Parameter

type Parameter[B tensor.Backend] = nn.Parameter[B]

Parameter represents a trainable parameter in a neural network.

func NewParameter

func NewParameter[B tensor.Backend](name string, t *tensor.Tensor[float32, B]) *Parameter[B]

NewParameter creates a new parameter with the given name and tensor.

type RMSNorm added in v0.3.0

type RMSNorm[B tensor.Backend] = nn.RMSNorm[B]

RMSNorm represents Root Mean Square Layer Normalization.

func NewRMSNorm added in v0.3.0

func NewRMSNorm[B tensor.Backend](dModel int, epsilon float32, backend B) *RMSNorm[B]

NewRMSNorm creates a new RMSNorm layer.

Example:

backend := cpu.New()
norm := nn.NewRMSNorm[B](768, 1e-5, backend)
output := norm.Forward(input)  // [..., 768] -> [..., 768]

type ReLU

type ReLU[B tensor.Backend] = nn.ReLU[B]

ReLU represents the Rectified Linear Unit activation function.

func NewReLU

func NewReLU[B tensor.Backend]() *ReLU[B]

NewReLU creates a new ReLU activation layer.

Example:

relu := nn.NewReLU()

type RotaryEncoding added in v0.4.0

type RotaryEncoding[B tensor.Backend] = nn.RotaryEncoding[B]

RotaryEncoding implements Rotary Position Embedding (RoPE).

RoPE is used in modern LLMs like LLaMA, Mistral, DeepSeek, and Qwen. It applies a rotation to query and key embeddings based on their position.

Example:

backend := cpu.New()
config := nn.RotaryEncodingConfig{
    DModel:    64,
    MaxSeqLen: 2048,
    Theta:     10000.0,
}
rope := nn.NewRotaryEncoding(config, backend)

// Apply to attention queries/keys
q := tensor.Randn[float32](tensor.Shape{batch, heads, seq, 64}, backend)
q_rotated := rope.Forward(q)

func NewRotaryEncoding added in v0.4.0

func NewRotaryEncoding[B tensor.Backend](cfg RotaryEncodingConfig, backend B) *RotaryEncoding[B]

NewRotaryEncoding creates a new RoPE (Rotary Position Embedding) layer.

Pre-computes cosine and sine values for all positions and dimension pairs.

Parameters:

  • cfg: Configuration (DModel, MaxSeqLen, Theta)
  • backend: Computation backend

Example:

config := nn.RotaryEncodingConfig{
    DModel:    64,     // Head dimension
    MaxSeqLen: 2048,   // Max sequence length
    Theta:     10000.0,
}
rope := nn.NewRotaryEncoding(config, backend)

type RotaryEncodingConfig added in v0.4.0

type RotaryEncodingConfig = nn.RotaryEncodingConfig

RotaryEncodingConfig configures a RotaryEncoding layer.

type Sequential

type Sequential[B tensor.Backend] = nn.Sequential[B]

Sequential represents a sequential container of modules.

func NewSequential

func NewSequential[B tensor.Backend](modules ...Module[B]) *Sequential[B]

NewSequential creates a new sequential model.

Example:

backend := cpu.New()
model := nn.NewSequential(
    nn.NewLinear(784, 128, backend),
    nn.NewReLU(),
    nn.NewLinear(128, 10, backend),
)

type SiLU added in v0.3.0

type SiLU[B tensor.Backend] = nn.SiLU[B]

SiLU represents the Sigmoid Linear Unit (SiLU/Swish) activation function. SiLU(x) = x * sigmoid(x).

func NewSiLU added in v0.3.0

func NewSiLU[B tensor.Backend]() *SiLU[B]

NewSiLU creates a new SiLU activation layer.

Example:

silu := nn.NewSiLU[B]()
output := silu.Forward(input)

type Sigmoid

type Sigmoid[B tensor.Backend] = nn.Sigmoid[B]

Sigmoid represents the Sigmoid activation function.

func NewSigmoid

func NewSigmoid[B tensor.Backend]() *Sigmoid[B]

NewSigmoid creates a new Sigmoid activation layer.

Example:

sigmoid := nn.NewSigmoid()

type SinusoidalPositionalEncoding added in v0.4.0

type SinusoidalPositionalEncoding[B tensor.Backend] = nn.SinusoidalPositionalEncoding[B]

SinusoidalPositionalEncoding implements fixed sinusoidal positional encodings.

This is the original positional encoding from "Attention is All You Need" (Vaswani et al., 2017).

Example:

backend := cpu.New()
pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend)
encodings := pe.Forward(100)  // [1, 100, 256]

// Add to embeddings
embeddings := embeddings.Add(encodings)

func NewSinusoidalPositionalEncoding added in v0.4.0

func NewSinusoidalPositionalEncoding[B tensor.Backend](maxLen, dim int, backend B) *SinusoidalPositionalEncoding[B]

NewSinusoidalPositionalEncoding creates a new sinusoidal positional encoding layer.

Pre-computes all positional encodings up to maxLen using sine and cosine functions.

Parameters:

  • maxLen: Maximum sequence length
  • dim: Embedding dimension
  • backend: Computation backend

Example:

pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend)

type SwiGLUFFN added in v0.5.0

type SwiGLUFFN[B tensor.Backend] = nn.SwiGLUFFN[B]

SwiGLUFFN implements a feed-forward network with SwiGLU activation.

Architecture (LLaMA-style):

hidden = SwiGLU(x @ W_up, x @ W_gate)
output = hidden @ W_down

Where SwiGLU(up, gate) = up * SiLU(gate).

This is more parameter-efficient than standard FFN with GELU. LLaMA uses ffn_dim = 2.7 * d_model (vs 4 * d_model in standard FFN) resulting in similar total parameters but better performance.

Example:

backend := autodiff.New(cpu.New())
cfg := nn.SwiGLUFFNConfig{
    EmbedDim: 4096,
    FFNDim:   11008,  // LLaMA 7B
}
ffn := nn.NewSwiGLUFFN(cfg, backend)
output := ffn.Forward(x)  // [batch, seq, 4096] -> [batch, seq, 4096]

func NewSwiGLUFFN added in v0.5.0

func NewSwiGLUFFN[B tensor.Backend](cfg SwiGLUFFNConfig, backend B) *SwiGLUFFN[B]

NewSwiGLUFFN creates a new SwiGLUFFN layer.

If GLUVariant is empty, defaults to "swiglu". If FFNDim is 0, it's computed as 8/3 * EmbedDim (LLaMA formula).

Example:

// LLaMA 7B FFN
ffn := nn.NewSwiGLUFFN(nn.SwiGLUFFNConfig{
    EmbedDim: 4096,
    FFNDim:   11008,
}, backend)

type SwiGLUFFNConfig added in v0.5.0

type SwiGLUFFNConfig = nn.SwiGLUFFNConfig

SwiGLUFFNConfig configures a SwiGLUFFN layer.

type Tanh

type Tanh[B tensor.Backend] = nn.Tanh[B]

Tanh represents the Tanh activation function.

func NewTanh

func NewTanh[B tensor.Backend]() *Tanh[B]

NewTanh creates a new Tanh activation layer.

Example:

tanh := nn.NewTanh()

type TransformerBlock added in v0.4.0

type TransformerBlock[B tensor.Backend] = nn.TransformerBlock[B]

TransformerBlock is a complete Transformer Block with attention and FFN.

Architecture (Pre-Norm):

x → Norm → MHA → + → Norm → FFN → + → output
         ↑_______|         ↑_______|
       (residual)        (residual)

Used in all transformer models (GPT, BERT, LLaMA, etc.)

func NewTransformerBlock added in v0.4.0

func NewTransformerBlock[B tensor.Backend](config TransformerConfig, backend B) *TransformerBlock[B]

NewTransformerBlock creates a new Transformer Block.

Parameters:

  • config: Configuration (embedDim, numHeads, ffnDim, etc.)
  • backend: Computation backend

Example:

backend := autodiff.New(cpu.New())
config := nn.TransformerConfig{
    EmbedDim:   768,
    NumHeads:   12,
    FFNDim:     3072,
    NormFirst:  true,
    UseRMSNorm: true,
    NormEps:    1e-5,
}
block := nn.NewTransformerBlock(config, backend)
output := block.Forward(x, mask)

type TransformerConfig added in v0.4.0

type TransformerConfig = nn.TransformerConfig

TransformerConfig defines the configuration for a Transformer Block.

Fields:

  • EmbedDim: Embedding dimension (d_model, e.g., 768 for GPT-2)
  • NumHeads: Number of attention heads (e.g., 12 for GPT-2)
  • FFNDim: FFN hidden dimension (typically 4 * EmbedDim)
  • Dropout: Dropout rate (0 = no dropout, not yet implemented)
  • NormFirst: true = Pre-Norm (LLaMA), false = Post-Norm (original)
  • UseRMSNorm: true = RMSNorm (LLaMA), false = LayerNorm (BERT/GPT)
  • NormEps: Normalization epsilon (1e-5 typical)

Example:

config := nn.TransformerConfig{
    EmbedDim:   768,
    NumHeads:   12,
    FFNDim:     3072,
    NormFirst:  true,
    UseRMSNorm: true,
    NormEps:    1e-5,
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL