nn

package
v0.7.6 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 3, 2026 License: Apache-2.0 Imports: 2 Imported by: 0

Documentation

Overview

Package nn provides neural network layers and building blocks.

Overview

This package contains:

  • Layers: Linear, Conv2D, MaxPool2D
  • Activations: ReLU, Sigmoid, Tanh
  • Loss functions: CrossEntropyLoss, MSELoss
  • Utilities: Sequential, Module interface, Parameter
  • Initialization: Xavier, Zeros, Ones, Randn

Basic Usage

import (
    "github.com/born-ml/born/nn"
    "github.com/born-ml/born/backend/cpu"
)

func main() {
    backend := cpu.New()

    // Build a simple MLP
    model := nn.NewSequential(
        nn.NewLinear(784, 128, backend),
        nn.NewReLU(),
        nn.NewLinear(128, 10, backend),
    )

    // Forward pass
    output := model.Forward(input)
}

Layers

Linear: Fully connected layer with Xavier initialization

layer := nn.NewLinear(inFeatures, outFeatures, backend)

Conv2D: 2D convolutional layer with im2col algorithm

conv := nn.NewConv2D(inChannels, outChannels, kernelSize, stride, padding, backend)

MaxPool2D: 2D max pooling layer

pool := nn.NewMaxPool2D(kernelSize, stride, backend)

Activations

Common activation functions:

relu := nn.NewReLU()
sigmoid := nn.NewSigmoid()
tanh := nn.NewTanh()

Loss Functions

CrossEntropyLoss: For classification tasks (numerically stable)

criterion := nn.NewCrossEntropyLoss(backend)
loss := criterion.Forward(logits, labels)

MSELoss: For regression tasks

criterion := nn.NewMSELoss(backend)
loss := criterion.Forward(predictions, targets)

Sequential Models

Build models by composing layers:

model := nn.NewSequential(
    nn.NewLinear(784, 256, backend),
    nn.NewReLU(),
    nn.NewLinear(256, 128, backend),
    nn.NewReLU(),
    nn.NewLinear(128, 10, backend),
)

Parameter Management

Access model parameters for optimization:

params := model.Parameters()
for _, param := range params {
    fmt.Println(param.Name(), param.Tensor().Shape())
}

Package nn provides public wrappers for positional encodings.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Accuracy

func Accuracy[B tensor.Backend](
	logits *tensor.Tensor[float32, B],
	targets *tensor.Tensor[int32, B],
) float32

Accuracy computes the classification accuracy.

Example:

acc := nn.Accuracy(predictions, labels)
fmt.Printf("Accuracy: %.2f%%\n", acc*100)

func CausalMask added in v0.4.0

func CausalMask[B tensor.Backend](seqLen int, backend B) *tensor.Tensor[float32, B]

CausalMask creates a causal (autoregressive) attention mask.

In causal attention, each position can only attend to earlier positions. This is used in autoregressive models like GPT.

Returns a mask tensor where future positions are masked with -inf. Shape: [1, 1, seq_len, seq_len] (broadcastable to [batch, heads, seq, seq])

Example:

mask := nn.CausalMask(10, backend)  // [1, 1, 10, 10]
output, weights := nn.ScaledDotProductAttention(Q, K, V, mask, 0)

func CrossEntropyBackward

func CrossEntropyBackward[B tensor.Backend](
	logits *tensor.Tensor[float32, B],
	targets *tensor.Tensor[int32, B],
	backend B,
) *tensor.Tensor[float32, B]

CrossEntropyBackward computes the backward pass for cross-entropy loss.

func GELUFunc added in v0.5.0

func GELUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

GELUFunc applies GELU (Gaussian Error Linear Unit) activation.

Uses the tanh approximation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3))).

GELU is used in BERT, GPT-2, and other transformers.

Example:

output := nn.GELUFunc(input)

func GLU added in v0.5.0

func GLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

GLU applies Gated Linear Unit: GLU(x, gate) = x * sigmoid(gate).

GLU is the base gating mechanism used in various transformer FFN layers.

Parameters:

  • x: input tensor.
  • gate: gating tensor (same shape as x).

Returns: x * sigmoid(gate).

Example:

output := nn.GLU(x, gate)

func GeGLU added in v0.5.0

func GeGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

GeGLU applies GELU-Gated Linear Unit: GeGLU(x, gate) = x * GELU(gate).

GeGLU uses GELU activation for gating instead of SiLU. Used in some transformer variants for different activation characteristics.

Parameters:

  • x: input tensor.
  • gate: gating tensor.

Returns: x * GELU(gate).

Example:

output := nn.GeGLU(up, gate)

func Ones

func Ones[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Ones initializes a tensor with ones.

Example:

backend := cpu.New()
weights := nn.Ones(tensor.Shape{128, 784}, backend)

func Randn

func Randn[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Randn initializes a tensor with random values from N(0, 1).

Example:

backend := cpu.New()
weights := nn.Randn(tensor.Shape{128, 784}, backend)

func ReGLU added in v0.5.0

func ReGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

ReGLU applies ReLU-Gated Linear Unit: ReGLU(x, gate) = x * ReLU(gate).

ReGLU uses ReLU activation for gating. It's simpler but may have "dead neuron" issues compared to SwiGLU or GeGLU.

Parameters:

  • x: input tensor.
  • gate: gating tensor.

Returns: x * ReLU(gate).

Example:

output := nn.ReGLU(up, gate)

func ReLUFunc added in v0.5.0

func ReLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

ReLUFunc applies the ReLU activation function element-wise. ReLU(x) = max(0, x).

func RepeatKV added in v0.5.0

func RepeatKV[B tensor.Backend](kv *tensor.Tensor[float32, B], nRep int) *tensor.Tensor[float32, B]

RepeatKV broadcasts KV heads to match query heads count.

This is the key operation in GQA that allows fewer KV heads than Q heads. Each KV head is repeated nRep times to match the Q head count.

Input: [batch, n_kv_heads, seq_len, head_dim] Output: [batch, n_q_heads, seq_len, head_dim] where n_q_heads = n_kv_heads * nRep

Example:

// 8 KV heads -> 32 Q heads (nRep=4)
kv := tensor.Randn[float32](tensor.Shape{2, 8, 100, 128}, backend)
expanded := nn.RepeatKV(kv, 4)  // [2, 32, 100, 128]

If nRep=1 (standard MHA), returns the input unchanged.

func ScaledDotProductAttention added in v0.4.0

func ScaledDotProductAttention[B tensor.Backend](
	query, key, value *tensor.Tensor[float32, B],
	mask *tensor.Tensor[float32, B],
	scale float32,
) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])

ScaledDotProductAttention computes attention scores using the scaled dot-product mechanism.

This is the core attention mechanism used in transformers.

Parameters:

  • query: Query tensor [batch, heads, seq_q, head_dim]
  • key: Key tensor [batch, heads, seq_k, head_dim]
  • value: Value tensor [batch, heads, seq_k, head_dim]
  • mask: Optional attention mask [batch, 1, seq_q, seq_k] or nil (additive mask, -inf for masked)
  • scale: Scaling factor (0 for auto-compute as 1/sqrt(head_dim))

Returns:

  • output: Attended values [batch, heads, seq_q, head_dim]
  • weights: Attention weights [batch, heads, seq_q, seq_k]

Example:

Q := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
K := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
V := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
output, weights := nn.ScaledDotProductAttention(Q, K, V, nil, 0)

func SiLUFunc added in v0.5.0

func SiLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

SiLUFunc applies SiLU (Swish) activation: f(x) = x * sigmoid(x).

This is the functional version of SiLU activation, useful in GLU variants.

Example:

output := nn.SiLUFunc(input)

func SigmoidFunc added in v0.5.0

func SigmoidFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

SigmoidFunc applies the sigmoid activation function element-wise. Sigmoid(x) = 1 / (1 + exp(-x)).

func SwiGLU added in v0.5.0

func SwiGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

SwiGLU applies Swish-Gated Linear Unit: SwiGLU(x, gate) = x * SiLU(gate).

SwiGLU is used in modern LLMs like LLaMA, Mistral, and DeepSeek. It combines the input with SiLU-activated gate for better gradient flow.

Parameters:

  • x: input tensor (typically "up" projection).
  • gate: gating tensor (typically "gate" projection).

Returns: x * SiLU(gate) where SiLU(z) = z * sigmoid(z).

Example:

// In LLaMA-style FFN:
up := upProj.Forward(input)
gate := gateProj.Forward(input)
hidden := nn.SwiGLU(up, gate)

func Xavier

func Xavier[B tensor.Backend](fanIn, fanOut int, shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Xavier initializes a tensor using Xavier/Glorot initialization.

Example:

backend := cpu.New()
weights := nn.Xavier(784, 128, tensor.Shape{128, 784}, backend)

func Zeros

func Zeros[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Zeros initializes a tensor with zeros (for biases).

Example:

backend := cpu.New()
bias := nn.Zeros(tensor.Shape{128}, backend)

Types

type ALiBi added in v0.4.0

type ALiBi[B tensor.Backend] = nn.ALiBi[B]

ALiBi implements Attention with Linear Biases.

ALiBi adds a linear bias to attention scores based on the distance between positions. Used in BLOOM, MPT, and other models. Allows extrapolation to longer sequences.

Example:

backend := cpu.New()
alibi := nn.NewALiBi(8, backend)  // 8 attention heads
bias := alibi.GetBias(128)        // [1, 8, 128, 128]

// In attention:
scores := Q.BatchMatMul(K.T())
scores = scores.Add(bias)
weights := scores.Softmax(-1)

func NewALiBi added in v0.4.0

func NewALiBi[B tensor.Backend](numHeads int, backend B) *ALiBi[B]

NewALiBi creates a new ALiBi bias generator.

Computes slopes for each attention head using a geometric sequence.

Parameters:

  • numHeads: Number of attention heads
  • backend: Computation backend

Example:

alibi := nn.NewALiBi(8, backend)
bias := alibi.GetBias(64)  // Get bias for sequence length 64

type Conv2D

type Conv2D[B tensor.Backend] = nn.Conv2D[B]

Conv2D represents a 2D convolutional layer.

func NewConv2D

func NewConv2D[B tensor.Backend](
	inChannels, outChannels int,
	kernelH, kernelW int,
	stride, padding int,
	useBias bool,
	backend B,
) *Conv2D[B]

NewConv2D creates a new 2D convolutional layer.

Example:

backend := cpu.New()
conv := nn.NewConv2D(1, 32, 3, 3, 1, 1, true, backend)  // in_channels=1, out_channels=32, kernel=3x3, stride=1, padding=1, useBias=true

type CrossEntropyLoss

type CrossEntropyLoss[B tensor.Backend] = nn.CrossEntropyLoss[B]

CrossEntropyLoss represents the cross-entropy loss for classification.

func NewCrossEntropyLoss

func NewCrossEntropyLoss[B tensor.Backend](backend B) *CrossEntropyLoss[B]

NewCrossEntropyLoss creates a new cross-entropy loss function.

Example:

backend := cpu.New()
criterion := nn.NewCrossEntropyLoss(backend)
loss := criterion.Forward(logits, labels)

type Embedding added in v0.3.0

type Embedding[B tensor.Backend] = nn.Embedding[B]

Embedding represents a lookup table for embeddings.

func NewEmbedding added in v0.3.0

func NewEmbedding[B tensor.Backend](numEmbeddings, embeddingDim int, backend B) *Embedding[B]

NewEmbedding creates a new embedding layer.

Example:

backend := cpu.New()
embed := nn.NewEmbedding[B](50000, 768, backend)  // vocab=50000, dim=768
tokenIds := tensor.FromSlice([]int32{1, 5, 10}, tensor.Shape{1, 3}, backend)
embeddings := embed.Forward(tokenIds)  // [1, 3, 768]

func NewEmbeddingWithWeight added in v0.5.0

func NewEmbeddingWithWeight[B tensor.Backend](weight *tensor.Tensor[float32, B]) *Embedding[B]

NewEmbeddingWithWeight creates an embedding layer from an existing weight tensor.

This is useful when loading pre-trained embeddings.

Example:

weights := tensor.Randn[float32](tensor.Shape{50000, 768}, backend)
embed := nn.NewEmbeddingWithWeight(weights)

type FFN added in v0.4.0

type FFN[B tensor.Backend] = nn.FFN[B]

FFN (Feed-Forward Network) is a 2-layer MLP with SiLU activation.

Architecture:

FFN(x) = Linear2(SiLU(Linear1(x)))

Used inside TransformerBlock.

func NewFFN added in v0.4.0

func NewFFN[B tensor.Backend](embedDim, ffnDim int, backend B) *FFN[B]

NewFFN creates a new Feed-Forward Network.

Parameters:

  • embedDim: Input/output dimension
  • ffnDim: Hidden dimension (typically 4 * embedDim)
  • backend: Computation backend

Example:

ffn := nn.NewFFN[B](768, 3072, backend)
output := ffn.Forward(x)

type GQAConfig added in v0.5.0

type GQAConfig = nn.GQAConfig

GQAConfig configures a GroupedQueryAttention layer.

func MQA added in v0.5.0

func MQA(embedDim, nQHeads, headDim int) GQAConfig

MQA creates a Multi-Query Attention config (GQA with n_kv_heads=1).

MQA is the extreme case of GQA where all query heads share a single KV head. This provides maximum memory savings but may reduce model capacity.

Example:

cfg := nn.MQA(4096, 32, 128)  // 32 Q heads, 1 KV head
mqa := nn.NewGQA(cfg, backend)

type GroupedQueryAttention added in v0.5.0

type GroupedQueryAttention[B tensor.Backend] = nn.GroupedQueryAttention[B]

GroupedQueryAttention implements Grouped Query Attention (GQA).

GQA is a variant of multi-head attention where the number of key-value heads is less than the number of query heads. This provides significant memory savings for KV-cache during inference while maintaining model quality.

Architecture comparison:

MHA: n_q_heads = n_kv_heads (e.g., 32 Q, 32 K, 32 V)
GQA: n_q_heads > n_kv_heads (e.g., 32 Q, 8 K, 8 V) -> 4x memory savings
MQA: n_kv_heads = 1 (e.g., 32 Q, 1 K, 1 V) -> 32x memory savings (extreme)

GQA is used in LLaMA 2/3, Mistral, DeepSeek, Qwen2, Phi-3, and other modern LLMs.

func NewGQA added in v0.5.0

func NewGQA[B tensor.Backend](cfg GQAConfig, backend B) *GroupedQueryAttention[B]

NewGQA creates a new GroupedQueryAttention module.

Validates that:

  • NQHeads is divisible by NKVHeads
  • EmbedDim equals NQHeads * HeadDim

If HeadDim is 0, it's computed as EmbedDim / NQHeads.

Example:

// LLaMA 2 7B style config
cfg := nn.GQAConfig{
    EmbedDim:  4096,
    NQHeads:   32,
    NKVHeads:  8,
    HeadDim:   128,
    UseRoPE:   true,
    MaxSeqLen: 4096,
}
gqa := nn.NewGQA(cfg, backend)
output := gqa.Forward(x, cache, startPos)

type KVCache added in v0.4.0

type KVCache[B tensor.Backend] = nn.KVCache[B]

KVCache is a public alias for internal KV cache implementation.

KVCache stores key-value pairs for efficient autoregressive generation. See internal/nn/kvcache.go for detailed documentation.

func NewKVCache added in v0.4.0

func NewKVCache[B tensor.Backend](
	batchSize, numHeads, maxSeqLen, headDim int,
	backend B,
) *KVCache[B]

NewKVCache creates a new KV cache.

This is a convenience wrapper for the internal implementation. See internal/nn.NewKVCache for detailed documentation.

type LayerNorm added in v0.4.0

type LayerNorm[B tensor.Backend] = nn.LayerNorm[B]

LayerNorm represents Layer Normalization.

func NewLayerNorm added in v0.4.0

func NewLayerNorm[B tensor.Backend](normalizedShape int, epsilon float32, backend B) *LayerNorm[B]

NewLayerNorm creates a new LayerNorm layer.

Example:

backend := cpu.New()
norm := nn.NewLayerNorm[B](768, 1e-5, backend)
output := norm.Forward(input)  // [..., 768] -> [..., 768]

type LearnedPositionalEmbedding added in v0.4.0

type LearnedPositionalEmbedding[B tensor.Backend] = nn.LearnedPositionalEmbedding[B]

LearnedPositionalEmbedding implements learned positional embeddings.

These embeddings are trainable parameters that are updated during training. Used in GPT-2 and other models.

Example:

backend := cpu.New()
pe := nn.NewLearnedPositionalEmbedding(512, 256, backend)
encodings := pe.Forward(100)  // [1, 100, 256]

// Get parameters for optimizer
params := pe.Parameters()

func NewLearnedPositionalEmbedding added in v0.4.0

func NewLearnedPositionalEmbedding[B tensor.Backend](maxLen, dim int, backend B) *LearnedPositionalEmbedding[B]

NewLearnedPositionalEmbedding creates a new learned positional embedding layer.

The embeddings are initialized from a normal distribution N(0, 1).

Parameters:

  • maxLen: Maximum sequence length
  • dim: Embedding dimension
  • backend: Computation backend

Example:

pe := nn.NewLearnedPositionalEmbedding(512, 256, backend)

type Linear

type Linear[B tensor.Backend] = nn.Linear[B]

Linear represents a fully connected (dense) layer.

func NewLinear

func NewLinear[B tensor.Backend](inFeatures, outFeatures int, backend B, opts ...LinearOption) *Linear[B]

NewLinear creates a new linear layer with Xavier initialization.

Example:

backend := cpu.New()
layer := nn.NewLinear(784, 128, backend)

// Without bias (for LLaMA, attention projections, etc.)
lm_head := nn.NewLinear(hidden_size, vocab_size, backend, nn.WithBias(false))

type LinearOption added in v0.7.4

type LinearOption = nn.LinearOption

LinearOption is a functional option for configuring a Linear layer.

func WithBias added in v0.7.4

func WithBias(useBias bool) LinearOption

WithBias sets whether the Linear layer should use bias.

Default is true. Set to false for architectures like LLaMA that don't use bias.

Example:

// Linear layer without bias (LLaMA-style)
lm_head := nn.NewLinear(hidden_size, vocab_size, backend, nn.WithBias(false))

// Linear layer with bias (default)
layer := nn.NewLinear(784, 128, backend)  // same as WithBias(true)

type MSELoss

type MSELoss[B tensor.Backend] = nn.MSELoss[B]

MSELoss represents the mean squared error loss for regression.

func NewMSELoss

func NewMSELoss[B tensor.Backend](backend B) *MSELoss[B]

NewMSELoss creates a new MSE loss function.

Example:

backend := cpu.New()
criterion := nn.NewMSELoss(backend)
loss := criterion.Forward(predictions, targets)

type MaxPool2D

type MaxPool2D[B tensor.Backend] = nn.MaxPool2D[B]

MaxPool2D represents a 2D max pooling layer.

func NewMaxPool2D

func NewMaxPool2D[B tensor.Backend](kernelSize, stride int, backend B) *MaxPool2D[B]

NewMaxPool2D creates a new 2D max pooling layer.

Example:

backend := cpu.New()
pool := nn.NewMaxPool2D(2, 2, backend)  // kernel=2, stride=2

type Module

type Module[B tensor.Backend] = nn.Module[B]

Module interface defines the common interface for all neural network modules.

type MultiHeadAttention added in v0.4.0

type MultiHeadAttention[B tensor.Backend] = nn.MultiHeadAttention[B]

MultiHeadAttention represents the multi-head attention mechanism.

func NewMultiHeadAttention added in v0.4.0

func NewMultiHeadAttention[B tensor.Backend](embedDim, numHeads int, backend B) *MultiHeadAttention[B]

NewMultiHeadAttention creates a new multi-head attention module.

Parameters:

  • embedDim: Total embedding dimension (must be divisible by numHeads)
  • numHeads: Number of attention heads
  • backend: Computation backend

Example:

backend := cpu.New()
mha := nn.NewMultiHeadAttention[B](768, 12, backend)  // BERT-base config
output := mha.Forward(x, x, x, nil)  // Self-attention

type Parameter

type Parameter[B tensor.Backend] = nn.Parameter[B]

Parameter represents a trainable parameter in a neural network.

func NewParameter

func NewParameter[B tensor.Backend](name string, t *tensor.Tensor[float32, B]) *Parameter[B]

NewParameter creates a new parameter with the given name and tensor.

type RMSNorm added in v0.3.0

type RMSNorm[B tensor.Backend] = nn.RMSNorm[B]

RMSNorm represents Root Mean Square Layer Normalization.

func NewRMSNorm added in v0.3.0

func NewRMSNorm[B tensor.Backend](dModel int, epsilon float32, backend B) *RMSNorm[B]

NewRMSNorm creates a new RMSNorm layer.

Example:

backend := cpu.New()
norm := nn.NewRMSNorm[B](768, 1e-5, backend)
output := norm.Forward(input)  // [..., 768] -> [..., 768]

type ReLU

type ReLU[B tensor.Backend] = nn.ReLU[B]

ReLU represents the Rectified Linear Unit activation function.

func NewReLU

func NewReLU[B tensor.Backend]() *ReLU[B]

NewReLU creates a new ReLU activation layer.

Example:

relu := nn.NewReLU()

type RotaryEncoding added in v0.4.0

type RotaryEncoding[B tensor.Backend] = nn.RotaryEncoding[B]

RotaryEncoding implements Rotary Position Embedding (RoPE).

RoPE is used in modern LLMs like LLaMA, Mistral, DeepSeek, and Qwen. It applies a rotation to query and key embeddings based on their position.

Example:

backend := cpu.New()
config := nn.RotaryEncodingConfig{
    DModel:    64,
    MaxSeqLen: 2048,
    Theta:     10000.0,
}
rope := nn.NewRotaryEncoding(config, backend)

// Apply to attention queries/keys
q := tensor.Randn[float32](tensor.Shape{batch, heads, seq, 64}, backend)
q_rotated := rope.Forward(q)

func NewRotaryEncoding added in v0.4.0

func NewRotaryEncoding[B tensor.Backend](cfg RotaryEncodingConfig, backend B) *RotaryEncoding[B]

NewRotaryEncoding creates a new RoPE (Rotary Position Embedding) layer.

Pre-computes cosine and sine values for all positions and dimension pairs.

Parameters:

  • cfg: Configuration (DModel, MaxSeqLen, Theta)
  • backend: Computation backend

Example:

config := nn.RotaryEncodingConfig{
    DModel:    64,     // Head dimension
    MaxSeqLen: 2048,   // Max sequence length
    Theta:     10000.0,
}
rope := nn.NewRotaryEncoding(config, backend)

type RotaryEncodingConfig added in v0.4.0

type RotaryEncodingConfig = nn.RotaryEncodingConfig

RotaryEncodingConfig configures a RotaryEncoding layer.

type Sequential

type Sequential[B tensor.Backend] = nn.Sequential[B]

Sequential represents a sequential container of modules.

func NewSequential

func NewSequential[B tensor.Backend](modules ...Module[B]) *Sequential[B]

NewSequential creates a new sequential model.

Example:

backend := cpu.New()
model := nn.NewSequential(
    nn.NewLinear(784, 128, backend),
    nn.NewReLU(),
    nn.NewLinear(128, 10, backend),
)

type SiLU added in v0.3.0

type SiLU[B tensor.Backend] = nn.SiLU[B]

SiLU represents the Sigmoid Linear Unit (SiLU/Swish) activation function. SiLU(x) = x * sigmoid(x).

func NewSiLU added in v0.3.0

func NewSiLU[B tensor.Backend]() *SiLU[B]

NewSiLU creates a new SiLU activation layer.

Example:

silu := nn.NewSiLU[B]()
output := silu.Forward(input)

type Sigmoid

type Sigmoid[B tensor.Backend] = nn.Sigmoid[B]

Sigmoid represents the Sigmoid activation function.

func NewSigmoid

func NewSigmoid[B tensor.Backend]() *Sigmoid[B]

NewSigmoid creates a new Sigmoid activation layer.

Example:

sigmoid := nn.NewSigmoid()

type SinusoidalPositionalEncoding added in v0.4.0

type SinusoidalPositionalEncoding[B tensor.Backend] = nn.SinusoidalPositionalEncoding[B]

SinusoidalPositionalEncoding implements fixed sinusoidal positional encodings.

This is the original positional encoding from "Attention is All You Need" (Vaswani et al., 2017).

Example:

backend := cpu.New()
pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend)
encodings := pe.Forward(100)  // [1, 100, 256]

// Add to embeddings
embeddings := embeddings.Add(encodings)

func NewSinusoidalPositionalEncoding added in v0.4.0

func NewSinusoidalPositionalEncoding[B tensor.Backend](maxLen, dim int, backend B) *SinusoidalPositionalEncoding[B]

NewSinusoidalPositionalEncoding creates a new sinusoidal positional encoding layer.

Pre-computes all positional encodings up to maxLen using sine and cosine functions.

Parameters:

  • maxLen: Maximum sequence length
  • dim: Embedding dimension
  • backend: Computation backend

Example:

pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend)

type SwiGLUFFN added in v0.5.0

type SwiGLUFFN[B tensor.Backend] = nn.SwiGLUFFN[B]

SwiGLUFFN implements a feed-forward network with SwiGLU activation.

Architecture (LLaMA-style):

hidden = SwiGLU(x @ W_up, x @ W_gate)
output = hidden @ W_down

Where SwiGLU(up, gate) = up * SiLU(gate).

This is more parameter-efficient than standard FFN with GELU. LLaMA uses ffn_dim = 2.7 * d_model (vs 4 * d_model in standard FFN) resulting in similar total parameters but better performance.

Example:

backend := autodiff.New(cpu.New())
cfg := nn.SwiGLUFFNConfig{
    EmbedDim: 4096,
    FFNDim:   11008,  // LLaMA 7B
}
ffn := nn.NewSwiGLUFFN(cfg, backend)
output := ffn.Forward(x)  // [batch, seq, 4096] -> [batch, seq, 4096]

func NewSwiGLUFFN added in v0.5.0

func NewSwiGLUFFN[B tensor.Backend](cfg SwiGLUFFNConfig, backend B) *SwiGLUFFN[B]

NewSwiGLUFFN creates a new SwiGLUFFN layer.

If GLUVariant is empty, defaults to "swiglu". If FFNDim is 0, it's computed as 8/3 * EmbedDim (LLaMA formula).

Example:

// LLaMA 7B FFN
ffn := nn.NewSwiGLUFFN(nn.SwiGLUFFNConfig{
    EmbedDim: 4096,
    FFNDim:   11008,
}, backend)

type SwiGLUFFNConfig added in v0.5.0

type SwiGLUFFNConfig = nn.SwiGLUFFNConfig

SwiGLUFFNConfig configures a SwiGLUFFN layer.

type Tanh

type Tanh[B tensor.Backend] = nn.Tanh[B]

Tanh represents the Tanh activation function.

func NewTanh

func NewTanh[B tensor.Backend]() *Tanh[B]

NewTanh creates a new Tanh activation layer.

Example:

tanh := nn.NewTanh()

type TransformerBlock added in v0.4.0

type TransformerBlock[B tensor.Backend] = nn.TransformerBlock[B]

TransformerBlock is a complete Transformer Block with attention and FFN.

Architecture (Pre-Norm):

x → Norm → MHA → + → Norm → FFN → + → output
         ↑_______|         ↑_______|
       (residual)        (residual)

Used in all transformer models (GPT, BERT, LLaMA, etc.)

func NewTransformerBlock added in v0.4.0

func NewTransformerBlock[B tensor.Backend](config TransformerConfig, backend B) *TransformerBlock[B]

NewTransformerBlock creates a new Transformer Block.

Parameters:

  • config: Configuration (embedDim, numHeads, ffnDim, etc.)
  • backend: Computation backend

Example:

backend := autodiff.New(cpu.New())
config := nn.TransformerConfig{
    EmbedDim:   768,
    NumHeads:   12,
    FFNDim:     3072,
    NormFirst:  true,
    UseRMSNorm: true,
    NormEps:    1e-5,
}
block := nn.NewTransformerBlock(config, backend)
output := block.Forward(x, mask)

type TransformerConfig added in v0.4.0

type TransformerConfig = nn.TransformerConfig

TransformerConfig defines the configuration for a Transformer Block.

Fields:

  • EmbedDim: Embedding dimension (d_model, e.g., 768 for GPT-2)
  • NumHeads: Number of attention heads (e.g., 12 for GPT-2)
  • FFNDim: FFN hidden dimension (typically 4 * EmbedDim)
  • Dropout: Dropout rate (0 = no dropout, not yet implemented)
  • NormFirst: true = Pre-Norm (LLaMA), false = Post-Norm (original)
  • UseRMSNorm: true = RMSNorm (LLaMA), false = LayerNorm (BERT/GPT)
  • NormEps: Normalization epsilon (1e-5 typical)

Example:

config := nn.TransformerConfig{
    EmbedDim:   768,
    NumHeads:   12,
    FFNDim:     3072,
    NormFirst:  true,
    UseRMSNorm: true,
    NormEps:    1e-5,
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL