nn

package

v0.7.6 Latest Latest Go to latest Published: Jan 3, 2026 License: Apache-2.0 Imports: 2 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/born-ml/born

Links

Open Source Insights

Documentation ¶

Overview ¶

Package nn provides neural network layers and building blocks.

Overview ¶

This package contains:

Layers: Linear, Conv2D, MaxPool2D
Activations: ReLU, Sigmoid, Tanh
Loss functions: CrossEntropyLoss, MSELoss
Utilities: Sequential, Module interface, Parameter
Initialization: Xavier, Zeros, Ones, Randn

Basic Usage ¶

import (
    "github.com/born-ml/born/nn"
    "github.com/born-ml/born/backend/cpu"
)

func main() {
    backend := cpu.New()

    // Build a simple MLP
    model := nn.NewSequential(
        nn.NewLinear(784, 128, backend),
        nn.NewReLU(),
        nn.NewLinear(128, 10, backend),
    )

    // Forward pass
    output := model.Forward(input)
}

Layers ¶

Linear: Fully connected layer with Xavier initialization

layer := nn.NewLinear(inFeatures, outFeatures, backend)

Conv2D: 2D convolutional layer with im2col algorithm

conv := nn.NewConv2D(inChannels, outChannels, kernelSize, stride, padding, backend)

MaxPool2D: 2D max pooling layer

pool := nn.NewMaxPool2D(kernelSize, stride, backend)

Activations ¶

Common activation functions:

relu := nn.NewReLU()
sigmoid := nn.NewSigmoid()
tanh := nn.NewTanh()

Loss Functions ¶

CrossEntropyLoss: For classification tasks (numerically stable)

criterion := nn.NewCrossEntropyLoss(backend)
loss := criterion.Forward(logits, labels)

MSELoss: For regression tasks

criterion := nn.NewMSELoss(backend)
loss := criterion.Forward(predictions, targets)

Sequential Models ¶

Build models by composing layers:

model := nn.NewSequential(
    nn.NewLinear(784, 256, backend),
    nn.NewReLU(),
    nn.NewLinear(256, 128, backend),
    nn.NewReLU(),
    nn.NewLinear(128, 10, backend),
)

Parameter Management ¶

Access model parameters for optimization:

params := model.Parameters()
for _, param := range params {
    fmt.Println(param.Name(), param.Tensor().Shape())
}

Package nn provides public wrappers for positional encodings.

Index ¶

func Accuracy[B tensor.Backend](logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B]) float32
func CausalMask[B tensor.Backend](seqLen int, backend B) *tensor.Tensor[float32, B]
func CrossEntropyBackward[B tensor.Backend](logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B], backend B) *tensor.Tensor[float32, B]
func GELUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
func GLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
func GeGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
func Ones[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
func Randn[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
func ReGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
func ReLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
func RepeatKV[B tensor.Backend](kv *tensor.Tensor[float32, B], nRep int) *tensor.Tensor[float32, B]
func ScaledDotProductAttention[B tensor.Backend](query, key, value *tensor.Tensor[float32, B], mask *tensor.Tensor[float32, B], ...) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])
func SiLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
func SigmoidFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
func SwiGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]
func Xavier[B tensor.Backend](fanIn, fanOut int, shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
func Zeros[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
type ALiBi
- func NewALiBi[B tensor.Backend](numHeads int, backend B) *ALiBi[B]
type Conv2D
- func NewConv2D[B tensor.Backend](inChannels, outChannels int, kernelH, kernelW int, stride, padding int, ...) *Conv2D[B]
type CrossEntropyLoss
- func NewCrossEntropyLoss[B tensor.Backend](backend B) *CrossEntropyLoss[B]
type Embedding
- func NewEmbedding[B tensor.Backend](numEmbeddings, embeddingDim int, backend B) *Embedding[B]
- func NewEmbeddingWithWeight[B tensor.Backend](weight *tensor.Tensor[float32, B]) *Embedding[B]
type FFN
- func NewFFN[B tensor.Backend](embedDim, ffnDim int, backend B) *FFN[B]
type GQAConfig
- func MQA(embedDim, nQHeads, headDim int) GQAConfig
type GroupedQueryAttention
- func NewGQA[B tensor.Backend](cfg GQAConfig, backend B) *GroupedQueryAttention[B]
type KVCache
- func NewKVCache[B tensor.Backend](batchSize, numHeads, maxSeqLen, headDim int, backend B) *KVCache[B]
type LayerNorm
- func NewLayerNorm[B tensor.Backend](normalizedShape int, epsilon float32, backend B) *LayerNorm[B]
type LearnedPositionalEmbedding
- func NewLearnedPositionalEmbedding[B tensor.Backend](maxLen, dim int, backend B) *LearnedPositionalEmbedding[B]
type Linear
- func NewLinear[B tensor.Backend](inFeatures, outFeatures int, backend B, opts ...LinearOption) *Linear[B]
type LinearOption
- func WithBias(useBias bool) LinearOption
type MSELoss
- func NewMSELoss[B tensor.Backend](backend B) *MSELoss[B]
type MaxPool2D
- func NewMaxPool2D[B tensor.Backend](kernelSize, stride int, backend B) *MaxPool2D[B]
type Module
type MultiHeadAttention
- func NewMultiHeadAttention[B tensor.Backend](embedDim, numHeads int, backend B) *MultiHeadAttention[B]
type Parameter
- func NewParameter[B tensor.Backend](name string, t *tensor.Tensor[float32, B]) *Parameter[B]
type RMSNorm
- func NewRMSNorm[B tensor.Backend](dModel int, epsilon float32, backend B) *RMSNorm[B]
type ReLU
- func NewReLU[B tensor.Backend]() *ReLU[B]
type RotaryEncoding
- func NewRotaryEncoding[B tensor.Backend](cfg RotaryEncodingConfig, backend B) *RotaryEncoding[B]
type RotaryEncodingConfig
type Sequential
- func NewSequential[B tensor.Backend](modules ...Module[B]) *Sequential[B]
type SiLU
- func NewSiLU[B tensor.Backend]() *SiLU[B]
type Sigmoid
- func NewSigmoid[B tensor.Backend]() *Sigmoid[B]
type SinusoidalPositionalEncoding
- func NewSinusoidalPositionalEncoding[B tensor.Backend](maxLen, dim int, backend B) *SinusoidalPositionalEncoding[B]
type SwiGLUFFN
- func NewSwiGLUFFN[B tensor.Backend](cfg SwiGLUFFNConfig, backend B) *SwiGLUFFN[B]
type SwiGLUFFNConfig
type Tanh
- func NewTanh[B tensor.Backend]() *Tanh[B]
type TransformerBlock
- func NewTransformerBlock[B tensor.Backend](config TransformerConfig, backend B) *TransformerBlock[B]
type TransformerConfig

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Accuracy ¶

func Accuracy[B tensor.Backend](
	logits *tensor.Tensor[float32, B],
	targets *tensor.Tensor[int32, B],
) float32

Accuracy computes the classification accuracy.

Example:

acc := nn.Accuracy(predictions, labels)
fmt.Printf("Accuracy: %.2f%%\n", acc*100)

func CausalMask ¶ added in v0.4.0

func CausalMask[B tensor.Backend](seqLen int, backend B) *tensor.Tensor[float32, B]

CausalMask creates a causal (autoregressive) attention mask.

In causal attention, each position can only attend to earlier positions. This is used in autoregressive models like GPT.

Returns a mask tensor where future positions are masked with -inf. Shape: [1, 1, seq_len, seq_len] (broadcastable to [batch, heads, seq, seq])

Example:

mask := nn.CausalMask(10, backend)  // [1, 1, 10, 10]
output, weights := nn.ScaledDotProductAttention(Q, K, V, mask, 0)

func CrossEntropyBackward ¶

func CrossEntropyBackward[B tensor.Backend](
	logits *tensor.Tensor[float32, B],
	targets *tensor.Tensor[int32, B],
	backend B,
) *tensor.Tensor[float32, B]

CrossEntropyBackward computes the backward pass for cross-entropy loss.

func GELUFunc ¶ added in v0.5.0

func GELUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

GELUFunc applies GELU (Gaussian Error Linear Unit) activation.

Uses the tanh approximation: 0.5 * x * (1 + tanh(sqrt(2/pi) * (x + 0.044715 * x^3))).

GELU is used in BERT, GPT-2, and other transformers.

Example:

output := nn.GELUFunc(input)

func GLU ¶ added in v0.5.0

func GLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

GLU applies Gated Linear Unit: GLU(x, gate) = x * sigmoid(gate).

GLU is the base gating mechanism used in various transformer FFN layers.

Parameters:

x: input tensor.
gate: gating tensor (same shape as x).

Returns: x * sigmoid(gate).

Example:

output := nn.GLU(x, gate)

func GeGLU ¶ added in v0.5.0

func GeGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

GeGLU applies GELU-Gated Linear Unit: GeGLU(x, gate) = x * GELU(gate).

GeGLU uses GELU activation for gating instead of SiLU. Used in some transformer variants for different activation characteristics.

Parameters:

x: input tensor.
gate: gating tensor.

Returns: x * GELU(gate).

Example:

output := nn.GeGLU(up, gate)

func Ones ¶

func Ones[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Ones initializes a tensor with ones.

Example:

backend := cpu.New()
weights := nn.Ones(tensor.Shape{128, 784}, backend)

func Randn ¶

func Randn[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Randn initializes a tensor with random values from N(0, 1).

Example:

backend := cpu.New()
weights := nn.Randn(tensor.Shape{128, 784}, backend)

func ReGLU ¶ added in v0.5.0

func ReGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

ReGLU applies ReLU-Gated Linear Unit: ReGLU(x, gate) = x * ReLU(gate).

ReGLU uses ReLU activation for gating. It's simpler but may have "dead neuron" issues compared to SwiGLU or GeGLU.

Parameters:

x: input tensor.
gate: gating tensor.

Returns: x * ReLU(gate).

Example:

output := nn.ReGLU(up, gate)

func ReLUFunc ¶ added in v0.5.0

func ReLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

ReLUFunc applies the ReLU activation function element-wise. ReLU(x) = max(0, x).

func RepeatKV ¶ added in v0.5.0

func RepeatKV[B tensor.Backend](kv *tensor.Tensor[float32, B], nRep int) *tensor.Tensor[float32, B]

RepeatKV broadcasts KV heads to match query heads count.

This is the key operation in GQA that allows fewer KV heads than Q heads. Each KV head is repeated nRep times to match the Q head count.

Input: [batch, n_kv_heads, seq_len, head_dim] Output: [batch, n_q_heads, seq_len, head_dim] where n_q_heads = n_kv_heads * nRep

Example:

// 8 KV heads -> 32 Q heads (nRep=4)
kv := tensor.Randn[float32](tensor.Shape{2, 8, 100, 128}, backend)
expanded := nn.RepeatKV(kv, 4)  // [2, 32, 100, 128]

If nRep=1 (standard MHA), returns the input unchanged.

func ScaledDotProductAttention ¶ added in v0.4.0

func ScaledDotProductAttention[B tensor.Backend](
	query, key, value *tensor.Tensor[float32, B],
	mask *tensor.Tensor[float32, B],
	scale float32,
) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])

ScaledDotProductAttention computes attention scores using the scaled dot-product mechanism.

This is the core attention mechanism used in transformers.

Parameters:

query: Query tensor [batch, heads, seq_q, head_dim]
key: Key tensor [batch, heads, seq_k, head_dim]
value: Value tensor [batch, heads, seq_k, head_dim]
mask: Optional attention mask [batch, 1, seq_q, seq_k] or nil (additive mask, -inf for masked)
scale: Scaling factor (0 for auto-compute as 1/sqrt(head_dim))

Returns:

output: Attended values [batch, heads, seq_q, head_dim]
weights: Attention weights [batch, heads, seq_q, seq_k]

Example:

Q := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
K := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
V := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
output, weights := nn.ScaledDotProductAttention(Q, K, V, nil, 0)

func SiLUFunc ¶ added in v0.5.0

func SiLUFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

SiLUFunc applies SiLU (Swish) activation: f(x) = x * sigmoid(x).

This is the functional version of SiLU activation, useful in GLU variants.

Example:

output := nn.SiLUFunc(input)

func SigmoidFunc ¶ added in v0.5.0

func SigmoidFunc[B tensor.Backend](x *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

SigmoidFunc applies the sigmoid activation function element-wise. Sigmoid(x) = 1 / (1 + exp(-x)).

func SwiGLU ¶ added in v0.5.0

func SwiGLU[B tensor.Backend](x, gate *tensor.Tensor[float32, B]) *tensor.Tensor[float32, B]

SwiGLU applies Swish-Gated Linear Unit: SwiGLU(x, gate) = x * SiLU(gate).

SwiGLU is used in modern LLMs like LLaMA, Mistral, and DeepSeek. It combines the input with SiLU-activated gate for better gradient flow.

Parameters:

x: input tensor (typically "up" projection).
gate: gating tensor (typically "gate" projection).

Returns: x * SiLU(gate) where SiLU(z) = z * sigmoid(z).

Example:

// In LLaMA-style FFN:
up := upProj.Forward(input)
gate := gateProj.Forward(input)
hidden := nn.SwiGLU(up, gate)

func Xavier ¶

func Xavier[B tensor.Backend](fanIn, fanOut int, shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Xavier initializes a tensor using Xavier/Glorot initialization.

Example:

backend := cpu.New()
weights := nn.Xavier(784, 128, tensor.Shape{128, 784}, backend)

func Zeros ¶

func Zeros[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Zeros initializes a tensor with zeros (for biases).

Example:

backend := cpu.New()
bias := nn.Zeros(tensor.Shape{128}, backend)

Types ¶

type ALiBi ¶ added in v0.4.0

type ALiBi[B tensor.Backend] = nn.ALiBi[B]

ALiBi implements Attention with Linear Biases.

ALiBi adds a linear bias to attention scores based on the distance between positions. Used in BLOOM, MPT, and other models. Allows extrapolation to longer sequences.

Example:

backend := cpu.New()
alibi := nn.NewALiBi(8, backend)  // 8 attention heads
bias := alibi.GetBias(128)        // [1, 8, 128, 128]

// In attention:
scores := Q.BatchMatMul(K.T())
scores = scores.Add(bias)
weights := scores.Softmax(-1)

func NewALiBi ¶ added in v0.4.0

func NewALiBi[B tensor.Backend](numHeads int, backend B) *ALiBi[B]

NewALiBi creates a new ALiBi bias generator.

Computes slopes for each attention head using a geometric sequence.

Parameters:

numHeads: Number of attention heads
backend: Computation backend

Example:

alibi := nn.NewALiBi(8, backend)
bias := alibi.GetBias(64)  // Get bias for sequence length 64

type Conv2D ¶

type Conv2D[B tensor.Backend] = nn.Conv2D[B]

Conv2D represents a 2D convolutional layer.

func NewConv2D ¶

func NewConv2D[B tensor.Backend](
	inChannels, outChannels int,
	kernelH, kernelW int,
	stride, padding int,
	useBias bool,
	backend B,
) *Conv2D[B]

NewConv2D creates a new 2D convolutional layer.

Example:

backend := cpu.New()
conv := nn.NewConv2D(1, 32, 3, 3, 1, 1, true, backend)  // in_channels=1, out_channels=32, kernel=3x3, stride=1, padding=1, useBias=true

type CrossEntropyLoss ¶

type CrossEntropyLoss[B tensor.Backend] = nn.CrossEntropyLoss[B]

CrossEntropyLoss represents the cross-entropy loss for classification.

func NewCrossEntropyLoss ¶

func NewCrossEntropyLoss[B tensor.Backend](backend B) *CrossEntropyLoss[B]

NewCrossEntropyLoss creates a new cross-entropy loss function.

Example:

backend := cpu.New()
criterion := nn.NewCrossEntropyLoss(backend)
loss := criterion.Forward(logits, labels)

type Embedding ¶ added in v0.3.0

type Embedding[B tensor.Backend] = nn.Embedding[B]

Embedding represents a lookup table for embeddings.

func NewEmbedding ¶ added in v0.3.0

func NewEmbedding[B tensor.Backend](numEmbeddings, embeddingDim int, backend B) *Embedding[B]

NewEmbedding creates a new embedding layer.

Example:

backend := cpu.New()
embed := nn.NewEmbedding[B](50000, 768, backend)  // vocab=50000, dim=768
tokenIds := tensor.FromSlice([]int32{1, 5, 10}, tensor.Shape{1, 3}, backend)
embeddings := embed.Forward(tokenIds)  // [1, 3, 768]

func NewEmbeddingWithWeight ¶ added in v0.5.0

func NewEmbeddingWithWeight[B tensor.Backend](weight *tensor.Tensor[float32, B]) *Embedding[B]

NewEmbeddingWithWeight creates an embedding layer from an existing weight tensor.

This is useful when loading pre-trained embeddings.

Example:

weights := tensor.Randn[float32](tensor.Shape{50000, 768}, backend)
embed := nn.NewEmbeddingWithWeight(weights)

type FFN ¶ added in v0.4.0

type FFN[B tensor.Backend] = nn.FFN[B]

FFN (Feed-Forward Network) is a 2-layer MLP with SiLU activation.

Architecture:

FFN(x) = Linear2(SiLU(Linear1(x)))

Used inside TransformerBlock.

func NewFFN ¶ added in v0.4.0

func NewFFN[B tensor.Backend](embedDim, ffnDim int, backend B) *FFN[B]

NewFFN creates a new Feed-Forward Network.

Parameters:

embedDim: Input/output dimension
ffnDim: Hidden dimension (typically 4 * embedDim)
backend: Computation backend

Example:

ffn := nn.NewFFN[B](768, 3072, backend)
output := ffn.Forward(x)

type GQAConfig ¶ added in v0.5.0

type GQAConfig = nn.GQAConfig

GQAConfig configures a GroupedQueryAttention layer.

func MQA ¶ added in v0.5.0

func MQA(embedDim, nQHeads, headDim int) GQAConfig

MQA creates a Multi-Query Attention config (GQA with n_kv_heads=1).

MQA is the extreme case of GQA where all query heads share a single KV head. This provides maximum memory savings but may reduce model capacity.

Example:

cfg := nn.MQA(4096, 32, 128)  // 32 Q heads, 1 KV head
mqa := nn.NewGQA(cfg, backend)

type GroupedQueryAttention ¶ added in v0.5.0

type GroupedQueryAttention[B tensor.Backend] = nn.GroupedQueryAttention[B]

GroupedQueryAttention implements Grouped Query Attention (GQA).

GQA is a variant of multi-head attention where the number of key-value heads is less than the number of query heads. This provides significant memory savings for KV-cache during inference while maintaining model quality.

Architecture comparison:

MHA: n_q_heads = n_kv_heads (e.g., 32 Q, 32 K, 32 V)
GQA: n_q_heads > n_kv_heads (e.g., 32 Q, 8 K, 8 V) -> 4x memory savings
MQA: n_kv_heads = 1 (e.g., 32 Q, 1 K, 1 V) -> 32x memory savings (extreme)

GQA is used in LLaMA 2/3, Mistral, DeepSeek, Qwen2, Phi-3, and other modern LLMs.

func NewGQA ¶ added in v0.5.0

func NewGQA[B tensor.Backend](cfg GQAConfig, backend B) *GroupedQueryAttention[B]

NewGQA creates a new GroupedQueryAttention module.

Validates that:

NQHeads is divisible by NKVHeads
EmbedDim equals NQHeads * HeadDim

If HeadDim is 0, it's computed as EmbedDim / NQHeads.

Example:

// LLaMA 2 7B style config
cfg := nn.GQAConfig{
    EmbedDim:  4096,
    NQHeads:   32,
    NKVHeads:  8,
    HeadDim:   128,
    UseRoPE:   true,
    MaxSeqLen: 4096,
}
gqa := nn.NewGQA(cfg, backend)
output := gqa.Forward(x, cache, startPos)

type KVCache ¶ added in v0.4.0

type KVCache[B tensor.Backend] = nn.KVCache[B]

KVCache is a public alias for internal KV cache implementation.

KVCache stores key-value pairs for efficient autoregressive generation. See internal/nn/kvcache.go for detailed documentation.

func NewKVCache ¶ added in v0.4.0

func NewKVCache[B tensor.Backend](
	batchSize, numHeads, maxSeqLen, headDim int,
	backend B,
) *KVCache[B]

NewKVCache creates a new KV cache.

This is a convenience wrapper for the internal implementation. See internal/nn.NewKVCache for detailed documentation.

type LayerNorm ¶ added in v0.4.0

type LayerNorm[B tensor.Backend] = nn.LayerNorm[B]

LayerNorm represents Layer Normalization.

func NewLayerNorm ¶ added in v0.4.0

func NewLayerNorm[B tensor.Backend](normalizedShape int, epsilon float32, backend B) *LayerNorm[B]

NewLayerNorm creates a new LayerNorm layer.

Example:

backend := cpu.New()
norm := nn.NewLayerNorm[B](768, 1e-5, backend)
output := norm.Forward(input)  // [..., 768] -> [..., 768]

type LearnedPositionalEmbedding ¶ added in v0.4.0

type LearnedPositionalEmbedding[B tensor.Backend] = nn.LearnedPositionalEmbedding[B]

LearnedPositionalEmbedding implements learned positional embeddings.

These embeddings are trainable parameters that are updated during training. Used in GPT-2 and other models.

Example:

backend := cpu.New()
pe := nn.NewLearnedPositionalEmbedding(512, 256, backend)
encodings := pe.Forward(100)  // [1, 100, 256]

// Get parameters for optimizer
params := pe.Parameters()

func NewLearnedPositionalEmbedding ¶ added in v0.4.0

func NewLearnedPositionalEmbedding[B tensor.Backend](maxLen, dim int, backend B) *LearnedPositionalEmbedding[B]

NewLearnedPositionalEmbedding creates a new learned positional embedding layer.

The embeddings are initialized from a normal distribution N(0, 1).

Parameters:

maxLen: Maximum sequence length
dim: Embedding dimension
backend: Computation backend

Example:

pe := nn.NewLearnedPositionalEmbedding(512, 256, backend)

type Linear ¶

type Linear[B tensor.Backend] = nn.Linear[B]

Linear represents a fully connected (dense) layer.

func NewLinear ¶

func NewLinear[B tensor.Backend](inFeatures, outFeatures int, backend B, opts ...LinearOption) *Linear[B]

NewLinear creates a new linear layer with Xavier initialization.

Example:

backend := cpu.New()
layer := nn.NewLinear(784, 128, backend)

// Without bias (for LLaMA, attention projections, etc.)
lm_head := nn.NewLinear(hidden_size, vocab_size, backend, nn.WithBias(false))

type LinearOption ¶ added in v0.7.4

type LinearOption = nn.LinearOption

LinearOption is a functional option for configuring a Linear layer.

func WithBias ¶ added in v0.7.4

func WithBias(useBias bool) LinearOption

WithBias sets whether the Linear layer should use bias.

Default is true. Set to false for architectures like LLaMA that don't use bias.

Example:

// Linear layer without bias (LLaMA-style)
lm_head := nn.NewLinear(hidden_size, vocab_size, backend, nn.WithBias(false))

// Linear layer with bias (default)
layer := nn.NewLinear(784, 128, backend)  // same as WithBias(true)

type MSELoss ¶

type MSELoss[B tensor.Backend] = nn.MSELoss[B]

MSELoss represents the mean squared error loss for regression.

func NewMSELoss ¶

func NewMSELoss[B tensor.Backend](backend B) *MSELoss[B]

NewMSELoss creates a new MSE loss function.

Example:

backend := cpu.New()
criterion := nn.NewMSELoss(backend)
loss := criterion.Forward(predictions, targets)

type MaxPool2D ¶

type MaxPool2D[B tensor.Backend] = nn.MaxPool2D[B]

MaxPool2D represents a 2D max pooling layer.

func NewMaxPool2D ¶

func NewMaxPool2D[B tensor.Backend](kernelSize, stride int, backend B) *MaxPool2D[B]

NewMaxPool2D creates a new 2D max pooling layer.

Example:

backend := cpu.New()
pool := nn.NewMaxPool2D(2, 2, backend)  // kernel=2, stride=2

type Module ¶

type Module[B tensor.Backend] = nn.Module[B]

Module interface defines the common interface for all neural network modules.

type MultiHeadAttention ¶ added in v0.4.0

type MultiHeadAttention[B tensor.Backend] = nn.MultiHeadAttention[B]

MultiHeadAttention represents the multi-head attention mechanism.

func NewMultiHeadAttention ¶ added in v0.4.0

func NewMultiHeadAttention[B tensor.Backend](embedDim, numHeads int, backend B) *MultiHeadAttention[B]

NewMultiHeadAttention creates a new multi-head attention module.

Parameters:

embedDim: Total embedding dimension (must be divisible by numHeads)
numHeads: Number of attention heads
backend: Computation backend

Example:

backend := cpu.New()
mha := nn.NewMultiHeadAttention[B](768, 12, backend)  // BERT-base config
output := mha.Forward(x, x, x, nil)  // Self-attention

type Parameter ¶

type Parameter[B tensor.Backend] = nn.Parameter[B]

Parameter represents a trainable parameter in a neural network.

func NewParameter ¶

func NewParameter[B tensor.Backend](name string, t *tensor.Tensor[float32, B]) *Parameter[B]

NewParameter creates a new parameter with the given name and tensor.

type RMSNorm ¶ added in v0.3.0

type RMSNorm[B tensor.Backend] = nn.RMSNorm[B]

RMSNorm represents Root Mean Square Layer Normalization.

func NewRMSNorm ¶ added in v0.3.0

func NewRMSNorm[B tensor.Backend](dModel int, epsilon float32, backend B) *RMSNorm[B]

NewRMSNorm creates a new RMSNorm layer.

Example:

backend := cpu.New()
norm := nn.NewRMSNorm[B](768, 1e-5, backend)
output := norm.Forward(input)  // [..., 768] -> [..., 768]

type ReLU ¶

type ReLU[B tensor.Backend] = nn.ReLU[B]

ReLU represents the Rectified Linear Unit activation function.

func NewReLU ¶

func NewReLU[B tensor.Backend]() *ReLU[B]

NewReLU creates a new ReLU activation layer.

Example:

relu := nn.NewReLU()

type RotaryEncoding ¶ added in v0.4.0

type RotaryEncoding[B tensor.Backend] = nn.RotaryEncoding[B]

RotaryEncoding implements Rotary Position Embedding (RoPE).

RoPE is used in modern LLMs like LLaMA, Mistral, DeepSeek, and Qwen. It applies a rotation to query and key embeddings based on their position.

Example:

backend := cpu.New()
config := nn.RotaryEncodingConfig{
    DModel:    64,
    MaxSeqLen: 2048,
    Theta:     10000.0,
}
rope := nn.NewRotaryEncoding(config, backend)

// Apply to attention queries/keys
q := tensor.Randn[float32](tensor.Shape{batch, heads, seq, 64}, backend)
q_rotated := rope.Forward(q)

func NewRotaryEncoding ¶ added in v0.4.0

func NewRotaryEncoding[B tensor.Backend](cfg RotaryEncodingConfig, backend B) *RotaryEncoding[B]

NewRotaryEncoding creates a new RoPE (Rotary Position Embedding) layer.

Pre-computes cosine and sine values for all positions and dimension pairs.

Parameters:

cfg: Configuration (DModel, MaxSeqLen, Theta)
backend: Computation backend

Example:

config := nn.RotaryEncodingConfig{
    DModel:    64,     // Head dimension
    MaxSeqLen: 2048,   // Max sequence length
    Theta:     10000.0,
}
rope := nn.NewRotaryEncoding(config, backend)

type RotaryEncodingConfig ¶ added in v0.4.0

type RotaryEncodingConfig = nn.RotaryEncodingConfig

RotaryEncodingConfig configures a RotaryEncoding layer.

type Sequential ¶

type Sequential[B tensor.Backend] = nn.Sequential[B]

Sequential represents a sequential container of modules.

func NewSequential ¶

func NewSequential[B tensor.Backend](modules ...Module[B]) *Sequential[B]

NewSequential creates a new sequential model.

Example:

backend := cpu.New()
model := nn.NewSequential(
    nn.NewLinear(784, 128, backend),
    nn.NewReLU(),
    nn.NewLinear(128, 10, backend),
)

type SiLU ¶ added in v0.3.0

type SiLU[B tensor.Backend] = nn.SiLU[B]

SiLU represents the Sigmoid Linear Unit (SiLU/Swish) activation function. SiLU(x) = x * sigmoid(x).

func NewSiLU ¶ added in v0.3.0

func NewSiLU[B tensor.Backend]() *SiLU[B]

NewSiLU creates a new SiLU activation layer.

Example:

silu := nn.NewSiLU[B]()
output := silu.Forward(input)

type Sigmoid ¶

type Sigmoid[B tensor.Backend] = nn.Sigmoid[B]

Sigmoid represents the Sigmoid activation function.

func NewSigmoid ¶

func NewSigmoid[B tensor.Backend]() *Sigmoid[B]

NewSigmoid creates a new Sigmoid activation layer.

Example:

sigmoid := nn.NewSigmoid()

type SinusoidalPositionalEncoding ¶ added in v0.4.0

type SinusoidalPositionalEncoding[B tensor.Backend] = nn.SinusoidalPositionalEncoding[B]

SinusoidalPositionalEncoding implements fixed sinusoidal positional encodings.

This is the original positional encoding from "Attention is All You Need" (Vaswani et al., 2017).

Example:

backend := cpu.New()
pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend)
encodings := pe.Forward(100)  // [1, 100, 256]

// Add to embeddings
embeddings := embeddings.Add(encodings)

func NewSinusoidalPositionalEncoding ¶ added in v0.4.0

func NewSinusoidalPositionalEncoding[B tensor.Backend](maxLen, dim int, backend B) *SinusoidalPositionalEncoding[B]

NewSinusoidalPositionalEncoding creates a new sinusoidal positional encoding layer.

Pre-computes all positional encodings up to maxLen using sine and cosine functions.

Parameters:

maxLen: Maximum sequence length
dim: Embedding dimension
backend: Computation backend

Example:

pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend)

type SwiGLUFFN ¶ added in v0.5.0

type SwiGLUFFN[B tensor.Backend] = nn.SwiGLUFFN[B]

SwiGLUFFN implements a feed-forward network with SwiGLU activation.

Architecture (LLaMA-style):

hidden = SwiGLU(x @ W_up, x @ W_gate)
output = hidden @ W_down

Where SwiGLU(up, gate) = up * SiLU(gate).

This is more parameter-efficient than standard FFN with GELU. LLaMA uses ffn_dim = 2.7 * d_model (vs 4 * d_model in standard FFN) resulting in similar total parameters but better performance.

Example:

backend := autodiff.New(cpu.New())
cfg := nn.SwiGLUFFNConfig{
    EmbedDim: 4096,
    FFNDim:   11008,  // LLaMA 7B
}
ffn := nn.NewSwiGLUFFN(cfg, backend)
output := ffn.Forward(x)  // [batch, seq, 4096] -> [batch, seq, 4096]

func NewSwiGLUFFN ¶ added in v0.5.0

func NewSwiGLUFFN[B tensor.Backend](cfg SwiGLUFFNConfig, backend B) *SwiGLUFFN[B]

NewSwiGLUFFN creates a new SwiGLUFFN layer.

If GLUVariant is empty, defaults to "swiglu". If FFNDim is 0, it's computed as 8/3 * EmbedDim (LLaMA formula).

Example:

// LLaMA 7B FFN
ffn := nn.NewSwiGLUFFN(nn.SwiGLUFFNConfig{
    EmbedDim: 4096,
    FFNDim:   11008,
}, backend)

type SwiGLUFFNConfig ¶ added in v0.5.0

type SwiGLUFFNConfig = nn.SwiGLUFFNConfig

SwiGLUFFNConfig configures a SwiGLUFFN layer.

type Tanh ¶

type Tanh[B tensor.Backend] = nn.Tanh[B]

Tanh represents the Tanh activation function.

func NewTanh ¶

func NewTanh[B tensor.Backend]() *Tanh[B]

NewTanh creates a new Tanh activation layer.

Example:

tanh := nn.NewTanh()

type TransformerBlock ¶ added in v0.4.0

type TransformerBlock[B tensor.Backend] = nn.TransformerBlock[B]

TransformerBlock is a complete Transformer Block with attention and FFN.

Architecture (Pre-Norm):

x → Norm → MHA → + → Norm → FFN → + → output
         ↑_______|         ↑_______|
       (residual)        (residual)

Used in all transformer models (GPT, BERT, LLaMA, etc.)

func NewTransformerBlock ¶ added in v0.4.0

func NewTransformerBlock[B tensor.Backend](config TransformerConfig, backend B) *TransformerBlock[B]

NewTransformerBlock creates a new Transformer Block.

Parameters:

config: Configuration (embedDim, numHeads, ffnDim, etc.)
backend: Computation backend

Example:

backend := autodiff.New(cpu.New())
config := nn.TransformerConfig{
    EmbedDim:   768,
    NumHeads:   12,
    FFNDim:     3072,
    NormFirst:  true,
    UseRMSNorm: true,
    NormEps:    1e-5,
}
block := nn.NewTransformerBlock(config, backend)
output := block.Forward(x, mask)

type TransformerConfig ¶ added in v0.4.0

type TransformerConfig = nn.TransformerConfig

TransformerConfig defines the configuration for a Transformer Block.

Fields:

EmbedDim: Embedding dimension (d_model, e.g., 768 for GPT-2)
NumHeads: Number of attention heads (e.g., 12 for GPT-2)
FFNDim: FFN hidden dimension (typically 4 * EmbedDim)
Dropout: Dropout rate (0 = no dropout, not yet implemented)
NormFirst: true = Pre-Norm (LLaMA), false = Post-Norm (original)
UseRMSNorm: true = RMSNorm (LLaMA), false = LayerNorm (BERT/GPT)
NormEps: Normalization epsilon (1e-5 typical)

Example:

config := nn.TransformerConfig{
    EmbedDim:   768,
    NumHeads:   12,
    FFNDim:     3072,
    NormFirst:  true,
    UseRMSNorm: true,
    NormEps:    1e-5,
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL