nn

package

v0.4.0 Latest Latest Go to latest Published: Dec 1, 2025 License: Apache-2.0 Imports: 2 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/born-ml/born

Links

Open Source Insights

Documentation ¶

Overview ¶

Package nn provides neural network layers and building blocks.

Overview ¶

This package contains:

Layers: Linear, Conv2D, MaxPool2D
Activations: ReLU, Sigmoid, Tanh
Loss functions: CrossEntropyLoss, MSELoss
Utilities: Sequential, Module interface, Parameter
Initialization: Xavier, Zeros, Ones, Randn

Basic Usage ¶

import (
    "github.com/born-ml/born/nn"
    "github.com/born-ml/born/backend/cpu"
)

func main() {
    backend := cpu.New()

    // Build a simple MLP
    model := nn.NewSequential(
        nn.NewLinear(784, 128, backend),
        nn.NewReLU(),
        nn.NewLinear(128, 10, backend),
    )

    // Forward pass
    output := model.Forward(input)
}

Layers ¶

Linear: Fully connected layer with Xavier initialization

layer := nn.NewLinear(inFeatures, outFeatures, backend)

Conv2D: 2D convolutional layer with im2col algorithm

conv := nn.NewConv2D(inChannels, outChannels, kernelSize, stride, padding, backend)

MaxPool2D: 2D max pooling layer

pool := nn.NewMaxPool2D(kernelSize, stride, backend)

Activations ¶

Common activation functions:

relu := nn.NewReLU()
sigmoid := nn.NewSigmoid()
tanh := nn.NewTanh()

Loss Functions ¶

CrossEntropyLoss: For classification tasks (numerically stable)

criterion := nn.NewCrossEntropyLoss(backend)
loss := criterion.Forward(logits, labels)

MSELoss: For regression tasks

criterion := nn.NewMSELoss(backend)
loss := criterion.Forward(predictions, targets)

Sequential Models ¶

Build models by composing layers:

model := nn.NewSequential(
    nn.NewLinear(784, 256, backend),
    nn.NewReLU(),
    nn.NewLinear(256, 128, backend),
    nn.NewReLU(),
    nn.NewLinear(128, 10, backend),
)

Parameter Management ¶

Access model parameters for optimization:

params := model.Parameters()
for _, param := range params {
    fmt.Println(param.Name(), param.Tensor().Shape())
}

Package nn provides public wrappers for positional encodings.

Index ¶

func Accuracy[B tensor.Backend](logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B]) float32
func CausalMask[B tensor.Backend](seqLen int, backend B) *tensor.Tensor[float32, B]
func CrossEntropyBackward[B tensor.Backend](logits *tensor.Tensor[float32, B], targets *tensor.Tensor[int32, B], backend B) *tensor.Tensor[float32, B]
func Ones[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
func Randn[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
func ScaledDotProductAttention[B tensor.Backend](query, key, value *tensor.Tensor[float32, B], mask *tensor.Tensor[float32, B], ...) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])
func Xavier[B tensor.Backend](fanIn, fanOut int, shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
func Zeros[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]
type ALiBi
- func NewALiBi[B tensor.Backend](numHeads int, backend B) *ALiBi[B]
type Conv2D
- func NewConv2D[B tensor.Backend](inChannels, outChannels int, kernelH, kernelW int, stride, padding int, ...) *Conv2D[B]
type CrossEntropyLoss
- func NewCrossEntropyLoss[B tensor.Backend](backend B) *CrossEntropyLoss[B]
type Embedding
- func NewEmbedding[B tensor.Backend](numEmbeddings, embeddingDim int, backend B) *Embedding[B]
type FFN
- func NewFFN[B tensor.Backend](embedDim, ffnDim int, backend B) *FFN[B]
type KVCache
- func NewKVCache[B tensor.Backend](batchSize, numHeads, maxSeqLen, headDim int, backend B) *KVCache[B]
type LayerNorm
- func NewLayerNorm[B tensor.Backend](normalizedShape int, epsilon float32, backend B) *LayerNorm[B]
type LearnedPositionalEmbedding
- func NewLearnedPositionalEmbedding[B tensor.Backend](maxLen, dim int, backend B) *LearnedPositionalEmbedding[B]
type Linear
- func NewLinear[B tensor.Backend](inFeatures, outFeatures int, backend B) *Linear[B]
type MSELoss
- func NewMSELoss[B tensor.Backend](backend B) *MSELoss[B]
type MaxPool2D
- func NewMaxPool2D[B tensor.Backend](kernelSize, stride int, backend B) *MaxPool2D[B]
type Module
type MultiHeadAttention
- func NewMultiHeadAttention[B tensor.Backend](embedDim, numHeads int, backend B) *MultiHeadAttention[B]
type Parameter
- func NewParameter[B tensor.Backend](name string, t *tensor.Tensor[float32, B]) *Parameter[B]
type RMSNorm
- func NewRMSNorm[B tensor.Backend](dModel int, epsilon float32, backend B) *RMSNorm[B]
type ReLU
- func NewReLU[B tensor.Backend]() *ReLU[B]
type RotaryEncoding
- func NewRotaryEncoding[B tensor.Backend](cfg RotaryEncodingConfig, backend B) *RotaryEncoding[B]
type RotaryEncodingConfig
type Sequential
- func NewSequential[B tensor.Backend](modules ...Module[B]) *Sequential[B]
type SiLU
- func NewSiLU[B tensor.Backend]() *SiLU[B]
type Sigmoid
- func NewSigmoid[B tensor.Backend]() *Sigmoid[B]
type SinusoidalPositionalEncoding
- func NewSinusoidalPositionalEncoding[B tensor.Backend](maxLen, dim int, backend B) *SinusoidalPositionalEncoding[B]
type Tanh
- func NewTanh[B tensor.Backend]() *Tanh[B]
type TransformerBlock
- func NewTransformerBlock[B tensor.Backend](config TransformerConfig, backend B) *TransformerBlock[B]
type TransformerConfig

Constants ¶

This section is empty.

Variables ¶

This section is empty.

Functions ¶

func Accuracy ¶

func Accuracy[B tensor.Backend](
	logits *tensor.Tensor[float32, B],
	targets *tensor.Tensor[int32, B],
) float32

Accuracy computes the classification accuracy.

Example:

acc := nn.Accuracy(predictions, labels)
fmt.Printf("Accuracy: %.2f%%\n", acc*100)

func CausalMask ¶ added in v0.4.0

func CausalMask[B tensor.Backend](seqLen int, backend B) *tensor.Tensor[float32, B]

CausalMask creates a causal (autoregressive) attention mask.

In causal attention, each position can only attend to earlier positions. This is used in autoregressive models like GPT.

Returns a mask tensor where future positions are masked with -inf. Shape: [1, 1, seq_len, seq_len] (broadcastable to [batch, heads, seq, seq])

Example:

mask := nn.CausalMask(10, backend)  // [1, 1, 10, 10]
output, weights := nn.ScaledDotProductAttention(Q, K, V, mask, 0)

func CrossEntropyBackward ¶

func CrossEntropyBackward[B tensor.Backend](
	logits *tensor.Tensor[float32, B],
	targets *tensor.Tensor[int32, B],
	backend B,
) *tensor.Tensor[float32, B]

CrossEntropyBackward computes the backward pass for cross-entropy loss.

func Ones ¶

func Ones[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Ones initializes a tensor with ones.

Example:

backend := cpu.New()
weights := nn.Ones(tensor.Shape{128, 784}, backend)

func Randn ¶

func Randn[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Randn initializes a tensor with random values from N(0, 1).

Example:

backend := cpu.New()
weights := nn.Randn(tensor.Shape{128, 784}, backend)

func ScaledDotProductAttention ¶ added in v0.4.0

func ScaledDotProductAttention[B tensor.Backend](
	query, key, value *tensor.Tensor[float32, B],
	mask *tensor.Tensor[float32, B],
	scale float32,
) (*tensor.Tensor[float32, B], *tensor.Tensor[float32, B])

ScaledDotProductAttention computes attention scores using the scaled dot-product mechanism.

This is the core attention mechanism used in transformers.

Parameters:

query: Query tensor [batch, heads, seq_q, head_dim]
key: Key tensor [batch, heads, seq_k, head_dim]
value: Value tensor [batch, heads, seq_k, head_dim]
mask: Optional attention mask [batch, 1, seq_q, seq_k] or nil (additive mask, -inf for masked)
scale: Scaling factor (0 for auto-compute as 1/sqrt(head_dim))

Returns:

output: Attended values [batch, heads, seq_q, head_dim]
weights: Attention weights [batch, heads, seq_q, seq_k]

Example:

Q := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
K := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
V := tensor.Randn[float32](tensor.Shape{2, 8, 10, 64}, backend)
output, weights := nn.ScaledDotProductAttention(Q, K, V, nil, 0)

func Xavier ¶

func Xavier[B tensor.Backend](fanIn, fanOut int, shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Xavier initializes a tensor using Xavier/Glorot initialization.

Example:

backend := cpu.New()
weights := nn.Xavier(784, 128, tensor.Shape{128, 784}, backend)

func Zeros ¶

func Zeros[B tensor.Backend](shape tensor.Shape, backend B) *tensor.Tensor[float32, B]

Zeros initializes a tensor with zeros (for biases).

Example:

backend := cpu.New()
bias := nn.Zeros(tensor.Shape{128}, backend)

Types ¶

type ALiBi ¶ added in v0.4.0

type ALiBi[B tensor.Backend] = nn.ALiBi[B]

ALiBi implements Attention with Linear Biases.

ALiBi adds a linear bias to attention scores based on the distance between positions. Used in BLOOM, MPT, and other models. Allows extrapolation to longer sequences.

Example:

backend := cpu.New()
alibi := nn.NewALiBi(8, backend)  // 8 attention heads
bias := alibi.GetBias(128)        // [1, 8, 128, 128]

// In attention:
scores := Q.BatchMatMul(K.T())
scores = scores.Add(bias)
weights := scores.Softmax(-1)

func NewALiBi ¶ added in v0.4.0

func NewALiBi[B tensor.Backend](numHeads int, backend B) *ALiBi[B]

NewALiBi creates a new ALiBi bias generator.

Computes slopes for each attention head using a geometric sequence.

Parameters:

numHeads: Number of attention heads
backend: Computation backend

Example:

alibi := nn.NewALiBi(8, backend)
bias := alibi.GetBias(64)  // Get bias for sequence length 64

type Conv2D ¶

type Conv2D[B tensor.Backend] = nn.Conv2D[B]

Conv2D represents a 2D convolutional layer.

func NewConv2D ¶

func NewConv2D[B tensor.Backend](
	inChannels, outChannels int,
	kernelH, kernelW int,
	stride, padding int,
	useBias bool,
	backend B,
) *Conv2D[B]

NewConv2D creates a new 2D convolutional layer.

Example:

backend := cpu.New()
conv := nn.NewConv2D(1, 32, 3, 3, 1, 1, true, backend)  // in_channels=1, out_channels=32, kernel=3x3, stride=1, padding=1, useBias=true

type CrossEntropyLoss ¶

type CrossEntropyLoss[B tensor.Backend] = nn.CrossEntropyLoss[B]

CrossEntropyLoss represents the cross-entropy loss for classification.

func NewCrossEntropyLoss ¶

func NewCrossEntropyLoss[B tensor.Backend](backend B) *CrossEntropyLoss[B]

NewCrossEntropyLoss creates a new cross-entropy loss function.

Example:

backend := cpu.New()
criterion := nn.NewCrossEntropyLoss(backend)
loss := criterion.Forward(logits, labels)

type Embedding ¶ added in v0.3.0

type Embedding[B tensor.Backend] = nn.Embedding[B]

Embedding represents a lookup table for embeddings.

func NewEmbedding ¶ added in v0.3.0

func NewEmbedding[B tensor.Backend](numEmbeddings, embeddingDim int, backend B) *Embedding[B]

NewEmbedding creates a new embedding layer.

Example:

backend := cpu.New()
embed := nn.NewEmbedding[B](50000, 768, backend)  // vocab=50000, dim=768
tokenIds := tensor.FromSlice([]int32{1, 5, 10}, tensor.Shape{1, 3}, backend)
embeddings := embed.Forward(tokenIds)  // [1, 3, 768]

type FFN ¶ added in v0.4.0

type FFN[B tensor.Backend] = nn.FFN[B]

FFN (Feed-Forward Network) is a 2-layer MLP with SiLU activation.

Architecture:

FFN(x) = Linear2(SiLU(Linear1(x)))

Used inside TransformerBlock.

func NewFFN ¶ added in v0.4.0

func NewFFN[B tensor.Backend](embedDim, ffnDim int, backend B) *FFN[B]

NewFFN creates a new Feed-Forward Network.

Parameters:

embedDim: Input/output dimension
ffnDim: Hidden dimension (typically 4 * embedDim)
backend: Computation backend

Example:

ffn := nn.NewFFN[B](768, 3072, backend)
output := ffn.Forward(x)

type KVCache ¶ added in v0.4.0

type KVCache[B tensor.Backend] = nn.KVCache[B]

KVCache is a public alias for internal KV cache implementation.

KVCache stores key-value pairs for efficient autoregressive generation. See internal/nn/kvcache.go for detailed documentation.

func NewKVCache ¶ added in v0.4.0

func NewKVCache[B tensor.Backend](
	batchSize, numHeads, maxSeqLen, headDim int,
	backend B,
) *KVCache[B]

NewKVCache creates a new KV cache.

This is a convenience wrapper for the internal implementation. See internal/nn.NewKVCache for detailed documentation.

type LayerNorm ¶ added in v0.4.0

type LayerNorm[B tensor.Backend] = nn.LayerNorm[B]

LayerNorm represents Layer Normalization.

func NewLayerNorm ¶ added in v0.4.0

func NewLayerNorm[B tensor.Backend](normalizedShape int, epsilon float32, backend B) *LayerNorm[B]

NewLayerNorm creates a new LayerNorm layer.

Example:

backend := cpu.New()
norm := nn.NewLayerNorm[B](768, 1e-5, backend)
output := norm.Forward(input)  // [..., 768] -> [..., 768]

type LearnedPositionalEmbedding ¶ added in v0.4.0

type LearnedPositionalEmbedding[B tensor.Backend] = nn.LearnedPositionalEmbedding[B]

LearnedPositionalEmbedding implements learned positional embeddings.

These embeddings are trainable parameters that are updated during training. Used in GPT-2 and other models.

Example:

backend := cpu.New()
pe := nn.NewLearnedPositionalEmbedding(512, 256, backend)
encodings := pe.Forward(100)  // [1, 100, 256]

// Get parameters for optimizer
params := pe.Parameters()

func NewLearnedPositionalEmbedding ¶ added in v0.4.0

func NewLearnedPositionalEmbedding[B tensor.Backend](maxLen, dim int, backend B) *LearnedPositionalEmbedding[B]

NewLearnedPositionalEmbedding creates a new learned positional embedding layer.

The embeddings are initialized from a normal distribution N(0, 1).

Parameters:

maxLen: Maximum sequence length
dim: Embedding dimension
backend: Computation backend

Example:

pe := nn.NewLearnedPositionalEmbedding(512, 256, backend)

type Linear ¶

type Linear[B tensor.Backend] = nn.Linear[B]

Linear represents a fully connected (dense) layer.

func NewLinear ¶

func NewLinear[B tensor.Backend](inFeatures, outFeatures int, backend B) *Linear[B]

NewLinear creates a new linear layer with Xavier initialization.

Example:

backend := cpu.New()
layer := nn.NewLinear(784, 128, backend)

type MSELoss ¶

type MSELoss[B tensor.Backend] = nn.MSELoss[B]

MSELoss represents the mean squared error loss for regression.

func NewMSELoss ¶

func NewMSELoss[B tensor.Backend](backend B) *MSELoss[B]

NewMSELoss creates a new MSE loss function.

Example:

backend := cpu.New()
criterion := nn.NewMSELoss(backend)
loss := criterion.Forward(predictions, targets)

type MaxPool2D ¶

type MaxPool2D[B tensor.Backend] = nn.MaxPool2D[B]

MaxPool2D represents a 2D max pooling layer.

func NewMaxPool2D ¶

func NewMaxPool2D[B tensor.Backend](kernelSize, stride int, backend B) *MaxPool2D[B]

NewMaxPool2D creates a new 2D max pooling layer.

Example:

backend := cpu.New()
pool := nn.NewMaxPool2D(2, 2, backend)  // kernel=2, stride=2

type Module ¶

type Module[B tensor.Backend] = nn.Module[B]

Module interface defines the common interface for all neural network modules.

type MultiHeadAttention ¶ added in v0.4.0

type MultiHeadAttention[B tensor.Backend] = nn.MultiHeadAttention[B]

MultiHeadAttention represents the multi-head attention mechanism.

func NewMultiHeadAttention ¶ added in v0.4.0

func NewMultiHeadAttention[B tensor.Backend](embedDim, numHeads int, backend B) *MultiHeadAttention[B]

NewMultiHeadAttention creates a new multi-head attention module.

Parameters:

embedDim: Total embedding dimension (must be divisible by numHeads)
numHeads: Number of attention heads
backend: Computation backend

Example:

backend := cpu.New()
mha := nn.NewMultiHeadAttention[B](768, 12, backend)  // BERT-base config
output := mha.Forward(x, x, x, nil)  // Self-attention

type Parameter ¶

type Parameter[B tensor.Backend] = nn.Parameter[B]

Parameter represents a trainable parameter in a neural network.

func NewParameter ¶

func NewParameter[B tensor.Backend](name string, t *tensor.Tensor[float32, B]) *Parameter[B]

NewParameter creates a new parameter with the given name and tensor.

type RMSNorm ¶ added in v0.3.0

type RMSNorm[B tensor.Backend] = nn.RMSNorm[B]

RMSNorm represents Root Mean Square Layer Normalization.

func NewRMSNorm ¶ added in v0.3.0

func NewRMSNorm[B tensor.Backend](dModel int, epsilon float32, backend B) *RMSNorm[B]

NewRMSNorm creates a new RMSNorm layer.

Example:

backend := cpu.New()
norm := nn.NewRMSNorm[B](768, 1e-5, backend)
output := norm.Forward(input)  // [..., 768] -> [..., 768]

type ReLU ¶

type ReLU[B tensor.Backend] = nn.ReLU[B]

ReLU represents the Rectified Linear Unit activation function.

func NewReLU ¶

func NewReLU[B tensor.Backend]() *ReLU[B]

NewReLU creates a new ReLU activation layer.

Example:

relu := nn.NewReLU()

type RotaryEncoding ¶ added in v0.4.0

type RotaryEncoding[B tensor.Backend] = nn.RotaryEncoding[B]

RotaryEncoding implements Rotary Position Embedding (RoPE).

RoPE is used in modern LLMs like LLaMA, Mistral, DeepSeek, and Qwen. It applies a rotation to query and key embeddings based on their position.

Example:

backend := cpu.New()
config := nn.RotaryEncodingConfig{
    DModel:    64,
    MaxSeqLen: 2048,
    Theta:     10000.0,
}
rope := nn.NewRotaryEncoding(config, backend)

// Apply to attention queries/keys
q := tensor.Randn[float32](tensor.Shape{batch, heads, seq, 64}, backend)
q_rotated := rope.Forward(q)

func NewRotaryEncoding ¶ added in v0.4.0

func NewRotaryEncoding[B tensor.Backend](cfg RotaryEncodingConfig, backend B) *RotaryEncoding[B]

NewRotaryEncoding creates a new RoPE (Rotary Position Embedding) layer.

Pre-computes cosine and sine values for all positions and dimension pairs.

Parameters:

cfg: Configuration (DModel, MaxSeqLen, Theta)
backend: Computation backend

Example:

config := nn.RotaryEncodingConfig{
    DModel:    64,     // Head dimension
    MaxSeqLen: 2048,   // Max sequence length
    Theta:     10000.0,
}
rope := nn.NewRotaryEncoding(config, backend)

type RotaryEncodingConfig ¶ added in v0.4.0

type RotaryEncodingConfig = nn.RotaryEncodingConfig

RotaryEncodingConfig configures a RotaryEncoding layer.

type Sequential ¶

type Sequential[B tensor.Backend] = nn.Sequential[B]

Sequential represents a sequential container of modules.

func NewSequential ¶

func NewSequential[B tensor.Backend](modules ...Module[B]) *Sequential[B]

NewSequential creates a new sequential model.

Example:

backend := cpu.New()
model := nn.NewSequential(
    nn.NewLinear(784, 128, backend),
    nn.NewReLU(),
    nn.NewLinear(128, 10, backend),
)

type SiLU ¶ added in v0.3.0

type SiLU[B tensor.Backend] = nn.SiLU[B]

SiLU represents the Sigmoid Linear Unit (SiLU/Swish) activation function. SiLU(x) = x * sigmoid(x).

func NewSiLU ¶ added in v0.3.0

func NewSiLU[B tensor.Backend]() *SiLU[B]

NewSiLU creates a new SiLU activation layer.

Example:

silu := nn.NewSiLU[B]()
output := silu.Forward(input)

type Sigmoid ¶

type Sigmoid[B tensor.Backend] = nn.Sigmoid[B]

Sigmoid represents the Sigmoid activation function.

func NewSigmoid ¶

func NewSigmoid[B tensor.Backend]() *Sigmoid[B]

NewSigmoid creates a new Sigmoid activation layer.

Example:

sigmoid := nn.NewSigmoid()

type SinusoidalPositionalEncoding ¶ added in v0.4.0

type SinusoidalPositionalEncoding[B tensor.Backend] = nn.SinusoidalPositionalEncoding[B]

SinusoidalPositionalEncoding implements fixed sinusoidal positional encodings.

This is the original positional encoding from "Attention is All You Need" (Vaswani et al., 2017).

Example:

backend := cpu.New()
pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend)
encodings := pe.Forward(100)  // [1, 100, 256]

// Add to embeddings
embeddings := embeddings.Add(encodings)

func NewSinusoidalPositionalEncoding ¶ added in v0.4.0

func NewSinusoidalPositionalEncoding[B tensor.Backend](maxLen, dim int, backend B) *SinusoidalPositionalEncoding[B]

NewSinusoidalPositionalEncoding creates a new sinusoidal positional encoding layer.

Pre-computes all positional encodings up to maxLen using sine and cosine functions.

Parameters:

maxLen: Maximum sequence length
dim: Embedding dimension
backend: Computation backend

Example:

pe := nn.NewSinusoidalPositionalEncoding(512, 256, backend)

type Tanh ¶

type Tanh[B tensor.Backend] = nn.Tanh[B]

Tanh represents the Tanh activation function.

func NewTanh ¶

func NewTanh[B tensor.Backend]() *Tanh[B]

NewTanh creates a new Tanh activation layer.

Example:

tanh := nn.NewTanh()

type TransformerBlock ¶ added in v0.4.0

type TransformerBlock[B tensor.Backend] = nn.TransformerBlock[B]

TransformerBlock is a complete Transformer Block with attention and FFN.

Architecture (Pre-Norm):

x → Norm → MHA → + → Norm → FFN → + → output
         ↑_______|         ↑_______|
       (residual)        (residual)

Used in all transformer models (GPT, BERT, LLaMA, etc.)

func NewTransformerBlock ¶ added in v0.4.0

func NewTransformerBlock[B tensor.Backend](config TransformerConfig, backend B) *TransformerBlock[B]

NewTransformerBlock creates a new Transformer Block.

Parameters:

config: Configuration (embedDim, numHeads, ffnDim, etc.)
backend: Computation backend

Example:

backend := autodiff.New(cpu.New())
config := nn.TransformerConfig{
    EmbedDim:   768,
    NumHeads:   12,
    FFNDim:     3072,
    NormFirst:  true,
    UseRMSNorm: true,
    NormEps:    1e-5,
}
block := nn.NewTransformerBlock(config, backend)
output := block.Forward(x, mask)

type TransformerConfig ¶ added in v0.4.0

type TransformerConfig = nn.TransformerConfig

TransformerConfig defines the configuration for a Transformer Block.

Fields:

EmbedDim: Embedding dimension (d_model, e.g., 768 for GPT-2)
NumHeads: Number of attention heads (e.g., 12 for GPT-2)
FFNDim: FFN hidden dimension (typically 4 * EmbedDim)
Dropout: Dropout rate (0 = no dropout, not yet implemented)
NormFirst: true = Pre-Norm (LLaMA), false = Post-Norm (original)
UseRMSNorm: true = RMSNorm (LLaMA), false = LayerNorm (BERT/GPT)
NormEps: Normalization epsilon (1e-5 typical)

Example:

config := nn.TransformerConfig{
    EmbedDim:   768,
    NumHeads:   12,
    FFNDim:     3072,
    NormFirst:  true,
    UseRMSNorm: true,
    NormEps:    1e-5,
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL