born

module
v0.7.7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 6, 2026 License: Apache-2.0

README ΒΆ

Born - Production-Ready ML for Go

Born ML Framework - Inspired by Burn

Go Version Go Reference Go Report Card Pure Go Release License Test Status Codecov Discussions

"Models are born production-ready"

Born is a modern deep learning framework for Go, inspired by Burn (Rust). Build ML models in pure Go and deploy as single binaries - no Python runtime, no complex dependencies.

Pure Go ML with GPU acceleration - no CGO required!


Why Born?

The Problem

Deploying ML models is hard:

  • Python runtime required
  • Complex dependency management
  • Large Docker images
  • Slow startup times
  • Integration friction with Go backends
The Born Solution
import "github.com/born-ml/born"

// Models "born" ready for production
model := born.Load("resnet50.born")
prediction := model.Predict(image)

// That's it. No Python. No containers. Just Go.

Benefits:

  • Single binary deployment
  • Fast startup (< 100ms)
  • Small memory footprint
  • Native Go integration
  • Cross-platform out of the box

Features

Core
  • Pure Go - No CGO dependencies, trivial cross-compilation
  • Type Safe - Generics-powered API for compile-time guarantees
  • Autodiff - Automatic differentiation via decorator pattern
  • Production Ready - Single binary deployment, fast startup
  • WebAssembly - Run inference in browsers natively
GPU Acceleration
  • WebGPU Backend - Zero-CGO GPU via go-webgpu, 123x MatMul speedup
  • 38+ GPU Operations - MatMul, BatchMatMul, Conv2D, MaxPool2D, Softmax, and more
  • Lazy Evaluation - GPU-resident tensors, command batching (~90s β†’ <5s/step)
  • Multi-dim Transpose - GPU-accelerated 3D/4D/5D/6D tensors
  • Automatic Memory - runtime.SetFinalizer for GPU buffer cleanup
LLM & Transformers
  • Flash Attention 2 - O(N) memory, WebGPU WGSL shader, 2x+ speedup on long sequences
  • Speculative Decoding - Draft model + verification, 2-4x inference speedup
  • Multi-Head Attention - MHA, SDPA, Grouped Query Attention (GQA)
  • KV-Cache - Efficient autoregressive generation (3.94x speedup)
  • Positional Encodings - RoPE, ALiBi, Sinusoidal, Learned
  • Modern FFN - SwiGLU, GeGLU, ReGLU with gated activations
  • Normalizations - LayerNorm, RMSNorm (LLaMA style)
  • Tokenizers - TikToken, BPE, HuggingFace format, chat templates
  • Sampling - Temperature, Top-K, Top-P, Min-P, repetition penalty
  • Text Generation - Streaming API, stop sequences
Model Import & Export
  • ONNX Import - Load PyTorch/TensorFlow models via .onnx (30+ operators)
  • GGUF Import - llama.cpp format with K-quant dequantization (Q4_K, Q5_K, Q6_K, Q8_0)
  • Native Format - .born format with nn.Save() / nn.Load()
  • Checkpoints - Resume training with optimizer state preservation
  • SafeTensors - HuggingFace compatible export

Quick Start

Installation
# Clone repository
git clone https://github.com/born-ml/born.git
cd born

# Build
make build

# Or install CLI
make install
Development Setup

Requirements:

  • Go 1.25+
  • Make (optional, but recommended)
  • golangci-lint (for linting)

Build:

make build          # Build all binaries
make test           # Run tests
make lint           # Run linter
make bench          # Run benchmarks
Example: MNIST Classification

Working example included! See examples/mnist/ for complete implementation.

package main

import (
    "github.com/born-ml/born/autodiff"
    "github.com/born-ml/born/backend/cpu"
    "github.com/born-ml/born/nn"
    "github.com/born-ml/born/optim"
)

func main() {
    // Create backend with autodiff
    backend := autodiff.New(cpu.New())

    // Define model (784 β†’ 128 β†’ 10)
    model := NewMNISTNet(backend)

    // Create loss and optimizer
    criterion := nn.NewCrossEntropyLoss(backend)
    optimizer := optim.NewAdam(model.Parameters(), optim.AdamConfig{
        LR:    0.001,
        Betas: [2]float32{0.9, 0.999},
    }, backend)

    // Training loop
    for epoch := range 10 {
        // Forward pass
        logits := model.Forward(batch.ImagesTensor)
        loss := criterion.Forward(logits, batch.LabelsTensor)

        // Backward pass
        optimizer.ZeroGrad()
        grads := backend.Backward(loss.Raw())
        optimizer.Step(grads)

        // Log progress
        acc := nn.Accuracy(logits, batch.LabelsTensor)
        fmt.Printf("Epoch %d: Loss=%.4f, Accuracy=%.2f%%\n",
            epoch, loss.Raw().AsFloat32()[0], acc*100)
    }
}

Run it: cd examples/mnist && go run .

Example: LLM Text Generation
package main

import (
    "fmt"
    "github.com/born-ml/born/generate"
    "github.com/born-ml/born/tokenizer"
    "github.com/born-ml/born/loader"
)

func main() {
    // Load tokenizer
    tok, _ := tokenizer.NewTikTokenForModel("gpt-4")

    // Load model (GGUF format)
    model, _ := loader.OpenModel("llama-7b.gguf")

    // Create generator with sampling config
    gen := generate.NewTextGenerator(model, tok, generate.SamplingConfig{
        Temperature: 0.7,
        TopP:        0.9,
        TopK:        40,
    })

    // Generate text
    result, _ := gen.Generate("Hello, world!", generate.GenerateConfig{
        MaxTokens: 100,
    })
    fmt.Println(result)

    // Or use streaming
    stream, _ := gen.GenerateStream("Once upon a time", generate.GenerateConfig{
        MaxTokens: 50,
        Stream:    true,
    })
    for chunk := range stream {
        fmt.Print(chunk.Token)
    }
}

Core Features:

  • βœ… Tensor operations (Add, MatMul, Reshape, Exp, Sqrt, Cat, etc.)
  • βœ… 35+ GPU operations (BatchMatMul, Conv2D, MaxPool2D, Comparisons, Reductions)
  • βœ… 31 type-safe public API operations (MulScalar, Greater, Softmax, Int32, etc.)
  • βœ… Automatic differentiation with gradient tape
  • βœ… Neural network modules (Linear, Conv2D, ReLU, SiLU, RMSNorm, Embedding)
  • βœ… Optimizers (SGD with momentum, Adam with bias correction)
  • βœ… Losses (CrossEntropyLoss with numerical stability)
  • βœ… Complete WebGPU backend (zero-CGO, 123x MatMul speedup)
  • βœ… Transformer primitives (for LLaMA, GPT, Mistral architectures)

Architecture

Backend Abstraction

Born uses a backend interface for device independence:

type Backend interface {
    Add(a, b *RawTensor) *RawTensor
    MatMul(a, b *RawTensor) *RawTensor
    // ... other operations
}

Available Backends:

Backend Status Description
CPU βœ… Available Pure Go implementation, all operations
WebGPU βœ… Available Zero-CGO GPU via go-webgpu
Vulkan πŸ“‹ Planned Cross-platform GPU compute (Linux focus)
CUDA πŸ“‹ Planned NVIDIA GPU via zero-CGO
Metal πŸ“‹ Planned Apple GPU (macOS/iOS)

WebGPU Operation Support πŸŽ‰

Category Operations Backend
Math Add, Sub, Mul, Div (float32 + int32), Exp, Sqrt, Rsqrt, Log, Cos, Sin βœ… GPU
Matrix MatMul, BatchMatMul (3D/4D), Transpose, Reshape βœ… GPU
CNN Conv2D, MaxPool2D βœ… GPU
Activation ReLU, Sigmoid, Tanh, Softmax βœ… GPU
Scalar MulScalar, AddScalar, SubScalar, DivScalar βœ… GPU
Reduction Sum, SumDim, MeanDim, Argmax βœ… GPU/CPU hybrid
Compare Greater, Lower, GreaterEqual, LowerEqual, Equal, NotEqual βœ… GPU
Boolean And, Or, Not βœ… GPU
Shape Cat, Chunk, Unsqueeze, Squeeze, Expand βœ… CPU (efficient)
Selection Where, Gather, Embedding βœ… GPU
Type Cast (float32, int32) βœ… CPU

Total: 38+ GPU-accelerated operations!

All operations required for LLM inference (Attention, RoPE, LayerNorm, etc.) are fully supported on GPU.

GPU Backend Setup:

WebGPU requires the wgpu_native library. Download from wgpu-native releases:

Windows (x64):

# Download latest release
curl -LO https://github.com/gfx-rs/wgpu-native/releases/latest/download/wgpu-windows-x86_64-msvc-release.zip
unzip wgpu-windows-x86_64-msvc-release.zip

# Install DLL system-wide (requires admin)
copy lib\wgpu_native.dll C:\Windows\System32\

# Or place next to your executable
copy lib\wgpu_native.dll .\your-app\

Linux (x64):

curl -LO https://github.com/gfx-rs/wgpu-native/releases/latest/download/wgpu-linux-x86_64-release.zip
unzip wgpu-linux-x86_64-release.zip
sudo cp lib/libwgpu_native.so /usr/local/lib/
sudo ldconfig

macOS (ARM64):

curl -LO https://github.com/gfx-rs/wgpu-native/releases/latest/download/wgpu-macos-aarch64-release.zip
unzip wgpu-macos-aarch64-release.zip
sudo cp lib/libwgpu_native.dylib /usr/local/lib/

Usage:

import (
    "github.com/born-ml/born/autodiff"
    "github.com/born-ml/born/backend/cpu"
    "github.com/born-ml/born/backend/webgpu"
)

// Automatic GPU/CPU selection with graceful fallback
var backend tensor.Backend
if webgpu.IsAvailable() {
    gpu, err := webgpu.New()
    if err == nil {
        backend = autodiff.New(gpu)
        defer gpu.Release() // Don't forget to release GPU resources
    }
}
if backend == nil {
    backend = autodiff.New(cpu.New())
}
Decorator Pattern

Functionality composed via decorators (inspired by Burn):

// Basic backend
base := cpu.New()

// Add autodiff
withAutodiff := autodiff.New(base)

// Add kernel fusion
optimized := fusion.New(withAutodiff)

// Your code works with any backend!
model := createModel(optimized)
Type Safety with Generics
type Tensor[T DType, B Backend] struct {
    raw     *RawTensor
    backend B
}

// Compile-time type checking
func (t *Tensor[float32, B]) MatMul(other *Tensor[float32, B]) *Tensor[float32, B]

Roadmap

βœ… What's Working

Core Framework

  • Tensor API with generics, autodiff, NN modules (Linear, Conv2D, ReLU, etc.)
  • Optimizers (SGD, Adam), losses (CrossEntropyLoss)
  • MNIST: 97.44% MLP, 98.18% CNN accuracy

GPU Acceleration

  • WebGPU backend with 38+ operations (123x MatMul speedup)
  • Lazy evaluation, command batching (~90s β†’ <5s/step)
  • CNN support (Conv2D, MaxPool2D, BatchMatMul)

LLM & Transformers

  • Multi-Head Attention, GQA, KV-Cache (3.94x speedup)
  • RoPE, ALiBi, RMSNorm, SwiGLU
  • Tokenizers (TikToken, BPE), text generation with streaming

Model Import & Export

  • ONNX import (30+ operators)
  • GGUF loading (LLaMA, Mistral, DeepSeek)
  • Native .born format, SafeTensors export
πŸš€ Upcoming

Quantization (v0.8.0) - GPTQ/AWQ (4x smaller), KV Cache compression, Model Zoo

Production Serving - PagedAttention, Continuous Batching, OpenAI-compatible API

Scale & Stability - Multi-GPU, CPU SIMD (AVX2/Neon), Gradient Checkpointing

v1.0 LTS - API freeze, 3+ years support, production hardening

Full roadmap & changelog: See ROADMAP.md and CHANGELOG.md


Documentation

For Users
For Contributors

Philosophy

"Born Ready"

Models trained anywhere (PyTorch, TensorFlow) are imported and born production-ready:

Training β†’ Birth β†’ Production
 (Burn)    (Born)    (Run)

PyTorch trains  β†’  Born imports  β†’  Born deploys
TensorFlow trains β†’ Born imports β†’ Born deploys
Born trains    β†’  Born ready   β†’  Born serves
Production First
  • Single Binary: Entire model in one executable
  • No Runtime: No Python, no dependencies
  • Fast Startup: < 100ms cold start
  • Small Memory: Minimal footprint
  • Cloud Native: Natural fit for Go services
Developer Experience
  • Type Safe: Catch errors at compile time
  • Clean API: Intuitive and ergonomic
  • Great Docs: Comprehensive documentation
  • Easy Deploy: go build and you're done

Performance

Actual Benchmarks (AMD Ryzen 9 5950X, NVIDIA RTX 3080):

Matrix Operations (WebGPU vs CPU)
Operation CPU GPU Speedup
MatMul 1024x1024 7143ms 58ms 123x
MatMul 512x512 499ms 12ms 41x
MatMul 256x256 56ms 3.7ms 15x
Neural Network Inference
Batch Size CPU GPU Speedup Throughput
64 48ms 19ms 2.5x 3,357/s
256 182ms 21ms 8.5x 11,883/s
512 348ms 32ms 10.9x 15,973/s

Note: CPU backend uses naive O(nΒ³) MatMul. SIMD optimizations planned for future releases.

WebGPU WGSL Shaders

Born includes 30+ optimized WGSL compute shaders:

Shader Workgroup Description
addShader 256 Element-wise addition
subShader 256 Element-wise subtraction
mulShader 256 Element-wise multiplication
divShader 256 Element-wise division
matmulShader 16x16 Matrix multiplication (2D)
batchMatMulShader 8x8x1 Batched matmul (3D/4D)
conv2dShader 8x8x1 2D convolution with padding
maxPool2dShader 8x8x1 2D max pooling
transposeShader 16x16 Matrix transpose
reluShader 256 ReLU activation
sigmoidShader 256 Sigmoid activation
tanhShader 256 Tanh activation
softmaxShader 256 Softmax (numerically stable)
expShader 256 Element-wise exp
sqrtShader 256 Element-wise sqrt
rsqrtShader 256 Reciprocal sqrt (1/√x)
cosShader 256 Element-wise cosine
sinShader 256 Element-wise sine
greaterShader 256 Greater-than comparison
lowerShader 256 Less-than comparison
equalShader 256 Equality comparison
andShader 256 Logical AND
orShader 256 Logical OR
notShader 256 Logical NOT
argmaxShader 256 Argmax along dimension
globalSumShader 256 Parallel sum reduction
scalarMulShader 256 Scalar multiplication
scalarAddShader 256 Scalar addition
addShaderInt32 256 Int32 element-wise addition
subShaderInt32 256 Int32 element-wise subtraction
mulShaderInt32 256 Int32 element-wise multiplication
divShaderInt32 256 Int32 element-wise division

All shaders use workgroup shared memory for optimal performance and support bounds checking for safety.


Inspiration

Born is inspired by and learns from:

  • Burn - Architecture patterns, decorator design
  • PyTorch - API ergonomics
  • TinyGrad - Simplicity principles
  • Gonum - Go numerical computing
  • HDF5 for Go - Model serialization, dataset storage (planned)

Acknowledgments

Special thanks to the projects that made Born possible:

πŸ™ go-webgpu & wgpu-native

Born's GPU acceleration is powered by go-webgpu - a remarkable pure Go binding for WebGPU via wgpu-native.

Why this stack is special:

  • Zero CGO - Pure Go bindings using goffi for FFI
  • Cross-platform - Works on Windows (D3D12), Linux (Vulkan), macOS (Metal)
  • Modern API - Clean, idiomatic Go interface to WebGPU
  • wgpu-native - Battle-tested Rust implementation of WebGPU by gfx-rs
  • Active development - Both projects are actively maintained

Without go-webgpu and wgpu-native, Born would need CGO for GPU support, making cross-compilation complex and defeating our "pure Go" goal. This stack enables us to offer production-ready GPU acceleration while maintaining the simplicity of go build.

Thank you to Alfred Dobra, gfx-rs team, and all contributors!


Community

Project is in early development. Star the repo to follow progress!


License

Licensed under the Apache License, Version 2.0.

Why Apache 2.0?

  • βœ… Patent protection - Critical for ML algorithms and production use
  • βœ… Enterprise-friendly - Clear legal framework for commercial adoption
  • βœ… Industry standard - Same as TensorFlow, battle-tested in ML ecosystem
  • βœ… Contributor protection - Explicit patent grant and termination clauses

See LICENSE file for full terms.


FAQ

Q: Why not use Gorgonia? A: Gorgonia is great but uses a different approach. Born focuses on modern Go (generics), pure Go (no CGO), and production-first design inspired by Burn.

Q: Can I run LLMs with Born? A: Yes! Full LLM support included - GGUF model loading, tokenizers, sampling strategies, and text generation with streaming. Load LLaMA, Mistral, or DeepSeek models directly.

Q: When will it be ready? A: Core features are released! CPU/GPU backends, transformers, LLM support, and ONNX import all work. See ROADMAP.md for upcoming features.

Q: Can I use PyTorch models? A: Yes! Via ONNX import. Train in PyTorch, export to ONNX, deploy with Born. GGUF models are also supported.

Q: WebAssembly support? A: Yes! Pure Go compiles to WASM natively. Inference in browsers out of the box.

Q: What LLM architectures are supported? A: LLaMA 2/3, Mistral, DeepSeek, and compatible architectures. GQA, RoPE, SwiGLU are all supported.

Q: How do I enable GPU acceleration? A: Install wgpu_native library from wgpu-native releases, then use webgpu.IsAvailable() to check GPU support. See Architecture for setup instructions. 38+ GPU operations included - everything needed for LLM inference!

Q: What GPU operations are supported? A: All operations needed for production ML! Math (Add, Mul, Exp, etc.), Matrix (MatMul, BatchMatMul, Conv2D), Activations (ReLU, Softmax), Comparisons (Greater, Equal), Boolean (And, Or, Not), Reductions (Sum, Argmax), and more. See the WebGPU Operation Table.

Q: How can I help? A: Check our Contributing Guide and GitHub Issues!


Born for Production. Ready from Day One.

Made with ❀️ by the Born ML team

Documentation β€’ Contributing β€’ Community

Directories ΒΆ

Path Synopsis
Package autodiff provides automatic differentiation capabilities.
Package autodiff provides automatic differentiation capabilities.
backend
cpu
Package cpu provides a pure Go CPU backend for tensor operations.
Package cpu provides a pure Go CPU backend for tensor operations.
webgpu
Package webgpu provides the WebGPU backend for GPU-accelerated tensor operations.
Package webgpu provides the WebGPU backend for GPU-accelerated tensor operations.
cmd
born command
Package main provides the Born ML Framework CLI.
Package main provides the Born ML Framework CLI.
born-bench command
Package main provides benchmarking tools for Born.
Package main provides benchmarking tools for Born.
born-convert command
Package main provides model conversion tools for Born.
Package main provides model conversion tools for Born.
examples
mnist command
mnist-cnn command
mnist-gpu command
MNIST GPU Inference Benchmark
MNIST GPU Inference Benchmark
Package generate provides text generation utilities for LLMs in Born ML.
Package generate provides text generation utilities for LLMs in Born ML.
internal
autodiff
Package autodiff implements automatic differentiation using the decorator pattern.
Package autodiff implements automatic differentiation using the decorator pattern.
autodiff/ops
Package ops defines operation interfaces and implementations for automatic differentiation.
Package ops defines operation interfaces and implementations for automatic differentiation.
backend
Package backend provides backend implementations for tensor operations.
Package backend provides backend implementations for tensor operations.
backend/cpu
Package cpu implements the CPU backend with SIMD optimizations and BLAS integration.
Package cpu implements the CPU backend with SIMD optimizations and BLAS integration.
backend/webgpu
Package webgpu implements the WebGPU backend for GPU-accelerated tensor operations.
Package webgpu implements the WebGPU backend for GPU-accelerated tensor operations.
generate
Package generate provides text generation utilities for LLMs.
Package generate provides text generation utilities for LLMs.
gguf
Package gguf provides GGUF file format parsing and model loading.
Package gguf provides GGUF file format parsing and model loading.
loader
Package loader provides model weight loading functionality for Born ML framework.
Package loader provides model weight loading functionality for Born ML framework.
nn
Package nn provides neural network modules and layers for building deep learning models.
Package nn provides neural network modules and layers for building deep learning models.
onnx
Package onnx provides ONNX model import/export functionality.
Package onnx provides ONNX model import/export functionality.
onnx/operators
Package operators implements ONNX operator mapping to Born tensor operations.
Package operators implements ONNX operator mapping to Born tensor operations.
optim
Package optim implements optimization algorithms for training neural networks.
Package optim implements optimization algorithms for training neural networks.
parallel
Package parallel provides parallel execution utilities for the Born ML framework.
Package parallel provides parallel execution utilities for the Born ML framework.
serialization
Package serialization provides native .born format for saving and loading Born ML models.
Package serialization provides native .born format for saving and loading Born ML models.
tensor
Package tensor provides the core tensor types and operations for Born ML framework.
Package tensor provides the core tensor types and operations for Born ML framework.
tokenizer
Package tokenizer provides text tokenization for LLM inference.
Package tokenizer provides text tokenization for LLM inference.
Package loader provides model weight loading functionality for Born ML framework.
Package loader provides model weight loading functionality for Born ML framework.
Package nn provides neural network layers and building blocks.
Package nn provides neural network layers and building blocks.
Package onnx provides ONNX model import functionality for Born ML framework.
Package onnx provides ONNX model import functionality for Born ML framework.
Package optim provides optimization algorithms for training neural networks.
Package optim provides optimization algorithms for training neural networks.
Package tensor provides type-safe tensor operations for the Born ML framework.
Package tensor provides type-safe tensor operations for the Born ML framework.
Package tokenizer provides text tokenization for LLM inference in Born ML.
Package tokenizer provides text tokenization for LLM inference in Born ML.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL