README
ΒΆ
LOOM - Deterministic Neural Virtual Machine
"The SQLite of AI" β A Polyglot Neural VM with Bit-Exact Reproducibility
Loom is a Deterministic Neural Virtual Machine (DNVM) β a portable execution environment for neural networks that guarantees bitwise-identical results across all platforms, backends, and language bindings. It combines a JIT compiler (generating WebGPU shaders at runtime) with a pure Go CPU backend to deliver the same numerical results everywhere:
- Portable IR: JSON network configs are your "bytecode" β define once, execute anywhere.
- JIT to GPU: Runtime WGSL shader generation β WebGPU compute pipelines.
- Polyglot FFI: Single Go core exports to Python, C#, TypeScript, WASM via C-ABI.
- Bit-Exact: 0.0000000000 difference between CPU and GPU, x86 and ARM, native and browser.
Unlike frameworks that disclaim cross-platform reproducibility, Loom enforces determinism by design. It compiles to a single binary with zero dependencies, transparently routing operations to CPU or WebGPU without changing user code.
π Cross-Ecosystem Compatibility
Models trained in any platform work instantly in all others. Bit-for-bit identical results across Go, Python, C#, TypeScript, and browser WASM.
| Platform | Package | Install |
|---|---|---|
| Go | GitHub | go get github.com/openfluke/loom |
| Python | PyPI | pip install welvet |
| C#/.NET | NuGet | dotnet add package Welvet |
| TypeScript/Node | NPM | npm install @openfluke/welvet |
| Browser | WASM | import { init } from "@openfluke/welvet" |
Supported Platforms
Pre-compiled binaries for:
- Linux: x86_64, ARM64, ARMv7
- Windows: x86_64, x86, ARM64
- macOS: Apple Silicon (M1/M2/M3), Intel, Universal
- Android: ARM64, ARMv7
- iOS: ARM64 (XCFramework)
Technical Architecture
What is Loom?
Loom is a Deterministic Neural Virtual Machine (DNVM) β a portable execution environment for neural networks that guarantees bitwise-identical results across all platforms, backends, and language bindings.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LOOM ARCHITECTURE β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β βββββββββββββββ βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Python β β TypeScript β β C# β β WASM β β
β β Binding β β Binding β β Binding β β Browser β β
β ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ ββββββββ¬βββββββ β
β β β β β β
β ββββββββββββββββββ¬β΄ββββββββββββββββββ΄ββββββββββββββββββ β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β C-ABI (FFI Layer) β β
β β Handle-based state management, JSON marshalling β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β EXECUTION ENGINE (nn/) β β
β β Forward/Backward passes, Optimizers, Schedulers, Tweening β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β β
β βΌ βΌ β
β βββββββββββββββββββ βββββββββββββββββββββββββββ β
β β CPU Backend β β GPU JIT Compiler β β
β β (Pure Go) β β (WGSL Generation) β β
β β β β βΌ β β
β β Deterministic β β βββββββββββββββββββ β β
β β IEEE-754 Math βββββββββββββββββββΊ β β WebGPU Runtime β β β
β βββββββββββββββββββ Bit-identical β βββββββββββββββββββ β β
β results βββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Classification
| Term | Description |
|---|---|
| Virtual Machine | Executes a portable IR (JSON network configs) on heterogeneous backends |
| JIT Compiler | Generates WGSL shaders at runtime, compiles to GPU compute pipelines |
| Deterministic | Guarantees bitwise-identical results across CPU, GPU, WASM, x86, ARM |
| Polyglot | Single Go core exports to Python, C#, TypeScript, WASM via C-ABI |
Architectural Layers
| Layer | Component | Role |
|---|---|---|
| IR (Bytecode) | JSON network configs, nn/serialization.go |
Portable, declarative network specification |
| Type System | nn/types.go with Tensor[T Numeric] |
Multi-precision tensors (F64βI8), generic operations |
| Execution | nn/forward.go, nn/backward.go |
Deterministic layer-by-layer forward/backward |
| JIT Backend | gpu/*.go |
Runtime WGSL generation β WebGPU pipelines |
| FFI Runtime | cabi/main.go |
Handle-based API, state management, memory safety |
| Bindings | python/, csharp/, typescript/, wasm/ |
Thin wrappers exposing the C-ABI |
Determinism Guarantee
Unlike typical ML runtimes that disclaim cross-platform reproducibility, Loom enforces bit-exact determinism:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Testing: Dense β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β’ Max Diff: 0.0000000000 (Idx: -1) β
β β’ Mean Diff: 0.0000000000 β
β β
[GOLD STANDARD] Exact Bit-Determinism β
β Perfect match. CPU and GPU logic are identical down to the bit. β
β β
β Output Sample: β
β [0] CPU: 0.5010004044 | GPU: 0.5010004044 | Diff: 0.0000000000 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Verified across:
- CPU (Go) β GPU (WebGPU/WGSL)
- x86_64 β ARM64 β ARMv7
- Linux β Windows β macOS β Android β iOS
- Native β WASM (Browser)
Comparison to Similar Projects
| Project | What It Is | How Loom Differs |
|---|---|---|
| ONNX Runtime | Multi-backend inference engine | Loom adds training, bidirectional FFI, and determinism guarantees |
| GGML | Quantized inference library | Loom adds GPU JIT compilation and cross-platform bitwise reproducibility |
| TVM | Compiler infrastructure for ML | Loom is simpler (pure Go), directly embeddable, with determinism by design |
| WebAssembly | Portable bytecode standard | Loom's JSON network configs are conceptually "WASM for neural compute" |
Why This Matters
- Reproducible Research: Same model, same inputs β same outputs, regardless of where it runs
- Cross-Platform Deployment: Train on Linux GPU, deploy to iOS/Android/Browser with identical behavior
- Debugging: No "works on my machine" issues from floating-point non-determinism
- Verification: Prove correctness once, trust it everywhere
Key Strengths
- True Embeddability: Single binary. Zero external dependencies. No Python runtime needed.
- Hybrid Gradient/Geometric Engine: Neural Tweening combines geometric gap-closing with backpropagation-guided momentum for real-time adaptation.
- Geometric/Recursive Clustering: Differentiable
KMeansLayerallows networks to learn interpretable symbolic prototypes within a neural hierarchy. - Structural Parallelism: Native support for Inception, ResNeXt, Siamese, and MoE architectures via
LayerParallelwith 6 combine modes. - Native Mixed-Precision: Generic tensor backend supports
int8,uint16,float32,float64natively. - Complete Training Infrastructure: 7 LR schedulers, 3 optimizers (SGD/AdamW/RMSprop), 10 softmax variants.
- Pure Go Tokenizer: HuggingFace-compatible BPE tokenizer for LLM inference.
- Step-Based Execution: Real-time inference with layer-by-layer control via
StepForwardAPI. - Network Telemetry: Runtime introspection via
GetMethodsJSON()andExtractNetworkBlueprint().
Key Limitations
- Ecosystem Maturity: No central "Model Zoo" or pip-installable convenience; relies on loading external checkpoints.
- GPU Support: WebGPU acceleration is implemented (Dense, Conv2D, MHA) but is beta/experimental and less stable than CuDNN/CUDA.
- Operator Coverage: While "Deep" support is good (MHA, LSTM), "Broad" support (e.g., 3D Conv, Deformable Attn, FFTs) is missing compared to SciPy/JAX.
- Math Backend: Relies on custom explicit forward/backward passes rather than a general-purpose symbolic autograd graph.
Recommended Configurations
Based on exhaustive benchmarks (300+ combinations tested), here are the optimal configurations:
Training Mode Selection
| Scenario | Recommended Mode | Why |
|---|---|---|
| Real-time / Robotics | StepBP or StepTweenChain |
100% availability, 0ms blocking |
| Noisy / Adversarial Data | StepTweenChain |
94% robustness vs 86% for NormalBP |
| Offline Batch Training | NormalBP |
Highest accuracy when blocking is acceptable |
| Multi-Agent Systems | StepBP |
12x better coordination vs blocked training |
| Continuous Adaptation | StepTweenChain |
Maintains competence during distribution shift |
Layer Γ Training Mode (float32)
| Layer | Best Mode | Score | Accuracy | Availability |
|---|---|---|---|---|
| Conv2D | StepTweenChain | 1187 | 98.7% | 100% |
| Conv2D | StepTween | 1012 | 98.7% | 100% |
| Attention | StepTween | 830 | 90.1% | 100% |
| RNN | StepTween | 663 | 76.5% | 100% |
| Dense | StepTween | 379 | 42.5% | 100% |
| LSTM | NormalBP | 49 | 53.6% | 28.7% |
[!TIP] Conv2D + StepTweenChain + float32 is the optimal configuration for most real-time scenarios, achieving 98.7% accuracy with 100% availability.
Numeric Type Selection
| Type | Best For | Notes |
|---|---|---|
| float32 | Most use cases | 18/30 benchmark wins, best accuracy |
| float64 | Scientific computing | Higher precision, slower, wins with NormalBP |
| int16 | LSTM layers | Only type that works for step-based LSTM |
| uint16 | Edge/embedded | Good balance of range and speed |
[!NOTE] Integer types (
int8,uint8, etc.) work but achieve only ~13-23% accuracy on adaptive tasks. Use floats for training, integers for quantized inference.
Benchmarks and Repro
Benchmark methodology and results live in docs/step_tween_assessment.md. Results are hardware- and build-dependent; use CPU runs as the reference baseline when comparing.
What's New
π§ Recursive Neuro-Symbolic Architecture: The differentiable
KMeansLayerenables models to learn hierarchical concept taxonomies. Perfect for OOD detection and robust classification. Seedocs/research_paper_7_recursive_neuro_symbolic.md.
π Transformer Inference: SmolLM2-135M-Instruct runs entirely in browser WASM with pure Go implementation.
π€― Grid Softmax = Native MoE: Mathematically proven equivalent to PyTorch MoE with 97.1% loss reduction. See
examples/moe_proof_demo.go.
β‘ Grid Scatter Mode: Place parallel branch outputs at specific 2D/3D grid positions for multi-agent systems, hierarchical RL, and ensemble methods with explicit topology.
π§ Neural Tweening: Train and run simultaneously with 100% accuracy on shallow networks, never crashes to 0% during task changes. Benchmarks β
π¦ Recursive Safetensors: Full support for deeply nested architectures (MoE, Sequential, Parallel) with 100% bitwise save/load consistency. Verified with
tva/testing/safetensors_recursive.go.
π’ Numerical Type Benchmarking: Compare network behavior across 13 numerical types (F64, F32, F16, BF16, F4, I64, I32, I16, I8, U64, U32, U16, U8) with in-memory quantization. WASM-compatible for browser deployment testing.
π§ͺ MNIST Verification: End-to-end demo
tva/demo/conv2d-mnist/main.goproving exact CPU/GPU consistency, training convergence, and multi-precision save/load integrity.
Framework Comparison
Global AI Landscape
| Feature Category | Feature | Loom (Go) | PyTorch (Py) | TF / TFLite | GoMLX (Go) | Spago (Go) | Core ML | TF.js | Candle (Rust) |
|---|---|---|---|---|---|---|---|---|---|
| Core | Primary Language | Go | Python | Python / C++ | Go | Go | Swift / ObjC | JS / TS | Rust |
| Runtime Dependency | None (Binary) | Heavy (Pip) | Binary (Edge) | CGo / XLA | None | OS-Native | Browser | None | |
| Auto-Differentiation | β οΈ Hybrid/Manual | β Full | β Full | β Full (XLA) | β Manual | β (Inference) | β Full | β Full | |
| Safetensors | β Native | β | β | β | β | β | β | β | |
| ONNX Support | β | β (Export) | β | β οΈ | β | β (Import) | β | β οΈ | |
| Structure Inference | β Auto-Detect | β | β | β | β | β | β | β | |
| Training | Gradient Descent | β Manual Chain | β Standard | β Standard | β Standard | β Standard | β (On-device) | β Standard | β Standard |
| Neural Tweening | β Hybrid Engine | β | β | β | β | β | β | β | |
| LR Schedulers | β 7 Types | β | β | β | β οΈ Basic | β | β | β | |
| Optimizers | β 3 (SGD/AdamW/RMSprop) | β Many | β Many | β | β | β οΈ | β | β | |
| Layer Support | Dense (MLP) | β | β | β | β | β | β | β | β |
| Conv2D | β | β | β | β | β | β | β | β | |
| Conv1D | β Native | β | β | β | β | β | β | β | |
| RNN / LSTM | β Full Gate | β | β | β | β | β | β | β | |
| Transformer (MHA) | β (Explicit) | β | β | β | β (BERT) | β | β | β | |
| SwiGLU | β Native | β | β | β | β | β | β | β | |
| Parallel / MoE | β Structure | β (Manual) | β (Manual) | β | β | β | β | β | |
| Sequential Layers | β Native | β | β | β οΈ | β οΈ | β οΈ | β | β οΈ | |
| Embeddings | β | β | β | β | β | β | β | β | |
| Tokenizer | β Pure Go | β (Rust/C++) | β (C++) | β | β | β | β | β | |
| Normalization | LayerNorm | β Native | β | β | β | β | β | β | β |
| RMSNorm | β Native | β οΈ (Manual) | β οΈ (Manual) | β | β | β | β | β | |
| Residual/Skip | β Native | β | β | β | β | β | β | β | |
| Advanced | Stitch Layers | β Native | β (Manual) | β (Manual) | β | β | β | β | β |
| Dynamic Arch Gen | β Built-in | β | β | β | β | β | β | β | |
| Step-Based Forward | β Unique | β | β | β | β | β | β | β | |
| K-Means Clustering | β Differentiable | β | β | β | β | β | β | β | |
| Correlation Analysis | β Pearson/Spearman | β | β | β | β | β | β | β | |
| Model Evaluation | β Deviation/Metrics | β | β | β οΈ | β οΈ | β οΈ | β οΈ | β οΈ | |
| Network Telemetry | β Blueprint API | β | β οΈ | β | β | β | β οΈ | β | |
| Runtime Introspection | β Reflection | β οΈ (Python) | β οΈ | β | β | β | β οΈ | β | |
| Platform | WASM Training | β Full | β | β | β | β | β | β (Slow) | β |
| Cross-Lang ABI | β Universal | β | β | β | β | β | β | β οΈ | |
| Ecosystem | HuggingFace Hub | β οΈ (Read/Inspect) | β Native | β Native | β | β | β | β | β |
| Pre-trained Zoo | β | β Massive | β Massive | β | β (Small) | β (Apple) | β Large | β οΈ Growing | |
| Mobile/Web | β WASM / C-ABI | β (Mobile) | β King | β | β | β King (iOS) | β King (Web) | β (WASM) |
Go Ecosystem Comparison
| Category | Feature | Loom | GoMLX | Gorgonia | Spago | Go-Deep | Gonum |
|---|---|---|---|---|---|---|---|
| Foundation | Primary implementation | Pure Go | CGo (XLA) | Pure Go + CGo | Pure Go | Pure Go | Pure Go |
| Tensor Backend | Custom (Generic) | XLA (C++) | Custom | Custom (Dense) | Custom | Dense Matrix | |
| Autograd | β οΈ Hybrid | β Full | β Symbolic | β Dynamic | β Backprop | β | |
| Model | Load Safetensors | β Native | β | β | β | β | β |
| Model Export | binary/json | XLA format | Onnx (Import) | Gob | Json | β | |
| Architecture | Dense (MLP) | β | β | β | β | β | β (Matrix Mul) |
| Conv2D | β | β | β | β | β | β | |
| Conv1D | β Native | β | β οΈ (via 2D) | β οΈ (via 2D) | β | β | |
| RNN / LSTM | β Full Gate | β | β οΈ Basic | β BiLSTM | β | β | |
| Transformer (MHA) | β Explicit | β | β οΈ Hard | β (BERT) | β | β | |
| SwiGLU | β | β | β | β | β | β | |
| Embeddings | β | β | β | β | β | β | |
| Parallel / MoE | β MoE + Gating | β (Manual) | β | β | β | β | |
| Sequential Layers | β Native + Nested | β οΈ (Manual) | β οΈ (Manual) | β οΈ (Manual) | β | β | |
| Tokenizer | β Pure Go | β (Deps) | β | β (WordPiece) | β | β | |
| Training | Gradient Descent | β Manual | β Standard | β Standard | β Standard | β Standard | β |
| Hybrid Tweening | β Unique | β | β | β | β | β | |
| LR Schedulers | β 7 Types | β | β | β οΈ Basic | β | β | |
| Optimizers | β SGD/AdamW/RMSprop | β | β | β | β οΈ SGD | β | |
| Softmax Variants | β 10 Types | β οΈ Standard | β οΈ Standard | β οΈ Standard | β οΈ Standard | β | |
| Normalization | LayerNorm | β Native | β | β οΈ Manual | β | β | β |
| RMSNorm | β Native | β | β | β | β | β | |
| Residual/Skip | β Native | β | β | β | β | β | |
| Advanced | RoPE Embeddings | β GQA Support | β | β | β | β | β |
| Network Grafting | β Unique | β | β | β | β | β | |
| Step-Based Forward | β Unique | β | β | β | β | β | |
| Dynamic Arch Gen | β Unique | β | β | β | β | β | |
| K-Means Clustering | β Differentiable | β | β | β | β | β | |
| Correlation Analysis | β Pearson/Spearman | β | β | β | β | β | |
| Model Evaluation | β Full Suite | β οΈ | β οΈ | β οΈ | β | β | |
| Network Telemetry | β Blueprint | β | β οΈ | β | β | β | |
| Runtime Introspection | β Reflection | β | β οΈ | β | β | β | |
| Platform | C-ABI (Polyglot) | β Universal | β | β | β | β | β |
| WASM Training | β Full | β (XLA) | β | β | β | β | |
| Ecosystem | HuggingFace | β οΈ (Load) | β | β | β (Load) | β | β |
| Documentation | β οΈ Growing | β Good | β Good | β Good | β οΈ Minimal | β Excellent | |
| Maintenance | π₯ Active | π₯ Active | β οΈ Slow | βΈοΈ Paused | β οΈ Slow | π₯ Active |
Native Numerical Type & Precision Support
| Layer Type | Numerical Type | Loom | GoMLX | Gorgonia | Spago | PyTorch |
|---|---|---|---|---|---|---|
| All Layers | Float32 | β | β | β | β (Float64) | β |
| (Dense, Conv, | Float64 (High Prec) | β Native | β | β | β | β |
| RNN, Attn) | Float16 / BF16 | β οΈ (Storage) | β (XLA) | β | β | β |
| Int8 (Training) | β Native | β | β | β | β οΈ (QAT Wrapper) | |
| Int8 (Inference) | β | β | β | β | β (Quant) | |
| Int16, Int32, Int64 | β Native | β (XLA) | β οΈ (Tensor) | β | β (Tensor Only) | |
| Uint8, Uint16, Uint32 | β Native | β (XLA) | β οΈ (Tensor) | β | β (Uint8 Only) |
[!NOTE] Complete Type System: Unlike frameworks that treat integers primarily as storage formats for quantization, Loom's Generics allow native training and inference on exotic types like
uint16(common in medical imaging),int32, orfloat64(scientific sim) across every layer type without changes to the model code.
Summary Verdict
- Choose PyTorch if you are doing Research, need the latest SOTA models, or rely on complex dynamic architectures.
- Choose TensorFlow / TFLite if you need robust Mobile/Edge Deployment.
- Choose GoMLX if you need High-Performance Training in Go and can tolerate CGo/C++ dependencies.
- Choose Core ML if you are targeting iOS/macOS exclusively.
- Choose Loom if you need Pure Go-Native Embedding (Cloud/CLI/Server), want a single binary with zero dependencies, need to experiment with the Neural Tweening training paradigm, or need unique features like Step-Based Forward Pass for real-time inference and Dynamic Architecture Generation for automated model exploration.
Layer Types & Features
Supported Layer Types
| Layer | Type String | Description |
|---|---|---|
| Dense | dense |
Fully connected layer |
| LSTM | lstm |
Long Short-Term Memory |
| RNN | rnn |
Recurrent Neural Network |
| GRU | gru |
Gated Recurrent Unit |
| Conv2D | conv2d |
2D Convolution |
| Conv1D | conv1d |
1D Convolution |
| Multi-Head Attention | multi_head_attention |
Transformer attention |
| LayerNorm | layer_norm |
Layer normalization |
| RMSNorm | rms_norm |
RMS normalization |
| SwiGLU | swiglu |
SwiGLU activation layer |
| KMeans | kmeans |
Differentiable recursive clustering layer |
| Softmax | softmax |
10 variants (Standard, Grid, Hierarchical, Temperature, Gumbel, Masked, Sparsemax, Entmax, Adaptive, Mixture) |
| Embedding | embedding |
Token embedding |
| Parallel | parallel |
Branching with 6 combine modes (add, concat, multiply, average, grid_scatter, filter) |
| Sequential | sequential |
Grouped sub-layers |
Activation Functions
relu, sigmoid, tanh, softmax, gelu, swish, mish, leaky_relu, elu, selu, linear
SafeTensors & Model Interoperability
Loom features a universal SafeTensors engine capable of standardizing models from any framework (PyTorch, TensorFlow, HuggingFace) into a highly optimized, single-file format. It proactively handles complex nested architectures (like Mixture-of-Experts within Parallel layers) via recursive serialization.
1. Universal "Any-to-Any" Quantization
Load a model in high precision (float32/float64) and instantly quantize it to any supported type for deployment. The file format handles the type conversion automatically.
- Input: Model weights in
F32(e.g., from HuggingFace) - Output: Quantized weights in
F4,I8,BF16,U16etc. - Verification: 100% round-trip integrity verified for all 143 layer/type combinations.
// Load standard model
tensors, _ := nn.LoadSafetensors("llama.safetensors")
// Save as 4-bit optimized web model (automatically quantizes)
for name, t := range tensors { t.DType = "F4" }
nn.SaveSafetensors("llama-web-4bit.safetensors", tensors)
2. WASM / In-Memory Operation
Loom's SafeTensors implementation can operate purely in memory (using []byte buffers) without any filesystem access, making it perfect for WebAssembly (WASM) and constrained environments.
// Serialize directly to memory (for sending to browser/client)
bytes, _ := nn.SerializeSafetensors(myModelWeights)
// Load directly from memory (no disk I/O required)
tensors, _ := nn.LoadSafetensorsWithShapes(bytes)
3. Full Layer Support
The interoperability layer supports every component in the Loom ecosystem:
| Category | Supported Layers |
|---|---|
| Core | Dense, Embedding, Parallel, Sequential |
| ** Convolution** | Conv1D, Conv2D |
| Sequence | RNN, LSTM, GRU |
| Attention | MultiHeadAttention, SwiGLU |
| Norm/Act | LayerNorm, RMSNorm, Softmax (10 variants) |
GPU Acceleration (WebGPU)
Experimental GPU acceleration via WebGPU compute shaders. Treat all GPU paths (forward and backward) as experimental for now. Use with:
network.GPU = true
network.WeightsToGPU() // Mount weights to GPU
output, _ := network.Forward(input) // Auto-routes to GPU!
network.Backward(dOutput) // GPU backward pass
network.ReleaseGPUWeights() // Cleanup
GPU Support Matrix
| Layer Type | Forward | Backward (Training) | Notes |
|---|---|---|---|
| Dense | β Stable | β οΈ Experimental | Production speedup (20x) on large layers. |
| Conv2D | β Stable | β οΈ Experimental | Works well, optimized for 32+ filters. |
| Conv1D | β Stable | β οΈ Experimental | Gradients implemented, accuracy tuning needed. |
| RNN | β Stable | β οΈ Experimental | Weights update, but BPTT limited to batch=1. |
| LSTM | β Stable | β οΈ Experimental | Same limitations as RNN. |
| LayerNorm | β Stable | β οΈ Experimental | Forward is stable, backward can be numeric unstable. |
| RMSNorm | β Stable | β οΈ Experimental | Same as LayerNorm. |
| SwiGLU | β Stable | β οΈ Experimental | High performance. |
| MHA | β Stable | β οΈ Experimental | Functional parity verified. |
| Softmax | β Stable | β οΈ Experimental | Functional. |
| KMeans | β WIP | β WIP | Currently runs on CPU only. |
Quick Start
Quick docs:
Installation
# Clone the repository
git clone https://github.com/openfluke/loom.git
cd loom
# Install dependencies
go mod download
Simple Example
package main
import (
"fmt"
"github.com/openfluke/loom/nn"
)
func main() {
network := nn.NewNetwork(4096, 4, 4, 5) // 80 total layers
if err := network.InitGPU(); err != nil {
panic(err)
}
defer network.ReleaseGPU()
input := make([]float32, 4096)
output, gpuTime, _ := network.ForwardGPU(input)
fmt.Printf("GPU Forward time: %v, Output size: %d\n", gpuTime, len(output))
}
Model Serialization
// Save a trained model
err := network.SaveModel("model.json", "my_model")
// Load it back - ONE LINE!
loadedNet, err := nn.LoadModel("model.json", "my_model")
// Or use strings (great for APIs/databases/WASM)
jsonString, err := network.SaveModelToString("my_model")
loadedNet, err := nn.LoadModelFromString(jsonString, "my_model")
Cross-Platform API
| Function | Go | Python | TypeScript | C# | C |
|---|---|---|---|---|---|
| Create | BuildNetworkFromJSON() |
create_network_from_json() |
createNetworkFromJSON() |
CreateLoomNetwork() |
CreateLoomNetwork() |
| Forward | Forward() |
forward_simple() |
forward() |
LoomForward() |
LoomForward() |
| Train | Train() |
train_simple() |
train() |
LoomTrain() |
LoomTrain() |
| Save | SaveModelToString() |
save_model_simple() |
saveModel() |
LoomSaveModel() |
LoomSaveModel() |
| Load | LoadModelFromString() |
load_model_simple() |
loadLoomNetwork() |
LoomLoadModel() |
LoomLoadModel() |
| Evaluate | EvaluateNetwork() |
evaluate_network_simple() |
evaluate() |
LoomEvaluateNetwork() |
LoomEvaluateNetwork() |
Language Bindings
Python
pip install welvet
import welvet
config = {"batch_size": 1, "layers": [...]}
welvet.create_network_from_json(config)
output = welvet.forward_simple([0.1, 0.2, 0.3, 0.4])
See python/README.md for complete documentation.
TypeScript / Node.js
npm install @openfluke/welvet
import { init, createNetworkFromJSON } from "@openfluke/welvet";
await init();
const network = createNetworkFromJSON(JSON.stringify(config));
const output = network.Forward(JSON.stringify([[0.1, 0.2, 0.3, 0.4]]));
See typescript/README.md for complete documentation.
C# / .NET
dotnet add package Welvet
using Welvet;
Network.CreateFromJson(config);
var output = NativeMethods.LoomForward(input, input.Length);
See csharp/README.md for complete documentation.
Project Structure
loom/
βββ nn/ # Neural network package (core)
βββ tokenizer/ # Pure Go BPE tokenizer
βββ wasm/ # WebAssembly module
βββ cabi/ # C ABI for FFI
βββ python/ # Python package (welvet)
βββ typescript/ # TypeScript/WASM package
βββ csharp/ # C#/.NET package (Welvet)
βββ fabric/ # Demo application
βββ pods/ # GPU compute pods
βββ model_conversion/ # HuggingFace model import
βββ docs/ # Documentation
βββ detector/ # GPU device detection
Documentation
- Neural Network Package - Detailed API documentation
- Neural Tweening Benchmarks - 19-test comprehensive benchmark
- Evaluation & Metrics - Deviation metrics, numerical type benchmarking, WASM-compatible verification
- Python Bindings - PyPI package docs
- TypeScript Bindings - NPM package docs
- C# Bindings - NuGet package docs
- WASM Module - Browser deployment
- C ABI - FFI reference
- Model Conversion - HuggingFace import guide
More Examples: See github.com/openfluke/tva for additional examples and experiments.
Comprehensive Test Suite
Loom includes a rigorous verification suite in tva/muniversal_testing.go and cabi/universal_test.c that validates functional correctness across all layers, numeric types, and backend engines (CPU/GPU).
Coverage Summary (2297 tests)
| Test Section | Tests | Description |
|---|---|---|
| Part 1: Core | 6 | Forward/backward pass correctness for basic layers |
| Part 2: Serialization | 2100 | Save/Load for all layers Γ 15 dtypes + parallel permutations |
| Part 3: Advanced | 11 | Complex layers (MHA, Grid Softmax, K-Means) and math ops |
| Part 5: GPU Determinism | 15 | Validates GPU forward pass matches CPU results |
| Part 6: GPU Training | 21 | Verifies GPU learning convergence vs CPU baseline |
| Part 7: In-Memory/WASM | 144 | SafeTensors round-trip without filesystem (11 layers Γ 13 dtypes) |
[!NOTE] GPU Acceleration Limits: As of v0.0.8, WebGPU acceleration is enabled for standard
Forward/Backwardpasses. The structural APInn/step_forward.go(Step-based execution),nn/tween.go(Neural Tweening), andnn/kmeans_layer.go(K-Means) currently run on CPU only.Browser Testing (v0.3.0): The universal test suite can now be run directly in the browser with full parity. See
typescript/README.mdfor details on runningserve.py.
C ABI Parity
The C test suite (cabi/universal_test.c) mirrors the Go suite with 2298 tests, validating that all functionality is accessible through the FFI layer for Python, C#, TypeScript, and WASM bindings.
Verified Advanced Architectures
The test suite also verifies complex, production-ready architectural patterns:
- Recursive Symbol Learning (RN1-RN6): Differentiable K-Means layers nested to form taxonomies, achieving 100% accuracy on hierarchical tasks with full interpretability.
- Heterogenous MoE: Using
LayerParallelwithCombineMode: "filter"to route inputs to experts of different depths/types (e.g., CNN expert vs Dense expert). - Stitched Experts: Using
LayerStitchto harmonize outputs from parallel branches with different dimensions (e.g., 5-dim output and 7-dim output stitched to common 10-dim). - Neural Grafting: Training only the gating mechanism of an MoE while keeping experts frozen, using
TweenStepfor precise surgical updates. - Bit-Exact Determinism: Verifying that GPU forward passes match CPU results to within machine epsilon (often exactly bit-matching for integer ops).
Runnable Demos
-
MNIST Consistency (
tva/demo/conv2d-mnist/main.go):- Trains a digit classifier on MNIST.
- Saves model to JSON and Safetensors.
- Reloads model and verifies 0.000000000 max difference in predictions.
- Benchmarks all 13 numerical types (F64βU8) for quantization quality.
- Proves robustness of
SaveWeightsToSafetensors/LoadWeightsFromSafetensors.
-
Recursive Safetensors (
tva/testing/safetensors_recursive.go):- Constructs a complex nested Network:
MoE (Gate) -> [Parallel -> [Dense, Sequential -> [Conv1D, RNN]]]. - Saves and reloads to prove structural integrity of serialization for arbitrary depths.
- Constructs a complex nested Network:
Requirements
- Go: 1.24 or higher
- GPU: WebGPU-compatible GPU (Vulkan, Metal, or D3D12) - optional
- OS: Linux, macOS, or Windows
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
Apache License 2.0 - see LICENSE file for details.
Made with β€οΈ by Openfluke