mil

package
v0.3.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 15, 2026 License: MIT Imports: 4 Imported by: 1

Documentation

Rendered for darwin/amd64

Overview

Package mil generates MIL (Model Intermediate Language) programs and weight blobs for Apple Neural Engine compilation.

text := mil.GenConv(16, 16, 1)
blob, _ := mil.BuildWeightBlob(weights, 16, 16)

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func BuildCausalMaskBlob added in v0.3.0

func BuildCausalMaskBlob(seq int) ([]byte, error)

BuildCausalMaskBlob builds the upper-triangular fp16 causal mask used by SDPA.

func BuildFP16Blob added in v0.3.0

func BuildFP16Blob(data []float32) ([]byte, error)

BuildFP16Blob builds a generic fp16 MIL BLOBFILE payload from row-major data.

func BuildIdentityWeightBlob

func BuildIdentityWeightBlob(channels int) ([]byte, error)

BuildIdentityWeightBlob builds weights for an identity convolution (I matrix).

func BuildRoPECosSinBlobs added in v0.3.0

func BuildRoPECosSinBlobs(seq, headDim int) ([]byte, []byte, error)

BuildRoPECosSinBlobs builds fp16 cosine/sine tables for RoPE with the standard base frequency theta=10000. Each output has shape [1,1,seq,headDim], flattened row-major.

func BuildRoPECosSinBlobsWithTheta added in v0.3.2

func BuildRoPECosSinBlobsWithTheta(seq, headDim int, theta float64) ([]byte, []byte, error)

BuildRoPECosSinBlobsWithTheta builds fp16 cosine/sine tables for RoPE with a configurable base frequency theta. Common values: 10000 (original), 100000 (nanochat), 500000 (Llama3). Each output has shape [1,1,seq,headDim], flattened row-major.

func BuildTransposedWeightBlob added in v0.3.0

func BuildTransposedWeightBlob(weights []float32, rows, cols int) ([]byte, error)

BuildTransposedWeightBlob builds a baked-weight blob for the transpose of a row-major matrix. The input matrix is [rows, cols] row-major; the baked tensor is [cols, rows].

func BuildVectorWeightBlob added in v0.3.0

func BuildVectorWeightBlob(weights []float32) ([]byte, error)

BuildVectorWeightBlob builds a single-baked-weight blob for a 1D fp16 tensor.

func BuildWeightBlob

func BuildWeightBlob(weights []float32, outCh, inCh int) ([]byte, error)

BuildWeightBlob constructs the binary weight blob for MIL compilation.

The blob layout matches the ANE's expected format:

  • Bytes 0-63: File header (0x01 at offset 0, 0x02 at offset 4)
  • Bytes 64-127: Chunk header (0xDEADBEEF magic, data size, data offset)
  • Bytes 128+: FP16 weight data

weights must have exactly outCh*inCh elements (OIHW layout, H=W=1).

func BuildWeightBlobV1

func BuildWeightBlobV1(data []float32) ([]byte, error)

BuildWeightBlobV1 constructs a binary weight blob for a flat 1D weight vector. Unlike BuildWeightBlob which reshapes to OIHW, this stores raw 1D data suitable for RMSNorm weight vectors and other non-convolution weights.

func GenClassifierBackward added in v0.3.0

func GenClassifierBackward(dim, vocab, seq int) string

GenClassifierBackward generates the classifier backward kernel. It multiplies baked transpose(embed) by dlogits to produce dx.

func GenClassifierForward added in v0.3.0

func GenClassifierForward(dim, vocab, seq int) string

GenClassifierForward generates a classifier projection kernel. The embedding matrix is baked as a conv weight tensor with shape [vocab, dim, 1, 1].

func GenConv

func GenConv(inCh, outCh, spatial int) string

GenConv generates a MIL text for a 1×1 convolution kernel with fp16 internal computation. inCh and outCh are channel counts; spatial is the spatial dimension (1 for vectors).

func GenConvDynamicFP16 added in v0.3.0

func GenConvDynamicFP16(inCh, outCh, spatial int) string

GenConvDynamicFP16 generates an fp16 conv graph with runtime-provided weights.

func GenConvFP16 added in v0.3.0

func GenConvFP16(inCh, outCh, spatial int) string

GenConvFP16 generates a minimal fp16 conv graph matching the working training path.

func GenConvFP16IO

func GenConvFP16IO(inCh, outCh, spatial int) string

GenConvFP16IO generates a MIL text for a 1×1 convolution with fp16 I/O (no casts).

func GenConvFP32

func GenConvFP32(inCh, outCh, spatial int) string

GenConvFP32 generates a MIL text for a 1×1 convolution with fp32 weights (no casting).

func GenDynamicMatmul added in v0.3.0

func GenDynamicMatmul(inCh, outCh, batch int) string

GenDynamicMatmul generates a weightless MIL graph for y = x*w.

The single input tensor packs activations and weights together as [1, inCh, 1, batch+outCh] fp32:

  • spatial [0:batch] contains activations laid out as [inCh, batch]
  • spatial [batch:batch+outCh] contains weights laid out as [inCh, outCh]

The output tensor is [1, outCh, 1, batch] fp32.

func GenFFNBackward added in v0.3.0

func GenFFNBackward(dim, hidden, seq int) string

GenFFNBackward generates the backward half of the fused FFN block. Input layout is concat(dffn, h1, h3); output layout is concat(dx, dh1, dh3).

func GenFFNBackwardReLU2 added in v0.3.2

func GenFFNBackwardReLU2(dim, hidden, seq int) string

GenFFNBackwardReLU2 generates the backward half of the fused ReLU² FFN block. Input layout is concat(dffn, h1); output is concat(dx, dh1). The ReLU² derivative is 2*max(0, h1).

func GenFFNForward added in v0.3.0

func GenFFNForward(dim, hidden, seq int) string

GenFFNForward generates a fused FFN block with baked W1/W2/W3 weights. It computes W2(silu(W1(x)) * W3(x)).

func GenFFNForwardRMS added in v0.3.0

func GenFFNForwardRMS(dim, hidden, seq int) string

GenFFNForwardRMS generates the full FFN block with internal RMSNorm and the final residual-free output only.

func GenFFNForwardRMSReLU2 added in v0.3.2

func GenFFNForwardRMSReLU2(dim, hidden, seq int) string

GenFFNForwardRMSReLU2 generates the full FFN block with internal RMSNorm and ReLU² activation. It computes W2(relu(W1(rms_norm(x)))²).

func GenFFNForwardReLU2 added in v0.3.2

func GenFFNForwardReLU2(dim, hidden, seq int) string

GenFFNForwardReLU2 generates a fused FFN block with ReLU² activation and baked W1/W2 weights. It computes W2(relu(W1(x))²). Unlike the gated SiLU variant, there is no W3 (only 2 weight matrices).

func GenFFNForwardTaps added in v0.3.0

func GenFFNForwardTaps(dim, hidden, seq int) string

GenFFNForwardTaps generates a fused FFN block that also returns intermediates. The output layout is concat(out, h1, h3) along the channel dimension.

func GenFFNForwardTapsReLU2 added in v0.3.2

func GenFFNForwardTapsReLU2(dim, hidden, seq int) string

GenFFNForwardTapsReLU2 generates a fused ReLU² FFN block that also returns the h1 intermediate. Output layout is concat(out, h1) along the channel dimension.

func GenFinalRMSNorm added in v0.3.0

func GenFinalRMSNorm(dim, seq int) string

GenFinalRMSNorm generates a final-layer RMSNorm kernel with baked weights.

func GenFinalRMSNormDynamic added in v0.3.0

func GenFinalRMSNormDynamic(dim, seq int) string

GenFinalRMSNormDynamic generates a final-layer RMSNorm kernel with runtime-provided weights.

func GenGQAExpand

func GenGQAExpand(kvHeads, qHeads, headDim, seqLen int) string

GenGQAExpand generates a MIL text for expanding KV heads for grouped-query attention. It tiles the KV tensor along the head dimension by a factor of qHeads/kvHeads.

func GenIdentity

func GenIdentity(channels, spatial int) string

GenIdentity generates a MIL text for an identity operation (1×1 conv with identity weights).

func GenIdentityFP16IO

func GenIdentityFP16IO(channels, spatial int) string

GenIdentityFP16IO generates a MIL text for an identity operation with fp16 I/O (no fp32 casts, the ANE reads and writes fp16 directly).

func GenMatmul

func GenMatmul(inCh, outCh, spatial int) string

GenMatmul generates a MIL text for a matrix multiplication as a 1×1 convolution. This is equivalent to y = x @ W^T for [batch, inCh] -> [batch, outCh].

func GenQKVBackward added in v0.3.0

func GenQKVBackward(dim, heads, seq int) string

GenQKVBackward generates the fused QKV backward kernel. Input layout is concat(dq, dk, dv); output layout is dx.

func GenQKVForwardRMS added in v0.3.0

func GenQKVForwardRMS(dim, seq int) string

GenQKVForwardRMS generates the RMSNorm plus QKV projection block. Output layout is concat(q, k, v) along the channel dimension.

func GenRMSNorm

func GenRMSNorm(channels, spatial int, eps float64) string

GenRMSNorm generates a MIL text for the overflow-safe 11-op RMSNorm decomposition.

The 11-op sequence prevents fp16 overflow (values >256 cause CPU fallback):

abs → reduce_max → maximum(1e-6) → real_div → square → reduce_mean →
add(eps) → sqrt → mul(safe_max) → real_div → mul(weight)

The program takes a single fp16 tensor input [1, channels, 1, spatial] and produces the same shape output. The weight vector is loaded from a BLOBFILE.

func GenRMSNormBackward added in v0.3.0

func GenRMSNormBackward(dim, seq int) string

GenRMSNormBackward generates the dx half of RMSNorm backward with baked weights. The input is concat(dy, x) along the channel dimension; dw remains a cheap CPU reduction.

func GenRMSNormBackwardDynamic added in v0.3.0

func GenRMSNormBackwardDynamic(dim, seq int) string

GenRMSNormBackwardDynamic generates the dx half of RMSNorm backward with runtime-provided weights. The input is concat(dy, x) along the channel dimension; dw remains a cheap CPU reduction.

func GenReadState

func GenReadState(name string, shape [4]int) string

GenReadState generates a MIL text for reading a named state buffer. This is used for iOS 18+ stateful inference (e.g., KV cache on ANE).

func GenSDPA

func GenSDPA(headDim, nHeads, seqLen int) string

GenSDPA generates a MIL text for scaled dot-product attention. Inputs: Q, K, V of shape [1, nHeads, seqLen, headDim]. Scale is 1/sqrt(headDim).

func GenSDPAApplyForward added in v0.3.0

func GenSDPAApplyForward(dim, heads, seq int) string

GenSDPAApplyForward generates the attention application block. Input0 is x and input1 is concat(q, k, v). Output layout is concat(x2, attn).

func GenSDPABackward1 added in v0.3.0

func GenSDPABackward1(dim, heads, seq int) string

GenSDPABackward1 generates the first SDPA backward kernel plus Wo^T. Input layout is concat(q, k, v, dx2); output layout is concat(dv, probs, dp).

func GenSDPABackward2 added in v0.3.0

func GenSDPABackward2(dim, heads, seq int) string

GenSDPABackward2 generates the second SDPA backward kernel. Input layout is concat(probs, dp, q, k); output layout is concat(dq, dk).

func GenSDPAForward added in v0.3.0

func GenSDPAForward(dim, heads, seq int) string

GenSDPAForward generates the fused attention forward block and returns x2 only.

func GenSDPAForwardTaps added in v0.3.0

func GenSDPAForwardTaps(dim, heads, seq int) string

GenSDPAForwardTaps generates the fused attention forward block with taps. Output layout is concat(x2, q, k, v, attn) along the channel dimension.

func GenScaleFP16IO

func GenScaleFP16IO(spatial int) string

GenScaleFP16IO generates a MIL text for a simple multiplication (1 channel, S spatial). Each spatial element is multiplied by the scalar weight.

func GenSoftmaxVocab added in v0.3.0

func GenSoftmaxVocab(vocab, seq int) string

GenSoftmaxVocab generates a softmax over the channel dimension.

func GenUpdateState

func GenUpdateState(name string, shape [4]int) string

GenUpdateState generates a MIL text for updating a named state buffer. This emits the coreml_update_state op for iOS 18+ stateful inference.

Types

type BlobDataType

type BlobDataType uint32

BlobDataType identifies the element type in a weight blob entry.

const (
	BlobFloat16 BlobDataType = 1
	BlobFloat32 BlobDataType = 2
	BlobUInt8   BlobDataType = 3
	BlobInt8    BlobDataType = 8
)

type BlobWriter

type BlobWriter struct {
	// contains filtered or unexported fields
}

BlobWriter accumulates weight blobs and builds a multi-weight MIL Blob Storage v2 binary.

The format consists of a 64-byte file header, followed by 64-byte per-blob metadata entries, then 64-byte-aligned data segments.

func NewBlobWriter

func NewBlobWriter() *BlobWriter

NewBlobWriter creates a new BlobWriter.

func (*BlobWriter) AddFloat16

func (w *BlobWriter) AddFloat16(data []float32) int

AddFloat16 converts float32 data to fp16 and appends it as a blob entry. Returns the blob index. Use Offset after all blobs are added to get the data offset.

func (*BlobWriter) AddFloat32

func (w *BlobWriter) AddFloat32(data []float32) int

AddFloat32 appends float32 data as a blob entry. Returns the blob index. Use Offset after all blobs are added to get the data offset.

func (*BlobWriter) AddRaw

func (w *BlobWriter) AddRaw(dtype BlobDataType, data []byte) int

AddRaw appends raw byte data as a blob entry. Returns the blob index. Use Offset after all blobs are added to get the data offset.

func (*BlobWriter) Build

func (w *BlobWriter) Build() ([]byte, error)

Build produces the complete binary blob.

func (*BlobWriter) Count

func (w *BlobWriter) Count() int

Count returns the number of blobs added.

func (*BlobWriter) Offset

func (w *BlobWriter) Offset(i int) uint64

Offset returns the byte offset where blob i's data starts in the built output. This must be called after all blobs have been added, as the offset depends on the total number of blobs (which determines the metadata section size).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL