goffi — Zero-CGO FFI for Go

Pure Go Foreign Function Interface for calling C libraries without CGO.
Designed for WebGPU and GPU computing — zero C dependencies, zero per-call allocations, 88–114 ns overhead.
Deep dive: How We Call C Libraries Without a C Compiler — architecture, assembly, callbacks, and ecosystem.
// Load library, prepare once, call many times — no CGO required
handle, _ := ffi.LoadLibrary("wgpu_native.dll")
sym, _ := ffi.GetSymbol(handle, "wgpuCreateInstance")
cif := &types.CallInterface{}
ffi.PrepareCallInterface(cif, types.DefaultCall, returnType, argTypes)
ffi.CallFunction(cif, sym, unsafe.Pointer(&result), args)
Features
|
Feature |
Details |
| Zero CGO |
Pure Go |
No C compiler needed. go get and build. |
| Fast |
88–114 ns/op |
Pre-computed CIF, zero per-call allocations |
| Cross-platform |
7 targets |
Windows, Linux, macOS, FreeBSD × AMD64 + ARM64 |
| Callbacks |
C→Go safe |
crosscall2 integration, works from any C thread |
| Type-safe |
Runtime validation |
5 typed error types with errors.As() support |
| Struct passing |
Full ABI |
≤8B (RAX), 9–16B (RAX+RDX), >16B (sret) |
| Context |
Timeouts |
CallFunctionContext(ctx, ...) cancellation |
| Tested |
89% coverage |
CI on Linux, Windows, macOS |
Quick Start
Installation
go get github.com/go-webgpu/goffi
Requirements
goffi requires CGO_ENABLED=0. This is automatic when no C compiler is installed or when cross-compiling. If you have gcc/clang:
CGO_ENABLED=0 go build ./...
Why? goffi uses Go's cgo_import_dynamic for dynamic library loading, which only activates when CGO is disabled.
Example: Calling strlen
package main
import (
"fmt"
"runtime"
"unsafe"
"github.com/go-webgpu/goffi/ffi"
"github.com/go-webgpu/goffi/types"
)
func main() {
// Load platform-specific C library
libName := "libc.so.6"
if runtime.GOOS == "windows" {
libName = "msvcrt.dll"
}
handle, err := ffi.LoadLibrary(libName)
if err != nil {
panic(err)
}
defer ffi.FreeLibrary(handle)
strlen, err := ffi.GetSymbol(handle, "strlen")
if err != nil {
panic(err)
}
// Prepare call interface once — reuse for all subsequent calls
cif := &types.CallInterface{}
err = ffi.PrepareCallInterface(
cif,
types.DefaultCall, // auto-detects platform ABI
types.UInt64TypeDescriptor, // return: size_t
[]*types.TypeDescriptor{types.PointerTypeDescriptor}, // arg: const char*
)
if err != nil {
panic(err)
}
// Call strlen — avalue elements are pointers TO argument values
testStr := "Hello, goffi!\x00"
strPtr := uintptr(unsafe.Pointer(unsafe.StringData(testStr)))
var length uint64
err = ffi.CallFunction(cif, strlen, unsafe.Pointer(&length), []unsafe.Pointer{unsafe.Pointer(&strPtr)})
if err != nil {
panic(err)
}
fmt.Printf("strlen(%q) = %d\n", testStr[:len(testStr)-1], length)
// Output: strlen("Hello, goffi!") = 13
}
FFI overhead: 88–114 ns/op (Windows AMD64, Intel i7-1255U)
| Benchmark |
Time |
Allocations |
Empty function (getpid) |
88 ns |
2 allocs |
Integer argument (abs) |
114 ns |
3 allocs |
String processing (strlen) |
98 ns |
3 allocs |
At 60 FPS with ~50 FFI calls per frame, overhead is 5 µs per frame — 0.03% of the 16.6 ms budget. Unmeasurable in profiling.
See docs/PERFORMANCE.md for detailed analysis, optimization strategies, and when NOT to use goffi.
Architecture
goffi transitions from Go's managed runtime to C code through three layers:
Go Code
│ ffi.CallFunction()
▼
runtime.cgocall ← Go runtime: system stack switch, GC coordination
│
▼
Assembly Wrapper ← Hand-written: load GP/SSE registers per ABI
│ CALL target_function
▼
C Function ← External library
Three ABIs, hand-written assembly for each:
| ABI |
GP Registers |
FP Registers |
Notes |
| System V AMD64 |
RDI, RSI, RDX, RCX, R8, R9 |
XMM0–XMM7 |
Linux, macOS, FreeBSD |
| Win64 |
RCX, RDX, R8, R9 |
XMM0–XMM3 |
32-byte shadow space mandatory |
| AAPCS64 |
X0–X7 |
D0–D7 |
HFA support for ARM64 |
See docs/ARCHITECTURE.md for the full technical deep dive.
Callbacks (C → Go)
WebGPU fires async callbacks from internal Metal/Vulkan threads. These threads have no goroutine — calling Go directly would crash.
goffi uses crosscall2 for safe C→Go transitions from any thread:
cb := ffi.NewCallback(func(status uint32, adapter uintptr, msg uintptr, ud uintptr) {
// Safe even when called from a C thread
result.handle = adapter
close(done)
})
ffi.CallFunction(cif, wgpuRequestAdapter, nil, args)
<-done // Wait for GPU driver callback
2000 pre-compiled trampoline entries per process. AMD64: 5 bytes/entry. ARM64: 8 bytes/entry.
Error Handling
Five typed error types for precise diagnostics:
handle, err := ffi.LoadLibrary("nonexistent.dll")
if err != nil {
var libErr *ffi.LibraryError
if errors.As(err, &libErr) {
fmt.Printf("Failed to %s %q: %v\n", libErr.Operation, libErr.Name, libErr.Err)
}
}
| Error Type |
When |
InvalidCallInterfaceError |
CIF preparation failures |
LibraryError |
Library loading / symbol lookup |
CallingConventionError |
Unsupported calling convention |
TypeValidationError |
Invalid type descriptor |
UnsupportedPlatformError |
Platform not supported |
Comparison: goffi vs purego vs CGO
| Feature |
goffi |
purego |
CGO |
| C compiler required |
No |
No |
Yes |
| API style |
libffi-like (prepare once, call many) |
reflect-based (RegisterFunc) |
Native |
| Per-call allocations |
Zero (CIF reusable) |
reflect + sync.Pool per call |
Zero |
| Struct pass/return |
Full (RAX+RDX, sret) |
Partial (no Windows structs) |
Full |
| Callback float returns |
XMM0 in asm |
Not supported (panic) |
Full |
| ARM64 HFA detection |
Recursive (nested structs) |
Partial (bug in nested path) |
Full |
| Typed errors |
5 types + errors.As() |
Generic |
N/A |
| Context support |
Timeouts/cancellation |
No |
No |
| C-thread callbacks |
crosscall2 |
crosscall2 |
Full |
| String/bool/slice args |
Raw pointers only |
Auto-marshaling |
Full |
| Platform breadth |
7 targets |
8 GOARCH / 20+ OS×ARCH |
All |
| AMD64 overhead |
88–114 ns |
Not published |
~140 ns (Go 1.26 claims ~30% reduction) |
Choose goffi for GPU/real-time workloads: struct passing, zero per-call overhead, callback float returns, typed errors.
Choose purego for general-purpose bindings: string auto-marshaling, broad architecture support, less boilerplate.
See also: JupiterRider/ffi — pure Go binding for libffi via purego. Supports struct pass/return and variadic functions; requires libffi at runtime.
Known Limitations
Windows: C++ exceptions may crash the program (#12516)
- Go runtime limitation, not goffi-specific. Go 1.22+ added partial SEH support (#58542), but edge cases remain.
- Workaround: build native libraries with
panic=abort.
Windows: float return values not captured from XMM0
syscall.SyscallN returns RAX only. Go syscall package limitation.
Variadic functions not supported (printf, sprintf)
- Use non-variadic wrappers. Planned for v0.5.0.
Struct packing follows System V ABI only
- Windows
#pragma pack not honored. Manually specify Size/Alignment in TypeDescriptor.
No bitfields in struct types.
Unix: duplicate symbol conflict with purego (#22)
| Platform |
Arch |
ABI |
Since |
CI |
| Windows |
amd64 |
Win64 |
v0.1.0 |
Tested |
| Windows |
arm64 |
AAPCS64 |
v0.5.0 |
Tested (Snapdragon X) |
| Linux |
amd64 |
System V |
v0.1.0 |
Tested |
| Linux |
arm64 |
AAPCS64 |
v0.3.0 |
Cross-compile verified |
| macOS |
amd64 |
System V |
v0.1.1 |
Tested |
| macOS |
arm64 |
AAPCS64 |
v0.3.7 |
Tested (M3 Pro) |
| FreeBSD |
amd64 |
System V |
v0.5.0 |
Cross-compile verified |
Roadmap
| Version |
Status |
Highlights |
| v0.2.0 |
Released |
Callback API, 2000-entry trampoline table |
| v0.3.x |
Released |
ARM64 (AAPCS64), HFA, Apple Silicon |
| v0.4.0 |
Released |
crosscall2 for C-thread callbacks |
| v0.4.1 |
Released |
ABI compliance audit — 10/11 gaps fixed |
| v0.4.2 |
Released |
purego compatibility (-tags nofakecgo) |
| v0.5.0 |
Next |
Windows ARM64, FreeBSD, variadic functions, builder API |
| v1.0.0 |
Planned |
API stability (SemVer 2.0), security audit |
See CHANGELOG.md for version history and ROADMAP.md for the full plan.
Testing
go test ./... # all tests
go test -cover ./... # with coverage (89%)
go test -bench=. -benchmem ./ffi # benchmarks
go test -v ./ffi # verbose, auto-detects platform
Documentation
Contributing
See CONTRIBUTING.md for guidelines.
- Fork → feature branch → tests (80%+ coverage) → lint → PR
- Conventional commits:
feat:, fix:, docs:, test:
Acknowledgments
- purego — proved that pure Go FFI is possible. The
crosscall2 callback mechanism, fakecgo approach, and assembly trampoline patterns were pioneered by purego. goffi exists because purego cleared the path.
- libffi — reference for FFI architecture patterns and CIF design.
- Go runtime —
runtime.cgocall for GC-safe stack switching, crosscall2 for C→Go transitions.
Ecosystem
goffi powers an ecosystem of pure Go GPU libraries:
| Project |
Description |
| go-webgpu/webgpu |
Zero-CGO WebGPU bindings (wgpu-native) |
| born-ml/born |
ML framework for Go, GPU-accelerated |
| gogpu |
GPU computing platform — dual Rust + Pure Go backends |
| wgpu-native |
Native WebGPU implementation (upstream) |
License
MIT — see LICENSE.
goffi v0.4.1 | GitHub | pkg.go.dev | Dev.to