sgs-runtime-wrapper

command
v1.0.10 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jan 27, 2026 License: MIT Imports: 11 Imported by: 0

README

SGS runtime-wrapper

OCI runtime wrapper that enables PVC rootfs replacement for Stateful Containers with support for Nvidia GPU workloads.

Overview

The sgs-runtime-wrapper intercepts OCI runtime calls to modify container root filesystem paths, enabling true rootfs replacement from PersistentVolumeClaims. It supports both standard runc and nvidia-container-runtime through automatic mode detection.

Requirements

  • Linux kernel 5.11+ required for nested overlayfs support
    • Ubuntu 20.04 with HWE kernel, Ubuntu 22.04+, or most 2022+ systems
    • Check with: uname -r

Features

  • Dual-mode operation: Automatically detects whether to hijack runc or nvidia-container-runtime
  • Nvidia GPU support: Transparently works with GPU workloads requiring nvidia-container-runtime
  • OverlayFS-based rootfs: Merges container image (read-only) with PVC (writable) using overlayfs
  • Persistent changes: All modifications persist to PVC while base image stays intact
  • Zero overhead: Uses syscall.Exec() for process replacement
  • Security hardened: Validates PVC paths, kernel version, strict permissions

How It Works

OverlayFS Architecture

The wrapper uses overlayfs to merge the container image with the PVC:

CONTAINER VIEW (what the container sees):
┌────────────────────────────────────────────────┐
│  /bin/sh     ← from image (lowerdir)           │
│  /lib/*      ← from image (lowerdir)           │
│  /home/user  ← from PVC if modified            │
│  /proc       ← mounted by runc (separate)      │
└────────────────────────────────────────────────┘
                    ↑
              pivot_root
                    ↑
┌────────────────────────────────────────────────┐
│     /pvc-host-path/merged (overlayfs mount)    │
│  ┌──────────────────────────────────────────┐  │
│  │ upperdir: /pvc-host-path/upper (PVC-RW)  │  │
│  ├──────────────────────────────────────────┤  │
│  │ lowerdir: /path/to/rootfs (image-RO)     │  │
│  └──────────────────────────────────────────┘  │
│  workdir: /pvc-host-path/work (kernel scratch) │
└────────────────────────────────────────────────┘

PVC Directory Structure:

/var/lib/kubelet/pods/<pod-uid>/volumes/.../<pvc-name>/
├── upper/     # Stores all file modifications (persistent)
├── work/      # Kernel working directory (temporary)
└── merged/    # Overlayfs mount point (container's root)
Runtime Flow
Pod with nvidia.com/gpu + volume mounted at /sgs-os-volume
    ↓
containerd → /usr/bin/nvidia-container-runtime (symlink to wrapper)
    ↓
sgs-runtime-wrapper:
  1. Auto-detects nvidia mode from invocation path
  2. Reads OCI config.json from bundle
  3. Finds PVC mount source (destination = /sgs-os-volume)
  4. Checks kernel version (5.11+ required for nested overlay)
  5. Creates overlayfs: image (lowerdir) + PVC (upperdir) → merged
  6. Modifies Root.Path to point to merged directory
  7. Removes PVC from mounts list
    ↓
exec /usr/bin/nvidia-container-runtime.real
    ↓
nvidia-container-runtime.real → runc (with modified config)
    ↓
Container runs with:
  - GPU access (from nvidia-container-runtime)
  - Overlayfs root: image binaries + persistent PVC storage
Mode Detection

The wrapper automatically detects its operating mode:

  1. Manual override: Set SGS_WRAPPER_MODE=nvidia or SGS_WRAPPER_MODE=runc
  2. Auto-detection: Resolves symlinks of executable path:
    • Path contains "nvidia-container-runtime" → nvidia mode
    • Otherwise → runc mode
Runtime Discovery

Nvidia mode:

  • Looks for /usr/bin/nvidia-container-runtime.real (renamed original)
  • Falls back to /usr/local/bin/nvidia-container-runtime.real
  • If not found, falls back to /usr/bin/runc with warning

Runc mode:

  • Checks SGS_RUNC_PATH environment variable
  • Uses exec.LookPath("runc") with infinite recursion prevention
  • Falls back to /usr/bin/runc

Installation

For Nvidia GPU Nodes (Automated via DaemonSet)

The installation is managed by ArgoCD in the cd-manifests repository. The installer DaemonSet:

  1. Copies wrapper binary to /usr/local/bin/sgs-runtime-wrapper
  2. Renames /usr/bin/nvidia-container-runtime/usr/bin/nvidia-container-runtime.real
  3. Creates symlink: /usr/bin/nvidia-container-runtime/usr/local/bin/sgs-runtime-wrapper

No containerd configuration changes required!

Manual Installation
# Build wrapper
make sgs-runtime-wrapper

# Install binary
sudo cp sgs-runtime-wrapper /usr/local/bin/
sudo chmod +x /usr/local/bin/sgs-runtime-wrapper

# Hijack nvidia-container-runtime
sudo mv /usr/bin/nvidia-container-runtime /usr/bin/nvidia-container-runtime.real
sudo ln -s /usr/local/bin/sgs-runtime-wrapper /usr/bin/nvidia-container-runtime
Binary Location in sgs Image

The sgs-runtime-wrapper binary is included in the main sgs container image (built via Nix). It is accessible at:

/nix/store/<hash>-sgs/bin/sgs-runtime-wrapper

To use it in a DaemonSet installer, extract it from the sgs image:

initContainers:
  - name: install-runtime-wrapper
    image: ghcr.io/bacchus-snu/sgs:latest
    command: ["/bin/sh", "-c"]
    args:
      - |
        # Find and copy sgs-runtime-wrapper binary from Nix store
        find /nix/store -name sgs-runtime-wrapper -executable -type f \
          -exec cp {} /host/usr/local/bin/sgs-runtime-wrapper \;
        chmod +x /host/usr/local/bin/sgs-runtime-wrapper

        # Hijack nvidia-container-runtime if present
        if [ -f /host/usr/bin/nvidia-container-runtime ]; then
          if [ ! -f /host/usr/bin/nvidia-container-runtime.real ]; then
            mv /host/usr/bin/nvidia-container-runtime \
               /host/usr/bin/nvidia-container-runtime.real
          fi
          ln -sf /usr/local/bin/sgs-runtime-wrapper \
                 /host/usr/bin/nvidia-container-runtime
        fi
    volumeMounts:
      - name: host-bin
        mountPath: /host/usr/bin
      - name: host-local-bin
        mountPath: /host/usr/local/bin
    securityContext:
      privileged: true

Usage

Pod Specification

Simply add the annotation to enable PVC rootfs replacement. No runtimeClassName needed when nvidia-container-runtime is hijacked:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-stateful-container
  annotations:
    sgs.snucse.org/os-volume: "boot-pvc"  # PVC name for rootfs
spec:
  containers:
    - name: main
      image: nvidia/cuda:12.0-base-ubuntu22.04
      command: ["/bin/bash", "-c", "nvidia-smi && sleep infinity"]
      resources:
        limits:
          nvidia.com/gpu: 1  # GPU resource request
      volumeMounts:
        - name: boot-volume
          mountPath: /mnt/boot  # Beacon mount (path doesn't matter)
  volumes:
    - name: boot-volume
      persistentVolumeClaim:
        claimName: boot-pvc  # Must match annotation value

Key points:

  • Annotation sgs.snucse.org/os-volume triggers rootfs replacement
  • PVC must be mounted somewhere in the pod (beacon mount for discovery)
  • GPU resource requests work normally
  • No explicit runtimeClassName needed (uses default nvidia runtime)
Without GPU (Traditional runc mode)

For non-GPU nodes using explicit RuntimeClass:

spec:
  runtimeClassName: sgs  # Explicit runtime class
  containers:
    - name: main
      # ... rest of spec

Implementation Details

Code Changes

File: cmd/sgs-runtime-wrapper/main.go

New constants:

defaultNvidiaRuntimePath = "/usr/bin/nvidia-container-runtime.real"
envWrapperMode = "SGS_WRAPPER_MODE"

New functions:

  • detectWrapperMode(): Auto-detects nvidia vs runc mode from executable path
  • getRuntimePath(): Replaces getRuncPath(), supports both modes

Modified behavior:

  • Mode detection at startup
  • Nvidia runtime discovery with fallback
  • Correct argv[0] based on runtime type (nvidia-container-runtime.real vs runc)
Security Considerations

Preserved from original:

  • PVC path validation: Must be in /var/lib/kubelet/pods/
  • File permissions: config.json and logs use 0600
  • Strict PVC name matching: Prevents directory traversal
  • Annotation-based trigger: Opt-in, not automatic

Additional for nvidia hijacking:

  • Symlink validation during installation
  • Backup of original runtime (.real suffix)
  • Fallback to runc if nvidia runtime not found
  • Installer checks for existing symlinks to prevent conflicts

Verification

Check Installation
# Verify symlink
ls -la /usr/bin/nvidia-container-runtime*
# Should show:
# lrwxrwxrwx ... /usr/bin/nvidia-container-runtime -> /usr/local/bin/sgs-runtime-wrapper
# -rwxr-xr-x ... /usr/bin/nvidia-container-runtime.real

# Check wrapper binary
ls -la /usr/local/bin/sgs-runtime-wrapper
Check Logs
# View wrapper logs
sudo cat /var/log/sgs-runtime-wrapper.log | tail -30

# Look for:
# - "Auto-detected nvidia mode (invoked as: nvidia-container-runtime)"
# - "Found SGS OS volume mount: <source> -> /sgs-os-volume"
# - "Kernel version X.Y.Z supports nested overlayfs"
# - "Mounted overlayfs: lowerdir=..., upperdir=..., merged=..."
# - "New root path (overlay merged): ..."
# - "Successfully modified config.json for SGS boot volume with overlayfs"
Verify Container
# Check kernel version (must be 5.11+)
uname -r
# Example: 5.15.0-generic, 6.1.0, 6.6.87

# Inside running container, check rootfs is overlayfs
cat /proc/mounts | grep " / "
# Should show: overlay / overlay rw,lowerdir=...,upperdir=...,workdir=...

# Verify GPU access
nvidia-smi
# Should show GPU devices

# Check that modifications persist
touch /test-file
# Restart pod, verify /test-file still exists

# On host, check PVC directory structure
ls -la /var/lib/kubelet/pods/<pod-uid>/volumes/.../<pvc>/
# Should show: upper/  work/  merged/

Troubleshooting

Kernel too old for nested overlayfs

If you see: kernel X.Y is too old for nested overlayfs; SGS requires kernel 5.11+

# Check kernel version
uname -r

# Upgrade kernel (Ubuntu example)
sudo apt update && sudo apt install linux-generic-hwe-22.04
sudo reboot
Overlayfs mount failed (EINVAL)

This usually indicates nested overlayfs is not supported:

  • Verify kernel is 5.11+ with uname -r
  • Check wrapper logs for detailed error: sudo tail -50 /var/log/sgs-runtime-wrapper.log
GPU not detected

Check that nvidia-container-runtime is properly symlinked:

ls -la /usr/bin/nvidia-container-runtime
readlink -f /usr/bin/nvidia-container-runtime
PVC not found

Check annotation matches PVC name exactly:

kubectl get pvc -n <namespace>
kubectl describe pod <pod-name> -n <namespace> | grep sgs.snucse.org/os-volume

View wrapper logs to see mount discovery:

sudo grep "Could not find mount" /var/log/sgs-runtime-wrapper.log
Infinite recursion detected

Wrapper detects itself via path comparison. If this fails, set explicit path:

# In DaemonSet or node environment
export SGS_RUNC_PATH=/usr/bin/nvidia-container-runtime.real

Uninstallation

Managed by ArgoCD uninstaller DaemonSet, which:

  1. Removes symlink: /usr/bin/nvidia-container-runtime
  2. Restores original: /usr/bin/nvidia-container-runtime.realnvidia-container-runtime
  3. Removes wrapper: /usr/local/bin/sgs-runtime-wrapper

Manual uninstallation:

sudo rm /usr/bin/nvidia-container-runtime
sudo mv /usr/bin/nvidia-container-runtime.real /usr/bin/nvidia-container-runtime
sudo rm /usr/local/bin/sgs-runtime-wrapper

Environment Variables

  • SGS_WRAPPER_MODE: Override mode detection (nvidia or runc)
  • SGS_RUNC_PATH: Override runtime path (e.g., /usr/bin/nvidia-container-runtime.real)

Logging

Logs to /var/log/sgs-runtime-wrapper.log with 0600 permissions.

Log levels:

  • Info: Normal operation (no annotation found, passthrough)
  • Warning: Recoverable issues (runtime not found, falling back)
  • Error: Fatal issues (PVC validation failed, config modification failed)

References

Documentation

Overview

Package main implements the SGS OCI runtime wrapper binary `sgs-runtime-wrapper`.

This program wraps runc or nvidia-container-runtime and intercepts the "create" command to modify the OCI spec's Root.Path. This provides true rootfs replacement for "Stateful Containers" with support for GPU workloads.

How it works:

  1. containerd calls this wrapper instead of the real runtime
  2. Wrapper auto-detects mode (runc or nvidia) based on invocation path
  3. Wrapper checks if the container has SGS OS volume mount at /sgs-os-volume
  4. If present, modifies config.json Root.Path to the PVC host path (mount source)
  5. Calls the real runtime (runc or nvidia-container-runtime.real) with modified config

Dual-Mode Support:

  • Nvidia mode: Symlink /usr/bin/nvidia-container-runtime → sgs-runtime-wrapper Calls /usr/bin/nvidia-container-runtime.real after modification
  • Runc mode: Use RuntimeClass with BinaryName = /usr/local/bin/sgs-runtime-wrapper Calls /usr/bin/runc after modification

Installation (Nvidia GPU hijacking):

  1. Build: go build -o sgs-runtime-wrapper ./cmd/sgs-runtime-wrapper
  2. Install: cp sgs-runtime-wrapper /usr/local/bin/
  3. Rename: mv /usr/bin/nvidia-container-runtime /usr/bin/nvidia-container-runtime.real
  4. Symlink: ln -s /usr/local/bin/sgs-runtime-wrapper /usr/bin/nvidia-container-runtime

For traditional runc hijacking, configure containerd (/etc/containerd/config.toml):

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.sgs]
  runtime_type = "io.containerd.runc.v2"
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.sgs.options]
    BinaryName = "/usr/local/bin/sgs-runtime-wrapper"

Then in your Pod spec, use:

spec:
  runtimeClassName: sgs

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL