weightsource

package
v0.20.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: May 19, 2026 License: Apache-2.0 Imports: 21 Imported by: 0

Documentation

Overview

Package weightsource is the pluggable source layer for weight imports.

A Source is a stateful provider bound at construction time to a specific URI. It exposes two capabilities: Inventory lists the files the source offers (with sizes, per-file digests, and a source-identity Fingerprint), and Open streams one file's bytes. The packer drives the import one file at a time so that sources larger than local disk can be imported without full materialization.

Implementations exist for file:// (local directory) and hf:// (HuggingFace Hub).

Index

Constants

View Source
const FileScheme = "file"

FileScheme is the URI scheme for local filesystem sources.

View Source
const HFScheme = "hf"

HFScheme is the short URI scheme for HuggingFace Hub sources.

View Source
const HFSchemeLong = "huggingface"

HFSchemeLong is the long-form URI scheme alias for HuggingFace Hub.

Variables

This section is empty.

Functions

func DirHash

func DirHash[F Dirhashable](files []F) string

DirHash computes a content-addressable digest of a file set per spec §2.4:

sha256(join(sort("<hex>  <path>"), "\n"))

where each line is the file's sha256 hex digest and relative path joined by two spaces (matching sha256sum output). DirHash sorts the lines itself, so the caller's input order does not affect the result.

The result is the "sha256:<hex>" form. This formula computes the weight set digest stored in weights.lock (WeightLockEntry.SetDigest), and is also used by file:// sources specifically as their Fingerprint — content-addressable stores happen to match their fingerprint to their dirhash. Other schemes (hf://, s3://, http://) use scheme-native identifiers (commit SHA, ETag, etc.) for their Fingerprint instead.

func NormalizeURI

func NormalizeURI(uri string) (string, error)

NormalizeURI returns the canonical form of a weight source URI.

Each scheme has its own normalization rules:

  • file:// and bare paths → canonical file:// form (see normalizeFileURI)
  • hf:// and huggingface:// → canonical hf:// form (see normalizeHFURI)

Empty strings and unsupported schemes return an error.

func SortInventoryFiles added in v0.20.0

func SortInventoryFiles(files []InventoryFile)

SortInventoryFiles sorts files by path. Every Source implementation must return a sorted inventory; this helper enforces the convention.

Types

type DirhashPart

type DirhashPart struct {
	Path   string
	Digest string
}

DirhashPart is the atomic input to DirHash: the pair of fields that uniquely identify a file's contribution to the dirhash. Path is the relative path (forward slashes) and Digest is the file's sha256 content digest in "sha256:<hex>" form.

func (DirhashPart) String

func (p DirhashPart) String() string

String returns the canonical identity of a single file: "path\x00digest". This is the primitive that any code comparing files across layers, plans, or lockfile entries should use. DirHash composes over this (sorted, then hashed); layer keys join these (preserving individual file identity so two files with identical content but different paths remain distinguishable).

type Dirhashable

type Dirhashable interface {
	DirhashParts() DirhashPart
}

Dirhashable is implemented by types that can participate in DirHash. Both weightsource.InventoryFile and lockfile.WeightLockFile implement it, letting the two call sites share one digest implementation.

type FileSource

type FileSource struct {
	// contains filtered or unexported fields
}

FileSource is the Source implementation for file:// URIs and bare paths.

URIs take one of these forms:

file:///abs/path      — absolute path
file://./rel/path     — canonical relative path (explicit ./)
/abs/path             — bare absolute path (normalized to file://)
./rel/path            — bare relative path (normalized to file://)
rel/path              — bare relative path, no ./ prefix (normalized)

The lockfile stores only the normalized form (see NormalizeURI); the absolute on-disk path is resolved once at construction time so the Source methods do not re-resolve on every call.

func NewFileSource

func NewFileSource(uri, projectDir string) (*FileSource, error)

NewFileSource constructs a FileSource bound to uri, resolving relative URIs against projectDir. It validates that the resolved path exists and is a directory.

func (*FileSource) Inventory

func (s *FileSource) Inventory(ctx context.Context) (Inventory, error)

Inventory walks the source directory and returns per-file path / size / content digest plus the source fingerprint (sha256 of the sorted file set, spec §2.4).

The .cog state directory is skipped. Non-regular entries (symlinks, devices, FIFOs, sockets) are rejected per spec §1.3 — silently dropping them would let a user ship a model missing files they expected. Resolve to regular files before importing.

func (*FileSource) Open

func (s *FileSource) Open(ctx context.Context, path string) (io.ReadCloser, error)

Open returns a reader for a single file in the source, identified by its inventory path (relative to the source root, using forward slashes). The caller closes the returned reader.

type Fingerprint

type Fingerprint string

Fingerprint is a source's version identity, carrying its algorithm (or source-native identifier type) as a scheme prefix.

Examples:

sha256:<hex>            — content hash (file:// sources)
commit:<sha>            — git commit (hf:// repos pinned to a commit)
etag:<value>            — HTTP ETag (http:// sources)
md5:<hex>               — MD5 hash (s3:// objects)
timestamp:<rfc3339>     — last-modified timestamp (fallback for systems
                           that expose nothing stronger)

The prefix makes two fingerprints from different sources unambiguously unequal even when the opaque values happen to collide. The empty string is not a valid Fingerprint — callers that want to express "no fingerprint known" should use a separate sentinel.

func (Fingerprint) Scheme

func (f Fingerprint) Scheme() string

Scheme returns the fingerprint's algorithm or identifier prefix (the part before the first colon). Returns "" if the fingerprint is malformed (no colon).

func (Fingerprint) String

func (f Fingerprint) String() string

String returns the fingerprint in its canonical "<scheme>:<value>" form.

type HFSource

type HFSource struct {
	// contains filtered or unexported fields
}

HFSource is the Source implementation for hf:// URIs.

URI forms:

hf://org/repo         — follows main branch
hf://org/repo@ref     — ref is a branch, tag, or 40-char commit sha

The source resolves the ref to a full commit sha at Inventory time and uses that pinned sha for all subsequent Open calls. Callers must call Inventory before Open to ensure content is pinned to a specific commit.

func NewHFSource

func NewHFSource(uri string) (*HFSource, error)

NewHFSource constructs an HFSource bound to the given hf:// URI. It parses the URI and looks up auth from env vars but does not make any network calls — validation happens at Inventory time.

func (*HFSource) Inventory

func (s *HFSource) Inventory(ctx context.Context) (Inventory, error)

Inventory calls the HuggingFace Hub API to list files and resolve the ref to a pinned commit sha. For LFS/xet-tracked files the sha256 digest comes from the API response (free, no download). Inline files (small, git-tracked) are fetched and hashed.

The fingerprint is "commit:<full-sha>".

func (*HFSource) Open

func (s *HFSource) Open(ctx context.Context, path string) (io.ReadCloser, error)

Open returns a reader that streams the file from the HuggingFace CDN. It follows the redirect from the resolve endpoint to the appropriate backend (LFS CDN, xet cas-bridge, or inline git blob).

Open uses the commit sha resolved during Inventory, so file content is pinned to the same revision that was inventoried. If Inventory has not been called, Open falls back to the original ref.

type HTTPSource added in v0.20.0

type HTTPSource struct {
	// contains filtered or unexported fields
}

HTTPSource is the Source implementation for https:// and http:// URIs.

Each HTTPSource represents a single remote file. The filename is derived from the URL path basename (e.g. "RealESRGAN_x4plus.pth"). This supports GitHub Releases, S3 presigned URLs, university file servers, ONNX Model Zoo, and any plain HTTP download.

Fingerprint strategy:

  • HEAD request: use a strong ETag if present → "etag:<value>"
  • No usable ETag: fall back to GET + sha256 hash → "sha256:<hex>"

ETag is treated as a *cache hint*, not a content identity. A change in ETag triggers re-verification; a stable ETag short-circuits it. Two HTTP sources with identical content but different ETags will re-import unnecessarily but produce the same final artifact — the worst case is wasted work, never wrong content. Weak ETags (W/-prefixed, RFC 7232 §2.3) explicitly do not promise content identity, so we ignore them and fall through to sha256.

func NewHTTPSource added in v0.20.0

func NewHTTPSource(uri string) (*HTTPSource, error)

NewHTTPSource constructs an HTTPSource bound to the given URL. It validates the URL parses correctly, has a non-empty path component, and does not embed credentials. No network calls are made at construction time.

URIs with userinfo (https://user:pass@host/...) are rejected: the URI is recorded verbatim in weights.lock, which is checked into git, so embedded credentials would leak. Use a separate auth mechanism (Authorization header support is on the roadmap).

func (*HTTPSource) Inventory added in v0.20.0

func (s *HTTPSource) Inventory(ctx context.Context) (Inventory, error)

Inventory resolves the remote file's metadata.

The inventory contains exactly one file: the URL's path basename. A HEAD request is tried first for ETag (used as fingerprint for cheap change detection) and Content-Length. The file is then GET+sha256-hashed to produce a real content digest — every InventoryFile must have a non-empty Digest for the downstream store and packer to work correctly.

When no ETag is available, the sha256 digest doubles as the fingerprint.

func (*HTTPSource) Open added in v0.20.0

func (s *HTTPSource) Open(ctx context.Context, _ string) (io.ReadCloser, error)

Open returns a reader that streams the file from the URL. Go's http.Client follows redirects by default, handling GitHub's 302→CDN pattern transparently.

Open does NOT verify the response body against the inventory digest; the store performs that verification on PutFile (see store.PutFile). If a mutable URL serves different bytes between Inventory() and Open(), the digest mismatch will surface during ingress, not here.

For HTTP sources without ETag and without Content-Length the file is hashed once during Inventory() and then streamed again on Open() — a 2x bandwidth cost. We accept that today rather than caching to disk between phases; for multi-GB downloads from sources like that, expect a re-download on every import.

type Inventory

type Inventory struct {
	Files       []InventoryFile
	Fingerprint Fingerprint
}

Inventory is the result of Source.Inventory: everything needed to plan an import without transferring payload bytes.

Fingerprint is the source's version identity for the currently bound URI. Files is the list of content-addressed entries that make up the source; the packer consumes this list to produce tar layers.

func FilterInventory

func FilterInventory(inv Inventory, include, exclude []string) (Inventory, error)

FilterInventory applies include/exclude glob patterns to an inventory's file list and returns a new inventory with only the matching files. The returned inventory shares the original's Fingerprint (which is the upstream version identity, not affected by filtering).

Semantics:

  • If include is non-empty, a file must match at least one include pattern.
  • If a file matches any exclude pattern, it is excluded (even if it also matches an include pattern — exclude wins).
  • If both lists are empty/nil, all files pass through unchanged.

Pattern matching uses gitignore-style globs via go-gitignore: bare patterns float across directories ("*.bin" matches any depth), path-shaped patterns anchor ("onnx/*.bin" matches direct children of onnx/), and "**" matches any number of path segments.

Returns an error if the filter yields zero files — an empty weight set is almost always a mistake and should surface immediately.

type InventoryFile

type InventoryFile struct {
	// Path is the file path relative to the source root, using forward
	// slashes regardless of the host OS.
	Path string
	// Size is the uncompressed file size in bytes.
	Size int64
	// Digest is the SHA-256 content digest with the "sha256:" prefix.
	Digest string
}

InventoryFile is one entry in an Inventory: a file's relative path, size, and content digest. For file:// the digest is computed by walking and hashing; for remote sources it is read from a source-side index.

func (InventoryFile) DirhashParts

func (f InventoryFile) DirhashParts() DirhashPart

DirhashParts implements Dirhashable so InventoryFile slices can be passed directly to DirHash.

type Source

type Source interface {
	// Inventory returns the file list and version identity for the
	// bound source. For file:// this walks and hashes (unavoidable for
	// a local directory). For future remote sources it is expected to
	// be cheap — HuggingFace Hub exposes per-file sha256 via its API,
	// OCI sources read them from the source manifest's config blob.
	Inventory(ctx context.Context) (Inventory, error)

	// Open returns a reader for a single file in the source, identified
	// by its inventory path (relative to the source root). Called on
	// demand during packing. The caller closes the returned reader.
	Open(ctx context.Context, path string) (io.ReadCloser, error)
}

Source is the provider for a weight-source scheme, bound at construction time to a specific URI.

Implementations translate a scheme-specific URI (file://, hf://, s3://, http://, ...) into (a) an inventory of what the source contains, and (b) an on-demand byte stream for any one file in that inventory. The weights subsystem drives the import pipeline off these two capabilities — there is deliberately no "materialize the whole source to disk" step, so sources whose contents do not fit on local disk can still flow through the packer one file at a time.

A Source instance is bound to one URI for its entire lifetime. Callers construct a Source via For(uri, projectDir). Methods are expected to be context-cancellable and safe to call concurrently for different paths.

func For

func For(uri, projectDir string) (Source, error)

For returns the Source implementation for the given URI's scheme, bound to uri and projectDir.

The scheme is the substring before the first "://". Bare paths (no scheme) are treated as file:// — this accepts both absolute ("/data") and relative ("./weights") forms as a convenience at the interface boundary.

Unknown schemes return a clear error listing the currently supported schemes. This is the only place where scheme → implementation dispatch happens; adding s3:// or http:// is a single case here plus the matching Source implementation.

For validates that the source exists and is usable. A file:// URI that points at a missing path or at a non-directory returns an error here, not at Inventory time.

type ZeroSurvivorsError

type ZeroSurvivorsError struct {
	InventorySize int
	Include       []string
	Exclude       []string
}

ZeroSurvivorsError is returned when include/exclude filtering removes all files from an inventory.

func (*ZeroSurvivorsError) Error

func (e *ZeroSurvivorsError) Error() string

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL