Documentation
¶
Overview ¶
Package weightsource is the pluggable source layer for weight imports.
A Source is a stateful provider bound at construction time to a specific URI. It exposes two capabilities: Inventory lists the files the source offers (with sizes, per-file digests, and a source-identity Fingerprint), and Open streams one file's bytes. The packer drives the import one file at a time so that sources larger than local disk can be imported without full materialization.
Implementations exist for file:// (local directory) and hf:// (HuggingFace Hub).
Index ¶
- Constants
- func DirHash[F Dirhashable](files []F) string
- func NormalizeURI(uri string) (string, error)
- func SortInventoryFiles(files []InventoryFile)
- type DirhashPart
- type Dirhashable
- type FileSource
- type Fingerprint
- type HFSource
- type HTTPSource
- type Inventory
- type InventoryFile
- type Source
- type ZeroSurvivorsError
Constants ¶
const FileScheme = "file"
FileScheme is the URI scheme for local filesystem sources.
const HFScheme = "hf"
HFScheme is the short URI scheme for HuggingFace Hub sources.
const HFSchemeLong = "huggingface"
HFSchemeLong is the long-form URI scheme alias for HuggingFace Hub.
Variables ¶
This section is empty.
Functions ¶
func DirHash ¶
func DirHash[F Dirhashable](files []F) string
DirHash computes a content-addressable digest of a file set per spec §2.4:
sha256(join(sort("<hex> <path>"), "\n"))
where each line is the file's sha256 hex digest and relative path joined by two spaces (matching sha256sum output). DirHash sorts the lines itself, so the caller's input order does not affect the result.
The result is the "sha256:<hex>" form. This formula computes the weight set digest stored in weights.lock (WeightLockEntry.SetDigest), and is also used by file:// sources specifically as their Fingerprint — content-addressable stores happen to match their fingerprint to their dirhash. Other schemes (hf://, s3://, http://) use scheme-native identifiers (commit SHA, ETag, etc.) for their Fingerprint instead.
func NormalizeURI ¶
NormalizeURI returns the canonical form of a weight source URI.
Each scheme has its own normalization rules:
- file:// and bare paths → canonical file:// form (see normalizeFileURI)
- hf:// and huggingface:// → canonical hf:// form (see normalizeHFURI)
Empty strings and unsupported schemes return an error.
func SortInventoryFiles ¶ added in v0.20.0
func SortInventoryFiles(files []InventoryFile)
SortInventoryFiles sorts files by path. Every Source implementation must return a sorted inventory; this helper enforces the convention.
Types ¶
type DirhashPart ¶
DirhashPart is the atomic input to DirHash: the pair of fields that uniquely identify a file's contribution to the dirhash. Path is the relative path (forward slashes) and Digest is the file's sha256 content digest in "sha256:<hex>" form.
func (DirhashPart) String ¶
func (p DirhashPart) String() string
String returns the canonical identity of a single file: "path\x00digest". This is the primitive that any code comparing files across layers, plans, or lockfile entries should use. DirHash composes over this (sorted, then hashed); layer keys join these (preserving individual file identity so two files with identical content but different paths remain distinguishable).
type Dirhashable ¶
type Dirhashable interface {
DirhashParts() DirhashPart
}
Dirhashable is implemented by types that can participate in DirHash. Both weightsource.InventoryFile and lockfile.WeightLockFile implement it, letting the two call sites share one digest implementation.
type FileSource ¶
type FileSource struct {
// contains filtered or unexported fields
}
FileSource is the Source implementation for file:// URIs and bare paths.
URIs take one of these forms:
file:///abs/path — absolute path file://./rel/path — canonical relative path (explicit ./) /abs/path — bare absolute path (normalized to file://) ./rel/path — bare relative path (normalized to file://) rel/path — bare relative path, no ./ prefix (normalized)
The lockfile stores only the normalized form (see NormalizeURI); the absolute on-disk path is resolved once at construction time so the Source methods do not re-resolve on every call.
func NewFileSource ¶
func NewFileSource(uri, projectDir string) (*FileSource, error)
NewFileSource constructs a FileSource bound to uri, resolving relative URIs against projectDir. It validates that the resolved path exists and is a directory.
func (*FileSource) Inventory ¶
func (s *FileSource) Inventory(ctx context.Context) (Inventory, error)
Inventory walks the source directory and returns per-file path / size / content digest plus the source fingerprint (sha256 of the sorted file set, spec §2.4).
The .cog state directory is skipped. Non-regular entries (symlinks, devices, FIFOs, sockets) are rejected per spec §1.3 — silently dropping them would let a user ship a model missing files they expected. Resolve to regular files before importing.
func (*FileSource) Open ¶
func (s *FileSource) Open(ctx context.Context, path string) (io.ReadCloser, error)
Open returns a reader for a single file in the source, identified by its inventory path (relative to the source root, using forward slashes). The caller closes the returned reader.
type Fingerprint ¶
type Fingerprint string
Fingerprint is a source's version identity, carrying its algorithm (or source-native identifier type) as a scheme prefix.
Examples:
sha256:<hex> — content hash (file:// sources)
commit:<sha> — git commit (hf:// repos pinned to a commit)
etag:<value> — HTTP ETag (http:// sources)
md5:<hex> — MD5 hash (s3:// objects)
timestamp:<rfc3339> — last-modified timestamp (fallback for systems
that expose nothing stronger)
The prefix makes two fingerprints from different sources unambiguously unequal even when the opaque values happen to collide. The empty string is not a valid Fingerprint — callers that want to express "no fingerprint known" should use a separate sentinel.
func (Fingerprint) Scheme ¶
func (f Fingerprint) Scheme() string
Scheme returns the fingerprint's algorithm or identifier prefix (the part before the first colon). Returns "" if the fingerprint is malformed (no colon).
func (Fingerprint) String ¶
func (f Fingerprint) String() string
String returns the fingerprint in its canonical "<scheme>:<value>" form.
type HFSource ¶
type HFSource struct {
// contains filtered or unexported fields
}
HFSource is the Source implementation for hf:// URIs.
URI forms:
hf://org/repo — follows main branch hf://org/repo@ref — ref is a branch, tag, or 40-char commit sha
The source resolves the ref to a full commit sha at Inventory time and uses that pinned sha for all subsequent Open calls. Callers must call Inventory before Open to ensure content is pinned to a specific commit.
func NewHFSource ¶
NewHFSource constructs an HFSource bound to the given hf:// URI. It parses the URI and looks up auth from env vars but does not make any network calls — validation happens at Inventory time.
func (*HFSource) Inventory ¶
Inventory calls the HuggingFace Hub API to list files and resolve the ref to a pinned commit sha. For LFS/xet-tracked files the sha256 digest comes from the API response (free, no download). Inline files (small, git-tracked) are fetched and hashed.
The fingerprint is "commit:<full-sha>".
func (*HFSource) Open ¶
Open returns a reader that streams the file from the HuggingFace CDN. It follows the redirect from the resolve endpoint to the appropriate backend (LFS CDN, xet cas-bridge, or inline git blob).
Open uses the commit sha resolved during Inventory, so file content is pinned to the same revision that was inventoried. If Inventory has not been called, Open falls back to the original ref.
type HTTPSource ¶ added in v0.20.0
type HTTPSource struct {
// contains filtered or unexported fields
}
HTTPSource is the Source implementation for https:// and http:// URIs.
Each HTTPSource represents a single remote file. The filename is derived from the URL path basename (e.g. "RealESRGAN_x4plus.pth"). This supports GitHub Releases, S3 presigned URLs, university file servers, ONNX Model Zoo, and any plain HTTP download.
Fingerprint strategy:
- HEAD request: use a strong ETag if present → "etag:<value>"
- No usable ETag: fall back to GET + sha256 hash → "sha256:<hex>"
ETag is treated as a *cache hint*, not a content identity. A change in ETag triggers re-verification; a stable ETag short-circuits it. Two HTTP sources with identical content but different ETags will re-import unnecessarily but produce the same final artifact — the worst case is wasted work, never wrong content. Weak ETags (W/-prefixed, RFC 7232 §2.3) explicitly do not promise content identity, so we ignore them and fall through to sha256.
func NewHTTPSource ¶ added in v0.20.0
func NewHTTPSource(uri string) (*HTTPSource, error)
NewHTTPSource constructs an HTTPSource bound to the given URL. It validates the URL parses correctly, has a non-empty path component, and does not embed credentials. No network calls are made at construction time.
URIs with userinfo (https://user:pass@host/...) are rejected: the URI is recorded verbatim in weights.lock, which is checked into git, so embedded credentials would leak. Use a separate auth mechanism (Authorization header support is on the roadmap).
func (*HTTPSource) Inventory ¶ added in v0.20.0
func (s *HTTPSource) Inventory(ctx context.Context) (Inventory, error)
Inventory resolves the remote file's metadata.
The inventory contains exactly one file: the URL's path basename. A HEAD request is tried first for ETag (used as fingerprint for cheap change detection) and Content-Length. The file is then GET+sha256-hashed to produce a real content digest — every InventoryFile must have a non-empty Digest for the downstream store and packer to work correctly.
When no ETag is available, the sha256 digest doubles as the fingerprint.
func (*HTTPSource) Open ¶ added in v0.20.0
func (s *HTTPSource) Open(ctx context.Context, _ string) (io.ReadCloser, error)
Open returns a reader that streams the file from the URL. Go's http.Client follows redirects by default, handling GitHub's 302→CDN pattern transparently.
Open does NOT verify the response body against the inventory digest; the store performs that verification on PutFile (see store.PutFile). If a mutable URL serves different bytes between Inventory() and Open(), the digest mismatch will surface during ingress, not here.
For HTTP sources without ETag and without Content-Length the file is hashed once during Inventory() and then streamed again on Open() — a 2x bandwidth cost. We accept that today rather than caching to disk between phases; for multi-GB downloads from sources like that, expect a re-download on every import.
type Inventory ¶
type Inventory struct {
Files []InventoryFile
Fingerprint Fingerprint
}
Inventory is the result of Source.Inventory: everything needed to plan an import without transferring payload bytes.
Fingerprint is the source's version identity for the currently bound URI. Files is the list of content-addressed entries that make up the source; the packer consumes this list to produce tar layers.
func FilterInventory ¶
FilterInventory applies include/exclude glob patterns to an inventory's file list and returns a new inventory with only the matching files. The returned inventory shares the original's Fingerprint (which is the upstream version identity, not affected by filtering).
Semantics:
- If include is non-empty, a file must match at least one include pattern.
- If a file matches any exclude pattern, it is excluded (even if it also matches an include pattern — exclude wins).
- If both lists are empty/nil, all files pass through unchanged.
Pattern matching uses gitignore-style globs via go-gitignore: bare patterns float across directories ("*.bin" matches any depth), path-shaped patterns anchor ("onnx/*.bin" matches direct children of onnx/), and "**" matches any number of path segments.
Returns an error if the filter yields zero files — an empty weight set is almost always a mistake and should surface immediately.
type InventoryFile ¶
type InventoryFile struct {
// Path is the file path relative to the source root, using forward
// slashes regardless of the host OS.
Path string
// Size is the uncompressed file size in bytes.
Size int64
// Digest is the SHA-256 content digest with the "sha256:" prefix.
Digest string
}
InventoryFile is one entry in an Inventory: a file's relative path, size, and content digest. For file:// the digest is computed by walking and hashing; for remote sources it is read from a source-side index.
func (InventoryFile) DirhashParts ¶
func (f InventoryFile) DirhashParts() DirhashPart
DirhashParts implements Dirhashable so InventoryFile slices can be passed directly to DirHash.
type Source ¶
type Source interface {
// Inventory returns the file list and version identity for the
// bound source. For file:// this walks and hashes (unavoidable for
// a local directory). For future remote sources it is expected to
// be cheap — HuggingFace Hub exposes per-file sha256 via its API,
// OCI sources read them from the source manifest's config blob.
Inventory(ctx context.Context) (Inventory, error)
// Open returns a reader for a single file in the source, identified
// by its inventory path (relative to the source root). Called on
// demand during packing. The caller closes the returned reader.
Open(ctx context.Context, path string) (io.ReadCloser, error)
}
Source is the provider for a weight-source scheme, bound at construction time to a specific URI.
Implementations translate a scheme-specific URI (file://, hf://, s3://, http://, ...) into (a) an inventory of what the source contains, and (b) an on-demand byte stream for any one file in that inventory. The weights subsystem drives the import pipeline off these two capabilities — there is deliberately no "materialize the whole source to disk" step, so sources whose contents do not fit on local disk can still flow through the packer one file at a time.
A Source instance is bound to one URI for its entire lifetime. Callers construct a Source via For(uri, projectDir). Methods are expected to be context-cancellable and safe to call concurrently for different paths.
func For ¶
For returns the Source implementation for the given URI's scheme, bound to uri and projectDir.
The scheme is the substring before the first "://". Bare paths (no scheme) are treated as file:// — this accepts both absolute ("/data") and relative ("./weights") forms as a convenience at the interface boundary.
Unknown schemes return a clear error listing the currently supported schemes. This is the only place where scheme → implementation dispatch happens; adding s3:// or http:// is a single case here plus the matching Source implementation.
For validates that the source exists and is usable. A file:// URI that points at a missing path or at a non-directory returns an error here, not at Inventory time.
type ZeroSurvivorsError ¶
ZeroSurvivorsError is returned when include/exclude filtering removes all files from an inventory.
func (*ZeroSurvivorsError) Error ¶
func (e *ZeroSurvivorsError) Error() string