objstore

package module

v0.0.0-...-73a0f5f Latest Latest Go to latest Published: Feb 11, 2026 License: MIT Imports: 33 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/ahrav/go-gitpack

Links

Open Source Insights

README ¶

go-gitpack

A minimal, memory-mapped Git object store that resolves objects directly from *.pack files without shelling out to the Git executable.

Overview

The objstore package provides fast, read-only access to Git objects stored in packfiles. It's designed for scenarios where you need low-latency lookups, such as secret scanning, indexing, etc.

Note: This is very much experimental and a learning repo.

Usage

Scan every unique blob (recommended)

Blob mode visits every unique blob exactly once in pack-offset order — no diff computation, sequential I/O, and each blob is seen only once via deduplication.

type myScanner struct{}

func (s *myScanner) ScanBlob(r io.Reader, meta objstore.ScanMeta) error {
    // meta.Blob, meta.Commit, meta.Path available
    _, err := io.Copy(io.Discard, r)
    return err
}

scanner, err := objstore.NewHistoryScanner("/path/to/.git")
if err != nil {
    log.Fatal(err)
}
defer scanner.Close()

if err := scanner.Scan(nil, &myScanner{}); err != nil {
    log.Fatal(err)
}

Documentation ¶

Overview ¶

package objstore provides a minimal, memory-mapped Git object store that resolves objects directly from *.pack files without shelling out to the Git executable.

This file implements two cache layers used during object resolution:

refCountedDeltaWindow: a bounded, reference-counted LRU cache for intermediate delta bases. It keeps recently resolved objects in memory so that subsequent delta applications can reuse them without re-reading the packfile. The window cooperates with Handle-based reference counting so that actively-used entries are never evicted.
arcCache: an Adaptive Replacement Cache (ARC) that sits above the delta window and caches fully resolved objects. ARC balances recency and frequency to handle both scan-like (one-pass) and lookup-like (repeated access) workloads.

The two caches serve complementary roles: the delta window is scoped to packfile delta resolution and is consulted during the innermost inflate loop, while the ARC cache is the top-level "have I already resolved this OID?" check used by store.get().

commit_attribution.go

Efficient extraction and caching of Git commit author metadata.

Every secret finding needs to be attributed to a commit author (name, email, timestamp). Parsing the raw commit header each time is expensive, so this file provides metaCache -- a concurrency-safe, read-through cache that stores AuthorInfo keyed by commit OID. When a commit-graph file is available, timestamps are served from the precomputed graph slice instead of re-parsing the header.

commit_order.go

Topological commit ordering using Kahn's algorithm backed by a min-heap.

The primary goal is to produce a deterministic parent-before-child ordering of commits so that every parent is visited before any of its children. This is the natural requirement for incremental secret-scanning: we want to scan a parent's tree changes before its child's, so the "seen" set grows in a predictable order.

The algorithm works in three steps:

Build an in-degree map over commits whose parents are in the input set.
Seed a min-heap with all root commits (in-degree == 0).
Pop the minimum-timestamp commit, decrement children's in-degree, and push newly zero-in-degree children onto the heap.

If the input contains cycles (which can happen with grafted or corrupt history), a deterministic timestamp-then-OID fallback appends the remaining commits so we always return every commit exactly once.

delta.go

Two encodings are supported:

ref‑deltas identify their base object by its 20‑byte object ID.
ofs‑deltas locate their base by a backward offset within the same packfile.

Delta chains may be deep or even cyclic. The resolver tracks every hop and enforces a caller‑supplied depth limit to prevent infinite recursion and denial‑of‑service attacks.

Internally the implementation uses a reusable "ping‑pong" arena so that each delta step can decode from one half of the buffer while writing into the other, avoiding heap allocations.

ErrDeltaTargetTooLarge is the only exported symbol. All other symbols below this comment are package‑private in order to keep the public surface minimal and stable.

delta_window_sharded.go

Sharded delta-base window for reducing lock contention during concurrent pack-object resolution.

When multiple goroutines resolve delta-compressed objects simultaneously, a single shared window becomes a serialization bottleneck because every acquire and add operation must take the same lock. Sharding the window by the first byte of the object ID distributes that contention across N independent refCountedDeltaWindow instances, each with its own lock and LRU budget. Because Git object IDs are uniformly distributed, the load across shards is approximately even without any additional hashing.

diff_tree.go implements a streaming, memory-efficient tree-to-tree diff for Git object trees.

The algorithm performs a merge-join over the sorted entries of two trees, emitting per-file change callbacks without materialising the full trees in memory. This design supports arbitrarily large repositories because only one tree level is traversed at a time.

PRECONDITION: Git tree entries are stored in Git tree sort order, which is *not* plain lexicographic order. Directories are compared as if their name had a trailing '/' appended (e.g. "foo" < "foo-bar" < "foo.c" < "foo/" when "foo" is a tree). The TreeIter returned by store.treeIter MUST yield entries in this canonical order for the merge-join comparisons (oln < nln, oln == nln) to be correct. Violating this precondition will produce incorrect diffs silently.

hash.go

Core type for Git object identifiers (SHA-1 hashes).

This file defines the Hash type and its basic operations (formatting, parsing, zero-checking). The Hash type is a fixed-size [20]byte array that represents the raw binary form of a SHA-1 digest.

Thread safety: Hash is a value type (fixed-size array) and is safe to copy, compare, and read concurrently without synchronization. There is no shared mutable state. Functions in this file (String, IsZero, ParseHash) are all pure and safe for concurrent use.

Related files:

hash_aligned.go: provides Hash.Uint64() for architectures with relaxed alignment (all except 32-bit ARM).
hash_unaligned.go: provides Hash.Uint64() for 32-bit ARM using safe byte-order-aware decoding.

Package objstore offers a content‑addressable store optimized for Git packfiles and commit‑graph data.

This file defines HistoryScanner, a high‑throughput helper that streams commit‑level information and tree‑to‑tree diffs without inflating full commit objects.

Overview ¶

A HistoryScanner wraps an internal object store plus (optionally) the repository's commit‑graph. It exposes a composable API layer focused on **read‑only** analytics workloads such as:

Scanning every commit once to extract change hunks.
Iterating trees to build custom indexes.
Fetching lightweight commit metadata (author, timestamp) on demand.

Callers should construct exactly one HistoryScanner per repository and reuse it for the lifetime of the program. All methods are safe for concurrent use unless their doc comment states otherwise.

Quick start ¶

// Open an existing repository.
s, err := objstore.NewHistoryScanner(".git")
if err != nil {
    log.Fatal(err)
}
defer s.Close()

s.SetMaxDeltaDepth(100) // Tune delta resolution
s.SetVerifyCRC(true)    // Extra integrity checking

// Stream added hunks from every commit.
hunks, errs := s.DiffHistoryHunks()
go func() {
    for h := range hunks {
        fmt.Println(h)
    }
}()
if err := <-errs; err != nil {
    log.Fatal(err)
}

loose_object.go

Reading Git loose objects from the on-disk object store.

A loose object is a single zlib-compressed file stored at <objects-dir>/<xx>/<yy...> where <xx> are the first two hex characters of the SHA-1 OID and <yy...> the remaining 38 characters. The decompressed content has the format:

<type> <size>\0<body>

where <type> is one of "commit", "tree", "blob", or "tag", <size> is the decimal byte length of <body>, and \0 is a single NUL byte separating the header from the payload.

profiling.go

Optional profiling and execution-tracing support for HistoryScanner.

When enabled via the WithProfiling ScannerOption, this file starts an HTTP server that exposes the standard net/http/pprof endpoints and, optionally, an execution trace that is written to disk for the duration of the scan. Neither capability affects scanning correctness; if anything in this file fails the scanner can still operate normally.

scan_mode.go

Scanning strategy selection for HistoryScanner.

Two modes are supported:

ScanModeBlob (default) -- iterates every unique blob introduced across the commit history in pack-file offset order, yielding the full blob body exactly once per OID. This is the recommended and fastest path.
ScanModeHunks (legacy) -- computes per-commit diffs and yields only the added-line hunks. Retained for backward compatibility with callers that require line-level attribution, but significantly slower because it must diff every parent-child commit pair.

scan_plan.go implements the blob-scanning pipeline for HistoryScanner.

The pipeline operates in three phases:

Candidate collection -- walkBlobCandidates traverses every commit (via walkCommitsFromRefs) and emits one blobRecord per changed blob. Candidates are partitioned into 256 on-disk buckets keyed by the first byte of the blob OID. Bucketing limits the working-set size during the subsequent dedup pass.
Dedup / classification -- the prepare step reads each bucket sequentially, deduplicates by blob OID (keeping the first occurrence), consults the caller-supplied SeenSet for cross-run dedup, and classifies each surviving blob as either "packed" (locatable in an open packfile via the index) or "loose" (must be resolved through the loose-object directory). Packed records include the packfile offset so that the execution phase can sort by offset for sequential I/O.
Execution -- packed records are sorted by ascending packfile offset and decoded in batches to maximise sequential read throughput on spinning disks and to benefit from OS read-ahead on SSDs. Loose objects are read individually via store.getNoCache.

Two execution strategies are supported:

Fast path (in-memory): When the total candidate footprint fits within scanFastPathMaxBytes and the blob count is below scanFastPathMaxBlobs, all records are held in memory (inMemoryScanPlan) and no temporary files are created. This eliminates disk I/O overhead for small-to-medium repositories.
Spill path (disk-backed): When the fast-path thresholds are exceeded, candidates are spilled to temporary files via spillWriter. The external sort uses fixed-size chunks (scanPackSortChunkSize) that are individually sorted and then merged with a min-heap (packedMergeHeap) to produce a globally offset-ordered stream without requiring the entire dataset to fit in memory.

The streaming executor (streamingPackExecutor) provides a third mode that avoids materializing the full plan by flushing per-pack buffers as they reach scanPackFlushRecords or scanPackFlushBytes, trading some I/O ordering for lower peak memory.

Package objstore provides a minimal, memory-mapped Git object store that resolves objects directly from *.pack files without shelling out to the Git executable.

The store is intended for read-only scenarios—such as code search, indexing, and content serving—where low-latency look-ups are required but a full on-disk checkout is unnecessary.

Implementation ¶

The store memory-maps one or more *.pack / *.idx pairs, builds an in-memory map from SHA-1 object IDs to their pack offsets, and inflates objects on demand. It coordinates between pack-index (IDX) and reverse-index (RIDX) data, and builds a unified in-memory lookup across packfiles with transparent delta chain resolution and caching.

All packfiles are memory-mapped for zero-copy access, with an adaptive replacement cache (ARC) and delta window for hot objects. Delta chains are resolved transparently with bounded depth and cycle detection. A small, size-bounded cache avoids redundant decompression, and optional CRC-32 verification can be enabled for additional integrity checks.

The store is safe for concurrent readers and eliminates the need to shell out to Git while providing high-performance object retrieval for read-only workloads like indexing and content serving.

Index ¶

Constants
Variables
func GetBuf() *[]byte
func OpenForTesting(dir string) (*store, error)
func PutBuf(buf *[]byte)
type AddedHunk
- func (h *AddedHunk) EndLine() uint32
type AuthorInfo
type BlobScanner
type CommitMetadata
type Handle
- func (h *Handle) Data() []byte
- func (h *Handle) Release()
- func (h *Handle) Type() ObjectType
type Hash
- func ParseHash(s string) (Hash, error)
- func (h Hash) IsZero() bool
- func (h Hash) String() string
- func (h Hash) Uint64() uint64
type HistoryScanner
- func NewHistoryScanner(gitDir string, opts ...ScannerOption) (*HistoryScanner, error)
- func (hs *HistoryScanner) Close() error
- func (hs *HistoryScanner) DiffHistoryHunks() (<-chan HunkAddition, <-chan error)
- func (s *HistoryScanner) GetCommitMetadata(oid Hash) (CommitMetadata, error)
- func (hs *HistoryScanner) Scan(seen SeenSet, scanner BlobScanner) error
- func (hs *HistoryScanner) ScanMode() ScanMode
- func (hs *HistoryScanner) SetMaxDeltaDepth(depth int)
- func (hs *HistoryScanner) SetMaxDeltaObjectSize(maxBytes uint64)
- func (hs *HistoryScanner) SetScanMode(mode ScanMode)
- func (hs *HistoryScanner) SetVerifyCRC(verify bool)
type HunkAddition
- func (h *HunkAddition) Commit() Hash
- func (h *HunkAddition) EndLine() int
- func (h *HunkAddition) IsBinary() bool
- func (h *HunkAddition) Lines() []string
- func (h *HunkAddition) Path() string
- func (h *HunkAddition) StartLine() int
- func (h *HunkAddition) String() string
type ObjectCache
- func NewARCCache(size int) (ObjectCache, error)
type ObjectType
- func (t ObjectType) String() string
type Parents
type ProfilingConfig
type ScanError
- func (e *ScanError) Error() string
type ScanJob
type ScanMeta
type ScanMode
- func (m ScanMode) String() string
type ScannerOption
- func WithProfiling(config *ProfilingConfig) ScannerOption
- func WithScanMode(mode ScanMode) ScannerOption
type SeenSet
type TreeIter
- func (it *TreeIter) Next() (name string, oid Hash, mode uint32, ok bool, err error)

Examples ¶

HistoryScanner

Constants ¶

View Source

const (
	// SmallFileThreshold is 1 MB (1 << 20). Files at or below this size are
	// diffed with the position‑tracking algorithm, which preserves line
	// ordering information for the most accurate output.
	SmallFileThreshold = 1 << 20 // 1 MB

	// MediumFileThreshold is 50 MB (50 << 20). Files larger than
	// SmallFileThreshold and up to this limit are diffed with the memory‑
	// optimized line‑set algorithm.
	MediumFileThreshold = 50 << 20 // 50 MB

	// LargeFileThreshold is 500 MB (500 << 20). Files whose size exceeds
	// MediumFileThreshold and is at or below this limit trigger the hash‑
	// based algorithm, which stores only 64‑bit hashes of each line.
	LargeFileThreshold = 500 << 20 // 500 MB

	// MaxDiffSize is 1 GB (1 << 30). If either blob is larger than this limit
	// computeAddedHunks skips the diff and returns a single placeholder hunk.
	MaxDiffSize = 1 << 30 // 1 GB
)

Size‑selection thresholds used by computeAddedHunks.

View Source

const MaxHdr = 4096

MaxHdr is the maximum number of bytes we are willing to read for a single Git object header. 4096 bytes is generous -- real headers are typically under 32 bytes -- but a fixed upper bound protects against malformed objects consuming unbounded memory.

Variables ¶

View Source

var (
	ErrWindowFull     = errors.New("delta window full: all entries in use")
	ErrObjectTooLarge = errors.New("object too large for window")
)

ErrWindowFull is returned when the delta window cannot accommodate new entries because all existing entries are actively referenced (refCnt > 0) and the memory budget has been exceeded. This prevents unbounded memory growth while respecting active references.

View Source

var (
	ErrAuthorLineNotFound  = errors.New("author line not found")
	ErrMalformedAuthorLine = errors.New("malformed author line: missing '>'")
	ErrMissingEmail        = errors.New("malformed author line: missing email")
	ErrMissingTimestamp    = errors.New("malformed author line: missing timestamp")
)

View Source

var (
	ErrNonMonotonicOffsets     = errors.New("idx corrupt: non‑monotonic offsets")
	ErrObjectExceedsPackBounds = errors.New("object extends past pack trailer")
	ErrPackTrailerCorrupt      = errors.New("pack trailer checksum mismatch")
)

View Source

var (
	ErrNonMonotonicFanout = errors.New("idx corrupt: fan‑out table not monotonic")
	ErrBadIdxChecksum     = errors.New("idx corrupt: checksum mismatch")
)

View Source

var (
	ErrEmptyObjectHeader       = errors.New("empty object header")
	ErrCannotParseObjectHeader = errors.New("cannot parse object header")
)

View Source

var (
	ErrEmptyObject = errors.New("empty object")

	// ErrOfsDeltaBaseRefTooLong is returned when the variable-length backward
	// offset of an ofs-delta object cannot be decoded within 12 continuation
	// bytes. Git's encoding uses 7 payload bits per byte with an MSB
	// continuation flag, so 12 bytes encode at most 84 bits -- far more than
	// needed for any valid pack offset. Exceeding this limit indicates a
	// corrupted or maliciously crafted packfile.
	ErrOfsDeltaBaseRefTooLong = errors.New("ofs-delta base-ref too long")
)

View Source

var (
	ErrObjectNotCommit = errors.New("object is not a commit")
	ErrObjectNotFound  = errors.New("object not found")
)

View Source

var (
	// ErrCorruptTree is returned when a raw tree object's byte layout violates
	// the Git tree format (e.g. missing NUL terminator, truncated SHA, invalid
	// octal mode digit). Callers should treat the enclosing pack or loose file
	// as damaged.
	ErrCorruptTree = errors.New("corrupt tree object")

	// ErrTypeMismatch is returned when an object retrieved from the store has a
	// different type than the caller expected (e.g. a blob where a tree was
	// required).
	ErrTypeMismatch = errors.New("unexpected object type")

	// ErrTreeNotFound is returned when the requested tree OID cannot be
	// located in any pack index or the loose-object directory.
	ErrTreeNotFound = errors.New("tree object not found")
)

View Source

var ErrCommitGraphRequired = errors.New("commit‑graph required but not found")

ErrCommitGraphRequired is kept for backward compatibility. HistoryScanner now always builds commit metadata in memory from ref walks.

View Source

var ErrDeltaTargetTooLarge = errors.New("delta target exceeds configured maximum")

Functions ¶

func GetBuf ¶

func GetBuf() *[]byte

GetBuf retrieves a *[]byte from the shared bufPool. The returned slice has length 0 and capacity MaxHdr (4096).

GetBuf and PutBuf are exported so that downstream packages (e.g. the scanner layer) can borrow header scratch buffers without importing the internal pool directly or allocating per-call.

func OpenForTesting ¶

func OpenForTesting(dir string) (*store, error)

OpenForTesting is a test helper that provides access to the unexported open function. This function should only be used in tests that need direct access to the store functionality. Production code should use NewHistoryScanner instead.

func PutBuf ¶

func PutBuf(buf *[]byte)

PutBuf returns a buffer to the bufPool after resetting its length to 0.

INVARIANT: the slice is truncated to length 0 before being pooled so that the next GetBuf caller receives a clean, zero-length slice. Callers MUST NOT retain a reference to *buf after calling PutBuf.

Types ¶

type AddedHunk ¶

type AddedHunk struct {
	// Lines contains the actual text content of the added lines.
	// Each string represents one line without its trailing newline character.
	// For binary files, this will contain a single element with the raw binary data.
	Lines []string

	// StartLine indicates the 1-based line number where this hunk begins in the new file.
	// Line numbers start at 1 to match standard diff output conventions.
	// For binary files, this is always 1.
	StartLine uint32

	// IsBinary indicates whether this hunk contains binary data.
	// When true, Lines contains the raw binary content as a single string.
	IsBinary bool
}

AddedHunk represents a contiguous block of added lines in a diff. The struct groups consecutive lines that were added to a file, tracking both the content and position of the additions.

func (*AddedHunk) EndLine ¶

func (h *AddedHunk) EndLine() uint32

EndLine returns the 1-based line number of the last line in this hunk. If the hunk contains no lines, EndLine returns the StartLine value. For a hunk starting at line 10 with 3 lines, EndLine returns 12.

type AuthorInfo ¶

type AuthorInfo struct {
	// Name holds the personal name of the commit author exactly as it
	// appears in the Git commit header.
	Name string

	// Email contains the author's e-mail address from the commit
	// header. The value is not validated or normalized.
	Email string

	// When records the author timestamp in Coordinated Universal Time.
	// Consumers should treat it as the authoritative time a change was
	// made, not when it was committed.
	When time.Time
}

AuthorInfo describes the Git author metadata attached to a secret finding. It is a lightweight, immutable value that callers use to display ownership information; it never alters repository content and is safe for concurrent read-only access.

type BlobScanner ¶

type BlobScanner interface {
	ScanBlob(r io.Reader, meta ScanMeta) error
}

BlobScanner consumes blob content plus metadata for each planned blob.

Thread safety: ScanBlob may be called from multiple goroutines concurrently (one per decode worker). Implementations MUST synchronize any shared mutable state internally.

Lifetime: the io.Reader passed to ScanBlob is only valid for the duration of the call. Callers must not retain the Reader or its underlying buffer after ScanBlob returns, because the buffer is recycled via a sync.Pool. If the implementation needs to keep blob data beyond the call, it must copy the bytes.

type CommitMetadata ¶

type CommitMetadata struct {
	// Author records the commit author exactly as stored in the commit header.
	Author AuthorInfo

	// Timestamp holds the committer time in seconds since the Unix epoch.
	Timestamp int64
}

CommitMetadata bundles the author identity and commit timestamp for a single commit.

Instances are immutable and therefore safe for concurrent reads.

type Handle ¶

type Handle struct {
	// contains filtered or unexported fields
}

Handle represents an active reference to cached object data and ensures the underlying entry cannot be evicted while the handle exists.

func (*Handle) Data ¶

func (h *Handle) Data() []byte

Data returns the cached object data associated with this handle.

Lifetime: the returned slice is valid only as long as the Handle has not been released. Once Release() is called, the underlying entry may be evicted and its data buffer reused or garbage-collected. Callers that need the data beyond the Handle's lifetime MUST copy the slice before releasing.

func (*Handle) Release ¶

func (h *Handle) Release()

Release decrements the reference count for this handle's entry and marks the handle as invalid for further use.

Idempotency: Release is safe to call multiple times. After the first call, h.entry and h.w are set to nil, so subsequent calls are no-ops. This makes it safe to defer Release() and also call it explicitly in error paths without risk of double-decrementing the reference count.

After Release returns, the Handle is returned to the sync.Pool for reuse. The caller MUST NOT access h.Data() after calling Release.

func (*Handle) Type ¶

func (h *Handle) Type() ObjectType

Type returns the Git ObjectType associated with the cached data.

type Hash ¶

type Hash [20]byte

Hash represents a raw Git object identifier.

It is the 20-byte binary form of a SHA-1 digest as used by Git internally. The zero value is the all-zero hash, which never resolves to a real object.

Hash also provides a Uint64() method (defined in hash_aligned.go or hash_unaligned.go depending on the target architecture) that returns the first eight bytes as a uint64, useful for hash-map bucketing and sharding.

func ParseHash ¶

func ParseHash(s string) (Hash, error)

ParseHash converts the canonical, 40-character hexadecimal SHA-1 string produced by Git into its raw 20-byte representation.

An error is returned when the input • is not exactly 40 runes long or • cannot be decoded as hexadecimal. The zero Hash value (all zero bytes) never corresponds to a real Git object and is therefore safe to use as a sentinel in maps.

func (Hash) IsZero ¶

func (h Hash) IsZero() bool

IsZero returns true if the hash is the zero value.

func (Hash) String ¶

func (h Hash) String() string

String returns the hexadecimal string representation of the hash. This is the canonical 40-character format used by Git.

func (Hash) Uint64 ¶

func (h Hash) Uint64() uint64

Uint64 returns the first eight bytes of h as a host-native uint64.

This version uses an unsafe pointer cast, which compiles to a single load instruction on aligned architectures. See the file-level comment for safety analysis and the build-tag pairing with hash_unaligned.go.

type HistoryScanner ¶

type HistoryScanner struct {
	// contains filtered or unexported fields
}

HistoryScanner provides read‑only, high‑throughput access to a Git repository's commit history.

It abstracts over commit‑graph files and packfile iteration to expose streaming APIs such as DiffHistoryHunks that deliver results concurrently while holding only a small working set in memory.

Instantiate a HistoryScanner when you need to traverse many commits or compute incremental diffs without materializing full commit objects. The zero value is invalid; use NewHistoryScanner.

Example ¶

ExampleHistoryScanner demonstrates the basic workflow for opening a Git repository, creating a HistoryScanner, and retrieving a raw object by its SHA-1 hash. This is the lowest-level API exposed by the library for direct object access.

scanner, err := NewHistoryScanner("/path/to/repo/.git")
if err != nil {
	fmt.Printf("Error: %v\n", err)
	return
}
defer scanner.Close()

hash, _ := ParseHash("89e5a3e7d8f6c4b2a1e0d9c8b7a6f5e4d3c2b1a0")
data, objType, err := scanner.get(hash)
if err != nil {
	fmt.Printf("Object not found: %v\n", err)
	return
}

fmt.Printf("Object type: %s\n", objType)
fmt.Printf("Object size: %d bytes\n", len(data))

func NewHistoryScanner ¶

func NewHistoryScanner(gitDir string, opts ...ScannerOption) (*HistoryScanner, error)

NewHistoryScanner opens gitDir and returns a HistoryScanner that streams commit data concurrently.

The scanner always builds an in-memory commit graph from a ref walk and does not consume on-disk commit-graph files.

Options can be provided to configure scanner behavior, such as enabling profiling with WithProfiling.

The caller must invoke (*HistoryScanner).Close when finished to free mmap handles and file descriptors.

func (*HistoryScanner) Close ¶

func (hs *HistoryScanner) Close() error

Close releases any mmap handles or file descriptors held by the scanner. It is idempotent; subsequent calls are no‑ops.

func (*HistoryScanner) DiffHistoryHunks ¶

func (hs *HistoryScanner) DiffHistoryHunks() (<-chan HunkAddition, <-chan error)

DiffHistoryHunks streams every added hunk from all commits, diffing each commit against its first parent only (i.e. merge commits are treated as a single diff against the first parent, matching `git log --first-parent` semantics). This keeps output deterministic and avoids duplicate hunks from merge base reconstruction.

It returns two buffered channels: one for HunkAddition values and one for a single error. The function never blocks the caller; all writes to the channels are non-blocking.

Goroutine ownership: DiffHistoryHunks spawns a background goroutine that owns the returned channels and closes them when the walk completes. The caller MUST drain the HunkAddition channel to completion (or read until the errC channel delivers a value) to avoid leaking goroutines. Failing to drain will block the internal worker pool indefinitely.

The HunkAddition channel is buffered to runtime.NumCPU() to allow workers to make progress without waiting for the consumer on every hunk. The errC channel is buffered to 1 so the producer goroutine can always send its final error without blocking.

A nil error sent on errC signals a graceful end-of-stream.

func (*HistoryScanner) GetCommitMetadata ¶

func (s *HistoryScanner) GetCommitMetadata(oid Hash) (CommitMetadata, error)

GetCommitMetadata returns (and caches) the commit's author and timestamp.

func (*HistoryScanner) Scan ¶

func (hs *HistoryScanner) Scan(seen SeenSet, scanner BlobScanner) error

Scan runs the scanning strategy selected by the scanner's current ScanMode.

Blob mode (ScanModeBlob, the default) is the recommended path for secret scanning. It visits every unique blob exactly once, in pack-offset order, and passes its full content to scanner.ScanBlob.

Hunk mode (ScanModeHunks) diffs each commit against its parent and yields only the added lines. It is retained for backward compatibility.

func (*HistoryScanner) ScanMode ¶

func (hs *HistoryScanner) ScanMode() ScanMode

ScanMode returns the scanner's currently configured scan mode.

func (*HistoryScanner) SetMaxDeltaDepth ¶

func (hs *HistoryScanner) SetMaxDeltaDepth(depth int)

SetMaxDeltaDepth sets the maximum number of delta hops while materializing objects.

func (*HistoryScanner) SetMaxDeltaObjectSize ¶

func (hs *HistoryScanner) SetMaxDeltaObjectSize(maxBytes uint64)

SetMaxDeltaObjectSize bounds reconstructed delta targets in bytes. Passing zero disables the bound.

func (*HistoryScanner) SetScanMode ¶

func (hs *HistoryScanner) SetScanMode(mode ScanMode)

SetScanMode updates the scanner's scan mode for subsequent Scan calls.

Thread safety: SetScanMode is not safe for concurrent use with Scan. The caller must ensure no Scan is in progress when changing the mode.

func (*HistoryScanner) SetVerifyCRC ¶

func (hs *HistoryScanner) SetVerifyCRC(verify bool)

SetVerifyCRC enables or disables CRC‑32 verification on all object reads.

type HunkAddition ¶

type HunkAddition struct {
	// contains filtered or unexported fields
}

HunkAddition describes a contiguous block of added lines introduced by a commit.

Values are streamed by HistoryScanner.DiffHistoryHunks and can be consumed concurrently by the caller.

func (*HunkAddition) Commit ¶

func (h *HunkAddition) Commit() Hash

Commit returns the commit that introduced the hunk.

func (*HunkAddition) EndLine ¶

func (h *HunkAddition) EndLine() int

EndLine returns the last line number (1‑based) of the hunk.

func (*HunkAddition) IsBinary ¶

func (h *HunkAddition) IsBinary() bool

IsBinary returns whether this hunk contains binary data.

func (*HunkAddition) Lines ¶

func (h *HunkAddition) Lines() []string

Lines returns all added lines without leading '+' markers.

func (*HunkAddition) Path ¶

func (h *HunkAddition) Path() string

Path returns the file to which the hunk was added, using forward‑slash separators.

func (*HunkAddition) StartLine ¶

func (h *HunkAddition) StartLine() int

StartLine returns the first line number (1‑based) of the hunk.

func (*HunkAddition) String ¶

func (h *HunkAddition) String() string

String returns a human‑readable representation.

type ObjectCache ¶

type ObjectCache interface {
	// Get returns the cached object associated with key and a boolean
	// that reports whether the entry was found.
	// Get must be safe for concurrent use.
	Get(key Hash) (cachedObj, bool)

	// Add stores value under key, potentially evicting other entries
	// according to the cache’s replacement policy.
	// Add must be safe for concurrent use.
	Add(key Hash, value cachedObj)

	// Purge removes all entries from the cache and frees any
	// associated resources.
	// Purge is typically called when a Store is closed or when the
	// caller wants to reclaim memory immediately.
	Purge()
}

ObjectCache defines a pluggable, in-memory cache for Git objects. Callers supply an ObjectCache to tune memory usage or swap in custom eviction strategies while interacting with objstore.Store. The cache is consulted on every object read, so an efficient implementation can dramatically reduce decompression work and I/O.

func NewARCCache ¶

func NewARCCache(size int) (ObjectCache, error)

NewARCCache creates a new ARC cache with the specified size and returns it as an ObjectCache.

type ObjectType ¶

type ObjectType byte

ObjectType enumerates the kinds of Git objects that can appear in a pack or loose-object store.

The zero value, ObjBad, denotes an invalid or unknown object type. The String method returns the canonical, lower-case Git spelling.

const (
	// ObjBad represents an invalid or unspecified object kind.
	ObjBad ObjectType = iota // 0

	// ObjCommit is a regular commit object.
	ObjCommit // 1

	// ObjTree is a directory tree object describing the hierarchy of a commit.
	ObjTree // 2

	// ObjBlob is a file-content blob object.
	ObjBlob // 3

	// ObjTag is an annotated tag object.
	ObjTag // 4

	// ObjOfsDelta is a delta object whose base is addressed by packfile offset.
	ObjOfsDelta // 6

	// ObjRefDelta is a delta object whose base is addressed by object ID.
	ObjRefDelta // 7
)

INVARIANT: The iota values below MUST match Git's internal 3-bit type encoding stored in packfile object headers (bits 6-4 of the first byte). Changing the order or inserting new constants will break header parsing in parseObjectHeaderUnsafe and every caller that casts a raw header nibble to ObjectType.

func (ObjectType) String ¶

func (t ObjectType) String() string

type Parents ¶

type Parents = map[Hash][]Hash

Parents maps each commit OID to the OIDs of its immediate parents.

The alias exists so callers can use a descriptive name instead of the more verbose map literal when working with parent relationships returned from LoadCommitGraph.

type ProfilingConfig ¶

type ProfilingConfig struct {
	// EnableProfiling starts an HTTP server with pprof endpoints.
	// When true, users can capture profiles using curl or go tool pprof.
	EnableProfiling bool

	// ProfileAddr specifies the address for the profiling HTTP server.
	// Defaults to ":6060" if empty.
	// Use "localhost:6060" to restrict to local access.
	ProfileAddr string

	// Trace enables execution tracing for the duration of the scan.
	// The trace is written to TraceOutputPath.
	Trace bool

	// TraceOutputPath specifies where to write the execution trace.
	// Defaults to "./trace.out" if empty and Trace is true.
	TraceOutputPath string
}

ProfilingConfig specifies profiling options for the scanner.

When provided to NewHistoryScanner via WithProfiling option, it starts an HTTP server with pprof endpoints for on-demand profiling.

type ScanError ¶

type ScanError struct {
	// FailedCommits maps each problematic commit OID to the error encountered
	// while decoding it.
	FailedCommits map[Hash]error
}

ScanError reports commits that failed to parse during a packfile scan.

The error is non‑fatal; callers decide whether the missing commits are relevant for their workflow.

func (*ScanError) Error ¶

func (e *ScanError) Error() string

Error implements the error interface.

type ScanJob ¶

type ScanJob struct {
	// Blob is the object ID of the blob to scan.
	Blob Hash

	// Commit is the object ID of the commit that introduced or modified
	// this blob. Used for attribution in scan results.
	Commit Hash

	// Path is the repository-relative file path (forward-slash separated)
	// where this blob appears in the commit's tree.
	Path string

	// Pack points to the memory-mapped packfile that contains this blob.
	// Nil for loose objects (which are not represented as ScanJobs).
	Pack *mmap.ReaderAt

	// Offset is the byte offset within Pack where this blob's packfile
	// entry begins. Jobs are sorted by ascending Offset before execution
	// to achieve sequential I/O access patterns and benefit from OS
	// read-ahead, which is critical for performance on spinning disks.
	Offset uint64
}

ScanJob is one unit of blob-centric scan work, representing a single blob that needs to be decoded and passed to a BlobScanner.

type ScanMeta ¶

type ScanMeta struct {
	// Blob is the object ID of the blob being scanned.
	Blob Hash

	// Commit is the object ID of the commit that introduced this blob change.
	Commit Hash

	// Path is the repository-relative file path (forward-slash separated).
	Path string
}

ScanMeta carries attribution and identity metadata that is passed alongside blob content to a BlobScanner. It identifies *which* blob is being scanned, *which* commit introduced it, and *where* in the tree it appeared.

type ScanMode ¶

type ScanMode uint8

ScanMode selects the high-level scanning strategy used by HistoryScanner.Scan.

const (
	// ScanModeBlob scans full blob objects, deduplicating by OID and visiting
	// them in pack-file offset order. Pack-sorted iteration minimizes random
	// I/O because entries stored contiguously in the pack are read
	// sequentially, which is especially beneficial on spinning disks and
	// over NFS.
	ScanModeBlob ScanMode = iota

	// ScanModeHunks is the legacy scanning mode that computes parent-child
	// diffs for every commit and yields only the added-line hunks. It exists
	// for backward compatibility with callers that need line-level
	// granularity. Prefer ScanModeBlob for new integrations because it
	// avoids the overhead of diff computation and tree comparison.
	ScanModeHunks
)

func (ScanMode) String ¶

func (m ScanMode) String() string

type ScannerOption ¶

type ScannerOption func(*HistoryScanner)

ScannerOption is a function that configures a HistoryScanner during construction. Options are applied in the order they are passed to NewHistoryScanner; later options therefore override earlier ones when they touch the same field.

func WithProfiling ¶

func WithProfiling(config *ProfilingConfig) ScannerOption

WithProfiling returns a ScannerOption that enables profiling with the given configuration.

Example:

scanner, err := NewHistoryScanner(gitDir,
    WithProfiling(&ProfilingConfig{
        EnableProfiling: true,
        ProfileAddr:     ":6060",
        Trace:           true,
        TraceOutputPath: "./trace.out",
    }),
)

func WithScanMode ¶

func WithScanMode(mode ScanMode) ScannerOption

WithScanMode configures the default mode used by HistoryScanner.Scan.

type SeenSet ¶

type SeenSet interface {
	Has(oid Hash) (bool, error)
	Put(oid Hash) error
}

SeenSet tracks globally scanned blobs across multiple scan runs.

An implementation MUST be safe for concurrent use by multiple goroutines because the scan pipeline may call Has and Put from parallel decode workers.

Contract:

Has returns (true, nil) if the blob OID has been recorded by a prior Put call in this or any previous scan run.
Put records the blob OID so that future Has calls return true.
Errors from Has or Put abort the current scan immediately.

Typical implementations back SeenSet with a persistent store (e.g. a Bloom filter backed by disk) so that incremental scans can skip blobs processed in earlier invocations.

type TreeIter ¶

type TreeIter struct {
	// contains filtered or unexported fields
}

TreeIter provides a zero-allocation, forward-only iterator over the entries of a raw Git tree object.

Callers create a TreeIter through Store.TreeIter or, internally, through treeCache.iter. After creation, call Next repeatedly until it returns ok == false. The iterator keeps a reference to the caller-supplied raw tree bytes and advances through that slice in place, so it never allocates or copies entry data except for the 20-byte object IDs it reports. Each TreeIter instance must therefore remain confined to the goroutine that consumes it, and the underlying byte slice must stay immutable for the life of the iterator.

func (*TreeIter) Next ¶

func (it *TreeIter) Next() (name string, oid Hash, mode uint32, ok bool, err error)

Next parses and returns the next entry in the raw Git tree.

It yields the entry's file name, object ID, and file mode. When ok is false the iterator has been exhausted and, by convention, err is io.EOF. Any malformed input results in ok == false and a non-nil err, typically ErrCorruptTree.

The returned name is produced via btostr and therefore shares the backing array of the original raw slice. The string is valid only as long as the raw slice is alive and unmodified; callers that need to retain the name past the iterator's lifetime must copy it.

The iterator keeps a slice pointing at the original raw buffer; callers must therefore ensure that the underlying slice is not mutated while iteration is in progress. Next is not safe for concurrent use; each TreeIter instance must be confined to a single goroutine.

Directories ¶

Path	Synopsis
examples
comparison command Package main compares go-gitpack's blob scanning against the git log commands used by gitleaks and trufflehog for full-history scanning.	Package main compares go-gitpack's blob scanning against the git log commands used by gitleaks and trufflehog for full-history scanning.
debug_simple command Package main is a minimal debugging example that opens a Git repository, creates a HistoryScanner, and performs a streaming blob scan that counts every blob reachable from the commit history.	Package main is a minimal debugging example that opens a Git repository, creates a HistoryScanner, and performs a streaming blob scan that counts every blob reachable from the commit history.
history_scan command Package main demonstrates streaming blob scanning from a Git repository using the go-gitpack library.	Package main demonstrates streaming blob scanning from a Git repository using the go-gitpack library.
profiling command Package main demonstrates how to use profiling with go-gitpack to diagnose memory and CPU performance issues when scanning large repositories.	Package main demonstrates how to use profiling with go-gitpack to diagnose memory and CPU performance issues when scanning large repositories.
simple_streaming command Package main demonstrates simple streaming blob scanning using the go-gitpack library.	Package main demonstrates simple streaming blob scanning using the go-gitpack library.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

go-gitpack

Overview

Usage

Scan every unique blob (recommended)

Documentation ¶

Overview ¶

Overview ¶

Quick start ¶

Implementation ¶

Index ¶

Examples ¶

Constants ¶

Variables ¶

Functions ¶

func GetBuf ¶

func OpenForTesting ¶

func PutBuf ¶

Types ¶

type AddedHunk ¶

func (*AddedHunk) EndLine ¶

type AuthorInfo ¶

type BlobScanner ¶

type CommitMetadata ¶

type Handle ¶

func (*Handle) Data ¶

func (*Handle) Release ¶

func (*Handle) Type ¶

type Hash ¶

func ParseHash ¶

func (Hash) IsZero ¶

func (Hash) String ¶

func (Hash) Uint64 ¶

type HistoryScanner ¶

func NewHistoryScanner ¶

func (*HistoryScanner) Close ¶

func (*HistoryScanner) DiffHistoryHunks ¶

func (*HistoryScanner) GetCommitMetadata ¶

func (*HistoryScanner) Scan ¶

func (*HistoryScanner) ScanMode ¶

func (*HistoryScanner) SetMaxDeltaDepth ¶

func (*HistoryScanner) SetMaxDeltaObjectSize ¶

func (*HistoryScanner) SetScanMode ¶

func (*HistoryScanner) SetVerifyCRC ¶

type HunkAddition ¶

func (*HunkAddition) Commit ¶

func (*HunkAddition) EndLine ¶

func (*HunkAddition) IsBinary ¶

func (*HunkAddition) Lines ¶

func (*HunkAddition) Path ¶

func (*HunkAddition) StartLine ¶

func (*HunkAddition) String ¶

type ObjectCache ¶

func NewARCCache ¶

type ObjectType ¶

func (ObjectType) String ¶

type Parents ¶

type ProfilingConfig ¶

type ScanError ¶

func (*ScanError) Error ¶

type ScanJob ¶

type ScanMeta ¶

type ScanMode ¶

func (ScanMode) String ¶

type ScannerOption ¶

func WithProfiling ¶

func WithScanMode ¶

type SeenSet ¶

type TreeIter ¶

func (*TreeIter) Next ¶

Source Files ¶

Directories ¶