corpus

package
v0.2.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 30, 2026 License: MIT Imports: 16 Imported by: 0

Documentation

Overview

Package corpus extracts plain-text turns from Claude Code session transcripts (`~/.claude/projects/<project>/<sessionUUID>.jsonl`).

One JSONL file = one session. Each line is a record. We care about type=user and type=assistant; everything else (file-history-snapshot, system, queue-operation, sidechain) is skipped.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func Extract

func Extract(path string, fn func(Turn) error) error

Extract walks `path` (file or dir) and invokes fn for every Turn extracted in source order. On a directory, all *.jsonl files are processed in lexicographic order; each file is streamed line by line so memory is O(turn) not O(corpus).

func HashFile

func HashFile(path string) (string, int64, time.Time, error)

HashFile returns the sha256 hex digest of a file's full contents. Helper for the SourceFile dedup key.

func ListJSONLFiles

func ListJSONLFiles(path string) ([]string, error)

ListJSONLFiles returns the *.jsonl files Extract would process for path (file or dir). Useful for sizing progress bars in advance.

Types

type SourceFile

type SourceFile struct {
	Path        string
	ModTime     time.Time
	SizeBytes   int64
	ContentHash string
	IngestedAt  time.Time
}

SourceFile is the dedup unit for ingest: the same path+mtime+sha is skipped on a re-run.

type Store

type Store struct {
	// contains filtered or unexported fields
}

Store wraps corpus.sqlite, the per-dataset record of which source JSONL files have been seen and which turns we extracted from them.

func OpenStore

func OpenStore(path string) (*Store, error)

OpenStore opens or creates the corpus database at path.

func (*Store) Close

func (s *Store) Close() error

Close releases the handle.

func (*Store) CountTurns

func (s *Store) CountTurns(ctx context.Context) (int, error)

CountTurns returns the total number of turn rows.

func (*Store) HasSourceFile

func (s *Store) HasSourceFile(ctx context.Context, sf SourceFile) (bool, error)

HasSourceFile reports whether the (path, mtime, sha) tuple has already been ingested. Used to short-circuit re-ingest of unchanged JSONL files.

func (*Store) IterateTurns

func (s *Store) IterateTurns(ctx context.Context, fn func(StoredTurn) error) error

IterateTurns calls fn for every turn in chronological (ts) order. Streamed; safe for large corpora.

func (*Store) PutSourceFile

func (s *Store) PutSourceFile(ctx context.Context, sf SourceFile) error

PutSourceFile records that a file was ingested.

func (*Store) PutTurn

func (s *Store) PutTurn(ctx context.Context, t Turn) error

PutTurn inserts a turn record. Duplicate (uuid, source_file) is a no-op.

func (*Store) SnapshotHash

func (s *Store) SnapshotHash(ctx context.Context) (string, error)

SnapshotHash returns a stable hash of the corpus contents (sorted turn UUIDs). Used in manifests to detect corpus drift.

type StoredTurn

type StoredTurn struct {
	ID        int64
	UUID      string
	SessionID string
	Role      string
	Text      string
	Timestamp time.Time
	Source    string
}

StoredTurn mirrors corpus.Turn but adds a stable per-dataset ID.

type Turn

type Turn struct {
	Role       string    // "user" | "assistant"
	Text       string    // concatenated text content (no thinking, no tool I/O)
	Timestamp  time.Time // wall-clock from the JSONL record
	SessionID  string
	UUID       string
	ParentUUID string
	Sidechain  bool
	SourceFile string
	LineNumber int
	GitBranch  string
}

Turn is one extracted user or assistant turn ready for chunking + embedding. SourceFile + LineNumber identify the JSONL line for traceability.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL