Documentation
¶
Overview ¶
Package corpus extracts plain-text turns from Claude Code session transcripts (`~/.claude/projects/<project>/<sessionUUID>.jsonl`).
One JSONL file = one session. Each line is a record. We care about type=user and type=assistant; everything else (file-history-snapshot, system, queue-operation, sidechain) is skipped.
Index ¶
- func Extract(path string, fn func(Turn) error) error
- func HashFile(path string) (string, int64, time.Time, error)
- func ListJSONLFiles(path string) ([]string, error)
- type SourceFile
- type Store
- func (s *Store) Close() error
- func (s *Store) CountTurns(ctx context.Context) (int, error)
- func (s *Store) HasSourceFile(ctx context.Context, sf SourceFile) (bool, error)
- func (s *Store) IterateTurns(ctx context.Context, fn func(StoredTurn) error) error
- func (s *Store) PutSourceFile(ctx context.Context, sf SourceFile) error
- func (s *Store) PutTurn(ctx context.Context, t Turn) error
- func (s *Store) SnapshotHash(ctx context.Context) (string, error)
- type StoredTurn
- type Turn
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func Extract ¶
Extract walks `path` (file or dir) and invokes fn for every Turn extracted in source order. On a directory, all *.jsonl files are processed in lexicographic order; each file is streamed line by line so memory is O(turn) not O(corpus).
func HashFile ¶
HashFile returns the sha256 hex digest of a file's full contents. Helper for the SourceFile dedup key.
func ListJSONLFiles ¶
ListJSONLFiles returns the *.jsonl files Extract would process for path (file or dir). Useful for sizing progress bars in advance.
Types ¶
type SourceFile ¶
type SourceFile struct {
Path string
ModTime time.Time
SizeBytes int64
ContentHash string
IngestedAt time.Time
}
SourceFile is the dedup unit for ingest: the same path+mtime+sha is skipped on a re-run.
type Store ¶
type Store struct {
// contains filtered or unexported fields
}
Store wraps corpus.sqlite, the per-dataset record of which source JSONL files have been seen and which turns we extracted from them.
func (*Store) CountTurns ¶
CountTurns returns the total number of turn rows.
func (*Store) HasSourceFile ¶
HasSourceFile reports whether the (path, mtime, sha) tuple has already been ingested. Used to short-circuit re-ingest of unchanged JSONL files.
func (*Store) IterateTurns ¶
IterateTurns calls fn for every turn in chronological (ts) order. Streamed; safe for large corpora.
func (*Store) PutSourceFile ¶
func (s *Store) PutSourceFile(ctx context.Context, sf SourceFile) error
PutSourceFile records that a file was ingested.
type StoredTurn ¶
type StoredTurn struct {
ID int64
UUID string
SessionID string
Role string
Text string
Timestamp time.Time
Source string
}
StoredTurn mirrors corpus.Turn but adds a stable per-dataset ID.
type Turn ¶
type Turn struct {
Role string // "user" | "assistant"
Text string // concatenated text content (no thinking, no tool I/O)
Timestamp time.Time // wall-clock from the JSONL record
SessionID string
UUID string
ParentUUID string
Sidechain bool
SourceFile string
LineNumber int
GitBranch string
}
Turn is one extracted user or assistant turn ready for chunking + embedding. SourceFile + LineNumber identify the JSONL line for traceability.