realworld

package
v0.1.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 13, 2026 License: Apache-2.0 Imports: 13 Imported by: 0

Documentation

Overview

Package realworld is the real-world ingest subsystem: it exports the public metadata and git history of a handful of the largest GitHub-native repositories into a pinned, normalized corpus, then seeds that corpus into a Githome instance so the read, write, search, git, and event paths can be exercised at a scale the small development fixture never reaches.

The subsystem is two stages that meet at one on-disk format, the Corpus:

  • Stage A (export) turns a live source — a git mirror, the public GraphQL API, the GH Archive event stream — into a normalized Corpus written to a snapshot directory, pinned by a manifest so a later upstream change cannot silently move the numbers.
  • Stage B (seed) reads a Corpus snapshot and writes it into a target store and git store through the bulk-seed write path, preserving the real numbers and timestamps.

Stage A needs network and, for the bulk API, credentials, so it is not exercised in unit tests; Stage B runs entirely against a local snapshot and a SQLite store, so the seeding, pseudonymization, replay, and capture logic is fully testable on the small fixtures checked in under testdata.

Index

Constants

View Source
const ManifestName = "realworld-manifest.json"

ManifestName is the manifest filename at the root of a snapshot directory.

View Source
const ManifestSchema = 1

ManifestSchema is the manifest format version, bumped when the manifest shape changes so an old reader refuses a corpus it cannot interpret.

Variables

View Source
var DefaultReactorPool = ReactorPool{Size: 200, Seed: 0x6e7e}

DefaultReactorPool is the reactor pool a corpus uses unless a manifest overrides it: 200 synthetic reactors, a fixed assignment seed.

View Source
var ErrRequiresNetwork = errors.New("realworld: this export source requires network access and credentials")

ErrRequiresNetwork is returned by an exporter whose source needs network or credentials that are not configured, so a caller in an offline or unit-test environment gets a clear signal rather than a silent empty corpus.

Functions

func ExportToSnapshot

func ExportToSnapshot(ctx context.Context, ex Exporter, refs []RepoRef, m *Manifest, root string) error

ExportToSnapshot runs an exporter over every repo in refs, writes each corpus into the snapshot root, and fills the manifest with the measured row counts and the provenance. A repo whose source is unreachable (ErrRequiresNetwork) is recorded as a drop rather than failing the whole run, so an offline build produces an honest partial snapshot the manifest names as partial. The manifest the caller passes is updated in place and saved at the root.

func SelectSampleNumbers

func SelectSampleNumbers(numbers []int64) []int64

SelectSampleNumbers returns the earliest, middle, and latest of a set of numbers, deduplicated and sorted. A short set returns just its distinct members. This is the "span the range" rule: the differ checks the oldest row, a middle row, and the newest, so a regression that only touches old or only touches new rows is still caught.

func WriteCorpus

func WriteCorpus(root string, c *Corpus) error

WriteCorpus writes one corpus into a snapshot directory, creating the per-repo layout. It does not write the manifest; the caller owns the manifest because it spans every repo in the snapshot.

Types

type CaptureItem

type CaptureItem struct {
	Kind       CaptureKind `json:"kind"`
	Number     int64       `json:"number"`
	CapturedAt time.Time   `json:"captured_at,omitzero"`
}

CaptureItem is one entry in the capture plan: a kind and the subject number to fetch. CapturedAt is zero in the plan and stamped when the response is actually captured.

func BuildCapturePlan

func BuildCapturePlan(c *Corpus) []CaptureItem

BuildCapturePlan selects the golden-response sample for a corpus: the earliest/middle/latest issue, pull request, comment, timeline, review, and status, so the differ has old and new rows of every kind. The plan is the differ-capture sample manifest the build records.

type CaptureKind

type CaptureKind string

CaptureKind names one kind of golden response.

const (
	CaptureIssue    CaptureKind = "issue"
	CapturePull     CaptureKind = "pull"
	CaptureComment  CaptureKind = "comment"
	CaptureTimeline CaptureKind = "timeline"
	CaptureReview   CaptureKind = "review"
	CaptureStatus   CaptureKind = "status"
)

The capture kinds: one golden response per kind, sampled across the number range so the differ checks old and new rows of each.

type Checkpoint

type Checkpoint struct {
	Done map[string]bool `json:"done"`
}

Checkpoint is the resumable-export journal: it records which repo/table pairs an export run has finished, so a run interrupted by rate-limit exhaustion or a crash resumes at the first unfinished pair instead of re-exporting from the top. It is a plain value the caller persists alongside the snapshot.

func NewCheckpoint

func NewCheckpoint() *Checkpoint

NewCheckpoint returns an empty journal.

func (*Checkpoint) IsDone

func (c *Checkpoint) IsDone(ref RepoRef, table string) bool

IsDone reports whether ref's table has already been exported.

func (*Checkpoint) Mark

func (c *Checkpoint) Mark(ref RepoRef, table string)

Mark records ref's table as exported.

type Comment

type Comment struct {
	ID                int64          `json:"id"`
	IssueNumber       int64          `json:"issue_number"`
	Author            string         `json:"author"`
	Body              string         `json:"body"`
	CreatedAt         time.Time      `json:"created_at"`
	UpdatedAt         time.Time      `json:"updated_at"`
	Reactions         map[string]int `json:"reactions,omitempty"`
	AuthorAssociation string         `json:"author_association,omitempty"`
}

Comment is one conversation comment. ID is the dataset id, used to order the db_id allocation deterministically; IssueNumber joins it to its issue.

type CommitStatus

type CommitStatus struct {
	SHA         string    `json:"sha"`
	Context     string    `json:"context"`
	State       string    `json:"state"`
	Description string    `json:"description,omitempty"`
	TargetURL   string    `json:"target_url,omitempty"`
	CreatedAt   time.Time `json:"created_at"`
}

CommitStatus is one external pass/fail report against a head sha under a context. An automation-heavy repo carries many contexts per sha.

type Corpus

type Corpus struct {
	Repo           RepoRef         `json:"repo"`
	Issues         []Issue         `json:"issues"`
	PullRequests   []PullRequest   `json:"pull_requests"`
	Comments       []Comment       `json:"comments"`
	Reviews        []Review        `json:"reviews"`
	ReviewComments []ReviewComment `json:"review_comments"`
	TimelineEvents []TimelineEvent `json:"timeline_events"`
	PRFiles        []PRFile        `json:"pr_files"`
	CommitStatuses []CommitStatus  `json:"commit_statuses"`
}

Corpus is the normalized metadata of one repository: the eight tables the public dataset and the GraphQL export both reduce to, with people named by login string and cross-references named by number or id. Stage B resolves the logins to user pks and the numbers to row pks as it seeds. A Corpus is the unit Stage A writes and Stage B reads; one snapshot directory holds one Corpus per repository.

func ReadCorpus

func ReadCorpus(root string, ref RepoRef) (*Corpus, error)

ReadCorpus reads one repo's corpus back from a snapshot directory.

func (*Corpus) Logins

func (c *Corpus) Logins() []string

Logins returns the distinct set of every login named anywhere in the corpus, in first-seen order, so the seeder can build the user table once before it writes any row that references a user. First-seen order keeps the build deterministic.

type DropNote

type DropNote struct {
	What   string `json:"what"`
	Count  int    `json:"count,omitempty"`
	Reason string `json:"reason"`
}

DropNote records one bounded or skipped piece of a corpus build, with the reason, so coverage is never silently capped.

type Exporter

type Exporter interface {
	// Export returns the corpus for ref, or ErrRequiresNetwork when the source
	// is not reachable in this environment.
	Export(ctx context.Context, ref RepoRef) (*Corpus, error)
	// Source names the exporter for logs and the manifest provenance.
	Source() string
}

Exporter produces the metadata corpus for one repository from one source.

type FixtureExporter

type FixtureExporter struct {
	Root string
}

FixtureExporter reads a corpus back from a snapshot directory. It is the offline source: a previously exported snapshot, or a small checked-in fixture, re-read as if freshly exported. It is the exporter the tests and the seed-only CLI path use.

func (FixtureExporter) Export

func (e FixtureExporter) Export(_ context.Context, ref RepoRef) (*Corpus, error)

Export reads ref's corpus from the snapshot root.

func (FixtureExporter) Source

func (e FixtureExporter) Source() string

Source identifies the fixture exporter.

type GitMirrorPlan

type GitMirrorPlan struct {
	Ref       RepoRef
	MirrorURL string
	PinnedSHA string
}

GitMirrorPlan is the recipe to mirror a repository's history into a git store, expressed as the commands to run rather than run inline, so the plan is testable and the network/disk-heavy execution is an explicit, separate step. The maintenance pass (repack, bitmap, commit-graph, multi-pack-index) is what makes a freshly cloned giant serve cold reads at the same speed a warmed long-lived repository does.

func (GitMirrorPlan) Commands

func (p GitMirrorPlan) Commands(dest string) [][]string

Commands returns the git invocations the plan runs, in order: a bare mirror clone, a reset of the advertised tip to the pin so a fetch benchmark has real new commits to deliver, and the maintenance pass. dest is the bare repo path in the git store.

type Issue

type Issue struct {
	Number          int64          `json:"number"`
	NodeID          string         `json:"node_id,omitempty"`
	IsPullRequest   bool           `json:"is_pull_request"`
	Title           string         `json:"title"`
	Body            string         `json:"body,omitempty"`
	State           string         `json:"state"`
	StateReason     string         `json:"state_reason,omitempty"`
	Author          string         `json:"author"`
	CreatedAt       time.Time      `json:"created_at"`
	UpdatedAt       time.Time      `json:"updated_at"`
	ClosedAt        *time.Time     `json:"closed_at,omitempty"`
	Labels          []Label        `json:"labels,omitempty"`
	Assignees       []string       `json:"assignees,omitempty"`
	MilestoneTitle  string         `json:"milestone_title,omitempty"`
	MilestoneNumber int64          `json:"milestone_number,omitempty"`
	Reactions       map[string]int `json:"reactions,omitempty"`
	CommentCount    int            `json:"comment_count"`
	Locked          bool           `json:"locked,omitempty"`
	LockReason      string         `json:"lock_reason,omitempty"`
}

Issue is one row of the shared issue/PR table: an issue when IsPullRequest is false, the issue half of a pull request when true. Number is the per-repo number preserved verbatim. Reactions are counts per content (`{"+1": 5, "heart": 2}`), materialized into rows against the reactor pool at seed time. NodeID is the dataset's node id, recorded for provenance but never written: Githome mints its own GraphQL ids.

type Label

type Label struct {
	Name        string `json:"name"`
	Color       string `json:"color,omitempty"`
	Description string `json:"description,omitempty"`
}

Label is one label carried on an issue, deduped per repository at seed time.

type Manifest

type Manifest struct {
	Schema int `json:"schema"`
	// Note is a human description of this corpus build; it is not load-bearing.
	Note string `json:"note,omitempty"`
	// DatasetRevision pins the metadata source (the dataset repo commit, or the
	// GraphQL export run id). It is the metadata analog of the per-repo SHA.
	DatasetRevision string `json:"dataset_revision"`
	// FixtureTier names the tier this corpus serves: rw-smoke, rw-meta,
	// rw-write, rw-git, or rw-full. Tiers bound how much a CI leg loads.
	FixtureTier string `json:"fixture_tier"`
	// Pseudonymized is true when logins and bodies were run through the
	// pseudonymizer, so the corpus carries no real identities.
	Pseudonymized bool `json:"pseudonymized"`
	// Reactor records the bounded synthetic reactor pool the seeder materializes
	// reaction counts against; reactions are the one MODELED count in a corpus.
	Reactor ReactorPool `json:"reactor"`
	// Repos is one entry per repository in this corpus.
	Repos []RepoManifest `json:"repos"`
	// SeederVersion and SchemaVersion pin the tooling and the store schema the
	// corpus was built against, the rest of the reproducibility checklist.
	SeederVersion string `json:"seeder_version,omitempty"`
	SchemaVersion int    `json:"schema_version,omitempty"`
	// Dropped records anything this build bounded or skipped — a truncated
	// table, an unreachable source, a sampled range — so a partial corpus never
	// reads as a complete one.
	Dropped []DropNote `json:"dropped,omitempty"`
}

Manifest pins a corpus and records what was measured and what was synthesized, so a corpus is reproducible and no reader mistakes a modeled value for a real one. It is the single file that freezes the corpus: the dataset revision and the per-repo git pins are its OFFICIAL anchors, the reactor pool and any pseudonymization are its MODELED notes, and Measured holds the row counts the seeder actually wrote rather than any count asserted up front.

func LoadManifest

func LoadManifest(path string) (*Manifest, error)

LoadManifest reads and validates a manifest from disk.

func NewManifest

func NewManifest(tier, datasetRevision string) *Manifest

NewManifest builds a manifest for a tier with the default reactor pool and the current schema version, ready for the seeder to fill Measured into.

func (*Manifest) Drop

func (m *Manifest) Drop(what, reason string, count int)

Drop records a bounded or skipped piece of the build.

func (*Manifest) RepoNames

func (m *Manifest) RepoNames() []string

RepoNames returns the owner/name of every repo in the manifest, sorted, for stable logging.

func (*Manifest) Save

func (m *Manifest) Save(path string) error

Save writes the manifest as indented JSON.

type OpClass

type OpClass string

OpClass is one of the five operation classes the SLOs are stated against. The per-repo mix weights how often each class appears in a synthetic replay.

const (
	// OpXCond is a conditional read: a GET that the client expects to answer
	// 304 from an ETag or since-cursor (the poll flood).
	OpXCond OpClass = "X-cond"
	// OpRMeta is a metadata read: an issue view, a list page, a PR view.
	OpRMeta OpClass = "R-meta"
	// OpRGit is a git read served over HTTP: a tree or blob fetch.
	OpRGit OpClass = "R-git"
	// OpTGit is a git transport operation: a clone or fetch.
	OpTGit OpClass = "T-git"
	// OpWMeta is a metadata write: open an issue, comment, merge a PR, apply a
	// label.
	OpWMeta OpClass = "W-meta"
)

type PRFile

type PRFile struct {
	PRNumber         int64  `json:"pr_number"`
	Path             string `json:"path"`
	Additions        int    `json:"additions"`
	Deletions        int    `json:"deletions"`
	Status           string `json:"status"`
	PreviousFilename string `json:"previous_filename,omitempty"`
}

PRFile is one changed file of a pull request. These are not seeded as state: a PR's file list is derived from the git diff at request time. They are kept as a correctness oracle so the diff path can be checked against recorded add/delete counts.

type Provenance

type Provenance string

Provenance records where a corpus value came from, so a reader never mistakes a modeled number for a measured one. It is carried in the manifest per repo and per synthesized field.

const (
	// Official is copied verbatim from the public source: a real number, a real
	// timestamp, a real body.
	Official Provenance = "OFFICIAL"
	// Derived is computed from official data by a documented rule: a comment
	// count recounted from the comments, an event payload rendered from typed
	// columns.
	Derived Provenance = "DERIVED"
	// Modeled is synthesized to hit a realistic shape where the public source
	// has no per-row truth: the reaction reactor pool, the synthetic webhook
	// subscriptions, a pseudonymized login.
	Modeled Provenance = "MODELED"
)

type Pseudonymizer

type Pseudonymizer struct {
	// RedactBodies replaces issue, comment, and review bodies with a
	// length-preserving placeholder so the corpus carries no real prose while
	// the marshaled-payload size stays realistic.
	RedactBodies bool
	// contains filtered or unexported fields
}

Pseudonymizer rewrites a corpus so it carries no real identities: every person login becomes a stable synthetic handle, and, when RedactBodies is set, every free-text body becomes a length-preserving placeholder. It is a pure transform — same input, same output — so a pseudonymized corpus is as reproducible as the original, and the mapping is recorded so a captured response can be compared field for field.

The repository's own owner and name are not pseudonymized: they identify the repository, not a person, and the manifest already records them. A login that also happens to be the owner is still rewritten where it appears as an author, actor, or assignee, so no real handle survives in the issue and event bodies.

func NewPseudonymizer

func NewPseudonymizer(redactBodies bool) *Pseudonymizer

NewPseudonymizer returns a pseudonymizer with an empty mapping.

func (*Pseudonymizer) Apply

func (p *Pseudonymizer) Apply(c *Corpus) *Corpus

Apply returns a pseudonymized copy of the corpus. The original is left unchanged. Logins are assigned in the corpus's first-seen order so the mapping is deterministic.

func (*Pseudonymizer) Mapping

func (p *Pseudonymizer) Mapping() map[string]string

Mapping returns the login-to-pseudonym map built so far, copied so the caller cannot mutate the pseudonymizer's state.

type PullRequest

type PullRequest struct {
	Number              int64      `json:"number"`
	Merged              bool       `json:"merged"`
	MergedAt            *time.Time `json:"merged_at,omitempty"`
	MergedBy            string     `json:"merged_by,omitempty"`
	MergeCommitSHA      string     `json:"merge_commit_sha,omitempty"`
	BaseRef             string     `json:"base_ref"`
	HeadRef             string     `json:"head_ref"`
	HeadSHA             string     `json:"head_sha"`
	Additions           int        `json:"additions"`
	Deletions           int        `json:"deletions"`
	ChangedFiles        int        `json:"changed_files"`
	Draft               bool       `json:"draft,omitempty"`
	MaintainerCanModify bool       `json:"maintainer_can_modify,omitempty"`
}

PullRequest is the PR-only extension joined to an Issue on Number.

type RateBudget

type RateBudget struct {
	// contains filtered or unexported fields
}

RateBudget bounds how many API points an export run may spend, the way the GraphQL API meters cost in points rather than requests. It is honest about exhaustion: once the budget is spent, Spend refuses further work so an export stops and checkpoints rather than hammering a throttled endpoint.

func NewRateBudget

func NewRateBudget(total int) *RateBudget

NewRateBudget returns a budget of total points.

func (*RateBudget) Remaining

func (b *RateBudget) Remaining() int

Remaining reports how many points are left, or a large number when the budget is unbounded (total <= 0).

func (*RateBudget) Spend

func (b *RateBudget) Spend(n int) bool

Spend charges n points, returning false when the budget cannot cover them. The caller checkpoints and stops on false rather than proceeding.

func (*RateBudget) Spent

func (b *RateBudget) Spent() int

Spent reports how many points have been charged.

type ReactorPool

type ReactorPool struct {
	Size int   `json:"size"`
	Seed int64 `json:"seed"`
}

ReactorPool is the synthetic identity pool reaction counts are materialized against. Size bounds how many reactor users exist; Seed fixes the assignment so two builds produce the same rows.

type ReplayMode

type ReplayMode string

ReplayMode names how a schedule's arrivals were generated.

const (
	// SyntheticMix drives the request mix at a fixed rate; it answers whether the
	// SLOs hold at a chosen size and rate.
	SyntheticMix ReplayMode = "synthetic-mix"
	// TraceDriven replays the real event timeline, time-compressed, so the load
	// carries the real burstiness rather than a smooth arrival.
	TraceDriven ReplayMode = "trace-driven"
)

type ReplayPlan

type ReplayPlan struct {
	Repo           string             `json:"repo"`
	Mode           ReplayMode         `json:"mode"`
	Mix            RequestMix         `json:"mix,omitempty"`
	Compression    float64            `json:"compression,omitempty"`
	ReadWriteRatio int                `json:"read_write_ratio,omitempty"`
	Requests       []ScheduledRequest `json:"requests"`
}

ReplayPlan is the full schedule for one repo, plus the parameters that shaped it, so a run is reproducible and a reviewer can see why the load looks the way it does. Compression and ReadWriteRatio are recorded for the trace-driven mode per the no-silent-caps rule.

func PlanSyntheticMix

func PlanSyntheticMix(repo string, mix RequestMix, count, rps int) ReplayPlan

PlanSyntheticMix builds a fixed-rate schedule that holds the repo's request mix exactly over count requests at rps requests per second. Arrivals are evenly spaced (the harness's open model adds no jitter of its own), and the class of each arrival is assigned by the largest-remainder method so the realized class counts match the mix proportions exactly and deterministically — no RNG, so two builds of the same plan are identical.

func PlanTraceDriven

func PlanTraceDriven(c *Corpus, compression float64, readWriteRatio int) ReplayPlan

PlanTraceDriven builds a schedule from the corpus's real event timeline. It extracts every state-changing event with its real timestamp (issue opened, comment added, PR merged, every timeline event), sorts by time, and time-compresses by compression so a year of history replays in a tractable wall-clock while the relative burstiness is preserved — a release-day spike stays a spike. Between writes it injects readWriteRatio reads of the repo's read classes, so the replay is not write-only. The compression factor and the ratio are carried on the plan so the run records them.

type RepoManifest

type RepoManifest struct {
	Repo       RepoRef        `json:"repo"`
	Provenance Provenance     `json:"provenance"`
	Rows       map[string]int `json:"rows,omitempty"`
	// GitBytes and TreeEntries are the measured git artifact; zero until a
	// mirror is cloned and measured.
	GitBytes    int64 `json:"git_bytes,omitempty"`
	TreeEntries int   `json:"tree_entries,omitempty"`
}

RepoManifest pins one repository's git side and records the measured row counts of its metadata, with the provenance of the whole entry.

type RepoRef

type RepoRef struct {
	Owner         string `json:"owner"`
	Name          string `json:"name"`
	DefaultBranch string `json:"default_branch"`
	MirrorURL     string `json:"mirror_url,omitempty"`
	PinnedSHA     string `json:"pinned_sha,omitempty"`
}

RepoRef identifies one repository in a corpus and pins its git history. The owner and name preserve the real namespace so URLs match real GitHub paths; MirrorURL and PinnedSHA freeze the git side the way DatasetRevision freezes the metadata side.

func ReadRepoRef

func ReadRepoRef(root string, ref RepoRef) RepoRef

ReadRepoRef reads only the pinned RepoRef (repo.json) for one repo in a snapshot, without loading any table. It lets a streaming caller record the pinned git side in a manifest without materializing the corpus; a missing or unreadable repo.json falls back to the ref passed in.

func (RepoRef) NWO

func (r RepoRef) NWO() string

NWO returns the owner/name form used in paths and logs.

type RequestMix

type RequestMix map[OpClass]int

RequestMix is the per-class weighting of a repo's load, in whole-number percentages that sum to 100. It is a MODELED shape: it is derived from the repo's real read/write character and the subsystem that repo stresses, not measured per request, so it is recorded in the manifest as such.

func MixFor

func MixFor(nwo string) RequestMix

MixFor returns the request mix for a repo by owner/name, falling back to the aggregate platform mix.

type Review

type Review struct {
	ID          int64      `json:"id"`
	PRNumber    int64      `json:"pr_number"`
	Author      string     `json:"author"`
	State       string     `json:"state"`
	Body        string     `json:"body,omitempty"`
	SubmittedAt *time.Time `json:"submitted_at,omitempty"`
	CommitID    string     `json:"commit_id,omitempty"`
}

Review is one act of reviewing a pull request.

type ReviewComment

type ReviewComment struct {
	ID          int64     `json:"id"`
	PRNumber    int64     `json:"pr_number"`
	ReviewID    int64     `json:"review_id"`
	Author      string    `json:"author"`
	Body        string    `json:"body"`
	Path        string    `json:"path"`
	Line        *int64    `json:"line,omitempty"`
	Side        string    `json:"side,omitempty"`
	DiffHunk    string    `json:"diff_hunk,omitempty"`
	CreatedAt   time.Time `json:"created_at"`
	UpdatedAt   time.Time `json:"updated_at"`
	InReplyToID *int64    `json:"in_reply_to_id,omitempty"`
}

ReviewComment is one inline review comment anchored to a diff line. ReviewID joins it to its review; InReplyToID threads a reply under the comment that started a conversation, so threads reassemble.

type ScheduledRequest

type ScheduledRequest struct {
	Offset time.Duration `json:"offset"`
	Class  OpClass       `json:"class"`
	Repo   string        `json:"repo"`
	Number int64         `json:"number,omitempty"`
}

ScheduledRequest is one planned request: its offset from the start of the replay, its operation class, the repo it targets, and the subject it touches (an issue or PR number for a metadata op, empty for a transport op). It is the unit the load harness fires.

type SeedResult

type SeedResult struct {
	RepoPK  int64
	Rows    map[string]int
	Dropped []DropNote
}

SeedResult reports what a corpus seed wrote, so the caller can fold it into the manifest as the measured artifact.

func SeedCorpus

func SeedCorpus(ctx context.Context, st *store.Store, c *Corpus, reactor ReactorPool) (*SeedResult, error)

SeedCorpus writes one corpus into a store through the bulk-seed write path, preserving every per-repo number and timestamp. The whole repository loads in one transaction so it lands whole or rolls back. The caller migrates the store first (or passes a migrated one); SeedCorpus does not migrate, so a multi-repo seed shares one schema.

Determinism: the tables are seeded in a fixed order (issues by number, comments and reviews by id, and so on), so the db_id sequence advances the same way on every run, and reactions are materialized against the bounded reactor pool with a fixed assignment, so two seeds of the same corpus produce identical databases.

func SeedSnapshot

func SeedSnapshot(ctx context.Context, st *store.Store, root string, ref RepoRef, reactor ReactorPool) (*SeedResult, error)

SeedSnapshot seeds one repo's corpus straight from its on-disk snapshot, streaming each table from disk and releasing it before the next, so the seeder never holds the whole corpus in memory at once. Peak memory is one table plus the foreign-key resolution maps, not the sum of every table's rows, which is what lets a multi-hundred-thousand-row repo seed without loading gigabytes of bodies into RAM. It is the scale counterpart to SeedCorpus, which takes an already-materialized corpus and is the path the in-memory and pseudonymized flows use.

type TimelineEvent

type TimelineEvent struct {
	ID          int64          `json:"id"`
	IssueNumber int64          `json:"issue_number"`
	EventType   string         `json:"event_type"`
	Actor       string         `json:"actor,omitempty"`
	CreatedAt   time.Time      `json:"created_at"`
	LabelName   string         `json:"label_name,omitempty"`
	LabelColor  string         `json:"label_color,omitempty"`
	Assignee    string         `json:"assignee_login,omitempty"`
	Milestone   string         `json:"milestone_title,omitempty"`
	TitleFrom   string         `json:"title_from,omitempty"`
	TitleTo     string         `json:"title_to,omitempty"`
	RefType     string         `json:"ref_type,omitempty"`
	RefNumber   int64          `json:"ref_number,omitempty"`
	LockReason  string         `json:"lock_reason,omitempty"`
	Data        map[string]any `json:"data,omitempty"`
}

TimelineEvent is one lifecycle event. EventType maps to the issue_events event column; the typed columns and Data blob render into the event payload. This is the largest table in an automation-heavy corpus.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL