realworld

package

v0.1.3 Latest Latest Go to latest Published: Jun 13, 2026 License: Apache-2.0 Imports: 13 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/tamnd/githome

Links

Open Source Insights

Documentation ¶

Overview ¶

Package realworld is the real-world ingest subsystem: it exports the public metadata and git history of a handful of the largest GitHub-native repositories into a pinned, normalized corpus, then seeds that corpus into a Githome instance so the read, write, search, git, and event paths can be exercised at a scale the small development fixture never reaches.

The subsystem is two stages that meet at one on-disk format, the Corpus:

Stage A (export) turns a live source — a git mirror, the public GraphQL API, the GH Archive event stream — into a normalized Corpus written to a snapshot directory, pinned by a manifest so a later upstream change cannot silently move the numbers.
Stage B (seed) reads a Corpus snapshot and writes it into a target store and git store through the bulk-seed write path, preserving the real numbers and timestamps.

Stage A needs network and, for the bulk API, credentials, so it is not exercised in unit tests; Stage B runs entirely against a local snapshot and a SQLite store, so the seeding, pseudonymization, replay, and capture logic is fully testable on the small fixtures checked in under testdata.

Index ¶

Constants
Variables
func ExportToSnapshot(ctx context.Context, ex Exporter, refs []RepoRef, m *Manifest, root string) error
func SelectSampleNumbers(numbers []int64) []int64
func WriteCorpus(root string, c *Corpus) error
type CaptureItem
- func BuildCapturePlan(c *Corpus) []CaptureItem
type CaptureKind
type Checkpoint
- func NewCheckpoint() *Checkpoint
- func (c *Checkpoint) IsDone(ref RepoRef, table string) bool
- func (c *Checkpoint) Mark(ref RepoRef, table string)
type Comment
type CommitStatus
type Corpus
- func ReadCorpus(root string, ref RepoRef) (*Corpus, error)
- func (c *Corpus) Logins() []string
type DropNote
type Exporter
type FixtureExporter
- func (e FixtureExporter) Export(_ context.Context, ref RepoRef) (*Corpus, error)
- func (e FixtureExporter) Source() string
type GitMirrorPlan
- func (p GitMirrorPlan) Commands(dest string) [][]string
type Issue
type Label
type Manifest
- func LoadManifest(path string) (*Manifest, error)
- func NewManifest(tier, datasetRevision string) *Manifest
- func (m *Manifest) Drop(what, reason string, count int)
- func (m *Manifest) RepoNames() []string
- func (m *Manifest) Save(path string) error
type OpClass
type PRFile
type Provenance
type Pseudonymizer
- func NewPseudonymizer(redactBodies bool) *Pseudonymizer
- func (p *Pseudonymizer) Apply(c *Corpus) *Corpus
- func (p *Pseudonymizer) Mapping() map[string]string
type PullRequest
type RateBudget
- func NewRateBudget(total int) *RateBudget
- func (b *RateBudget) Remaining() int
- func (b *RateBudget) Spend(n int) bool
- func (b *RateBudget) Spent() int
type ReactorPool
type ReplayMode
type ReplayPlan
- func PlanSyntheticMix(repo string, mix RequestMix, count, rps int) ReplayPlan
- func PlanTraceDriven(c *Corpus, compression float64, readWriteRatio int) ReplayPlan
type RepoManifest
type RepoRef
- func ReadRepoRef(root string, ref RepoRef) RepoRef
- func (r RepoRef) NWO() string
type RequestMix
- func MixFor(nwo string) RequestMix
type Review
type ReviewComment
type ScheduledRequest
type SeedResult
- func SeedCorpus(ctx context.Context, st *store.Store, c *Corpus, reactor ReactorPool) (*SeedResult, error)
- func SeedSnapshot(ctx context.Context, st *store.Store, root string, ref RepoRef, ...) (*SeedResult, error)
type TimelineEvent

Constants ¶

View Source

const ManifestName = "realworld-manifest.json"

ManifestName is the manifest filename at the root of a snapshot directory.

View Source

const ManifestSchema = 1

ManifestSchema is the manifest format version, bumped when the manifest shape changes so an old reader refuses a corpus it cannot interpret.

Variables ¶

View Source

var DefaultReactorPool = ReactorPool{Size: 200, Seed: 0x6e7e}

DefaultReactorPool is the reactor pool a corpus uses unless a manifest overrides it: 200 synthetic reactors, a fixed assignment seed.

View Source

var ErrRequiresNetwork = errors.New("realworld: this export source requires network access and credentials")

ErrRequiresNetwork is returned by an exporter whose source needs network or credentials that are not configured, so a caller in an offline or unit-test environment gets a clear signal rather than a silent empty corpus.

Functions ¶

func ExportToSnapshot ¶

func ExportToSnapshot(ctx context.Context, ex Exporter, refs []RepoRef, m *Manifest, root string) error

ExportToSnapshot runs an exporter over every repo in refs, writes each corpus into the snapshot root, and fills the manifest with the measured row counts and the provenance. A repo whose source is unreachable (ErrRequiresNetwork) is recorded as a drop rather than failing the whole run, so an offline build produces an honest partial snapshot the manifest names as partial. The manifest the caller passes is updated in place and saved at the root.

func SelectSampleNumbers ¶

func SelectSampleNumbers(numbers []int64) []int64

SelectSampleNumbers returns the earliest, middle, and latest of a set of numbers, deduplicated and sorted. A short set returns just its distinct members. This is the "span the range" rule: the differ checks the oldest row, a middle row, and the newest, so a regression that only touches old or only touches new rows is still caught.

func WriteCorpus ¶

func WriteCorpus(root string, c *Corpus) error

WriteCorpus writes one corpus into a snapshot directory, creating the per-repo layout. It does not write the manifest; the caller owns the manifest because it spans every repo in the snapshot.

Types ¶

type CaptureItem ¶

type CaptureItem struct {
	Kind       CaptureKind `json:"kind"`
	Number     int64       `json:"number"`
	CapturedAt time.Time   `json:"captured_at,omitzero"`
}

CaptureItem is one entry in the capture plan: a kind and the subject number to fetch. CapturedAt is zero in the plan and stamped when the response is actually captured.

func BuildCapturePlan ¶

func BuildCapturePlan(c *Corpus) []CaptureItem

BuildCapturePlan selects the golden-response sample for a corpus: the earliest/middle/latest issue, pull request, comment, timeline, review, and status, so the differ has old and new rows of every kind. The plan is the differ-capture sample manifest the build records.

type CaptureKind ¶

type CaptureKind string

CaptureKind names one kind of golden response.

const (
	CaptureIssue    CaptureKind = "issue"
	CapturePull     CaptureKind = "pull"
	CaptureComment  CaptureKind = "comment"
	CaptureTimeline CaptureKind = "timeline"
	CaptureReview   CaptureKind = "review"
	CaptureStatus   CaptureKind = "status"
)

The capture kinds: one golden response per kind, sampled across the number range so the differ checks old and new rows of each.

type Checkpoint ¶

type Checkpoint struct {
	Done map[string]bool `json:"done"`
}

Checkpoint is the resumable-export journal: it records which repo/table pairs an export run has finished, so a run interrupted by rate-limit exhaustion or a crash resumes at the first unfinished pair instead of re-exporting from the top. It is a plain value the caller persists alongside the snapshot.

func NewCheckpoint ¶

func NewCheckpoint() *Checkpoint

NewCheckpoint returns an empty journal.

func (*Checkpoint) IsDone ¶

func (c *Checkpoint) IsDone(ref RepoRef, table string) bool

IsDone reports whether ref's table has already been exported.

func (*Checkpoint) Mark ¶

func (c *Checkpoint) Mark(ref RepoRef, table string)

Mark records ref's table as exported.

type Comment ¶

type Comment struct {
	ID                int64          `json:"id"`
	IssueNumber       int64          `json:"issue_number"`
	Author            string         `json:"author"`
	Body              string         `json:"body"`
	CreatedAt         time.Time      `json:"created_at"`
	UpdatedAt         time.Time      `json:"updated_at"`
	Reactions         map[string]int `json:"reactions,omitempty"`
	AuthorAssociation string         `json:"author_association,omitempty"`
}

Comment is one conversation comment. ID is the dataset id, used to order the db_id allocation deterministically; IssueNumber joins it to its issue.

type CommitStatus ¶

type CommitStatus struct {
	SHA         string    `json:"sha"`
	Context     string    `json:"context"`
	State       string    `json:"state"`
	Description string    `json:"description,omitempty"`
	TargetURL   string    `json:"target_url,omitempty"`
	CreatedAt   time.Time `json:"created_at"`
}

CommitStatus is one external pass/fail report against a head sha under a context. An automation-heavy repo carries many contexts per sha.

type Corpus ¶

type Corpus struct {
	Repo           RepoRef         `json:"repo"`
	Issues         []Issue         `json:"issues"`
	PullRequests   []PullRequest   `json:"pull_requests"`
	Comments       []Comment       `json:"comments"`
	Reviews        []Review        `json:"reviews"`
	ReviewComments []ReviewComment `json:"review_comments"`
	TimelineEvents []TimelineEvent `json:"timeline_events"`
	PRFiles        []PRFile        `json:"pr_files"`
	CommitStatuses []CommitStatus  `json:"commit_statuses"`
}

Corpus is the normalized metadata of one repository: the eight tables the public dataset and the GraphQL export both reduce to, with people named by login string and cross-references named by number or id. Stage B resolves the logins to user pks and the numbers to row pks as it seeds. A Corpus is the unit Stage A writes and Stage B reads; one snapshot directory holds one Corpus per repository.

func ReadCorpus ¶

func ReadCorpus(root string, ref RepoRef) (*Corpus, error)

ReadCorpus reads one repo's corpus back from a snapshot directory.

func (*Corpus) Logins ¶

func (c *Corpus) Logins() []string

Logins returns the distinct set of every login named anywhere in the corpus, in first-seen order, so the seeder can build the user table once before it writes any row that references a user. First-seen order keeps the build deterministic.

type DropNote ¶

type DropNote struct {
	What   string `json:"what"`
	Count  int    `json:"count,omitempty"`
	Reason string `json:"reason"`
}

DropNote records one bounded or skipped piece of a corpus build, with the reason, so coverage is never silently capped.

type Exporter ¶

type Exporter interface {
	// Export returns the corpus for ref, or ErrRequiresNetwork when the source
	// is not reachable in this environment.
	Export(ctx context.Context, ref RepoRef) (*Corpus, error)
	// Source names the exporter for logs and the manifest provenance.
	Source() string
}

Exporter produces the metadata corpus for one repository from one source.

type FixtureExporter ¶

type FixtureExporter struct {
	Root string
}

FixtureExporter reads a corpus back from a snapshot directory. It is the offline source: a previously exported snapshot, or a small checked-in fixture, re-read as if freshly exported. It is the exporter the tests and the seed-only CLI path use.

func (FixtureExporter) Export ¶

func (e FixtureExporter) Export(_ context.Context, ref RepoRef) (*Corpus, error)

Export reads ref's corpus from the snapshot root.

func (FixtureExporter) Source ¶

func (e FixtureExporter) Source() string

Source identifies the fixture exporter.

type GitMirrorPlan ¶

type GitMirrorPlan struct {
	Ref       RepoRef
	MirrorURL string
	PinnedSHA string
}

GitMirrorPlan is the recipe to mirror a repository's history into a git store, expressed as the commands to run rather than run inline, so the plan is testable and the network/disk-heavy execution is an explicit, separate step. The maintenance pass (repack, bitmap, commit-graph, multi-pack-index) is what makes a freshly cloned giant serve cold reads at the same speed a warmed long-lived repository does.

func (GitMirrorPlan) Commands ¶

func (p GitMirrorPlan) Commands(dest string) [][]string

Commands returns the git invocations the plan runs, in order: a bare mirror clone, a reset of the advertised tip to the pin so a fetch benchmark has real new commits to deliver, and the maintenance pass. dest is the bare repo path in the git store.

type Issue ¶

type Issue struct {
	Number          int64          `json:"number"`
	NodeID          string         `json:"node_id,omitempty"`
	IsPullRequest   bool           `json:"is_pull_request"`
	Title           string         `json:"title"`
	Body            string         `json:"body,omitempty"`
	State           string         `json:"state"`
	StateReason     string         `json:"state_reason,omitempty"`
	Author          string         `json:"author"`
	CreatedAt       time.Time      `json:"created_at"`
	UpdatedAt       time.Time      `json:"updated_at"`
	ClosedAt        *time.Time     `json:"closed_at,omitempty"`
	Labels          []Label        `json:"labels,omitempty"`
	Assignees       []string       `json:"assignees,omitempty"`
	MilestoneTitle  string         `json:"milestone_title,omitempty"`
	MilestoneNumber int64          `json:"milestone_number,omitempty"`
	Reactions       map[string]int `json:"reactions,omitempty"`
	CommentCount    int            `json:"comment_count"`
	Locked          bool           `json:"locked,omitempty"`
	LockReason      string         `json:"lock_reason,omitempty"`
}

Issue is one row of the shared issue/PR table: an issue when IsPullRequest is false, the issue half of a pull request when true. Number is the per-repo number preserved verbatim. Reactions are counts per content (`{"+1": 5, "heart": 2}`), materialized into rows against the reactor pool at seed time. NodeID is the dataset's node id, recorded for provenance but never written: Githome mints its own GraphQL ids.

type Label ¶

type Label struct {
	Name        string `json:"name"`
	Color       string `json:"color,omitempty"`
	Description string `json:"description,omitempty"`
}

Label is one label carried on an issue, deduped per repository at seed time.

type Manifest ¶

type Manifest struct {
	Schema int `json:"schema"`
	// Note is a human description of this corpus build; it is not load-bearing.
	Note string `json:"note,omitempty"`
	// DatasetRevision pins the metadata source (the dataset repo commit, or the
	// GraphQL export run id). It is the metadata analog of the per-repo SHA.
	DatasetRevision string `json:"dataset_revision"`
	// FixtureTier names the tier this corpus serves: rw-smoke, rw-meta,
	// rw-write, rw-git, or rw-full. Tiers bound how much a CI leg loads.
	FixtureTier string `json:"fixture_tier"`
	// Pseudonymized is true when logins and bodies were run through the
	// pseudonymizer, so the corpus carries no real identities.
	Pseudonymized bool `json:"pseudonymized"`
	// Reactor records the bounded synthetic reactor pool the seeder materializes
	// reaction counts against; reactions are the one MODELED count in a corpus.
	Reactor ReactorPool `json:"reactor"`
	// Repos is one entry per repository in this corpus.
	Repos []RepoManifest `json:"repos"`
	// SeederVersion and SchemaVersion pin the tooling and the store schema the
	// corpus was built against, the rest of the reproducibility checklist.
	SeederVersion string `json:"seeder_version,omitempty"`
	SchemaVersion int    `json:"schema_version,omitempty"`
	// Dropped records anything this build bounded or skipped — a truncated
	// table, an unreachable source, a sampled range — so a partial corpus never
	// reads as a complete one.
	Dropped []DropNote `json:"dropped,omitempty"`
}

Manifest pins a corpus and records what was measured and what was synthesized, so a corpus is reproducible and no reader mistakes a modeled value for a real one. It is the single file that freezes the corpus: the dataset revision and the per-repo git pins are its OFFICIAL anchors, the reactor pool and any pseudonymization are its MODELED notes, and Measured holds the row counts the seeder actually wrote rather than any count asserted up front.

func LoadManifest ¶

func LoadManifest(path string) (*Manifest, error)

LoadManifest reads and validates a manifest from disk.

func NewManifest ¶

func NewManifest(tier, datasetRevision string) *Manifest

NewManifest builds a manifest for a tier with the default reactor pool and the current schema version, ready for the seeder to fill Measured into.

func (*Manifest) Drop ¶

func (m *Manifest) Drop(what, reason string, count int)

Drop records a bounded or skipped piece of the build.

func (*Manifest) RepoNames ¶

func (m *Manifest) RepoNames() []string

RepoNames returns the owner/name of every repo in the manifest, sorted, for stable logging.

func (*Manifest) Save ¶

func (m *Manifest) Save(path string) error

Save writes the manifest as indented JSON.

type OpClass ¶

type OpClass string

OpClass is one of the five operation classes the SLOs are stated against. The per-repo mix weights how often each class appears in a synthetic replay.

const (
	// OpXCond is a conditional read: a GET that the client expects to answer
	// 304 from an ETag or since-cursor (the poll flood).
	OpXCond OpClass = "X-cond"
	// OpRMeta is a metadata read: an issue view, a list page, a PR view.
	OpRMeta OpClass = "R-meta"
	// OpRGit is a git read served over HTTP: a tree or blob fetch.
	OpRGit OpClass = "R-git"
	// OpTGit is a git transport operation: a clone or fetch.
	OpTGit OpClass = "T-git"
	// OpWMeta is a metadata write: open an issue, comment, merge a PR, apply a
	// label.
	OpWMeta OpClass = "W-meta"
)

type PRFile ¶

type PRFile struct {
	PRNumber         int64  `json:"pr_number"`
	Path             string `json:"path"`
	Additions        int    `json:"additions"`
	Deletions        int    `json:"deletions"`
	Status           string `json:"status"`
	PreviousFilename string `json:"previous_filename,omitempty"`
}

PRFile is one changed file of a pull request. These are not seeded as state: a PR's file list is derived from the git diff at request time. They are kept as a correctness oracle so the diff path can be checked against recorded add/delete counts.

type Provenance ¶

type Provenance string

Provenance records where a corpus value came from, so a reader never mistakes a modeled number for a measured one. It is carried in the manifest per repo and per synthesized field.

const (
	// Official is copied verbatim from the public source: a real number, a real
	// timestamp, a real body.
	Official Provenance = "OFFICIAL"
	// Derived is computed from official data by a documented rule: a comment
	// count recounted from the comments, an event payload rendered from typed
	// columns.
	Derived Provenance = "DERIVED"
	// Modeled is synthesized to hit a realistic shape where the public source
	// has no per-row truth: the reaction reactor pool, the synthetic webhook
	// subscriptions, a pseudonymized login.
	Modeled Provenance = "MODELED"
)

type Pseudonymizer ¶

type Pseudonymizer struct {
	// RedactBodies replaces issue, comment, and review bodies with a
	// length-preserving placeholder so the corpus carries no real prose while
	// the marshaled-payload size stays realistic.
	RedactBodies bool
	// contains filtered or unexported fields
}

Pseudonymizer rewrites a corpus so it carries no real identities: every person login becomes a stable synthetic handle, and, when RedactBodies is set, every free-text body becomes a length-preserving placeholder. It is a pure transform — same input, same output — so a pseudonymized corpus is as reproducible as the original, and the mapping is recorded so a captured response can be compared field for field.

The repository's own owner and name are not pseudonymized: they identify the repository, not a person, and the manifest already records them. A login that also happens to be the owner is still rewritten where it appears as an author, actor, or assignee, so no real handle survives in the issue and event bodies.

func NewPseudonymizer ¶

func NewPseudonymizer(redactBodies bool) *Pseudonymizer

NewPseudonymizer returns a pseudonymizer with an empty mapping.

func (*Pseudonymizer) Apply ¶

func (p *Pseudonymizer) Apply(c *Corpus) *Corpus

Apply returns a pseudonymized copy of the corpus. The original is left unchanged. Logins are assigned in the corpus's first-seen order so the mapping is deterministic.

func (*Pseudonymizer) Mapping ¶

func (p *Pseudonymizer) Mapping() map[string]string

Mapping returns the login-to-pseudonym map built so far, copied so the caller cannot mutate the pseudonymizer's state.

type PullRequest ¶

type PullRequest struct {
	Number              int64      `json:"number"`
	Merged              bool       `json:"merged"`
	MergedAt            *time.Time `json:"merged_at,omitempty"`
	MergedBy            string     `json:"merged_by,omitempty"`
	MergeCommitSHA      string     `json:"merge_commit_sha,omitempty"`
	BaseRef             string     `json:"base_ref"`
	HeadRef             string     `json:"head_ref"`
	HeadSHA             string     `json:"head_sha"`
	Additions           int        `json:"additions"`
	Deletions           int        `json:"deletions"`
	ChangedFiles        int        `json:"changed_files"`
	Draft               bool       `json:"draft,omitempty"`
	MaintainerCanModify bool       `json:"maintainer_can_modify,omitempty"`
}

PullRequest is the PR-only extension joined to an Issue on Number.

type RateBudget ¶

type RateBudget struct {
	// contains filtered or unexported fields
}

RateBudget bounds how many API points an export run may spend, the way the GraphQL API meters cost in points rather than requests. It is honest about exhaustion: once the budget is spent, Spend refuses further work so an export stops and checkpoints rather than hammering a throttled endpoint.

func NewRateBudget ¶

func NewRateBudget(total int) *RateBudget

NewRateBudget returns a budget of total points.

func (*RateBudget) Remaining ¶

func (b *RateBudget) Remaining() int

Remaining reports how many points are left, or a large number when the budget is unbounded (total <= 0).

func (*RateBudget) Spend ¶

func (b *RateBudget) Spend(n int) bool

Spend charges n points, returning false when the budget cannot cover them. The caller checkpoints and stops on false rather than proceeding.

func (*RateBudget) Spent ¶

func (b *RateBudget) Spent() int

Spent reports how many points have been charged.

type ReactorPool ¶

type ReactorPool struct {
	Size int   `json:"size"`
	Seed int64 `json:"seed"`
}

ReactorPool is the synthetic identity pool reaction counts are materialized against. Size bounds how many reactor users exist; Seed fixes the assignment so two builds produce the same rows.

type ReplayMode ¶

type ReplayMode string

ReplayMode names how a schedule's arrivals were generated.

const (
	// SyntheticMix drives the request mix at a fixed rate; it answers whether the
	// SLOs hold at a chosen size and rate.
	SyntheticMix ReplayMode = "synthetic-mix"
	// TraceDriven replays the real event timeline, time-compressed, so the load
	// carries the real burstiness rather than a smooth arrival.
	TraceDriven ReplayMode = "trace-driven"
)

type ReplayPlan ¶

type ReplayPlan struct {
	Repo           string             `json:"repo"`
	Mode           ReplayMode         `json:"mode"`
	Mix            RequestMix         `json:"mix,omitempty"`
	Compression    float64            `json:"compression,omitempty"`
	ReadWriteRatio int                `json:"read_write_ratio,omitempty"`
	Requests       []ScheduledRequest `json:"requests"`
}

ReplayPlan is the full schedule for one repo, plus the parameters that shaped it, so a run is reproducible and a reviewer can see why the load looks the way it does. Compression and ReadWriteRatio are recorded for the trace-driven mode per the no-silent-caps rule.

func PlanSyntheticMix ¶

func PlanSyntheticMix(repo string, mix RequestMix, count, rps int) ReplayPlan

PlanSyntheticMix builds a fixed-rate schedule that holds the repo's request mix exactly over count requests at rps requests per second. Arrivals are evenly spaced (the harness's open model adds no jitter of its own), and the class of each arrival is assigned by the largest-remainder method so the realized class counts match the mix proportions exactly and deterministically — no RNG, so two builds of the same plan are identical.

func PlanTraceDriven ¶

func PlanTraceDriven(c *Corpus, compression float64, readWriteRatio int) ReplayPlan

PlanTraceDriven builds a schedule from the corpus's real event timeline. It extracts every state-changing event with its real timestamp (issue opened, comment added, PR merged, every timeline event), sorts by time, and time-compresses by compression so a year of history replays in a tractable wall-clock while the relative burstiness is preserved — a release-day spike stays a spike. Between writes it injects readWriteRatio reads of the repo's read classes, so the replay is not write-only. The compression factor and the ratio are carried on the plan so the run records them.

type RepoManifest ¶

type RepoManifest struct {
	Repo       RepoRef        `json:"repo"`
	Provenance Provenance     `json:"provenance"`
	Rows       map[string]int `json:"rows,omitempty"`
	// GitBytes and TreeEntries are the measured git artifact; zero until a
	// mirror is cloned and measured.
	GitBytes    int64 `json:"git_bytes,omitempty"`
	TreeEntries int   `json:"tree_entries,omitempty"`
}

RepoManifest pins one repository's git side and records the measured row counts of its metadata, with the provenance of the whole entry.

type RepoRef ¶

type RepoRef struct {
	Owner         string `json:"owner"`
	Name          string `json:"name"`
	DefaultBranch string `json:"default_branch"`
	MirrorURL     string `json:"mirror_url,omitempty"`
	PinnedSHA     string `json:"pinned_sha,omitempty"`
}

RepoRef identifies one repository in a corpus and pins its git history. The owner and name preserve the real namespace so URLs match real GitHub paths; MirrorURL and PinnedSHA freeze the git side the way DatasetRevision freezes the metadata side.

func ReadRepoRef ¶

func ReadRepoRef(root string, ref RepoRef) RepoRef

ReadRepoRef reads only the pinned RepoRef (repo.json) for one repo in a snapshot, without loading any table. It lets a streaming caller record the pinned git side in a manifest without materializing the corpus; a missing or unreadable repo.json falls back to the ref passed in.

func (RepoRef) NWO ¶

func (r RepoRef) NWO() string

NWO returns the owner/name form used in paths and logs.

type RequestMix ¶

type RequestMix map[OpClass]int

RequestMix is the per-class weighting of a repo's load, in whole-number percentages that sum to 100. It is a MODELED shape: it is derived from the repo's real read/write character and the subsystem that repo stresses, not measured per request, so it is recorded in the manifest as such.

func MixFor ¶

func MixFor(nwo string) RequestMix

MixFor returns the request mix for a repo by owner/name, falling back to the aggregate platform mix.

type Review ¶

type Review struct {
	ID          int64      `json:"id"`
	PRNumber    int64      `json:"pr_number"`
	Author      string     `json:"author"`
	State       string     `json:"state"`
	Body        string     `json:"body,omitempty"`
	SubmittedAt *time.Time `json:"submitted_at,omitempty"`
	CommitID    string     `json:"commit_id,omitempty"`
}

Review is one act of reviewing a pull request.

type ReviewComment ¶

type ReviewComment struct {
	ID          int64     `json:"id"`
	PRNumber    int64     `json:"pr_number"`
	ReviewID    int64     `json:"review_id"`
	Author      string    `json:"author"`
	Body        string    `json:"body"`
	Path        string    `json:"path"`
	Line        *int64    `json:"line,omitempty"`
	Side        string    `json:"side,omitempty"`
	DiffHunk    string    `json:"diff_hunk,omitempty"`
	CreatedAt   time.Time `json:"created_at"`
	UpdatedAt   time.Time `json:"updated_at"`
	InReplyToID *int64    `json:"in_reply_to_id,omitempty"`
}

ReviewComment is one inline review comment anchored to a diff line. ReviewID joins it to its review; InReplyToID threads a reply under the comment that started a conversation, so threads reassemble.

type ScheduledRequest ¶

type ScheduledRequest struct {
	Offset time.Duration `json:"offset"`
	Class  OpClass       `json:"class"`
	Repo   string        `json:"repo"`
	Number int64         `json:"number,omitempty"`
}

ScheduledRequest is one planned request: its offset from the start of the replay, its operation class, the repo it targets, and the subject it touches (an issue or PR number for a metadata op, empty for a transport op). It is the unit the load harness fires.

type SeedResult ¶

type SeedResult struct {
	RepoPK  int64
	Rows    map[string]int
	Dropped []DropNote
}

SeedResult reports what a corpus seed wrote, so the caller can fold it into the manifest as the measured artifact.

func SeedCorpus ¶

func SeedCorpus(ctx context.Context, st *store.Store, c *Corpus, reactor ReactorPool) (*SeedResult, error)

SeedCorpus writes one corpus into a store through the bulk-seed write path, preserving every per-repo number and timestamp. The whole repository loads in one transaction so it lands whole or rolls back. The caller migrates the store first (or passes a migrated one); SeedCorpus does not migrate, so a multi-repo seed shares one schema.

Determinism: the tables are seeded in a fixed order (issues by number, comments and reviews by id, and so on), so the db_id sequence advances the same way on every run, and reactions are materialized against the bounded reactor pool with a fixed assignment, so two seeds of the same corpus produce identical databases.

func SeedSnapshot ¶

func SeedSnapshot(ctx context.Context, st *store.Store, root string, ref RepoRef, reactor ReactorPool) (*SeedResult, error)

SeedSnapshot seeds one repo's corpus straight from its on-disk snapshot, streaming each table from disk and releasing it before the next, so the seeder never holds the whole corpus in memory at once. Peak memory is one table plus the foreign-key resolution maps, not the sum of every table's rows, which is what lets a multi-hundred-thousand-row repo seed without loading gigabytes of bodies into RAM. It is the scale counterpart to SeedCorpus, which takes an already-materialized corpus and is the path the in-memory and pseudonymized flows use.

type TimelineEvent ¶

type TimelineEvent struct {
	ID          int64          `json:"id"`
	IssueNumber int64          `json:"issue_number"`
	EventType   string         `json:"event_type"`
	Actor       string         `json:"actor,omitempty"`
	CreatedAt   time.Time      `json:"created_at"`
	LabelName   string         `json:"label_name,omitempty"`
	LabelColor  string         `json:"label_color,omitempty"`
	Assignee    string         `json:"assignee_login,omitempty"`
	Milestone   string         `json:"milestone_title,omitempty"`
	TitleFrom   string         `json:"title_from,omitempty"`
	TitleTo     string         `json:"title_to,omitempty"`
	RefType     string         `json:"ref_type,omitempty"`
	RefNumber   int64          `json:"ref_number,omitempty"`
	LockReason  string         `json:"lock_reason,omitempty"`
	Data        map[string]any `json:"data,omitempty"`
}

TimelineEvent is one lifecycle event. EventType maps to the issue_events event column; the typed columns and Data blob render into the event payload. This is the largest table in an automation-heavy corpus.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL