scrape

package
v1.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Feb 19, 2026 License: MIT Imports: 11 Imported by: 0

README

internal/scrape

Logic overview

The scrape package coordinates per-organization scrape execution and defines the productivity metric contract.

  • Manager runs one scrape per configured org in parallel goroutines.
  • Outcome captures result+error per org so failures remain isolated.
  • GitHubOrgScraper performs repo iteration per org with bounded worker concurrency.
  • 24h activity metrics are scraped from commits, pull requests, pull reviews, and issue comments endpoints.
  • LOC collection uses /stats/contributors as primary and commit-detail fallback when large-repo zeroing is detected.
  • Optional checkpoint persistence extends scrape windows after outages and advances checkpoints on successful repo windows.
  • Partial repo failures are returned as missed windows so runtime can enqueue targeted backfill.
  • NoopOrgScraper is a safe default implementation used during bootstrap/testing.

API reference

Types
  • OrgResult: scrape output for one organization (Metrics plus MissedWindow for partial failures).
  • MissedWindow: failed repo window metadata (org, repo, window_start, window_end, reason) used for backfill enqueue.
  • Outcome: per-org wrapper with org name, result, and error.
  • OrgScraper: org scrape interface.
  • CheckpointStore: checkpoint persistence contract (SetCheckpoint, GetCheckpoint).
  • CheckpointAwareScraper: optional interface for runtime checkpoint-store injection.
  • Manager: parallel org scrape coordinator.
  • GitHubDataClient: typed GitHub endpoint interface used by GitHubOrgScraper.
  • GitHubOrgScraperConfig: behavior config for LOC fallback, budgets, and time hooks.
  • GitHubOrgScraper: production org scraper backed by per-org GitHub App clients.
  • NoopOrgScraper: no-op implementation of OrgScraper.
  • LabelOrg, LabelRepo, LabelUser: required productivity label keys.
  • MetricActivity*: stable productivity metric name constants.
Functions
  • NewManager(scraper OrgScraper, orgs []config.GitHubOrgConfig) *Manager: constructs a manager.
  • NewGitHubOrgScraper(clients map[string]GitHubDataClient, cfg GitHubOrgScraperConfig) *GitHubOrgScraper: constructs the production org scraper.
  • NewOrgScraperFromConfig(cfg *config.Config) (OrgScraper, error): builds per-org GitHub App clients from config and returns an org scraper.
  • ProductivityMetricNames() []string: returns stable supported productivity metric names.
  • IsProductivityMetric(name string) bool: validates metric names against the contract.
  • RequiredLabels(org, repo, user string) map[string]string: enforces required labels.
  • NewProductivityMetric(name, org, repo, user string, value float64, updatedAt time.Time) (store.MetricPoint, error): creates validated productivity metric points.
  • ValidateProductivityMetric(point store.MetricPoint) error: validates point contract compliance.
Methods
  • (*Manager) ScrapeAll(ctx context.Context) []Outcome: executes one parallel scrape pass for all configured orgs.
  • (*GitHubOrgScraper) ScrapeOrg(ctx context.Context, org config.GitHubOrgConfig) (OrgResult, error): scrapes one org, emits metrics, and reports missed windows for partial failures.
  • (*GitHubOrgScraper) SetCheckpointStore(checkpoints CheckpointStore): injects checkpoint persistence after scraper construction.
  • (*NoopOrgScraper) ScrapeOrg(ctx context.Context, org config.GitHubOrgConfig) (OrgResult, error): returns empty output with no error.

Documentation

Index

Constants

View Source
const (
	// LabelOrg is the required organization label key.
	LabelOrg = "org"
	// LabelRepo is the required repository label key.
	LabelRepo = "repo"
	// LabelUser is the required contributor label key.
	LabelUser = "user"

	// UnknownLabelValue is used when a required label is blank.
	UnknownLabelValue = "unknown"

	// MetricActivityCommits24h is the rolling 24h commit activity gauge.
	MetricActivityCommits24h = "gh_activity_commits_24h"
	// MetricActivityPROpened24h is the rolling 24h opened PR activity gauge.
	MetricActivityPROpened24h = "gh_activity_prs_opened_24h"
	// MetricActivityPRMerged24h is the rolling 24h merged PR activity gauge.
	MetricActivityPRMerged24h = "gh_activity_prs_merged_24h"
	// MetricActivityReviewsSubmitted24h is the rolling 24h submitted review activity gauge.
	MetricActivityReviewsSubmitted24h = "gh_activity_reviews_submitted_24h"
	// MetricActivityIssueComments24h is the rolling 24h issue comment activity gauge.
	MetricActivityIssueComments24h = "gh_activity_issue_comments_24h"
	// MetricActivityLOCAddedWeekly is the weekly LOC additions gauge.
	MetricActivityLOCAddedWeekly = "gh_activity_loc_added_weekly"
	// MetricActivityLOCRemovedWeekly is the weekly LOC removals gauge.
	MetricActivityLOCRemovedWeekly = "gh_activity_loc_removed_weekly"
	// MetricActivityLastEventUnixTime is the timestamp of the latest activity event.
	MetricActivityLastEventUnixTime = "gh_activity_last_event_unixtime"
)

Variables

This section is empty.

Functions

func IsProductivityMetric

func IsProductivityMetric(name string) bool

IsProductivityMetric reports whether a metric name is supported by the scrape contract.

func NewProductivityMetric

func NewProductivityMetric(name, org, repo, user string, value float64, updatedAt time.Time) (store.MetricPoint, error)

NewProductivityMetric builds a productivity metric point with required labels.

func ProductivityMetricNames

func ProductivityMetricNames() []string

ProductivityMetricNames returns the stable set of supported productivity metric names.

func RequiredLabels

func RequiredLabels(org, repo, user string) map[string]string

RequiredLabels builds the enforced org/repo/user label map.

func ValidateProductivityMetric

func ValidateProductivityMetric(point store.MetricPoint) error

ValidateProductivityMetric validates that a metric point conforms to the scrape output contract.

Types

type BackfillScraper

type BackfillScraper interface {
	ScrapeBackfill(
		ctx context.Context,
		org config.GitHubOrgConfig,
		repo string,
		windowStart time.Time,
		windowEnd time.Time,
		reason string,
	) (OrgResult, error)
}

BackfillScraper provides backfill-window specific scraping behavior.

type CheckpointAwareScraper

type CheckpointAwareScraper interface {
	SetCheckpointStore(checkpoints CheckpointStore)
}

CheckpointAwareScraper allows runtime to inject checkpoint persistence.

type CheckpointStore

type CheckpointStore interface {
	SetCheckpoint(org, repo string, checkpoint time.Time) error
	GetCheckpoint(org, repo string) (time.Time, bool, error)
}

CheckpointStore persists per-org/repo scrape progress.

type GitHubDataClient

type GitHubDataClient interface {
	ListOrgRepos(ctx context.Context, org string) (githubapi.OrgReposResult, error)
	GetContributorStats(ctx context.Context, owner, repo string) (githubapi.ContributorStatsResult, error)
	ListRepoCommitsWindow(ctx context.Context, owner, repo string, since, until time.Time, maxCommits int) (githubapi.CommitListResult, error)
	ListRepoPullRequestsWindow(ctx context.Context, owner, repo string, since, until time.Time) (githubapi.PullRequestListResult, error)
	ListPullReviews(ctx context.Context, owner, repo string, pullNumber int, since, until time.Time) (githubapi.PullReviewsResult, error)
	ListIssueCommentsWindow(ctx context.Context, owner, repo string, since, until time.Time) (githubapi.IssueCommentsResult, error)
	GetCommit(ctx context.Context, owner, repo, sha string) (githubapi.CommitDetail, error)
}

GitHubDataClient is the typed GitHub API interface consumed by the org scraper.

type GitHubOrgScraper

type GitHubOrgScraper struct {
	// contains filtered or unexported fields
}

GitHubOrgScraper implements OrgScraper using typed GitHub API clients.

func NewGitHubOrgScraper

func NewGitHubOrgScraper(clients map[string]GitHubDataClient, cfg GitHubOrgScraperConfig) *GitHubOrgScraper

NewGitHubOrgScraper creates a production org scraper over per-org GitHub clients.

func (*GitHubOrgScraper) ScrapeBackfill

func (s *GitHubOrgScraper) ScrapeBackfill(
	ctx context.Context,
	org config.GitHubOrgConfig,
	repo string,
	windowStart time.Time,
	windowEnd time.Time,
	reason string,
) (OrgResult, error)

ScrapeBackfill re-scrapes one missed org/repo window.

func (*GitHubOrgScraper) ScrapeOrg

ScrapeOrg scrapes one organization and returns metrics plus missed windows for partial failures.

func (*GitHubOrgScraper) SetCheckpointStore

func (s *GitHubOrgScraper) SetCheckpointStore(checkpoints CheckpointStore)

SetCheckpointStore injects or replaces checkpoint persistence for the scraper.

type GitHubOrgScraperConfig

type GitHubOrgScraperConfig struct {
	LOCRefreshInterval                        time.Duration
	FallbackEnabled                           bool
	FallbackMaxCommitsPerRepoPerWeek          int
	FallbackMaxCommitDetailCallsPerOrgPerHour int
	LargeRepoZeroDetectionWindows             int
	LargeRepoCooldown                         time.Duration
	Checkpoints                               CheckpointStore
	Now                                       func() time.Time
	Sleep                                     func(time.Duration)
}

GitHubOrgScraperConfig configures GitHub-backed org scraping behavior.

type Manager

type Manager struct {
	// contains filtered or unexported fields
}

Manager executes organization scraping.

func NewManager

func NewManager(scraper OrgScraper, orgs []config.GitHubOrgConfig) *Manager

NewManager creates a scrape manager.

func (*Manager) ScrapeAll

func (m *Manager) ScrapeAll(ctx context.Context) []Outcome

ScrapeAll performs one parallel scrape pass across configured organizations.

func (*Manager) ScrapeBackfill

func (m *Manager) ScrapeBackfill(
	ctx context.Context,
	org string,
	repo string,
	windowStart time.Time,
	windowEnd time.Time,
	reason string,
) (OrgResult, error)

ScrapeBackfill re-scrapes one missed org/repo window.

func (*Manager) ScrapeOrg

func (m *Manager) ScrapeOrg(ctx context.Context, org config.GitHubOrgConfig) Outcome

ScrapeOrg performs one scrape pass for a single configured organization.

type MissedWindow

type MissedWindow struct {
	Org         string
	Repo        string
	WindowStart time.Time
	WindowEnd   time.Time
	Reason      string
}

MissedWindow describes a failed scrape window that can be backfilled later.

type NoopOrgScraper

type NoopOrgScraper struct{}

NoopOrgScraper is a placeholder scraper implementation for initial bootstrapping.

func (*NoopOrgScraper) ScrapeOrg

ScrapeOrg returns an empty result without error.

type OrgResult

type OrgResult struct {
	Metrics      []store.MetricPoint
	MissedWindow []MissedWindow
	Summary      OrgSummary
}

OrgResult is the scrape output for one organization.

type OrgScraper

type OrgScraper interface {
	ScrapeOrg(ctx context.Context, org config.GitHubOrgConfig) (OrgResult, error)
}

OrgScraper scrapes one organization.

func NewOrgScraperFromConfig

func NewOrgScraperFromConfig(cfg *config.Config) (OrgScraper, error)

NewOrgScraperFromConfig builds a GitHubOrgScraper using per-org GitHub App credentials from config.

type OrgSummary

type OrgSummary struct {
	ReposDiscovered         int
	ReposTargeted           int
	ReposProcessed          int
	ReposStatsAccepted      int
	ReposStatsForbidden     int
	ReposStatsNotFound      int
	ReposStatsConflict      int
	ReposStatsUnprocessable int
	ReposStatsUnavailable   int
	ReposNoCompleteWeek     int
	ReposFallbackUsed       int
	ReposFallbackTruncated  int
	MissedWindows           int
	MetricsProduced         int
	RateLimitMinRemaining   int
	RateLimitResetUnix      int64
	SecondaryLimitHits      int
	GitHubRequestTotals     map[string]int
	LOCFallbackBudgetHits   int
}

OrgSummary provides per-organization scrape debug counters.

type Outcome

type Outcome struct {
	Org    string
	Result OrgResult
	Err    error
}

Outcome contains scrape results and errors for one organization.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL