github

package
v0.2.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 17, 2025 License: Apache-2.0 Imports: 21 Imported by: 0

Documentation

Overview

Package github implements a connector for GitHub repositories.

This connector indexes all repositories accessible to the authenticated user, including owned repositories, collaborator repositories, and organisation member repositories. Content types indexed include repository files, issues, pull requests, and wiki pages.

Architecture

The connector follows the driven port pattern defined in driven.Connector. It comprises the following components:

  • Connector: orchestrates sync operations and manages lifecycle
  • Client: handles GitHub API communication with rate limiting
  • Config: parses and validates source configuration
  • Cursor: tracks incremental sync state per repository

Authentication

Two authentication methods are supported:

  • Personal Access Tokens (PAT): classic or fine-grained tokens created at github.com/settings/tokens. Requires 'repo' scope for private repositories.

  • OAuth App: tokens obtained via the OAuth 2.0 authorisation code flow. The application must be registered at github.com/settings/developers.

Both methods provide 5,000 API requests per hour for authenticated users. Unauthenticated requests are limited to 60 per hour and are not supported.

Configuration

Source configuration accepts the following keys:

  • content_types: comma-separated list of content to index. Valid values: files, issues, prs, wikis. Default: all types.

  • file_patterns: comma-separated glob patterns for file filtering. Example: "*.go,*.md". Default: all files.

No repository specification is required. The connector automatically discovers and indexes all repositories accessible to the authenticated user.

Rate Limiting

The connector implements a dual-strategy rate limiting approach:

  1. Proactive throttling: a token bucket algorithm limits requests to approximately 1.2 requests per second, staying well under the 5,000/hour limit whilst maximising throughput.

  2. Reactive handling: the connector monitors X-RateLimit-Remaining and X-RateLimit-Reset headers. When limits are exhausted, it waits until the reset time before continuing.

Secondary rate limits (abuse detection) are handled with exponential backoff.

Sync Operations

Full sync retrieves all content from all accessible repositories. For each repository, the connector:

  1. Fetches the repository tree using the recursive Trees API
  2. Retrieves blob content for each file matching configured patterns
  3. Fetches issues and pull requests with their comments
  4. Retrieves wiki pages if the repository has a wiki

Incremental sync uses cursors to track sync state. The cursor stores:

  • Tree SHA: detects file changes by comparing against the current HEAD
  • Timestamps: filters issues and PRs updated since the last sync
  • Wiki SHA: tracks wiki repository changes

Each repository maintains independent cursor state, enabling partial syncs to resume from where they left off.

Document Structure

Documents are emitted with the following URI patterns:

  • Files: github://{owner}/{repo}/blob/{path}
  • Issues: github://{owner}/{repo}/issues/{number}
  • Pull Requests: github://{owner}/{repo}/pull/{number}
  • Wiki Pages: github://{owner}/{repo}/wiki/{page}

Metadata includes repository information, file paths, issue/PR state, labels, and timestamps.

Error Handling

The connector distinguishes between recoverable and fatal errors:

  • Rate limit errors: automatically retried after waiting
  • Network errors: retried with exponential backoff
  • Authentication errors: reported immediately as domain.ErrAuthInvalid
  • Permission errors: logged and skipped (repository continues)

Limitations

  • Binary files are not indexed (text content only)
  • File size limit: 1MB per file (GitHub API constraint)
  • Watch mode is not supported (no webhook integration in CLI)
  • Private repository access requires appropriate token scopes

Example Usage

cfg, _ := github.ParseConfig(source)
connector := github.New(source.ID, cfg, tokenProvider)

if err := connector.Validate(ctx); err != nil {
    return err
}

docs, errs := connector.FullSync(ctx)
for doc := range docs {
    // Process document
}

Index

Constants

View Source
const (
	// DefaultTimeout is the default HTTP request timeout.
	DefaultTimeout = 30 * time.Second

	// MaxRetries is the maximum number of retries for transient errors.
	MaxRetries = 3

	// RetryDelay is the initial delay between retries.
	RetryDelay = time.Second
)
View Source
const (
	// GitHubRateLimit is the authenticated rate limit (5000/hour).
	GitHubRateLimit = 5000

	// ProactiveRate is the proactive throttle rate (~1.2 req/sec = 4320/hr).
	ProactiveRate = 1.2

	// MinBuffer is the minimum remaining requests before waiting for reset.
	MinBuffer = 100

	// HeaderRateLimit is the rate limit header.
	HeaderRateLimit = "X-RateLimit-Limit"

	// HeaderRateRemaining is the remaining requests header.
	HeaderRateRemaining = "X-RateLimit-Remaining"

	// HeaderRateReset is the reset timestamp header (Unix seconds).
	HeaderRateReset = "X-RateLimit-Reset"

	// HeaderRetryAfter is the retry-after header (seconds).
	HeaderRetryAfter = "Retry-After"
)
View Source
const CursorVersion = 1

CursorVersion is the current cursor schema version.

View Source
const MIMETypeGitHubIssue = "application/vnd.github.issue+json"

MIMETypeGitHubIssue is the custom MIME type for GitHub issues.

View Source
const MIMETypeGitHubPull = "application/vnd.github.pull+json"

MIMETypeGitHubPull is the custom MIME type for GitHub pull requests.

Variables

View Source
var (
	// ErrConfigInvalidContentType indicates an invalid content type was specified.
	ErrConfigInvalidContentType = errors.New("github: invalid content type")

	// ErrRepoNotFound indicates the repository was not found or is not accessible.
	ErrRepoNotFound = errors.New("github: repository not found")

	// ErrBranchNotFound indicates the specified branch was not found.
	ErrBranchNotFound = errors.New("github: branch not found")

	// ErrWikiDisabled indicates the repository's wiki is disabled.
	ErrWikiDisabled = errors.New("github: wiki is disabled for this repository")

	// ErrInvalidCursor indicates the cursor format is invalid.
	ErrInvalidCursor = errors.New("github: invalid cursor format")
)

GitHub-specific errors.

Functions

func FetchFiles

func FetchFiles(
	ctx context.Context, client *Client, repo *gh.Repository, cfg *Config,
) ([]domain.RawDocument, string, error)

FetchFiles retrieves all files from a repository and converts them to RawDocuments.

func FetchIssueComments

func FetchIssueComments(
	ctx context.Context, client *Client, owner, repo string, issueNumber int,
) ([]*gh.IssueComment, error)

FetchIssueComments retrieves all comments for an issue.

func FetchIssues

func FetchIssues(
	ctx context.Context, client *Client, repo *gh.Repository, since time.Time,
) ([]domain.RawDocument, time.Time, error)

FetchIssues retrieves all issues (excluding PRs) from a repository.

func FetchPRReviews

func FetchPRReviews(
	ctx context.Context, client *Client, owner, repo string, prNumber int,
) ([]*gh.PullRequestReview, error)

FetchPRReviews retrieves all reviews for a pull request.

func FetchPullRequests

func FetchPullRequests(
	ctx context.Context, client *Client, repo *gh.Repository, since time.Time,
) ([]domain.RawDocument, time.Time, error)

FetchPullRequests retrieves all pull requests from a repository.

func FetchWikiPages

func FetchWikiPages(ctx context.Context, client *Client, repo *gh.Repository) ([]domain.RawDocument, string, error)

FetchWikiPages retrieves wiki pages from a repository. Note: GitHub's REST API has limited wiki support. Wiki pages are accessed via the repo's wiki git repository at {repo}.wiki.git. For simplicity, we fetch the wiki page list and content via API where available.

func FilterRepos

func FilterRepos(repos []*gh.Repository, includeArchived, includeForks bool) []*gh.Repository

FilterRepos filters repositories based on criteria.

func GetLastPage

func GetLastPage(linkHeader string) string

GetLastPage extracts the "last" URL from a Link header.

func GetTree

func GetTree(ctx context.Context, client *Client, owner, repo, ref string) (*gh.Tree, error)

GetTree retrieves the full tree for a repository at a given ref. Uses recursive=1 to get all files in one call.

func HasNextPage

func HasNextPage(linkHeader string) bool

HasNextPage checks if there is a next page available.

func IsForbidden

func IsForbidden(err error) bool

IsForbidden checks if the error indicates a forbidden resource.

func IsNotFound

func IsNotFound(err error) bool

IsNotFound checks if the error indicates a resource was not found.

func IsRateLimited

func IsRateLimited(err error) bool

IsRateLimited checks if the error indicates rate limiting.

func IsUnauthorized

func IsUnauthorized(err error) bool

IsUnauthorized checks if the error indicates an authentication failure.

func ListAllRepos

func ListAllRepos(ctx context.Context, client *Client) ([]*gh.Repository, error)

ListAllRepos returns all repositories accessible to the authenticated user. This is the primary method for indexing - it gets ALL repos the user can access: owned repositories, collaborator repositories, and organization member repositories.

func ParseAllLinks(linkHeader string) map[string]string

ParseAllLinks extracts all URLs from a Link header by relationship type. Returns a map of rel type to URL.

func ParseNextLink(linkHeader string) string

ParseNextLink extracts the "next" URL from a Link header. Returns empty string if no next link is found.

func RepoFullName

func RepoFullName(owner, repo string) string

RepoFullName returns the full repository name.

func ResolveWebURL

func ResolveWebURL(uri string, _ map[string]any) string

ResolveWebURL converts a GitHub URI to a web URL. github://owner/repo/blob/branch/path -> https://github.com/owner/repo/blob/branch/path

Types

type APIError

type APIError struct {
	StatusCode int
	Message    string
	URL        string
}

APIError represents a GitHub API error response.

func (*APIError) Error

func (e *APIError) Error() string

type Client

type Client struct {
	// contains filtered or unexported fields
}

Client wraps the go-github client with helper methods.

func NewClient

func NewClient(tokenProvider driven.TokenProvider) *Client

NewClient creates a new GitHub API client with a token provider.

func NewClientWithHTTPClient

func NewClientWithHTTPClient(httpClient *http.Client) *Client

NewClientWithHTTPClient creates a GitHub client with a custom http.Client. Useful for OAuth flows where the http.Client handles token refresh.

func NewClientWithToken

func NewClientWithToken(ctx context.Context, token string) *Client

NewClientWithToken creates a GitHub client with a static access token. Works for both PAT and OAuth access tokens.

func (*Client) DownloadContents

func (c *Client) DownloadContents(ctx context.Context, owner, repo, path, ref string) (io.ReadCloser, error)

DownloadContents downloads a file larger than 1MB. Returns an io.ReadCloser that must be closed by the caller.

func (*Client) GetBlob

func (c *Client) GetBlob(ctx context.Context, owner, repo, sha string) (*gh.Blob, error)

GetBlob fetches a blob (file content) by its SHA.

func (*Client) GetFileContent

func (c *Client) GetFileContent(ctx context.Context, owner, repo, path, ref string) (string, error)

GetFileContent fetches the content of a file. For files < 1MB, content is base64 encoded in the response.

func (*Client) GetRepository

func (c *Client) GetRepository(ctx context.Context, owner, repo string) (*gh.Repository, error)

GetRepository fetches a single repository.

func (*Client) GetTree

func (c *Client) GetTree(ctx context.Context, owner, repo, sha string) (*gh.Tree, error)

GetTree fetches the entire tree for a repository recursively. This is efficient for getting all file paths in one API call.

func (*Client) GitHub

func (c *Client) GitHub() *gh.Client

GitHub returns the underlying go-github client. Caller should call ensureClient first.

func (*Client) ListAllAccessibleRepos

func (c *Client) ListAllAccessibleRepos(ctx context.Context) ([]*gh.Repository, error)

ListAllAccessibleRepos returns ALL repositories the authenticated user can access. This includes: owned repos, collaborator repos, and organization member repos.

func (*Client) ListIssues

func (c *Client) ListIssues(
	ctx context.Context, owner, repo string, opts *gh.IssueListByRepoOptions,
) ([]*gh.Issue, error)

ListIssues lists issues for a repository.

func (*Client) ListPullRequests

func (c *Client) ListPullRequests(
	ctx context.Context, owner, repo string, opts *gh.PullRequestListOptions,
) ([]*gh.PullRequest, error)

ListPullRequests lists pull requests for a repository.

func (*Client) RateLimit

func (c *Client) RateLimit(ctx context.Context) (*gh.RateLimits, error)

RateLimit returns the current rate limit status.

func (*Client) RateLimiter

func (c *Client) RateLimiter() *RateLimiter

RateLimiter returns the rate limiter for external access.

func (*Client) TokenProvider

func (c *Client) TokenProvider() driven.TokenProvider

TokenProvider returns the token provider (used by other modules).

func (*Client) ValidateCredentials

func (c *Client) ValidateCredentials(ctx context.Context) error

ValidateCredentials checks if the provided token is valid by making an API call.

type CommentContent

type CommentContent struct {
	Author    string    `json:"author"`
	Body      string    `json:"body"`
	CreatedAt time.Time `json:"created_at"`
}

CommentContent represents a comment in the issue content.

type Config

type Config struct {
	// ContentTypes specifies what content to index.
	// Default: all types (files, issues, prs, wikis)
	ContentTypes []ContentType

	// FilePatterns are glob patterns for file filtering.
	// Default: all files
	FilePatterns []string
}

Config holds the parsed configuration for a GitHub source.

func ParseConfig

func ParseConfig(source domain.Source) (*Config, error)

ParseConfig parses a source's config map into a Config struct. All fields are optional - by default indexes all accessible repos with all content types.

func (*Config) HasContentType

func (c *Config) HasContentType(ct ContentType) bool

HasContentType checks if a content type is enabled.

type Connector

type Connector struct {
	// contains filtered or unexported fields
}

Connector fetches documents from GitHub repositories.

func New

func New(sourceID string, cfg *Config, tokenProvider driven.TokenProvider) *Connector

New creates a new GitHub connector.

func (*Connector) Capabilities

func (c *Connector) Capabilities() driven.ConnectorCapabilities

Capabilities returns the connector's capabilities.

func (*Connector) Close

func (c *Connector) Close() error

Close releases resources.

func (*Connector) FullSync

func (c *Connector) FullSync(ctx context.Context) (<-chan domain.RawDocument, <-chan error)

FullSync fetches all documents from GitHub.

func (*Connector) GetAccountIdentifier

func (c *Connector) GetAccountIdentifier(ctx context.Context, accessToken string) (string, error)

GetAccountIdentifier fetches the GitHub username for the authenticated user.

func (*Connector) IncrementalSync

func (c *Connector) IncrementalSync(
	ctx context.Context, state domain.SyncState,
) (<-chan domain.RawDocumentChange, <-chan error)

IncrementalSync fetches only changes since the last sync.

func (*Connector) SourceID

func (c *Connector) SourceID() string

SourceID returns the source identifier.

func (*Connector) Type

func (c *Connector) Type() string

Type returns the connector type identifier.

func (*Connector) Validate

func (c *Connector) Validate(ctx context.Context) error

Validate checks if the GitHub connector is properly configured.

func (*Connector) Watch

func (c *Connector) Watch(_ context.Context) (<-chan domain.RawDocumentChange, error)

Watch is not supported for GitHub (no webhooks in CLI).

type ContentType

type ContentType string

ContentType represents the type of content to index.

const (
	ContentFiles  ContentType = "files"
	ContentIssues ContentType = "issues"
	ContentPRs    ContentType = "prs"
	ContentWikis  ContentType = "wikis"
)

func AllContentTypes

func AllContentTypes() []ContentType

AllContentTypes returns all supported content types.

type Cursor

type Cursor struct {
	// Version is the schema version for future migrations.
	Version int `json:"v"`

	// Repos maps repository full name (owner/repo) to its cursor state.
	Repos map[string]RepoCursor `json:"repos"`
}

Cursor tracks sync state across multiple repositories and content types.

func DecodeCursor

func DecodeCursor(s string) (*Cursor, error)

DecodeCursor deserializes a cursor from a base64-encoded JSON string. Returns a new empty cursor if the input is empty or invalid.

func NewCursor

func NewCursor() *Cursor

NewCursor creates a new empty cursor.

func (*Cursor) Encode

func (c *Cursor) Encode() string

Encode serializes the cursor to a base64-encoded JSON string.

func (*Cursor) GetRepoCursor

func (c *Cursor) GetRepoCursor(owner, repo string) RepoCursor

GetRepoCursor returns the cursor for a specific repository.

func (*Cursor) SetRepoCursor

func (c *Cursor) SetRepoCursor(owner, repo string, cursor *RepoCursor)

SetRepoCursor sets the cursor for a specific repository.

func (*Cursor) UpdateFilesTreeSHA

func (c *Cursor) UpdateFilesTreeSHA(owner, repo, sha string)

UpdateFilesTreeSHA updates the files tree SHA for a repository.

func (*Cursor) UpdateIssuesSince

func (c *Cursor) UpdateIssuesSince(owner, repo string, t time.Time)

UpdateIssuesSince updates the issues timestamp for a repository.

func (*Cursor) UpdatePRsSince

func (c *Cursor) UpdatePRsSince(owner, repo string, t time.Time)

UpdatePRsSince updates the PRs timestamp for a repository.

func (*Cursor) UpdateWikiCommitSHA

func (c *Cursor) UpdateWikiCommitSHA(owner, repo, sha string)

UpdateWikiCommitSHA updates the wiki commit SHA for a repository.

type IssueContent

type IssueContent struct {
	Number    int              `json:"number"`
	Title     string           `json:"title"`
	Body      string           `json:"body"`
	State     string           `json:"state"`
	Author    string           `json:"author"`
	CreatedAt time.Time        `json:"created_at"`
	UpdatedAt time.Time        `json:"updated_at"`
	Labels    []string         `json:"labels"`
	Assignees []string         `json:"assignees"`
	Milestone string           `json:"milestone,omitempty"`
	Comments  []CommentContent `json:"comments"`
}

IssueContent is the JSON structure for the issue RawDocument content.

type OAuthHandler

type OAuthHandler struct{}

OAuthHandler implements OAuth operations for GitHub.

func NewOAuthHandler

func NewOAuthHandler() *OAuthHandler

NewOAuthHandler creates a new GitHub OAuth handler.

func (*OAuthHandler) BuildAuthURL

func (h *OAuthHandler) BuildAuthURL(
	authProvider *domain.AuthProvider,
	redirectURI, state, codeChallenge string,
) string

BuildAuthURL constructs the GitHub OAuth authorization URL. GitHub doesn't require access_type=offline like Google does.

func (*OAuthHandler) DefaultConfig

func (h *OAuthHandler) DefaultConfig() driven.OAuthDefaults

DefaultConfig returns default OAuth URLs and scopes for GitHub.

func (*OAuthHandler) ExchangeCode

func (h *OAuthHandler) ExchangeCode(
	ctx context.Context,
	authProvider *domain.AuthProvider,
	code, redirectURI, codeVerifier string,
) (*domain.OAuthToken, error)

ExchangeCode exchanges an authorization code for tokens.

func (*OAuthHandler) GetUserInfo

func (h *OAuthHandler) GetUserInfo(ctx context.Context, accessToken string) (string, error)

GetUserInfo fetches the user's login from GitHub.

func (*OAuthHandler) RefreshToken

func (h *OAuthHandler) RefreshToken(
	ctx context.Context,
	authProvider *domain.AuthProvider,
	refreshToken string,
) (*domain.OAuthToken, error)

RefreshToken refreshes an expired access token using a refresh token. Note: GitHub OAuth apps don't typically use refresh tokens. GitHub Apps use installation tokens which expire, but OAuth apps have long-lived tokens.

func (*OAuthHandler) SetupHint

func (h *OAuthHandler) SetupHint() string

SetupHint returns guidance for setting up a GitHub OAuth app.

type PRContent

type PRContent struct {
	Number       int              `json:"number"`
	Title        string           `json:"title"`
	Body         string           `json:"body"`
	State        string           `json:"state"`
	Draft        bool             `json:"draft"`
	Merged       bool             `json:"merged"`
	Author       string           `json:"author"`
	HeadBranch   string           `json:"head_branch"`
	BaseBranch   string           `json:"base_branch"`
	CreatedAt    time.Time        `json:"created_at"`
	UpdatedAt    time.Time        `json:"updated_at"`
	Labels       []string         `json:"labels"`
	Assignees    []string         `json:"assignees"`
	Reviewers    []string         `json:"reviewers"`
	Additions    int              `json:"additions"`
	Deletions    int              `json:"deletions"`
	ChangedFiles int              `json:"changed_files"`
	Comments     []CommentContent `json:"comments"`
	Reviews      []ReviewContent  `json:"reviews"`
}

PRContent is the JSON structure for the PR RawDocument content.

type RateLimitError

type RateLimitError struct {
	ResetAt   time.Time
	Remaining int
	Limit     int
}

RateLimitError represents a rate limit exceeded error with reset time.

func (*RateLimitError) Error

func (e *RateLimitError) Error() string

type RateLimiter

type RateLimiter struct {
	// contains filtered or unexported fields
}

RateLimiter implements dual-strategy rate limiting for GitHub API.

func NewRateLimiter

func NewRateLimiter() *RateLimiter

NewRateLimiter creates a new rate limiter with proactive throttling.

func (*RateLimiter) CheckRateLimit

func (r *RateLimiter) CheckRateLimit(resp *http.Response) error

CheckRateLimit checks if the response indicates rate limiting. Returns a RateLimitError if rate limited, nil otherwise.

func (*RateLimiter) Limit

func (r *RateLimiter) Limit() int

Limit returns the rate limit.

func (*RateLimiter) Remaining

func (r *RateLimiter) Remaining() int

Remaining returns the current remaining requests.

func (*RateLimiter) ResetTime

func (r *RateLimiter) ResetTime() time.Time

ResetTime returns the rate limit reset time.

func (*RateLimiter) UpdateFromResponse

func (r *RateLimiter) UpdateFromResponse(resp *http.Response)

UpdateFromResponse updates rate limit state from response headers.

func (*RateLimiter) Wait

func (r *RateLimiter) Wait(ctx context.Context) error

Wait blocks until it's safe to make a request. It uses both proactive throttling and reactive API limit checking.

func (*RateLimiter) WaitForReset

func (r *RateLimiter) WaitForReset(ctx context.Context) error

WaitForReset waits until the rate limit resets.

type RepoCursor

type RepoCursor struct {
	// FilesTreeSHA is the Git tree SHA for the last indexed commit.
	FilesTreeSHA string `json:"files_sha,omitempty"`

	// IssuesSince is the timestamp of the last updated issue.
	IssuesSince time.Time `json:"issues_since,omitempty"`

	// PRsSince is the timestamp of the last updated PR.
	PRsSince time.Time `json:"prs_since,omitempty"`

	// WikiCommitSHA is the last indexed wiki commit SHA.
	WikiCommitSHA string `json:"wiki_sha,omitempty"`
}

RepoCursor tracks sync state for a single repository.

type ReviewContent

type ReviewContent struct {
	Author      string    `json:"author"`
	State       string    `json:"state"`
	Body        string    `json:"body"`
	SubmittedAt time.Time `json:"submitted_at"`
}

ReviewContent represents a review in the PR content.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL