Documentation
¶
Overview ¶
Package github implements a connector for GitHub repositories.
This connector indexes all repositories accessible to the authenticated user, including owned repositories, collaborator repositories, and organisation member repositories. Content types indexed include repository files, issues, pull requests, and wiki pages.
Architecture ¶
The connector follows the driven port pattern defined in driven.Connector. It comprises the following components:
- Connector: orchestrates sync operations and manages lifecycle
- Client: handles GitHub API communication with rate limiting
- Config: parses and validates source configuration
- Cursor: tracks incremental sync state per repository
Authentication ¶
Two authentication methods are supported:
Personal Access Tokens (PAT): classic or fine-grained tokens created at github.com/settings/tokens. Requires 'repo' scope for private repositories.
OAuth App: tokens obtained via the OAuth 2.0 authorisation code flow. The application must be registered at github.com/settings/developers.
Both methods provide 5,000 API requests per hour for authenticated users. Unauthenticated requests are limited to 60 per hour and are not supported.
Configuration ¶
Source configuration accepts the following keys:
content_types: comma-separated list of content to index. Valid values: files, issues, prs, wikis. Default: all types.
file_patterns: comma-separated glob patterns for file filtering. Example: "*.go,*.md". Default: all files.
No repository specification is required. The connector automatically discovers and indexes all repositories accessible to the authenticated user.
Rate Limiting ¶
The connector implements a dual-strategy rate limiting approach:
Proactive throttling: a token bucket algorithm limits requests to approximately 1.2 requests per second, staying well under the 5,000/hour limit whilst maximising throughput.
Reactive handling: the connector monitors X-RateLimit-Remaining and X-RateLimit-Reset headers. When limits are exhausted, it waits until the reset time before continuing.
Secondary rate limits (abuse detection) are handled with exponential backoff.
Sync Operations ¶
Full sync retrieves all content from all accessible repositories. For each repository, the connector:
- Fetches the repository tree using the recursive Trees API
- Retrieves blob content for each file matching configured patterns
- Fetches issues and pull requests with their comments
- Retrieves wiki pages if the repository has a wiki
Incremental sync uses cursors to track sync state. The cursor stores:
- Tree SHA: detects file changes by comparing against the current HEAD
- Timestamps: filters issues and PRs updated since the last sync
- Wiki SHA: tracks wiki repository changes
Each repository maintains independent cursor state, enabling partial syncs to resume from where they left off.
Document Structure ¶
Documents are emitted with the following URI patterns:
- Files: github://{owner}/{repo}/blob/{path}
- Issues: github://{owner}/{repo}/issues/{number}
- Pull Requests: github://{owner}/{repo}/pull/{number}
- Wiki Pages: github://{owner}/{repo}/wiki/{page}
Metadata includes repository information, file paths, issue/PR state, labels, and timestamps.
Error Handling ¶
The connector distinguishes between recoverable and fatal errors:
- Rate limit errors: automatically retried after waiting
- Network errors: retried with exponential backoff
- Authentication errors: reported immediately as domain.ErrAuthInvalid
- Permission errors: logged and skipped (repository continues)
Limitations ¶
- Binary files are not indexed (text content only)
- File size limit: 1MB per file (GitHub API constraint)
- Watch mode is not supported (no webhook integration in CLI)
- Private repository access requires appropriate token scopes
Example Usage ¶
cfg, _ := github.ParseConfig(source)
connector := github.New(source.ID, cfg, tokenProvider)
if err := connector.Validate(ctx); err != nil {
return err
}
docs, errs := connector.FullSync(ctx)
for doc := range docs {
// Process document
}
Index ¶
- Constants
- Variables
- func FetchFiles(ctx context.Context, client *Client, repo *gh.Repository, cfg *Config) ([]domain.RawDocument, string, error)
- func FetchIssueComments(ctx context.Context, client *Client, owner, repo string, issueNumber int) ([]*gh.IssueComment, error)
- func FetchIssues(ctx context.Context, client *Client, repo *gh.Repository, since time.Time) ([]domain.RawDocument, time.Time, error)
- func FetchPRReviews(ctx context.Context, client *Client, owner, repo string, prNumber int) ([]*gh.PullRequestReview, error)
- func FetchPullRequests(ctx context.Context, client *Client, repo *gh.Repository, since time.Time) ([]domain.RawDocument, time.Time, error)
- func FetchWikiPages(ctx context.Context, client *Client, repo *gh.Repository) ([]domain.RawDocument, string, error)
- func FilterRepos(repos []*gh.Repository, includeArchived, includeForks bool) []*gh.Repository
- func GetLastPage(linkHeader string) string
- func GetTree(ctx context.Context, client *Client, owner, repo, ref string) (*gh.Tree, error)
- func HasNextPage(linkHeader string) bool
- func IsForbidden(err error) bool
- func IsNotFound(err error) bool
- func IsRateLimited(err error) bool
- func IsUnauthorized(err error) bool
- func ListAllRepos(ctx context.Context, client *Client) ([]*gh.Repository, error)
- func ParseAllLinks(linkHeader string) map[string]string
- func ParseNextLink(linkHeader string) string
- func RepoFullName(owner, repo string) string
- func ResolveWebURL(uri string, _ map[string]any) string
- type APIError
- type Client
- func (c *Client) DownloadContents(ctx context.Context, owner, repo, path, ref string) (io.ReadCloser, error)
- func (c *Client) GetBlob(ctx context.Context, owner, repo, sha string) (*gh.Blob, error)
- func (c *Client) GetFileContent(ctx context.Context, owner, repo, path, ref string) (string, error)
- func (c *Client) GetRepository(ctx context.Context, owner, repo string) (*gh.Repository, error)
- func (c *Client) GetTree(ctx context.Context, owner, repo, sha string) (*gh.Tree, error)
- func (c *Client) GitHub() *gh.Client
- func (c *Client) ListAllAccessibleRepos(ctx context.Context) ([]*gh.Repository, error)
- func (c *Client) ListIssues(ctx context.Context, owner, repo string, opts *gh.IssueListByRepoOptions) ([]*gh.Issue, error)
- func (c *Client) ListPullRequests(ctx context.Context, owner, repo string, opts *gh.PullRequestListOptions) ([]*gh.PullRequest, error)
- func (c *Client) RateLimit(ctx context.Context) (*gh.RateLimits, error)
- func (c *Client) RateLimiter() *RateLimiter
- func (c *Client) TokenProvider() driven.TokenProvider
- func (c *Client) ValidateCredentials(ctx context.Context) error
- type CommentContent
- type Config
- type Connector
- func (c *Connector) Capabilities() driven.ConnectorCapabilities
- func (c *Connector) Close() error
- func (c *Connector) FullSync(ctx context.Context) (<-chan domain.RawDocument, <-chan error)
- func (c *Connector) GetAccountIdentifier(ctx context.Context, accessToken string) (string, error)
- func (c *Connector) IncrementalSync(ctx context.Context, state domain.SyncState) (<-chan domain.RawDocumentChange, <-chan error)
- func (c *Connector) SourceID() string
- func (c *Connector) Type() string
- func (c *Connector) Validate(ctx context.Context) error
- func (c *Connector) Watch(_ context.Context) (<-chan domain.RawDocumentChange, error)
- type ContentType
- type Cursor
- func (c *Cursor) Encode() string
- func (c *Cursor) GetRepoCursor(owner, repo string) RepoCursor
- func (c *Cursor) SetRepoCursor(owner, repo string, cursor *RepoCursor)
- func (c *Cursor) UpdateFilesTreeSHA(owner, repo, sha string)
- func (c *Cursor) UpdateIssuesSince(owner, repo string, t time.Time)
- func (c *Cursor) UpdatePRsSince(owner, repo string, t time.Time)
- func (c *Cursor) UpdateWikiCommitSHA(owner, repo, sha string)
- type IssueContent
- type OAuthHandler
- func (h *OAuthHandler) BuildAuthURL(authProvider *domain.AuthProvider, redirectURI, state, codeChallenge string) string
- func (h *OAuthHandler) DefaultConfig() driven.OAuthDefaults
- func (h *OAuthHandler) ExchangeCode(ctx context.Context, authProvider *domain.AuthProvider, ...) (*domain.OAuthToken, error)
- func (h *OAuthHandler) GetUserInfo(ctx context.Context, accessToken string) (string, error)
- func (h *OAuthHandler) RefreshToken(ctx context.Context, authProvider *domain.AuthProvider, refreshToken string) (*domain.OAuthToken, error)
- func (h *OAuthHandler) SetupHint() string
- type PRContent
- type RateLimitError
- type RateLimiter
- func (r *RateLimiter) CheckRateLimit(resp *http.Response) error
- func (r *RateLimiter) Limit() int
- func (r *RateLimiter) Remaining() int
- func (r *RateLimiter) ResetTime() time.Time
- func (r *RateLimiter) UpdateFromResponse(resp *http.Response)
- func (r *RateLimiter) Wait(ctx context.Context) error
- func (r *RateLimiter) WaitForReset(ctx context.Context) error
- type RepoCursor
- type ReviewContent
Constants ¶
const ( // DefaultTimeout is the default HTTP request timeout. DefaultTimeout = 30 * time.Second // MaxRetries is the maximum number of retries for transient errors. MaxRetries = 3 // RetryDelay is the initial delay between retries. RetryDelay = time.Second )
const ( // GitHubRateLimit is the authenticated rate limit (5000/hour). GitHubRateLimit = 5000 // ProactiveRate is the proactive throttle rate (~1.2 req/sec = 4320/hr). ProactiveRate = 1.2 // MinBuffer is the minimum remaining requests before waiting for reset. MinBuffer = 100 // HeaderRateLimit is the rate limit header. HeaderRateLimit = "X-RateLimit-Limit" // HeaderRateRemaining is the remaining requests header. HeaderRateRemaining = "X-RateLimit-Remaining" // HeaderRateReset is the reset timestamp header (Unix seconds). HeaderRateReset = "X-RateLimit-Reset" // HeaderRetryAfter is the retry-after header (seconds). HeaderRetryAfter = "Retry-After" )
const CursorVersion = 1
CursorVersion is the current cursor schema version.
const MIMETypeGitHubIssue = "application/vnd.github.issue+json"
MIMETypeGitHubIssue is the custom MIME type for GitHub issues.
const MIMETypeGitHubPull = "application/vnd.github.pull+json"
MIMETypeGitHubPull is the custom MIME type for GitHub pull requests.
Variables ¶
var ( // ErrConfigInvalidContentType indicates an invalid content type was specified. ErrConfigInvalidContentType = errors.New("github: invalid content type") // ErrRepoNotFound indicates the repository was not found or is not accessible. ErrRepoNotFound = errors.New("github: repository not found") // ErrBranchNotFound indicates the specified branch was not found. ErrBranchNotFound = errors.New("github: branch not found") // ErrWikiDisabled indicates the repository's wiki is disabled. ErrWikiDisabled = errors.New("github: wiki is disabled for this repository") // ErrInvalidCursor indicates the cursor format is invalid. ErrInvalidCursor = errors.New("github: invalid cursor format") )
GitHub-specific errors.
Functions ¶
func FetchFiles ¶
func FetchFiles( ctx context.Context, client *Client, repo *gh.Repository, cfg *Config, ) ([]domain.RawDocument, string, error)
FetchFiles retrieves all files from a repository and converts them to RawDocuments.
func FetchIssueComments ¶
func FetchIssueComments( ctx context.Context, client *Client, owner, repo string, issueNumber int, ) ([]*gh.IssueComment, error)
FetchIssueComments retrieves all comments for an issue.
func FetchIssues ¶
func FetchIssues( ctx context.Context, client *Client, repo *gh.Repository, since time.Time, ) ([]domain.RawDocument, time.Time, error)
FetchIssues retrieves all issues (excluding PRs) from a repository.
func FetchPRReviews ¶
func FetchPRReviews( ctx context.Context, client *Client, owner, repo string, prNumber int, ) ([]*gh.PullRequestReview, error)
FetchPRReviews retrieves all reviews for a pull request.
func FetchPullRequests ¶
func FetchPullRequests( ctx context.Context, client *Client, repo *gh.Repository, since time.Time, ) ([]domain.RawDocument, time.Time, error)
FetchPullRequests retrieves all pull requests from a repository.
func FetchWikiPages ¶
func FetchWikiPages(ctx context.Context, client *Client, repo *gh.Repository) ([]domain.RawDocument, string, error)
FetchWikiPages retrieves wiki pages from a repository. Note: GitHub's REST API has limited wiki support. Wiki pages are accessed via the repo's wiki git repository at {repo}.wiki.git. For simplicity, we fetch the wiki page list and content via API where available.
func FilterRepos ¶
func FilterRepos(repos []*gh.Repository, includeArchived, includeForks bool) []*gh.Repository
FilterRepos filters repositories based on criteria.
func GetLastPage ¶
GetLastPage extracts the "last" URL from a Link header.
func GetTree ¶
GetTree retrieves the full tree for a repository at a given ref. Uses recursive=1 to get all files in one call.
func HasNextPage ¶
HasNextPage checks if there is a next page available.
func IsForbidden ¶
IsForbidden checks if the error indicates a forbidden resource.
func IsNotFound ¶
IsNotFound checks if the error indicates a resource was not found.
func IsRateLimited ¶
IsRateLimited checks if the error indicates rate limiting.
func IsUnauthorized ¶
IsUnauthorized checks if the error indicates an authentication failure.
func ListAllRepos ¶
ListAllRepos returns all repositories accessible to the authenticated user. This is the primary method for indexing - it gets ALL repos the user can access: owned repositories, collaborator repositories, and organization member repositories.
func ParseAllLinks ¶
ParseAllLinks extracts all URLs from a Link header by relationship type. Returns a map of rel type to URL.
func ParseNextLink ¶
ParseNextLink extracts the "next" URL from a Link header. Returns empty string if no next link is found.
func RepoFullName ¶
RepoFullName returns the full repository name.
func ResolveWebURL ¶
ResolveWebURL converts a GitHub URI to a web URL. github://owner/repo/blob/branch/path -> https://github.com/owner/repo/blob/branch/path
Types ¶
type Client ¶
type Client struct {
// contains filtered or unexported fields
}
Client wraps the go-github client with helper methods.
func NewClient ¶
func NewClient(tokenProvider driven.TokenProvider) *Client
NewClient creates a new GitHub API client with a token provider.
func NewClientWithHTTPClient ¶
NewClientWithHTTPClient creates a GitHub client with a custom http.Client. Useful for OAuth flows where the http.Client handles token refresh.
func NewClientWithToken ¶
NewClientWithToken creates a GitHub client with a static access token. Works for both PAT and OAuth access tokens.
func (*Client) DownloadContents ¶
func (c *Client) DownloadContents(ctx context.Context, owner, repo, path, ref string) (io.ReadCloser, error)
DownloadContents downloads a file larger than 1MB. Returns an io.ReadCloser that must be closed by the caller.
func (*Client) GetFileContent ¶
GetFileContent fetches the content of a file. For files < 1MB, content is base64 encoded in the response.
func (*Client) GetRepository ¶
GetRepository fetches a single repository.
func (*Client) GetTree ¶
GetTree fetches the entire tree for a repository recursively. This is efficient for getting all file paths in one API call.
func (*Client) GitHub ¶
GitHub returns the underlying go-github client. Caller should call ensureClient first.
func (*Client) ListAllAccessibleRepos ¶
ListAllAccessibleRepos returns ALL repositories the authenticated user can access. This includes: owned repos, collaborator repos, and organization member repos.
func (*Client) ListIssues ¶
func (c *Client) ListIssues( ctx context.Context, owner, repo string, opts *gh.IssueListByRepoOptions, ) ([]*gh.Issue, error)
ListIssues lists issues for a repository.
func (*Client) ListPullRequests ¶
func (c *Client) ListPullRequests( ctx context.Context, owner, repo string, opts *gh.PullRequestListOptions, ) ([]*gh.PullRequest, error)
ListPullRequests lists pull requests for a repository.
func (*Client) RateLimiter ¶
func (c *Client) RateLimiter() *RateLimiter
RateLimiter returns the rate limiter for external access.
func (*Client) TokenProvider ¶
func (c *Client) TokenProvider() driven.TokenProvider
TokenProvider returns the token provider (used by other modules).
type CommentContent ¶
type CommentContent struct {
Author string `json:"author"`
Body string `json:"body"`
CreatedAt time.Time `json:"created_at"`
}
CommentContent represents a comment in the issue content.
type Config ¶
type Config struct {
// ContentTypes specifies what content to index.
// Default: all types (files, issues, prs, wikis)
ContentTypes []ContentType
// FilePatterns are glob patterns for file filtering.
// Default: all files
FilePatterns []string
}
Config holds the parsed configuration for a GitHub source.
func ParseConfig ¶
ParseConfig parses a source's config map into a Config struct. All fields are optional - by default indexes all accessible repos with all content types.
func (*Config) HasContentType ¶
func (c *Config) HasContentType(ct ContentType) bool
HasContentType checks if a content type is enabled.
type Connector ¶
type Connector struct {
// contains filtered or unexported fields
}
Connector fetches documents from GitHub repositories.
func New ¶
func New(sourceID string, cfg *Config, tokenProvider driven.TokenProvider) *Connector
New creates a new GitHub connector.
func (*Connector) Capabilities ¶
func (c *Connector) Capabilities() driven.ConnectorCapabilities
Capabilities returns the connector's capabilities.
func (*Connector) GetAccountIdentifier ¶
GetAccountIdentifier fetches the GitHub username for the authenticated user.
func (*Connector) IncrementalSync ¶
func (c *Connector) IncrementalSync( ctx context.Context, state domain.SyncState, ) (<-chan domain.RawDocumentChange, <-chan error)
IncrementalSync fetches only changes since the last sync.
type ContentType ¶
type ContentType string
ContentType represents the type of content to index.
const ( ContentFiles ContentType = "files" ContentIssues ContentType = "issues" ContentPRs ContentType = "prs" ContentWikis ContentType = "wikis" )
func AllContentTypes ¶
func AllContentTypes() []ContentType
AllContentTypes returns all supported content types.
type Cursor ¶
type Cursor struct {
// Version is the schema version for future migrations.
Version int `json:"v"`
// Repos maps repository full name (owner/repo) to its cursor state.
Repos map[string]RepoCursor `json:"repos"`
}
Cursor tracks sync state across multiple repositories and content types.
func DecodeCursor ¶
DecodeCursor deserializes a cursor from a base64-encoded JSON string. Returns a new empty cursor if the input is empty or invalid.
func (*Cursor) GetRepoCursor ¶
func (c *Cursor) GetRepoCursor(owner, repo string) RepoCursor
GetRepoCursor returns the cursor for a specific repository.
func (*Cursor) SetRepoCursor ¶
func (c *Cursor) SetRepoCursor(owner, repo string, cursor *RepoCursor)
SetRepoCursor sets the cursor for a specific repository.
func (*Cursor) UpdateFilesTreeSHA ¶
UpdateFilesTreeSHA updates the files tree SHA for a repository.
func (*Cursor) UpdateIssuesSince ¶
UpdateIssuesSince updates the issues timestamp for a repository.
func (*Cursor) UpdatePRsSince ¶
UpdatePRsSince updates the PRs timestamp for a repository.
func (*Cursor) UpdateWikiCommitSHA ¶
UpdateWikiCommitSHA updates the wiki commit SHA for a repository.
type IssueContent ¶
type IssueContent struct {
Number int `json:"number"`
Title string `json:"title"`
Body string `json:"body"`
State string `json:"state"`
Author string `json:"author"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
Labels []string `json:"labels"`
Assignees []string `json:"assignees"`
Milestone string `json:"milestone,omitempty"`
Comments []CommentContent `json:"comments"`
}
IssueContent is the JSON structure for the issue RawDocument content.
type OAuthHandler ¶
type OAuthHandler struct{}
OAuthHandler implements OAuth operations for GitHub.
func NewOAuthHandler ¶
func NewOAuthHandler() *OAuthHandler
NewOAuthHandler creates a new GitHub OAuth handler.
func (*OAuthHandler) BuildAuthURL ¶
func (h *OAuthHandler) BuildAuthURL( authProvider *domain.AuthProvider, redirectURI, state, codeChallenge string, ) string
BuildAuthURL constructs the GitHub OAuth authorization URL. GitHub doesn't require access_type=offline like Google does.
func (*OAuthHandler) DefaultConfig ¶
func (h *OAuthHandler) DefaultConfig() driven.OAuthDefaults
DefaultConfig returns default OAuth URLs and scopes for GitHub.
func (*OAuthHandler) ExchangeCode ¶
func (h *OAuthHandler) ExchangeCode( ctx context.Context, authProvider *domain.AuthProvider, code, redirectURI, codeVerifier string, ) (*domain.OAuthToken, error)
ExchangeCode exchanges an authorization code for tokens.
func (*OAuthHandler) GetUserInfo ¶
GetUserInfo fetches the user's login from GitHub.
func (*OAuthHandler) RefreshToken ¶
func (h *OAuthHandler) RefreshToken( ctx context.Context, authProvider *domain.AuthProvider, refreshToken string, ) (*domain.OAuthToken, error)
RefreshToken refreshes an expired access token using a refresh token. Note: GitHub OAuth apps don't typically use refresh tokens. GitHub Apps use installation tokens which expire, but OAuth apps have long-lived tokens.
func (*OAuthHandler) SetupHint ¶
func (h *OAuthHandler) SetupHint() string
SetupHint returns guidance for setting up a GitHub OAuth app.
type PRContent ¶
type PRContent struct {
Number int `json:"number"`
Title string `json:"title"`
Body string `json:"body"`
State string `json:"state"`
Draft bool `json:"draft"`
Merged bool `json:"merged"`
Author string `json:"author"`
HeadBranch string `json:"head_branch"`
BaseBranch string `json:"base_branch"`
CreatedAt time.Time `json:"created_at"`
UpdatedAt time.Time `json:"updated_at"`
Labels []string `json:"labels"`
Assignees []string `json:"assignees"`
Reviewers []string `json:"reviewers"`
Additions int `json:"additions"`
Deletions int `json:"deletions"`
ChangedFiles int `json:"changed_files"`
Comments []CommentContent `json:"comments"`
Reviews []ReviewContent `json:"reviews"`
}
PRContent is the JSON structure for the PR RawDocument content.
type RateLimitError ¶
RateLimitError represents a rate limit exceeded error with reset time.
func (*RateLimitError) Error ¶
func (e *RateLimitError) Error() string
type RateLimiter ¶
type RateLimiter struct {
// contains filtered or unexported fields
}
RateLimiter implements dual-strategy rate limiting for GitHub API.
func NewRateLimiter ¶
func NewRateLimiter() *RateLimiter
NewRateLimiter creates a new rate limiter with proactive throttling.
func (*RateLimiter) CheckRateLimit ¶
func (r *RateLimiter) CheckRateLimit(resp *http.Response) error
CheckRateLimit checks if the response indicates rate limiting. Returns a RateLimitError if rate limited, nil otherwise.
func (*RateLimiter) Remaining ¶
func (r *RateLimiter) Remaining() int
Remaining returns the current remaining requests.
func (*RateLimiter) ResetTime ¶
func (r *RateLimiter) ResetTime() time.Time
ResetTime returns the rate limit reset time.
func (*RateLimiter) UpdateFromResponse ¶
func (r *RateLimiter) UpdateFromResponse(resp *http.Response)
UpdateFromResponse updates rate limit state from response headers.
func (*RateLimiter) Wait ¶
func (r *RateLimiter) Wait(ctx context.Context) error
Wait blocks until it's safe to make a request. It uses both proactive throttling and reactive API limit checking.
func (*RateLimiter) WaitForReset ¶
func (r *RateLimiter) WaitForReset(ctx context.Context) error
WaitForReset waits until the rate limit resets.
type RepoCursor ¶
type RepoCursor struct {
// FilesTreeSHA is the Git tree SHA for the last indexed commit.
FilesTreeSHA string `json:"files_sha,omitempty"`
// IssuesSince is the timestamp of the last updated issue.
IssuesSince time.Time `json:"issues_since,omitempty"`
// PRsSince is the timestamp of the last updated PR.
PRsSince time.Time `json:"prs_since,omitempty"`
// WikiCommitSHA is the last indexed wiki commit SHA.
WikiCommitSHA string `json:"wiki_sha,omitempty"`
}
RepoCursor tracks sync state for a single repository.