go_scholar

package module
v1.0.10 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 16, 2026 License: Apache-2.0 Imports: 13 Imported by: 0

README

scholar

scholar is a WiP Go module that implements a querier and parser for Google Scholar's output. Its classes can be used independently, but it can also be invoked as a command-line tool.

This tool is inspired by scholar.py

Usage

import "github.com/compscidr/scholar"

sch := scholar.New("profiles.json", "articles.json")

// Optional: Configure request delay for throttling (default is 2 seconds)
sch.SetRequestDelay(1 * time.Second)

articles := sch.QueryProfile("SbUmSEAAAAAJ", 1)

for _, article := range articles {
	// do something with the article
}

Features

Working:

  • Queries and parses a user profile by user id to get basic publication data
  • Queries each of the articles listed (up to 80) and parses the results for extra information
  • Caches the profile for a day, and articles for a week (need to confirm this is working)
    • This is in memory, so if the program is restarted, the cache is lost
  • Configurable limit to number of articles to query in one go
  • On-disk caching of the profile and articles to avoid hitting the rate limit
  • Rate limiting and throttling with configurable delays between requests
  • Automatic retry with exponential backoff for 429 (Too Many Requests) responses

Testing

The module includes both mocked tests (fast, no network) and optional integration tests (against real Google Scholar API).

Running Tests
# Run all tests (uses mock HTTP client, no network requests)
go test

# Run specific test
go test -run TestProfileQuerier

# Run integration tests against real Google Scholar API (optional)
go test -tags integration

# Note: Integration tests may fail due to rate limits or network restrictions
# This is expected and will not break CI/CD pipelines

The integration tests are designed to be optional - they test against the real Google Scholar API but gracefully handle network failures and rate limits. This allows developers to verify functionality against the live API when needed without breaking automated builds.

TODO:

  • Pagination of articles

Rate Limiting

The library automatically throttles requests to avoid hitting Google Scholar's rate limits:

  • Default delay: 2 seconds between requests
  • Configurable via SetRequestDelay(duration)
  • Automatic retry with exponential backoff for 429 responses (up to 3 retries)
  • Backoff delays: 5s, 10s, 20s for subsequent retries

Possible throttle info:

https://stackoverflow.com/questions/60271587/how-long-is-the-error-429-toomanyrequests-cooldown

Documentation

Index

Constants

View Source
const AGENT = "Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Firefox/81.0"
View Source
const BaseURL = "https://scholar.google.com"
View Source
const MAX_TIME_ARTICLE = time.Second * 3600 * 24 * 30 // 30 days
View Source
const MAX_TIME_PROFILE = time.Second * 3600 * 24 * 7 // 1 week

Variables

This section is empty.

Functions

This section is empty.

Types

type Article

type Article struct {
	Title               string
	Authors             string
	ScholarURL          string
	Year                int
	Month               int
	Day                 int
	NumCitations        int
	Articles            int // if there are more than one article within this publication (it will also tell how big the arrays below are)
	Description         string
	PdfURL              string
	Journal             string
	Volume              string
	Pages               string
	Publisher           string
	ScholarCitedByURLs  []string
	ScholarVersionsURLs []string
	ScholarRelatedURLs  []string
	LastRetrieved       time.Time
}

func (Article) String

func (a Article) String() string

type HTTPClient added in v1.0.10

type HTTPClient interface {
	Do(req *http.Request) (*http.Response, error)
}

HTTPClient interface to allow mocking of HTTP requests

type Profile

type Profile struct {
	User          string
	LastRetrieved time.Time
	Articles      []string // list of article URLs - we'd still need to look them up in the article map
}

type Scholar

type Scholar struct {
	// contains filtered or unexported fields
}

func New

func New(profileCache string, articleCache string) *Scholar

func (*Scholar) QueryArticle

func (sch *Scholar) QueryArticle(url string, article *Article, dumpResponse bool) (*Article, error)

func (*Scholar) QueryProfile

func (sch *Scholar) QueryProfile(user string, limit int) ([]*Article, error)

func (*Scholar) QueryProfileDumpResponse

func (sch *Scholar) QueryProfileDumpResponse(user string, queryArticles bool, limit int, dumpResponse bool) ([]*Article, error)

QueryProfileDumpResponse queries the profile of a User and returns a list of Articles if queryArticles is true, it will also query the Articles for extra information which isn't present on the profile page

we may wish to set this to false if we are only interested in some article info, or we have a cache hit and we just
want to get updated information from the profile page only to save requests

if dumpResponse is true, it will print the response to stdout (useful for debugging)

func (*Scholar) QueryProfileWithMemoryCache added in v1.0.5

func (sch *Scholar) QueryProfileWithMemoryCache(user string, limit int) ([]*Article, error)

func (*Scholar) SaveCache added in v1.0.5

func (sch *Scholar) SaveCache(profileCache string, articleCache string)

func (*Scholar) SetHTTPClient added in v1.0.10

func (sch *Scholar) SetHTTPClient(client HTTPClient)

SetHTTPClient allows setting a custom HTTP client (useful for testing)

func (*Scholar) SetRequestDelay added in v1.0.10

func (sch *Scholar) SetRequestDelay(delay time.Duration)

SetRequestDelay allows setting a custom delay between requests for throttling

Directories

Path Synopsis
scholar module

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL