defsource

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 22, 2026 License: MIT Imports: 7 Imported by: 0

README

defsource

Go library for crawling, indexing, and searching API documentation with FTS5 full-text search.

Go Reference Go Report Card


Features

  • Full-text search with SQLite FTS5 and BM25 ranking
  • Pluggable source adapters (WordPress reference docs included)
  • Concurrent crawler with rate limiting, retries, and resume support
  • Token-budgeted output formatting (LLM-friendly)
  • Wrapper method resolution (traces delegation chains up to 3 levels)
  • Priority-based crawl ordering (critical classes first)
  • HTTP REST API server
  • CLI tools for crawling and serving

Installation

go get github.com/hatlesswizard/defsource

Important: This library uses CGO via go-sqlite3. You must have CGO enabled and include the sqlite_fts5 build tag:

CGO_ENABLED=1 go build -tags sqlite_fts5 ./...

Quick Start

package main

import (
	"context"
	"fmt"
	"log"

	"github.com/hatlesswizard/defsource"
)

func main() {
	client, err := defsource.New("./data/defsource.db")
	if err != nil {
		log.Fatal(err)
	}
	defer client.Close()

	ctx := context.Background()

	// List all indexed libraries
	libs, err := client.ListLibraries(ctx)
	if err != nil {
		log.Fatal(err)
	}
	for _, lib := range libs {
		fmt.Printf("%s (%s) - %d snippets\n", lib.Name, lib.ID, lib.SnippetCount)
	}

	// Query documentation
	if len(libs) > 0 {
		result, err := client.QueryDocs(ctx, libs[0].ID, "get posts")
		if err != nil {
			log.Fatal(err)
		}
		fmt.Println(result.Text)
	}
}

API Reference

Constructor and Lifecycle
New
func New(dbPath string, opts ...Option) (*Client, error)

Creates a new defSource client backed by a SQLite database at dbPath. The database file is created if it does not exist. Returns an error if the database cannot be opened or initialized.

Parameter Type Description
dbPath string Filesystem path to the SQLite database file
opts ...Option Zero or more functional options to configure the client

Defaults:

  • Token budget: 8000
Close
func (c *Client) Close() error

Releases all database resources held by the client. Always call Close when you are done using the client (typically via defer).

Store
func (c *Client) Store() store.Store

Returns the underlying store.Store interface. This is intended for use by the crawler and other internal subsystems that need direct store access.


Query Methods
QueryDocs
func (c *Client) QueryDocs(ctx context.Context, libraryID, query string, opts ...QueryOption) (*DocResult, error)

Primary search method. Performs an FTS5 full-text search against the specified library and returns documentation snippets ranked by BM25 relevance. Results are capped at 20 snippets. The Text field of the returned DocResult contains pre-formatted markdown output, trimmed to fit within the configured token budget.

Parameter Type Description
ctx context.Context Context for cancellation and deadlines
libraryID string The ID of the library to search within
query string The search query string
opts ...QueryOption Optional query configuration (e.g., search mode)

Returns: *DocResult containing matched snippets and formatted text, or an error if the library is not found or the search fails.

result, err := client.QueryDocs(ctx, "wordpress", "register post type")
fmt.Println(result.Text) // pre-formatted markdown
ResolveLibrary
func (c *Client) ResolveLibrary(ctx context.Context, query, libraryName string) ([]Library, error)

Searches for libraries matching the given name and ranks them by relevance to the query. Returns up to 5 results. Ranking considers name similarity and relevance to the query context. Snippet counts are computed on the fly if not already stored.

Parameter Type Description
ctx context.Context Context for cancellation and deadlines
query string The user's search query, used for ranking
libraryName string The library name or partial name to search for

Returns: A slice of up to 5 Library values ranked by relevance, or an error.

libs, err := client.ResolveLibrary(ctx, "custom post type", "wordpress")
ListLibraries
func (c *Client) ListLibraries(ctx context.Context) ([]Library, error)

Returns all indexed libraries. For each library, the snippet count is computed on the fly if not already stored (by counting entities plus their methods).

Parameter Type Description
ctx context.Context Context for cancellation and deadlines

Returns: A slice of all Library values, or an error.

libs, err := client.ListLibraries(ctx)
ListEntities
func (c *Client) ListEntities(ctx context.Context, libraryID string) ([]EntityInfo, error)

Returns all entities (classes, functions, etc.) for a given library, including the count of methods on each entity.

Parameter Type Description
ctx context.Context Context for cancellation and deadlines
libraryID string The ID of the library to list entities for

Returns: A slice of EntityInfo values, or an error.

entities, err := client.ListEntities(ctx, "wordpress")

Options
WithTokenBudget
func WithTokenBudget(budget int) Option

Sets the maximum approximate token count for QueryDocs responses. The formatted markdown text in DocResult.Text will be truncated to stay within this budget. Default is 8000 tokens. Use this to control response size when integrating with LLMs that have context limits.

client, err := defsource.New("defsource.db", defsource.WithTokenBudget(4000))
WithSearchMode
func WithSearchMode(mode string) QueryOption

Sets the FTS5 search mode for a QueryDocs call.

Mode Behavior Description
"all" AND (default) All query terms must appear in the result
"any" OR Any query term may appear in the result
result, err := client.QueryDocs(ctx, "wordpress", "post meta", defsource.WithSearchMode("any"))

Types
Library

Represents an indexed documentation source.

type Library struct {
    ID           string    `json:"id"`            // Unique identifier for the library
    Name         string    `json:"name"`          // Human-readable library name
    Description  string    `json:"description"`   // Short description of the library
    SourceURL    string    `json:"source_url"`    // URL of the original documentation source
    Version      string    `json:"version"`       // Library version string
    TrustScore   float64   `json:"trust_score"`   // Confidence score (0.0-1.0) for the source
    SnippetCount int       `json:"snippet_count"` // Total number of indexed snippets (entities + methods)
    CrawledAt    time.Time `json:"crawled_at"`    // Timestamp of the last completed crawl
}
DocResult

The response returned by QueryDocs.

type DocResult struct {
    Library  string       `json:"library"`  // ID of the queried library
    Query    string       `json:"query"`    // The original search query
    Snippets []DocSnippet `json:"snippets"` // Matched documentation snippets ranked by relevance
    Text     string       `json:"text"`     // Pre-formatted markdown output (token-budgeted)
}
DocSnippet

A single documentation entry representing either a class/entity or a specific method.

type DocSnippet struct {
    EntityName    string      `json:"entity_name"`              // Name of the parent entity (class, function group)
    MethodName    string      `json:"method_name,omitempty"`    // Method name, empty if this is an entity-level snippet
    Signature     string      `json:"signature,omitempty"`      // Full method signature
    Description   string      `json:"description"`              // Human-readable description
    Parameters    []Parameter `json:"parameters,omitempty"`     // Method parameters (empty for entity-level snippets)
    ReturnType    string      `json:"return_type,omitempty"`    // Return type string
    ReturnDesc    string      `json:"return_desc,omitempty"`    // Description of the return value
    SourceCode    string      `json:"source_code"`              // Source code of the entity or method
    WrappedSource string      `json:"wrapped_source,omitempty"` // Source code of the delegated-to method (wrapper resolution)
    WrappedMethod string      `json:"wrapped_method,omitempty"` // Name of the method this wraps/delegates to
    URL           string      `json:"url"`                      // URL to the original documentation page
    Relevance     float64     `json:"relevance"`                // BM25 relevance score from FTS5
    Relations     []Relation  `json:"relations,omitempty"`      // Relationships to other methods (uses/used_by)
}
Parameter

Describes a function or method parameter.

type Parameter struct {
    Name        string `json:"name"`        // Parameter name
    Type        string `json:"type"`        // Parameter type (e.g., "string", "int", "array")
    Required    bool   `json:"required"`    // Whether the parameter is required
    Description string `json:"description"` // Human-readable description of the parameter
}
Relation

Describes a relationship between methods.

type Relation struct {
    Kind        string `json:"kind"`                  // Relationship type: "uses" or "used_by"
    TargetName  string `json:"target_name"`           // Name of the related method
    TargetURL   string `json:"target_url,omitempty"`  // URL to the related method's documentation
    Description string `json:"description,omitempty"` // Description of the relationship
}
EntityInfo

Summary information about an entity, returned by ListEntities.

type EntityInfo struct {
    Name        string `json:"name"`         // Entity name (e.g., "WP_Query")
    Slug        string `json:"slug"`         // URL-safe slug for the entity
    Kind        string `json:"kind"`         // Entity kind (e.g., "class", "function")
    Description string `json:"description"`  // Short description
    MethodCount int    `json:"method_count"` // Number of methods belonging to this entity
    URL         string `json:"url"`          // URL to the original documentation page
}

Usage Examples
Basic Query
package main

import (
	"context"
	"fmt"
	"log"

	"github.com/hatlesswizard/defsource"
)

func main() {
	client, err := defsource.New("./data/defsource.db")
	if err != nil {
		log.Fatal(err)
	}
	defer client.Close()

	result, err := client.QueryDocs(context.Background(), "wordpress", "register post type")
	if err != nil {
		log.Fatal(err)
	}

	// Print pre-formatted markdown (token-budgeted)
	fmt.Println(result.Text)

	// Or iterate over individual snippets
	for _, s := range result.Snippets {
		fmt.Printf("[%.2f] %s::%s\n", s.Relevance, s.EntityName, s.MethodName)
	}
}
Library Discovery
package main

import (
	"context"
	"fmt"
	"log"

	"github.com/hatlesswizard/defsource"
)

func main() {
	client, err := defsource.New("./data/defsource.db")
	if err != nil {
		log.Fatal(err)
	}
	defer client.Close()

	ctx := context.Background()

	// Find the best matching library
	libs, err := client.ResolveLibrary(ctx, "custom post type", "wordpress")
	if err != nil {
		log.Fatal(err)
	}
	if len(libs) == 0 {
		log.Fatal("no matching library found")
	}

	fmt.Printf("Using library: %s (trust: %.1f)\n", libs[0].Name, libs[0].TrustScore)

	// Query docs in the resolved library
	result, err := client.QueryDocs(ctx, libs[0].ID, "custom post type")
	if err != nil {
		log.Fatal(err)
	}
	fmt.Println(result.Text)
}
Custom Configuration
package main

import (
	"context"
	"fmt"
	"log"

	"github.com/hatlesswizard/defsource"
)

func main() {
	// Create a client with a smaller token budget
	client, err := defsource.New(
		"./data/defsource.db",
		defsource.WithTokenBudget(4000),
	)
	if err != nil {
		log.Fatal(err)
	}
	defer client.Close()

	// Use OR mode to find results matching any term
	result, err := client.QueryDocs(
		context.Background(),
		"wordpress",
		"meta query tax_query",
		defsource.WithSearchMode("any"),
	)
	if err != nil {
		log.Fatal(err)
	}

	fmt.Printf("Found %d snippets\n", len(result.Snippets))
	fmt.Println(result.Text)
}

CLI Tools

defsource-crawl

Crawls a documentation source and stores the results in a SQLite database.

Build:

CGO_ENABLED=1 go build -tags sqlite_fts5 -o bin/defsource-crawl ./cmd/defsource-crawl

Flags:

Flag Default Description
-source wordpress Documentation source to crawl
-db ./data/defsource.db Path to SQLite database
-workers 10 Number of concurrent workers
-rps 10 Requests per second (rate limit)
-resume false Resume the last interrupted crawl
-retry-failed false Retry transient failures from the last crawl

Examples:

# Full crawl with default settings
./bin/defsource-crawl --source=wordpress --db=./data/defsource.db

# Fast crawl with more workers
./bin/defsource-crawl --source=wordpress --workers=20 --rps=20

# Resume an interrupted crawl
./bin/defsource-crawl --source=wordpress --resume

# Retry only the pages that failed last time
./bin/defsource-crawl --source=wordpress --retry-failed

The crawler supports graceful shutdown via SIGINT (Ctrl+C) or SIGTERM. An interrupted crawl can be continued later with --resume.

defsource-server

Serves the indexed documentation over an HTTP REST API.

Build:

CGO_ENABLED=1 go build -tags sqlite_fts5 -o bin/defsource-server ./cmd/defsource-server

Flags:

Flag Default Description
-db ./data/defsource.db Path to SQLite database
-addr :8080 Server listen address

Environment Variables:

Variable Default Description
DEFSOURCE_CORS_ORIGIN * Allowed CORS origin header value

Examples:

# Start with defaults
./bin/defsource-server

# Custom port and database
./bin/defsource-server --db=./mydata/docs.db --addr=:3000

# Restrict CORS to a specific origin
DEFSOURCE_CORS_ORIGIN=https://example.com ./bin/defsource-server --addr=:8080

The server supports graceful shutdown via SIGINT or SIGTERM with a 10-second drain timeout.

HTTP API

Endpoints
Method Path Description
GET /api/v1/libraries List all indexed libraries
GET /api/v1/libraries/search Search for libraries by name
GET /api/v1/docs Query documentation with full-text search
GET /api/v1/entities List entities for a library
GET /health Health check
GET /api/v1/libraries

Returns all indexed libraries.

curl http://localhost:8080/api/v1/libraries

Response:

{
  "libraries": [
    {
      "id": "wordpress",
      "name": "WordPress",
      "description": "WordPress Class Reference",
      "source_url": "https://developer.wordpress.org",
      "version": "6.x",
      "trust_score": 0.95,
      "snippet_count": 4200,
      "crawled_at": "2025-01-15T10:30:00Z"
    }
  ]
}

Search for libraries by name, ranked by relevance to a query.

Parameter Required Description
libraryName Yes Library name or partial name (max 200 chars)
query Yes Search query for ranking (max 500 chars)
curl "http://localhost:8080/api/v1/libraries/search?libraryName=wordpress&query=post+type"

Response:

{
  "results": [
    {
      "id": "wordpress",
      "name": "WordPress",
      "description": "WordPress Class Reference",
      "source_url": "https://developer.wordpress.org",
      "version": "6.x",
      "trust_score": 0.95,
      "snippet_count": 4200,
      "crawled_at": "2025-01-15T10:30:00Z"
    }
  ]
}
GET /api/v1/docs

Query documentation with full-text search. Returns markdown by default or structured JSON.

Parameter Required Description
libraryId Yes Library ID to search within (max 200 chars)
query Yes Search query string (max 500 chars)
mode No Search mode: all (AND, default) or any (OR)
format No Response format: omit for markdown, json for structured JSON
# Markdown response (default)
curl "http://localhost:8080/api/v1/docs?libraryId=wordpress&query=register+post+type"

# JSON response
curl "http://localhost:8080/api/v1/docs?libraryId=wordpress&query=register+post+type&format=json"

# OR mode
curl "http://localhost:8080/api/v1/docs?libraryId=wordpress&query=meta+query+tax_query&mode=any"

Markdown response: Returns Content-Type: text/markdown; charset=utf-8 with the pre-formatted documentation text.

JSON response:

{
  "library": "wordpress",
  "query": "register post type",
  "snippets": [
    {
      "entity_name": "WP_Post_Type",
      "method_name": "register_post_type",
      "signature": "register_post_type( string $post_type, array $args = array() )",
      "description": "Registers a post type.",
      "parameters": [
        {
          "name": "$post_type",
          "type": "string",
          "required": true,
          "description": "Post type key."
        }
      ],
      "return_type": "WP_Post_Type|WP_Error",
      "return_desc": "The registered post type object or an error.",
      "source_code": "...",
      "url": "https://developer.wordpress.org/reference/functions/register_post_type/",
      "relevance": 12.5,
      "relations": []
    }
  ],
  "text": "# register_post_type\n..."
}
GET /api/v1/entities

List all entities (classes, functions) for a library, with method counts.

Parameter Required Description
libraryId Yes Library ID to list entities for
curl "http://localhost:8080/api/v1/entities?libraryId=wordpress"

Response:

{
  "entities": [
    {
      "name": "WP_Query",
      "slug": "wp_query",
      "kind": "class",
      "description": "The WordPress Query class.",
      "method_count": 42,
      "url": "https://developer.wordpress.org/reference/classes/wp_query/"
    }
  ]
}
GET /health

Health check endpoint.

curl http://localhost:8080/health

Response:

{
  "status": "ok"
}

Architecture

defSource follows a pipeline architecture: documentation sources are discovered, crawled concurrently with rate limiting, parsed into structured entities and methods, and stored in a SQLite database with FTS5 full-text indexes. At query time, FTS5 performs BM25-ranked searches, and results are formatted into token-budgeted markdown suitable for LLM consumption.

Discover -> Crawl -> Parse -> Store (SQLite+FTS5) -> Search -> Format

The system uses a pluggable source adapter pattern. Each source implements a Source interface that defines how to discover pages, parse entities, and extract methods. This makes it straightforward to add support for new documentation sources without modifying the core crawling or search infrastructure.

Supported Sources

WordPress Class Reference

Crawls the WordPress Developer Reference and indexes:

  • Classes and their methods (e.g., WP_Query, WP_Post, WP_REST_Controller)
  • Function signatures, parameters, return types, and descriptions
  • Source code for both methods and their wrapper targets
  • Cross-references between methods (uses/used_by relationships)
  • Priority-based ordering ensures critical classes (like WP_Query) are crawled first
Extensibility

New documentation sources can be added by implementing the Source interface in the internal/source package. The interface defines methods for discovering entity URLs, parsing entity pages, and extracting method details.

Development

# Build both CLI tools
make build

# Run tests
make test

# Crawl WordPress docs (builds first)
make crawl

# Start the HTTP server (builds first)
make server

# Run linter
make lint

# Clean build artifacts and database
make clean

Requirements

  • Go 1.25.3 or later (as specified in go.mod)
  • CGO must be enabled (CGO_ENABLED=1) -- required by go-sqlite3
  • Build tag sqlite_fts5 must be included (-tags sqlite_fts5)
  • A C compiler (GCC or Clang) for CGO compilation

License

MIT -- see LICENSE for details.

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type Client

type Client struct {
	// contains filtered or unexported fields
}

Client is the main entry point for querying documentation.

func New

func New(dbPath string, opts ...Option) (*Client, error)

New creates a new defSource client backed by a SQLite database at dbPath.

func (*Client) Close

func (c *Client) Close() error

Close releases database resources.

func (*Client) ListEntities

func (c *Client) ListEntities(ctx context.Context, libraryID string) ([]EntityInfo, error)

ListEntities returns all entities for a library.

func (*Client) ListLibraries

func (c *Client) ListLibraries(ctx context.Context) ([]Library, error)

ListLibraries returns all indexed libraries.

func (*Client) QueryDocs

func (c *Client) QueryDocs(ctx context.Context, libraryID, query string, opts ...QueryOption) (*DocResult, error)

QueryDocs retrieves documentation for a specific library.

func (*Client) ResolveLibrary

func (c *Client) ResolveLibrary(ctx context.Context, query, libraryName string) ([]Library, error)

ResolveLibrary searches for libraries matching the given name.

func (*Client) Store

func (c *Client) Store() store.Store

Store returns the underlying store for use by the crawler.

type DocResult

type DocResult struct {
	Library  string       `json:"library"`
	Query    string       `json:"query"`
	Snippets []DocSnippet `json:"snippets"`
	Text     string       `json:"text"`
}

DocResult is the response from QueryDocs.

type DocSnippet

type DocSnippet struct {
	EntityName    string      `json:"entity_name"`
	MethodName    string      `json:"method_name,omitempty"`
	Signature     string      `json:"signature,omitempty"`
	Description   string      `json:"description"`
	Parameters    []Parameter `json:"parameters,omitempty"`
	ReturnType    string      `json:"return_type,omitempty"`
	ReturnDesc    string      `json:"return_desc,omitempty"`
	SourceCode    string      `json:"source_code"`
	WrappedSource string      `json:"wrapped_source,omitempty"`
	WrappedMethod string      `json:"wrapped_method,omitempty"`
	URL           string      `json:"url"`
	Relevance     float64     `json:"relevance"`
	Relations     []Relation  `json:"relations,omitempty"`
}

DocSnippet is a single documentation entry (class or method).

type EntityInfo

type EntityInfo struct {
	Name        string `json:"name"`
	Slug        string `json:"slug"`
	Kind        string `json:"kind"`
	Description string `json:"description"`
	MethodCount int    `json:"method_count"`
	URL         string `json:"url"`
}

EntityInfo represents summary information about an entity.

type Library

type Library struct {
	ID           string    `json:"id"`
	Name         string    `json:"name"`
	Description  string    `json:"description"`
	SourceURL    string    `json:"source_url"`
	Version      string    `json:"version"`
	TrustScore   float64   `json:"trust_score"`
	SnippetCount int       `json:"snippet_count"`
	CrawledAt    time.Time `json:"crawled_at"`
}

Library represents an indexed documentation source.

type Option

type Option func(*Client)

Option configures the Client.

func WithTokenBudget

func WithTokenBudget(budget int) Option

WithTokenBudget sets the maximum approximate token count for query-docs responses.

type Parameter

type Parameter struct {
	Name        string `json:"name"`
	Type        string `json:"type"`
	Required    bool   `json:"required"`
	Description string `json:"description"`
}

Parameter describes a function/method parameter.

type QueryOption

type QueryOption func(*queryConfig)

QueryOption configures a QueryDocs call.

func WithSearchMode

func WithSearchMode(mode string) QueryOption

WithSearchMode sets the FTS5 search mode: "all" (AND, default) or "any" (OR).

type Relation

type Relation struct {
	Kind        string `json:"kind"`
	TargetName  string `json:"target_name"`
	TargetURL   string `json:"target_url,omitempty"`
	Description string `json:"description,omitempty"`
}

Relation describes a relationship between methods (uses/used_by).

Directories

Path Synopsis
internal

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL