code

package
v0.36.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 19, 2026 License: MIT Imports: 7 Imported by: 0

Documentation

Overview

Package code provides code-aware sparse vector generation for source code.

The CodeSparseProvider splits identifiers (camelCase, snake_case, acronyms) before hashing into sparse vectors, improving recall for code search.

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

This section is empty.

Types

type CodeSparseProvider

type CodeSparseProvider struct {
	// contains filtered or unexported fields
}

func NewCodeSparseProvider

func NewCodeSparseProvider() *CodeSparseProvider

func (*CodeSparseProvider) GenerateSparseVector

func (p *CodeSparseProvider) GenerateSparseVector(ctx context.Context, text string) (*schema.SparseVector, error)

type Provider

type Provider interface {
	GenerateSparseVector(ctx context.Context, text string) (*schema.SparseVector, error)
}

func NewProvider

func NewProvider() Provider

type Tokenizer

type Tokenizer struct{}

Tokenizer is a code-aware sparse vector provider. It splits camelCase and snake_case identifiers into constituent terms, filters language keywords, and produces normalized sparse vectors via FNV hashing. Register it with sparse.RegisterProvider to replace the default BGE BoW provider for source code inputs.

func NewTokenizer

func NewTokenizer() *Tokenizer

func (*Tokenizer) Tokenize

func (t *Tokenizer) Tokenize(text string) []string

Tokenize splits text into lowercase code terms. Pipeline: normalize whitespace → split on operators/punctuation/underscores → split camelCase → lowercase → filter short tokens and stop words.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL