scanner

package module
v0.0.3 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 11, 2025 License: MIT Imports: 11 Imported by: 0

README

Smart Text Processing

"Finally, semantic chunking that understands meaning!" — Text processing, done intelligently.

Text scanners for Go that go beyond simple line-by-line processing. Built with semantic understanding at its core, making text chunking intelligent and context-aware.

Version Documentation Build Status Git Hub Coverage Status Go Report Card

Traditional text processing treats all chunks equally, but meaning isn't uniform across text. The scanner library uses embedding-based semantic analysis to group related content together, making it perfect for RAG systems, document analysis, and intelligent text processing pipelines.

Quick Start: Semantic Chunking

Here's how to chunk text by semantic similarity instead of arbitrary boundaries:

package main

import (
  "context"
  "strings"

  "github.com/fogfish/scanner"
)

func main() {
  api  := // create instance of embedding vector provider scanner.Embedder
  text := `
  The quick brown fox jumps over the lazy dog. 
  This is a classic pangram used in typography.
  
  Machine learning has revolutionized AI.
  Neural networks can now understand language context.
  
  Climate change affects global weather patterns.
  Rising temperatures impact ecosystems worldwide.
  `

  // Break text into sentences first
  sentences := scanner.NewSentencer(
    scanner.EndOfSentence, 
    strings.NewReader(text),
  )

  // Group sentences by semantic similarity
  semantic := scanner.NewSemantic(api, sentences)
  semantic.Window(10)                         // Look at 10 sentences at a time
  semantic.Similarity(scanner.HighSimilarity) // Group highly similar content

	// Get semantically coherent chunks
	for semantic.Scan() {
		chunk := semantic.Text()
		fmt.Printf("Semantic chunk: %v\n", chunk)
		// Output will group related sentences together:
		// - Typography sentences together
		// - AI/ML sentences together  
		// - Climate sentences together
	}
}

This approach produces chunks where sentences actually relate to each other, rather than arbitrary splits that might separate related concepts.

Why Semantic Chunking Matters

Traditional chunking problems:

  • Splits related content across chunks
  • Breaks context mid-conversation
  • Fixed boundaries ignore meaning
  • Poor retrieval in RAG systems

Semantic chunking benefits:

  • Keeps related content together
  • Maintains semantic coherence
  • Context-aware boundaries
  • Better embedding similarity for retrieval

Perfect for:

  • RAG Systems: Better retrieval through coherent chunks
  • Document Analysis: Group related paragraphs and concepts
  • Content Summarization: Preserve topic boundaries
  • Text Classification: Maintain semantic integrity

The Scanner Toolkit

Beyond semantic chunking, the library provides a complete text processing toolkit:

Scanner Purpose Use Case
Semantic Groups by meaning similarity RAG, document analysis
Sentencer Splits by punctuation Natural sentence boundaries
Slicer Fixed delimiter splitting CSV, structured data
Chunker Fixed-size chunks Token limits, simple splitting
Tagger Tag bounded chunks Markup data
Sorter Semantic sorting of data Organizing similar items
Identity Entire input as one chunk Small documents

All scanners implement the familiar bufio.Scanner interface:

for scanner.Scan() {
  text := scanner.Text()
  // Process chunk
}

Similarity Control

Fine-tune semantic grouping with built-in similarity functions:

semantic.Similarity(scanner.HighSimilarity)   // Very similar content (0.0-0.2)
semantic.Similarity(scanner.MediumSimilarity) // Related content (0.2-0.5)
semantic.Similarity(scanner.WeakSimilarity)   // Loosely related (0.5-0.8)

// Custom similarity threshold
semantic.Similarity(scanner.RangeSimilarity(0.1, 0.3))

// Custom similarity logic
semantic.Similarity(scanner.CosineSimilarity(func(d float32) bool {
    return d < 0.25 // Custom threshold
}))

Algorithm Behavior

Control how chunks grow:

// Compare new sentences to the first sentence in chunk (stable reference)
semantic.SimilarityWith(scanner.SIMILARITY_WITH_HEAD)

// Compare new sentences to the last added sentence (evolving reference)  
semantic.SimilarityWith(scanner.SIMILARITY_WITH_TAIL)

Getting Started

The library requires Go 1.24 or later.

go get -u github.com/fogfish/scanner

Compatible with any embedding provider - OpenAI, Cohere, local models, or custom implementations. Just implement the simple Embedder interface:

type Embedder interface {
    Embedding(ctx context.Context, text string) ([]float32, int, error)
}

How To Contribute

The library is MIT licensed and accepts contributions via GitHub pull requests:

  1. Fork it
  2. Create your feature branch (git checkout -b my-new-feature)
  3. Commit your changes (git commit -am 'Added some feature')
  4. Push to the branch (git push origin my-new-feature)
  5. Create new Pull Request
git clone https://github.com/fogfish/scanner
cd scanner
go test ./...

License

See LICENSE

Documentation

Index

Constants

View Source
const EndOfSentence = ".!?"

Default end of sentence

Variables

This section is empty.

Functions

func CosineSimilarity added in v0.0.2

func CosineSimilarity(f func(float32) bool) func(a, b []float32) bool

Similarity with custom assert of cosine distance

func Dissimilar added in v0.0.2

func Dissimilar(a, b []float32) bool

Dissimilar is cosine distance (0.8, 1.0]. Typically, these items are unrelated, and you might filter them out unless dissimilarity is desirable (e.g., in anomaly detection).

func HighSimilarity added in v0.0.2

func HighSimilarity(a, b []float32) bool

High Similarity is cosine distance [0, 0.2]. Use this range when you need very close matches (e.g., finding duplicate documents).

func MediumSimilarity added in v0.0.2

func MediumSimilarity(a, b []float32) bool

Medium Similarity is cosine distance (0.2, 0.5]. Useful when you want to find items that are related but not identical.

func NewSentencer

func NewSentencer(eos string, r io.Reader) *bufio.Scanner

Create a scanner that slices input stream by end of sentence

func NewSlicer

func NewSlicer(delim string, r io.Reader) *bufio.Scanner

Create a scanner that slices input stream by fixed delimiter

func NewTagger added in v0.0.3

func NewTagger(open, close string, r io.Reader) *bufio.Scanner

Create a scanner that slices input stream by end of sentence

func RangeSimilarity added in v0.0.2

func RangeSimilarity(lo, hi float32) func(a, b []float32) bool

Similarity on custom cosine distance [lo, hi]. Use this range when you need custom interval.

func WeakSimilarity added in v0.0.2

func WeakSimilarity(a, b []float32) bool

Weak Similarity is cosine distance (0.5, 0.8]. This range could be used for exploratory results where you want to include some diversity.

Types

type Chunker

type Chunker struct {
	Scanner
	// contains filtered or unexported fields
}

func NewChunker

func NewChunker(size int, s Scanner) *Chunker

func (*Chunker) Scan

func (s *Chunker) Scan() bool

func (*Chunker) Text

func (s *Chunker) Text() string

type Embedder added in v0.0.2

type Embedder interface {
	Embedding(ctx context.Context, text string) ([]float32, int, error)
}

Utility for embedding vector calculation.

type Identity

type Identity struct {
	io.Reader
	// contains filtered or unexported fields
}

func NewIdentity

func NewIdentity(r io.Reader) *Identity

func (*Identity) Err

func (r *Identity) Err() error

func (*Identity) Scan

func (r *Identity) Scan() bool

func (*Identity) Text

func (r *Identity) Text() string

type Scanner

type Scanner interface {
	Scan() bool
	Text() string
	Err() error
}

Scanner is an interface similar to bufio.Scanner. It defines core functionality defined by this library.

type Semantic added in v0.0.2

type Semantic struct {
	// contains filtered or unexported fields
}

Semantic provides a convenient solution for semantic chunking. Successive calls to the Semantic.Scan method will step through the context windows of a file and grouping sentences semantically. The context window is defined by number sentences, use Window method to change default 32 sentences value.

The specification of a sentence is defined by the Scanner interface, which is compatible with bufio.NewScanner. Use a Split function of type SplitFunc within bufio.NewScanner to control sentence breakdown.

The package provides NewSentencer utility that breaks the input into sentences using punctuation runes. Redefine Use Split function of bufio.NewScanner to define own algorithms.

The scanner uses embeddings to determine similarity. Use Similarity method to change the default high cosine similarity to own implementation. The module provides high, medium, weak and dissimilarity functions based on cosine distance.

Scanning stops unrecoverably at EOF or the first I/O error.

func NewSemantic added in v0.0.2

func NewSemantic(embed Embedder, r Scanner) *Semantic

Creates new instance of Scanner to read from io.Reader and using embedding.

func (*Semantic) Err added in v0.0.2

func (s *Semantic) Err() error

func (*Semantic) Scan added in v0.0.2

func (s *Semantic) Scan() bool

Scan advances the Semantic through context window, sequences will be available through Semantic.Text. It returns false if there was I/O error or EOF is reached.

func (*Semantic) Similarity added in v0.0.2

func (s *Semantic) Similarity(f func([]float32, []float32) bool)

Similarity sets the similarity function for the Semantic. The default is HighSimilarity.

func (*Semantic) SimilarityWith added in v0.0.2

func (s *Semantic) SimilarityWith(x SimilarityWith)

Similarity sets the behavior to sorting algorithms.

Using SIMILARITY_WITH_HEAD configures algorithm to sort chunk similar to the first element of chunk. The first element of chunk is stable during the chunk forming.

Using SIMILARITY_WITH_TAIL configures algorithm to sort chunk similar to the last element of chunk. The last element is changed after new one is added to chunk.

func (*Semantic) Text added in v0.0.2

func (s *Semantic) Text() []string

func (*Semantic) Window added in v0.0.2

func (s *Semantic) Window(n int)

Widow defines the context window for similarity detection. The default value is 32 sentences.

type Sentencer

type Sentencer []byte

func (Sentencer) Split

func (s Sentencer) Split(data []byte, atEOF bool) (advance int, token []byte, err error)

bufio.SplitFunc for sentence.

type SimilarityWith added in v0.0.2

type SimilarityWith int

Configure similarity sorting algorithm

const (
	SIMILARITY_WITH_HEAD SimilarityWith = iota
	SIMILARITY_WITH_TAIL
)

Configure similarity sorting algorithm

type Slicer

type Slicer []byte

func (Slicer) Split

func (s Slicer) Split(data []byte, atEOF bool) (advance int, token []byte, err error)

bufio.SplitFunc for fixed delimiter.

type Sorter added in v0.0.2

type Sorter[T any] struct {
	// contains filtered or unexported fields
}

Sorter provides a convenient solution for semantic sorting.

Successive calls to the Sorter.Sort method will step through the context windows of a slice and grouping 'sentences' semantically. The context window is defined either by number sentences, use Window method to change default 32 sentences value.

The input slice is assumed to be split into sentences already.

The sorter uses embeddings to determine similarity. Use Similarity method to change the default high cosine similarity to own implementation. The module provides high, medium, weak and dissimilarity functions based on cosine distance.

func NewSorter added in v0.0.2

func NewSorter[T any](embed Embedder, lens optics.Lens[T, string], seq seq.Seq[T]) *Sorter[T]

Creates new instance of semantic Sorter, seq.Seq[T] is source of records.

func (*Sorter[T]) Err added in v0.0.2

func (s *Sorter[T]) Err() error

func (*Sorter[T]) Next added in v0.0.2

func (s *Sorter[T]) Next() bool

Next advances the Sorter through context window, sequences will be available through [Scanner.Text]. It returns false if there was I/O error or EOF is reached.

func (*Sorter[T]) Similarity added in v0.0.2

func (s *Sorter[T]) Similarity(f func([]float32, []float32) bool)

Similarity sets the similarity function for the Sorter. The default is HighSimilarity.

func (*Sorter[T]) SimilarityWith added in v0.0.2

func (s *Sorter[T]) SimilarityWith(x SimilarityWith)

Similarity sets the behavior to sorting algorithms.

Using SIMILARITY_WITH_HEAD configures algorithm to sort chunk similar to the first element of chunk. The first element of chunk is stable during the chunk forming.

Using SIMILARITY_WITH_TAIL configures algorithm to sort chunk similar to the last element of chunk. The last element is changed after new one is added to chunk.

func (*Sorter[T]) Value added in v0.0.2

func (s *Sorter[T]) Value() []T

func (*Sorter[T]) Window added in v0.0.2

func (s *Sorter[T]) Window(n int)

Widow defines the context window for similarity detection. The default value is 32 sentences.

type Tagger added in v0.0.3

type Tagger struct {
	// contains filtered or unexported fields
}

func (Tagger) Split added in v0.0.3

func (t Tagger) Split(data []byte, atEOF bool) (advance int, token []byte, err error)

bufio.SplitFunc for sentence.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL