scanner

package module

v0.0.2 Latest Latest Go to latest Published: Sep 7, 2025 License: MIT Imports: 11 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/fogfish/scanner

Links

Open Source Insights

README ¶

Smart Text Processing

"Finally, semantic chunking that understands meaning!" — Text processing, done intelligently.

Text scanners for Go that go beyond simple line-by-line processing. Built with semantic understanding at its core, making text chunking intelligent and context-aware.

Traditional text processing treats all chunks equally, but meaning isn't uniform across text. The scanner library uses embedding-based semantic analysis to group related content together, making it perfect for RAG systems, document analysis, and intelligent text processing pipelines.

Quick Start: Semantic Chunking

Here's how to chunk text by semantic similarity instead of arbitrary boundaries:

package main

import (
  "context"
  "strings"

  "github.com/fogfish/scanner"
)

func main() {
  api  := // create instance of embedding vector provider scanner.Embedder
  text := `
  The quick brown fox jumps over the lazy dog. 
  This is a classic pangram used in typography.
  
  Machine learning has revolutionized AI.
  Neural networks can now understand language context.
  
  Climate change affects global weather patterns.
  Rising temperatures impact ecosystems worldwide.
  `

  // Break text into sentences first
  sentences := scanner.NewSentencer(
    scanner.EndOfSentence, 
    strings.NewReader(text),
  )

  // Group sentences by semantic similarity
  semantic := scanner.NewSemantic(api, sentences)
  semantic.Window(10)                         // Look at 10 sentences at a time
  semantic.Similarity(scanner.HighSimilarity) // Group highly similar content

	// Get semantically coherent chunks
	for semantic.Scan() {
		chunk := semantic.Text()
		fmt.Printf("Semantic chunk: %v\n", chunk)
		// Output will group related sentences together:
		// - Typography sentences together
		// - AI/ML sentences together  
		// - Climate sentences together
	}
}

This approach produces chunks where sentences actually relate to each other, rather than arbitrary splits that might separate related concepts.

Why Semantic Chunking Matters

Traditional chunking problems:

Splits related content across chunks
Breaks context mid-conversation
Fixed boundaries ignore meaning
Poor retrieval in RAG systems

Semantic chunking benefits:

Keeps related content together
Maintains semantic coherence
Context-aware boundaries
Better embedding similarity for retrieval

Perfect for:

RAG Systems: Better retrieval through coherent chunks
Document Analysis: Group related paragraphs and concepts
Content Summarization: Preserve topic boundaries
Text Classification: Maintain semantic integrity

The Scanner Toolkit

Beyond semantic chunking, the library provides a complete text processing toolkit:

Scanner	Purpose	Use Case
Semantic	Groups by meaning similarity	RAG, document analysis
Sentencer	Splits by punctuation	Natural sentence boundaries
Slicer	Fixed delimiter splitting	CSV, structured data
Chunker	Fixed-size chunks	Token limits, simple splitting
Sorter	Semantic sorting of data	Organizing similar items
Identity	Entire input as one chunk	Small documents

All scanners implement the familiar bufio.Scanner interface:

for scanner.Scan() {
  text := scanner.Text()
  // Process chunk
}

Similarity Control

Fine-tune semantic grouping with built-in similarity functions:

semantic.Similarity(scanner.HighSimilarity)   // Very similar content (0.0-0.2)
semantic.Similarity(scanner.MediumSimilarity) // Related content (0.2-0.5)
semantic.Similarity(scanner.WeakSimilarity)   // Loosely related (0.5-0.8)

// Custom similarity threshold
semantic.Similarity(scanner.RangeSimilarity(0.1, 0.3))

// Custom similarity logic
semantic.Similarity(scanner.CosineSimilarity(func(d float32) bool {
    return d < 0.25 // Custom threshold
}))

Algorithm Behavior

Control how chunks grow:

// Compare new sentences to the first sentence in chunk (stable reference)
semantic.SimilarityWith(scanner.SIMILARITY_WITH_HEAD)

// Compare new sentences to the last added sentence (evolving reference)  
semantic.SimilarityWith(scanner.SIMILARITY_WITH_TAIL)

Getting Started

The library requires Go 1.24 or later.

go get -u github.com/fogfish/scanner

Compatible with any embedding provider - OpenAI, Cohere, local models, or custom implementations. Just implement the simple Embedder interface:

type Embedder interface {
    Embedding(ctx context.Context, text string) ([]float32, int, error)
}

How To Contribute

The library is MIT licensed and accepts contributions via GitHub pull requests:

Fork it
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Added some feature')
Push to the branch (git push origin my-new-feature)
Create new Pull Request

git clone https://github.com/fogfish/scanner
cd scanner
go test ./...

License

See LICENSE

Documentation ¶

Index ¶

Constants
func CosineSimilarity(f func(float32) bool) func(a, b []float32) bool
func Dissimilar(a, b []float32) bool
func HighSimilarity(a, b []float32) bool
func MediumSimilarity(a, b []float32) bool
func NewSentencer(eos string, r io.Reader) *bufio.Scanner
func NewSlicer(delim string, r io.Reader) *bufio.Scanner
func RangeSimilarity(lo, hi float32) func(a, b []float32) bool
func WeakSimilarity(a, b []float32) bool
type Chunker
- func NewChunker(size int, s Scanner) *Chunker
- func (s *Chunker) Scan() bool
- func (s *Chunker) Text() string
type Embedder
type Identity
- func NewIdentity(r io.Reader) *Identity
- func (r *Identity) Err() error
- func (r *Identity) Scan() bool
- func (r *Identity) Text() string
type Scanner
type Semantic
- func NewSemantic(embed Embedder, r Scanner) *Semantic
- func (s *Semantic) Err() error
- func (s *Semantic) Scan() bool
- func (s *Semantic) Similarity(f func([]float32, []float32) bool)
- func (s *Semantic) SimilarityWith(x SimilarityWith)
- func (s *Semantic) Text() []string
- func (s *Semantic) Window(n int)
type Sentencer
- func (s Sentencer) Split(data []byte, atEOF bool) (advance int, token []byte, err error)
type SimilarityWith
type Slicer
- func (s Slicer) Split(data []byte, atEOF bool) (advance int, token []byte, err error)
type Sorter
- func NewSorter[T any](embed Embedder, lens optics.Lens[T, string], seq seq.Seq[T]) *Sorter[T]
- func (s *Sorter[T]) Err() error
- func (s *Sorter[T]) Next() bool
- func (s *Sorter[T]) Similarity(f func([]float32, []float32) bool)
- func (s *Sorter[T]) SimilarityWith(x SimilarityWith)
- func (s *Sorter[T]) Value() []T
- func (s *Sorter[T]) Window(n int)

Constants ¶

View Source

const EndOfSentence = ".!?"

Default end of sentence

Variables ¶

This section is empty.

Functions ¶

func CosineSimilarity ¶ added in v0.0.2

func CosineSimilarity(f func(float32) bool) func(a, b []float32) bool

Similarity with custom assert of cosine distance

func Dissimilar ¶ added in v0.0.2

func Dissimilar(a, b []float32) bool

Dissimilar is cosine distance (0.8, 1.0]. Typically, these items are unrelated, and you might filter them out unless dissimilarity is desirable (e.g., in anomaly detection).

func HighSimilarity ¶ added in v0.0.2

func HighSimilarity(a, b []float32) bool

High Similarity is cosine distance [0, 0.2]. Use this range when you need very close matches (e.g., finding duplicate documents).

func MediumSimilarity ¶ added in v0.0.2

func MediumSimilarity(a, b []float32) bool

Medium Similarity is cosine distance (0.2, 0.5]. Useful when you want to find items that are related but not identical.

func NewSentencer ¶

func NewSentencer(eos string, r io.Reader) *bufio.Scanner

Create a scanner that slices input stream by end of sentence

func NewSlicer ¶

func NewSlicer(delim string, r io.Reader) *bufio.Scanner

Create a scanner that slices input stream by fixed delimiter

func RangeSimilarity ¶ added in v0.0.2

func RangeSimilarity(lo, hi float32) func(a, b []float32) bool

Similarity on custom cosine distance [lo, hi]. Use this range when you need custom interval.

func WeakSimilarity ¶ added in v0.0.2

func WeakSimilarity(a, b []float32) bool

Weak Similarity is cosine distance (0.5, 0.8]. This range could be used for exploratory results where you want to include some diversity.

Types ¶

type Chunker ¶

type Chunker struct {
	Scanner
	// contains filtered or unexported fields
}

func NewChunker ¶

func NewChunker(size int, s Scanner) *Chunker

func (*Chunker) Scan ¶

func (s *Chunker) Scan() bool

func (*Chunker) Text ¶

func (s *Chunker) Text() string

type Embedder ¶ added in v0.0.2

type Embedder interface {
	Embedding(ctx context.Context, text string) ([]float32, int, error)
}

Utility for embedding vector calculation.

type Identity ¶

type Identity struct {
	io.Reader
	// contains filtered or unexported fields
}

func NewIdentity ¶

func NewIdentity(r io.Reader) *Identity

func (*Identity) Err ¶

func (r *Identity) Err() error

func (*Identity) Scan ¶

func (r *Identity) Scan() bool

func (*Identity) Text ¶

func (r *Identity) Text() string

type Scanner ¶

type Scanner interface {
	Scan() bool
	Text() string
	Err() error
}

Scanner is an interface similar to bufio.Scanner. It defines core functionality defined by this library.

type Semantic ¶ added in v0.0.2

type Semantic struct {
	// contains filtered or unexported fields
}

Semantic provides a convenient solution for semantic chunking. Successive calls to the Semantic.Scan method will step through the context windows of a file and grouping sentences semantically. The context window is defined by number sentences, use Window method to change default 32 sentences value.

The specification of a sentence is defined by the Scanner interface, which is compatible with bufio.NewScanner. Use a Split function of type SplitFunc within bufio.NewScanner to control sentence breakdown.

The package provides NewSentencer utility that breaks the input into sentences using punctuation runes. Redefine Use Split function of bufio.NewScanner to define own algorithms.

The scanner uses embeddings to determine similarity. Use Similarity method to change the default high cosine similarity to own implementation. The module provides high, medium, weak and dissimilarity functions based on cosine distance.

Scanning stops unrecoverably at EOF or the first I/O error.

func NewSemantic ¶ added in v0.0.2

func NewSemantic(embed Embedder, r Scanner) *Semantic

Creates new instance of Scanner to read from io.Reader and using embedding.

func (*Semantic) Err ¶ added in v0.0.2

func (s *Semantic) Err() error

func (*Semantic) Scan ¶ added in v0.0.2

func (s *Semantic) Scan() bool

Scan advances the Semantic through context window, sequences will be available through Semantic.Text. It returns false if there was I/O error or EOF is reached.

func (*Semantic) Similarity ¶ added in v0.0.2

func (s *Semantic) Similarity(f func([]float32, []float32) bool)

Similarity sets the similarity function for the Semantic. The default is HighSimilarity.

func (*Semantic) SimilarityWith ¶ added in v0.0.2

func (s *Semantic) SimilarityWith(x SimilarityWith)

Similarity sets the behavior to sorting algorithms.

Using SIMILARITY_WITH_HEAD configures algorithm to sort chunk similar to the first element of chunk. The first element of chunk is stable during the chunk forming.

Using SIMILARITY_WITH_TAIL configures algorithm to sort chunk similar to the last element of chunk. The last element is changed after new one is added to chunk.

func (*Semantic) Text ¶ added in v0.0.2

func (s *Semantic) Text() []string

func (*Semantic) Window ¶ added in v0.0.2

func (s *Semantic) Window(n int)

Widow defines the context window for similarity detection. The default value is 32 sentences.

type Sentencer ¶

type Sentencer []byte

func (Sentencer) Split ¶

func (s Sentencer) Split(data []byte, atEOF bool) (advance int, token []byte, err error)

bufio.SplitFunc for sentence.

type SimilarityWith ¶ added in v0.0.2

type SimilarityWith int

Configure similarity sorting algorithm

const (
	SIMILARITY_WITH_HEAD SimilarityWith = iota
	SIMILARITY_WITH_TAIL
)

Configure similarity sorting algorithm

type Slicer ¶

type Slicer []byte

func (Slicer) Split ¶

func (s Slicer) Split(data []byte, atEOF bool) (advance int, token []byte, err error)

bufio.SplitFunc for fixed delimiter.

type Sorter ¶ added in v0.0.2

type Sorter[T any] struct {
	// contains filtered or unexported fields
}

Sorter provides a convenient solution for semantic sorting.

Successive calls to the Sorter.Sort method will step through the context windows of a slice and grouping 'sentences' semantically. The context window is defined either by number sentences, use Window method to change default 32 sentences value.

The input slice is assumed to be split into sentences already.

The sorter uses embeddings to determine similarity. Use Similarity method to change the default high cosine similarity to own implementation. The module provides high, medium, weak and dissimilarity functions based on cosine distance.

func NewSorter ¶ added in v0.0.2

func NewSorter[T any](embed Embedder, lens optics.Lens[T, string], seq seq.Seq[T]) *Sorter[T]

Creates new instance of semantic Sorter, seq.Seq[T] is source of records.

func (*Sorter[T]) Err ¶ added in v0.0.2

func (s *Sorter[T]) Err() error

func (*Sorter[T]) Next ¶ added in v0.0.2

func (s *Sorter[T]) Next() bool

Next advances the Sorter through context window, sequences will be available through [Scanner.Text]. It returns false if there was I/O error or EOF is reached.

func (*Sorter[T]) Similarity ¶ added in v0.0.2

func (s *Sorter[T]) Similarity(f func([]float32, []float32) bool)

Similarity sets the similarity function for the Sorter. The default is HighSimilarity.

func (*Sorter[T]) SimilarityWith ¶ added in v0.0.2

func (s *Sorter[T]) SimilarityWith(x SimilarityWith)

Similarity sets the behavior to sorting algorithms.

Using SIMILARITY_WITH_HEAD configures algorithm to sort chunk similar to the first element of chunk. The first element of chunk is stable during the chunk forming.

Using SIMILARITY_WITH_TAIL configures algorithm to sort chunk similar to the last element of chunk. The last element is changed after new one is added to chunk.

func (*Sorter[T]) Value ¶ added in v0.0.2

func (s *Sorter[T]) Value() []T

func (*Sorter[T]) Window ¶ added in v0.0.2

func (s *Sorter[T]) Window(n int)

Widow defines the context window for similarity detection. The default value is 32 sentences.

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL