vectorfs

package
v0.0.0-...-668ef5a Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 11, 2026 License: Apache-2.0 Imports: 24 Imported by: 0

README

VectorFS Plugin

Document Vector Search Plugin for AGFS with S3 storage and TiDB Cloud vector indexing.

Overview

VectorFS provides semantic search capabilities for documents by combining:

  • S3 for scalable document storage
  • TiDB Cloud vector index for fast similarity search using HNSW algorithm
  • OpenAI embeddings (default) for generating vector representations

Features

  • Automatic Indexing: Documents are automatically indexed when written (async with worker pool)
  • Deduplication: Same content (same SHA256 digest) won't be indexed twice
  • Semantic Search: Use standard grep command for vector similarity search
  • Document Retrieval: Read original documents with cat command
  • Subdirectory Support: Organize documents in nested folders
  • Batch Copy: Copy entire folders with cp -r command
  • Scalable Storage: S3-backed document storage
  • Fast Vector Search: TiDB Cloud's HNSW index with >90% recall rate
  • Document Chunking: Smart chunking by paragraphs and sentences
  • Multiple Namespaces: Isolate documents by project/namespace
  • Similarity Scores: Search results include distance and relevance scores

Directory Structure

/vectorfs/
  README                    - Documentation
  <namespace>/              - Project/namespace directory
    docs/                   - Document directory (auto-indexed)
      file1.txt             - Root-level document
      subfolder/            - Subdirectory (virtual)
        file2.txt           - Nested document
        deep/file3.txt      - Deeply nested document
    .indexing               - Indexing status (virtual file, read-only)

Note:

  • Subdirectories under docs/ are virtual - they don't need to be created explicitly. Just write files with paths like docs/guides/tutorial.txt and the directory structure is maintained in metadata.
  • The .indexing file is a virtual read-only status file. Currently returns "idle" as a placeholder. Future versions will show real-time worker pool status.

Configuration

YAML Configuration
plugins:
  vectorfs:
    enabled: true
    path: /vectorfs
    config:
      # S3 Storage Configuration
      s3_bucket: my-document-bucket
      s3_key_prefix: vectorfs # Optional, default: vectorfs
      s3_region: us-east-1 # Optional, default: us-east-1
      s3_access_key: AKIAXXXXXXXX # Optional, uses IAM role if not provided
      s3_secret_key: secret # Optional
      s3_endpoint: "" # Optional, for custom S3-compatible services

      # TiDB Cloud Configuration
      tidb_dsn: "user:password@tcp(gateway01.us-west-2.prod.aws.tidbcloud.com:4000)/dbname?tls=true"

      # Embedding Configuration
      embedding_provider: openai # Default: openai
      openai_api_key: sk-xxxxxxxxxxxxxxxx
      embedding_model: text-embedding-3-small # Default: text-embedding-3-small
      embedding_dim: 1536 # Default: 1536

      # Chunking Configuration (Optional)
      chunk_size: 512 # Default: 512 tokens
      chunk_overlap: 50 # Default: 50 tokens

      # Worker Pool Configuration (Optional)
      index_workers: 4 # Default: 4 concurrent workers
TiDB Cloud Setup
  1. Create a TiDB Cloud cluster (Serverless or Dedicated)
  2. Enable TiFlash (required for vector search)
  3. Get the connection string (DSN) from cluster details
  4. Tables will be created automatically when you create a namespace
S3 Setup
  1. Create an S3 bucket (or use S3-compatible service like MinIO)
  2. Configure access credentials (IAM role recommended for production)
  3. Documents will be stored as: s3://bucket/vectorfs/<namespace>/<digest>

Usage

1. Create a Namespace (Project)
agfs:/> mkdir /vectorfs/my_project

This creates TiDB tables:

  • tbl_meta_my_project - File metadata
  • tbl_chunks_my_project - Document chunks with vector embeddings
2. Write Documents

Documents are automatically indexed when written to the docs/ directory:

# Write a single file
agfs:/> echo "How to deploy applications..." > /vectorfs/my_project/docs/deployment.txt

# Write to subdirectory (virtual subdirectories)
agfs:/> echo "Kubernetes guide" > /vectorfs/my_project/docs/guides/kubernetes.txt
agfs:/> echo "Docker tutorial" > /vectorfs/my_project/docs/tutorials/docker.txt

What happens:

  1. Write operation returns immediately (~8ms)
  2. Indexing happens asynchronously in background worker pool:
    • SHA256 digest calculated
    • Document uploaded to S3
    • Text split into chunks (~512 tokens)
    • Embeddings generated via OpenAI API
    • Chunks and embeddings stored in TiDB

Copy entire folders:

# Copy multiple files and folders
agfs:/> cp -r /s3fs/mybucket/docs /vectorfs/my_project/docs/imported
3. Search Documents

Use the standard grep command for semantic search:

agfs:/> grep "deployment strategies" /vectorfs/my_project/docs

# Or use agfs-shell's fsgrep command
$ fsgrep -r "deployment strategies" /vectorfs/my_project/docs

Returns:

{
  "matches": [
    {
      "file": "/vectorfs/my_project/docs/deployment.txt",
      "line": 1,
      "content": "How to deploy applications using blue-green strategy...",
      "metadata": {
        "distance": 0.234,
        "score": 0.766
      }
    },
    {
      "file": "/vectorfs/my_project/docs/kubernetes.txt",
      "line": 3,
      "content": "Kubernetes deployment strategies include rolling updates...",
      "metadata": {
        "distance": 0.412,
        "score": 0.588
      }
    }
  ],
  "count": 2
}

Similarity scores:

  • distance: Cosine distance (0.0 = identical, 1.0 = completely different)
  • score: Relevance score (1.0 - distance, higher is better)

The search uses cosine distance in TiDB's vector index to find semantically similar chunks.

4. Read Documents

Read original document content from S3:

# Read a file
agfs:/> cat /vectorfs/my_project/docs/deployment.txt

# Read from subdirectory
agfs:/> cat /vectorfs/my_project/docs/guides/kubernetes.txt

Documents are retrieved from S3 using the file's digest and returned with their original content.

5. List Documents
agfs:/> ls /vectorfs/my_project/docs
deployment.txt
kubernetes.txt
architecture.md
guides/
tutorials/

# List subdirectory
agfs:/> ls /vectorfs/my_project/docs/guides
kubernetes.txt
getting-started.md
6. Check Indexing Status

Each namespace has a virtual .indexing file that shows background indexing status:

agfs:/> cat /vectorfs/my_project/.indexing
idle

Current Status: This file currently returns idle as a placeholder. Since indexing happens asynchronously in a worker pool, documents may still be processing in the background even when showing "idle".

Future Enhancement: Will show real-time worker pool statistics:

  • Queue depth (pending documents)
  • Active workers processing
  • Indexing rate and completion status

Note: With async indexing, there may be a short delay (typically 1-15 seconds depending on file size) between writing a file and it being searchable. Large files (>20KB) with many chunks take longer to index.

Architecture

Data Flow
User writes file
      ↓
  Calculate SHA256 digest
      ↓
  Submit to index queue → Return immediately (~8ms)
      ↓
Worker pool (4 workers by default) processes async:
      ↓
  Upload to S3 (s3://bucket/vectorfs/<namespace>/<digest>)
      ↓
  Chunk document (paragraphs → sentences)
      ↓
  Generate embeddings (OpenAI API, batch)
      ↓
  Store in TiDB:
    - tbl_meta_<namespace> (file metadata)
    - tbl_chunks_<namespace> (chunks + vector embeddings)
Vector Search Flow
User runs grep
      ↓
  Generate query embedding (OpenAI API)
      ↓
  TiDB vector search:
    SELECT ... ORDER BY VEC_COSINE_DISTANCE(embedding, <query>) LIMIT 10
      ↓
  Return matching chunks as GrepMatch format

Database Schema

File Metadata Table
CREATE TABLE tbl_meta_<namespace> (
    file_digest VARCHAR(64) PRIMARY KEY,
    file_name VARCHAR(1024) NOT NULL,
    s3_key VARCHAR(1024) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
    INDEX idx_file_name (file_name)
);
Chunks Table with Vector Index
CREATE TABLE tbl_chunks_<namespace> (
    chunk_id BIGINT AUTO_INCREMENT PRIMARY KEY,
    file_digest VARCHAR(64) NOT NULL,
    chunk_index INT NOT NULL,
    chunk_text TEXT NOT NULL,
    embedding VECTOR(1536) NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    INDEX idx_file_digest (file_digest),
    VECTOR INDEX idx_embedding ((VEC_COSINE_DISTANCE(embedding)))
);

Performance Considerations

Write Performance
  • Write Response: ~8ms (immediate return, async indexing)
  • Worker Pool: 4 concurrent workers (configurable)
  • Queue Capacity: 100 pending tasks
Indexing Performance (Background)
  • Embedding API: ~100-200ms per batch (OpenAI)
  • TiDB Insert: ~10-50ms per chunk
  • S3 Upload: ~50-200ms per document
  • Large Files: 26KB file (~169 chunks) completes in ~15 seconds

Benefits of async indexing:

  • No timeout issues with cp -r for large folders
  • User operations never blocked
  • Controlled concurrency prevents API rate limits
Search Performance
  • Query Embedding: ~100ms (OpenAI)
  • Vector Search: ~10-50ms (TiDB HNSW index)
  • Total: ~150ms for typical search

TiDB Cloud vector search maintains >90% recall rate with HNSW indexing.

Cost Estimation

OpenAI Embeddings
  • Model: text-embedding-3-small
  • Cost: ~$0.02 per 1M tokens
  • Example: 100 documents × 1000 words ≈ 130K tokens ≈ $0.003
TiDB Cloud
  • Serverless: Pay per use (RU consumption)
  • Dedicated: Fixed monthly cost based on cluster size
S3 Storage
  • Standard storage: ~$0.023 per GB/month
  • Example: 1000 documents × 10KB ≈ 10MB ≈ $0.0002/month

Limitations

  1. No Updates: Updating documents creates a new version (different digest). Old versions remain in S3 and TiDB.

  2. Deletion: Not yet implemented. Use direct TiDB/S3 operations to clean up.

  3. Single Embedding Provider: Only OpenAI is supported currently.

  4. TiFlash Required: TiDB Cloud cluster must have TiFlash enabled for vector search.

  5. Indexing Visibility: The .indexing status file is currently a placeholder (always shows "idle"). No API yet to check:

    • Whether a specific file has been indexed
    • Real-time queue depth or worker status
    • Indexing progress or completion percentage

Troubleshooting

"failed to connect to TiDB"
  • Verify DSN connection string
  • Ensure TLS is enabled for TiDB Cloud: ?tls=true
  • Check network connectivity and firewall rules
"failed to initialize S3 client"
  • Verify AWS credentials or IAM role
  • Check bucket name and region
  • For custom endpoints, ensure s3_endpoint is correct
"failed to generate embeddings"
  • Verify OpenAI API key is valid
  • Check API rate limits and quotas
  • Ensure network access to api.openai.com
"vector search returns no results"
  • Verify documents have been indexed (ls /vectorfs/<namespace>/docs)
  • Check TiFlash is enabled on TiDB Cloud cluster
  • Try broader search queries
"file not appearing in search results immediately"
  • Indexing happens asynchronously in background worker pool
  • Small files (< 5KB): typically indexed within 1-3 seconds
  • Large files (> 20KB): may take 10-15+ seconds to complete indexing
  • Check server logs for indexing completion: grep "Successfully indexed" /var/log/agfs.log
  • The .indexing status file currently doesn't show real-time status (placeholder)
  • Workaround: Wait a few seconds after writing, then search again

Example: Complete Workflow

# 1. Create namespace
mkdir /vectorfs/tech_docs

# 2. Add documents
echo "Kubernetes is a container orchestration platform..." > /vectorfs/tech_docs/docs/k8s.txt
echo "Docker provides containerization for applications..." > /vectorfs/tech_docs/docs/docker.txt
echo "Terraform enables infrastructure as code..." > /vectorfs/tech_docs/docs/terraform.txt

# 3. Search
grep "container management" /vectorfs/tech_docs/docs

# Returns semantically similar results:
# - k8s.txt (mentions container orchestration)
# - docker.txt (mentions containerization)

Future Enhancements

  • Real-time indexing status in .indexing file (queue depth, active workers, completion %)
  • Per-file indexing status API (check if specific file has been indexed)
  • Document update/delete operations
  • Multiple embedding providers (Cohere, Hugging Face, etc.)
  • Hybrid search (vector + keyword)
  • Metadata filtering in search
  • Configurable top-K results
  • Re-indexing support
  • Priority queue for indexing tasks

See Also

License

Apache 2.0

Documentation

Index

Constants

View Source
const (
	PluginName = "vectorfs"
)

Variables

This section is empty.

Functions

This section is empty.

Types

type Chunk

type Chunk struct {
	Text  string
	Index int
}

Chunk represents a text chunk

func ChunkDocument

func ChunkDocument(text string, cfg ChunkerConfig) []Chunk

ChunkDocument splits a document into chunks

type ChunkData

type ChunkData struct {
	ChunkIndex int
	ChunkText  string
	Embedding  []float32
}

ChunkData represents a chunk to be inserted

type ChunkerConfig

type ChunkerConfig struct {
	ChunkSize    int // Approximate chunk size in tokens
	ChunkOverlap int // Overlap between chunks in tokens
}

ChunkerConfig holds chunking configuration

type EmbeddingClient

type EmbeddingClient struct {
	// contains filtered or unexported fields
}

EmbeddingClient handles embedding generation

func NewEmbeddingClient

func NewEmbeddingClient(cfg EmbeddingConfig) (*EmbeddingClient, error)

NewEmbeddingClient creates a new embedding client

func (*EmbeddingClient) GenerateBatchEmbeddings

func (e *EmbeddingClient) GenerateBatchEmbeddings(texts []string) ([][]float32, error)

GenerateBatchEmbeddings generates embeddings for multiple texts

func (*EmbeddingClient) GenerateEmbedding

func (e *EmbeddingClient) GenerateEmbedding(text string) ([]float32, error)

GenerateEmbedding generates an embedding for the given text

func (*EmbeddingClient) GetDimension

func (e *EmbeddingClient) GetDimension() int

GetDimension returns the embedding dimension

type EmbeddingConfig

type EmbeddingConfig struct {
	Provider  string // Provider name (openai)
	APIKey    string // API key
	Model     string // Model name
	Dimension int    // Embedding dimension
}

EmbeddingConfig holds embedding configuration

type FileMetadata

type FileMetadata struct {
	FileDigest string
	FileName   string
	S3Key      string
	FileSize   int64
	CreatedAt  time.Time
	UpdatedAt  time.Time
}

FileMetadata represents file metadata stored in TiDB

type Indexer

type Indexer struct {
	// contains filtered or unexported fields
}

Indexer handles document indexing

func NewIndexer

func NewIndexer(
	s3Client *S3Client,
	tidbClient *TiDBClient,
	embeddingClient *EmbeddingClient,
	chunkerConfig ChunkerConfig,
) *Indexer

NewIndexer creates a new indexer

func (*Indexer) DeleteDocument

func (idx *Indexer) DeleteDocument(namespace, digest string) error

DeleteDocument removes a document from the index

func (*Indexer) IndexChunks

func (idx *Indexer) IndexChunks(namespace, digest, fileName, content string) error

IndexChunks performs chunking, embedding generation, and stores chunks in TiDB (async phase). This is called after PrepareDocument to enable vector search on the document.

func (*Indexer) IndexDocument

func (idx *Indexer) IndexDocument(namespace, digest, fileName, content string) error

IndexDocument indexes a document (upload to S3, chunk, generate embeddings, store in TiDB) Deprecated: Use PrepareDocument + IndexChunks for better performance. This method is kept for backward compatibility.

func (*Indexer) PrepareDocument

func (idx *Indexer) PrepareDocument(namespace, digest, fileName, content string) (bool, error)

PrepareDocument uploads document to S3 and registers metadata in TiDB (synchronous phase). After this completes, the file is visible via ls/cat. Returns (alreadyExists, error) - if alreadyExists is true, no further indexing is needed.

type S3Client

type S3Client struct {
	// contains filtered or unexported fields
}

S3Client handles S3 operations for document storage

func NewS3Client

func NewS3Client(cfg S3Config) (*S3Client, error)

NewS3Client creates a new S3 client

func (*S3Client) DeleteDocument

func (c *S3Client) DeleteDocument(ctx context.Context, namespace, digest string) error

DeleteDocument deletes a document from S3

func (*S3Client) DocumentExists

func (c *S3Client) DocumentExists(ctx context.Context, namespace, digest string) (bool, error)

DocumentExists checks if a document exists in S3

func (*S3Client) DownloadDocument

func (c *S3Client) DownloadDocument(ctx context.Context, namespace, digest string) ([]byte, error)

DownloadDocument downloads a document from S3

func (*S3Client) UploadDocument

func (c *S3Client) UploadDocument(ctx context.Context, namespace, digest string, data []byte) error

UploadDocument uploads a document to S3

type S3Config

type S3Config struct {
	AccessKey string
	SecretKey string
	Bucket    string
	KeyPrefix string
	Region    string
	Endpoint  string
}

S3Config holds S3 configuration

type TiDBClient

type TiDBClient struct {
	// contains filtered or unexported fields
}

TiDBClient handles TiDB operations for vector search

func NewTiDBClient

func NewTiDBClient(cfg TiDBConfig) (*TiDBClient, error)

NewTiDBClient creates a new TiDB client

func (*TiDBClient) Close

func (c *TiDBClient) Close() error

Close closes the TiDB connection

func (*TiDBClient) CreateNamespace

func (c *TiDBClient) CreateNamespace(namespace string, embeddingDim int) error

CreateNamespace creates tables for a new namespace (fails if already exists)

func (*TiDBClient) DeleteFileByName

func (c *TiDBClient) DeleteFileByName(namespace, fileName string) error

DeleteFileByName deletes all versions of a file by name (used before writing new content)

func (*TiDBClient) DeleteFileChunks

func (c *TiDBClient) DeleteFileChunks(namespace, fileDigest string) error

DeleteFileChunks deletes all chunks for a file

func (*TiDBClient) DeleteFileMetadata

func (c *TiDBClient) DeleteFileMetadata(namespace, fileDigest string) error

DeleteFileMetadata deletes file metadata

func (*TiDBClient) DeleteNamespace

func (c *TiDBClient) DeleteNamespace(namespace string) error

DeleteNamespace drops all tables for a namespace

func (*TiDBClient) FileExists

func (c *TiDBClient) FileExists(namespace, digest string) (bool, error)

FileExists checks if a file (by digest) is already indexed

func (*TiDBClient) GetFileMetadataByName

func (c *TiDBClient) GetFileMetadataByName(namespace, fileName string) (*FileMetadata, error)

GetFileMetadataByName retrieves file metadata by file name (returns the latest version)

func (*TiDBClient) HasFilesWithPrefix

func (c *TiDBClient) HasFilesWithPrefix(namespace, prefix string) (bool, error)

HasFilesWithPrefix checks if any files exist with the given prefix (for directory detection) This is much faster than loading all files just to check if a directory exists

func (*TiDBClient) InsertChunk

func (c *TiDBClient) InsertChunk(namespace, fileDigest string, chunkIndex int, chunkText string, embedding []float32) error

InsertChunk inserts a document chunk with embedding

func (*TiDBClient) InsertChunksBatch

func (c *TiDBClient) InsertChunksBatch(namespace, fileDigest string, chunks []ChunkData) error

InsertChunksBatch inserts multiple chunks in a single batch operation This significantly reduces database round-trips compared to individual inserts

func (*TiDBClient) InsertFileMetadata

func (c *TiDBClient) InsertFileMetadata(namespace string, meta FileMetadata) error

InsertFileMetadata inserts file metadata

func (*TiDBClient) ListFiles

func (c *TiDBClient) ListFiles(namespace string) ([]FileMetadata, error)

ListFiles lists all files in a namespace

func (*TiDBClient) ListFilesWithPrefix

func (c *TiDBClient) ListFilesWithPrefix(namespace, prefix string) ([]FileMetadata, error)

ListFilesWithPrefix lists files in a namespace with a given prefix (database-level filtering) This is more efficient than ListFiles when only a subset of files is needed

func (*TiDBClient) ListNamespaces

func (c *TiDBClient) ListNamespaces() ([]string, error)

ListNamespaces lists all namespaces (by finding all tbl_meta_* tables)

func (*TiDBClient) NamespaceExists

func (c *TiDBClient) NamespaceExists(namespace string) (bool, error)

NamespaceExists checks if a namespace exists

func (*TiDBClient) VectorSearch

func (c *TiDBClient) VectorSearch(namespace string, queryEmbedding []float32, limit int) ([]VectorMatch, error)

VectorSearch performs vector similarity search

type TiDBConfig

type TiDBConfig struct {
	DSN string // Connection string
}

TiDBConfig holds TiDB configuration

type VectorFSPlugin

type VectorFSPlugin struct {
	// contains filtered or unexported fields
}

func NewVectorFSPlugin

func NewVectorFSPlugin() *VectorFSPlugin

NewVectorFSPlugin creates a new VectorFS plugin

func (*VectorFSPlugin) GetConfigParams

func (v *VectorFSPlugin) GetConfigParams() []plugin.ConfigParameter

func (*VectorFSPlugin) GetFileSystem

func (v *VectorFSPlugin) GetFileSystem() filesystem.FileSystem

func (*VectorFSPlugin) GetReadme

func (v *VectorFSPlugin) GetReadme() string

func (*VectorFSPlugin) Initialize

func (v *VectorFSPlugin) Initialize(cfg map[string]interface{}) error

func (*VectorFSPlugin) Name

func (v *VectorFSPlugin) Name() string

func (*VectorFSPlugin) Shutdown

func (v *VectorFSPlugin) Shutdown() error

func (*VectorFSPlugin) Validate

func (v *VectorFSPlugin) Validate(cfg map[string]interface{}) error

type VectorMatch

type VectorMatch struct {
	FileDigest string
	FileName   string
	ChunkText  string
	ChunkIndex int
	Distance   float64
}

VectorMatch represents a vector search result

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL