tokenizer

package
v1.6.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 11, 2025 License: AGPL-3.0 Imports: 12 Imported by: 0

README

SQL Tokenizer Package

Overview

The tokenizer package provides a high-performance, zero-copy SQL lexical analyzer that converts SQL text into tokens. It supports multiple SQL dialects with full Unicode support and comprehensive operator recognition.

Key Features

  • Zero-Copy Operation: Works directly on input bytes without string allocation
  • Unicode Support: Full UTF-8 support for international SQL (8+ languages tested)
  • Multi-Dialect: PostgreSQL, MySQL, SQL Server, Oracle, SQLite operators and syntax
  • Object Pooling: 60-80% memory reduction through instance reuse
  • Position Tracking: Precise line/column information for error reporting
  • DOS Protection: Token limits and input size validation
  • Thread-Safe: All pool operations are race-free

Performance

  • Throughput: 8M tokens/second sustained
  • Latency: Sub-microsecond tokenization for typical queries
  • Memory: Minimal allocations with zero-copy design
  • Concurrency: Validated race-free with 20,000+ concurrent operations

Usage

Basic Tokenization
package main

import (
    "github.com/ajitpratap0/GoSQLX/pkg/sql/tokenizer"
)

func main() {
    // Get tokenizer from pool
    tkz := tokenizer.GetTokenizer()
    defer tokenizer.PutTokenizer(tkz)  // ALWAYS return to pool

    // Tokenize SQL
    sql := []byte("SELECT * FROM users WHERE active = true")
    tokens, err := tkz.Tokenize(sql)
    if err != nil {
        // Handle tokenization error
    }

    // Process tokens
    for _, tok := range tokens {
        fmt.Printf("%s at line %d, col %d\n",
            tok.Token.Value,
            tok.Start.Line,
            tok.Start.Column)
    }
}
Batch Processing
func ProcessMultipleQueries(queries []string) {
    tkz := tokenizer.GetTokenizer()
    defer tokenizer.PutTokenizer(tkz)

    for _, query := range queries {
        tokens, err := tkz.Tokenize([]byte(query))
        if err != nil {
            continue
        }

        // Process tokens
        // ...

        tkz.Reset()  // Reset between uses
    }
}
Concurrent Tokenization
func ConcurrentTokenization(queries []string) {
    var wg sync.WaitGroup

    for _, query := range queries {
        wg.Add(1)
        go func(sql string) {
            defer wg.Done()

            // Each goroutine gets its own tokenizer
            tkz := tokenizer.GetTokenizer()
            defer tokenizer.PutTokenizer(tkz)

            tokens, _ := tkz.Tokenize([]byte(sql))
            // Process tokens...
        }(query)
    }

    wg.Wait()
}

Token Types

Keywords
SELECT, FROM, WHERE, JOIN, GROUP BY, ORDER BY, HAVING, LIMIT, OFFSET,
INSERT, UPDATE, DELETE, CREATE, ALTER, DROP, WITH, UNION, EXCEPT, INTERSECT, etc.
Identifiers
  • Standard: user_id, TableName, column123
  • Quoted: "column name" (SQL standard)
  • Backtick: `column` (MySQL)
  • Bracket: [column] (SQL Server)
  • Unicode: "名前", "имя", "الاسم" (international)
Literals
  • Numbers: 42, 3.14, 1.5e10, 0xFF
  • Strings: 'hello', 'it''s' (escaped quotes)
  • Booleans: TRUE, FALSE
  • NULL: NULL
Operators
  • Comparison: =, <>, !=, <, >, <=, >=
  • Arithmetic: +, -, *, /, %
  • Logical: AND, OR, NOT
  • PostgreSQL: @>, <@, ->, ->>, #>, ?, ||
  • Pattern: LIKE, ILIKE, SIMILAR TO

Dialect-Specific Features

PostgreSQL
-- Array operators
SELECT * FROM users WHERE tags @> ARRAY['admin']

-- JSON operators
SELECT data->>'email' FROM users

-- String concatenation
SELECT first_name || ' ' || last_name FROM users
MySQL
-- Backtick identifiers
SELECT `user_id` FROM `users`

-- Double pipe as OR
SELECT * FROM users WHERE status = 1 || status = 2
SQL Server
-- Bracket identifiers
SELECT [User ID] FROM [User Table]

-- String concatenation with +
SELECT FirstName + ' ' + LastName FROM Users

Architecture

Core Files
  • tokenizer.go: Main tokenizer logic
  • string_literal.go: String parsing with escape sequence handling
  • unicode.go: Unicode identifier and quote normalization
  • position.go: Position tracking (line, column, byte offset)
  • pool.go: Object pool management
  • buffer.go: Internal buffer pool for performance
  • error.go: Structured error types
Tokenization Pipeline
Input bytes → Position tracking → Character scanning → Token recognition → Output tokens

Error Handling

Detailed Error Information
tokens, err := tkz.Tokenize(sqlBytes)
if err != nil {
    if tokErr, ok := err.(*tokenizer.Error); ok {
        fmt.Printf("Error at line %d, column %d: %s\n",
            tokErr.Location.Line,
            tokErr.Location.Column,
            tokErr.Message)
    }
}
Common Error Types
  • Unterminated String: Missing closing quote
  • Invalid Number: Malformed numeric literal
  • Invalid Character: Unexpected character in input
  • Invalid Escape: Unknown escape sequence in string

DOS Protection

Token Limit
// Default: 100,000 tokens per query
// Prevents memory exhaustion from malicious input
Input Size Validation
// Configurable maximum input size
// Default: 10MB per query

Unicode Support

Supported Scripts
  • Latin: English, Spanish, French, German, etc.
  • Cyrillic: Russian, Ukrainian, Bulgarian, etc.
  • CJK: Chinese, Japanese, Korean
  • Arabic: Arabic, Persian, Urdu
  • Devanagari: Hindi, Sanskrit
  • Greek, Hebrew, Thai, and more
Example
sql := `
    SELECT "名前" AS name,
           "возраст" AS age,
           "البريد_الإلكتروني" AS email
    FROM "المستخدمون"
    WHERE "نشط" = true
`
tokens, _ := tkz.Tokenize([]byte(sql))

Testing

Run tokenizer tests:

# All tests
go test -v ./pkg/sql/tokenizer/

# With race detection (MANDATORY during development)
go test -race ./pkg/sql/tokenizer/

# Specific features
go test -v -run TestTokenizer_Unicode ./pkg/sql/tokenizer/
go test -v -run TestTokenizer_PostgreSQL ./pkg/sql/tokenizer/

# Performance benchmarks
go test -bench=BenchmarkTokenizer -benchmem ./pkg/sql/tokenizer/

# Fuzz testing
go test -fuzz=FuzzTokenizer -fuzztime=30s ./pkg/sql/tokenizer/

Best Practices

1. Always Use Object Pool
// GOOD: Use pool
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

// BAD: Direct instantiation
tkz := &Tokenizer{}  // Misses pool benefits
2. Reset Between Uses
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

for _, query := range queries {
    tokens, _ := tkz.Tokenize([]byte(query))
    // ... process tokens
    tkz.Reset()  // Reset state for next query
}
3. Use Byte Slices
// GOOD: Zero-copy with byte slice
tokens, _ := tkz.Tokenize([]byte(sql))

// LESS EFFICIENT: String conversion
tokens, _ := tkz.Tokenize([]byte(sqlString))

Common Pitfalls

❌ Forgetting to Return to Pool
// BAD: Memory leak
tkz := tokenizer.GetTokenizer()
tokens, _ := tkz.Tokenize(sql)
// tkz never returned to pool
✅ Correct Pattern
// GOOD: Automatic cleanup
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)
tokens, err := tkz.Tokenize(sql)
❌ Reusing Without Reset
// BAD: State contamination
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

tkz.Tokenize(sql1)  // First use
tkz.Tokenize(sql2)  // State from sql1 still present!
✅ Correct Pattern
// GOOD: Reset between uses
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

tkz.Tokenize(sql1)
tkz.Reset()  // Clear state
tkz.Tokenize(sql2)

Performance Tips

1. Minimize Allocations

The tokenizer is designed for zero-copy operation. To maximize performance:

  • Pass []byte directly (avoid string conversions)
  • Reuse tokenizer instances via the pool
  • Process tokens immediately (avoid copying token slices)
2. Batch Processing

For multiple queries, reuse a single tokenizer:

tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

for _, query := range queries {
    tokens, _ := tkz.Tokenize([]byte(query))
    // Process immediately
    tkz.Reset()
}
3. Concurrent Processing

Each goroutine should get its own tokenizer:

// Each goroutine gets its own instance from pool
go func() {
    tkz := tokenizer.GetTokenizer()
    defer tokenizer.PutTokenizer(tkz)
    // ... tokenize and process
}()
  • parser: Consumes tokens to build AST
  • keywords: Keyword recognition and categorization
  • models: Token type definitions
  • metrics: Performance monitoring integration

Documentation

Version History

  • v1.5.0: Enhanced Unicode support, DOS protection hardening
  • v1.4.0: Production validation, 8M tokens/sec sustained
  • v1.3.0: PostgreSQL operator support expanded
  • v1.2.0: Multi-dialect operator recognition
  • v1.0.0: Initial release with zero-copy design

Documentation

Overview

Package tokenizer provides a high-performance SQL tokenizer with zero-copy operations

Index

Constants

View Source
const (
	// MaxInputSize is the maximum allowed input size in bytes (10MB)
	// This prevents DoS attacks via extremely large SQL queries
	MaxInputSize = 10 * 1024 * 1024 // 10MB

	// MaxTokens is the maximum number of tokens allowed in a single SQL query
	// This prevents DoS attacks via token explosion
	MaxTokens = 1000000 // 1M tokens
)

Variables

This section is empty.

Functions

func PutTokenizer

func PutTokenizer(t *Tokenizer)

PutTokenizer returns a Tokenizer to the pool

Types

type BufferPool

type BufferPool struct {
	// contains filtered or unexported fields
}

BufferPool manages a pool of reusable byte buffers for token content

func NewBufferPool

func NewBufferPool() *BufferPool

NewBufferPool creates a new buffer pool with optimized initial capacity

func (*BufferPool) Get

func (p *BufferPool) Get() []byte

Get retrieves a buffer from the pool

func (*BufferPool) Grow

func (p *BufferPool) Grow(buf []byte, n int) []byte

Grow ensures the buffer has enough capacity

func (*BufferPool) Put

func (p *BufferPool) Put(buf []byte)

Put returns a buffer to the pool

type DebugLogger

type DebugLogger interface {
	Debug(format string, args ...interface{})
}

DebugLogger is an interface for debug logging

type Error

type Error struct {
	Message  string
	Location models.Location
}

Error represents a tokenization error with location information

func ErrorInvalidIdentifier

func ErrorInvalidIdentifier(value string, location models.Location) *Error

ErrorInvalidIdentifier creates an error for an invalid identifier

func ErrorInvalidNumber

func ErrorInvalidNumber(value string, location models.Location) *Error

ErrorInvalidNumber creates an error for an invalid number format

func ErrorInvalidOperator

func ErrorInvalidOperator(value string, location models.Location) *Error

ErrorInvalidOperator creates an error for an invalid operator

func ErrorUnexpectedChar

func ErrorUnexpectedChar(ch byte, location models.Location) *Error

ErrorUnexpectedChar creates an error for an unexpected character

func ErrorUnterminatedString

func ErrorUnterminatedString(location models.Location) *Error

ErrorUnterminatedString creates an error for an unterminated string

func NewError

func NewError(message string, location models.Location) *Error

NewError creates a new tokenization error

func (*Error) Error

func (e *Error) Error() string

type Position

type Position struct {
	Line   int
	Index  int
	Column int
	LastNL int // byte offset of last newline
}

Position tracks our scanning cursor with optimized tracking - Line is 1-based - Index is 0-based - Column is 1-based - LastNL tracks the last newline for efficient column calculation

func NewPosition

func NewPosition(line, index int) Position

NewPosition builds a Position from raw info

func (*Position) AdvanceN

func (p *Position) AdvanceN(n int, lineStarts []int)

AdvanceN moves forward by n bytes

func (*Position) AdvanceRune

func (p *Position) AdvanceRune(r rune, size int)

Advance moves us forward by the given rune, updating line/col efficiently

func (Position) Clone

func (p Position) Clone() Position

Clone makes a copy of Position

func (Position) Location

func (p Position) Location(t *Tokenizer) models.Location

Location gives the models.Location for this position

type StringLiteralReader

type StringLiteralReader struct {
	// contains filtered or unexported fields
}

StringLiteralReader handles reading of string literals with proper escape sequence handling

func NewStringLiteralReader

func NewStringLiteralReader(input []byte, pos *Position, quote rune) *StringLiteralReader

NewStringLiteralReader creates a new StringLiteralReader

func (*StringLiteralReader) ReadStringLiteral

func (r *StringLiteralReader) ReadStringLiteral() (models.Token, error)

ReadStringLiteral reads a string literal with proper escape sequence handling

type Tokenizer

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer provides high-performance SQL tokenization with zero-copy operations

func GetTokenizer

func GetTokenizer() *Tokenizer

GetTokenizer gets a Tokenizer from the pool

func New

func New() (*Tokenizer, error)

New creates a new Tokenizer with default configuration

func NewWithKeywords

func NewWithKeywords(kw *keywords.Keywords) (*Tokenizer, error)

NewWithKeywords initializes a Tokenizer with custom keywords

func (*Tokenizer) Reset

func (t *Tokenizer) Reset()

Reset resets a Tokenizer's state for reuse

func (*Tokenizer) SetDebugLogger

func (t *Tokenizer) SetDebugLogger(logger DebugLogger)

SetDebugLogger sets a debug logger for verbose tracing

func (*Tokenizer) Tokenize

func (t *Tokenizer) Tokenize(input []byte) ([]models.TokenWithSpan, error)

Tokenize processes the input and returns tokens

func (*Tokenizer) TokenizeContext added in v1.5.0

func (t *Tokenizer) TokenizeContext(ctx context.Context, input []byte) ([]models.TokenWithSpan, error)

TokenizeContext processes the input and returns tokens with context support for cancellation. It checks the context at regular intervals (every 100 tokens) to enable fast cancellation. Returns context.Canceled or context.DeadlineExceeded when the context is cancelled.

This method is useful for:

  • Long-running tokenization operations that need to be cancellable
  • Implementing timeouts for tokenization
  • Graceful shutdown scenarios

Example:

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
tokens, err := tokenizer.TokenizeContext(ctx, []byte(sql))
if err == context.DeadlineExceeded {
    // Handle timeout
}

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL