tokenizer

package

v1.6.0 Latest Latest Go to latest Published: Dec 11, 2025 License: AGPL-3.0 Imports: 12 Imported by: 0

Details

Valid go.mod file
Redistributable license
Tagged version
Stable version
Learn more about best practices

Repository

github.com/ajitpratap0/GoSQLX

Links

Open Source Insights

README ¶

SQL Tokenizer Package

Overview

The tokenizer package provides a high-performance, zero-copy SQL lexical analyzer that converts SQL text into tokens. It supports multiple SQL dialects with full Unicode support and comprehensive operator recognition.

Key Features

Zero-Copy Operation: Works directly on input bytes without string allocation
Unicode Support: Full UTF-8 support for international SQL (8+ languages tested)
Multi-Dialect: PostgreSQL, MySQL, SQL Server, Oracle, SQLite operators and syntax
Object Pooling: 60-80% memory reduction through instance reuse
Position Tracking: Precise line/column information for error reporting
DOS Protection: Token limits and input size validation
Thread-Safe: All pool operations are race-free

Performance

Throughput: 8M tokens/second sustained
Latency: Sub-microsecond tokenization for typical queries
Memory: Minimal allocations with zero-copy design
Concurrency: Validated race-free with 20,000+ concurrent operations

Usage

Basic Tokenization

package main

import (
    "github.com/ajitpratap0/GoSQLX/pkg/sql/tokenizer"
)

func main() {
    // Get tokenizer from pool
    tkz := tokenizer.GetTokenizer()
    defer tokenizer.PutTokenizer(tkz)  // ALWAYS return to pool

    // Tokenize SQL
    sql := []byte("SELECT * FROM users WHERE active = true")
    tokens, err := tkz.Tokenize(sql)
    if err != nil {
        // Handle tokenization error
    }

    // Process tokens
    for _, tok := range tokens {
        fmt.Printf("%s at line %d, col %d\n",
            tok.Token.Value,
            tok.Start.Line,
            tok.Start.Column)
    }
}

Batch Processing

func ProcessMultipleQueries(queries []string) {
    tkz := tokenizer.GetTokenizer()
    defer tokenizer.PutTokenizer(tkz)

    for _, query := range queries {
        tokens, err := tkz.Tokenize([]byte(query))
        if err != nil {
            continue
        }

        // Process tokens
        // ...

        tkz.Reset()  // Reset between uses
    }
}

Concurrent Tokenization

func ConcurrentTokenization(queries []string) {
    var wg sync.WaitGroup

    for _, query := range queries {
        wg.Add(1)
        go func(sql string) {
            defer wg.Done()

            // Each goroutine gets its own tokenizer
            tkz := tokenizer.GetTokenizer()
            defer tokenizer.PutTokenizer(tkz)

            tokens, _ := tkz.Tokenize([]byte(sql))
            // Process tokens...
        }(query)
    }

    wg.Wait()
}

Token Types

Keywords

SELECT, FROM, WHERE, JOIN, GROUP BY, ORDER BY, HAVING, LIMIT, OFFSET,
INSERT, UPDATE, DELETE, CREATE, ALTER, DROP, WITH, UNION, EXCEPT, INTERSECT, etc.

Identifiers

Standard: user_id, TableName, column123
Quoted: "column name" (SQL standard)
Backtick: `column` (MySQL)
Bracket: [column] (SQL Server)
Unicode: "名前", "имя", "الاسم" (international)

Literals

Numbers: 42, 3.14, 1.5e10, 0xFF
Strings: 'hello', 'it''s' (escaped quotes)
Booleans: TRUE, FALSE
NULL: NULL

Operators

Comparison: =, <>, !=, <, >, <=, >=
Arithmetic: +, -, *, /, %
Logical: AND, OR, NOT
PostgreSQL: @>, <@, ->, ->>, #>, ?, ||
Pattern: LIKE, ILIKE, SIMILAR TO

Dialect-Specific Features

PostgreSQL

-- Array operators
SELECT * FROM users WHERE tags @> ARRAY['admin']

-- JSON operators
SELECT data->>'email' FROM users

-- String concatenation
SELECT first_name || ' ' || last_name FROM users

MySQL

-- Backtick identifiers
SELECT `user_id` FROM `users`

-- Double pipe as OR
SELECT * FROM users WHERE status = 1 || status = 2

SQL Server

-- Bracket identifiers
SELECT [User ID] FROM [User Table]

-- String concatenation with +
SELECT FirstName + ' ' + LastName FROM Users

Architecture

Core Files

tokenizer.go: Main tokenizer logic
string_literal.go: String parsing with escape sequence handling
unicode.go: Unicode identifier and quote normalization
position.go: Position tracking (line, column, byte offset)
pool.go: Object pool management
buffer.go: Internal buffer pool for performance
error.go: Structured error types

Tokenization Pipeline

Input bytes → Position tracking → Character scanning → Token recognition → Output tokens

Error Handling

Detailed Error Information

tokens, err := tkz.Tokenize(sqlBytes)
if err != nil {
    if tokErr, ok := err.(*tokenizer.Error); ok {
        fmt.Printf("Error at line %d, column %d: %s\n",
            tokErr.Location.Line,
            tokErr.Location.Column,
            tokErr.Message)
    }
}

Common Error Types

Unterminated String: Missing closing quote
Invalid Number: Malformed numeric literal
Invalid Character: Unexpected character in input
Invalid Escape: Unknown escape sequence in string

DOS Protection

Token Limit

// Default: 100,000 tokens per query
// Prevents memory exhaustion from malicious input

Input Size Validation

// Configurable maximum input size
// Default: 10MB per query

Unicode Support

Supported Scripts

Latin: English, Spanish, French, German, etc.
Cyrillic: Russian, Ukrainian, Bulgarian, etc.
CJK: Chinese, Japanese, Korean
Arabic: Arabic, Persian, Urdu
Devanagari: Hindi, Sanskrit
Greek, Hebrew, Thai, and more

Example

sql := `
    SELECT "名前" AS name,
           "возраст" AS age,
           "البريد_الإلكتروني" AS email
    FROM "المستخدمون"
    WHERE "نشط" = true
`
tokens, _ := tkz.Tokenize([]byte(sql))

Testing

Run tokenizer tests:

# All tests
go test -v ./pkg/sql/tokenizer/

# With race detection (MANDATORY during development)
go test -race ./pkg/sql/tokenizer/

# Specific features
go test -v -run TestTokenizer_Unicode ./pkg/sql/tokenizer/
go test -v -run TestTokenizer_PostgreSQL ./pkg/sql/tokenizer/

# Performance benchmarks
go test -bench=BenchmarkTokenizer -benchmem ./pkg/sql/tokenizer/

# Fuzz testing
go test -fuzz=FuzzTokenizer -fuzztime=30s ./pkg/sql/tokenizer/

Best Practices

1. Always Use Object Pool

// GOOD: Use pool
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

// BAD: Direct instantiation
tkz := &Tokenizer{}  // Misses pool benefits

2. Reset Between Uses

tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

for _, query := range queries {
    tokens, _ := tkz.Tokenize([]byte(query))
    // ... process tokens
    tkz.Reset()  // Reset state for next query
}

3. Use Byte Slices

// GOOD: Zero-copy with byte slice
tokens, _ := tkz.Tokenize([]byte(sql))

// LESS EFFICIENT: String conversion
tokens, _ := tkz.Tokenize([]byte(sqlString))

Common Pitfalls

❌ Forgetting to Return to Pool

// BAD: Memory leak
tkz := tokenizer.GetTokenizer()
tokens, _ := tkz.Tokenize(sql)
// tkz never returned to pool

✅ Correct Pattern

// GOOD: Automatic cleanup
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)
tokens, err := tkz.Tokenize(sql)

❌ Reusing Without Reset

// BAD: State contamination
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

tkz.Tokenize(sql1)  // First use
tkz.Tokenize(sql2)  // State from sql1 still present!

✅ Correct Pattern

// GOOD: Reset between uses
tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

tkz.Tokenize(sql1)
tkz.Reset()  // Clear state
tkz.Tokenize(sql2)

Performance Tips

1. Minimize Allocations

The tokenizer is designed for zero-copy operation. To maximize performance:

Pass []byte directly (avoid string conversions)
Reuse tokenizer instances via the pool
Process tokens immediately (avoid copying token slices)

2. Batch Processing

For multiple queries, reuse a single tokenizer:

tkz := tokenizer.GetTokenizer()
defer tokenizer.PutTokenizer(tkz)

for _, query := range queries {
    tokens, _ := tkz.Tokenize([]byte(query))
    // Process immediately
    tkz.Reset()
}

3. Concurrent Processing

Each goroutine should get its own tokenizer:

// Each goroutine gets its own instance from pool
go func() {
    tkz := tokenizer.GetTokenizer()
    defer tokenizer.PutTokenizer(tkz)
    // ... tokenize and process
}()

parser: Consumes tokens to build AST
keywords: Keyword recognition and categorization
models: Token type definitions
metrics: Performance monitoring integration

Documentation

Version History

v1.5.0: Enhanced Unicode support, DOS protection hardening
v1.4.0: Production validation, 8M tokens/sec sustained
v1.3.0: PostgreSQL operator support expanded
v1.2.0: Multi-dialect operator recognition
v1.0.0: Initial release with zero-copy design

Documentation ¶

Overview ¶

Package tokenizer provides a high-performance SQL tokenizer with zero-copy operations

Index ¶

Constants
func PutTokenizer(t *Tokenizer)
type BufferPool
- func NewBufferPool() *BufferPool
type DebugLogger
type Error
- func (e *Error) Error() string
type Position
- func NewPosition(line, index int) Position
type StringLiteralReader
- func NewStringLiteralReader(input []byte, pos *Position, quote rune) *StringLiteralReader
- func (r *StringLiteralReader) ReadStringLiteral() (models.Token, error)
type Tokenizer

Constants ¶

View Source

const (
	// MaxInputSize is the maximum allowed input size in bytes (10MB)
	// This prevents DoS attacks via extremely large SQL queries
	MaxInputSize = 10 * 1024 * 1024 // 10MB

	// MaxTokens is the maximum number of tokens allowed in a single SQL query
	// This prevents DoS attacks via token explosion
	MaxTokens = 1000000 // 1M tokens
)

Variables ¶

This section is empty.

Functions ¶

func PutTokenizer ¶

func PutTokenizer(t *Tokenizer)

PutTokenizer returns a Tokenizer to the pool

Types ¶

type BufferPool ¶

type BufferPool struct {
	// contains filtered or unexported fields
}

BufferPool manages a pool of reusable byte buffers for token content

func NewBufferPool ¶

func NewBufferPool() *BufferPool

NewBufferPool creates a new buffer pool with optimized initial capacity

func (*BufferPool) Get ¶

func (p *BufferPool) Get() []byte

Get retrieves a buffer from the pool

func (*BufferPool) Grow ¶

func (p *BufferPool) Grow(buf []byte, n int) []byte

Grow ensures the buffer has enough capacity

func (*BufferPool) Put ¶

func (p *BufferPool) Put(buf []byte)

Put returns a buffer to the pool

type DebugLogger ¶

type DebugLogger interface {
	Debug(format string, args ...interface{})
}

DebugLogger is an interface for debug logging

type Error ¶

type Error struct {
	Message  string
	Location models.Location
}

Error represents a tokenization error with location information

func ErrorInvalidIdentifier ¶

func ErrorInvalidIdentifier(value string, location models.Location) *Error

ErrorInvalidIdentifier creates an error for an invalid identifier

func ErrorInvalidNumber ¶

func ErrorInvalidNumber(value string, location models.Location) *Error

ErrorInvalidNumber creates an error for an invalid number format

func ErrorInvalidOperator ¶

func ErrorInvalidOperator(value string, location models.Location) *Error

ErrorInvalidOperator creates an error for an invalid operator

func ErrorUnexpectedChar ¶

func ErrorUnexpectedChar(ch byte, location models.Location) *Error

ErrorUnexpectedChar creates an error for an unexpected character

func ErrorUnterminatedString ¶

func ErrorUnterminatedString(location models.Location) *Error

ErrorUnterminatedString creates an error for an unterminated string

func NewError ¶

func NewError(message string, location models.Location) *Error

NewError creates a new tokenization error

func (*Error) Error ¶

func (e *Error) Error() string

type Position ¶

type Position struct {
	Line   int
	Index  int
	Column int
	LastNL int // byte offset of last newline
}

Position tracks our scanning cursor with optimized tracking - Line is 1-based - Index is 0-based - Column is 1-based - LastNL tracks the last newline for efficient column calculation

func NewPosition ¶

func NewPosition(line, index int) Position

NewPosition builds a Position from raw info

func (*Position) AdvanceN ¶

func (p *Position) AdvanceN(n int, lineStarts []int)

AdvanceN moves forward by n bytes

func (*Position) AdvanceRune ¶

func (p *Position) AdvanceRune(r rune, size int)

Advance moves us forward by the given rune, updating line/col efficiently

func (Position) Clone ¶

func (p Position) Clone() Position

Clone makes a copy of Position

func (Position) Location ¶

func (p Position) Location(t *Tokenizer) models.Location

Location gives the models.Location for this position

type StringLiteralReader ¶

type StringLiteralReader struct {
	// contains filtered or unexported fields
}

StringLiteralReader handles reading of string literals with proper escape sequence handling

func NewStringLiteralReader ¶

func NewStringLiteralReader(input []byte, pos *Position, quote rune) *StringLiteralReader

NewStringLiteralReader creates a new StringLiteralReader

func (*StringLiteralReader) ReadStringLiteral ¶

func (r *StringLiteralReader) ReadStringLiteral() (models.Token, error)

ReadStringLiteral reads a string literal with proper escape sequence handling

type Tokenizer ¶

type Tokenizer struct {
	// contains filtered or unexported fields
}

Tokenizer provides high-performance SQL tokenization with zero-copy operations

func GetTokenizer ¶

func GetTokenizer() *Tokenizer

GetTokenizer gets a Tokenizer from the pool

func New ¶

func New() (*Tokenizer, error)

New creates a new Tokenizer with default configuration

func NewWithKeywords ¶

func NewWithKeywords(kw *keywords.Keywords) (*Tokenizer, error)

NewWithKeywords initializes a Tokenizer with custom keywords

func (*Tokenizer) Reset ¶

func (t *Tokenizer) Reset()

Reset resets a Tokenizer's state for reuse

func (*Tokenizer) SetDebugLogger ¶

func (t *Tokenizer) SetDebugLogger(logger DebugLogger)

SetDebugLogger sets a debug logger for verbose tracing

func (*Tokenizer) Tokenize ¶

func (t *Tokenizer) Tokenize(input []byte) ([]models.TokenWithSpan, error)

Tokenize processes the input and returns tokens

func (*Tokenizer) TokenizeContext ¶ added in v1.5.0

func (t *Tokenizer) TokenizeContext(ctx context.Context, input []byte) ([]models.TokenWithSpan, error)

TokenizeContext processes the input and returns tokens with context support for cancellation. It checks the context at regular intervals (every 100 tokens) to enable fast cancellation. Returns context.Canceled or context.DeadlineExceeded when the context is cancelled.

This method is useful for:

Long-running tokenization operations that need to be cancellable
Implementing timeouts for tokenization
Graceful shutdown scenarios

Example:

ctx, cancel := context.WithTimeout(context.Background(), 5*time.Second)
defer cancel()
tokens, err := tokenizer.TokenizeContext(ctx, []byte(sql))
if err == context.DeadlineExceeded {
    // Handle timeout
}

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL

README ¶

SQL Tokenizer Package

Overview

Key Features

Performance

Usage

Basic Tokenization

Batch Processing

Concurrent Tokenization

Token Types

Keywords

Identifiers

Literals

Operators

Dialect-Specific Features

PostgreSQL

MySQL

SQL Server

Architecture

Core Files

Tokenization Pipeline

Error Handling

Detailed Error Information

Common Error Types

DOS Protection

Token Limit

Input Size Validation

Unicode Support

Supported Scripts

Example

Testing

Best Practices

1. Always Use Object Pool

2. Reset Between Uses

3. Use Byte Slices

Common Pitfalls

❌ Forgetting to Return to Pool

✅ Correct Pattern

❌ Reusing Without Reset

✅ Correct Pattern

Performance Tips

1. Minimize Allocations

2. Batch Processing

3. Concurrent Processing

Related Packages

Documentation

Version History

Documentation ¶

Overview ¶

Index ¶

Constants ¶

Variables ¶

Functions ¶

func PutTokenizer ¶

Types ¶

type BufferPool ¶

func NewBufferPool ¶

func (*BufferPool) Get ¶

func (*BufferPool) Grow ¶

func (*BufferPool) Put ¶

type DebugLogger ¶

type Error ¶

func ErrorInvalidIdentifier ¶

func ErrorInvalidNumber ¶

func ErrorInvalidOperator ¶

func ErrorUnexpectedChar ¶

func ErrorUnterminatedString ¶

func NewError ¶

func (*Error) Error ¶

type Position ¶

func NewPosition ¶

func (*Position) AdvanceN ¶

func (*Position) AdvanceRune ¶

func (Position) Clone ¶

func (Position) Location ¶

type StringLiteralReader ¶

func NewStringLiteralReader ¶

func (*StringLiteralReader) ReadStringLiteral ¶

type Tokenizer ¶

func GetTokenizer ¶