transliterate

package
v1.0.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 9, 2025 License: MIT Imports: 6 Imported by: 1

Documentation

Overview

Package transliterate provides functionality to convert Unicode text into plain ASCII equivalents.

It takes Unicode characters and replaces them with their closest ASCII representation (e.g., 'é' becomes 'e', 'ü' becomes 'u'). Characters without a known ASCII approximation are generally omitted.

Emoji and other pictographic symbols are omitted from the output as they have no standardized ASCII representation. If you need emoji-to-text conversion, consider using a dedicated emoji processing library.

The transliteration tables are based on the tables used here https://github.com/mozillazg/go-unidecode but they have been modified to include characters and improve accuracy.

Currently supports Unicode BMP (U+0000-U+FFFF) and some supplementary planes:

  • x1d4: Mathematical Alphanumeric Symbols
  • x1d5: Mathematical Alphanumeric Symbols
  • x1d6: Mathematical Alphanumeric Symbols
  • x1d7: Mathematical Alphanumeric Symbols
  • x1f1: Enclosed Alphanumeric Supplement
  • x1f6: Transport and Map Symbols + Emoji Symbols

Thread Safety

String() and WithLimit() are safe for concurrent use. The package maintains an internal cache that is thread-safe and uses a buffer pool for improved performance under high concurrency.

Cache Behavior

The package maintains a cache of up to 1000 character translations. When the cache becomes full, it's cleared. This approach favors simplicity over granular eviction but may affect performance for workloads with highly varying character sets.

Configuration

The package can be configured using the Configure function with options:

Cache size (default 1000 entries):

transliterate.Configure(transliterate.WithMaxCacheSize(5000))

Maximum input length (default 1MB):

transliterate.Configure(transliterate.WithMaxInputLength(1 << 24)) // 16MB

Options can be combined:

transliterate.Configure(
    transliterate.WithMaxCacheSize(5000),
    transliterate.WithMaxInputLength(1 << 24),
)

Configuration should be done early in your application lifecycle, preferably before any calls to String() or WithLimit().

Table Generation

The transliteration tables are generated from a text definition file. See tools/make_tables/README.md and tools/convert_tables/README.md for details on maintaining the tables.

Examples:

transliterate.String("これはひらがなです") // Output: "korehahiraganadesu"
transliterate.String("你好,世界") // Output: "Ni Hao, Shi Jie" (Depends on table)

Index

Constants

View Source
const (
	DefaultCacheSize   = 1000
	DefaultMaxInputLen = 1 << 20 // 1MB
)

Default configuration values

Variables

This section is empty.

Functions

func ClearCache

func ClearCache()

ClearCache empties the transliteration cache. This can be useful when memory pressure is high or when preparing for a new batch of translations.

func Configure

func Configure(opts ...CacheOption)

Configure applies the given options to the configuration

func GetCacheSize

func GetCacheSize() int

GetCacheSize returns the current size of the transliteration cache. This can be useful for monitoring memory usage.

func GetCacheStats

func GetCacheStats() (hits uint64)

GetCacheStats returns the number of cache hits since last reset. This can be used to monitor cache effectiveness.

func ResetCacheStats

func ResetCacheStats()

ResetCacheStats zeros out the cache statistics counter. Useful for beginning a new monitoring period.

func String

func String(s string) string

String transliterates a Unicode string into its closest ASCII representation. For example, "é" becomes "e". Characters without a known approximation are omitted. Invalid UTF-8 sequences are also omitted.

func WithLimit

func WithLimit(s string) (string, error)

WithLimit transliterates a Unicode string into its closest ASCII representation, but limits the input string length to prevent excessive memory usage. For example, "é" becomes "e". Characters without a known approximation are omitted. Invalid UTF-8 sequences are also omitted.

Types

type CacheOption

type CacheOption func(*config)

CacheOption allows configuration of the transliteration cache

func WithMaxCacheSize

func WithMaxCacheSize(size int) CacheOption

WithMaxCacheSize sets the maximum size of the cache

func WithMaxInputLength

func WithMaxInputLength(length int) CacheOption

WithMaxInputLength sets the maximum input string length

Directories

Path Synopsis
internal
table
Package table contains the generated transliteration data for the transliterate package.
Package table contains the generated transliteration data for the transliterate package.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL