Documentation
¶
Overview ¶
Package transliterate provides functionality to convert Unicode text into plain ASCII equivalents.
It takes Unicode characters and replaces them with their closest ASCII representation (e.g., 'é' becomes 'e', 'ü' becomes 'u'). Characters without a known ASCII approximation are generally omitted.
Emoji and other pictographic symbols are omitted from the output as they have no standardized ASCII representation. If you need emoji-to-text conversion, consider using a dedicated emoji processing library.
The transliteration tables are based on the tables used here https://github.com/mozillazg/go-unidecode but they have been modified to include characters and improve accuracy.
Currently supports Unicode BMP (U+0000-U+FFFF) and some supplementary planes:
- x1d4: Mathematical Alphanumeric Symbols
- x1d5: Mathematical Alphanumeric Symbols
- x1d6: Mathematical Alphanumeric Symbols
- x1d7: Mathematical Alphanumeric Symbols
- x1f1: Enclosed Alphanumeric Supplement
- x1f6: Transport and Map Symbols + Emoji Symbols
Thread Safety ¶
String() and WithLimit() are safe for concurrent use. The package maintains an internal cache that is thread-safe and uses a buffer pool for improved performance under high concurrency.
Cache Behavior ¶
The package maintains a cache of up to 1000 character translations. When the cache becomes full, it's cleared. This approach favors simplicity over granular eviction but may affect performance for workloads with highly varying character sets.
Configuration ¶
The package can be configured using the Configure function with options:
Cache size (default 1000 entries):
transliterate.Configure(transliterate.WithMaxCacheSize(5000))
Maximum input length (default 1MB):
transliterate.Configure(transliterate.WithMaxInputLength(1 << 24)) // 16MB
Options can be combined:
transliterate.Configure( transliterate.WithMaxCacheSize(5000), transliterate.WithMaxInputLength(1 << 24), )
Configuration should be done early in your application lifecycle, preferably before any calls to String() or WithLimit().
Table Generation ¶
The transliteration tables are generated from a text definition file. See tools/make_tables/README.md and tools/convert_tables/README.md for details on maintaining the tables.
Examples:
transliterate.String("これはひらがなです") // Output: "korehahiraganadesu" transliterate.String("你好,世界") // Output: "Ni Hao, Shi Jie" (Depends on table)
Index ¶
Constants ¶
const ( DefaultCacheSize = 1000 DefaultMaxInputLen = 1 << 20 // 1MB )
Default configuration values
Variables ¶
This section is empty.
Functions ¶
func ClearCache ¶
func ClearCache()
ClearCache empties the transliteration cache. This can be useful when memory pressure is high or when preparing for a new batch of translations.
func Configure ¶
func Configure(opts ...CacheOption)
Configure applies the given options to the configuration
func GetCacheSize ¶
func GetCacheSize() int
GetCacheSize returns the current size of the transliteration cache. This can be useful for monitoring memory usage.
func GetCacheStats ¶
func GetCacheStats() (hits uint64)
GetCacheStats returns the number of cache hits since last reset. This can be used to monitor cache effectiveness.
func ResetCacheStats ¶
func ResetCacheStats()
ResetCacheStats zeros out the cache statistics counter. Useful for beginning a new monitoring period.
func String ¶
String transliterates a Unicode string into its closest ASCII representation. For example, "é" becomes "e". Characters without a known approximation are omitted. Invalid UTF-8 sequences are also omitted.
Types ¶
type CacheOption ¶
type CacheOption func(*config)
CacheOption allows configuration of the transliteration cache
func WithMaxCacheSize ¶
func WithMaxCacheSize(size int) CacheOption
WithMaxCacheSize sets the maximum size of the cache
func WithMaxInputLength ¶
func WithMaxInputLength(length int) CacheOption
WithMaxInputLength sets the maximum input string length