Documentation
¶
Index ¶
- func ConvertRomanNumerals(s string) string
- func ExpandAbbreviations(s string) string
- func ExpandNumberWords(s string) string
- func GenerateBigrams(s string) []string
- func GenerateTrigrams(s string) []string
- func IsBurmese(s string) bool
- func IsKhmer(s string) bool
- func IsLao(s string) bool
- func IsThai(s string) bool
- func JaccardSimilarity(set1, set2 []string) float64
- func NormalizeDotSeparators(s string) string
- func NormalizeOrdinals(s string) string
- func NormalizePunctuation(s string) string
- func NormalizeSymbolsAndSeparators(s string) string
- func NormalizeToWords(input string) []string
- func NormalizeUnicode(s string, ctx *pipelineContext) string
- func NormalizeWidth(s string) string
- func ParseGame(title string) string
- func ParseMovie(title string) string
- func ParseMusic(title string) string
- func ParseTVShow(title string) string
- func ParseWithMediaType(mediaType MediaType, title string) string
- func Slugify(mediaType MediaType, input string) string
- func SplitAndStripArticles(s string) string
- func SplitTitle(title string) (mainTitle, secondaryTitle string, hasSecondary bool)
- func StripEditionAndVersionSuffixes(s string) string
- func StripLeadingArticle(s string) string
- func StripMetadataBrackets(s string) string
- func StripMovieSceneTags(s string) string
- func StripMusicSceneTags(s string) string
- func StripSceneTags(s string) string
- func StripTrailingArticle(s string) string
- type MediaType
- type ScriptType
- type SlugifyResult
Constants ¶
This section is empty.
Variables ¶
This section is empty.
Functions ¶
func ConvertRomanNumerals ¶
ConvertRomanNumerals converts Roman numerals (II-XIX) to Arabic numbers. Note: X is intentionally NOT converted to avoid "Mega Man X" → "Mega Man 10".
Useful for:
- Games: "Final Fantasy VII" → "Final Fantasy 7", "Street Fighter II" → "Street Fighter 2"
- Movies: "Rocky III" → "Rocky 3"
- Music: "Symphony No. IX" → "Symphony No. 9"
Examples:
- "Final Fantasy VII" → "Final Fantasy 7"
- "Street Fighter II" → "Street Fighter 2"
- "Mega Man X" → "Mega Man X" (unchanged - X preserved)
Optimization: Performs case-insensitive matching without full-string case conversions, converting to lowercase directly during output.
func ExpandAbbreviations ¶
ExpandAbbreviations expands common abbreviations found in titles. Uses word boundaries to avoid false matches (e.g., "versus" won't become "versuersus"). Handles two types of abbreviations:
- Period-required: Only expand when period is present (e.g., "feat." but not "feat")
- Flexible: Expand with or without period (e.g., "vs" or "vs.")
Useful for:
- Games: "Super Mario Bros." → "Super Mario Brothers", "Mario vs DK" → "Mario versus DK"
- Music: "Song feat. Artist" → "Song featuring Artist"
- Movies: "Dr. Strangelove" → "Doctor Strangelove"
Examples:
- "Mario vs Donkey Kong" → "Mario versus Donkey Kong"
- "Super Mario Bros." → "Super Mario Brothers"
- "Dr. Mario" → "Doctor Mario"
- "St. Louis Blues" → "Saint Louis Blues"
- "Song feat. Artist" → "Song featuring Artist"
- "A great feat" → "A great feat" (not expanded - no period)
func ExpandNumberWords ¶
ExpandNumberWords expands number words (one, two, three, etc.) to their numeric forms. Handles words 1-20 in both forms:
- "one" or "one." → "1"
- "twenty" or "twenty." → "20"
Useful for:
- Games: "Street Fighter Two" → "Street Fighter 2"
- Movies: "Ocean's Eleven" → "Ocean's 11"
- TV: "Chapter One" → "Chapter 1"
Examples:
- "Game One" → "Game 1"
- "Part Two" → "Part 2"
- "Street Fighter Two" → "Street Fighter 2"
func GenerateBigrams ¶
GenerateBigrams creates overlapping 2-character chunks from a string. This is used for matching scripts that don't have word boundaries (Thai, Burmese, Khmer, Lao).
Example:
GenerateBigrams("เพลงไทย") → ["เพ", "พล", "ลง", "งไ", "ไท", "ทย"]
For strings shorter than 2 characters, returns the original string as a single-element slice.
func GenerateTrigrams ¶
GenerateTrigrams creates overlapping 3-character chunks from a string. This is an alternative to bigrams that may provide better accuracy for longer queries.
Example:
GenerateTrigrams("เพลงไทย") → ["เพล", "พลง", "ลงไ", "งไท", "ไทย"]
For strings shorter than 3 characters, falls back to bigrams or returns the original string.
func IsThai ¶
IsThai returns true if the string contains Thai characters. This is a convenience function for the resolution workflow.
func JaccardSimilarity ¶
JaccardSimilarity computes the Jaccard similarity coefficient between two sets of strings. This is defined as the size of the intersection divided by the size of the union.
Returns a value between 0.0 (no overlap) and 1.0 (identical sets).
Example:
set1 := []string{"a", "b", "c"}
set2 := []string{"b", "c", "d"}
similarity := JaccardSimilarity(set1, set2) // Returns 0.5 (2 common / 4 total)
func NormalizeDotSeparators ¶
NormalizeDotSeparators converts dot separators to spaces, commonly used in scene release filenames. Scene releases typically use dots to separate words: "Show.Name.S01E02.mkv" This function converts those dots to spaces for better normalization.
Note: Preserves dots in:
- Dates (e.g., "2024.01.15" stays as-is for date parsing)
- Episode markers like "S01.E02" (preserved for episode format normalization)
Note: Does NOT preserve generic numeric decimals (e.g., "5.1" → "5 1"). However, known scene tags like "DD5.1", "AAC2.0", "H.264" are stripped by StripSceneTags() before this function runs, so they never reach here.
Useful for:
- TV shows: "Show.Name.S01E02" → "Show Name S01E02"
- Movies: "Movie.Name.2024" → "Movie Name 2024"
Examples:
- "Breaking.Bad.S01E02" → "Breaking Bad S01E02"
- "Attack.on.Titan.1x02" → "Attack on Titan 1x02"
- "Show.Episode.Title" → "Show Episode Title"
- "Show.2024.01.15" → "Show 2024.01.15" (date preserved)
func NormalizeOrdinals ¶
NormalizeOrdinals removes ordinal suffixes from numbers. This allows "2nd" and "II" to both normalize to "2" for consistent matching.
Useful for:
- Games: "Sonic the Hedgehog 2nd" → "Sonic the Hedgehog 2"
- Movies: "21st Century" → "21 Century"
Examples:
- "Street Fighter 2nd Impact" → "Street Fighter 2 Impact"
- "21st Century" → "21 Century"
- "3rd Strike" → "3 Strike"
func NormalizePunctuation ¶
NormalizePunctuation normalizes Unicode punctuation variants to their ASCII equivalents. This ensures consistent behavior across all pipeline stages, particularly for:
- Conjunction detection (" 'n' " patterns in Stage 7)
- Separator normalization (dash handling in Stage 7)
- Abbreviation expansion (word boundary detection in Stage 9)
Normalized characters:
- Curly quotes: ' ' " " → ' "
- Prime marks: ′ ″ → ' "
- Grave/acute: ` ´ → '
- Dashes: – — ― − → -
- Ellipsis: … → ...
Examples:
- "Link's Awakening" → "Link's Awakening" (curly apostrophe → straight)
- "Super–Bros." → "Super-Bros." (en dash → hyphen, enables "Bros" expansion)
- "Rock 'n' Roll" → "Rock 'n' Roll" (curly quotes → straight, enables conjunction)
This is Stage 2 of the normalization pipeline (character-level normalization). Must be called BEFORE Stage 3 (Unicode normalization) and Stage 7 (symbol/separator processing).
func NormalizeSymbolsAndSeparators ¶
NormalizeSymbolsAndSeparators converts conjunctions and separators to normalized forms. Handles conjunctions: "&", " + ", " 'n' " variants → "and" Handles plus symbol: "+" → "plus" Handles separators: ":", "_", "-", "/", "\", ",", ";" → space NOTE: Period "." is NOT converted here; it's handled after abbreviation expansion
Examples:
- "Sonic & Knuckles" → "Sonic and Knuckles"
- "Rock + Roll Racing" → "Rock and Roll Racing"
- "Game+" → "Game plus"
- "Zelda:Link" → "Zelda Link"
- "Super_Mario_Bros" → "Super Mario Bros"
- "Game/Part\One" → "Game Part One"
This is Stage 7 of the normalization pipeline.
func NormalizeToWords ¶
NormalizeToWords converts a game title to a normalized form with preserved word boundaries. This function applies game-specific parsing followed by universal normalization, then returns word tokens for scoring and ranking operations.
The result preserves spaces between words, enabling word-level operations like:
- Token-based similarity matching
- Word sequence validation
- Sequel suffix detection
- Weighted word scoring
Example:
NormalizeToWords("The Legend of Zelda: Ocarina of Time (USA)")
→ "legend of zelda ocarina of time"
→ []string{"legend", "of", "zelda", "ocarina", "of", "time"}
Note: For database queries and slug matching, use Slugify() instead. This function is for scoring and ranking operations only.
func NormalizeUnicode ¶
NormalizeUnicode performs Unicode normalization with symbol removal and script-aware processing. This combines several operations:
- Removes Unicode symbols (trademark ™, copyright ©, currency $€¥)
- Applies script-specific normalization (NFKC for Latin, NFC for CJK, etc.)
- Removes diacritics for Latin scripts (Pokémon → Pokemon)
- Preserves essential marks for CJK scripts
Examples:
- Symbols: "Sonic™" → "Sonic", "Game©" → "Game"
- Diacritics (Latin): "Pokémon" → "Pokemon", "Café" → "Cafe"
- Ligatures: "final" → "final"
- CJK preserved: "ドラゴンクエスト" → "ドラゴンクエスト"
This is Stage 3 of the normalization pipeline. Returns the input unchanged if normalization fails or if input is pure ASCII.
The optional ctx parameter enables caching optimizations during pipeline processing. When ctx is nil, caching is skipped (useful for standalone calls or tests). When ctx is provided, ASCII check and script detection results are cached for reuse.
func NormalizeWidth ¶
NormalizeWidth performs width normalization on a string. Converts fullwidth ASCII characters to halfwidth (for Latin text processing). Converts halfwidth CJK characters to fullwidth (for consistent display and matching).
Examples:
- Fullwidth ASCII: "ABCDEF" → "ABCDEF"
- Fullwidth numbers: "123" → "123"
- Halfwidth katakana: "ウエッジ" → "ウエッジ"
- Mixed: "Super Mario 123" → "Super Mario 123"
This is Stage 1 of the normalization pipeline. Returns the input unchanged if normalization fails.
func ParseGame ¶
ParseGame normalizes game titles by applying game-specific transformations. This handles common game title patterns and variations to ensure consistent matching.
Transformations applied (in order):
- Split titles and strip articles: "The Zelda: Link's Awakening" → "Zelda Link's Awakening"
- Strip trailing articles: "Legend, The" → "Legend"
- Strip metadata brackets: (USA), [!], {Europe}, <Beta> → removed
- Strip edition/version suffixes: "Edition", "Version", v1.0 → removed
- Normalize separators: Convert periods to spaces (for abbreviation matching)
- Expand abbreviations: "Bros" → "brothers", "vs" → "versus", "Dr" → "doctor"
- Expand number words: "one" → "1", "two" → "2"
- Normalize ordinals: "1st" → "1", "2nd" → "2"
- Convert roman numerals: "VII" → "7", "II" → "2" (preserves "X" for games like Mega Man X)
Examples:
- "Super Mario Bros. III (USA) [!]" → "super mario brothers 3"
- "Street Fighter II Version" → "street fighter 2"
- "Mega Man X" → "mega man x" (X preserved)
- "Final Fantasy VII" → "final fantasy 7"
func ParseMovie ¶
ParseMovie normalizes movie titles to a canonical format. Handles scene release tags, edition suffix stripping, and article stripping. Years are stripped from the slug (like games) and extracted as tags by the tag parser.
Transformations applied (in order):
- Width normalization: Convert fullwidth characters to ASCII
- Scene tag stripping: Remove quality, codec, source, HDR, 3D tags
- Scene group stripping: Remove trailing release group tags (-GROUP)
- Dot normalization: Convert scene release dots to spaces
- Edition suffix stripping: Remove "Edition", "Version", "Cut", "Release" suffixes (preserves qualifiers like "Director's", "Extended", "Theatrical")
- Bracket stripping: Remove metadata brackets including years (extracted as tags)
- Split titles and strip articles: "The Movie: Subtitle" → "Movie Subtitle"
- Strip trailing articles: "Movie, The" → "Movie"
Supported formats: - Standard: "Movie Name (2024)" - Scene: "Movie.Name.2024.1080p.BluRay.x264-GROUP" - With edition: "Movie Name (2024) Director's Cut Edition" → "Movie Name Director's" - With ID: "Movie Name (2024) {imdb-tt1234567}"
Examples:
- "The.Matrix.1999.1080p.BluRay.x264.DTS-WAF" → "Matrix 1999"
- "Blade Runner (1982) Director's Cut" → "Blade Runner Director's"
- "Avatar.2009.Extended.Edition.1080p" → "Avatar 2009 Extended"
- "The Dark Knight (2008)" → "Dark Knight"
- "Lord of the Rings (2001) Extended Edition" → "Lord of Rings Extended"
- "Movie, The (2024)" → "Movie"
Note: Years like (1999) are extracted as tags (year:1999) by the tag parser, allowing users to filter by year when needed: launch.title Movie/Matrix (+year:1999)
TODO: Scene releases use bare years without parentheses (Movie.Name.1999.1080p), but we can't safely strip them without breaking movies with years in their titles (e.g., "2001: A Space Odyssey", "1917", "1984"). For now, we only strip years in parentheses/brackets. This means scene releases will include the year in the slug (e.g., "Matrix 1999" vs "Matrix" from standard naming). Cross-format matching happens at the Slugify level where lowercasing provides some normalization.
func ParseMusic ¶
ParseMusic normalizes music album titles to a canonical format. This is a CONSERVATIVE implementation that focuses on cleaning scene release tags while preserving artist names for uniqueness.
Transformations applied (in order):
- Width normalization: Convert fullwidth characters to ASCII
- Scene tag stripping: Remove format, quality, source tags and release group
- Separator normalization: Convert dots, underscores, and dashes to spaces
- Bracket stripping: Remove metadata brackets including years (extracted as tags)
- Disc number stripping: Remove CD1, CD2, Disc 1, etc.
- Split titles and strip articles: "The Album: Subtitle" → "Album Subtitle"
- Strip trailing articles: "Album, The" → "Album"
Supported formats: - Scene release: "Artist-Album-2024-CD-FLAC-GROUP" → "Artist Album 2024" - User-friendly: "Artist - Album (2024)" → "Artist Album" - With quality: "Artist - Album (2024) [FLAC 24bit]" → "Artist Album" - With disc: "Artist - Album CD1" → "Artist Album"
Examples:
- "Pink.Floyd-The.Wall-1979-CD-FLAC-GROUP" → "Pink Floyd Wall 1979"
- "The Beatles - Abbey Road (1969)" → "Beatles Abbey Road"
- "VA - Best of 2024 [FLAC]" → "VA Best of 2024"
- "Miles Davis - Kind of Blue (1959)" → "Miles Davis Kind of Blue"
Note: Years in parentheses/brackets are extracted as tags (year:1997) by the tag parser. Bare years (from scene releases) are kept in the slug.
Design note: This implementation intentionally keeps artist names to preserve uniqueness. Many albums share the same title across different artists ("IV", "Nevermind", etc.). More sophisticated artist/album extraction can be added later if needed.
func ParseTVShow ¶
ParseTVShow normalizes TV show titles to a canonical format. Handles various episode number formats, scene release tags, and reorders components.
Transformations applied (in order):
- Width normalization: Convert fullwidth characters to ASCII
- Scene tag stripping: Remove quality, codec, source tags (1080p, x264, BluRay, etc.)
- Dot normalization: Convert scene release dots to spaces
- Split titles and strip articles: "The Show: Episode Title" → "Show Episode Title"
- Strip trailing articles: "Show, The" → "Show"
- Strip metadata brackets: [720p], (extended), etc. → removed
- Normalize episode formats: S01E02, 1x02, dates, absolute → canonical formats
- Component reordering: Place episode marker in consistent position
Supported episode formats: - Season-based: S01E02, s01e02, 1x02, S01.E02, S01_E02, 102 (multi-episode supported) - Date-based: YYYY-MM-DD, DD-MM-YYYY, various separators (-, ., /) - Absolute: Episode 001, Ep 42, E001, #001 (anime) - Various delimiter variations (-, ., _, space)
Examples:
- "Breaking.Bad.S01E02.1080p.BluRay.x264-GROUP" → "Breaking Bad s01e02"
- "Show - S01E02 [720p]" → "Show s01e02"
- "S01E02 - Show - Episode Title" → "Show s01e02 Episode Title"
- "Attack on Titan - 1x02 - Title" → "Attack on Titan s01e02 Title"
- "Daily Show - 2024-01-15" → "Daily Show 2024-01-15"
- "One Piece - Episode 001" → "One Piece e001"
func ParseWithMediaType ¶
ParseWithMediaType is the entry point for media-type-aware parsing. It delegates to the appropriate parser based on media type. Each parser applies media-specific normalization BEFORE the universal pipeline.
Media-specific parsers are implemented in separate files:
- ParseTVShow → media_parsing_tv.go
- ParseGame → media_parsing_game.go
- ParseMovie, ParseMusic, etc. → TODO (return unchanged for now)
func Slugify ¶
Slugify applies media-type-aware parsing before slugification. It normalizes media titles based on their type (TV shows, movies, music, etc.) to ensure consistent matching across different format variations.
Media type should be a string matching one of the MediaType constants from systemdefs: "TVShow", "Movie", "Music", "Audio", "Video", "Game", "Image", "Application"
For TV shows, this normalizes episode markers:
"Show - S01E02 - Title" and "Show - 1x02 - Title" both normalize to the same slug
For other media types, parsing is applied based on the type, or the title passes through to the standard slugification pipeline.
Example:
Slugify(MediaTypeTVShow, "Breaking Bad - S01E02 - Gray Matter") → same as Slugify(MediaTypeTVShow, "Breaking Bad - 1x02 - Gray Matter")
func SplitAndStripArticles ¶
SplitAndStripArticles splits a title into main and secondary parts, then strips leading articles from both. This combines title splitting and article removal into a single operation.
Delimiter priority (highest to lowest): ":", " - ", "'s " Note: For "'s " delimiter, the "'s" is retained in the main title.
Examples:
- "The Legend of Zelda: Link's Awakening" → "Legend of Zelda Link's Awakening"
- "The Game - A Subtitle" → "Game Subtitle"
- "Mario's Adventure" → "Mario's Adventure" (no leading article)
This function is shared by all media parsers to ensure consistent article handling.
func SplitTitle ¶
SplitTitle splits a title into main and secondary parts based on common delimiters. This is a public API function used by other packages for metadata processing.
Delimiter priority (highest to lowest): ":", " - ", "'s " Note: For "'s " delimiter, the "'s" is retained in the main title.
Returns:
- mainTitle: The primary part of the title
- secondaryTitle: The secondary part (subtitle)
- hasSecondary: Whether a secondary title was found
Examples:
- "The Legend of Zelda: Link's Awakening" → ("The Legend of Zelda", "Link's Awakening", true)
- "Super Mario Bros." → ("Super Mario Bros.", "", false)
- "Game - Subtitle" → ("Game", "Subtitle", true)
func StripEditionAndVersionSuffixes ¶
StripEditionAndVersionSuffixes removes edition/version words and version numbers from titles. Strips standalone words ("version", "edition") and their multi-language equivalents. Does NOT strip semantic edition markers like "Special", "Ultimate", "Remastered" - these represent different products and users may want to target them specifically.
Useful for:
- Games: "Pokemon Red Version" → "Pokemon Red"
- Applications: "Photoshop v2024" → "Photoshop"
- Movies: "Blade Runner Director's Cut Edition" → "Blade Runner Director's Cut"
Supported languages:
- English: version, edition
- German: ausgabe (edition)
- Italian: versione, edizione
- Portuguese: versao, edicao (after diacritic normalization)
- Japanese: バージョン (version), エディション (edition), ヴァージョン (version alt.)
Examples:
- "Pokemon Red Version" → "Pokemon Red"
- "Game Edition" → "Game"
- "Super Mario Edition" → "Super Mario"
- "ドラゴンクエストバージョン" → "ドラゴンクエスト" (CJK)
- "Game Special Edition" → "Game Special" (Edition stripped, Special kept)
func StripLeadingArticle ¶
StripLeadingArticle removes leading articles ("The", "A", "An") from a string. This is a utility function used by both slug normalization and word-level matching. It preserves the original case of non-article portions.
Examples:
- "The Legend of Zelda" → "Legend of Zelda"
- "A New Hope" → "New Hope"
- "An American Tail" → "American Tail"
func StripMetadataBrackets ¶
StripMetadataBrackets removes all bracket types (parentheses, square brackets, braces, angle brackets) from a string. Commonly used to clean metadata like region codes, dump info, and tags.
Useful for:
- Games: "Sonic (USA) [!]" → "Sonic"
- Movies: "Movie (2024) [Remastered]" → "Movie (2024)" (year preserved, quality tag removed)
- TV shows: "Show - S01E02 [720p]" → "Show - S01E02"
Examples:
- "Game (USA) [!]" → "Game"
- "Title {Europe} <Beta>" → "Title"
- "Game ((nested)) [test]" → "Game"
func StripMovieSceneTags ¶
StripMovieSceneTags removes scene release tags specific to movies. Unlike the shared StripSceneTags(), this function excludes edition qualifiers (Extended, Unrated, Director's Cut, Remastered) which identify different movie editions.
Removed tags include:
- Quality: 480p, 720p, 1080p, 2160p, 4K, 8K, UHD, HD, SD
- Source: BluRay, WEB-DL, HDTV, DVDRip, Remux, etc.
- Codec: x264, x265, H.264, H.265, HEVC, XviD, AVC, VC-1, 10bit, 8bit
- Audio: AC3, AAC, DTS, DD5.1, DD7.1, Atmos, TrueHD, etc.
- HDR: HDR, HDR10, HDR10+, Dolby Vision, HLG
- 3D: 3D, HSBS, HOU, Half-SBS, Half-OU
- Tags: PROPER, REPACK, INTERNAL, LIMITED, MULTI, KORSUB (but NOT Extended, Unrated, etc.)
- Group: -GROUP at end
Preserved edition qualifiers:
- Extended, Unrated, Director's Cut, Remastered (these identify different editions)
Examples:
- "Movie.2024.2160p.WEB-DL.DV.HDR10.HEVC-GROUP" → "Movie 2024"
- "Avatar.2009.Extended.3D.HSBS.1080p.BluRay" → "Avatar 2009 Extended"
- "Film.2020.Unrated.1080p.BluRay.x264.DTS" → "Film 2020 Unrated"
func StripMusicSceneTags ¶
StripMusicSceneTags removes scene release tags specific to music. Unlike movie scene tags, music preserves edition qualifiers (Remastered, Deluxe, etc.) as these identify different album editions.
Removed tags include:
- Format: FLAC, MP3, AAC, ALAC, APE, WAV, OGG, WMA, M4A, OPUS
- Quality: V0, V2, 320, 192, 256, CBR, VBR, LAME, 24bit, 96kHz, etc.
- Source: CD, WEB, Vinyl, SACD, DVD, Blu-ray, DAT, Cassette
- Disc numbers: CD1, CD2, Disc1, Disc2
- Group: -GROUP at end
Preserved edition qualifiers:
- Remastered, Deluxe, Limited, Expanded, Anniversary, Bonus, Special
Examples:
- "Artist-Album-2024-CD-FLAC-V0-GROUP" → "Artist-Album 2024"
- "Album.Title.1979.Vinyl.FLAC.24bit.96kHz" → "Album Title 1979"
- "Album.2020.Remastered.WEB.FLAC" → "Album 2020 Remastered"
func StripSceneTags ¶
StripSceneTags removes scene release tags commonly found in TV show filenames. Scene releases use specific tags to indicate quality, source, codec, audio, and release group. This function strips all such tags to normalize titles for matching.
Removed tags include:
- Quality: 480p, 720p, 1080p, 2160p, 4K, HD, SD, UHD
- Source: BluRay, BDRip, BRRip, WEBRip, WEB-DL, HDTV, DVDRip, etc.
- Codec: x264, x265, H.264, H.265, HEVC, XviD, AVC, 10bit, 8bit
- Audio: AC3, AAC, DTS, DD5.1, DD7.1, Atmos, TrueHD, etc.
- Other: PROPER, REPACK, INTERNAL, LIMITED, EXTENDED, UNRATED, Director's Cut, etc.
- Group: Trailing release group tag (e.g., "-GROUP")
Useful for:
- TV shows: "Show.Name.S01E02.1080p.BluRay.x264-GROUP" → "Show Name S01E02"
- Movies: "Movie.Name.2024.720p.WEB-DL.AAC2.0.H.264-RELEASE" → "Movie Name 2024"
Examples:
- "Breaking.Bad.S01E02.1080p.BluRay.x264-GROUP" → "Breaking Bad S01E02"
- "Show.S01E02.720p.WEB-DL.AAC2.0.H.264" → "Show S01E02"
- "Episode.4K.HDR.Atmos.PROPER" → "Episode"
func StripTrailingArticle ¶
StripTrailingArticle removes trailing articles like ", The" from the end of a string.
Pattern: `, The` followed by end of string or separator characters (space, colon, dash, parenthesis, bracket)
Examples:
- "Legend, The" → "Legend"
- "Mega Man, The" → "Mega Man"
- "Story, the:" → "Story:" (case insensitive)
Types ¶
type MediaType ¶
type MediaType string
MediaType categorizes the type of media content being slugified. This determines which media-specific parsing rules are applied before slugification.
const ( // MediaTypeGame represents gaming systems (consoles, computers, arcade). MediaTypeGame MediaType = "Game" // MediaTypeMovie represents film and movie content. MediaTypeMovie MediaType = "Movie" // MediaTypeTVShow represents TV episodes and shows. MediaTypeTVShow MediaType = "TVShow" // MediaTypeMusic represents music and song content. MediaTypeMusic MediaType = "Music" // MediaTypeImage represents image files. MediaTypeImage MediaType = "Image" // MediaTypeAudio represents general audio content (audiobooks, podcasts). MediaTypeAudio MediaType = "Audio" // MediaTypeVideo represents general video content (music videos). MediaTypeVideo MediaType = "Video" // MediaTypeApplication represents application/software content. MediaTypeApplication MediaType = "Application" )
type ScriptType ¶
type ScriptType int
ScriptType represents different writing systems supported by the slug system. Each script type may require different normalization strategies.
const ( ScriptLatin ScriptType = iota // Latin alphabet (English, French, Spanish, etc.) ScriptCJK // Chinese, Japanese, Korean ScriptCyrillic // Russian, Ukrainian, Bulgarian, Serbian, etc. ScriptGreek // Greek ScriptIndic // Devanagari, Bengali, Tamil, Telugu, etc. ScriptArabic // Arabic, Urdu, Persian/Farsi ScriptHebrew // Hebrew ScriptThai // Thai (requires n-gram matching) ScriptBurmese // Burmese/Myanmar (requires n-gram matching) ScriptKhmer // Khmer/Cambodian (requires n-gram matching) ScriptLao // Lao (requires n-gram matching) ScriptAmharic // Amharic/Ethiopic )
func DetectScript ¶
func DetectScript(s string) ScriptType
DetectScript identifies the primary writing system used in a string. Returns the first matching script type, or ScriptLatin as the default.
type SlugifyResult ¶
SlugifyResult contains the slug and tokens generated during slugification. This ensures metadata is computed from the EXACT tokens used during slug generation, not from re-tokenization.
func SlugifyWithTokens ¶
func SlugifyWithTokens(mediaType MediaType, input string) SlugifyResult
SlugifyWithTokens performs 14-stage normalization and returns both slug and tokens. This is the core implementation - it returns tokens extracted DURING slug generation to ensure metadata is computed from the EXACT same tokenization that produces the slug.
Use this function when you need both the slug and token-based metadata (e.g., word count). For simple slug generation, use Slugify() instead.
Example:
result := SlugifyWithTokens("The Legend of Zelda: Ocarina of Time (USA)")
result.Slug → "legendofzeldaocarinaoftime"
result.Tokens → []string{"legend", "of", "zelda", "ocarina", "of", "time"}