Version: Next

Title Normalization and Matching System

Zaparoo Core's title normalization and matching system enables users to launch games using natural language titles (e.g., "The Legend of Zelda: Ocarina of Time") that are fuzzy-matched against indexed ROM filenames. This document provides a high-level overview of how the system works.

Overview

The system lets users look up games by natural language titles rather than exact filenames or unique identifiers. Titles can be written in various forms (with or without articles, with Roman numerals or digits, with typos, etc.) and the system will find matching games through progressive normalization and fallback strategies.

Key Concept: Slugs are not IDs. They are an intermediary normalization step that enables fuzzy matching between user queries and indexed filenames. The system normalizes both sides:

User input → normalize → slug → match against database
Filenames → normalize → slug → store in database

The original input title and filename are preserved for additional context during resolution.

Why This Approach?

The system works around several constraints:

No hashing: Too slow on low-resource devices (MiSTer FPGA, older Raspberry Pis)
Offline-first: No dependency on online services or internet access
Cross-device portability: Tokens work across different devices despite different file naming schemes

Advantages:

Natural language names on NFC/QR tokens work universally
Third-party apps don't need special integration - just write the game name
System can be improved over time without breaking compatibility
Makes local media search much more useful

Limitations:

No cross-language support (and currently prioritizes English heuristics)
Conflicts can occur (mitigated by system namespacing and tags)
Best-effort normalization rather than perfect accuracy

How It Works

1. Indexing

When scanning media, Zaparoo:

Cleans path and extracts filename (strips file extension and path)
Parses filename to extract clean display title (8-step pipeline - see Filename Parser section)
Extracts tags from filename using bracket disambiguation (4-step pipeline - see Filename Parser section)
Determines media type from system (Game, TVShow, Movie, Music, Image, Audio, Video, Application)
Runs media-type-aware slugification on title (two-phase normalization - see Slug Normalization section)
Stores path, title, slug, tags, and metadata in database for fast searching

2. Resolution (Query-Based Launching)

When launching by title query (e.g., launch.title("NES/Super Mario Bros")), Zaparoo:

Parses SystemID/GameName format
Validates format and looks up system in database
Extracts tags from three sources in user query (advanced args, canonical tags, filename-style metadata)
Merges tags with priority hierarchy
Slugifies game name using media-type-aware normalization (same as indexing)
Checks cache for previous resolutions
Tries multiple matching strategies in order until finding a match (see Matching Strategies section)
Applies confidence scoring and filtering to select best result
Caches successful resolution for future queries

Note: The user's query is processed separately from filenames. Users can include tags in their queries using filename-style (USA), canonical (+region:us), or advanced args ?tags=region:us formats.

Slug Normalization

Slug normalization uses a two-phase architecture that converts titles into a canonical form. Both indexing and resolution use identical normalization.

Implementation: pkg/database/slugs/slugify.go → Slugify(mediaType MediaType, input string)

Two-Phase Architecture

Phase 1: Media-Specific Parsing

Applies format-specific normalization based on media type before universal normalization.

For Games (ParseGame):

Width normalization is applied first (fullwidth separators → ASCII for detection), then 9 steps:

Split titles and strip articles: "The Zelda: Link's Awakening" → "Zelda Link's Awakening"
Strip trailing articles: "Legend, The" → "Legend"
Strip metadata brackets: (USA), [!], {Europe} → removed
Strip edition/version suffixes: "Edition", "Version", "v1.0" → removed
Normalize symbols/separators (preserve commas for trailing articles)
Expand abbreviations: "Bros" → "brothers", "vs" → "versus", "Dr" → "doctor"
Expand number words: "one" → "1", "two" → "2" (1-20)
Normalize ordinals: "2nd" → "2", "3rd" → "3"
Convert roman numerals: "VII" → "7", "II" → "2" (preserves "X" in "Mega Man X")

For TV Shows (ParseTVShow):

Width normalization is applied first, then 9 steps:

Scene tag stripping: quality, codec, source tags (1080p, x264, BluRay, etc.)
Dot normalization: scene release dots → spaces
Strip metadata brackets: [720p], (extended) → removed
Normalize date episodes: YYYY-MM-DD, DD-MM-YYYY (with ., /, - separators) → canonical YYYY-MM-DD
Normalize season-based formats: S01E02, s01e02, 1x02, S01.E02, S01_E02, multi-episode (S01E01-E02) → s01e02
Normalize absolute numbering: Episode 001, Ep 42, E001, #001 (anime), leading numbers → e001
Component reordering: episode marker placed after show name ("S01E02 - Show - Title" → "Show s01e02 Title")
Split titles and strip articles: "The Show: Episode Title" → "Show Episode Title"
Strip trailing articles: "Show, The" → "Show"

For Movies (ParseMovie):

Width normalization (fullwidth separators → ASCII for detection)
Scene tag stripping: quality, codec, source, HDR, 3D tags (preserves edition qualifiers like "Extended", "Unrated", "Director's Cut")
Dot normalization: scene release dots → spaces
Edition suffix stripping: trailing "Edition", "Version", "Cut", "Release" removed (preserves qualifiers: "Director's Cut Edition" → "Director's")
Bracket stripping: (2024), {imdb-tt1234567} → removed (years extracted as tags)
Split titles and strip articles: "The Movie: Subtitle" → "Movie Subtitle"
Strip trailing articles: "Movie, The" → "Movie"

For Music (ParseMusic):

Width normalization (fullwidth separators → ASCII for detection)
Scene tag stripping: format (FLAC, MP3), quality (V0, 320, 24bit), source (CD, Vinyl, WEB), release group (preserves edition qualifiers like "Remastered", "Deluxe")
Separator normalization: dots, underscores, dashes → spaces
Bracket stripping: (1979), [FLAC] → removed (years extracted as tags)
Disc number stripping: CD1, CD2, Disc 1 → removed
Strip leading article: "The Beatles Abbey Road" → "Beatles Abbey Road"
Strip trailing articles: "Album, The" → "Album"
Whitespace collapse: multiple spaces → single space

For Image/Audio/Video/Application:

Pass through to Phase 2 universal normalization only (no media-specific parsing yet)

Phase 2: Universal Normalization (`normalizeInternal`)

Applied after media-specific parsing:

Width Normalization - Fullwidth → Halfwidth (ASCII), Halfwidth → Fullwidth (CJK)
Punctuation Normalization - Curly quotes, fancy dashes → standard ASCII
Unicode Normalization - Remove symbols (™©®), remove diacritics (Pokémon → Pokemon), script-aware processing
Symbols & Separators - & → and, separators → spaces: "Sonic & Knuckles" → "Sonic and Knuckles"
Period Conversion - All periods → spaces (safe after abbreviations expanded)
Lowercasing - Convert to lowercase

Final Stage (in Slugify/SlugifyWithTokens):

Character Filtering - Remove non-alphanumeric, multi-script aware

Complete Example (Games)

Input:  "The Legend of Zelda: Ocarina of Time (USA) [!]"

Phase 1 (ParseGame):
  Step 1: Split & strip articles → "Legend of Zelda Ocarina of Time (USA) [!]"
  Step 3:   Strip brackets → "Legend of Zelda Ocarina of Time"
  Step 9:   Roman numerals (none) → "Legend of Zelda Ocarina of Time"

Phase 2 (normalizeInternal):
  Step 6:   Lowercase → "legend of zelda ocarina of time"

Final:
  Character filtering → "legendofzeldaocarinaoftime"

Output: "legendofzeldaocarinaoftime"

Multi-Script Support

The system preserves non-Latin scripts (CJK, Cyrillic, Arabic, etc.) while aggressively normalizing Latin text:

Latin titles: Full normalization, ASCII output
CJK titles: Preserved characters, essential marks kept
Mixed titles: Both portions concatenated, searchable by either part

Example:

Input:  "Street Fighter ストリートファイター"
Output: "streetfighterストリートファイター"

This makes the title searchable by either the Latin or CJK portion.

Filename Parser (Indexing)

During media indexing, Zaparoo parses filenames to extract clean titles and metadata tags. This happens when scanning ROM directories, media libraries, or individual files.

Implementation: pkg/database/mediascanner/indexing_pipeline.go → GetPathFragments(), pkg/database/tags/filename_parser.go

Indexing Pipeline

When a file is indexed, the system:

Cleans path - Normalizes to forward slashes, handles URIs
Extracts filename - Strips file extension and path
Parses title - Extracts clean display title from filename
Extracts tags - Parses metadata from brackets/parentheses
Slugifies title - Creates normalized slug for matching
Stores in database - Saves path, title, slug, and tags

Title Extraction from Filename

Function: tags.ParseTitleFromFilename(filename, stripLeadingNumbers)

Extracts a clean display title from a filename by removing metadata and normalizing artifacts.

The 8-Step Pipeline

Remove File Extension
- Strips .zip, .nes, .mkv, etc.
- Only removes if 2-4 characters after last dot
- Example: "game.nes" → "game"
Strip Release Group
- Removes scene release group suffix: -GROUP at end
- Must be uppercase, 3+ characters
- Example: "Movie-YIFY" → "Movie"
- Done early before hyphen → space conversion
Normalize Filename Separators (contextual)
- Trigger: Filename has no spaces AND 2+ separators (dots, underscores, or dashes)
- Action: Convert all ., _, - → spaces
- Examples:
  - "The.Dark.Knight.2008.mkv" → "The Dark Knight 2008"
  - "super_mario_bros.sfc" → "super mario bros"
  - "mega-man-x.nes" → "mega man x"
- Heuristic: Detects scene releases and ROM naming conventions
Strip Scene Release Artifacts (contextual)
- Trigger: Only strips from text AFTER a year (if found)
- Protects titles: "Cam (2018)" keeps "Cam" (it's the title, not a scene tag)
- Removed patterns:
  - Resolution: 720p, 1080p, 2160p, 4K, UHD
  - Source: BluRay, WEB-DL, WEBRip, HDTV, DVDRip, CAM, TS
  - Video codec: x264, x265, h264, HEVC, AVC
  - Audio codec: AAC, AC3, DTS, DD5.1, TrueHD, Atmos
  - HDR: HDR, HDR10, Dolby Vision
  - Status: PROPER, REPACK, INTERNAL, LIMITED
- Example: "The Dark Knight 2008 1080p BluRay x264" → "The Dark Knight 2008"
Strip Episode Markers
- Removes TV show episode patterns: S01E02, s1e2
- Keeps them for tag extraction but removes from display title
- Example: "Breaking Bad S01E02 Title" → "Breaking Bad Title"
Strip Leading Numbers (optional)
- Only when stripLeadingNumbers=true (detected from directory context)
- Removes list prefixes: "1. ", "01 - ", "05-"
- Example: "01 - Game Name" → "Game Name"
- Contextual: Only enabled when directory shows list-style numbering
Remove All Bracket Content
- Function: slugs.StripMetadataBrackets()
- Removes: (), [], {}, <>
- Handles nested brackets
- Example: "Game (USA) [!] {Europe}" → "Game"
- Example: "Movie (2008) (Blu-ray)" → "Movie"
Normalize Whitespace
- Collapses multiple spaces to single space
- Trims leading/trailing spaces
- Final cleanup after all transformations

Title Extraction Examples

ROM filename:

Input:  "Super Mario Bros. III (USA) (Rev A) [!].nes"
Step 1: Remove extension → "Super Mario Bros. III (USA) (Rev A) [!]"
Step 7: Remove brackets → "Super Mario Bros. III"
Step 8: Normalize whitespace → "Super Mario Bros. III"
Output: "Super Mario Bros. III"

Scene release:

Input:  "The.Dark.Knight.2008.1080p.BluRay.x264-YIFY.mkv"
Step 1: Remove extension → "The.Dark.Knight.2008.1080p.BluRay.x264-YIFY"
Step 2: Strip release group → "The.Dark.Knight.2008.1080p.BluRay.x264"
Step 3: Normalize separators → "The Dark Knight 2008 1080p BluRay x264"
Step 4: Strip scene artifacts (after year) → "The Dark Knight 2008"
Output: "The Dark Knight 2008"

TV show:

Input:  "Breaking.Bad.S01E02.Gray.Matter.720p.mkv"
Step 3: Normalize separators → "Breaking Bad S01E02 Gray Matter 720p"
Step 4: Strip scene artifacts → "Breaking Bad S01E02 Gray Matter"
Step 5: Strip episode marker → "Breaking Bad Gray Matter"
Output: "Breaking Bad Gray Matter"

Tag Extraction from Filename

Function: tags.ParseFilenameToCanonicalTags(filename)

Extracts metadata tags from filenames following No-Intro and TOSEC conventions.

The 4-Step Pipeline

Step 1: Extract Special Patterns

Function: extractSpecialPatterns()
Finds patterns that appear outside brackets:
- Disc numbers: (Disc 1 of 2) → disc:1, discof:2
- Revisions: (Rev A), (Rev 1) → rev:a, rev:1
- Volumes: (Vol. 2), (Volume 3) → volume:2
- Versions: (v1.2), v3.0 → rev:1-2, rev:3-0
- Years: (1997) → year:1997
- Episodes: S01E02, 1x05 → season:1, episode:2
- Issues: #12, Issue 5 → issue:12
- Tracks: 01 -, Track 03 → track:1
- Translations: T+En, T-Fr v1.0 → translation:en, translation-:fr
- Bracketless versions: v1.0, v1.2.3 outside brackets → rev:1-0 (only if no version already extracted)
- Edition/version words: "Edition", "Version" (+ multi-language equivalents) → edition:edition or edition:version (inferred, not removed from title)
Removes matched patterns from string for cleaner bracket extraction

Step 2: Extract Bracket Content

Function: extractTags()
State machine parser for brackets:
- (), {}, <> → parentheses tags (region, language, dev status)
- [] → square bracket tags (dump info, hacks)
Returns two separate lists for disambiguation

Step 3: Process Parentheses Tags

Function: disambiguateTag() with BracketTypeParen
Context-aware parsing with positional rules:
- First paren tag → usually region (if matches known region)
- Subsequent tags → language, version, dev status, etc.
Handles multi-value tags: (En,Fr,De) → lang:en, lang:fr, lang:de
Tag types recognized:
- Regions: USA, Europe, Japan, World, etc.
- Languages: En, Fr, De, Ja, etc. (2-3 letter codes)
- Dev status: Beta, Proto, Alpha, Demo
- Versions: v1.0, Rev A, Alt
- Years: 1997, 2008

Step 4: Process Square Bracket Tags

Function: disambiguateTag() with BracketTypeSquare
Always dump-related or modification info:
- Dump status: [!] → dump:verified, [b] → dump:bad
- Hacks: [h], [h1] → hack:yes, hack:1
- Translations: [T+En] → translation:en
- Trainer: [t], [t1] → trainer:yes, trainer:1
- Fixes: [f] → fix:yes
- Overdumps: [o] → overdump:yes

Tag Extraction Examples

ROM filename:

Input:  "Super Mario Bros. 3 (USA) (Rev A) [!].nes"

Step 1: Extract special patterns
  → Rev: rev:a (from "(Rev A)")
  → Remaining: "Super Mario Bros. 3 (USA) [!].nes"

Step 2: Extract brackets
  → Paren tags: ["USA"]
  → Square tags: ["!"]

Step 3: Process paren tags
  → "USA" (position 0, first tag) → region:us

Step 4: Process square tags
  → "!" → dump:verified

Output: [rev:a, region:us, dump:verified]

Multi-language ROM:

Input:  "Zelda (Europe) (En,Fr,De,Es,It).gba"

Step 2: Extract brackets
  → Paren tags: ["Europe", "En,Fr,De,Es,It"]

Step 3: Process paren tags (position 0)
  → "Europe" → region:eu

Step 3: Process paren tags (position 1)
  → "En,Fr,De,Es,It" (multi-value) → lang:en, lang:fr, lang:de, lang:es, lang:it

Output: [region:eu, lang:en, lang:fr, lang:de, lang:es, lang:it]

Unfinished ROM:

Input:  "Star Fox 2 (Beta) (1995).sfc"

Step 1: Extract special patterns
  → Year: year:1995
  → Remaining: "Star Fox 2 (Beta).sfc"

Step 3: Process paren tags
  → "Beta" → unfinished:beta

Output: [year:1995, unfinished:beta]

Scene release:

Input:  "The.Dark.Knight.2008.1080p.BluRay.x264-YIFY.mkv"

Step 1: Extract special patterns
  → Year: year:2008
  → Remaining: (no brackets to extract)

Output: [year:2008]

Disambiguation Rules

The system uses positional and bracket-type rules for disambiguation:

Positional Rules (parentheses):

First tag → region (if matches known region list)
Subsequent tags → language, version, dev status
Context-aware → checks previously processed tags

Bracket Type Rules:

Parentheses/Braces/Angles → metadata (region, language, version, dev status)
Square brackets → always dump info or modifications (hacks, trainers, fixes)

Special Handling:

Multi-value tags: (En,Fr,De) → creates multiple lang tags
Composite tags: (En,Fr) in Europe ROM → both languages extracted
Inferred tags: "Edition" in plain text → marked as TagSourceInferred, skipped for filtering

Matching Strategies

Resolution tries strategies in order until finding results. Each strategy is more lenient than the last.

Implementation: pkg/zapscript/titles.go → cmdTitle()

Strategy Flow

1. Check cache
   ↓ (miss)
2. Exact match (with tags)
   ↓ (no results OR low confidence)
3. Exact match (without tags)
   ↓ (no results)
4. Secondary title match
   ↓ (no results)
5. Advanced fuzzy matching
   ├─ Token signature (word-order independent)
   ├─ Jaro-Winkler (typo tolerance)
   └─ Damerau-Levenshtein tie-breaking
   ↓ (no results)
6. Main title only
   ↓ (no results)
7. Progressive trim (last resort)

Strategy Details

1. Cache Lookup

Fast path: checks previous resolutions
Keyed by: SystemID + Slug + Tags

2. Exact Match (with tags)

Direct slug lookup
Tags applied as filters
Early exit if confidence ≥ 0.95 (high confidence)

3. Exact Match (without tags)

Same slug lookup, tags ignored
Tags become soft preferences during result selection

4. Secondary Title Match

Handles mismatched secondary titles (bidirectional):

Query has secondary, DB doesn't: "Zelda: Ocarina" → matches "Ocarina of Time"
Query simple, DB has secondary: "Ocarina" → matches "Zelda: Ocarina of Time"

5. Advanced Fuzzy Matching

Uses a pre-filter (±3 chars, ±1 word) then tries three algorithms:

Token signature: Order-independent word matching
Jaro-Winkler: Typo tolerance, prefix-weighted (0.85+ similarity)
Damerau-Levenshtein: Tie-breaking for top 5 candidates

6. Main Title Only

Searches using just the main title portion (bidirectional):

Query has secondary, DB doesn't: "Zelda: Ocarina" → matches "Zelda"
Query simple, DB has secondary: "Zelda" → matches "Zelda: Ocarina of Time"

7. Progressive Trim

Progressively removes words from the end of the original query (max 3 iterations):

"Legend of Zelda Link's Awakening DX" → tries "...Awakening", "...Link's", "...Zelda" (then slugifies each)

Result Selection

When multiple results match, the system applies filtering and scoring:

Confidence Scoring

Base confidence from strategy (0.85-1.0) is adjusted by tag matching:

High confidence (≥0.95): Launch immediately
Acceptable (≥0.70): Launch with info message
Minimum (≥0.60): Launch with warning
Below 0.60: Error out

Filtering Priority

User-specified tags - Filter to exact matches (if provided)
Exclude variants - Remove unfinished (demo, beta, proto, alpha, sample, preview, prerelease), unlicensed (hack, translation, bootleg, clone), and bad dumps
Exclude re-releases - Remove reboxed editions, re-releases
Preferred regions - Match user's region config
Preferred languages - Match user's language config
File type priority - Prefer file types based on launcher extension order (earlier = better)
Quality-based tie-breaking - Select best file using:
- Numeric suffix penalty (avoids duplicates like "game (1).zip")
- Path depth (prefers files in organized folders over deep backups)
- Character density (cleaner filenames preferred)
- Filename length (shorter is simpler)

Tag System

Tags are used for filtering and disambiguation during indexing and resolution.

Tag Extraction (During Indexing)

Tags are automatically extracted from filenames during media scanning using the Filename Parser (see section above for complete details).

Common tag types:

Regions: (USA), (Europe), (Japan) → region:us, region:eu, region:jp
Languages: (En), (Fr,De) → lang:en, lang:fr, lang:de
Years: (1997) → year:1997
Dump info: [!] → dump:verified, [b] → dump:bad
Development: (Beta), (Proto) → unfinished:beta, unfinished:proto
Revisions: (Rev A) → rev:a
Episodes: S01E02, 1x05 → season:01, episode:02
Discs: (Disc 1 of 2) → disc:1, discof:2

See "Filename Parser" section for the complete 4-step tag extraction pipeline with disambiguation rules.

Tag Usage in Queries

Tags can be specified in three ways (with priority hierarchy):

Advanced args (highest priority): NES/Zelda?tags=region:us,-unfinished:beta
- Explicit user requirements via ?tags= parameter
- Format: tag:value or -tag:value (NOT operator)
- Overrides all other tag sources
Inline canonical tags (medium priority): NES/Zelda (+region:us) (-unfinished:beta)
- Explicit tag filters with operators in parentheses
- Supports: (+tag:value) AND, (-tag:value) NOT, (tag:value) AND (default)
- Overrides filename-style tags
Filename-style (lowest priority): NES/Zelda (USA) (1986) (auto-extracted)
- Automatically parsed from filename metadata in parentheses
- Always treated as AND filters
- Used only when no conflicting higher-priority tags exist

Implementation Notes

Key Files

Indexing Pipeline:

Main indexing orchestrator: pkg/database/mediascanner/indexing_pipeline.go
Filename title extraction: pkg/database/tags/filename_parser.go → ParseTitleFromFilename()
Filename tag extraction: pkg/database/tags/filename_parser.go → ParseFilenameToCanonicalTags()

Slug Normalization:

Core slugification: pkg/database/slugs/slugify.go
Media parsing (dispatcher): pkg/database/slugs/media_parsing.go
Game parsing: pkg/database/slugs/media_parsing_game.go
TV show parsing: pkg/database/slugs/media_parsing_tv.go
Movie parsing: pkg/database/slugs/media_parsing_movie.go
Music parsing: pkg/database/slugs/media_parsing_music.go
Normalization helpers: pkg/database/slugs/normalization.go
Script detection: pkg/database/slugs/scripts.go

Query Resolution (Title-based launching):

Query parser & resolution: pkg/zapscript/titles.go → cmdTitle()
Matching strategies: pkg/zapscript/titles/strategies.go
Result selection: pkg/zapscript/titles/selection.go
Fuzzy matching: pkg/database/matcher/fuzzy.go
Query tag extraction & merging: pkg/zapscript/titles/tags.go

Overview​

Why This Approach?​

How It Works​

1. Indexing​

2. Resolution (Query-Based Launching)​

Slug Normalization​

Two-Phase Architecture​

Phase 1: Media-Specific Parsing​

Phase 2: Universal Normalization (normalizeInternal)​

Complete Example (Games)​

Multi-Script Support​

Filename Parser (Indexing)​

Indexing Pipeline​

Title Extraction from Filename​

The 8-Step Pipeline​

Title Extraction Examples​

Tag Extraction from Filename​

The 4-Step Pipeline​

Tag Extraction Examples​

Disambiguation Rules​

Matching Strategies​

Strategy Flow​

Strategy Details​

1. Cache Lookup​

2. Exact Match (with tags)​

3. Exact Match (without tags)​

4. Secondary Title Match​

5. Advanced Fuzzy Matching​

6. Main Title Only​

7. Progressive Trim​

Result Selection​

Confidence Scoring​

Filtering Priority​

Tag System​

Tag Extraction (During Indexing)​

Tag Usage in Queries​

Implementation Notes​

Key Files​