Skip to main content

Titles, Tags and Metadata - Part 1

· 12 min read
wizzo
Lead Developer

Hi everyone. I wanted to do a technical post to give an update on a few things I've been working on for Zaparoo. This post will be quite long but maybe some people out there will find it interesting! I'll be going over how I've been trying to solve the file to game matching process in Zaparoo with our unique requirements.

This is the feature that took up like 99% of the work this next release. This thing finally solves (hopefully!) the problem of making cards reliably launch the same game between different devices. I'm going to go ahead and explain it in detail now because it took me SO LONG to do this and I need you to APPRECIATE how much WORK THIS WAS (but it's OK because it was a very interesting problem to solve).

Game Identification

So, in the previous versions, you are probably aware when you write a game to an NFC card, it will look something like this: PSX/Some/Folder/The Game.iso

This is just a literal path to a file, with some smart rules that turn the first "folder" into a check over multiple parent folders to see where the file actually is. This solves the issue of, say, a USB drive changing its mount point (/media/usb0 vs /media/usb1), but it still relies on an exact folder and filename layout within that mount. This was a simple method I added way back in the original releases and it works pretty well, it's carried us this far, but it has no fallback mechanism if the file moves or is just in a different location (obviously very common between devices).

We can add some hacks to this, like if the file is missing, search in a different folder or if the file is missing, look for a slightly different filename (very slow BTW), but to truly fix this issue we're ultimately re-framing the problem from "find and launch a file" to "find and launch a game" which is much more difficult. This is a problem I've been banging my head against for like 2 years now.

How Others Do it

Traditionally in other launchers, they solve this by hashing the game files and then querying an online service or some large local DB to match the file hash back to a game, also known as "scraping". This works really well, that's why everyone does it, but Zaparoo has some big constraints:

  1. Hashing files is REALLY slow on MiSTer. Between its slow CPU and SD card speed, it turns what was a 10 minute index into days of work. It's really crazy. You see people asking why there's no cover art and stuff on MiSTer? This is a really big part of the reason. Scrapers rely on hashing for accurate game identification.
  2. Assuming we could hash, I don't want to tie Zaparoo to any existing specific scraping service. On other launchers, they're free to hit any scraper and they can store and use the game IDs as they like once a game is identified. For us, the game ID is portable (on a card) and so I'd actually have to lock in an "official scraper" and use their internal game IDs. I'm not comfortable with this.
  3. Less important, but the last thing is internet access. Most people do have internet access, but personally I try to write everything in Zaparoo to not require it. I always try to be mindful of the question "will this work if I had friends and took my MiSTer to their house?".

And that was that for a long time. I took a few cracks at it but always landed in the same place, whatever the solution was, it was too complex for me to fix at that point.

Trying to Solve it Again

So, crap! This is never gonna work. Until this year when BossRighteous came along and said "hey, we should just convert the game's filename to an ID and use that, it's probably good enough". I went "OK, great!" and he went ahead did all the initial work of converting Zaparoo over to a new type of internal database which would help support this idea.

The idea is simple enough, you take a filename, run some rules to chop off the garbage like ROM tags and file extensions, then normalise it to produce a "slug". For example:

  • Super Mario World becomes the slug supermarioworld.
  • Sonic Spinball becomes the slug sonicspinball.

And so on etc. basically we run some simple rules to isolate a game's title and make it a little more generic to account for some minor filename differences, which will produce a simple piece of text we can match against in the database. The idea is really clever, it operates very heavily on the concept of "good enough" because the slug will not be unique, but it made sense and actually had a chance of working with what resources we had. Maybe this was the value that got written to the NFC card and made everything work.

So, that was implemented, it was released, and then nothing happened. Slugs are being generated, but there's still a lot of work to bridge the gap between the slug and the launching, and it just fell to the wayside when other priorities came up.

Trying to Solve it Again Again

So that finally brings us to about a month ago. Lately I've been asked by a few third party developers if the Zaparoo API could be used as a back-end for their own launcher UIs. Totally, I always wanted that to happen, but that also relies on game matching working! So I figured OK let's really knuckle down and finally solve this for real.

To start, I had to make a lot of optimisations to the database. It was functional as is, but it was really slow when working on a large number of items (like hundreds of thousands) and that was a non-starter considering that tapping a card and launching had to feel instant. I won't get into the database optimisations too much (I think it's kind of boring sorry!) but I can say now that looking up an ID is instant and performing complex searches is fast as hell. I feel like I have milked every last drop out of SQLite. Now I had the confidence to hook up the launch system to do look ups directly on the database!

And it sucked.

I write a slug to a card, it launches the wrong game. I try the card on another device, it can't find anything. It's not good, it turns out the problem is much more complex.

Fuzzy Matching

Fuzzy matching is a critical part of this whole idea. If you have a file called Super Mario World.sfc and Super Mario World (US).sfc you would know intuitively these are the same games, either one could be launched in progress of your prime directive: play Super Mario World until that first water stage and I die and then go and play something else cause this game is the worst and unfair.

The existing slug system actually handled this fine, it had some decent handling of ROM tags and could strip them out. So that was actually working! But now I use the card on my Pi where I've just copied over a bunch of Best Of lists, Super Mario World is there, but the filename is 01. Super Mario World (USA).sfc. Tap tap tap, not working. The slug on this device 01supermarioworld but the card has supermarioworld written to it. They're "different games" according to Zaparoo.

Another example, I want to make an Ocarina of Time card, on system A the file is The Legend of Zelda - Ocarina of Time.z64 which makes the slug thelegendofzeldaocarinaoftime. Looking good. I go and tap the card on system B, nuts, some JOKER has called the file Legend of Zelda, The - Ocarina of Time and now it's not working. This is the worst.

Game filenames are all over the place, even for curated packs like No-Intro and TOSEC. They have conflicting standards, they have mistakes, and all bets are off when the game is from some random site on its own.

The slugs were supposed to accommodate for this, and they do to an extent, but it turned out that wasn't hitting my "good enough" meter yet. That ended up putting me down a whole other rabbit hole of...

Natural Language Parsing

But not like real natural language parsing, we're on a budget here, I need to shoot through 500,000 titles as quickly as possible on one core of a potato CPU. Instead, I need to try and work out what are the most common quirks of English and filename formats and see how we can accommodate for them.

This also brings up two major disadvantages of this whole approach:

  1. By design, this system can never match game titles between languages. But the slug system does allow a path forward for matching to an external scraper without hashing, and then it could match between languages that way.
  2. All the tricks I'm about to describe are for English only. There's nothing to say we couldn't add more advanced processing for other languages, but I only speak English and would need help. Thankfully English is pretty useful for this anyway.

So that's a bummer, but it's just how it goes. We can still get a tonne of value out of this.

At this point, you can imagine a montage of me furiously googling for research papers and staring a DAT files. Some of the many many things now handled during slugging:

  • A lot of Unicode normalisation, turning all sorts of weird characters into ASCII equivalents and classifying when non-English characters are used.
  • Identifying secondary titles like the "Ocarina of Time" in "LoZ: Ocarina of Time" and even things like "Disney's Aladdin" vs "Aladdin".
  • Stripping articles, converting ampersands, handling the difference between a + meaning "and", or a + meaning super good version.
  • Stripping suffixes like "edition" and "version" so "Pokemon Red" and "Pokemon: Red Version" are equivalent.
  • Expanding common abbreviations in game titles.
  • Converting (some! you can't even just do all of 'em) roman numerals and number words.
  • Detecting if leading numbers are part of a "best of" folder or part of the name.

And many many more weird little fiddly annoying things. There's so many things and they just get crazier the deeper you go. They can be very difficult to detect without context too. It was a struggle to know when to say when, but I think I have a pretty decent and fast set of rules now.

I try it all again and... it still kinda sucks? I mean it's way better, and I have these beautiful slugs now that match so many variants, but it led me to a new problem.

Re-framing Slugs

Since the beginning I've been hyper-focused on using slugs as a stand-in for hashes or an external API ID. Which means the slugs are supposed to be unique as possible. That is basically in direct conflict with the goal of fuzzy matching.

I've been adding rules to normalise slugs and make them fuzzier, which is aiding the goal of grouping filenames to games, but it's making the slugs so generic that they can match too many things for me to reasonably filter through when it's time to launch.

It also brings up an even worse problem for me, if slugs are used as public IDs, I need to make sure my normalisation process is absolutely PERFECT before this gets released. Once it's released, the process of generating slugs is now baked in, and I can't change it without breaking people's setups. That's a lot of pressure and was a big part of me fussing over the process to generate them so much.

After thinking about this for a while I thought: OK... what if we ditch using slugs as public IDs, and we stuck to using regular game titles as the ID instead? It seems very obvious now but it was a pretty big deal when I realised we could do that, and suddenly everything fell into place: we use slugs as a cached search helper, but use the game's original title as the ID in full natural language.

Natural Language Titles

When I say this, what I mean is we literally write to the NFC card something like:

  • SNES/Super Mario World
  • Sega Genesis/Sonic the Hedgehog (USA)
  • DOS/Monkey Island 2 - LeChucks Revenge

Once we start treating slugs as an internal value just to help with fuzzy matching, there are instantly some great benefits:

  • The entire slugging and matching process can now be tweaked with far less risk of breaking things. The title on the NFC card never has to change.
  • The title can be used to disambiguate the matches, we lose none of the original context. For example let's say the title on disk is Monkey Island II but we scan the third example above, it's trivial now to match from context, whereas before they had totally different slugs.
  • IDs can be written by hand. I mean look at it, you could write any old random game you want from memory and you've got a pretty decent shot at it working perfectly as a valid ID.
  • Following from that, IDs can be generated independently. Some random third party website can using results from a source like IGDB to create a "Zaparoo card ID" with only the most basic understanding of the format, and no need to directly connect to a device.

So there we go, it's such a simple little idea and seems pretty obvious in hindsight (use the game title to match the game by title) but I've been super happy with this approach so far and it's definitely hitting my "good enough" threshold.

Moving On

This is actually only half of the process, there's still:

  • The entire matching process afterwards.
  • There's a brand new tags system which is insanely cool.
  • A road map for handling metadata scraping at last.

I'll stop here though because this is already so long, and I intend to do a second post going through the rest.

Hope someone out there found this interesting! Thanks for reading.