!!Con West 2020 - Spencer Alves: Reverse engineering game formats for fun and profit! (or just fun)

Mar 20, 2020 18:52 · 1797 words · 9 minute read printable one usually also use

Okay. Hi! So I have a lot of stuff to go through. So I hope that you’ve read the summary on the website. Quick disclaimer before we get started: Most of my experience is with games with file systems. So games on discs, or downloadable PC games. That’s not to say that none of this is applicable to, you know, games on cartridges or even things that aren’t games. I hope the techniques are general.

00:57 - First of all, why is this something you want to get into? Well, reasons from least useful to most useful. Least useful being showing off. These are some cool websites I like. Noclip.website is a place where you can fly around the levels from some 3D games. SilvaGunner is a YouTube channel that does spoofs of video game soundtracks. And a lot of their sources are from actually the game data itself. Second reason is modding. If you get a good enough understanding of these data formats, you can convince the program that the data you make is what it expects and you can put your own content in there! Another reason is engine recreation.

01:41 - So a lot of old games don’t run so well on modern systems anymore. So some enterprising people have taken it upon themselves to rewrite them from scratch. Given a good enough understanding of the formats they use. Finally, the most useful thing to get out of this is learning. You learn a lot about how these game engines work, how they’re put together, the engineering that goes behind them, and it’s valuable information for even your day-to-day programming. What have I done with this? Well… This is supposed to be Mario doing a victory pose, but let me know after the talk if you’re good with matrix math. Before you get started, even doing this, you want to do some research, because the older and more popular the game is, the more likely someone has already done the work for you, or at least most of the work. So here are some search terms. Another thing you want to think about while you’re doing this work is: Keep in mind the context. What engine does it use? That’s useful because engines generally use the same format.

02:52 - So you know if you look up another game that uses the same formats, someone may have done the work for you. What platform was it made for? That can determine things like endianness, which we’ll get to later. And who developed it? This actually goes back to the first bullet point. Even if they don’t say they’re using a specific engine, if a developer made two games in quick succession, it’s likely that they at least had some code in common. Another thing you can get out of who developed it is… The environment it was developed in.

03:32 - Like, a lot of Japanese developers at least used to like to use Shift JIS instead of Unicode. So if you’ve never seen a hex editor before, this is what it looks like. Left column is position, middle is the actual contents of the file in hexadecimal, the right column is the same content as the file, but interpreted in ASCII, with little dots for things that aren’t printable in ASCII, and bottom there is – I should be highlighting these. That’s position, contents, ASCII. That’s the… I call it decode view. It’s the contents next to the cursor, just interpreted in different data formats. Mine doesn’t have a string view, but some others do. Like fling point. Things like that. I’m gonna give you this example now.

04:19 - By the end of the talk, hopefully I’ll have given you enough information that you can figure out what this actually is. Or at least what type of file it is. I’ll give you two little hints now. One is these two columns. And then the other one is the numbers on the right have a certain pattern to them. Give you a few seconds. Integers are pretty straightforward. But they can have… You have to keep two interesting things about them in mind. One is endianness. They can go either way. As for why you would want something little endian… Go ask Intel. They can also be signed or unsigned. That’s not what I mean. Signed or unsigned. That means whether or not it has a plus or a minus in front of it.

05:12 - If you see a number starting with a bunch of Fs in hexadecimal, that’s less likely a really really big number and more likely a small negative number. Floating point is harder to read in hex view than integers. But two hints I can give you. One is be looking out for that 803F or 3F80 -- that second line in the table there – because 1.0 is really really common. And in fact, as a kind of extension to that, as humans, we like to use human-sized numbers. So the exponents – those last two – that last byte – is gonna be common for a lot of them.

05:51 - If you pop it into the decode view, and you see something like 3x10^38 or something, or 3x10^200000, these game developers don’t normally need to count the number of cows under the sun. So they’ll usually be human-sized numbers. As for where things actually go in the file, the obvious answer is just put one thing after another. Like in a c-structed memory. Here’s an example of that. It gets a little bit more complicated when you have arbitrarily sized arrays or complex data types. One way to do that is just have – one way to do a dynamically sized array, for example, is to have just the number of the thing and number of things immediately following.

06:42 - But there are some other things you can do. For example, pointers. In a real running program, you can just allocate some memory and have a pointer to it. In a file, you actually have to have it all in that linear space. You can still do the same thing. So here’s an example. Ignore the top part. That’s the header. But you see immediately after the header you have these two numbers, which are little-endian, but if you flip it around, it points to other positions in the file. So that way this example is showing you – at this point in the file are the – I think it’s like positions, and at this point in the file, the vertex normals, something like that.

07:21 - Another way you can have things laid out in the file is with chunks. Not that kind of chunk. And this is more common for really complex formats, like documents or movie files that need to have a whole bunch of different types of things all in one file. Usually you’ll have a 4-byte or 8-byte or however many byte identifier for what type each chunk is and a few byte for how long the chunk is. You can see each of these chunks is 4-0 bytes, and that exactly lines up with four rows in our hex dump here. So the example that I gave you at the beginning -- I’ll give you the hints again. That is this column.

08:01 - And actually, I’m gonna give you a little bit more information. All of these in the left column are floating point numbers. And everything on the right side are integers that are increasing. Now, anybody have any ideas as to what this is more specifically? Yes, this is a 3D mesh. Nice job. Each of those floating point numbers is a 3D position and each of those indices is – well, numbers is an index into these positions. You get three positions, and they make a triangle. And you get a mesh. Yeah. So one thing I want to point out is the numbers on the right are generally increasing. Because that’s just the way, like, indices that are automatically generated tend to be laid out. Other small things – indices I just talked about. Opcodes we heard about in earlier talks. So they’re useful for programs. They’re also useful for things like music. Images are really fun in a hex editor.

08:57 - Because you can resize it as big as the image and view it in hexview. Audio samples tend to go up and down. Encryption and compression are really easy to recognize but really hard to decode. Because they both kind of have this similar goal of making data look random. Right? Compression needs to get as much data into as little space as possible. And encryption – its actual goal is to make it look random. So I don’t have any good solutions for that right now. Other techniques you can use, other than just looking at the hex and figuring it out from there – the first bullet point in the last example I showed you we can recognize the different data types and then you might look at the beginning to see if there are any pointers to that, or you can look the other way. You can look at the beginning and there are things that look like pointers, and so later in the file you can look at where those point to. The next three bullet points, the ones that start with a C, are useful if you already have a program that can go one direction or the other. For, like, comparing with your own program.

10:01 - The second to last one, debugging, is useful if you have a debugger, or also an emulator. And then as a last resort, is actually reversing the executable. And that could be a whole separate topic. That could actually be a college course. Not a 10-minute talk. So I’m not gonna go into it here. And the other reason I would consider it my last resort is in the first slide, I talked about engine recreations. If you.. How do I put this? If you look at the actual Assembly code of some executable, and you try to recreate it, you can’t prove that you’re not just copying it. So that gets to… Legally gray. If there’s one thing I want you to take away from this talk, it’s that there’s no such thing as an unknown, unstructured blob of binary data.

10:51 - I mean, these files are written by people, right? People put those numbers there for a reason. And if you’re adventurous enough, you can find out why. Here are some resources. Thanks for coming to my first ever tech talk! .