Nice website and account. This sort of basic ROM manipulation was a huge gateway into low-level programming for me and many others. For an aspiring teenage programmer, it is immensely satisfying to be able to open a ROM in a hex editor and get Marin to greet Link by saying your crass phrases of choice at the start of Link's Awakening.
Surprised the start of the article neglects to mention the biggest hidden feature of the game, though: the fact that it also contains a hidden port of "The Lost Levels" (unlocked after getting a high enough score in the main game, IIRC). Super Mario Bros Deluxe is a stellar example of a great port; it's clear that a huge amount of thought and effort went into making a high quality version for the GBC. Glad to see it getting a bit of attention in this day when ports of old games are used as lazy cash grabs, inferior to the original releases.
Libraries that don't do point-release security updates for major versions should not be trusted, or used in production.
Libraries that do release security updates, but introduce new language features like generics in those point-releases also shouldn't be trusted, and have no place in production. Why should I upgrade my language version to get a security fix?
What about the file in the article makes it a "non-ELF binary"? The only thing I can think of is putting junk data in place of bytes the ELF spec designates as "padding" and expects to be 0. Other than that, it seems totally reasonable that putting garbage in place of a section-header offset with no headers, a physical address, and an alignment field wouldn't make it an acceptable ELF.
It's entirely on the Linux kernel to not verify these fields. However, its failure to verify these doesn't make the file not an ELF. It just makes it an ELF with a stupid alignment requirement that Linux happens to ignore.
Better yet, another commenter [1] found that you can clobber the number of section header entries, as long as the size of a section header entry is 0. So, now the smallest size is two bytes shorter: 112 bytes for a full "Hello, world!", with an 8-byte "alignment" field to spare!
I'll need to update this article. The only annoying part will be scribbling over the hexdump output again.
I wonder if there are some nice tools for "scribbling over hexdump" somewhere, and also rendering pretty output based on that. It tends to be really helpful both when synthesizing/assembling some binary formats, as well as debugging/decoding/disassembling existing ones (and then ideally also writing blogposts based on that). I saw some "annotation" tool like this in one disassembler I tried once, but it wasn't super great, and didn't allow for easy tweaking & moving of annotation groups after doing some changes in the output. I'm pretty sure this is something that's done very often by reverse-engineering people, so I'd assume tools like this should already be popular, just I don't know how to find them? I know there's also some Lua API with support for disassembling many protocols in WireShark, but I don't suppose it's easy to prototype & quickly iterate new formats in it (?)
Interesting! So, it looks like you can clobber the number of section header entries, because, with this alignment, the _size_ of a section header is 0. Cool!
Odd. For some reason, my version of nasm didn't do that, and instead opted for the lengthier 10-byte instruction shown in the article's objdump output. Maybe it's just an older version of nasm.
Eh, mabe they don't quite have the same feel they did when they had more regular, long episodes, but they still made the "Missing Hit" episode after being bought, which was excellent and widely popular. The recent-ish qanon episode was also decent and interesting, IMO.
Yes, those are both good episodes, although they had at least one better Qanon episode before. There are just so many "leave us a message" shows recently. TFA predates the merger, but it too is an example of a "history" episode that plays it too straight: this happened, then this other thing happened. This is not as good as RA's more "mysterious" episodes.
If you have a system with ROCm installed, support for compiling PyTorch on AMD has already been upstreamed. The PyTorch source code ships with a script for building a version using HIP (basically the script converts PyTorch's CUDA code to HIP and adjusts some build settings). So, if you're running a system with ROCm installed and are willing to compile PyTorch from source you can run PyTorch on AMD (admittedly that's several caveats and you're also limited to Linux in order to use ROCm).
I'd say it's likely because NVIDA's CUDA compiler produces code for NVIDIA GPUs, and AMD would have a lot of (questionably legal) reverse engineering ahead of them in order to support the same code on their own GPUs.
If you're asking why AMD doesn't make a compiler for CUDA source code that targets their own GPUs--that's basically what ROCm currently does. They're pushing their CUDA alternative, called "HIP", which is essentially just CUDA code with a find-and-replace of "cuda" with "hip". (And other similarly minor changes.) They have an open source "hipify" program that does this automatically (https://github.com/ROCm-Developer-Tools/HIP/tree/master/hipi...).
So, basically, AMD GPUs are already sort of CUDA compatible: just run your CUDA code through hipify, then compile it using the HIP compiler, and run it on a ROCm-supported system (which, for now is the most spotty of all of these steps IMO).
The "image" interface is one of my favorite parts of Go's standard library, and I think it's one of the best showcases of Go's "interface" feature. Want to save something as an image? Just implement three functions, two of which are trivial boilerplate and the third of which is just "return the color at this coordinate". You don't even need to manage the buffer of image data yourself this way. For example, if you want an image full of noise, just make your "get color" function return a random value every time it's called. I've used this myself for things like simple fractals.
And,on top of all that, once you've written code satisfying the image interface, the standard library even includes functions for saving your image as one of several possible image formats. And, due to the fact that the interface itself is specified by the standard library, virtually all third-party image encoding or decoding libraries for Go use it, too. So, every image reader or writer I've seen for Go, even third-party ones, can be drop-in replacements for one another.
Anyway, it's not Go's standard use case, but as someone who loves fractals and fiddles with images all the time it's one of my favorite parts of the language.
I like the simplicity of this (very Go-like). However, what are the performance implications of only being able to get one pixel value at a time? Wouldn't it be much less efficient than say "get this line of pixels" or "get this rectangle of pixels"?
You'll get the overhead of a non-inlineable vtable-based method call for each pixel. How badly that hurts you depends on the ratio of expense of that call vs. how expensive the pixel is to generate. If you've already got all your pixel values manifested in memory and you're just indexing into an array, it's going to be fairly expensive. If you're generating noise with a random number generator, it's going to be noticeable but not necessarily catastrophic (since "generating a random number" and "making a method call" are somewhat comparable in size, varying based on the number generator in question). If you're generating a fractal the overhead will rapidly be lost in the noise.
But I'd also point out that the Go standard library does not necessarily claim to the "last word" for any give task; it's generally more an 80/20 sort of thing. If you've got a case where that's an unacceptable performance loss, go get or write a specialized library. There's nothing "un-Go-ic" about that.
I would expect the stable based dispatch to be handled quite well with branch prediction. And surely the cache misses from those nested loops would have a much worse impact. Even if a few extra instructions have to run per pixel it's going to be quicker than a fetch from main memory.
In general, I tend to agree there's a lot of people who have kind of picked up "vtables always bad and slow" and overestimate the actual overhead.
But I have actually benchmarked this before, and it is possible to have a function body so small (like, for example, a single slice array index lookup and return of a register-sized value like a machine word) that the function call overhead can dominate even so.
(Languages like Erlang and Go that emphasize concurrency have a constant low-level stream of posts on their forums from people who do an even more extreme version, when they try to "parallelize" the task of adding a list of integers together, and replace a highly-pipelineable int add operation that can actually come out to less than one cycle per add with spawning a new execution context, sending over the integers to add, adding them in the new context, and then synchronizing on sending them back. Then they wonder why Erlang/Go/threading in general sucks so much because this new program is literally hundreds of times slower than the original.)
But it is true this amortizes away fairly quickly, because the overhead isn't that large. Even the larger random number generators like the Mersenne Twister will be a long ways towards dominating the function call overhead. I don't even begin to worry about function call overhead unless I can see I'm trying to do several million per second, because generally, you can't do several million per second on a single core because the function bodies themselves are too large and doing too much stuff, such that even if function call overhead was 0 it would still be impossible in the particular program to do it.
In gaming they are just at 1.61% right below 1280x1024 and the increase is so low (+0.01) that might as well be zero (compare with 1080p's +2.04% which is the one increasing the most):
Tech minded people are a bubble, gamers are a tiny bubble among those and /r/pcmasterrace 4K-or-die boasters are a tiny bubble among gamers. 4K, or even 1440p, matters way less in practice than tech minded people think.
There is basically no upper bound to the display resolution people want, even if their eyes can't physically resolve it. Graphic designers and gamers will still swear there's a difference. It's like audiophilia for the visual system.
At a certain point people will prefer large and more monitors over increased resolution. If I could buy two 16k monitors instead of one 32k monitor I'd do it, and that puts a soft upper limit on the resolution.
Again, where's the proof that the performance is a problem? The standard library should solve for the 80% case. I suspect it is well within "fast enough."
This is another way that Go's approach works well. In practice, you will see library code check for common image formats, and then dispatch to optimized code. The advantage here being that optimization is a library concern. Callers still maintain flexibility.
You can certainly operate directly on pixel arrays as well. When the original example here is changed to operate on .Pix instead of using .At(), it runs about 2x faster:
With a single modern CPU you can do a lot of operations at each pixel of a high resolution screen and still get a framerate that your monitor cannot churn. Then, you typically have a few more cores around to do other stuff.
Unless your language introduces an unreasonable overhead, a for loop over the pixels is perfectly appropriate and fast.
The problem in this case is not the looping over each pixel, but the overhead of invoking a dynamic method on each pixel. For example, if you're iterating over a []byte and setting each value to zero, the compiler can optimize that to a single memclr. Using an interface masks the underlying representation and consequently prevents any sort of inlining.
> Using an interface masks the underlying representation and consequently prevents any sort of inlining.
This sounds like a limitation of a particular optimizing compiler/interpreter rather than a problem of the language itself. For example, the plain lua interpreter incurs quite a lot of overhead for this, but the luajit interpreter is oblivious. The standard python interpreter definitely adds a lot of overhead.
Agreed. I was able to implement image.Image for a set of pixels from a Cgo library (written in Rust) and it works cleanly w/ the other Go image tools such as PNG saving[0].
It sounds like a great use of interfaces, but what is Go-specific about it? Couldn't the same thing be done in Java? (And I imagine it probably already has.)
The specific thing, IMO, is the fact that the interface itself is specified in the Go standard library along with plenty of utilities for working with it.
Java does have similar things, but, as far as I know, they're limited to the UI libraries and far more complicated (e.g. https://docs.oracle.com/javase/7/docs/api/java/awt/Image.htm...). Imagine if everyone who read or wrote image files in Java were guaranteed to conform to a dirt-simple "Image" interface with only three methods--with optimizations available but strictly optional.
For someone who's just tinkering or just wants to dump some cool bitmaps, the ability to write an "At(...)" function in Go and be looking at a PNG within a few seconds is just great fun to have in the standard library.
Surprised the start of the article neglects to mention the biggest hidden feature of the game, though: the fact that it also contains a hidden port of "The Lost Levels" (unlocked after getting a high enough score in the main game, IIRC). Super Mario Bros Deluxe is a stellar example of a great port; it's clear that a huge amount of thought and effort went into making a high quality version for the GBC. Glad to see it getting a bit of attention in this day when ports of old games are used as lazy cash grabs, inferior to the original releases.