How many people still write assembly, and for what purposes?

stephencanon · on July 24, 2013

I write assembly most days (I write system libraries--largely floating-point and vector code--for a living). In general the assembly that I write is between 1.5x and 3x faster than what the best compilers currently produce from equivalent C code (obviously for completely trivial tasks, you usually can’t beat the compiler, but those generally don’t need to be abstracted into a library either; there are also outliers where the code I write is orders of magnitude faster than what the compiler produces). For most programmers most of the time, that’s not enough of a payoff to bother. For library functions that are used by thousands of applications and millions of users, even small performance wins can have huge impact.

Partially my advantage comes from the fact that I have more knowledge of microarchitecture than compilers do, but the real key is that I am hired to do a very different job than the compiler is: if your compiler took a day (or even an hour) to compile a small program, you simply wouldn’t use it, no matter how good the resulting code was. On the other hand, for a library function with significant impact on system performance, I can easily justify spending a week “compiling” it to be as efficient as possible.

mjn · on July 24, 2013

if your compiler took a day (or even an hour) to compile a small program, you simply wouldn’t use it, no matter how good the resulting code was

Some research has been moving in that direction, since there does seem to be a demand for it. After all, if someone is willing to wait for you to spend a week hand-tuning a function, maybe they'd also be willing to run a compiler in a "take 10 hours to crunch on this" mode. Example: http://blog.regehr.org/archives/923

stephencanon · on July 24, 2013

Oh, absolutely.

That said, it’s also worth keeping in mind the enormous differences in how computers and expert humans currently approach the problem closely parallel the differences in how computers and expert humans play chess; quickly evaluate billions of possible “moves” vs. quickly identify the few most promising “moves” and then slowly evaluate them to pick the best. I fully expect to be regularly beaten by the compiler “someday”, but (a) I believe that day is still several years off and (b) even then, I expect that expert human + compiler will beat compiler alone, just as in chess.

zwegner · on July 24, 2013

As a former chess engine author, and current compiler writer, with ideas in search-based compiler algorithms/unbounded compilation times, I hope to accelerate the arrival of that day :)

npsimons · on July 24, 2013

For most programmers most of the time, that’s not enough of a payoff to bother.

I would argue that for most programmers, they can't beat the compiler; they probably don't have your skills and experience. I'm glad there are still people like you (I've probably used some of your code in embedded projects using VxWorks/PPC), but the truth is that many people like you are writing the compilers (or libraries, as you are). That, and the fact that when you mention profiling, you often get blank stares, plus mentioning algorithmic complexity gets you knotted brows, tends to lead me to believe that the great mass of programmers shouldn't try optimizing, prematurely or otherwise, especially in assembly. Play with it, learn about your whole stack, top to bottom, sure, but very few (such as yourself) can beat a good optimizing compiler.

revscat · on July 24, 2013

I haven't dealt with assembly in many years, and if you don't mind my asking, what tools do you use to do this these days?

stephencanon · on July 24, 2013

I don’t think the tools really change; architecture reference manuals, your favorite text editor, and an assembler. Processor simulators are sometimes useful for resolving the most perplexing quandaries.

codemac · on July 24, 2013

With the new AVX 512 announcement from Intel, I was trying to use the sde to see what I could get away with in terms of reducing cycles for certain operations by loading larger values..

Have you used the Intel SDE? Do you know how helpful the -mix histogram is or isn't? Or just general docs on usage?

stephencanon · on July 24, 2013

SDE has never been particularly interesting to me, but I can certainly see that it could be useful for some cases. In general, the simulator I want is one that completely models the CPU and generates detailed traces (like the late great SimG4).

nkurz · on July 24, 2013

the simulator I want is one that completely models the CPU and generates detailed traces

I'd like one also. Lacking that, have you found any tools close enough to this to be useful? Intel's IACA is better than pen and paper, but rarely replaces it. Is there an AMD equivalent? Are PTLsim or MARSSx86 useful? (sorry for presuming x86 if not what you work on)

mjn · on July 24, 2013

The most common modern cases I've run across are handwritten inner loops that make careful use of SIMD instructions, since auto-vectorization isn't quite there yet. For example, the x264 encoder has a lot of assembly in it. In x264's case they even wrote their own assembly IR, though targeted only at x86 variants: http://x264dev.multimedia.cx/archives/191

acqq · on July 24, 2013

- Whoever makes JITs for the faster languages you use (Java, JavaScript, Lua) must make them in assembly. I've made and maintain similar things myself, for x86 and x64.

- Whoever wants to produce the fastest CPU-bound libraries or routines which are to be used in some native language (like the guys who produce the drivers for graphics cards). I've made some such routines.

Assembly is still the only way to reach the limits and who doesn't have to worry about that stuff, good for him. But there are people who do, I'm one of them.

hga · on July 24, 2013

It is at least a goal of LLVM to support this type of JITing, and they cite Lua on the home page (http://llvm.org/):

"A major strength of LLVM is its versatility, flexibility, and reusability, which is why it is being used for such a wide variety of different tasks: everything from doing light-weight JIT compiles of embedded languages like Lua to compiling Fortran code for massive super computers."

acqq · on July 24, 2013

Do ask Mike Pall of LuaJIT (he's here on HN) if he uses LLVM for LuaJIT, I think he doesn't and that he has good reasons not to. Which of course doesn't mean that LLVM is bad in general, not that it's simply not a silver bullet.

As I wrote assembly code for x86, I didn't even use assembler, just the inline functionality of the compiler. Using assembler is adding one more dependency in your project. Often it is undesired. Regarding LuaJIT, there other aspects, also important.

ihnorton · on July 24, 2013

LLVM does have JIT capability, used by projects including Numba, Julia, Pure and a GLSL compiler (Gallium). Very useful in some contexts but definitely not a silver bullet.

Regarding LuaJIT, it compiles in about 10 seconds on a modern computer and does not use LLVM. A lot of it is written with an inline assembly preprocessor called DynASM:

http://luajit.org/dynasm_examples.html

ANTSANTS · on July 24, 2013

> Regarding LuaJIT, it compiles in about 10 seconds on a modern computer and does not use LLVM. A lot of it is written with an inline assembly preprocessor called DynASM

I think he was trying to say "Mike Pall doesn't use LLVM-IR for LuaJIT's bytecode representation of Lua programs, nor does it just embed LLVM processor backends for JIT compilation, for good reason."

simias · on July 24, 2013

Very low level hardware access where you need no use architecture specific opcodes not directly accessible from a higher level language.

Also, performance critical code where you feel you can do a better job than the compiler (good luck with that nowadays).

In the first case I feel LLMV's IR wouldn't offer any significant advantage over inline ASM in some C, in the latter I'm quite perplex. Writing very fast assembly usually means targeting a very specific architecture and use tips and tricks that will make your code run faster than what the compiler might do, in this case I fail to see how that would work "portably" with this intermediate language.

So yeah, I don't really see what writing in this language offers over writing some plain old C. I can't really see it replace ASM for... well anything really.

mbrubeck · on July 24, 2013

The graphics and media codec people here at Mozilla regularly write assembly code to provide highly-optimized (often SIMD) code paths for different chips. Example:

https://bugzilla.mozilla.org/show_bug.cgi?id=634557

rayiner · on July 24, 2013

Practical example: objc_msgSend (which is called for every message send in an Obj-C program) is written in assembly so it can jump to the target method implementation without disturbing any caller-save/callee-save register state: http://www.friday.com/bbum/2009/12/18/objc_msgsend-part-1-th... (and also presumably because C won't guarantee that a tail call is actually compiled as a jump).

Assembly is more relevant today than it has been in awhile, largely because Intel has gone pretty gung-ho with vector instructions that compilers mostly can't figure out how to emit on their own.

For example, say I want to convert an array of bytes into a bitmask such that every zero-valued byte is converted into a set bit. On a 64-bit system, you can do this 64-bytes at a time:

    unsigned long bits = 0;
    for(unsigned char i = 0; i < 64; ++i) {
        bits |= ((unsigned long)(bytes[i] == 0) << i);
    }

This isn't a particularly efficient use of the CPU. If you've got a CPU that supports AVX2, you've got two very powerful instructions available to you: VPCMPEQB and VPMOVMSKB. VPCMPEQB will compare two YMM registers for equality at a byte granularity and for every byte in which the registers are equal, will set the destination register to all ones. VPMOVMSKB will take the high bit of each byte of a YMM register and store the result in a GPR. Since YMM registers are 256 bits (32 bytes), you can reduce the entire loop above into just a handful of instructions: two vector loads, two pairs of VPCMPEQB (against zero) and VMOVMSKB, and an OR. Instead of a loop processing data a byte at a time, you can have straight-line code processing data 32 bytes at a time.

A 4th generation Core CPU (Haswell) has a tremendous amount of bandwidth. It can do (2) 32-byte loads per clock cycle, 2 32-way byte comparisons per clock cycle, etc. If you're writing regular C code, dealing with 8-byte longs or 4-byte ints, you're leaving much of that bandwidth on the table.

pbsd · on July 25, 2013

FWIW, those particular instructions have been around since the Pentium 3. Most of the recent advances have been in width (64 -> 128 -> 256 -> ...) and execution units of that actual width.

minimax · on July 24, 2013

It's good to know about SIMD instructions, but you can usually get at them with intrinsics (if you're writing in C) without dropping down to pure assembly.

rayiner · on July 24, 2013

You can, but if you want to optimize things like register usage it's sometimes easier to just write the whole function in assembly.

majke · on July 24, 2013

Knuth in The Art of Computer Programming uses (artificial) MIX assembly to describe algorithms. To me it was awkward at the first, but it does make a lot of sense after first few chapters.

Assembler has real world users, but also it has a niche in academic community.

https://en.wikipedia.org/wiki/MIX

thangalin · on July 24, 2013

Operating systems - http://www.menuetos.net/ http://www.returninfinity.com/baremetal.html http://mikeos.berlios.de/write-your-own-os.html

Embedded systems - http://thinkingeek.com/2013/02/02/arm-assembler-raspberry-pi...

kabdib · on July 24, 2013

Embedded system folks who target very small and cheap processors. (The last time I did this, I had a 2K code budget, and the CPU was under 20 cents).

People hacking at a very low level, e.g., for boot code, or TLB miss handlers.

Places where you need to be highly efficient, and where cycles matter.

Places where compilers don't operate well (e.g., specialized instructions for doing shared memory operations, or stack manipulations for doing things like coroutines).

I might go a year or so without touching assembly now, but not much more than that.

jws · on July 24, 2013

The last time I wrote assembly was for clock cycle accurate timing manipulating some IO pins at rates very near the limits of the processor. No time or tolerance for interrupt handlers and timers.

Before that it was to program a pair of limited range timers to run with slightly different periods so I could lazily read them from C, then by examining their phase and values determine if I got an uninterrupted pair of data, and if so, how many times the timers had rolled over, thus implementing a single, high resolution, extended range timer.

It is also used to exploit processor instructions which do not yet have compiler support.

forgottenpaswrd · on July 24, 2013

People told you about GPU optimization and such, and it is ok.

Another very important one is for security. In big companies and governments people use debuggers like IDA Pro for controlling what really executes in the computer in order to protect communications and such from snooping of other companies or governments, for example.

dmm · on July 24, 2013

Making games for the original game boy. The little 8bit z80-alike processor doesn't map too well to c abstractions. ASM is just easier.

pbsdp · on July 24, 2013

I've written ARM assembly for iOS apps to optimize carefully for memory ordering constraints and memory access latency (eg, pipeline stalls), and made use of NEON SIMD for certain critical paths.

This has yielded (ballpark) 2x-5x improvements to runtime performance; some operations essentially become 'free' from the application perspective whereas they took a significant hit previously and could cause UI stuttering and/or significant CPU burn (which also directly correlates to battery life consumption).

Assembly is far from dead in desktop/mobile development.