Modern X86 Assembly Language Programming
You need to be signed in to add a collection
Why learn assembly in 2025? Compiler Explorer's Matt Godbolt talks with Dan Kusswurm about SIMD programming, when to write assembly vs intrinsics, and achieving 2-3x performance gains.
Transcript
What is Assembly Language and Why Does It Matter?
Matt Godbolt: Hello and welcome to an edition of the GOTO Book Club. My name is Matt Godbolt, and today I will be interviewing Dan Kusswurm. I should say a little bit about myself. I know every episode of this has a different host. I am mostly known for a website called Compiler Explorer, which allows you to type in C and C++ code in other compiler languages and see the assembly output. I've spent my entire career dealing with assembly language. When I got an email inviting me to do this talk, I was delighted because this is one of the few books that I actually own. Modern x86 Assembly by Dan Kusswurm is, in fact, one of the few books that lives on my desk. It's a fantastic opportunity for me to be able to talk to the person who wrote it. I realize I don't know very much about you, Dan, so why don't you give us a few lines about yourself and introduce yourself?
Dan Kusswurm: My name is Dan Kusswurm. I'm a software developer, computer scientist, and author. I spent the bulk of my career developing software, primarily for embedded systems, scientific instruments, and medical devices. Most of the code that I've written throughout my career has been in C++, with the occasional justified assembly language. My primary background has been computational software in medical devices like image analysis type work—things where the potential use of assembly language is potentially viable.
Matt Godbolt: We should really start by talking about what is assembly language and why should we care about it.
Dan Kusswurm: Assembly language is the low-level language of the processor. It's the native instruction set of the processor. If you look at different processor families that are out there today—your x86 family, your ARM families—there are different assembly languages. In fact, within those processor families, specific processors also support various different instruction sets. What an assembler does is it takes those instructions and translates them into executable machine code that you actually run on the processor. If you're looking at it from a higher-level language like C or C++, your compiler is translating your C or C++ code or even your Rust code into assembly language. An assembler then takes that and converts it into executable machine code.
Matt Godbolt: I think this is something that I see people use interchangeably—machine code and assembly code or assembly language. What is the distinction between those two things?
Dan Kusswurm: The distinction is that assembly language is the actual source code. It's the mnemonic that actually defines the instructions. The machine code is the actual ones and zeros that get executed on the machine. That's the differentiation.
Matt Godbolt: And the assembler is what takes the assembly code and turns it into this machine code. What's different from that from, say, a compiler?
Dan Kusswurm: A compiler is actually doing both. The compiler is translating into assembly language code, and an assembler can come along and translate into machine code. Or the compiler may do a direct translation into machine code.
Matt Godbolt: At this level, an assembler is taking something for a specific CPU architecture, maybe even a specific model of that CPU.
Dan Kusswurm: Exactly.
Matt Godbolt: Whereas a high-level language is phrased in a way that's, in theory at least, more platform agnostic.
Dan Kusswurm: Yes, that's correct.
Matt Godbolt: So if the compiler generates assembly, why do I even need to know that it exists?
Dan Kusswurm: That's a good question. Let me back up a second here. I'll talk about where the motivation for the book began, because I think that helps answer your question. Back in the 2012-2013 time frame, I was working on a project that involved some image analysis. We were imaging microscope slides that had cells stained on them. You could potentially have 100,000 cells on a slide. This is pre-AI days, so it's a lot of old-school image analysis—a lot of heavy-duty number crunching. I was looking for a way to improve the performance of our software product. I'd had some experience with x86 assembly language going back to my early days as an undergrad, but I didn't have much experience with the SIMD extensions—the SSE and AVX and stuff like that.
I decided to take a close look at what the performance bottlenecks were and identified those functions. I decided to write some code and see how that could improve the performance of these image processing functions. It wasn't massive amounts of code. The core base of the code was in C++. Initially, I'd say maybe 10 or 12 functions—let's try to make assembly algorithms of those and see if we can improve performance. Back then, compilers weren't very good at generating SIMD code. I got out the Intel manuals and learned all the assembly instructions, SSE back then. The manuals are good for telling you how each individual instruction works, but they aren't very good at telling you how to put together complete algorithms. That was the motivation for the book. There are reference manuals, but nothing that showed you how to use them in practice.
Understanding SIMD Programming
Matt Godbolt: I'm going to pause there because I don't know if it's necessarily obvious what SIMD is. Do you want to talk a little bit about what SIMD is?
Dan Kusswurm: SIMD is an acronym. It stands for Single Instruction, Multiple Data. Normally on a computer, if you're doing a simple addition or iterating through an array, it'll process each element one at a time. What SIMD does is it takes chunks of data—maybe 8 or 16 elements at a time—and performs the same arithmetic operation on those elements with one instruction. Instead of doing 16 separate adds or multiplies, you're now just doing one operation, and you're getting 16 results.
Matt Godbolt: So for these 16 things, what's the relationship between them?
Dan Kusswurm: Generally, you're doing matrix and vector type operations. Think of arrays of data or matrix elements. For high-performance computing applications with large arrays, you're doing large matrix-matrix operations or matrix-vector operations.
Matt Godbolt: So SIMD allows me to write one instruction—a multiply or an add or subtract—and it applies to up to 8 or 16 things independently. They're packed together in some way?
Dan Kusswurm: They're packed together as a vector. You have a register inside the processor that can hold these 8 or 16 values at one time. If you look at the book, there are diagrams that actually illustrate how the operations work.
Matt Godbolt: There are some great diagrams in the book explaining this concept of how you can interpret this ultra-wide register as different-sized slices of things.
Dan Kusswurm: Think of a normal register on a processor—maybe 32 bits or 64 bits. SIMD registers are typically a minimum of 128 bits, and they can go to 256 or 512, at least on the Intel architectures.
Matt Godbolt: And then that 512 bits could be 256 shorts or 16 single-precision floating points or 8 double-precision floating points or 64 one-byte integers if you're doing some integer arithmetic.
Dan Kusswurm: Exactly.
Matt Godbolt: Just out of curiosity, why is that? Why not just make the processor faster? Why is it better to do this sort of bulk operation?
Dan Kusswurm: Instead of doing one bulk operation on wide registers, you'd have to do 16 or 32 or whatever individual operations. You'd usually have to set up a small for loop to actually do that. That takes time. For a SIMD operation, for the right types of applications, they're generally much faster than scalar arithmetic.
Matt Godbolt: Presumably the chips are physically capable of doing these things simultaneously 16 times. You're talking about potentially multiplying eight double-precision numbers together at once in one instruction.
Dan Kusswurm: Yes, absolutely.
Matt Godbolt: That's what's going on. It's not like inside the chip it's looping over them.
Dan Kusswurm: No, it's actually one instruction. It does 8 or 16 arithmetic operations at once as opposed to doing individual scalar operations.
Writing SIMD Algorithms
Matt Godbolt: You were telling the story, and I think I cut you off.
Dan Kusswurm: The motivation for the book was to use the SIMD extensions on the x86 platform to improve the performance of these image analysis algorithms. Like I said, the reference manuals are good for teaching you how to use individual instructions, but they don't really elucidate complete algorithms. That was the focus of the book. I wanted to write something—first of all, I couldn't find existing books that covered the algorithms in any adequate detail. Some of them were superficially covered, but there was no real in-depth book that I found satisfactory. I figured I'm doing a lot of trial and error here myself. Maybe other people can benefit from my expertise, my experience, my learning curve. That was the motivation for the original book.
The challenge is you just can't go and learn the instructions for a particular processor architecture. You've got to learn how to create for loops. You've got to learn how to do basic integer and floating-point arithmetic algorithms, memory addressing modes. So if you look at the book, the first maybe third of the book covers basic x86 architecture. The second two-thirds of the book focuses mostly on the SIMD instruction extensions—AVX, AVX2, AVX-512.
Matt Godbolt: The intention behind this was to be a guide for people who want to write SIMD-based algorithms. But in order to get there, you have to get over the whole of the ISA. This is specifically for x86, although I know that you've also done similar books for other processors.
Dan Kusswurm: The book you're holding is strictly x86. There's also an ARM book that I did a couple of years ago back in 2020, basically similar types of topics, but again focused on the ARM platform as opposed to the x86.
Matt Godbolt: My understanding is some of the concepts of SIMD are broadly the same across architectures.
Dan Kusswurm: Absolutely. There are different instructions and some slightly different specifics of the architecture, but the basic principles are exactly the same.
Matt Godbolt: I've got the first edition here. I was looking through my e-book reader version of the latest version, which is the third edition.
Dan Kusswurm: Yes, the third edition is out now.
Matt Godbolt: I see that you've finally dropped the old x87 ancient prehistoric floating-point version that nobody has used in a very long time. That was a welcome thing when I was reading through the updates. Your intention was to write a book, and you ended up writing a pretty comprehensive assembly language programming guide that tells you pretty much everything about how to program. Tell me about what makes SIMD algorithms challenging, because surely if it was easy, the compiler would do it for you.
The Challenges of SIMD Programming
Dan Kusswurm: There are a couple of reasons. First of all, compilers are getting better at automatically generating SIMD code. They've improved quite a bit, especially since I wrote the first version. There are a couple of challenges when you're writing SIMD code. The first challenge is if you're developing an algorithm, there's a lot of decision-making in it—if-then-else type stuff. Compilers still have a little bit of difficulty generating that by themselves.
Matt Godbolt: Can we pause there? Now that you've explained how SIMD works, we're talking about picking up 16 elements at a time. Obviously, you're doing a single instruction. We're doing the same thing to 16 things, which means—what if I only want to do it to some of these things? Or what if I have a branch? Or what if it's like, once I reach the end, how does that work?
Dan Kusswurm: That's one of the programming differences between normal programming and SIMD programming. In SIMD programming, you make your logic decisions not by saying "if X is greater than Y, do this, otherwise do that." You have to say "if X is greater than Y," you'll get a mask that tells you whether that statement is true or false. Then through some Boolean operations, you can throw away the data that you don't need and keep the data that you want.
Matt Godbolt: It's sort of like a ternary operator thing, but presumably I have to evaluate both sides. Like if I were to say, "set it to value X if it's less than X, otherwise leave it alone," I need to have done the work for both forks of that and then select which one.
Dan Kusswurm: Yes. Let's say I'm comparing X greater than Y. I'll get a mask saying whether it's true or false. So I obviously know where it's true. If I just do a NOT on that, I have the opposite condition right away.
Matt Godbolt: If I wanted to do—if X is less than Y, element by element in a 16-element-wide vector—I'm doing comparisons.
Dan Kusswurm: Exactly.
Matt Godbolt: Like if X is less than Y, then I want the square root of X. Otherwise, I want the cube of X. I have to have done both sides of that. So there's presumably a point where you go, "Actually, your logic is too complicated. You should use an if statement here." Where do you make these calls about that kind of thing?
Dan Kusswurm: Going back to the example, I've got 16 X's and 16 Y's. I do whatever comparison. I've now got 16 results in one of these wide registers. Now I can take the true cases and do the square root, for example, and the false cases, I can do the cubes. What I also can do is I can take the 16 results and compress them into a single bitmask that I can move to the processor and test.
There are also instructions on the x86 architecture that allow conditional operations. This is where AVX-512 shines. I can do masking operations on the instructions. I could say, "Only update the registers where the condition was true," and then take the opposite case and update the registers doing a different operation where the condition was false.
Matt Godbolt: Presumably with this mask register as well, you could add in something which says if they're all false, I don't even have to do the true part.
Dan Kusswurm: Exactly. Don't want to do anything.
Matt Godbolt: So you can write some amount of conditional stuff, but it starts to blur when it's valuable to do this. Because if one of those paths is hugely expensive and rare, then maybe it's not worth vectorizing. But then maybe you think about your algorithm in a different way.
Dan Kusswurm: I think the latter is probably more of it—think about the algorithm in a different way.
Matt Godbolt: And that is probably the bit where compilers fall down because they don't know what your algorithm is. They're trying to infer something from the way that you wrote code, and they're limited in how much they can rewrite the algorithm.
Dan Kusswurm: If you take a very simple example where I'm just adding two arrays and storing into a third array, compilers can pick that up these days and generate some really efficient code for doing that. But where there are logic decisions involved, they have a little bit more difficulty doing that. The other case where compilers still lack efficiency is taking advantage of instructions that are for target-specific applications.
To give you an example, the x86 assembly architectures' AVX registers have instructions for manipulating sparse matrices. That's a very targeted, specific application. The compiler has difficulty generating that, and I'm not sure you'd want a compiler to generate that because that's a really targeted application.
Matt Godbolt: It's a very specific use case.
Dan Kusswurm: Exactly.
Matt Godbolt: Given all the heuristics a poor compiler has to go through to say, "Maybe they were trying to do this, maybe they were trying to do that," at some point there's a law of diminishing returns. It's like they're probably not doing motion compensation for MPEG encoding.
Dan Kusswurm: Exactly.
Matt Godbolt: We shouldn't be looking to use that instruction here. And things like carryless multiplies and really obscure things that you might use.
Dan Kusswurm: Exactly.
Intrinsics vs Assembly Language
Matt Godbolt: From my own experience, I've found that there's often an intrinsic that I can use in my C code. Where's the line to draw between where I can write these sort of guided—not really doing C++ anymore, but saying what I want—versus when I crack out the assembly book and just go, "No, I'm writing assembly"?
Dan Kusswurm: I actually use intrinsics myself very often. They're good for doing a quick assessment. I'm not sure how much performance I can gain by doing full-blown assembly language coding. I might try intrinsics first, get a rough feel for what the performance gains are. If it looks like I can maybe squeeze some more out, then I'll go do some assembly language.
The one thing I do like about going to the pure assembly language approach, when it's warranted, is that you get to take advantage of stuff within the assembler, things like macro processors. You get full access to the full instruction set. Whereas you don't have that with intrinsics—you've got quite a bit of access with the intrinsics, but not the full instruction set. Or, for example, if you're doing some hardware manipulation, intrinsics really don't help you with that. Like maybe a device driver or doing code for an embedded system—intrinsics don't really help you in that case. In those situations, assembly language is justified.
Matt Godbolt: That makes sense to me. There's a lot of diminishing returns. From my own experience, when you're writing intrinsics—and just for those who don't know, intrinsics are like function calls that you're making in C or C++, but the compiler knows, "nudge nudge, wink wink, don't actually call a function—you should generate a particular instruction here." But at some point, you realize that you're just writing assembly with a worse syntax in C.
Dan Kusswurm: That's a real good observation. To use the intrinsics, you have to understand what the underlying assembly language instruction does. There's just no way around that. So you're kind of 80% there already to doing assembly language code. You take the next step and do the actual assembly language itself. But you do have a choice there.
Matt Godbolt: If you're writing assembly, you suddenly are beholden to some of the other architectural details, which is what part of this book is about—to get you to that point. What register should I expect parameters to be passed in? What registers must I preserve and pop? There's a lot more glue that you need to write yourself if you're just going to drop into assembly and write.
Dan Kusswurm: Absolutely. There's what I call calling conventions. As you alluded to, certain arguments are passed through registers. Depending on the number of arguments a function has, you may have to pass arguments on the stack. You have different processor registers that are volatile or nonvolatile. A volatile register doesn't need to be saved by the called function, but a nonvolatile register does. Depending on the architecture and whether you're doing Windows or Linux, there are different register usages and different usages of how the stack is organized. The third edition especially dives into a lot of the differences between both Windows and Linux and how to do that type of stuff.
Matt Godbolt: Things like stack red zones and such.
Dan Kusswurm: Exactly.
Matt Godbolt: I don't have to think about most of the time because the compiler does that bit for me. So this is where you're exposed to it. But also, when I was reading that, it was actually a bit—that was the bit that stuck out to me as being like, "Oh, this is cool. Actually, I didn't fully understand what the red zone was all about and why it's okay." Well, it just means I don't have to manipulate the stack pointer. I get some space there.
Real-World Optimization Examples
Matt Godbolt: Let's talk a little bit about—do you have some favorite examples of optimizations and things that you've done that you can look back and be like, "Yeah, we were actually able to blow the compiler's generated stuff completely out of the water"?
Dan Kusswurm: One of the examples—there are a couple—some real basic operations, just simple statistics, calculating the mean of an image or calculating a variance. You're doing a lot of summing, and you do a square root to get your variance, your standard deviation. Those types of operations—you can get several times improved performance.
I'll caveat with this: If you're doing an application where you do this one-time operation, it's not worth jumping into assembly language to do that. If you're doing real-time programming—for example, you've got your camera capturing 30 or 60 frames a second, and you need to build a histogram for each one of those frames and then calculate some basic statistics—then it's worthwhile considering using assembly language.
The other use case I encountered is doing convolutions, either 1D or 2D convolutions. Basically, a convolution is just an arithmetic operation on an array or matrix. For example, if you want to blur an image, you might apply a low-pass filter. Now you've got your actual image and you've got a small little operator called a kernel. You blend the two together and you get your result. SIMD operations and assembly language programming in particular are really good for those types of operations. Again, particularly if you've got some unusual convolutions with some strange masks or kernels. Those are the types of use cases that, in my mind, stick out where it was justified to jump into assembly language programming to get the performance gains.
Matt Godbolt: That makes sense to me. Those kinds of things sort of spring to mind as being the obvious thing—you're picking up data, you're doing the same thing to it over and over again, and maybe there's a tiny sum at the end or whatever.
Dan Kusswurm: Exactly.
Development Workflow and Benchmarking
Matt Godbolt: One thing we should really talk about is what makes assembly language more difficult to integrate into, say, a normal development flow? What are the challenges?
Dan Kusswurm: In some respects, it's the same development flow. You define what your function is going to do. You do requirements, design documents. You call your function. You test it. So that part's not different. I think the important part is that you always need to benchmark any assembly language code that you generate. The reason for that is, if you're not getting any substantial performance gains—and by substantial performance gains, I'm talking significant, like double or triple or something like that—
Matt Godbolt: That's an interesting observation there because 10% faster sounds good to me. But you're saying it can get two, three times faster, perhaps.
Dan Kusswurm: Yeah. It depends on your application. A 10% might be warranted in some cases, but in a lot of cases, it's not going to be warranted.
Matt Godbolt: What's the trade-off? If I can get a 10% speed boost in my thing I call a couple of times, why not do it in assembly?
Dan Kusswurm: You certainly could. It basically boils down to you're trading extra development time for performance gains and future maintainability. It's a balance between those three items.
Matt Godbolt: That makes sense. But otherwise, the flow is relatively similar.
Dan Kusswurm: Outside of the benchmarking component, the normal software development lifecycle or workflow is pretty much the same.
Matt Godbolt: And benchmarking is its own black art.
Dan Kusswurm: Benchmarking is its own topic. The crude approach is you set up a simple for loop and just constantly call your test function a million times and measure it. That gives you a rough idea.
Matt Godbolt: It's a rough idea.
Dan Kusswurm: A better approach is if you go to either the Intel or the AMD websites, they actually have profiling tools that allow you to dig deep and get a much more accurate, better representation of how your algorithm is actually performing. If I recall correctly, there's a new performance tool you can actually use to do profiling to get a much better measurement as opposed to just calling the function a million times and calculating a mean and standard deviation.
Matt Godbolt: Even those things, given how complicated processors are these days and all the tricks they pull off behind the scenes, you can easily convince yourself that something's good. And then when it's actually in the production environment, when you don't have the cache to yourself—
Dan Kusswurm: Exactly.
Matt Godbolt: Everything goes out the window. So you have to be a bit careful about that kind of thing.
Dan Kusswurm: Just be aware that those types of benchmarking approaches don't really reflect your real-world caching. You're absolutely right.
Matt Godbolt: It's difficult. From my own experience of this kind of thing, the only thing I've found that I can do is to have a CI system that graphs over time my performance and then kind of almost post-hoc go, "Oh, hang on a second, we introduced something that made it go slower. What was that?" Sometimes it's like somebody added a line in an innocuous place. This is more of a concern with compiled languages. But with assembly, you're controlling that a lot more. But you pay for that in development time.
Dan Kusswurm: You pay for that in development time, and you potentially pay for future maintainability too.
Matt Godbolt: That makes sense. Maintainability is often the most limiting factor in software development lifecycle, but sometimes it's worth it for the performance and big gains if your domain demands it. I come from the trading background, so for high-frequency trading, we don't mind spending the time and effort to do this. But you can overdo it.
Dan Kusswurm: You know, in real-time applications, your multimedia applications, game development—those are the things where it makes some sense to at least consider looking at your computational bottlenecks and then consider using assembly language. Obviously, if you get an I/O bottleneck like a file or network or something like that, assembly language is just not going to help you at all.
Matt Godbolt: Cache layout, memory layout, stuff like that—that's a thing you can often attempt at a high-level language before you roll down to it. And also that's a precursor. If we can get everything aligned, then that looks good for vectorization as well.
Dan Kusswurm: Exactly. That's a good point. One of the things that I found is that doing assembly language programming, when I go back to C or C++, I'm always cognizant of the fact—I want these elements or these members of a structure or class to make sure they're properly aligned on a cache line boundary or at least a 16-byte boundary. That's going to improve performance a little bit.
Matt Godbolt: Even just being aware—if you're in a performance-sensitive environment at all, understanding what's going on under the hood is always helpful.
Dan Kusswurm: It's always helpful.
Matt Godbolt: Pick up a copy of your book and play around with some of the examples that you've got and get yourself through that process of saying, "Oh, this is how the computer actually works." And then later on, you can see how these pictures map into the higher-level languages you're perhaps used to writing, and then you might have an inroad into one day saying, "This function could be 100 times faster."
Dan Kusswurm: Absolutely. It's a good way to summarize it.
Closing Remarks
Matt Godbolt: I think we've probably got to the point of time here. Obviously, tons more we could talk about. I feel like you and I could talk on for hours and hours.
Dan Kusswurm: Absolutely.
Matt Godbolt: Is there any parting words that you'd like to say? Obviously, everyone should go and grab a copy of this.
Dan Kusswurm: The third edition is out there. If you're interested in buying a copy, go to your favorite online bookseller. You can also go to link.springer.com. That's the publisher's website. All the Springer publications are there. Just type in "modern x86" or type in my name, and you'll come up with a list of the books that I've published there. You can get a copy there.
Matt Godbolt: Fantastic. Well, thank you very much for talking with me today, Dan. I've really enjoyed this conversation, and I hope that we'll be introducing some more people to the idea that maybe they should learn assembly language, because this one is dear to my heart.
Dan Kusswurm: Thank you for having me. I appreciate it, Matt.
About the speakers
Dan Kusswurm ( author )
Matt Godbolt ( interviewer )
Low-level Latency Geek