SIMD programming in pure Rust

pizlonator · 2026-01-22T01:55:19 1769046919

This article references the fact that security issues in crypto libs are memory safety issues, and I think this is meant to be a motivator for writing the crypto using SIMD intrinsics.

This misses two key issues.

1. If you want to really trust that your crypto code has no timing side channels, then you've gotta write it in assembly. Otherwise, you're at the compiler's whims to turn code that seems like it really should be constant-time into code that isn't. There's no thorough mechanism in compilers like LLVM to prevent this from happening.

2. If you look at the CVEs in OpenSSL, they are generally in the C code, not the assembly code. If you look at OpenSSL CVEs going back to the beginning of 2023, there is not a single vulnerability in Linux/X86_64 assembly. There are some in the Windows port of the X86_64 assembly (because Windows has a different calling conv and the perlasm mishandled it). There are some on other arches. But almost all of the CVEs are in C, not asm.

If you want to know a lot more about how I think about this, see https://fil-c.org/constant_time_crypto

I do think it's a good idea to have crypto libraries implemented in memory safe languages, and that may mean writing them in Rust. But the actual kernels that do the cryptographic computations that involve secrets should be written in asm for maximum security so that you can be sure that sidechannels are avoided and because empirically, the memory safety bugs are not in that asm code.

johnisgood · 2026-01-22T05:07:24 1769058444

So there are more bugs in a more readable and understandable programming language (C) as opposed to asm? What gives? I am asking because intuition would say the opposite since asm is much more lower-level than C.

wahern · 2026-01-22T06:28:49 1769063329

The core primitives written in assembly operate on fixed sized blocks of data; no allocations, no indexing arrays based on raw user controlled inputs, etc. Moreover, the nature of the algorithms--at least the parts written in assembly, e.g. block transforms--means any bugs tend to result in complete garbage and are caught early during development.

pizlonator · 2026-01-22T16:23:52 1769099032

In OpenSSL, the C code is doing stuff that is hard to validate exhaustively like dealing with wire protocols, different sized buffers that have to be handled in special ways, dynamic memory allocation, etc

The assembly code is for the kernels themselves. Usually it’s a loop with easy to understand bounds, usually zero pointer chasing. Mostly math. Definitely no calls to malloc/free

So it’s not that assembly has an advantage over C. It’s that the kinds of things that get written in assembly also happen to:

- Be the sort of code where accidentally overflowing a buffer is unlikely because the indexing is the easy part. And there’s no interesting downcasting or allocation happening

- Be the sort of code where timing side channels are disastrous and the only way to know for sure you don’t have one is to look at the assembly

formerly_proven · 2026-01-22T07:15:28 1769066128

Crypto primitives tend to have very simple control flow (those that don’t are usually insecure) and even simpler data structures. You won’t find many branches beyond “is there another block?” in a typical block cipher or hash, for example.

itemize123 · 2026-01-22T05:14:02 1769058842

compiler optimization is a blackbox. shortcuts to crypto routines will allow side channel attacks

9rx · 2026-01-22T19:12:34 1769109154

Is it actually more readable and understandable (not to be confused with more familiar), though? It is hard to beat the understandability of the lowest level. Abstractions add unknowns. C's claim to fame was being a "portable assembler"; freeing developers from having to write the same program over and over and over for different systems.

ironbound · 2026-01-22T06:11:07 1769062267

I can't review assembly, agreed on better language part but we'd need tooling to help prove it's correctness of you want the asm path.

tucnak · 2026-01-22T07:43:51 1769067831

Well, is it that you can't or that you won't? All you need to do is learn it, you know? The full x86 ISA for one is quite nasty, but there's a useful subset, and other architectures are much nicer in general. Asm is as basic as it gets.

zbentley · 2026-01-22T13:13:33 1769087613

Eh, I don't think you need to get that extreme.

A combination of careful use of a given high-level-language with expert awareness of compiler behavior, and the presence of tests that detect some of the nasty timing behaviors that get compiled in via static analysis of compiler IR or assembly on selected platforms will get you pretty far--not guaranteed perfect like handwritten asm would, but far enough that the advantages of not needing maintainers to be fluent in assembly past the point of maintaining those tests might outweigh the drawbacks.

pizlonator · 2026-01-22T16:17:34 1769098654

Validating that your compiler didn’t introduce a timing side channel into a crypto algo is harder than validating that the crypto algo has no memory safety bugs.

I think this is true for crypto algos because they have low cyclomatic complexity and you’re going to think deeply about its behavior anyway as part of cryptanalysis, benchmarking, and other testing

zbentley · 2026-01-23T13:43:06 1769175786

> Validating that your compiler didn’t introduce a timing side channel into a crypto algo is harder than validating that the crypto algo has no memory safety bugs.

Genuine question from a novice in this area: assuming that your critical crypto operations are, at the assembly level, in identifiable labeled blocks, why is this hard?

My understanding of timing attacks is that most of them derive from code that uses a looping construct such that performance is O(length(data)), and that most mitigations involve either replacing loops with constant-time instructions, or padding loops out to be fixed-iteration-count (in which case, they can/should hopefully be unrolled).

If that's true, wouldn't it be easy to identify whether a specific, user-selected critical section of compiled/assembly-output code was vulnerable to a side channel by checking it for nonterminal jumps (or generating a very simple/shallow-depth CFG and checking for cycles)?

Even if that's possible, I know it's definitely not sufficient for timing attack protection (awareness of specific instructions' performance and instructions whose performance are affected by other code are just a few examples of where this breaks down); I'm just wondering if it's a means of checking some timing-attack low hanging fruit in an easy way at compile time or (assuming simple disassembly is possible and critical sections can still be identified from machine code) startup time.

I very much do not want to sound like I'm proposing a solution from first principles here; I'm a total novice and this is a very deep area. I'm quite certain the answer is somewhere on the spectrum between "this doesn't work for $reasons" and "this is already well-known and practiced in sanitizer passes/serious cryptography libs". I'm just curious about the approach.

pizlonator · 2026-01-23T17:57:08 1769191028

> assuming that your critical crypto operations are, at the assembly level, in identifiable labeled blocks, why is this hard?

The compiler won't do you that kind of favor. The crypto code will be smeared in with other junk, reordered according to the compiler's inscrutable whims.

Worse, exactly what kind of nonsense the compiler will do to your algo will depend on compiler version, ABI, OS, and countless spooky-action-at-a-distance properties of your larger project. Like, you could breathe on a non-crypto part of your code, and for complex reasons the compiler now emits different instructions for your crypto. This means that you'll have to revalidate what the compiler output every time you make any change to your code, even if that change doesn't affect the crypto kernels themselves.

> My understanding of timing attacks is that most of them derive from code that uses a looping construct such that performance is O(length(data))

That's not the big risk. The risk is that the compiler does one of the following:

- Introduces a branch on secret data for the purpose of speculation. There's nothing stopping a compiler from doing this. There are many optimizations in LLVM today that will do this (or not) based on heuristics. The moment this happens, you get a timing leak. Note that the branch isn't for a loop - it's the compiler saying "if this value is like this then I can short-circuit the calculation somehow"

- Turns math into a lookup table. This is less common, but again, there's nothing stopping the compiler from doing it. Again, that's a timing leak.

> If that's true, wouldn't it be easy to identify whether a specific, user-selected critical section of compiled/assembly-output code was vulnerable to a side channel by checking it for nonterminal jumps (or generating a very simple/shallow-depth CFG and checking for cycles)?

Theoretically, but first you'd have to decompile the assembly to work out which parts of it are even part of the crypto.

Here's a good way to think about it: writing assembly by hand is way easier than analyzing the assembly generated by a compiler. And, if you write it by hand, you only have to validate it if you make changes to that assembly, not anytime you breathed on the compiler or its inputs.

shihab · 2026-01-21T23:30:24 1769038224

> For example, NEON ... can hold up to 32 128-bit vectors to perform your operations without having to touch the "slow" memory.

Something I recently learnt: the actual number of physical registers in modern x86 CPUs are significantly larger, even for 512-bit SIMD. Zen 5 CPUs actually have 384 vectors registers, 384*512b = 24KB!

cmovq · 2026-01-22T00:05:22 1769040322

This is true, but if you run out of the 32 register names you’ll still need to spill to memory. The large register file is to allow for multiple instructions to execute in parallel among other things.

zeusk · 2026-01-22T01:50:37 1769046637

They’re used by the internal register renamer/allocator so if it sees you’re storing the results to memory then reusing the named register for a new result - it will allocate a new physical register so your instruction doesn’t stall for the previous write to go through.

adrian_b · 2026-01-22T13:22:05 1769088125

I do not understand what you want to say.

The register renamer allocates a new physical register when you attempt to write the same register as a previous instruction, as otherwise you would have to wait for that instruction to complete, and you would also have to wait for any instructions that would want to read the value from that register.

When you store a value into memory, the register renamer does nothing, because you do not attempt to modify any register.

The only optimization is that if a following instruction attempts to read the value stored in the memory, that instruction does not wait for the previous store to complete, in order to be able to load the stored value from the memory, but it gets the value directly from the store queue. But this has nothing to do with register renaming.

Thus if your algorithm has already used all the visible register numbers, and you will still need in the future all the values that occupy the registers, then you have to store one register into the memory, typically in the stack, and the register renamer cannot do anything to prevent this.

This is why Intel will increase the number of architectural general-purpose registers of x86-64 from 16 to 32, matching Arm Aarch64 and IBM POWER, with the APX ISA extension, which will be available in the Nova Lake desktop/laptop CPUs and in the Diamond Rapids server CPUs, which are expected by the end of this year.

Register renaming is a typical example of the general strategy that is used when shared resources prevent concurrency: the shared resources must be multiplied, so that each concurrent task uses its private resource.

gpderetta · 2026-01-22T15:32:52 1769095972

> When you store a value into memory, the register renamer does nothing, because you do not attempt to modify any register.

you are of course correct about everything. But the extreme pendant in me can't avoid pointing out that there are in fact a few mainstream CPUs[1] that can rename memory to physical registers, at least in some cases. This is done explicitly to mitigate the cost of spilling. edit: this is different from the store-forwarding optimization you mentioned.

[1] Ryzen for example: https://www.agner.org/forum/viewtopic.php?t=41

adrian_b · 2026-01-22T23:34:07 1769124847

That feature does not exist in any AMD Zen, but only in certain Zen generations and randomly, i.e. not in successive generations. This optimization has been introduced then removed a couple of times. Therefore this is not an optimization on whose presence you can count in a processor.

I believe that it is not useful to group such an optimization with register renaming. The effect of register renaming is to replace a single register shared by multiple instructions with multiple registers, so that each instructions may use its own private register, without interfering with the other instructions.

On the other hand, the optimization mentioned by you is better viewed as an enhancement of the optimization mentioned by me, and which is implemented in all modern CPUs, i.e. that after a store instruction the stored value persists for some time in the store queue and the subsequent instructions can access it there instead of going to memory.

With this additional optimization, the stored values that are needed by subsequent instructions are retained in some temporary registers even after the store queue is drained to the memory as long as they are still needed.

Unlike with register renaming, here the purpose is not to multiply the memory locations that store a value so that they can be accessed independently. Here the purpose is to cache the value close to the execution units, to be available quickly, instead of taking it from the far away memory.

As mentioned at your link, the most frequent case when this optimization is efficient is when arguments are pushed in the stack before invoking a function and then the invoked function loads the arguments in registers. On the CPUs where this optimization is implemented the passing of arguments to the function bypasses the stack, becoming much faster.

However this calling convention is important mainly for legacy 32-bit applications, because the 64-bit programs pass most arguments inside registers, so they do not need this optimization. Therefore this optimization is more important for Windows, where it is more frequent to use ancient 32-bit executables, which have not been recompiled to 64-bit.

gpderetta · 2026-01-23T09:34:02 1769160842

Yes, it is not in all Zen cpus.

I don't think it makes sense to distinguish it from renaming. It is effectively aliasing a memory location (or better, an offset off the stack pointer) with a physical register, effectively treating named stack offsets as additional architectural registers. AFAIK this is done on the renaming stage.

adrian_b · 2026-01-23T16:29:11 1769185751

The named stack offsets are treated as additional hidden registers, not as additional architectural registers.

You do not access them using architectural register numbers, as you would do with the renamed physical registers, but you access them with an indexed memory addressing mode.

The aliasing between a stack location and a hidden register is of the same nature as the aliasing between a stack location from its true address in the main memory and the location in the L1 cache memory where the the stack locations are normally cached in any other modern CPU.

This optimization present in some Zen CPUs just caches some locations from the stack even closer to the execution units of the CPU core than the L1 cache used for the same purpose in other CPUs, allowing those stack locations to be accessed as fast as the registers.

gpderetta · 2026-01-23T16:56:01 1769187361

The stack offset (or in general memory location address[1]) has a name (its unique address), exactly like an architectural register, how can it be an hidden register?

In any case, as far as I know the feature is known as Memory Renaming, and it was discussed in Accademia decades before it showed in actual consumer CPUs. It uses the renaming hardware and it behaves more like renaming (0 latency movs resolved at rename time, in the front end) than an actual cache (that involves an AGI unit and a load unit and it is resolved in the execution stages, in the OoO backend) .

[1] more precisely, the feature seems to use address expressions to name the stack slots, instead of actual addresses, although it can handle offset changes after push/pop/call/ret, probably thanks to the Stack Engine that canonicalizes the offsets at the decode stage.

dapperdrake · 2026-01-21T23:59:39 1769039979

In the register file or named registers?

And the critical matrix tiling size is often SRAM, so L3 unified cache.

rwaksmunski · 2026-01-19T14:40:40 1768833640

Every Rust SIMD article should mention the .chunks_exact() auto vectorization trick by law.

ChadNauseam · 2026-01-21T22:42:03 1769035323

Didn't know about this. Thanks!

Not related, but I often want to see the next or previous element when I'm iterating. When that happens, I always have to switch to an index-based loop. Is there a function that returns Iter<Item=(T, Option<T>)> where the second element is a lookahead?

tyilo · 2026-01-21T23:01:56 1769036516

You probably just want to use `.peekable()`: https://doc.rust-lang.org/stable/std/iter/trait.Iterator.htm...

SkiFire13 · 2026-01-22T11:49:48 1769082588

Note that `peekable` will most likely break autovectorization

rwaksmunski · 2026-01-22T19:16:50 1769109410

I use the slice::windows for that.

jeffbee · 2026-01-21T23:00:17 1769036417

"Intel CPUs were downclocking their frequency when using AVX-512 instructions due to excessive energy usage (and thus heat generation) which led to performance worse than when not using AVX-512 acceleration."

This is an overstatement so gross that it can be considered false. On Skylake-X, for mixed workloads that only had a few AVX-512 instructions, a net performance loss could have happened. On Ice Lake and later this statement was not true in any way. For code like ChaCha20 it was not true even on Skylake-X.

rurban · 2026-01-22T05:02:25 1769058145

This was written in the past tense, and was true in the last decade. Only recently Intel came up with proper AVX-512

adrian_b · 2026-01-22T14:28:22 1769092102

"Recently" is 6 years ago, so not so recent.

The real Intel mistake was that they have segregated by ISA the desktop/laptop CPUs and the server CPUs, by removing AVX-512 from the former, soon after providing decent AVX-512 implementations. This doomed AVX-512 until AMD provided it again in Zen 4, which has forced Intel to eventually reintroduce it in Nova Lake, which is expected by the end of this year.

Even the problems of Skylake Server and of its derivatives were not really caused by their AVX-512 implementation, which still had a much better energy efficiency than their AVX2 implementation, but by their obsolete implementation for varying the supply voltage and clock frequency of the CPU, which was far too slow, so it had to use an inappropriate algorithm in order to guarantee that the CPUs are not damaged.

The bad algorithm for frequency/voltage control was what caused the performance problems of AVX-512 (i.e. just a few AVX-512 instructions could lower preventively the clock frequency for times comparable with a second, because the CPU feared that if more AVX-512 instructions would come in the future it would be impossible to lower the voltage and frequency fast enough to prevent overheating).

The contemporaneous Zen 1 had a much more agile mechanism for varying supply voltage and clock frequency, which was matched by Intel only recently, many years later.

jeffbee · 2026-01-22T05:15:14 1769058914

It wasn't. My comment covers the entire history of the ISA extension on Intel Xeon CPUs.

cl0ckt0wer · 2026-01-22T01:59:35 1769047175

Yeah I would have loved benchmarks across generations and vendors.

celrod · 2026-01-22T05:17:09 1769059029

I netted huge performance wins out of AVX512 on my Skylake-X chips all the time. I'm excited about less downclocking and smarter throttling algorithms, but AVX512 was great even without them -- mostly just hampered by poor hardware availability, poor adoption in software, and some FUD.

dfajgljsldkjag · 2026-01-21T22:52:14 1769035934

The benchmarks on Zen 5 are absolutely insane for just a bit of extra work. I really hope the portable SIMD module stabilizes soon, so we do not have to keep rewriting the same logic for NEON and AVX every time we want to optimize something. That example about implementing ChaCha20 twice really hit home for me.

formerly_proven · 2026-01-21T22:20:58 1769034058

Lazy man's "kinda good enough for some cases SIMD in pure Rust" is to simply target x86-64-v3 (RUSTFLAGS=-Ctarget-cpu=x86-64-v3), which is supported by all AMD Zen and Intel CPUs since Haswell; and for floating point code, which cannot be auto-vectorized due to the accuracy implications, "simply" write it with explicit four or eight-way lanes, and LLVM will do the rest. Usually. Loops may need explicit handling of head or tail to auto-vectorize (chunks_exact helps with this, it hands you the tail).

crote · 2026-01-19T11:55:18 1768823718

What is the "nasty surprise" of Zen 4 AVX512? Sure, it's not quite the twice as fast you might initially assume, but (unlike Intel's downclocking) it's still a strict upgrade over AVX2, is it not?

cogman10 · 2026-01-21T21:53:50 1769032430

It's splitting a 512 instruction into 2 256 instructions internally. That's the main nasty surpise.

I suppose it saves on the decoding portion a little but it's ultimately no more effective than just issuing the 2 256 instructions yourself.

nwallin · 2026-01-22T04:12:07 1769055127

Single pumped AVX512 can still be a lot more effective than double pumped AVX2.

AVX512 has 2048 bytes of named registers; AVX2 has 512 bytes. AVX512 uses out of band registers for masking, AVX2 uses in band mask registers. AVX512 has better options for swizzling values around. All (almost all?) AVX512 instructions have masked variants, allowing you to combine an operation and a subsequent mask operation into a single operation.

Often times I'll write the AVX512 version first, and go to write the AVX2 version, and a lot of the special sauce that made the AVX512 version good doesn't work in AVX2 and it's real awkward to get the same thing done.

MobiusHorizons · 2026-01-21T21:57:33 1769032653

The benefit seems to be that we are one step closer to not needing to have the fallback path. This was probably a lot more relevant before Intel shit the bed with consumer avx-512 with e-cores not having the feature

adrian_b · 2026-01-22T13:52:12 1769089932

Most 512-bit instructions are not split into two 256-bit instructions, neither on Zen 4, nor on laptop Zen 5. This is a myth caused by a very poor choice of words of the AMD CEO at the initial Zen 4 presentation.

For most 512-bit instructions that operate on the vector registers, both Zen 4 and all the Intel CPUs supporting AVX-512 have an identical throughput: two 512-bit instructions per clock cycle.

There are only a few instructions where Zen 4 is inferior to the most expensive of the Intel server/workstation CPUs, but those are important instructions for some applications.

The Intel CPUs have a double throughput for the transfers with the L1 cache memory. Zen 4 can do only one 512-bit load per cycle plus one 512-bit store every other cycle. The Intel CPUs with AVX-512 support and Zen 5 (server/desktop/Halo) can do two 512-bit loads plus one 512-bit store per cycle.

The other difference is that the most expensive Intel CPUs (typically Gold/Platinum Xeons) have a second floating-point multiplier, which is missing on Zen 4 and on the cheaper Intel SKUs. Thus Zen 4 can do one fused multiply-add (or FMUL) plus one FP addition per cycle, while the most expensive Intel CPUs can do two FMA or FMUL per cycle. This results in a double performance for the most expensive Intel CPUs vs. Zen 4 in many linear algebra benchmarks, e.g. Linpack or DGEMM. However there are many other applications of AVX-512 besides linear algebra, where a Zen 4 can be faster than most or even all Intel CPUs.

On the other hand, server/desktop/Halo Zen 5 has a double throughput for most 512-bit instructions in comparison with any Intel CPU. Presumably the future Intel Diamond Rapids server CPU will match the throughput of Zen 5 and Zen 6, i.e. of four 512-bit instructions per clock cycle.

On Zen 4, using AVX-512 provides very significant speedups in most cases over AVX2, despite the fact that the same execution resources are used. This proves that there still are cases when the ISA matters a lot.

convolvatron · 2026-01-21T22:38:12 1769035092

axv-512 for zen4 also includes a bunch of instructions that weren't in 256, including enhanced masking, 16 bit floats, bit instructions, double-sized double-width register file

adgjlsfhk1 · 2026-01-22T05:34:59 1769060099

Predicated instructions are incredibly useful (and avx-512 only). They let you get rid of the usual tail handling at the end of the loop.

fooker · 2026-01-21T23:53:21 1769039601

> it's still a strict upgrade over AVX2

If you benchmark it, it will be slower about half the time.

adrian_b · 2026-01-22T14:13:52 1769091232

I do not know what benchmarks do you have in mind.

All the benchmarks that I have ever seen published about AVX-512 vs. AVX2 on AMD Zen 4 have shown better performance on AVX-512. Very frequently, the AVX-512 performance was not a little better, but much better, despite the fact that on Zen 4 both program versions use exactly the same execution resources (but AVX-512 programs have less instructions, which avoids front-end bottlenecks).

adgjlsfhk1 · 2026-01-22T05:31:20 1769059880

for the simplest cases it will be about the same speed as avx2, but if you're trying to do anything fancy, the extra registers and instructions are a godsend.

fooker · 2026-01-22T12:10:56 1769083856

Well, try it out for a realistic program.

It makes for nice looking code, yes. But is often slower (for various reasons that are well understood by now).

adrian_b · 2026-01-22T14:15:30 1769091330

Please mention some of those reasons.

vintagedave · 2026-01-22T09:30:09 1769074209

I was surprised not to see Sleef mentioned as an option. It’s available for Rust and handles the architecture agnostic or portable needs they have.

https://docs.rs/sleef/latest/sleef/

nice_byte · 2026-01-22T03:04:29 1769051069

it's hard to believe that using simd extensions in rust is still as much of a chaotic clusterfudge as it was the first time I looked into it. no support in standard library? 3 different crates? might as well write inline assembly...

steveklabnik · 2026-01-22T03:27:00 1769052420

The standard library offers intrinsics but not a portable high level API just yet.

wyldfire · 2026-01-22T01:21:11 1769044871

Does Rust provide architecture-specific intrinsics like C/C++ toolchains usually do? That's a popular way to do SIMD.

steveklabnik · 2026-01-22T01:48:26 1769046506

Yes https://doc.rust-lang.org/stable/std/arch/

m-hilgendorf · 2026-01-22T04:56:45 1769057805

People should be aware though that without `-C target_feature=+<feature>` in your rustc flags the compiler may emit function calls to stubs for the intrinsic. So people should make sure they're passing the appropriate target features, especially when benchmarking.

[0] https://godbolt.org/z/85nx44zcE

edited: I tested gcc/clang and they just straight up fail to compile without -msse3. The generated code without optimizations is also pretty bonkers!

wyldfire · 2026-01-22T18:24:02 1769106242

> compiler may emit function calls to stubs for the intrinsic.

I guess the compiler itself might not know how those calls get resolved but the linker or cargo or some other mechanism might decide to include a library that provides stubs.

Why would anyone ever want that behavior? nop stubs could conceivably be used in some compiler isolated (no-execution) testing scenario but I'd expect it to be opt-in.

karavelov · 2026-01-22T01:53:25 1769046805

Yes, in the submodules of `std::arch`