How I increased performance 2.5x by proving the internet wrong (kinda)

How I increased performance 2.5x by proving the internet wrong (kinda)

26 Apr 2026

Introduction

This is a story about the ARX project mentioned in a previous blog post.


What I did not mention in that previous blog post is that initially I had implemented ARX with SHA2-512 as the core hashing algorithm. This decision was primarily informed by my extensive online research on performance comparisons between SHA2-256 and SHA2-512, which overwhelmingly suggested that SHA2-512 was faster than SHA2-256 on 64-bit architectures.


And because I don’t see ARX targeting anything but 64-bit architectures, I of course followed the public consensus on Stack Overflow and in the various benchmarks I found online.

Well, there’s your problem.

Yeah yeah, “don’t trust everything you read on the internet,” I know, I know. But as software engineers we are all dependent on the internet providing us up-to-date and accurate information so we can do our jobs properly.


A new library is useless without its documentation, and trying to solve every problem yourself is a recipe for wasted time and, in the worst case, insanity. Such is a developer’s life: we have all had to learn to use this immensely helpful resource and tell signal from noise. And in this case, there was a clear signal that SHA2-512 was faster, so I went with it.

But is it actually faster?

Well here comes the interesting part: I had been working so strongly under that assumption that I did not even consider an alternative. Which is why, when I was working on adding support for timestamps and other file metadata into the ARX archive (via tree entries), I spent a non-insignificant amount of time trying to bit-pack Linux and Windows flags into a single byte, yet did not consider reducing the hash size by half.


It was only when I realized that I was penny-pinching over 1 byte (+8 for a timestamp), all the while the hash was using up a whole 64 of them, that I started questioning whether the speed benefit was really worth the additional size. I had worked out that using a 512-bit hash over a 256-bit one adds ~988 KiB per archive in hard-to-compress hash data, which incentivized me, to at least write a benchmark.

The Benchmarking

The benchmark, available in our community channel here, is a super simple sha2 implementation in Rust that loads 1 KiB of random data into memory and repeatedly writes it into a hasher until a total of 10 GiB has been hashed.


The initial results were shocking:

sha2-10GiB/SHA-256/10GiB
	time: 	[5.6370 s 5.7111 s 5.7774 s]
	thrpt:  [1.7309 GiB/s 1.7510 GiB/s 1.7740 GiB/s]

sha2-10GiB/SHA-512/10GiB
	time: 	[16.039 s 16.510 s 17.034 s]
	thrpt:  [601.13 MiB/s 620.22 MiB/s 638.43 MiB/s]

Somehow SHA2-256 was more than 2x faster than SHA2-512, which ran completely contrary to what I had read online, and therefore to my own expectations.


I couldn’t believe my results so I uploaded the benchmark to the community matrix room and asked others to repeat the test and post their results. And out of the 8 individual runs reported, only a single one showed SHA2-512 being faster: a ~2020 Dell Latitude 5400 with an i7-8565U CPU.


But I couldn’t for the life of me work out why across all of these diverse machines (including two mobile phones) we kept getting results counter to those posted online.


Worst of all, one member of the community posted a link to an external benchmark for a third hashing algorithm, “blake3,” which in their repository also showed SHA2-512 being faster. They use a different crate, openssl, so I decided to clone the repository and benchmark it myself. And against every expectation of mine, it once again showed SHA2-256 being faster.

So what is going on?

Well, I honestly struggled to tell. I suspected it may have had something to do with the implementations being different, perhaps even some instructions that were more optimal in SHA-256. But after digging through the sha2-asm crate’s assembly, I could not find any obvious difference. The only avenue I had left to look into was the crate itself.


And this is where I noticed that I had been using an older version of the sha2 crate. When inspecting the latest 0.11.0 version I noticed mention of differing “backends,” in particular for hardware acceleration. Surely, I thought, all devices that support SHA-256 hardware acceleration would also support SHA-512. But after more research on my specific CPU, and some of the other models in the office, I noticed that less than a quarter of the devices even supported SHA-512 acceleration. Worse than that, the sha2 crate seems to be broken when forcing the sha2-512 hardware acceleration backend (understandable given how new it is), which means that even if a device were to support it, we cannot take advantage of it yet.

Conclusion

This had me thinking long and hard about whether or not hardware acceleration counts. After disabling it and running software-only implementations, I saw the same results everybody else did too, with SHA2-512 being faster than SHA2-256 by a small margin. But while I proved that in software SHA2-512 is faster, the fact remains that in the real world, with hardware acceleration, SHA2-256 is faster.


So I claim a moral victory, as SHA2-256 has been hardware accelerated since 2014 (because it is heavily used in TLS certificate verification), and therefore almost all devices support it. This means real-world performance is better with SHA2-256 for a majority of devices, which is all that matters. And even on devices without hardware acceleration, or those with SHA2-512 hardware acceleration, the smaller size of SHA2-256 still makes it the better choice for ARX.


But how did this affect ARX’s performance? Well, I ran some benchmarks comparing the versions right before & after the change and found that the SHA2-256 version was about 2.5x faster than the SHA2-512 one, dropping on average 1.1s from the index generation time (for a 6.68 GiB directory). This is a huge improvement, and together with the space savings is another step towards a more efficient ARX.

==================================
SHA512 vs SHA256 Performance
==================================

SHA512:
  Runs:    17
  Average: 1.82705s

SHA256:
  Runs:    28
  Average: 0.728944s

==================================
Speedup: 2.50x faster
Improvement: 60.0% faster
==================================