Rebuilding the shoulders of giants

The Situation

I have recently started yet another side project, this one with the aim of selling it to our business internally to push for both open source software and Rust. To this end I decided to tackle an actual business need, both because I had a good idea for it, and because without solving a real requirement there was no way they were going to bite.

My target? Our internal archive server called “ArtifactRepository”, a glorified network share accessible via http with a handful of metadata pulled from our central database. The purpose of this server is to store all builds done by our internal CI/CD system to be distributed from the 1 build agent to every one of 2300 Test Agents. Unfortunately the huge amount of data throughput required for 2300 machines to pull a 2.3 GiB (the size of a built artifact of our main product) artifact from a single server has long ago surpassed what a single server / network card can handle, thus necessitating the use of UFTP to directly upload build artifacts to all build agents.

This doesn’t make the situation much better though, as we still store a full copy of every build, and UFTP doesn’t always work, so there are always plenty of test agents pulling down zip files of the builds. Yes you read that right, zip files. Every build artifact is first zipped up into an archive, then named with the GUID matching the build ID it originated from, and is then uploaded to our artifact repository / multicast onto test agents.

And this is where the core of the problem lies - zip files, and in particular the space and computational inefficiencies they present.

The Problem

Zip files

That’s it?

Yeah…

The Reason?

Ok, ok, I’ll elaborate. Zip files are an archaic format designed in 1989 for compressing files onto floppy disks - not exactly a common use case these days. Additionally (as my coworker likes to keep reminding me) they store their metadata table at the very end, which means that even though we have downloaded 90% of the data, we cannot start processing it until we have read the whole file (the header lives at the end so files could be appended to floppies without rewriting them all). This however pales in comparison to my biggest problem with zip (imo).

Each 👏🏻 File 👏🏻 Is 👏🏻 Compressed 👏🏻 Separately.

This has a few implications, but they all lead to the same conclusion: we use up a ton more space than we need. The main reason is that we cannot properly take advantage of the compression algorithm (deflate in zip’s case) to find and compress duplicate data across files, as for each file the compressor starts fresh, losing its compression dictionary.

And while no compression algorithm achieves perfect deduplication, the difference is stark, especially with duplicate files, which are very common in .NET builds.

The Solution?

Use something other than zip? 7z or rar perhaps, maybe even ZPAQ or WIM? Sure, any of them would drop our file sizes & quite drastically in some places too. But the real problem isn’t storage space on the ArtifactRepository host (although that is a problem too), it’s the wasted space and inefficiency of every build being uploaded as its own self-contained archive.

Because while 7z will cut down on storage space per build output, it won’t help decrease wasted storage across builds. We still store the same System.Text.Json.dll in 100+ different archive files.

The real solution is to stop storing files as separate archives and instead store them in their exploded view. With content addressing we get deduplication for free, and can skip uploading or downloading files that already exist, cutting down on bandwidth as well as storage.

The ARX artifact System

The ARX artifact system is a content addressed archiver designed specifically to minimize space on the server, as well as cut down on bandwidth by skipping files already existing on the local or remote.

It is built around 2 core parts:

The Store
The Archive

The Store

The store is simply a storage medium for the indexes (arx name for the “artifact” unit) and any files/folders that they contain. It is most commonly represented as purely a folder but could be backed by a database, s3 bucket or any other storage medium, (please don’t use an ICMP drive…). The only requirement being that it stores data in a vaguely file like format we can look up by its name/key.

This key is the content hash of the file/folder it stores which makes the store not only idempotent but also concurrently safe even if the underlying storage medium isn’t.

The most common store is probably a http server backed by either a file store or S3 bucket which conforms to the (in progress) arx store api specification. This allows easy uploading and downloading of indexes from/to the server as well as doing partial downloads / uploads.

The Archive

However, since a lot of these indexes will need to either be downloaded by users, or should be transferred as a single unit, we also have a custom archive format to store all deduplicated files within a single file.

This archive format is designed with streaming in mind, storing the full header table at the start of the compressible section allowing concurrent writing of files, as they are being extracted.

It comes in two flavours; the .arx format which is a single file containing all data referenced by an index, and the .sar file which is a “sparse” or “supplemental” archive which only contains a subset of files, primarily designed to fill a store with files missing for a particular index.

Rebuilding the shoulders of giants

The ARX artifact system is not a novel invention but rather a re-implementation of the core of git’s design, in particular the Merkle tree and the idea of treating directories as content-addressed objects.

Indexes mirror commits almost exactly (with the exclusion of “parents”), and while some modifications had to be made to trees for them to capture the metadata required, the file modes and content addressing are the same.

This fundamental design gives us not only rock solid file-level deduplication but also lets us quickly and easily see diffs of both the full object set, as well as on the tree itself (e.g directories with changed files, etc…). So we not only get a deduplicating archiver (such as zpaq or wim) but also a seamless system that integrates local and remote.

So is it any good?

The numbers we are seeing are quite spectacular when compared to zip, and even tar gets a run for its money on folders >1 GB (due to it being single-threaded). But the real question is how much smaller the output is.

To this end here are some numbers we have collected on two internal builds. One representing a “worst case” scenario with very few duplicated files, and one representing the ideal average we are seeing.

Worst Case:

    Source                            Size:  6.67 GB  (6,673,602,560 bytes)

    ZIP Best       Time: 1m 55.603s   Size:  2.39 GB  (2,386,735,364 bytes)
    ZIP None       Time:     8.621s   Size:  6.68 GB  (6,676,012,856 bytes)

    ARX Deflate    Time: 1m 32.230s   Size:  2.14 GB  (2,141,875,314 bytes)
    ARX ZStd       Time:     5.926s   Size:  1.97 GB  (1,965,064,773 bytes)
    ARX None       Time:     7.717s   Size:  6.08 GB  (6,080,060,327 bytes)

Here we see that Deflate is a super slow compression algorithm, yet we still managed to decrease time taken by 20 seconds, zstd however is the real star of the show in terms of performance beating even no compression on both zip and arx.

Best/Average Case:

    Source                            Size:  3.03 GB  (3,026,909,139 bytes)

    ZIP Best       Time: 1m 06.289s   Size:  1.30 GB  (1,300,489,786 bytes)
    ZIP None       Time:     4.665s   Size:  3.03 GB  (3,030,561,161 bytes)

    ARX Deflate    Time:    10.111s   Size: 212.1 MB    (212,119,207 bytes)
    ARX ZStd       Time:     1.520s   Size: 198.6 MB    (198,597,614 bytes)
    ARX None       Time:     1.703s   Size: 512.2 MB    (512,243,541 bytes)

This particular example perfectly showcases the benefits of deduplication. While we spend a little extra time at the outset to generate a full index of the source, we save over 50 seconds even with Deflate, thanks to deduplication. We also see that without compression the raw data alone decreases by 83.1% which leads to these truly mind boggling numbers.

For reference, ZIP Best is the compression implementation currently used in production.

Conclusion

While this project is still very much ongoing it is already publicly available and has shown promise within the open source community. With any luck this won’t just be another wheel I invented but rather a useful tool to others.

If you want to keep up to date with the project or just chat with me personally, feel free to come hang out in our public matrix channel