I often like to start my day by keeping myself abreast of interesting things in the world of computer programming. Often this means a visit to Hacker News or Lobsters. I somehow managed to come across Richard Startin's article about why he doesn't use Google Protocol Buffers for telemetry. Near the bottom, he has an aside about VarInts and shares a few tricks to produce them more efficiently.
Why was this relevant to me? That is because the Minecraft: Java Edition protocol uses VarInts extensively. But what exactly are we looking at here? Let's dive a bit deeper.
What even is a VarInt? #
A Protocol Buffers VarInt is a 1 to 10 byte sequence that encodes a specific number (usually up to 32-bit for up to 5 byte VarInts, 64-bit for up to 10 byte VarInts). The process for encoding a VarInt is simple:
- Check if the value is less than
0x80
. (This is usually done with a simple AND and comparison, i.e.(value & ~0x7FL) == 0
.) If so, write the value out to the next position in the buffer and return. - Grab the lowest 7 bits of the value being written (
value & 0x7F
) and set the 8th bit ((value & 0x7F | 0x80)
). Write this out into the next position in the buffer. - Logically shift the value to the right by 7 bits and repeat from step 1.
Here is the naive implementation (sourced from the Google Protocol Buffers library's CodedInputStream
):
while (true) {
if ((value & ~0x7FL) == 0) {
buffer[position++] = (byte) value;
return;
} else {
buffer[position++] = (byte) (((int) value & 0x7F) | 0x80);
value >>>= 7;
}
}
For the purposes of demonstration we will focus on the 32-bit VarInts only - the techniques shown here can be extended to 64-bit VarLongs (at a not-insignificant cost to your sanity).
Now, for the main star of the show - the old VarInt writing implementation from Velocity:
while (true) {
if ((value & 0xFFFFFF80) == 0) {
buf.writeByte(value);
return;
}
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
}
This is simply a version of the above method, but adapted to write to a Netty ByteBuf
. With the necessary background taken care of, let's have some fun.
Disclaimer #
I am by no means an expert on how the JIT operates, but I am very interested in fun and unique ways to solve problems!
The testing system is my home PC, featuring an AMD Ryzen 9 3900X processor with 32GB of DDR4-3600 memory, running Fedora 34 and Java 11 as packaged in Fedora. We use the flag -Dio.netty.leakDetection.level=DISABLED
as otherwise 1% of buffers will be selected for leak detection instrumentation, which would introduce unfavorable noise in the results.
The benchmarks in question allocate a direct buffer 5 bytes long, write a VarInt into it, and then clear it. This is repeated for a pseudo-random sequence of 2,048 numbers in order to defeat the JVM's tendency to assume that a certain branch would always be taken.
Try 1: Richard Startin's Method #
Let's try using Richard Startin's tricks. Here's the above method, but using his method:
int continuationBytes = (31 - Integer.numberOfLeadingZeros(value)) / 7;
for (int i = 0; i < continuationBytes; ++i) {
buf.writeByte(((byte) ((value & 0x7F) | 0x80)));
value >>>= 7;
}
buf.writeByte((byte) value);
What do we get?
Benchmark Mode Cnt Score Error Units
VarIntWriterBenchmark.oldVelocityVarintWrite thrpt 5 97618.250 ± 1682.211 ops/s
VarIntWriterBenchmark.startinVarintWrite thrpt 5 73264.856 ± 664.202 ops/s
This is actually worse than we had before! But why?
Getting to the bottom of it #
Is it because of the Integer.numberOfLeadingZeros()
call? The answer is probably a resounding no - after all, Integer.numberOfLeadingZeros()
is an intrinsic on at least x86_64 (the only architecture where we are running these benchmarks) and Arm. Indeed, my Zen 2 processor can process 3 LZCNT
instructions every cycle, so in theory this should deliver performance improvements.
In isolation, we can devise a small benchmark to see if this educated guess was right:
@Benchmark
public int lzcnt() {
int sum = 0;
for (int number : numbers) {
sum += (31 - Integer.numberOfLeadingZeros(number)) / 7;
}
return sum;
}
@Benchmark
public int byHand() {
int sum = 0;
for (int value : numbers) {
while (true) {
if ((value & 0xFFFFFF80) == 0) {
sum++;
break;
}
sum++;
value >>>= 7;
}
}
return sum;
}
The results:
Benchmark Mode Cnt Score Error Units
VarIntWriterBenchmark.byHand thrpt 5 323063.055 ± 1759.055 ops/s
VarIntWriterBenchmark.lzcnt thrpt 5 570563.618 ± 17726.994 ops/s
At least in isolation, this doesn't seem to be the case. But as we'll see later, this benchmark doesn't really tell us anything meaningful. Perhaps this change makes the JIT compiler more trigger-happy in a way that tanks our performance, thanks to the more traditional loop style we now employ. I'm not sure.
Try 2: Unrolling the Loop #
Okay, let's go back a step. Remember that the loop we are running could run 1 to 5 times, depending on how large the VarInt will be. We also know that the number of bytes will increase with each 7 bits in the number. Given that, what if we simply unrolled the loop, accounting for all of the cases?
Here's the code:
if (value < 0) {
buf.writeByte(value & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte(value >>> 7);
} else if (value < 128) {
buf.writeByte(value);
} else if (value < 16384) {
buf.writeByte(value & 0x7F | 0x80);
buf.writeByte(value >>> 7);
} else if (value < 2097152) {
buf.writeByte(value & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte(value >>> 7);
} else if (value < 268435456) {
buf.writeByte(value & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte(value >>> 7);
} else {
buf.writeByte(value & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte(value >>> 7);
}
What do we get now?
Benchmark Mode Cnt Score Error Units
VarIntWriterBenchmark.oldVelocityVarintWrite thrpt 5 95870.960 ± 4328.322 ops/s
VarIntWriterBenchmark.startinVarintWrite thrpt 5 73097.880 ± 1116.550 ops/s
VarIntWriterBenchmark.unrolledVarintWrite thrpt 5 95840.321 ± 2882.220 ops/s
There's no change, but perhaps we could have expected that from naively unrolling the loop and just figuring out when certain byte lengths would be written. We can, of course, do better than that.
Try 3: Smartly Unrolling the Loop #
Let's take a look at that unrolled code again. It's clearly not optimal, since it repeats a lot of the same code, including multiple calls to ByteBuf#writeByte()
. Can we do better than this? In fact, we can:
if ((value & 0xFFFFFF80) == 0) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
if ((value & 0xFFFFFF80) == 0) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
if ((value & 0xFFFFFF80) == 0) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
if ((value & 0xFFFFFF80) == 0) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
buf.writeByte(value);
}
}
}
}
This looks pretty gnarly already, but let's give it a shot:
Benchmark Mode Cnt Score Error Units
VarIntWriterBenchmark.oldVelocityVarintWrite thrpt 5 95113.915 ± 182.119 ops/s
VarIntWriterBenchmark.smartUnrolledVarintWrite thrpt 5 124865.620 ± 550.267 ops/s
VarIntWriterBenchmark.unrolledVarintWrite thrpt 5 93953.405 ± 1530.158 ops/s
That's very impressive - a nearly 30% performance improvement from improving the structure of the code. Can we improve this some more?
Try 4: The numberOfLeadingZeros
Mystery Solved? #
Let's try to use Integer.numberOfLeadingZeros()
to optimize this function. Perhaps it will give us a bit more performance.
int continuationBytes = (31 - Integer.numberOfLeadingZeros(value)) / 7;
if (continuationBytes == 0) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
if (continuationBytes == 1) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
if (continuationBytes == 2) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
if (continuationBytes == 3) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
buf.writeByte(value);
}
}
}
}
I can't wait for all the performance improvements!
Benchmark Mode Cnt Score Error Units
VarIntWriterBenchmark.smartUnrolledVarintWrite thrpt 5 110664.096 ± 1460.675 ops/s
That's not good at all. We just regressed by more than 10%.
Try 5: Remove the Data Dependency #
That was a bust. Is there something else we can do? Well, there is a data dependency. We are bit-shifting to the right by 7 bits each time then seeing if we need to continue writing bytes or not. It's possible to get a little bit more parallelism out of the CPU by knowing how much we'll bit-shift by:
if ((value & (0xFFFFFFFF << 7)) == 0) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
if ((value & (0xFFFFFFFF << 14)) == 0) {
buf.writeByte(value >>> 7);
} else {
buf.writeByte((value >>> 7) & 0x7F | 0x80);
if ((value & (0xFFFFFFFF << 21)) == 0) {
buf.writeByte(value >>> 14);
} else {
buf.writeByte((value >>> 14) & 0x7F | 0x80);
if ((value & (0xFFFFFFFF << 28)) == 0) {
buf.writeByte(value >>> 21);
} else {
buf.writeByte((value >>> 21) & 0x7F | 0x80);
buf.writeByte(value >>> 28);
}
}
}
}
How does this attempt fare?
Benchmark Mode Cnt Score Error Units
VarIntWriterBenchmark.oldVelocityVarintWrite thrpt 5 94480.674 ± 1389.432 ops/s
VarIntWriterBenchmark.smartNoDataDependencyUnrolledVarintWrite thrpt 5 129428.934 ± 7405.911 ops/s
VarIntWriterBenchmark.smartUnrolledVarintWrite thrpt 5 126697.664 ± 5033.494 ops/s
It looks like the two approaches are in fact similar in performance, with the newer version performing perhaps a tad better, but the results are too noisy to know for sure.
Try 6: Training Ourselves in the Dark Arts #
Already we found a ~30% speedup from unrolling the loop carefully, but can we do better than that still? Yes, we can. As most performance-oriented engineers know, doing things in bulk can be a very profitable optimization. Up to this point, we've been running 1 to 5 calls to write individual bytes. It turns out that if we get a little bit creative, we can either write the entirety of the bytes up front (1, 2, or 4 bytes worth) or combine one larger writing operation with writing one byte (2 + 1 for 3 bytes, 4 + 1 for 5 bytes – VarLong writing is left as an exercise to the reader). Luckily, we don't need to handle the 2 + 1 case specially as Netty ByteBuf
provides a writeMedium()
method. In order to do this, we will instead construct the VarInt as a number, and then write it to the buffer. Let's give it a try:
if ((value & (0xFFFFFFFF << 7)) == 0) {
buf.writeByte(value);
} else if ((value & (0xFFFFFFFF << 14)) == 0) {
int w = (value & 0x7F | 0x80) << 8 | (value >>> 7);
buf.writeShort(w);
} else if ((value & (0xFFFFFFFF << 21)) == 0) {
int w = (value & 0x7F | 0x80) << 16 | ((value >>> 7) & 0x7F | 0x80) << 8 | (value >>> 14);
buf.writeMedium(w);
} else if ((value & (0xFFFFFFFF << 28)) == 0) {
int w = (value & 0x7F | 0x80) << 24 | (((value >>> 7) & 0x7F | 0x80) << 16)
| ((value >>> 14) & 0x7F | 0x80) << 8 | (value >>> 21);
buf.writeInt(w);
} else {
int w = (value & 0x7F | 0x80) << 24 | ((value >>> 7) & 0x7F | 0x80) << 16
| ((value >>> 14) & 0x7F | 0x80) << 8 | ((value >>> 21) & 0x7F | 0x80);
buf.writeInt(w);
buf.writeByte(value >>> 28);
}
The end result is this:
Benchmark Mode Cnt Score Error Units
VarIntWriterBenchmark.blendedVarintWrite thrpt 5 175968.604 ± 5195.657 ops/s
VarIntWriterBenchmark.oldVelocityVarintWrite thrpt 5 94270.376 ± 2796.116 ops/s
VarIntWriterBenchmark.smartUnrolledVarintWrite thrpt 5 124524.690 ± 2929.593 ops/s
This is an enormous improvement, to say the least. From our baseline we got a nearly 1.87x speedup! But there is a reason why this is suboptimal: the above code is definitely difficult to read and understand, and it is extremely fragile – extending it to 10-byte VarLongs would be a nightmare!
Bonus: Exact byte length encoding #
Quoting Startin:
When I went back and read it again I was surprised to find that embedded messages are length-prefixed, but the length prefix is varint encoded, which means you don’t know how many bytes you need for the length until you’ve done the serialisation, and it’s recursive.
Startin then argues that was one reason why he considered Protocol Buffers to be bad for reporting telemetry from the edge. But for those who are stuck with VarInts like I am, you can in fact write VarInts in a sequential fashion, and even with performance slightly above the traditional approach to writing VarInts. Specifically, it is possible for us encode any number that can be encoded as a VarInt in exactly 5 bytes:
int w = (value & 0x7F | 0x80) << 24 | ((value >>> 7) & 0x7F | 0x80) << 16
| ((value >>> 14) & 0x7F | 0x80) << 8 | ((value >>> 21) & 0x7F | 0x80);
buf.writeInt(w);
buf.writeByte(value >>> 28);
Therefore, if you are willing to inflate the size of your messages, you can write the message out sequentially. Even better, you can even improve upon this result by encoding only the maximum amount you need. In the Minecraft protocol, compressed messages are always 21-bit or less, so we can encode the length of a compressed packet in just three VarInt bytes and save ourselves a memory copy. Very neat!
But performance-wise, is it worth doing that?
Benchmark Mode Cnt Score Error Units
VarIntWriterBenchmark.blendedVarintWrite thrpt 5 179321.732 ± 7610.742 ops/s
VarIntWriterBenchmark.lucky5VarintWrite thrpt 5 113540.019 ± 6062.895 ops/s
VarIntWriterBenchmark.oldVelocityVarintWrite thrpt 5 95748.224 ± 2531.663 ops/s
If you have a need for this, then yes, it even beats the traditional loop. But despite appearances, this is not only not the fastest way to write a VarInt, it also completely defeats the point of using a VarInt. But it's a neat trick that could come in handy under certain circumstances.
But this is just one processor - I need more than that! #
I ran the tests on a DigitalOcean CPU-optimized droplet with 4GB of memory and two dedicated CPU threads of an Intel Xeon E5-2650v4. The OS is Ubuntu 20.04 but we keep using Java 11.
Here's the results:
Benchmark Mode Cnt Score Error Units
VarIntWriterBenchmark.blendedVarintWrite thrpt 5 103837.167 ± 629.934 ops/s
VarIntWriterBenchmark.oldVelocityVarintWrite thrpt 5 54538.692 ± 1312.263 ops/s
VarIntWriterBenchmark.smartNoDataDependencyUnrolledVarintWrite thrpt 5 76576.717 ± 1051.005 ops/s
VarIntWriterBenchmark.smartUnrolledVarintWrite thrpt 5 76325.907 ± 855.393 ops/s
VarIntWriterBenchmark.startinVarintWrite thrpt 5 46513.811 ± 706.059 ops/s
VarIntWriterBenchmark.unrolledVarintWrite thrpt 5 56751.155 ± 803.914 ops/s
These results comport with the rest of the article. As for Arm, here are the results from my M1 MacBook Pro:
Benchmark Mode Cnt Score Error Units
VarIntWriterBenchmark.blendedVarintWrite thrpt 5 237790.486 ± 571.134 ops/s
VarIntWriterBenchmark.oldVelocityVarintWrite thrpt 5 165370.092 ± 1975.949 ops/s
VarIntWriterBenchmark.smartNoDataDependencyUnrolledVarintWrite thrpt 5 188349.773 ± 2062.149 ops/s
VarIntWriterBenchmark.smartUnrolledVarintWrite thrpt 5 187717.560 ± 755.758 ops/s
VarIntWriterBenchmark.startinVarintWrite thrpt 5 115166.912 ± 611.680 ops/s
VarIntWriterBenchmark.unrolledVarintWrite thrpt 5 183905.054 ± 766.400 ops/s
Okay, so the Intel processor is behind, probably better explained by the fact it's clocked lower and is an older chip from the Broadwell era. But the M1 mops the floor in this particular benchmark, beating out the Ryzen 9 3900X and the Intel Xeon E5-2650v4. Why?
Atomic Power #
When we call a write function on a Netty ByteBuf
there are several checks that have to be performed. These are typically performed in AbstractByteBuf#ensureWritable0()
. Specifically:
- It checks that the written bytes after writing
minWritableBytes
would not exceed the buffer's current capacity. - It ensures that the buffer's reference count is not zero.
- If the check in (1) passed, we are good to go (this is considered the "fast path").
- Otherwise, if the new size of the buffer would exceed the buffer's current capacity (but not its maximum capacity), it will expand the buffer.
Since our buffer size is always 5, the first check will always succeed, and since the JIT knows that the branch is always taken, that just means we're left with the reference count check as our last source of slowdown. In Netty, reference counts are always accessed and modified atomically. We know that uncontended atomic access is fast on M1. Does this explain the performance difference? Let's run the tests on our Ryzen processor again but with the flag -Dio.netty.buffer.checkAccessible=false
:
Benchmark Mode Cnt Score Error Units
VarIntWriterBenchmark.blendedVarintWrite thrpt 5 158655.053 ± 3702.731 ops/s
VarIntWriterBenchmark.oldVelocityVarintWrite thrpt 5 101722.002 ± 331.125 ops/s
VarIntWriterBenchmark.smartNoDataDependencyUnrolledVarintWrite thrpt 5 131313.391 ± 392.404 ops/s
VarIntWriterBenchmark.smartUnrolledVarintWrite thrpt 5 146280.007 ± 1271.907 ops/s
VarIntWriterBenchmark.startinVarintWrite thrpt 5 80280.574 ± 636.335 ops/s
VarIntWriterBenchmark.unrolledVarintWrite thrpt 5 101123.052 ± 915.786 ops/s
That's certainly very different. blendedVarintWrite
actually lost a bit over 10% of its throughput, but the other byte-by-byte approaches gained at least by 8%. What happens if we run it on the M1?
Benchmark Mode Cnt Score Error Units
VarIntWriterBenchmark.blendedVarintWrite thrpt 5 237697.049 ± 304.124 ops/s
VarIntWriterBenchmark.oldVelocityVarintWrite thrpt 5 166456.475 ± 384.274 ops/s
VarIntWriterBenchmark.smartNoDataDependencyUnrolledVarintWrite thrpt 5 190258.830 ± 1163.047 ops/s
VarIntWriterBenchmark.smartUnrolledVarintWrite thrpt 5 189399.626 ± 186.078 ops/s
VarIntWriterBenchmark.startinVarintWrite thrpt 5 116002.657 ± 364.835 ops/s
VarIntWriterBenchmark.unrolledVarintWrite thrpt 5 184978.894 ± 3182.543 ops/s
This essentially didn't change. So we can conclude that blendedVarintWrite
gets most of its performance through amortizing API calling costs in the Netty API.
Conclusions #
It is very possible to write VarInts very quickly, on the order of close to 500 million per second on commodity consumer hardware with only a single thread.
Here's all the various methods by their time per operation:
Method | Intel Broadwell | AMD Zen 2 | Apple M1 |
---|---|---|---|
Baseline | 8.95 ns/op | 5.1 ns/op | 2.93 ns/op |
Startin | 10.5 ns/op | 6.68 ns/op | 4.24 ns/op |
Smart Unrolling | 6.4 ns/op | 3.92 ns/op | 2.6 ns/op |
Dark Arts (#6) | 4.7 ns/op | 2.78 ns/op | 2.05 ns/op |
We have sped up the writing speed on the x86 platforms by more than 80%, whereas the M1 received a more modest but still substantial 43% performance boost.
I invite critiques on the GitHub repository where I have published the JMH tests.
Licensing #
You are free to use any of the code snippets under the terms of the MIT license.