How fast can you write a VarInt?

performance minecraft

I often like to start my day by keeping myself abreast of interesting things in the world of computer programming. Often this means a visit to Hacker News or Lobsters. I somehow managed to come across Richard Startin's article about why he doesn't use Google Protocol Buffers for telemetry. Near the bottom, he has an aside about VarInts and shares a few tricks to produce them more efficiently.

Why was this relevant to me? That is because the Minecraft: Java Edition protocol uses VarInts extensively. But what exactly are we looking at here? Let's dive a bit deeper.

What even is a VarInt?

A Protocol Buffers VarInt is a 1 to 10 byte sequence that encodes a specific number (usually up to 32-bit for up to 5 byte VarInts, 64-bit for up to 10 byte VarInts). The process for encoding a VarInt is simple:

  1. Check if the value is less than 0x80. (This is usually done with a simple AND and comparison, i.e. (value & ~0x7FL) == 0.) If so, write the value out to the next position in the buffer and return.
  2. Grab the lowest 7 bits of the value being written (value & 0x7F) and set the 8th bit ((value & 0x7F | 0x80)). Write this out into the next position in the buffer.
  3. Logically shift the value to the right by 7 bits and repeat from step 1.

Here is the naive implementation (sourced from the Google Protocol Buffers library's CodedInputStream):

while (true) {
if ((value & ~0x7FL) == 0) {
buffer[position++] = (byte) value;
return;
} else {
buffer[position++] = (byte) (((int) value & 0x7F) | 0x80);
value >>>= 7;
}
}

For the purposes of demonstration we will focus on the 32-bit VarInts only - the techniques shown here can be extended to 64-bit VarLongs (at a not-insignificant cost to your sanity).

Now, for the main star of the show - the old VarInt writing implementation from Velocity:

while (true) {
if ((value & 0xFFFFFF80) == 0) {
buf.writeByte(value);
return;
}

buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
}

This is simply a version of the above method, but adapted to write to a Netty ByteBuf. With the necessary background taken care of, let's have some fun.

Disclaimer

I am by no means an expert on how the JIT operates, but I am very interested in fun and unique ways to solve problems!

The testing system is my home PC, featuring an AMD Ryzen 9 3900X processor with 32GB of DDR4-3600 memory, running Fedora 34 and Java 11 as packaged in Fedora. We use the flag -Dio.netty.leakDetection.level=DISABLED as otherwise 1% of buffers will be selected for leak detection instrumentation, which would introduce unfavorable noise in the results.

The benchmarks in question allocate a direct buffer 5 bytes long, write a VarInt into it, and then clear it. This is repeated for a pseudo-random sequence of 2,048 numbers in order to defeat the JVM's tendency to assume that a certain branch would always be taken.

Try 1: Richard Startin's Method

Let's try using Richard Startin's tricks. Here's the above method, but using his method:

int continuationBytes = (31 - Integer.numberOfLeadingZeros(value)) / 7;
for (int i = 0; i < continuationBytes; ++i) {
buf.writeByte(((byte) ((value & 0x7F) | 0x80)));
value >>>= 7;
}
buf.writeByte((byte) value);

What do we get?

Benchmark                                      Mode  Cnt      Score      Error  Units
VarIntWriterBenchmark.oldVelocityVarintWrite  thrpt    5  97618.250 ± 1682.211  ops/s
VarIntWriterBenchmark.startinVarintWrite      thrpt    5  73264.856 ±  664.202  ops/s

This is actually worse than we had before! But why?

Getting to the bottom of it

Is it because of the Integer.numberOfLeadingZeros() call? The answer is probably a resounding no - after all, Integer.numberOfLeadingZeros() is an intrinsic on at least x86_64 (the only architecture where we are running these benchmarks) and Arm. Indeed, my Zen 2 processor can process 3 LZCNT instructions every cycle, so in theory this should deliver performance improvements.

In isolation, we can devise a small benchmark to see if this educated guess was right:

  @Benchmark
public int lzcnt() {
int sum = 0;
for (int number : numbers) {
sum += (31 - Integer.numberOfLeadingZeros(number)) / 7;
}
return sum;
}

@Benchmark
public int byHand() {
int sum = 0;
for (int value : numbers) {
while (true) {
if ((value & 0xFFFFFF80) == 0) {
sum++;
break;
}

sum++;
value >>>= 7;
}
}
return sum;
}

The results:

Benchmark                      Mode  Cnt       Score       Error  Units
VarIntWriterBenchmark.byHand  thrpt    5  323063.055 ±  1759.055  ops/s
VarIntWriterBenchmark.lzcnt   thrpt    5  570563.618 ± 17726.994  ops/s

At least in isolation, this doesn't seem to be the case. But as we'll see later, this benchmark doesn't really tell us anything meaningful. Perhaps this change makes the JIT compiler more trigger-happy in a way that tanks our performance, thanks to the more traditional loop style we now employ. I'm not sure.

Try 2: Unrolling the Loop

Okay, let's go back a step. Remember that the loop we are running could run 1 to 5 times, depending on how large the VarInt will be. We also know that the number of bytes will increase with each 7 bits in the number. Given that, what if we simply unrolled the loop, accounting for all of the cases?

Here's the code:

  if (value < 0) {
buf.writeByte(value & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte(value >>> 7);
} else if (value < 128) {
buf.writeByte(value);
} else if (value < 16384) {
buf.writeByte(value & 0x7F | 0x80);
buf.writeByte(value >>> 7);
} else if (value < 2097152) {
buf.writeByte(value & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte(value >>> 7);
} else if (value < 268435456) {
buf.writeByte(value & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte(value >>> 7);
} else {
buf.writeByte(value & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte((value >>>= 7) & 0x7F | 0x80);
buf.writeByte(value >>> 7);
}

What do we get now?

Benchmark                                      Mode  Cnt      Score      Error  Units
VarIntWriterBenchmark.oldVelocityVarintWrite  thrpt    5  95870.960 ± 4328.322  ops/s
VarIntWriterBenchmark.startinVarintWrite      thrpt    5  73097.880 ± 1116.550  ops/s
VarIntWriterBenchmark.unrolledVarintWrite     thrpt    5  95840.321 ± 2882.220  ops/s

There's no change, but perhaps we could have expected that from naively unrolling the loop and just figuring out when certain byte lengths would be written. We can, of course, do better than that.

Try 3: Smartly Unrolling the Loop

Let's take a look at that unrolled code again. It's clearly not optimal, since it repeats a lot of the same code, including multiple calls to ByteBuf#writeByte(). Can we do better than this? In fact, we can:

  if ((value & 0xFFFFFF80) == 0) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
if ((value & 0xFFFFFF80) == 0) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
if ((value & 0xFFFFFF80) == 0) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
if ((value & 0xFFFFFF80) == 0) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
buf.writeByte(value);
}
}
}
}

This looks pretty gnarly already, but let's give it a shot:

Benchmark                                        Mode  Cnt       Score      Error  Units
VarIntWriterBenchmark.oldVelocityVarintWrite    thrpt    5   95113.915 ±  182.119  ops/s
VarIntWriterBenchmark.smartUnrolledVarintWrite  thrpt    5  124865.620 ±  550.267  ops/s
VarIntWriterBenchmark.unrolledVarintWrite       thrpt    5   93953.405 ± 1530.158  ops/s

That's very impressive - a nearly 30% performance improvement from improving the structure of the code. Can we improve this some more?

Try 4: The numberOfLeadingZeros Mystery Solved?

Let's try to use Integer.numberOfLeadingZeros() to optimize this function. Perhaps it will give us a bit more performance.

  int continuationBytes = (31 - Integer.numberOfLeadingZeros(value)) / 7;
if (continuationBytes == 0) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
if (continuationBytes == 1) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
if (continuationBytes == 2) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
if (continuationBytes == 3) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
value >>>= 7;
buf.writeByte(value);
}
}
}
}

I can't wait for all the performance improvements!

Benchmark                                        Mode  Cnt       Score      Error  Units
VarIntWriterBenchmark.smartUnrolledVarintWrite  thrpt    5  110664.096 ± 1460.675  ops/s

That's not good at all. We just regressed by more than 10%.

Try 5: Remove the Data Dependency

That was a bust. Is there something else we can do? Well, there is a data dependency. We are bit-shifting to the right by 7 bits each time then seeing if we need to continue writing bytes or not. It's possible to get a little bit more parallelism out of the CPU by knowing how much we'll bit-shift by:

  if ((value & (0xFFFFFFFF << 7)) == 0) {
buf.writeByte(value);
} else {
buf.writeByte(value & 0x7F | 0x80);
if ((value & (0xFFFFFFFF << 14)) == 0) {
buf.writeByte(value >>> 7);
} else {
buf.writeByte((value >>> 7) & 0x7F | 0x80);
if ((value & (0xFFFFFFFF << 21)) == 0) {
buf.writeByte(value >>> 14);
} else {
buf.writeByte((value >>> 14) & 0x7F | 0x80);
if ((value & (0xFFFFFFFF << 28)) == 0) {
buf.writeByte(value >>> 21);
} else {
buf.writeByte((value >>> 21) & 0x7F | 0x80);
buf.writeByte(value >>> 28);
}
}
}
}

How does this attempt fare?

Benchmark                                                        Mode  Cnt       Score      Error  Units
VarIntWriterBenchmark.oldVelocityVarintWrite                    thrpt    5   94480.674 ± 1389.432  ops/s
VarIntWriterBenchmark.smartNoDataDependencyUnrolledVarintWrite  thrpt    5  129428.934 ± 7405.911  ops/s
VarIntWriterBenchmark.smartUnrolledVarintWrite                  thrpt    5  126697.664 ± 5033.494  ops/s

It looks like the two approaches are in fact similar in performance, with the newer version performing perhaps a tad better, but the results are too noisy to know for sure.

Try 6: Training Ourselves in the Dark Arts

Already we found a ~30% speedup from unrolling the loop carefully, but can we do better than that still? Yes, we can. As most performance-oriented engineers know, doing things in bulk can be a very profitable optimization. Up to this point, we've been running 1 to 5 calls to write individual bytes. It turns out that if we get a little bit creative, we can either write the entirety of the bytes up front (1, 2, or 4 bytes worth) or combine one larger writing operation with writing one byte (2 + 1 for 3 bytes, 4 + 1 for 5 bytes – VarLong writing is left as an exercise to the reader). Luckily, we don't need to handle the 2 + 1 case specially as Netty ByteBuf provides a writeMedium() method. In order to do this, we will instead construct the VarInt as a number, and then write it to the buffer. Let's give it a try:

  if ((value & (0xFFFFFFFF << 7)) == 0) {
buf.writeByte(value);
} else if ((value & (0xFFFFFFFF << 14)) == 0) {
int w = (value & 0x7F | 0x80) << 8 | (value >>> 7);
buf.writeShort(w);
} else if ((value & (0xFFFFFFFF << 21)) == 0) {
int w = (value & 0x7F | 0x80) << 16 | ((value >>> 7) & 0x7F | 0x80) << 8 | (value >>> 14);
buf.writeMedium(w);
} else if ((value & (0xFFFFFFFF << 28)) == 0) {
int w = (value & 0x7F | 0x80) << 24 | (((value >>> 7) & 0x7F | 0x80) << 16)
| ((value >>> 14) & 0x7F | 0x80) << 8 | (value >>> 21);
buf.writeInt(w);
} else {
int w = (value & 0x7F | 0x80) << 24 | ((value >>> 7) & 0x7F | 0x80) << 16
| ((value >>> 14) & 0x7F | 0x80) << 8 | ((value >>> 21) & 0x7F | 0x80);
buf.writeInt(w);
buf.writeByte(value >>> 28);
}

The end result is this:

Benchmark                                        Mode  Cnt       Score      Error  Units
VarIntWriterBenchmark.blendedVarintWrite        thrpt    5  175968.604 ± 5195.657  ops/s
VarIntWriterBenchmark.oldVelocityVarintWrite    thrpt    5   94270.376 ± 2796.116  ops/s
VarIntWriterBenchmark.smartUnrolledVarintWrite  thrpt    5  124524.690 ± 2929.593  ops/s

This is an enormous improvement, to say the least. From our baseline we got a nearly 1.87x speedup! But there is a reason why this is suboptimal: the above code is definitely difficult to read and understand, and it is extremely fragile – extending it to 10-byte VarLongs would be a nightmare!

Bonus: Exact byte length encoding

Quoting Startin:

When I went back and read it again I was surprised to find that embedded messages are length-prefixed, but the length prefix is varint encoded, which means you don’t know how many bytes you need for the length until you’ve done the serialisation, and it’s recursive.

Startin then argues that was one reason why he considered Protocol Buffers to be bad for reporting telemetry from the edge. But for those who are stuck with VarInts like I am, you can in fact write VarInts in a sequential fashion, and even with performance slightly above the traditional approach to writing VarInts. Specifically, it is possible for us encode any number that can be encoded as a VarInt in exactly 5 bytes:

  int w = (value & 0x7F | 0x80) << 24 | ((value >>> 7) & 0x7F | 0x80) << 16
| ((value >>> 14) & 0x7F | 0x80) << 8 | ((value >>> 21) & 0x7F | 0x80);
buf.writeInt(w);
buf.writeByte(value >>> 28);

Therefore, if you are willing to inflate the size of your messages, you can write the message out sequentially. Even better, you can even improve upon this result by encoding only the maximum amount you need. In the Minecraft protocol, compressed messages are always 21-bit or less, so we can encode the length of a compressed packet in just three VarInt bytes and save ourselves a memory copy. Very neat!

But performance-wise, is it worth doing that?

Benchmark                                                        Mode  Cnt       Score      Error  Units
VarIntWriterBenchmark.blendedVarintWrite                        thrpt    5  179321.732 ± 7610.742  ops/s
VarIntWriterBenchmark.lucky5VarintWrite                         thrpt    5  113540.019 ± 6062.895  ops/s
VarIntWriterBenchmark.oldVelocityVarintWrite                    thrpt    5   95748.224 ± 2531.663  ops/s

If you have a need for this, then yes, it even beats the traditional loop. But despite appearances, this is not only not the fastest way to write a VarInt, it also completely defeats the point of using a VarInt. But it's a neat trick that could come in handy under certain circumstances.

But this is just one processor - I need more than that!

I ran the tests on a DigitalOcean CPU-optimized droplet with 4GB of memory and two dedicated CPU threads of an Intel Xeon E5-2650v4. The OS is Ubuntu 20.04 but we keep using Java 11.

Here's the results:

Benchmark                                                        Mode  Cnt       Score      Error  Units
VarIntWriterBenchmark.blendedVarintWrite                        thrpt    5  103837.167 ±  629.934  ops/s
VarIntWriterBenchmark.oldVelocityVarintWrite                    thrpt    5   54538.692 ± 1312.263  ops/s
VarIntWriterBenchmark.smartNoDataDependencyUnrolledVarintWrite  thrpt    5   76576.717 ± 1051.005  ops/s
VarIntWriterBenchmark.smartUnrolledVarintWrite                  thrpt    5   76325.907 ±  855.393  ops/s
VarIntWriterBenchmark.startinVarintWrite                        thrpt    5   46513.811 ±  706.059  ops/s
VarIntWriterBenchmark.unrolledVarintWrite                       thrpt    5   56751.155 ±  803.914  ops/s

These results comport with the rest of the article. As for Arm, here are the results from my M1 MacBook Pro:

Benchmark                                                        Mode  Cnt       Score      Error  Units
VarIntWriterBenchmark.blendedVarintWrite                        thrpt    5  237790.486 ±  571.134  ops/s
VarIntWriterBenchmark.oldVelocityVarintWrite                    thrpt    5  165370.092 ± 1975.949  ops/s
VarIntWriterBenchmark.smartNoDataDependencyUnrolledVarintWrite  thrpt    5  188349.773 ± 2062.149  ops/s
VarIntWriterBenchmark.smartUnrolledVarintWrite                  thrpt    5  187717.560 ±  755.758  ops/s
VarIntWriterBenchmark.startinVarintWrite                        thrpt    5  115166.912 ±  611.680  ops/s
VarIntWriterBenchmark.unrolledVarintWrite                       thrpt    5  183905.054 ±  766.400  ops/s

Okay, so the Intel processor is behind, probably better explained by the fact it's clocked lower and is an older chip from the Broadwell era. But the M1 mops the floor in this particular benchmark, beating out the Ryzen 9 3900X and the Intel Xeon E5-2650v4. Why?

Atomic Power

When we call a write function on a Netty ByteBuf there are several checks that have to be performed. These are typically performed in AbstractByteBuf#ensureWritable0(). Specifically:

  1. It checks that the written bytes after writing minWritableBytes would not exceed the buffer's current capacity.
  2. It ensures that the buffer's reference count is not zero.
  3. If the check in (1) passed, we are good to go (this is considered the "fast path").
  4. Otherwise, if the new size of the buffer would exceed the buffer's current capacity (but not its maximum capacity), it will expand the buffer.

Since our buffer size is always 5, the first check will always succeed, and since the JIT knows that the branch is always taken, that just means we're left with the reference count check as our last source of slowdown. In Netty, reference counts are always accessed and modified atomically. We know that uncontended atomic access is fast on M1. Does this explain the performance difference? Let's run the tests on our Ryzen processor again but with the flag -Dio.netty.buffer.checkAccessible=false:

Benchmark                                                        Mode  Cnt       Score      Error  Units
VarIntWriterBenchmark.blendedVarintWrite                        thrpt    5  158655.053 ± 3702.731  ops/s
VarIntWriterBenchmark.oldVelocityVarintWrite                    thrpt    5  101722.002 ±  331.125  ops/s
VarIntWriterBenchmark.smartNoDataDependencyUnrolledVarintWrite  thrpt    5  131313.391 ±  392.404  ops/s
VarIntWriterBenchmark.smartUnrolledVarintWrite                  thrpt    5  146280.007 ± 1271.907  ops/s
VarIntWriterBenchmark.startinVarintWrite                        thrpt    5   80280.574 ±  636.335  ops/s
VarIntWriterBenchmark.unrolledVarintWrite                       thrpt    5  101123.052 ±  915.786  ops/s

That's certainly very different. blendedVarintWrite actually lost a bit over 10% of its throughput, but the other byte-by-byte approaches gained at least by 8%. What happens if we run it on the M1?

Benchmark                                                        Mode  Cnt       Score      Error  Units
VarIntWriterBenchmark.blendedVarintWrite                        thrpt    5  237697.049 ±  304.124  ops/s
VarIntWriterBenchmark.oldVelocityVarintWrite                    thrpt    5  166456.475 ±  384.274  ops/s
VarIntWriterBenchmark.smartNoDataDependencyUnrolledVarintWrite  thrpt    5  190258.830 ± 1163.047  ops/s
VarIntWriterBenchmark.smartUnrolledVarintWrite                  thrpt    5  189399.626 ±  186.078  ops/s
VarIntWriterBenchmark.startinVarintWrite                        thrpt    5  116002.657 ±  364.835  ops/s
VarIntWriterBenchmark.unrolledVarintWrite                       thrpt    5  184978.894 ± 3182.543  ops/s

This essentially didn't change. So we can conclude that blendedVarintWrite gets most of its performance through amortizing API calling costs in the Netty API.

Conclusions

It is very possible to write VarInts very quickly, on the order of close to 500 million per second on commodity consumer hardware with only a single thread.

Here's all the various methods by their time per operation:

Method Intel Broadwell AMD Zen 2 Apple M1
Baseline 8.95 ns/op 5.1 ns/op 2.93 ns/op
Startin 10.5 ns/op 6.68 ns/op 4.24 ns/op
Smart Unrolling 6.4 ns/op 3.92 ns/op 2.6 ns/op
Dark Arts (#6) 4.7 ns/op 2.78 ns/op 2.05 ns/op

We have sped up the writing speed on the x86 platforms by more than 80%, whereas the M1 received a more modest but still substantial 43% performance boost.

I invite critiques on the GitHub repository where I have published the JMH tests.

Licensing

You are free to use any of the code snippets under the terms of the MIT license.