Welcome to the world of microbenchmarks

Feb 4, 2017

Microbenchmarks have been popular for as long as I can remember. Recently some extra ridiculous HTTP request handling ones have been making the rounds.
A million requests per second with Python
(More than) one million requests per second in Node.js
Over a million HTTP requests/second with Node.js & Python. Only 70k with ngnix, and an embarrassing 55k with Go. What is going on here? Is Google asleep at the wheel?

On the internet nobody knows you're a dog. Similarly nobody knows that your microbenchmark results are misleading at best.

First off these specific tests use HTTP/1.1 pipelining. There's not a single actively developed browser in existence that supports HTTP/1.1 pipelining out of the box. Thus you certainly can't use this for web development. Secondly, as is the norm for microbenchmarks like this, they are doing all these millions of requests from a single client that's located on the same machine. How many real world use cases are there where a single localhost client will do a million requests per second and also supports HTTP/1.1 pipelining?

I've seen people claim that Not totally useless. This shows the performance and overhead of the library/framework not the task. [...] I usually take it to mean divide stated performance by 10 immediately if a simple DB query is involved, etc.

Microbenchmark results don't linearly scale to everything else. Just because language X can print "Hello, World!" 2x faster than language Y doesn't mean that every other operation is also 2x faster. For example a huge factor is algorithm quality. Language X may be fast at "Hello, World!", but then proceed to have QuickSort as its standard sorting function, while language Y has Timsort. [1] Language X may have a nicely optimized C library for hashing, while language Y has AVX2 optimized ASM. Some languages don't even have a wide & well-optimized standard library. Thus you can only really tell how good a language/library is for your use case if you test with an actual real world scenario.

Additionally, the ultimate microbenchmark winning code is one that does every trick in the book while not caring about anything else. This means hooking the kernel, unloading every kernel module / driver that isn't necessary for the microbenchmark, and doing the microbenchmark work at ring 0 with absolute minimum overhead. Written in ASM, which is implanted by C code, which is launched by node.js. Then, if there's any data dependent processing in the microbenchmark, the winning code will precompute everything and load the full 2 TB of precomputed data into RAM. The playing field is even, JVM & Apache, or whatever else is the competition will also be run on this 2 TB RAM machine of course. They just won't use it, because they aren't designed to deliver the best results in this single microbenchmark. The point is that, not only don't microbenchmark results mean linear scaling for other work, but the techniques to achieve the microbenchmark results may even be detrimental to everything else!

These not-real-world microbenchmarks are definitely useless from an engineering perspective. However they aren't completely useless. Their use is marketing. It gets the word out to developers that this product X is really good! So what if it doesn't translate to real world scenarios or even if the numbers are completely fabricated and you can't even reproduce this under lab conditions. Very few people care enough to look at things that closely. Just seeing a bunch of posts claiming product X is really good is enough to leave a strong impression that product X really is that great. Perception is reality, and perception is usually better influenced by massive claims (even if untrue) rather than realistic iterative progress. Our product is 5% faster than state of the art! just doesn't have that viral headline nature that you need to win over the hearts of the masses.

The RethinkDB postmortem had a great paragraph about these microbenchmarks. People wanted RethinkDB to be fast on workloads they actually tried, rather than “real world” workloads we suggested. For example, they’d write quick scripts to measure how long it takes to insert ten thousand documents without ever reading them back. MongoDB mastered these workloads brilliantly, while we fought the losing battle of educating the market.

As for why someone would perform & propagate these microbenchmarks. Maybe they don't know better, or maybe they are doing it because they have decided to invest in some technology tribe and thus profit from that tribe surviving, and even more from growing. This is a pretty automatic behavior for humans. Take any tribal war, e.g. XBox One vs PS4. Those who happen to own a XBox One (perhaps as a gift) can be seen at various places passionately arguing that XBox One is better than PS4, even if objectively it has slower hardware and fewer highly acclaimed exclusive games. The person is on the XBox One tribe, and working towards getting more users to own XBox One will mean that more developer investments are also made towards the XBox One thanks to the bigger user base. Thus even if the original claims to get users into the tribe were false, if growth is big enough it may work out well enough at the end.

Concluding in terms of microbenchmarks, be sure to apply almost zero value to their results. The only exception should be when you already have a working application and you have profiled it and identified the hotspots. Perhaps more than 50% of your request handling CPU time is spent on computing some hash. Now you know that finding the fastest hash implementation will actually have a meaningful impact on your system.

However even then it may not be worth taking the time to deal with it. Talking to users and finding out what problems they have and solving those problems will lead to more success every single time. Unless your users are complaining that performance sucks, odds are you don't need to spend your resources on improving performance and instead need to spend it on actual problems your users are having.

Comments (0)