theotherian's blog: December 2013

this may be my most frustrating post so far

First, a little background.

At some point, even when you can scale horizontally, you start to examine aspects of your application that you can easily take for granted in the grand scheme of things for performance gains. One of those points when dealing with web services is serialization. There's general knowledge that Java serialization is slow, and XML is bloated compared to JSON. JSON is a pretty safe pick in general: it's readable, lightweight, and fast. That said, what happens when you want to do better than JSON in your RESTful web service?

A colleague and I came to this point recently, where the majority of his transaction overhead was spent unmarshalling requests and marshalling responses. This application comes under very high load, so the obvious conclusion was "well, there's a clear place to start to improve things." From there, we started looking at Apache Thrift, Google ProtoBuf (or Protocol Buffers), Kryo, Jackson Smile and, of course as a control, JSON. Naturally, we wanted to invest some time comparing these to each other.

I looked around online a lot at performance benchmarks and found some data dealing with Kryo, ProtoBuf and others located at https://github.com/eishay/jvm-serializers/wiki. The data presented there was very low level, and my goal was quite literally to produce the least sophisticated comparison of these frameworks possible, ideally using the 4-6 line samples on their respective wikis. My reasoning for this was that there is likely a common case of people not investing a huge amount of time trying to optimize their serialization stack, but rather trying to seek out a drop-in boost in the form of a library.

This is where the frustration comes into play. My results don't quite match what I've seen elsewhere, which caused me to question them several times and revisit the benchmarks I was performing. They still don't quite match, and to be honest I'm questioning the benchmark code I linked to after discovering calls to System.gc() all over the place, but I feel like I have enough data that it's worth posting something up here.

the experiment: use cases, setup, metrics, and the contenders

Let's talk about the use cases I was trying to cover first:

Don't go over the network. Do everything in memory to avoid external performance influences in the benchmark.
Serialize an object that is reasonably complex and representative of something a web service may use.
Serialize objects that have both small and large data footprints.
Use the most basic setup possible to perform the serialization and deserialization.

The setup was:

Run a "warm up" pass before gathering metrics to remove initial load factors on JVM startup that won't be a constant issue, and to fragment the heap slightly to both simulate real-world conditions and not give a potential advantage to a single framework.
Run a series of batches of entities to gather enough data to arrive at a reasonable conclusion of performance.
Randomize the data a bit to try and keep things in line with real-world conditions. The data is randomized from a small data set, with the assumption being that the differences in size are small enough and the batches are large enough to get a reasonably even distribution, meaning the metrics will converge on a figure that is a reasonable measurement of performance.

The following metrics were recorded:

Measure the average time to serialize and deserialize a batch of 100,000 entities.
Measure the average size of a response.
Measure the average time of an individual serialization/deserialization

Lastly, the contenders:

Kryo, using the sample found here: https://github.com/EsotericSoftware/kryo#quickstart
Jackson Smile, using the example found here: https://github.com/FasterXML/jackson-dataformat-smile#usage
Jackson JSON, using the example found here: http://wiki.fasterxml.com/JacksonInFiveMinutes#Examples
Jackson Smile JAXRS Provider, which had to be inferred.

The use of the Jackson Smile JAXRS provider may seem odd, but I have a good reason. The basic Smile example is only a few lines, while the Smile JAXRS provider class is almost 1000 (!!!) lines. There's a lot of extra work going on in that class, and felt it was worth comparing because 1) many people could end up using this adapter in the wild and 2) perhaps there are some optimizations that should be benchmarked.

code

All of the code used in this can be found at https://github.com/theotherian/serialization-shootout/tree/master/serialization-shootout

Here's a tree representation of what the entity being serialized/deserialized, Car, looks like:

Here are the harnesses being used:

the results: normal size objects

By normal, I mean on the smaller size; most data is in order of 10's of bytes:

Key data points:

Kryo and Smile are clearly more performant than JSON in terms of time spent and size of payload.
Kryo and Smile are close: Kryo performs better but Smile is slightly smaller.
Kryo has the fastest raw serialization/deserialization performance by a significant amount over both Smile and JSON.
The Smile JAXRS provider is significantly slower than its raw counterpart.

the results: large size objects

For this comparison, I added portions of Wikipedia articles as part of the object, all equal in length:

Key data points:

Kryo is best in breed by a wide margin here, handling batches in 1.2s vs 1.9s for both Smile and JSON. Serialization and deserialization are both significantly faster.
Variance in size is practically nonexistent between all the frameworks.
Smile JAXRS really looks like a dog here, taking 2.6s to handle a batch and showing surprisingly poor deserialization performance.

the winner: kryo (with HUGE MASSIVE caveats)

Kryo clearly has some advantages here, but it also has one major disadvantage: Kryo instances are not thread safe. Did you hear that?

KRYO INSTANCES ARE NOT THREAD SAFE!

This caused me to show the same amount of rage DateFormat did years ago. BFD you may say, thinking "Just create a Kryo instance each time!" Well, what if I told you that each batch of the normal size objects takes a whopping NINE SECONDS when I moved the creation of the Kryo object inside the harness' method.

No sir; if you're going to use Kryo you need to have thread local storage for your Kryo instances or you are going to be in for some serious pain. Depending on the load of your application, you may want to pre-create them as a pool within a servlet initializer that is scaled to the number of threads you have in your container.

Quite frankly I'm astonished that there's so much overhead encountered on an instance that isn't thread safe, but I also haven't delved into the API enough to know what the reasons are behind this. Still though, it creates some very annoying design implications that you'll need to make sure are accounted for correctly in your application.

Part of me would sooner call Smile the winner since it doesn't have this particular issue, but after looking at the JAXRS provider for it I'm left scratching my head.

However, when it comes to larger entities, Smile offered marginal improvement over JSON, whereas Kryo clearly won that round.

Based on the results in the first pass, I think Kryo showed the most improvement, but also a fair number of warts.

next steps

I'm far from finished here, but felt compelled to get something published. I plan on doing the following things next:

Getting feedback from others about my approach and the data to see if I'm way off the mark.
Potentially benchmarking ProtoBuf here too. It's more painful to set up, but worth experimenting with to get more data.
Figuring why Smile JAXRS is so miserably slow.
Messing around with Kryo's optimization (an example of this is here).
Looking at other BSON libraries.

I do genuinely feel like I'm missing some critical piece of data or type of test here, so if you see anything that could stand to be addressed, please let me know in the comments!

caching with jmx is just so much better

If you've never used this before, you're missing out. Being able to remotely check statistics on your cache to measure its effectiveness, as well as being able to purge it at runtime is invaluable. Sadly Guava doesn't have this baked in the way ehcache does, but it's relatively easy to add.

Most of my work is a slightly different take on some work a fellow Github user named kofemann produced (located here) which contains the JMX beans and bean registration logic. I made a few alterations to the code, pulling the registration out into a separate class (I really didn't like the bean doing all that work in the constructor) and adding a refreshAll method.

taking advantage of refresh after write functionality

If you've read my previous blog post about the awesomeness that is Guava's refresh after write functionality, then you'll see how it can be advantageous when it comes to JMX management. If you didn't read my post (shame on you), then it's worth calling out using refresh after write allows for asynchronous loading of cache values, meaning you never block barring the initial loading of the cache.

This can be used via JMX management as well by iterating through the keys of the cache and calling refresh for each one, which will load new values without causing clients of the cache to block (as opposed to purging the cache). Purging a cache is a dangerous thing to do under certain circumstances, since missing values will trigger loading events that will block clients at runtime and potentially overwhelm either your application server or even your underlying data storage. I would argue that ehcache is particularly bad because of potential read contention caused by write blocking. To clarify, several threads in your application can block waiting for cache values to be reloaded, and all of those blocking threads will then compete over a limited number of read locks after the write lock has been released, potentially causing a CPU spike and considerable latency in your application under the worst conditions. When I say worst conditions, I'm speaking from very recent and harrowing experience, so I have the lumps to say with the utmost certainty this can happen. :)

the implementation

For JMX you need an interface and an implementation. The interface can be found on my Gist and doesn't really need to be shown in the post. The implementation is below; it's really a wrapper around Guava's CacheStats object and the cleanup/invalidateAll methods, as well as my refreshAll method:

As I said before, refreshAll has the advantage of not causing your application to potentially lock up due to cache contention; everything will load up in the background. Depending on how you have your thread pool set up for performing refreshes, you can also throttle how hard you're hitting your data store by restricting the number of concurrent fetches of data by limiting the threads available.

registering your cache in jmx

This is pretty straightforward: just pass your cache (in this case a LoadingCache because of refreshAll) to the method shown below and you'll expose it via JMX for statistics and management:

feedback

Let me know if this works for you; I plan on using this soon in a high load environment, so I'll follow up with any results I find to help out my readers. I feel kind of bad bagging on ehcache so much recently, but it's caused me enough gray hair over the last month that I plan on focusing several blog posts around caching.

theotherian's blog

Sunday, December 22, 2013

kryo vs smile vs json part 1: a misguided shootout