Sunday, December 22, 2013

kryo vs smile vs json part 1: a misguided shootout

this may be my most frustrating post so far

First, a little background.

At some point, even when you can scale horizontally, you start to examine aspects of your application that you can easily take for granted in the grand scheme of things for performance gains. One of those points when dealing with web services is serialization. There's general knowledge that Java serialization is slow, and XML is bloated compared to JSON. JSON is a pretty safe pick in general: it's readable, lightweight, and fast. That said, what happens when you want to do better than JSON in your RESTful web service?

A colleague and I came to this point recently, where the majority of his transaction overhead was spent unmarshalling requests and marshalling responses. This application comes under very high load, so the obvious conclusion was "well, there's a clear place to start to improve things." From there, we started looking at Apache Thrift, Google ProtoBuf (or Protocol Buffers), Kryo, Jackson Smile and, of course as a control, JSON. Naturally, we wanted to invest some time comparing these to each other.

I looked around online a lot at performance benchmarks and found some data dealing with Kryo, ProtoBuf and others located at https://github.com/eishay/jvm-serializers/wiki. The data presented there was very low level, and my goal was quite literally to produce the least sophisticated comparison of these frameworks possible, ideally using the 4-6 line samples on their respective wikis. My reasoning for this was that there is likely a common case of people not investing a huge amount of time trying to optimize their serialization stack, but rather trying to seek out a drop-in boost in the form of a library.

This is where the frustration comes into play. My results don't quite match what I've seen elsewhere, which caused me to question them several times and revisit the benchmarks I was performing. They still don't quite match, and to be honest I'm questioning the benchmark code I linked to after discovering calls to System.gc() all over the place, but I feel like I have enough data that it's worth posting something up here.

the experiment: use cases, setup, metrics, and the contenders

Let's talk about the use cases I was trying to cover first:

  • Don't go over the network. Do everything in memory to avoid external performance influences in the benchmark.
  • Serialize an object that is reasonably complex and representative of something a web service may use.
  • Serialize objects that have both small and large data footprints.
  • Use the most basic setup possible to perform the serialization and deserialization.

The setup was:

  • Run a "warm up" pass before gathering metrics to remove initial load factors on JVM startup that won't be a constant issue, and to fragment the heap slightly to both simulate real-world conditions and not give a potential advantage to a single framework.
  • Run a series of batches of entities to gather enough data to arrive at a reasonable conclusion of performance.
  • Randomize the data a bit to try and keep things in line with real-world conditions. The data is randomized from a small data set, with the assumption being that the differences in size are small enough and the batches are large enough to get a reasonably even distribution, meaning the metrics will converge on a figure that is a reasonable measurement of performance.

The following metrics were recorded:

  • Measure the average time to serialize and deserialize a batch of 100,000 entities.
  • Measure the average size of a response.
  • Measure the average time of an individual serialization/deserialization

Lastly, the contenders:

The use of the Jackson Smile JAXRS provider may seem odd, but I have a good reason. The basic Smile example is only a few lines, while the Smile JAXRS provider class is almost 1000 (!!!) lines. There's a lot of extra work going on in that class, and felt it was worth comparing because 1) many people could end up using this adapter in the wild and 2) perhaps there are some optimizations that should be benchmarked.

code

All of the code used in this can be found at https://github.com/theotherian/serialization-shootout/tree/master/serialization-shootout

Here's a tree representation of what the entity being serialized/deserialized, Car, looks like:

Here are the harnesses being used:

the results: normal size objects

By normal, I mean on the smaller size; most data is in order of 10's of bytes:

Key data points:

  • Kryo and Smile are clearly more performant than JSON in terms of time spent and size of payload.
  • Kryo and Smile are close: Kryo performs better but Smile is slightly smaller.
  • Kryo has the fastest raw serialization/deserialization performance by a significant amount over both Smile and JSON.
  • The Smile JAXRS provider is significantly slower than its raw counterpart.

the results: large size objects

For this comparison, I added portions of Wikipedia articles as part of the object, all equal in length:

Key data points:

  • Kryo is best in breed by a wide margin here, handling batches in 1.2s vs 1.9s for both Smile and JSON. Serialization and deserialization are both significantly faster.
  • Variance in size is practically nonexistent between all the frameworks.
  • Smile JAXRS really looks like a dog here, taking 2.6s to handle a batch and showing surprisingly poor deserialization performance.

the winner: kryo (with HUGE MASSIVE caveats)

Kryo clearly has some advantages here, but it also has one major disadvantage: Kryo instances are not thread safe. Did you hear that?

KRYO INSTANCES ARE NOT THREAD SAFE!

This caused me to show the same amount of rage DateFormat did years ago. BFD you may say, thinking "Just create a Kryo instance each time!" Well, what if I told you that each batch of the normal size objects takes a whopping NINE SECONDS when I moved the creation of the Kryo object inside the harness' method.

No sir; if you're going to use Kryo you need to have thread local storage for your Kryo instances or you are going to be in for some serious pain. Depending on the load of your application, you may want to pre-create them as a pool within a servlet initializer that is scaled to the number of threads you have in your container.

Quite frankly I'm astonished that there's so much overhead encountered on an instance that isn't thread safe, but I also haven't delved into the API enough to know what the reasons are behind this. Still though, it creates some very annoying design implications that you'll need to make sure are accounted for correctly in your application.

Part of me would sooner call Smile the winner since it doesn't have this particular issue, but after looking at the JAXRS provider for it I'm left scratching my head.

However, when it comes to larger entities, Smile offered marginal improvement over JSON, whereas Kryo clearly won that round.

Based on the results in the first pass, I think Kryo showed the most improvement, but also a fair number of warts.

next steps

I'm far from finished here, but felt compelled to get something published. I plan on doing the following things next:

  • Getting feedback from others about my approach and the data to see if I'm way off the mark.
  • Potentially benchmarking ProtoBuf here too. It's more painful to set up, but worth experimenting with to get more data.
  • Figuring why Smile JAXRS is so miserably slow.
  • Messing around with Kryo's optimization (an example of this is here).
  • Looking at other BSON libraries.

I do genuinely feel like I'm missing some critical piece of data or type of test here, so if you see anything that could stand to be addressed, please let me know in the comments!

6 comments:

  1. It seems to me that 5k is still too small to really show the possible advantages of the serialization times. What if more fields contained the Wikipedia text and each trial consisted of a collection of 10 cars instead of a single car or using your car as it is and using an array of 100 per response. I think of gzip in tomcat. It's just added overhead if your response isn't big enough. You may have simply proven that for small responses Json is good enough. A response over 100k may show very different results.

    ReplyDelete
    Replies
    1. I can certainly add larger strings to the entities, or like you said serialize a collection of Car instances to increase the footprint. I'm curious how typical that is of web services in general to return object graphs > 5k when serialized.

      Delete
    2. The samples I have are from search results, which have typically been between 100k and 800k in a product search. Also think of searching for buying guides, reviews, or something like that. Granted, for a back-end RESTful service you may be getting much smaller results, and perhaps the (de)serialization overhead doesn't actually buy you anything there.

      Delete
    3. For normal sized objects, actually JSON is not that bad since the average data size for that was a little higher than both Kryo and Smile and it did pretty well. For larger data sets, Kryo does seem to be doing better.
      I feel like we might need to choose the format based on the type and size of data.
      Regarding the higher sized responses, we might need that in an aggregator service.

      Delete
  2. coffee distributor leveler tool 53mm
    Coffee distributor leveler tool in order to make the steam, the entire boiler must be heated to a boiling point. The hot water in the coffee distributor Leveler burns the coffee powder, loses the aromatic oil hidden in the coffee powder and extracts the doubly bitter coffee. Coffee brewed on Bezzera's machine doesn't form Crema for two main reasons:
    The hot water used to make coffee is too hot and loses its oil.
    Steam boilers cannot provide sufficient pressure.

    ReplyDelete
  3. Electric Heated Flowmeter Regulator exporter, factory and manufacturer Electromagnetic flowmeter is a kind of electromagnetic induction type flow meter, highly intelligent, high precision, and can measure the forward and reverse flow, instantaneous flow and cumulative flow, flow velocity, flow percentage, conductance ratio, span ratio of 1:15 0, the lattice, according to the Chinese range set at random, sensors and switches are interchangeable random direction, atc incentive fluid anomaly traffic on the lower limit alarm, Easy to use, less maintenance, long service life, low power consumption (10V) can output standard current and frequency pulse and RS-485 communication.

    ReplyDelete