Friday, February 28, 2014

using jackson mixins and modules to fix serialization issues on external classes

the problem

Let's say you have some classes coming from another library that you need to serialize into JSON using Jackson. You can't manipulate the source code of these classes one reason or another, and you have a problem:

These classes don't serialize correctly

There are a number of reasons this can happen, but in this post we're going to focus on two examples: two classes that have a recursive relationship to one another, and a class that doesn't conform to the bean spec.

dealing with recursive relationships

Let's say we have two classes, User and Thing, as shown below. User has a one to many relationship with Thing, and Thing has a many to one relationship back to its parent, User:

Given these classes, let's say in a unit test we create users using the following code, establishing the recursive relationship:

Now that we can create a user, let's try serializing:

If you run that method, your test will fail, and you'll see a nice CPU spike on your computer because this just happened:

fixing recursive relationship issues with mixins

As it turns out, Jackson has an awesome feature called mixins that can address this type of problem (remember, we're assuming User and Thing are not modifiable).

Mixins allow you to create another class that has additional Jackson annotations for special serialization handling, and Jackson will allow you to map that class to the class you want the annotations to apply to. Let's create a mixin for Thing that specifies a @JsonFilter:

Now you might be thinking, "What's that 'thing filter' string referencing?" We have to add this mixin to the object mapper, binding it to the Thing class, and then we have to create a filter called "thing filter" that exludes Thing's field of user, as shown in the test below:

If you run this test, you'll see that it passes.

dealing with a class that doesn't conform to the bean spec

Let's say we have two other classes, Widget and WidgetName. For some reason, the person who wrote WidgetName decided to not conform to the bean spec, meaning when we serialize an instance of Widget, we can't see data in WidgetName:

Let's say we're creating widgets using the code below:

If we try to create a widget like this and serialize it, we won't see the name. Here's a test that will serialize:

And here's the output:

fixing classes that don't conform to the bean spec with modules and customer serializers

We can address this problem pretty easily using a Jackson module that provides a custom serializer for WidgetName. The serializer can be seen below, and it uses the JsonGenerator instance to write the value from the WidgetName argument:

We now need to wire this up in order to use it. Below is a test that creates a SimpleModule instance, wires up our custom serializer to it, and registers the module within our ObjectMapper instance:

If you run this test, you can see the correct output for the serialization:

conclusion and resources

All the resources used in this example can be found on Github at https://github.com/theotherian/jackson-mixins-and-modules. If you end up having to serialize data that's outside of your control and is causing you problems, you should be able to get a lot of mileage out of these two solutions.

Thursday, January 23, 2014

writing your own tech blog part 1: the first 10k hits are the toughest

a milestone and a retrospective

Today, my blog exceeded 10,000 views. I'll concede that some of them are spam/bots, but most aren't. So, in a flagrant self endorsement, I'm going to write a blog post about writing blog posts. It's like Inception but your dead wife isn't trying to kill you in your subconscious.

I'm pretty excited that I've been able to not just hit that milestone, but that I've been keeping with it and getting some comments and feedback. I had a lot of motivation and ideas when I started writing, and I've had some opportunities to figure out what works, what doesn't, and some goals that work well. I wanted to share these in an effort to help others who are thinking about writing a blog, or have one and are stuck or not feeling motivated.

things that work

Let's start off on a positive note: ways to succeed personally and publicly with this. Here's a list of observations and advice that I found worked well for me and was reflected in the comments I received and the page views I've attracted.

  • Try picking topics or problems that are somewhat mainstream, but also something people tend to get stuck on. For example, the post I wrote about Jersey 2.0 resource filters gets a decent amount of traffic, and it's the kind of thing that people often set and forget or run into issues getting started with. I found the documentation around this feature of Jersey to be lacking and had to do a fair amount of trial and error to get things working correctly, so I figured others had the same problem. I also use Jersey a lot, which brings me to my next point...
  • Pick things you're interested in. This probably sounds obvious, but I want to emphasize it. If you write about something you're genuinely interested in, it'll help with your motivation. It can also help broaden your horizons, because as you're writing you may start to think of new directions and features you want to explore. Sometimes it'll help you solve a problem, which leads me to...
  • If you solved a problem that seemed tricky to figure out, like it was something you'll forget and need again, or can help someone else in the same boat, just write about it. Seriously; make yourself a quick set of rough notes in a text editor and write about it later. I was surprised at how often I came back to one of my posts by doing this, and it's helped me professionally in that I've referred coworkers to my blog for guides on how to do certain things. You may be seeing a pattern here, because my next point is:
  • When you write a post, try to be as complete as possible with your examples and resources. Usually I try to do things with Maven and try to include a pom file which makes it easier to reproduce my work on your machine should you choose to mess with it. Try to avoid leaving out steps, or better yet, after you post something, start from a blank slate in an IDE and try running through your example; you may be surprised by what you realize you left others to figure out on their own.
  • Spend some time on self promotion and SEO. I typically tweet my blog posts, often times to coworkers to help get feedback since I'm lucky enough to work with some incredibly smart people. I also look at what people search on that leads them to my site, and what I search for when I'm researching for a post. Sometimes I will custom craft the URL for a post to try and get the most lift. As one example of having strong SEO, if you search for Jersey 2.0 filters, my post is the 4th result after a java.net article and two dzone articles. Against titans like that, I'll consider 4th place an achievement. (On an amusing note, my colleague Sanjay's post is 8th in the results. I'll be sure to tease him about this tomorrow, hehe)

setting goals

There are really two goals that I would advocate you take into consideration when writing a blog. I'm sure you'll have lots of goals, but there are two I think are particularly important.

  1. In the words of Interpol, pace is the trick. Set some modest, achievable goals for yourself, and don't try to go at your blog at some frenetic pace. You don't want to burn yourself out; you want to find that balance of productivity and desire that gives you a steady flow of work. For me, I set a goal of writing one post a month. Some months I don't write one, some I write two. By doing that though, I feel like I can achieve that goal and never end up dreading it. I don't feel like I'm setting an unreasonable goal, and I don't feel like I'm being lax on myself: it's enough to keep things moving.
  2. Once you get into a groove and get some solid material up, your blog can be a massively important extension of your resume. I've had multiple interviews since putting my URL in the header of my resume, and in several of them one or more technical interviewers told me "I took some time to look at your blog, and I really like the work you shared." It's a serious advantage to your cause, because you're demonstrating expertise in your field. I see 8+ page resumes all the time, and they're a nightmare to deal with. Often times they're crafted to get past filters, listing every technology and buzzword possible, but really they're just a massive obfuscation of what the candidate is good at. If you can stick with a blog and show what you're made of, that 1-2 page resume can focus on a smaller, stronger set of accomplishments, and your many useful, well written blog posts can do the rest of the talking. More than just a manifestation of skill, it demonstrates that you care about your craft, and are contributing something that can help others.

these ways lead to madness

There are ways to put yourself on the road to failure as well. Some of these are the counter case to my points above and may have already been inferred by you, but I think they're still worth discussing.

  • Don't overdo it. Like I said before, find a pace that works for you, but more importantly recognize what pace doesn't. For a while I tried to do two posts a month, and I started to feel burdened with it. Luckily I was able to realize this and have since backed down (as I said I shoot for one a month), but I think had I not realized this I may have put myself off.
  • Avoid overcomplicated material or examples. If you want to do something large, break it up into smaller manageable parts. I did this with my multipart guide on Maven archetypes, because writing all of that at once would have been repulsive. I broke up the subject into three logical portions that weren't so short that they added no value but also not so long that people gave up on them.
  • Don't skimp on research for your posts. If you're writing something and you think someone may have already solved it in a different way or you feel like you're missing something, dig deeper. I've had multiple experiences where I was pretty far into writing a post and felt something was off where that turned out to be true. Often times I spend 10-20 hours researching something before I write about it to try and understand as much as possible before I publish. Throwing some half-baked scribble up just to add a post could discredit you in the eyes of your audience.
  • Try not to ignore, disregard, or fight in comments. I'll concede that I'm guilty of this in one case: I replied but didn't follow through on uploading more of my code to help someone. You want people to contribute back to you; sometimes they may offer additional resources you didn't know about, other times they may be struggling with something you wrote and it could help you realize that you glossed over some important details. If people read your blog and just see a bunch of comments with no answer they could think that you don't care or have abandoned the blog. You could get the occasional troll as well. If you do, try to keep things civil as your blog is a reflection of you. People may argue or even be combative; try to do your best to remain objective. If there's no resolution to the argument, there's no shame in calling that out. Saying "I think we just have two different ways of approaching this" or "I understand your point, but I don't agree with it. I do appreciate you commenting though" is a perfectly professional thing to say and can wrap things up just fine.

next stop: 100k (hopefully)

I hope sharing these observations and opinions helps. I'm a big fan of knowledge sharing and seeing developers help and contribute to one another, and I think blogging is a fantastic way to accomplish that. I encourage anyone who takes their career as a developer seriously to give writing their own tech blog a shot; you may be surprised just how much you'll learn in the process.

Monday, January 20, 2014

random things I learned about solr today

sometimes I just go where the search results take me

I was doing some research on SolrCloud tonight, and wound up learning about enough disparate things that I figured I'd put together a quick page summarizing what I'd read along with some links. If nothing else this post is going to just end up being notes for my own memory, but if somehow this helps someone else along the way all the better.

so, what did you learn?

regarding data consistency and cap theorem

SolrCloud (or distributed Solr) claims to use a CP model for data, which surprised me. CP means consistent and partition tolerant and refers to CAP theorem; if you aren't familiar with it you should read about it. The more I read about this though, I would disagree that "CP" is correct unless my understanding of CAP is flawed.

According to this, SolrCloud "is a CP system - in the face of partitions, we favor consistency over availability." This discussion gets things a little more clear, clarifying that SolrCloud "favors consistency over availability (mostly concerning writes)."

To expand on what this means, you need to have at least a high level understanding of Solr's sharding capabilities, which is about all I have at this point. When you shard, you have a leader for certain documents as well as replicas. When you go to update a document, Solr will route the request to the leader and then propagate the change to the replicas. If you happen to look up data from replicas as well as the leader, then you'll actually be using an eventual consistency model. One request that hits the replica can get a stale document compared to what the leader has if the leader hasn't finished distributing an update to the replicas in the event of a real time get.

The "A" is missing in this equation because it's possible that update requests will be rejected under certain conditions. SolrCloud uses ZooKeeper to elect a leader, and ZooKeeper will not allow a split brain condition to happen if part of the cluster goes down. If ZooKeeper doesn't agree on a leader due to a partition of the cluster and a potential split brain condition, update requests will be rejected, i.e. availability is sacrificed in favor of remaining consistent and being partition tolerant. However, availability is still maintained for read operations; the cluster will not reject those requests unless you've partitioned in such a way that there's no shard or replica corresponding to a particular document.

To wrap things up, I found the assertion of a CP model surprising when it's using the same eventual consistency model that AP data stores use such as CouchDB. To Solr's credit, changes should be distributed to replicas extremely fast and soft commits happen within seconds meaning the eventual consistency window is quite small, so the odds that it will create a problem are small.

soft commits, hard commits, real time gets and the transaction log

This is merely a terse summary of the documentation around real time gets and near real time searching, but since it falls under the "things I learned and may likely forget tomorrow morning" umbrella I'm writing about it.

First, it's important to call out that when you update a document in Solr that doesn't make it automatically available within searches. As of Solr 4, you can access a fresh version of a resource after it's been updated by using a real time get as long as you have the transaction log enabled. The transaction log is not unlike what databases use to track changes, and to be honest Solr can behave more like a database than I thought as a result of this feature. Enabling real time gets makes Solr behave more like a NoSQL database.

If you've updated a document, then you have two options to make the changes searchable: a hard commit or soft commit. A hard commit is expensive: it pushes changes to the file system (making them persistent) and has a significant performance impact. A soft commit is less expensive but not persistent. All updates are persistent if you have the transaction log enabled. According to Solr's documentation, it's reasonable to have soft commits automatically happen within seconds while hard commits are restricted to a much longer interval (maybe 10-15 minutes).

You need to be aware of a few things when using the transaction log, as documented here. First, all your updates are written to the transaction log before a successful response is returned to a client. Second, performing a hard commit will persist all changes in the transaction log. Third, not performing a hard commit periodically can result in having a huge transaction log that can potentially kick the crap out of your Solr instance on startup should it try to persist changes potentially on the order of gigs. So, keep an eye on how large you're allowing your transaction log to become, lest you send Solr into a tailspin on startup.

block joins make searching relational

If you've ever wanted a nice parent-child relationship on your documents, it's here. I'm not going to talk about this too much myself because I have a tenuous understanding of how to query this in Solr so far, and there are awesome resources here, here, here and here. One thing worth calling out is that apparently this won't work correctly in JSON until version 4.7 according to this jira ticket.

that's it for now

There's a lot more I'm planning on reading up on regarding Solr in the next few weeks, meaning there's a decent chance of more posts like this as well as in-depth follow ups to help people get started with certain features. In the meantime, feel free to share anything you think I or others should dedicate some time to learning about Solr next!

Wednesday, January 1, 2014

strong, soft, weak, and phantom references: the double rainbows of java

did you just make a double rainbow metaphor?

Yes, because like the double rainbow video, you may look at soft, weak or phantom references and say to yourself "what does it mean?!" It's easy to mix up soft vs weak if you're new to them, and it's also easy to be confused by soft references since they are often referred to as a "poor man's cache." Phantom references, on the other hand, offer such a different type of functionality that you'll probably never need to use them... unless you need to at which point they're literally the only thing that can do what they do in Java.

Hopefully, by the end of this post, you'll just look at them and say "full on!" (another video reference). This isn't even a triple rainbow; this is a quadruple rainbow we're dealing with. The goal is to help provide a clear but concise summary of what these do, when you would use them, and what their impact is on the JVM and garbage collector.

Before we continue though, a few words of warning about these types:

  • These types have a direct impact on Java's very sophisticated garbage collector. Use them wrong, and you can end up regretting it.
  • Since these expedite when something is eligible for garbage collection, you can end up getting null returned when you weren't previously.
  • These should only be used in specific cases, where you're absolutely positive you need the behavior they offer. You should be no means look at these and see a general replacement for what you're doing or a myriad number of ways to change your code.
  • If you're going to use these at all, do a code review with someone else, preferably a senior developer, principal developer, or architect. They're powerful, impactful, and easy to use incorrectly, so even if you quadruple checked your code and think to yourself "Nailed it!", have someone else go over it with you; a pair of fresh eyes on code can make all the difference in the world.

strong references, and a brief jvm crash course

Any reference (or object) you create is a strong reference; this is the default behavior of Java. Strong references exist in the heap for as long as they're reachable, which means some thread in the application can reach that reference (object) without actually using a Reference instance, and potentially longer depending on the necessity for a full GC cycle. Any reference you create (barring things that are interned, which is another discussion) is added to the heap, first in an area called the young generation. The garbage collector keeps an eye on the young generation all the time, since most objects (references) that get created are short lived and eligible to be garbage collected shortly after their creation. Once an object survives the young generation's GC cycles, it's promoted to the old generation and sticks around.

Once something ends up in the old generation, the garbage collection characteristics are different. Full GC cycles will free up memory in the old generation, but to do so they have to pause the application to know what can be freed up. There's a lot more to talk about here, but it's beyond the scope of this post. Full GC pauses can be very slow depending on the size of your heap/old generation, and generally only happen when its absolutely necessary to free up space (I say generally because the JVM's -client and -server options have an effect on this behavior). An object can exist in the old generation and no longer be strongly reachable in your application, but that doesn't mean it's necessarily going to be garbage collected if your application doesn't have to free up memory.

There are multiple reasons why the JVM may need to free up memory. You may need to move something from the young generation to the old, and you don't have enough space. You may have an old generation that's highly fragmented from many small objects being collected, and your application needs a larger block of contiguous space to store something. Whatever the reason, if you can't free up the space you need in the heap, you'll be in OutOfMemoryError town. Conversely, if you're low on memory you can also create conditions where you're creating a barrage of young and old garbage collection passes, often referred to as thrashing, which can tank the performance of your application.

weak references

I'm going to deviate from the canonical ordering in the title of this post and explain weak references before soft, because I think it's far easier to understand soft after you understand weak.

Think of a WeakReference as a way of giving a hint to the garbage collector that something is not particularly important and can be aggressively garbage collected. An object is considered "weakly reachable" if it's no longer strongly reachable and only reachable via the referent field of a WeakReference instance. You can wrap something inside of a WeakReference, which is then accessible via the get() method, as shown in the example below:

If the instance inside of value is only reachable via value, it's eligible for garbage collection. If it's garbage collected, then value.get() will return null. The garbage collector has a level of awareness of weak references (and for all Reference types for that matter) and can be more strategic about reclaiming memory as such.

Now, you may be asking yourself: "when would I use weak references?" Most of the other resources on the web say one of two things: WeakHashMap is an example of how to use them, and using them for canonicalized mappings. I think both of these are poor answers for a few reasons: WeakHashMap is dangerous to use if used incorrectly (read the JavaDoc), and I highly doubt that the average person who is just learning about weak references will read "use them for canonicalized mappings", slap their hand on their forehead and exclaim "Oh! Of course!"

That said, there's a very practical example of using weak references via WeakHashMap written by Brian Goetz that I will attempt to paraphrase. When you store a key-value pair in a Map, the key and value are strongly reachable as long as the map is. Let's say we have a case where, once the key is garbage collected, the value should be too: a clear example of this is a parent-child relationship where we don't need the children if we don't have the parent. If we use the parent as the key to a WeakHashMap instance, it ends up wrapped in a WeakReference, meaning that once the parent is no longer strongly reachable anywhere else in the application it can be garbage collected. The WeakHashMap can then go back and clean up the value stored with the key by using a ReferenceQueue, which I explain further down in this post.

Previous to that paragraph, I mentioned WeakHashMap can be dangerous, and I'd like to expand on that. It's not uncommon that someone may think a WeakHashMap is a good candidate for a cache, which is likely a recipe for problems. Usually a cache is used as a means to store data in memory that has a (potentially huge) cost to load, meaning the value is what you want to have long-lived and not necessarily the key, which is probably quite dynamic in nature. If you use a WeakHashMap without long-lived keys, you'll be purging stuff out of it quite often, and probably cause a large amount of overhead in your application. So, if you're going to use WeakHashMap, the first question you must ask yourself is: how long-lived is the key to this map?

soft references, sometimes referred to as the "poor man's cache"

The differences between a SoftReference and a WeakReference are straightforward on the surface but quite complex behind the scenes. Just like the definition of "weakly reachable", a reference is considered to be "softly reachable" if it's no longer strongly reachable and is only reachable via the referent field of a SoftReference instance. While a weak reference will be GC'd as aggressively as possible, a soft reference will be GC'd only if an OutOfMemoryError would be thrown if it wasn't reclaimed, or if it hasn't been used recently. The former case is pretty easy to understand: none of your strongly referenced objects are eligible for GC and you can't grow the heap any more, so you have to clear your soft references to keep your application running. The latter case is more complex: a SoftReference will actively record the time of the last garbage collection when you call get(), and the garbage collector itself records the last time a collection occurred inside of a global field in SoftReference. Recording these two points provides the garbage collector with a useful piece of information: how much time has passed from the GC before the value was last accessed versus when the most current GC occurred.

Here's an example of using a SoftReference:

The JVM also provides a tuning parameter related to soft references called -XX:SoftRefLRUPolicyMSPerMB=(some time in millis). This parameter (set to 1000ms by default) indicates how long the value in a SoftReference (also called the referent) may survive when it's no longer strongly reachable in the application, based on the number of megabytes of free memory. So, if you have 100MB of free memory, your "softly reachable" object may last an additional 100 seconds by default within the heap. The reason I say "may" is that it's completely subject to when garbage collection takes place. If the softly reachable referent kicked around for 120 seconds and then became strongly reachable again, that time would reset and the referent wouldn't be available for garbage collection until the conditions I've mentioned were met again.

Now, regarding the "poor man's cache" label...

Sometimes you'll find questions online where someone will ask about building a cache where data can be expired automatically, the topic of soft references will come up, and then some will be scolded and told that you should use a cache library that has Least Recently Used (LRU) semantics like ehcache or Guava cache. While both of those as well as many other caching libraries have far more sophisticated ways for managing data than just relying on soft references, that doesn't mean that soft references don't have value in regard to caching.

In fact, ehcache has a bit of a problem in this regard: everything it caches is strongly referenced, and while it does have LRU eviction, that eviction is lazy rather than eager. This means that you could have data that isn't being used sitting around in memory, strongly referenced and not eligible for GC, and not forced out of the cache because you haven't exceeded the maximum number of entries. Guava cache, on the other hand, has a builder method of CacheBuilder.softValues() that allows you to specify that values be wrapped in a SoftReference instances. If you're using a loading cache, the value can be repopulated if it's been garbage collected automatically. In this case, soft references play nicely with a robust caching solution since you have the advanced semantics of LRU and maximum capacity along with the lazy cleanup of values that aren't being used frequently by the garbage collector.

phantom references: the tool you'll never need until you need it

Think of phantom references as what the finalize() method should have been in the first place.

Similarly to WeakReference and SoftReference, you can wrap an object in a PhantomReference instance. However, unlike the other two types, the constructor for PhantomReference requires a ReferenceQueue instance as well as the instance you're wrapping. Also unlike the other two types, the get() method of a PhantomReference always returns null. So, why does get() always return null, and what does a ReferenceQueue do?

A phantom reference only serves one purpose: to provide a way to find out if its referent has been garbage collected. An object is said to be "phantom reachable" if it is no longer strongly reachable in the application and is only reachable via the referent field of a PhantomReference instance. When the referent is garbage collected, the phantom reference is put on the reference queue instance passed into its constructor. By polling the queue, you can find out if something has been garbage collected.

Extension of phantom references can be used to provide metadata about what was garbage collected. For example, let's say we have a CustomerPhantomReference class that has a referent of type Customer and also stores a numeric id for that customer. Let's also assume that we can do some resource clean up after a customer is no longer in memory in the application. By having a background thread poll the reference queue used in the CustomerPhantomReference instance, we can get the phantom reference back providing us the numeric id of the customer that was garbage collected and perform some cleanup based on that id. This may sound very similar to the example I provided with weak references at face value, so allow me to provide some clarification. In the case of weak references, we were making other data available to be GC'd. In this case, you may have some resource cleanup you want to perform that's functional in nature rather than just making something no longer strongly reachable.

Given that, it should be clear that the reason the constructor of a PhantomReference instance requires a ReferenceQueue is that a phantom reference is useless without the queue: the only thing it tells you is that something has been garbage collected. Still though, what about get() returning null?

One of the dangers of the finalize() method is that you can reintroduce strong reachability by leaking a reference to the instance the method is being executed from. Since PhantomReference will only return null from its get() method, it doesn't provide a way for you to make the referent strongly reachable again.

so what do reference queues do in regard to weak and soft references?

We already know that soft and weak references provide a way to have things garbage collected when they would normally be strongly reachable. We also know from phantom references use a reference queue as a way to provide feedback for when something is garbage collected, which is really the purpose of phantom references to begin with. So why would we want soft and weak references to be queued up too?

The reason is actually quite simple: your soft and weak references are still strongly referenced. That's right, you could potentially end up hitting an OutOfMemoryError because of an overabundance of now useless SoftReference or WeakReference instances which are strongly referenced though the value they effectively proxied was garbage collected.

Using a ReferenceQueue allows you to poll for any type of Reference that has been garbage collected and remove it (or set it to null). There's an example of this visible in WeakHashMap.expungeStaleEntries() where the map polls its ReferenceQueue whenever you call size() or whenever getTable() or resize() is called internally.

additional resources

Garbage Collection, by Bill Venners
Understanding Weak References by Ethan Nicholas
Memory Thrashing by Steven Haines
Understanding Java Garbage Collection by Sangmin Lee
WeakHashMap is Not a Cache! by Domingos Neto
Plugging Memory Leaks with Weak References by Brian Goetz
How Hotspot Decides to Clear Soft References by Jeremy Manson

Sunday, December 22, 2013

kryo vs smile vs json part 1: a misguided shootout

this may be my most frustrating post so far

First, a little background.

At some point, even when you can scale horizontally, you start to examine aspects of your application that you can easily take for granted in the grand scheme of things for performance gains. One of those points when dealing with web services is serialization. There's general knowledge that Java serialization is slow, and XML is bloated compared to JSON. JSON is a pretty safe pick in general: it's readable, lightweight, and fast. That said, what happens when you want to do better than JSON in your RESTful web service?

A colleague and I came to this point recently, where the majority of his transaction overhead was spent unmarshalling requests and marshalling responses. This application comes under very high load, so the obvious conclusion was "well, there's a clear place to start to improve things." From there, we started looking at Apache Thrift, Google ProtoBuf (or Protocol Buffers), Kryo, Jackson Smile and, of course as a control, JSON. Naturally, we wanted to invest some time comparing these to each other.

I looked around online a lot at performance benchmarks and found some data dealing with Kryo, ProtoBuf and others located at https://github.com/eishay/jvm-serializers/wiki. The data presented there was very low level, and my goal was quite literally to produce the least sophisticated comparison of these frameworks possible, ideally using the 4-6 line samples on their respective wikis. My reasoning for this was that there is likely a common case of people not investing a huge amount of time trying to optimize their serialization stack, but rather trying to seek out a drop-in boost in the form of a library.

This is where the frustration comes into play. My results don't quite match what I've seen elsewhere, which caused me to question them several times and revisit the benchmarks I was performing. They still don't quite match, and to be honest I'm questioning the benchmark code I linked to after discovering calls to System.gc() all over the place, but I feel like I have enough data that it's worth posting something up here.

the experiment: use cases, setup, metrics, and the contenders

Let's talk about the use cases I was trying to cover first:

  • Don't go over the network. Do everything in memory to avoid external performance influences in the benchmark.
  • Serialize an object that is reasonably complex and representative of something a web service may use.
  • Serialize objects that have both small and large data footprints.
  • Use the most basic setup possible to perform the serialization and deserialization.

The setup was:

  • Run a "warm up" pass before gathering metrics to remove initial load factors on JVM startup that won't be a constant issue, and to fragment the heap slightly to both simulate real-world conditions and not give a potential advantage to a single framework.
  • Run a series of batches of entities to gather enough data to arrive at a reasonable conclusion of performance.
  • Randomize the data a bit to try and keep things in line with real-world conditions. The data is randomized from a small data set, with the assumption being that the differences in size are small enough and the batches are large enough to get a reasonably even distribution, meaning the metrics will converge on a figure that is a reasonable measurement of performance.

The following metrics were recorded:

  • Measure the average time to serialize and deserialize a batch of 100,000 entities.
  • Measure the average size of a response.
  • Measure the average time of an individual serialization/deserialization

Lastly, the contenders:

The use of the Jackson Smile JAXRS provider may seem odd, but I have a good reason. The basic Smile example is only a few lines, while the Smile JAXRS provider class is almost 1000 (!!!) lines. There's a lot of extra work going on in that class, and felt it was worth comparing because 1) many people could end up using this adapter in the wild and 2) perhaps there are some optimizations that should be benchmarked.

code

All of the code used in this can be found at https://github.com/theotherian/serialization-shootout/tree/master/serialization-shootout

Here's a tree representation of what the entity being serialized/deserialized, Car, looks like:

Here are the harnesses being used:

the results: normal size objects

By normal, I mean on the smaller size; most data is in order of 10's of bytes:

Key data points:

  • Kryo and Smile are clearly more performant than JSON in terms of time spent and size of payload.
  • Kryo and Smile are close: Kryo performs better but Smile is slightly smaller.
  • Kryo has the fastest raw serialization/deserialization performance by a significant amount over both Smile and JSON.
  • The Smile JAXRS provider is significantly slower than its raw counterpart.

the results: large size objects

For this comparison, I added portions of Wikipedia articles as part of the object, all equal in length:

Key data points:

  • Kryo is best in breed by a wide margin here, handling batches in 1.2s vs 1.9s for both Smile and JSON. Serialization and deserialization are both significantly faster.
  • Variance in size is practically nonexistent between all the frameworks.
  • Smile JAXRS really looks like a dog here, taking 2.6s to handle a batch and showing surprisingly poor deserialization performance.

the winner: kryo (with HUGE MASSIVE caveats)

Kryo clearly has some advantages here, but it also has one major disadvantage: Kryo instances are not thread safe. Did you hear that?

KRYO INSTANCES ARE NOT THREAD SAFE!

This caused me to show the same amount of rage DateFormat did years ago. BFD you may say, thinking "Just create a Kryo instance each time!" Well, what if I told you that each batch of the normal size objects takes a whopping NINE SECONDS when I moved the creation of the Kryo object inside the harness' method.

No sir; if you're going to use Kryo you need to have thread local storage for your Kryo instances or you are going to be in for some serious pain. Depending on the load of your application, you may want to pre-create them as a pool within a servlet initializer that is scaled to the number of threads you have in your container.

Quite frankly I'm astonished that there's so much overhead encountered on an instance that isn't thread safe, but I also haven't delved into the API enough to know what the reasons are behind this. Still though, it creates some very annoying design implications that you'll need to make sure are accounted for correctly in your application.

Part of me would sooner call Smile the winner since it doesn't have this particular issue, but after looking at the JAXRS provider for it I'm left scratching my head.

However, when it comes to larger entities, Smile offered marginal improvement over JSON, whereas Kryo clearly won that round.

Based on the results in the first pass, I think Kryo showed the most improvement, but also a fair number of warts.

next steps

I'm far from finished here, but felt compelled to get something published. I plan on doing the following things next:

  • Getting feedback from others about my approach and the data to see if I'm way off the mark.
  • Potentially benchmarking ProtoBuf here too. It's more painful to set up, but worth experimenting with to get more data.
  • Figuring why Smile JAXRS is so miserably slow.
  • Messing around with Kryo's optimization (an example of this is here).
  • Looking at other BSON libraries.

I do genuinely feel like I'm missing some critical piece of data or type of test here, so if you see anything that could stand to be addressed, please let me know in the comments!

Monday, December 2, 2013

making guava cache better with jmx

caching with jmx is just so much better

If you've never used this before, you're missing out. Being able to remotely check statistics on your cache to measure its effectiveness, as well as being able to purge it at runtime is invaluable. Sadly Guava doesn't have this baked in the way ehcache does, but it's relatively easy to add.

Most of my work is a slightly different take on some work a fellow Github user named kofemann produced (located here) which contains the JMX beans and bean registration logic. I made a few alterations to the code, pulling the registration out into a separate class (I really didn't like the bean doing all that work in the constructor) and adding a refreshAll method.

taking advantage of refresh after write functionality

If you've read my previous blog post about the awesomeness that is Guava's refresh after write functionality, then you'll see how it can be advantageous when it comes to JMX management. If you didn't read my post (shame on you), then it's worth calling out using refresh after write allows for asynchronous loading of cache values, meaning you never block barring the initial loading of the cache.

This can be used via JMX management as well by iterating through the keys of the cache and calling refresh for each one, which will load new values without causing clients of the cache to block (as opposed to purging the cache). Purging a cache is a dangerous thing to do under certain circumstances, since missing values will trigger loading events that will block clients at runtime and potentially overwhelm either your application server or even your underlying data storage. I would argue that ehcache is particularly bad because of potential read contention caused by write blocking. To clarify, several threads in your application can block waiting for cache values to be reloaded, and all of those blocking threads will then compete over a limited number of read locks after the write lock has been released, potentially causing a CPU spike and considerable latency in your application under the worst conditions. When I say worst conditions, I'm speaking from very recent and harrowing experience, so I have the lumps to say with the utmost certainty this can happen. :)

the implementation

For JMX you need an interface and an implementation. The interface can be found on my Gist and doesn't really need to be shown in the post. The implementation is below; it's really a wrapper around Guava's CacheStats object and the cleanup/invalidateAll methods, as well as my refreshAll method:

As I said before, refreshAll has the advantage of not causing your application to potentially lock up due to cache contention; everything will load up in the background. Depending on how you have your thread pool set up for performing refreshes, you can also throttle how hard you're hitting your data store by restricting the number of concurrent fetches of data by limiting the threads available.

registering your cache in jmx

This is pretty straightforward: just pass your cache (in this case a LoadingCache because of refreshAll) to the method shown below and you'll expose it via JMX for statistics and management:

feedback

Let me know if this works for you; I plan on using this soon in a high load environment, so I'll follow up with any results I find to help out my readers. I feel kind of bad bagging on ehcache so much recently, but it's caused me enough gray hair over the last month that I plan on focusing several blog posts around caching.

Thursday, November 14, 2013

non-blocking cache with guava and listenable futures (or why I hate ehcache some days)

trying to scale? just throw more cache at the problem!

Yes, I'm going to use the term "cache" in lots of ironic and financially metaphoric ways in this post. My apologies.

Caching makes a lot of things possible in everything we do on the Internet and on computer systems in general. That said, caching can also get you into trouble for a variety of reasons such as how wisely you use memory, how performant your cache is under contention, and how effective your cache is (i.e. cache-hit ratio).

If you're using Java, chances are you've heard of ehcache at some point. While there's a lot that ehcache does well, there's a particular aspect of it that in my experience doesn't scale well, and under certain conditions can take down your application. In fact, part of the reason I'm writing this blog post is the aftermath of raising my arms in the air and screaming after examining a performance issue related to ehcache which caused a failed load test today.

mo' (eh)cache, mo' problems

When reading data from ehcache, you end up blocking until a result can be returned (more on that here). While this is necessary for the first time you fetch data from the cache, since something has to be loaded in order to be returned, it's probably not the case that you need to block for subsequent requests. To clarify this, if you're caching something for 1 hour and it takes 5 seconds to load, you probably don't care about loading data that's 1 hour and 5 seconds old, especially if the alternative is blocking the request to your application for 5 seconds to do it as well as every other request trying to load that data.

Unfortunately, and if I'm wrong here I hope someone will call me out in the comments, ehcache blocks every time the data needs to be reloaded. Furthermore, it uses a ReadWriteLock for all reads as well along with a fixed number of mutexes (2048 by default), so you can end up with read contention as well given enough load. While I understand the decisions that were made and why, there are cases where it isn't ideal and you don't want to grab any locks to create blocking conditions.

making your cache work for you

To be fair, this problem really manifests itself when you have high contention on specific keys, particularly when reloading events occur. In most cases ehcache performs perfectly fine; this post isn't meant to be a general condemnation of a very popular and useful library. That said, in order to solve the problem we don't really want to block on reads or writes; we want to refresh our data in the background, and only update what consumers of the cache see when we have that refreshed data.

This can be accomplished by having a thread go and reload the data while the cache returns stale entries until new data is available. This accomplishes both the goals of not requiring read or write locks outside of the initial population of the cache, which is unavoidable. Even better, Guava Cache has all of this functionality baked in.

refresh after write and listenable futures in guava cache

Guava Cache can handle this by telling the CacheBuilder via refreshAfterWrite to refresh entries by calling the reload method in the CacheLoader instance used to construct your LoadingCache instance. The reload method returns a ListenableFuture, which is the same as a regular Future but exposes a method to register a callback. In this case, the callback is used to update the value in the cache once we've finished retrieving it.

Here's an example of this in action:

The sleeps are in there to create artificial latency to show what this looks like in action. If you run this you'll see the asynchronous load events kick off and can witness the five seconds of latency in between that event firing and the data being updated. You should also notice that reads keep succeeding in the meantime. There is a small spike the first time an asynchronous load fires, which I assume is a one-time cost resource allocation within Guava Cache.

There is one point to consider when doing this, which is how to shut down your refresh thread. In my example I used a ThreadFactory (courtesy of ThreadFactoryBuilder) to set my refresh thread as a daemon thread, which allows the JVM to shut down at the end. I also used the ThreadFactory to name the thread, which I would recommend as a general practice to make debugging easier on yourself whenever you're creating thread pools. In my example there aren't any resource concerns, so it doesn't matter if the thread is terminated, but if you had resource cleanup to perform for some reason you'd want to wire up a shutdown hook to your ExecutorService in your application since the pool would exist eternally.

For a use case like this, you'd want to be judicious about how many threads you're willing to allocate to this process as well. The number should scale somewhat to the maximum number of entries and refresh interval you choose so that you can refresh in a timely manner without consuming too many resources in your application.

conclusion

If you've come across this problem, then I hope this post helps you get past it. To reiterate what I said before, ehcache is a solid API overall, it just doesn't handle this case well. I haven't tested the Guava Cache implementation under high load conditions yet, so it's certainly possible that it has issues I've left out of the post, but from a face value standpoint it addresses the issues I've seen with ehcache in a way that doesn't involve rolling your own solution from scratch.

Feel free to share any feedback or things I may have missed in the comments!