Saturday, May 31, 2014

caching in java: getting better performance and precision from your golden hammer

inspiration

This post is actually a written form of a presentation I did at a corporate tech convention. The inspiration for it came from a common tendency for caching to be treated as a golden hammer for performance: if we were executing a database query frequently, we were caching it.

To be blunt, it's just not that simple. Sure, memory is relatively cheap, and it may make things go faster out the door, but that doesn't mean you're using your resources well. It also doesn't mean your caching approach makes sense in the long run either: how effective is your hit ratio when that data grows? Sometimes people don't think about acceptable levels of staleness in data either and just blindly cache for hours when it can negatively impact the experience of your application.

All of these data points mean there is no simple answer to caching. This isn't just a one-size-fits-all solution: you have to be precise and think about how your data is going to be used, how it will grow over time, and how stale it can get. To reference Idiocracy, I view caching as a toilet water solution. If you haven't seen the movie, here's the gist (spoilers ahead): 500 years in the future, a man from the past wakes up from cryostasis and is the smartest man in the world. He finds out the crops are dead (I won't say why) and suggests that people start watering them instead of what they were doing. In this dystopian future, people only associate water with toilets, so the response from everyone is "Like from the toilet?" After water solves the problem, discussion of other problems leads people to suggest "Hey, let's try putting toilet water on it!" The moral of the story is this: you can't blindly apply the same solution to every problem just because it solved one that sounds similar.

This post is designed to suggest questions to ask rather than provide answers for everything. At the end I'll present a checklist you can try out to see if it helps you.

what are the costs of caching in java?

Java memory management is a massive topic by itself, so we're going to only go skin deep here. If you're not familiar with how the heap is organized, here's a diagram (note: permanent generation is going away in Java 8 as a result of being collapsed into tenured):

We're not going to worry about eden versus survivor space in this post. What we're most interested in is the difference in young vs tenured, and how things end up getting stored over time. So...

  • The young generation stores short lived objects, and is garbage collected (or GC'd) very frequently. For an example of short lived objects, think of all the objects you create during a request in a web application that are unnecessary after the request is complete. Young generally uses far less space than the tenured generation, because most applications don't create so many short lived objects that you need a massive young generation.
  • As the garbage collector scans the objects in the young generation, if it sees objects that are still in use after checking several times, it will eventually move that object over to the tenured generation.
  • As your tenured generation has more data added, the amount of memory you use will grow to the maximum number you specified in your -Xmx setting for your JVM instance. Once you've reached that point, the garbage collector will collect from the tenured generation when necessary.
  • Different garbage collectors have different approaches when it comes to the following statement, but at a high level, once something has to move out of young generation and into tenured, if there isn't enough space in tenured, the garbage collector has to clear some. This means your application will suffer from some pausing, and this pausing will increase (sometimes dramatically) in either frequency, duration, or both with the size of your heap. The garbage collector has to rummage through the objects in the tenured generation, see if they're still being used by your application (strongly reachable), and remove them if they're not. Sometimes it has to shift the objects that remain (compacting vs non-compacting garbage collection) if there are several small blocks of memory free, but a single large object needs to move.
  • Generally, data you cached will live indefinitely in the tenured generation. I won't get into other cases here (though if you're interested check out my posts on other reference types and off-heap storage), because mainstream stuff like Guava Cache and ehcache will put stuff here.

So there you have it: ultimately the garbage collector is going to have to manage what you cache, which isn't necessarily a bad thing, but can become one if you're not careful. If you're going to force the garbage collector to do more work, you need to get value out of keeping that data in memory.

so what should we cache?

To the best of our ability: things that are frequently used and/or expensive to load. The former is the difference between a high and a low cache hit ratio, and we want a high cache hit ratio if possible. We need to know what data is accessed frequently or most expensive to load though in order to understand what to cache. Here are some criteria I encourage people to apply to their data:

  • Is there some data that is explicitly more popular? For example: if we're a retail website and there's a set of products we know we're going to promote in a highly visible place, that data is going to be explicitly popular because we're directing people to it.
  • Is there some data that is implicitly more popular? For example: if we're a social media site, we probably have a bias towards new content rather than old due to the way we present it. In that case, we can prioritize newer content since we know that, at least for some period of time, it will be popular until newer content replaces it.
  • If you were to graph traffic against some unique identifier in your data, where does the tail start? Often times the tail of the graph starts early on in the data set, and should give you an indication of what data is very valuable to cache, and what data will often be a cache miss if you were to cache it.
  • Is your cache distributed or local? If it's local, you're going to increase the cache misses to be equal to the number of hosts running that cache for any given entry. If it's distributed you may be able to cache more of the tail effectively.
  • Is the data expensive to load? If it is you may want to consider reloading that data asynchronously once it expires, rather than blocking on a reload. I've written more about blocking and non-blocking caches in another post.

Bear in mind the cost of loading data and preventative measures in terms of exposing yourself to a DDoS, or Distributed Denial of Service attack. Chances are you can't cache all of your data, and you should be careful to not put yourself in a position where cache misses and/or expensive load times can make your application unavailable if hit frequently enough. This subject is beyond the scope of this post, though I may write about it at some point in the future.

why will data live indefinitely if cached?

Interestingly, I had a discussion with a colleague about this just a few weeks ago, and it revealed a massive misconception that he and probably other people have about caching.

Usually, caches don't eagerly expire data by default; they do so lazily depending on their caching algorithm. Here's an example: say we have a cache that can hold 10 items, and that the max age is 30 minutes. If we load 10 items into the cache around the same time, and see how many items are in the cache 40 minutes later without having accessed them at all in between, we'll still have 10 items in the cache. By default, with technologies like ehcache and Guava Cache, there's no thread running in the background checking for expired entries. Since the entries are stored in such a way that they're strongly reachable, they're not eligible for garbage collection and will always stay in your tenured generation.

You may think this isn't a big deal, but it can be in practice. Let's say someone didn't pay attention when they set up a cache and decided it could hold 10,000 entries. Maybe the data set in the backing storage was small at first and no one saw a problem. Let's say that, over time, the data set grows to 10,000 entries, and that those entries are gradually hit over the runtime of the application, but only 100 or so are hit with any reasonable frequency. In that case, you're going to shove the entire data set in memory over time, and you're never going to get that memory back, even if no one accesses that data again for as long as the application runs.

Guava allows you to create an expiry thread that will make old data eligible for garbage collection (I haven't found the same in ehcache; comment if you're aware of this feature). Still though, you're probably better off sizing your cache more effectively for your traffic and data than you are eagerly evicting, depending on what problem you're trying to solve.

Bottom line: think about how much stuff you're tossing in memory, how much space it will take up, and how valuable it is to have it in memory.

wrapping up for now

This post is less of a deep dive into caching and more of a contextual overview of the impact of caching. I hope it helps clear up some confusion that I know I've run into multiple times. The goal of this post is really to identify why caching can actually go wrong over time, and why all of us really need to put thought into how we're using memory. The golden hammer principle can often apply to caching, and if you're not careful, you can really hurt yourself if you're swinging that hammer around a little too freely. :)