Sunday, July 28, 2013

hibernate, joins, and max results: a match made in hell

what are you ranting about now?

Hibernate's ability to handle joins when specifying a max number of results via Criteria can cause some serious pain. The real problem here is that nothing in the mappings will look wrong. In fact, in several cases the mappings will return the correct data, but with some serious consequences. This is a case where you should probably learn how the hot dogs are being made.

give me something to go off of

For this example, we're going to be using the following two classes, which represent a one to many relationship:

The tables those entities represent contain the following data, and will exist in an HSQLDB instance:

In most cases, we're going to try and get data out of these tables with the following Criteria:

Note that an Order is being added here to prevent any lack of determinism in result sets from obfuscating what's happening.

alright, so what happens?

Nothing good. Let's start off mapping the @OneToMany as such:

We're using eager loading with a JOIN fetch type, and we're defining that the relationship is mapped by the person field in Thing.

Now, let's run our Criteria and take a look at the SQL generated:

Looks like what would be expected. What does persons contain?

Hold on just a second. Why are there two instances of 'Ian' being returned?

Well, the limit applies to the entire result set, not just the parent entities. Since the join is matching the two children, those are the first two rows returned. As a result, not only are we just getting one person back, but we're getting two of the same instance in the list.

Typically, if you're using a JOIN fetch with a Criteria, there's a way to eliminate duplicate root entities:

Unfortunately, it doesn't work here. Since maxResults applies to the underlying result set and not the entities Hibernate is pulling in, adding this restriction actually just limited the list to be a single instance:

that was using join. what about subselect?

I've seen many instances where people switch their OneToMany associations to use SUBSELECT rather than JOIN because they see duplicate parent entities and don't know about the DISTINCT_ROOT_ENTITY transformer. That said, let's take a look at what SUBSELECT does. Here's the mapping:

Let's run it through the original Criteria setup again:

Alright, how's that output looking?

Great! What queries did it run to get the data this time?

WHAT?! Now we have four queries? DO NOT WANT!

Breaking down the SQL a bit, we can see that the first query only selects records from PERSON, while the second selects only records from THING. But...

It didn't just select the THING records for the PERSON records we wanted, it selected every record from THING. The series of ids that the IN clause is selecting against is totally unbounded; the limit statement isn't applied.

Still, that doesn't explain the third and fourth queries, which are indeed the same. The reason these execute is that the Thing instances loaded reference Person instances which don't exist in the session yet. Since they're mapped by the person field in Thing, Hibernate will select each missing Person with a separate query.

Imagine if we had 1000 records in PERSON, each having two corresponding records in THING. We would end up executing 1000 queries to get the data! Generally the enemy with Hibernate is n+1 selects; in this case you have m-n+2 selects, where n is the number of results and m is the total number of records in the parent table.

what if you use a join column instead of mapped by?

Very well, let's change the mapping again:

Side note: the name used in @JoinColumn appears to be case sensitive, at least in HSQLDB and Oracle. Using "person_id" instead of "PERSON_ID" will yield the same additional selects in the last example.

Now what queries ran?

If you think we're out of the woods here, I assure you we're not. Using @JoinColumn changed the relationship such that we're not issuing individual statements for parents, however, we're still fetching in everything. Hibernate still runs the same subquery for IN as before, except that we join to the parent table up front which avoids the additional individual selects:

It's important to realize that the entire contents of both tables are still being pulled into memory here. Again, if you're dealing with a huge number of records, you could easily run out of heap space.

what about using hql instead?

You can try and run this query with HQL. However, the result is not much different than before. Here's the code to do so:

Running it will generate the following SQL query:

The results returned are actually correct; the first two people and their associated things are in the list. Notice again that we have an unbounded query; the limit statement was never applied. Just before the SQL is logged, Hibernate logs a warning:

This means that Hibernate is fetching everything and then trying to apply the first/max result restrictions in memory. As you may imagine, this is also undesirable. This happens because of the join fetch used in the HQL query. Even more comforting is what the JPA spec has to say about this kind of interaction:

The effect of applying setMaxResults or setFirstResult to a query involving fetch joins over collections is undefined. (JPA "Enterprise JavaBeans 3.0, Final Release", Kapitel 3.6.1 Query Interface)

well, if subselect, join, and hql don't work, that just leaves select

Correct, however, FetchMode.SELECT by itself will cause an n+1 select problem.

Luckily there's a way to mitigate that problem.

Hibernate has another annotation called @BatchSize. Unfortunately, the JavaDoc for this annotation is simply "The batch size for SQL loading" and doesn't really explain what it's doing. What the annotation really does is generate a select for the association against multiple ids (up to the number specified) and automatically uses the ids of entities that are in the session that haven't had their association loaded yet. In other words, if you were to get 5 instances of Person, and you had your batch size set to 5, Hibernate will issue a single select for instances of Thing that matched the 5 ids of the Person instances in the session when the any association needed to be loaded. If you had 7 Person instances in the session with a batch size of 5, Hibernate would issue two selects: one for the first 5 and another for the other 2.

Here's the code for this in action (you can use FetchType.LAZY or FetchType.EAGER, they both work):

Here's the SQL generated:

And here are the results, and importantly, the ones we wanted without anything extra being loaded:

The example above does use lazy loading just to call out that it will work in this case since the other examples were eager. If you were to try lazy loading in any other capacity, you'd end up running into other issues:

  • Using FetchMode.SELECT will cause n+1 selects without @BatchSize (and would still do so if you had a batch size of 1)
  • Using FetchMode.SUBSELECT will cause the same unbounded query we saw when using SUBSELECT eagerly, and will generate m-n+2 selects if it uses @OneToMany's mappedBy attribute, where m is the total number of records in the parent table

ok, so the problem can be solved. why the rant?

The main reason I tried to outline so much in this post is that the other examples don't look wrong. Common sense would lead most to believe that the mappings were correct and that the limit would be applied as expected to the parent and children. You don't even see the issue unless you have show_sql turned on, your DBA yells at you for DDoSing the database with tiny individual selects, or your application runs out of memory due to loading huge result sets in memory. In all of those cases, you won't really notice the issue until your data gets a little larger, and by then you could be staring down a production outage depending on the scope of the problem.

I'd never fault someone for getting this wrong, because on paper the annotations seem so logical. There have been several bugs logged for this issue (HB-520, HHH-304 and HHH-2666), and even Gavin King himself says in one of the tickets:

Any limit that gets applied to a query should also get applied to the subselect of any subsequent collection fetches.

Using @BatchSize is syntactically trivial, but it does require some thought. You don't want to stick too high or too low of a number to the batch size. Too low and you'll generate many selects. Too high and you could end up loading more data than you needed, though this is probably more applicable to lazy loading than eager since eager is going to pull in the association for every parent in the session anyway. Having too high of a batch size could adversely affect database performance since the range of the total number of arguments provided to the query is going to be greater unless you have some unusual degree of uniformity in your data; i.e. a batch size of 100 will generate as many as 100 unique statements. As long as you have some notion of the number of entities you're fetching in certain areas where the number of results is bound by a max or a where clause, you should be able to pick a sensible number that will keep the total number of queries you run low.

Alternatively, you could create a separate entity to handle different association loading strategies. I've used this before to enforce a type safe contract that can control what mechanism you're using to load associations, as well as in a polymorphic capacity. If you had an area of your application where you knew you'd be using a max results limitation, you could have your persister logic look up entities using FetchMode.SELECT with @BatchSize, and use another FetchMode and/or FetchType on a different entity when you didn't need batching. Leveraging Hibernate's @MappedSuperclass annotation means you can leverage polymorphism for this use case nicely; declaring the getter for the association in the superclass and mapping the association at the field level in the subclass.

As far as my knowledge of Hibernate is concerned, using batching (or batching with separate entities) is the optimal case for dealing with joins combined with max result limitations. If anyone reading this knows of a better way, or can see a flaw in my logic or code above, please let me know in the comments!

Also, all the files used in this project can be found at https://github.com/theotherian/hibernate-gotchas. Feel free to check out the project and mess around with this yourself!

Saturday, July 27, 2013

creating an in-memory hsqldb instance for using hibernate in a maven build

a simple setup for testing

Lately I've been working on some Hibernate examples I plan on sharing soon, and I wanted to create a simple in memory instance of HSQLDB to test against. Trying to drill down to the bare necessities of what I needed proved to be a little scattered across several resources, so I thought I'd aggregate everything in one place.

This article is assuming a few things:

  • You're using a Maven build
  • You're using Hibernate
  • You don't care about persisting to disk or database state

creating the server

Setting up the server is mostly straightforward, but I do have a subtle change to make the Maven aspect easier:

Note that the location is set to target/mem:test. This tells HSQLDB to create an in-memory only database, which will never persist changes to disk. That said, HSQLDB still writes out a few files (a log, script, and properties file) which uses the name of the database. Prefixing the name with target/ will write to Maven's default build directory so that your workspace doesn't have a bunch of log data in it that you'll have to ignore from version control

configuring hibernate

Now that we can fire up the database, we can connect to it via Hibernate. The config file below will connect, but also has a subtlety similar to what was used above:

Here's where things get a little screwy, and there's a decent chance it's because I don't understand certain aspects of HSQLDB. In this case, the connection string is target/hsql:mem:test. As far as I can tell, both the HSQLDB server and driver write to the same set of files, prefixed by the connection string or database name. When you start up the database and connection pool, you end up with files in target with the names of mem:test.* and hsql:mem:test.*. Since both of these sets of files end up being written to, I figure it's not a bad thing that they're named differently. I do find it a little odd that the client logs data like this, but at least at the moment I don't care enough to see if this can be adjusted. If it can be I'll update this post to reflect how to do that.

wrapping up...

As stated above, I really wanted a sandbox to demonstrate a few Hibernate interactions with, so I really wanted to keep the setup as barebones as possible. There are probably certain things I could or should have configured differently, but the above will work to get you going. I never really use HSQLDB, so if you're reading this article and have some input on a better way to do this or any corrections that should be made, please leave your feedback in the comments :)

Friday, July 12, 2013

how to name your github gists

yes, this is a bit of a hack

I started using Gists to encapsulate all of my code, command line, and log related content rather than using <pre> tags or some styling tool to format code inline. I found that using Gists cleaned up the source of my blog considerably, and made it easier to edit content in general; converting to an HTML escaped version and then having to make changes afterwards was a pain at times

That said, I ran into a different pain, which was Gist's bizarre way of handling organization of data sets

description? check. name? well...

Gists to allow you to set a description for a set of files, which can help you identify what it is that's in there from a list of all your gists. However, it also organizes them from the alphabetically first file name in the set. If you had, for example, a bunch of Maven examples all with a file in the set called 'pom.xml', you'd have a hard time navigating your list of Gists. It also shows the contents of that file in the list, further confusing you.

However, that identification system can be used to your advantage if you're clever. If you create another file in your Gist which is actually a name for the set, but precede that name with a single space, it will be sorted to the top alphabetically. You can't have empty files though, so my advice would be to actually add your description there instead. As an example of how this looks, take a look at my Gists and how they're labeled.

how can that file be excluded when using gists remotely?

Gists offer a script tag based solution for embedding a Gist elsewhere. For example:

<script src="https://gist.github.com/theotherian/5954592.js"></script>

would embed all the files contained in the Gist at https://gist.github.com/theotherian/5954592.js

But what if you just want to embed an individual file, in this case the togglewifi.applescript file? In this case you can append ?file=filename to the URL:

<script src="https://gist.github.com/theotherian/5954592.js?file=togglewifi.applescript"></script>

The end result is this:

ditching blogger's dynamic views

not all they're cracked up to be

After starting this blog with the new Dynamic Views blogger offers, I've decided to go back to the classic templates.

Oddly enough, the sobering reality that they weren't for me happened when I tried to direct a colleague towards my blog for an answer to a question. Trying to explain where to click and how to navigate was surprisingly non-intuitive, whereas the classic template is sort of the model most blogs follow. They're straightforward, easy to navigate, and lightweight.

Another reason for reverting was Github Gists. While you can shoehorn them into blogs using Dynamic Views using JavaScript others have written to consume specially crafted code tags, I'd rather just use the direct links Github provides.

The other views that Dynamic Views offer didn't really do much for me or the blog either. Most of the views are largely inapplicable to a technical blog, and just add a layer of complication that's unnecessary. While I could see value in using them for other non-technical blogs, I think they're a poor fit for blogs heavy on text rather than images.

Tuesday, July 2, 2013

creating resource filters with jersey and jaxrs 2.0

in what situations would I use this feature?

Let's say you have a series of preconditions for certain resources in a web application. For the sake of this example, let's assume we're using Jersey as the controller for a web application rather than just a purely RESTful web service. How can we solve this requirement?

For some simple cases it's not unheard of to see an inheritance model leveraged for this, where resources extend other resources that provide methods for certain preconditions. If single inheritance started to make this unwieldy, a composition or functional approach could be used instead, but you may end up inundating your resource with additional logic. In certain cases, it may be preferable to define this precondition with metadata, perhaps with an annotation.

In cases like this, you can leverage Jersey's dynamic binding of resource filters along with your own annotations to both define conditions as metadata and decouple resources from other business logic in your code base. Below is source code to demonstrate this.

the annotation

Pretty straight forward: no fields, just retained at runtime and able to annotate methods.

the filter

This is a little more interesting. Here we check the user agent to see if it's IE6, and if it is, we abort the request directly in the filter. Aborting the request is a new feature in jaxrs 2.0. In this case, we're sending a 412 response code (PRECONDITION_FAILED), and passing back an error page as the entity via the response.

As you might have guessed, since we're showing a ContainerRequestFilter here, there's also a ContainerResponseFilter that can also be used with resources. The response counterpart is passed both the ContainerRequestContext and a ContainerResponseContext. A typical example of a use for response filters is attaching supplemental headers to a response.

Now that we have our filter, we need a way of binding it to any resources annotated with @IE6NotSupported

the binding

Now we're getting somewhere. Here we have an implementation of a DynamicFeature, another class that's new to jaxrs 2.0. For those unfamiliar with this class, here's a quick snippet of the javadoc:

/**
 * A JAX-RS meta-provider for dynamic registration of post-matching providers
 * during a JAX-RS application setup at deployment time.
 *
 * Dynamic feature is used by JAX-RS runtime to register providers that shall be applied
 * to a particular resource class and method and overrides any annotation-based binding
 * definitions defined on any registered resource filter or interceptor instance.

This is invoked for every method of every resource in your application when the application starts up. In the example above, we're specifically looking if the method is annotated with @IE6NotSupported. We could also change our annotation to support targets of TYPE as well as METHOD, and call resourceInfo.getResourceClass() and perform the same check.

the bootstrap

Now that we have an annotation, a filter, and a way to link the filter to a resource, we need to tell our application to invoke this upon startup.

Typically you'd have many classes mapped here for all the resources and providers in your application, but for the sake of the example we're just mapping the ResourceFilterBindingFeature class.

summary

I hope this helps introduce you to the world of Jersey and jaxrs 2.0 filters. Personally, I find them hugely beneficial for mapping out both pre and post conditions that should be honored for resources without coupling them to the resource itself. I'm a big fan of annotations, and for things like resources I think there are several cases where a precondition can be expressed via metadata rather than in the resource itself.