what are you ranting about now?
Hibernate's ability to handle joins when specifying a max number of results via
Criteria can cause some serious pain. The real problem here is that nothing in the mappings will look wrong. In fact, in several cases the mappings will return the correct data, but with some serious consequences. This is a case where you should probably learn how the hot dogs are being made.
give me something to go off of
For this example, we're going to be using the following two classes, which represent a one to many relationship:
The tables those entities represent contain the following data, and will exist in an HSQLDB instance:
In most cases, we're going to try and get data out of these tables with the following
Note that an
Order is being added here to prevent any lack of determinism in result sets from obfuscating what's happening.
alright, so what happens?
Nothing good. Let's start off mapping the
@OneToMany as such:
We're using eager loading with a JOIN fetch type, and we're defining that the relationship is mapped by the
person field in
Now, let's run our
Criteria and take a look at the SQL generated:
Looks like what would be expected. What does
Hold on just a second. Why are there two instances of 'Ian' being returned?
limit applies to the entire result set, not just the parent entities. Since the join is matching the two children, those are the first two rows returned. As a result, not only are we just getting one person back, but we're getting two of the same instance in the list.
Typically, if you're using a
JOIN fetch with a
Criteria, there's a way to eliminate duplicate root entities:
Unfortunately, it doesn't work here. Since
maxResults applies to the underlying result set and not the entities Hibernate is pulling in, adding this restriction actually just limited the list to be a single instance:
that was using join. what about subselect?
I've seen many instances where people switch their
OneToMany associations to use
SUBSELECT rather than
JOIN because they see duplicate parent entities and don't know about the
DISTINCT_ROOT_ENTITY transformer. That said, let's take a look at what
SUBSELECT does. Here's the mapping:
Let's run it through the original
Criteria setup again:
Alright, how's that output looking?
Great! What queries did it run to get the data this time?
WHAT?! Now we have four queries? DO NOT WANT!
Breaking down the SQL a bit, we can see that the first query only selects records from
PERSON, while the second selects only records from
It didn't just select the
THING records for the
PERSON records we wanted, it selected every record from
THING. The series of ids that the
IN clause is selecting against is totally unbounded; the limit statement isn't applied.
Still, that doesn't explain the third and fourth queries, which are indeed the same. The reason these execute is that the
Thing instances loaded reference
Person instances which don't exist in the session yet. Since they're mapped by the
person field in
Thing, Hibernate will select each missing
Person with a separate query.
Imagine if we had 1000 records in
PERSON, each having two corresponding records in
THING. We would end up executing 1000 queries to get the data! Generally the enemy with Hibernate is
n+1 selects; in this case you have
m-n+2 selects, where
n is the number of results and
m is the total number of records in the parent table.
what if you use a join column instead of mapped by?
Very well, let's change the mapping again:
Side note: the
name used in
@JoinColumn appears to be case sensitive, at least in HSQLDB and Oracle. Using
"person_id" instead of
"PERSON_ID" will yield the same additional selects in the last example.
Now what queries ran?
If you think we're out of the woods here, I assure you we're not. Using
@JoinColumn changed the relationship such that we're not issuing individual statements for parents, however, we're still fetching in everything. Hibernate still runs the same subquery for
IN as before, except that we join to the parent table up front which avoids the additional individual selects:
It's important to realize that the entire contents of both tables are still being pulled into memory here. Again, if you're dealing with a huge number of records, you could easily run out of heap space.
what about using hql instead?
You can try and run this query with HQL. However, the result is not much different than before. Here's the code to do so:
Running it will generate the following SQL query:
The results returned are actually correct; the first two people and their associated things are in the list. Notice again that we have an unbounded query; the limit statement was never applied. Just before the SQL is logged, Hibernate logs a warning:
This means that Hibernate is fetching everything and then trying to apply the first/max result restrictions in memory. As you may imagine, this is also undesirable. This happens because of the
join fetch used in the HQL query. Even more comforting is what the JPA spec has to say about this kind of interaction:
The effect of applying setMaxResults or setFirstResult to a query involving fetch joins over collections is undefined. (JPA "Enterprise JavaBeans 3.0, Final Release", Kapitel 3.6.1 Query Interface)
well, if subselect, join, and hql don't work, that just leaves select
FetchMode.SELECT by itself will cause an
n+1 select problem.
Luckily there's a way to mitigate that problem.
Hibernate has another annotation called
@BatchSize. Unfortunately, the JavaDoc for this annotation is simply "
The batch size for SQL loading" and doesn't really explain what it's doing. What the annotation really does is generate a select for the association against multiple ids (up to the number specified) and automatically uses the ids of entities that are in the session that haven't had their association loaded yet. In other words, if you were to get 5 instances of
Person, and you had your batch size set to 5, Hibernate will issue a single select for instances of
Thing that matched the 5 ids of the
Person instances in the session when the any association needed to be loaded. If you had 7
Person instances in the session with a batch size of 5, Hibernate would issue two selects: one for the first 5 and another for the other 2.
Here's the code for this in action (you can use
FetchType.EAGER, they both work):
Here's the SQL generated:
And here are the results, and importantly, the ones we wanted without anything extra being loaded:
The example above does use lazy loading just to call out that it will work in this case since the other examples were eager. If you were to try lazy loading in any other capacity, you'd end up running into other issues:
@BatchSize(and would still do so if you had a batch size of 1)
FetchMode.SUBSELECTwill cause the same unbounded query we saw when using
SUBSELECTeagerly, and will generate
m-n+2selects if it uses
mis the total number of records in the parent table
ok, so the problem can be solved. why the rant?
The main reason I tried to outline so much in this post is that the other examples don't look wrong. Common sense would lead most to believe that the mappings were correct and that the limit would be applied as expected to the parent and children. You don't even see the issue unless you have
show_sql turned on, your DBA yells at you for DDoSing the database with tiny individual selects, or your application runs out of memory due to loading huge result sets in memory. In all of those cases, you won't really notice the issue until your data gets a little larger, and by then you could be staring down a production outage depending on the scope of the problem.
I'd never fault someone for getting this wrong, because on paper the annotations seem so logical. There have been several bugs logged for this issue (HB-520, HHH-304 and HHH-2666), and even Gavin King himself says in one of the tickets:
Any limit that gets applied to a query should also get applied to the subselect of any subsequent collection fetches.
@BatchSize is syntactically trivial, but it does require some thought. You don't want to stick too high or too low of a number to the batch size. Too low and you'll generate many selects. Too high and you could end up loading more data than you needed, though this is probably more applicable to lazy loading than eager since eager is going to pull in the association for every parent in the session anyway. Having too high of a batch size could adversely affect database performance since the range of the total number of arguments provided to the query is going to be greater unless you have some unusual degree of uniformity in your data; i.e. a batch size of 100 will generate as many as 100 unique statements. As long as you have some notion of the number of entities you're fetching in certain areas where the number of results is bound by a max or a where clause, you should be able to pick a sensible number that will keep the total number of queries you run low.
Alternatively, you could create a separate entity to handle different association loading strategies. I've used this before to enforce a type safe contract that can control what mechanism you're using to load associations, as well as in a polymorphic capacity. If you had an area of your application where you knew you'd be using a max results limitation, you could have your persister logic look up entities using
@BatchSize, and use another
FetchType on a different entity when you didn't need batching. Leveraging Hibernate's
@MappedSuperclass annotation means you can leverage polymorphism for this use case nicely; declaring the getter for the association in the superclass and mapping the association at the field level in the subclass.
As far as my knowledge of Hibernate is concerned, using batching (or batching with separate entities) is the optimal case for dealing with joins combined with max result limitations. If anyone reading this knows of a better way, or can see a flaw in my logic or code above, please let me know in the comments!
Also, all the files used in this project can be found at https://github.com/theotherian/hibernate-gotchas. Feel free to check out the project and mess around with this yourself!