Sunday, July 28, 2013

hibernate, joins, and max results: a match made in hell

what are you ranting about now?

Hibernate's ability to handle joins when specifying a max number of results via Criteria can cause some serious pain. The real problem here is that nothing in the mappings will look wrong. In fact, in several cases the mappings will return the correct data, but with some serious consequences. This is a case where you should probably learn how the hot dogs are being made.

give me something to go off of

For this example, we're going to be using the following two classes, which represent a one to many relationship:

The tables those entities represent contain the following data, and will exist in an HSQLDB instance:

In most cases, we're going to try and get data out of these tables with the following Criteria:

Note that an Order is being added here to prevent any lack of determinism in result sets from obfuscating what's happening.

alright, so what happens?

Nothing good. Let's start off mapping the @OneToMany as such:

We're using eager loading with a JOIN fetch type, and we're defining that the relationship is mapped by the person field in Thing.

Now, let's run our Criteria and take a look at the SQL generated:

Looks like what would be expected. What does persons contain?

Hold on just a second. Why are there two instances of 'Ian' being returned?

Well, the limit applies to the entire result set, not just the parent entities. Since the join is matching the two children, those are the first two rows returned. As a result, not only are we just getting one person back, but we're getting two of the same instance in the list.

Typically, if you're using a JOIN fetch with a Criteria, there's a way to eliminate duplicate root entities:

Unfortunately, it doesn't work here. Since maxResults applies to the underlying result set and not the entities Hibernate is pulling in, adding this restriction actually just limited the list to be a single instance:

that was using join. what about subselect?

I've seen many instances where people switch their OneToMany associations to use SUBSELECT rather than JOIN because they see duplicate parent entities and don't know about the DISTINCT_ROOT_ENTITY transformer. That said, let's take a look at what SUBSELECT does. Here's the mapping:

Let's run it through the original Criteria setup again:

Alright, how's that output looking?

Great! What queries did it run to get the data this time?

WHAT?! Now we have four queries? DO NOT WANT!

Breaking down the SQL a bit, we can see that the first query only selects records from PERSON, while the second selects only records from THING. But...

It didn't just select the THING records for the PERSON records we wanted, it selected every record from THING. The series of ids that the IN clause is selecting against is totally unbounded; the limit statement isn't applied.

Still, that doesn't explain the third and fourth queries, which are indeed the same. The reason these execute is that the Thing instances loaded reference Person instances which don't exist in the session yet. Since they're mapped by the person field in Thing, Hibernate will select each missing Person with a separate query.

Imagine if we had 1000 records in PERSON, each having two corresponding records in THING. We would end up executing 1000 queries to get the data! Generally the enemy with Hibernate is n+1 selects; in this case you have m-n+2 selects, where n is the number of results and m is the total number of records in the parent table.

what if you use a join column instead of mapped by?

Very well, let's change the mapping again:

Side note: the name used in @JoinColumn appears to be case sensitive, at least in HSQLDB and Oracle. Using "person_id" instead of "PERSON_ID" will yield the same additional selects in the last example.

Now what queries ran?

If you think we're out of the woods here, I assure you we're not. Using @JoinColumn changed the relationship such that we're not issuing individual statements for parents, however, we're still fetching in everything. Hibernate still runs the same subquery for IN as before, except that we join to the parent table up front which avoids the additional individual selects:

It's important to realize that the entire contents of both tables are still being pulled into memory here. Again, if you're dealing with a huge number of records, you could easily run out of heap space.

what about using hql instead?

You can try and run this query with HQL. However, the result is not much different than before. Here's the code to do so:

Running it will generate the following SQL query:

The results returned are actually correct; the first two people and their associated things are in the list. Notice again that we have an unbounded query; the limit statement was never applied. Just before the SQL is logged, Hibernate logs a warning:

This means that Hibernate is fetching everything and then trying to apply the first/max result restrictions in memory. As you may imagine, this is also undesirable. This happens because of the join fetch used in the HQL query. Even more comforting is what the JPA spec has to say about this kind of interaction:

The effect of applying setMaxResults or setFirstResult to a query involving fetch joins over collections is undefined. (JPA "Enterprise JavaBeans 3.0, Final Release", Kapitel 3.6.1 Query Interface)

well, if subselect, join, and hql don't work, that just leaves select

Correct, however, FetchMode.SELECT by itself will cause an n+1 select problem.

Luckily there's a way to mitigate that problem.

Hibernate has another annotation called @BatchSize. Unfortunately, the JavaDoc for this annotation is simply "The batch size for SQL loading" and doesn't really explain what it's doing. What the annotation really does is generate a select for the association against multiple ids (up to the number specified) and automatically uses the ids of entities that are in the session that haven't had their association loaded yet. In other words, if you were to get 5 instances of Person, and you had your batch size set to 5, Hibernate will issue a single select for instances of Thing that matched the 5 ids of the Person instances in the session when the any association needed to be loaded. If you had 7 Person instances in the session with a batch size of 5, Hibernate would issue two selects: one for the first 5 and another for the other 2.

Here's the code for this in action (you can use FetchType.LAZY or FetchType.EAGER, they both work):

Here's the SQL generated:

And here are the results, and importantly, the ones we wanted without anything extra being loaded:

The example above does use lazy loading just to call out that it will work in this case since the other examples were eager. If you were to try lazy loading in any other capacity, you'd end up running into other issues:

  • Using FetchMode.SELECT will cause n+1 selects without @BatchSize (and would still do so if you had a batch size of 1)
  • Using FetchMode.SUBSELECT will cause the same unbounded query we saw when using SUBSELECT eagerly, and will generate m-n+2 selects if it uses @OneToMany's mappedBy attribute, where m is the total number of records in the parent table

ok, so the problem can be solved. why the rant?

The main reason I tried to outline so much in this post is that the other examples don't look wrong. Common sense would lead most to believe that the mappings were correct and that the limit would be applied as expected to the parent and children. You don't even see the issue unless you have show_sql turned on, your DBA yells at you for DDoSing the database with tiny individual selects, or your application runs out of memory due to loading huge result sets in memory. In all of those cases, you won't really notice the issue until your data gets a little larger, and by then you could be staring down a production outage depending on the scope of the problem.

I'd never fault someone for getting this wrong, because on paper the annotations seem so logical. There have been several bugs logged for this issue (HB-520, HHH-304 and HHH-2666), and even Gavin King himself says in one of the tickets:

Any limit that gets applied to a query should also get applied to the subselect of any subsequent collection fetches.

Using @BatchSize is syntactically trivial, but it does require some thought. You don't want to stick too high or too low of a number to the batch size. Too low and you'll generate many selects. Too high and you could end up loading more data than you needed, though this is probably more applicable to lazy loading than eager since eager is going to pull in the association for every parent in the session anyway. Having too high of a batch size could adversely affect database performance since the range of the total number of arguments provided to the query is going to be greater unless you have some unusual degree of uniformity in your data; i.e. a batch size of 100 will generate as many as 100 unique statements. As long as you have some notion of the number of entities you're fetching in certain areas where the number of results is bound by a max or a where clause, you should be able to pick a sensible number that will keep the total number of queries you run low.

Alternatively, you could create a separate entity to handle different association loading strategies. I've used this before to enforce a type safe contract that can control what mechanism you're using to load associations, as well as in a polymorphic capacity. If you had an area of your application where you knew you'd be using a max results limitation, you could have your persister logic look up entities using FetchMode.SELECT with @BatchSize, and use another FetchMode and/or FetchType on a different entity when you didn't need batching. Leveraging Hibernate's @MappedSuperclass annotation means you can leverage polymorphism for this use case nicely; declaring the getter for the association in the superclass and mapping the association at the field level in the subclass.

As far as my knowledge of Hibernate is concerned, using batching (or batching with separate entities) is the optimal case for dealing with joins combined with max result limitations. If anyone reading this knows of a better way, or can see a flaw in my logic or code above, please let me know in the comments!

Also, all the files used in this project can be found at https://github.com/theotherian/hibernate-gotchas. Feel free to check out the project and mess around with this yourself!

17 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. Excellent article and an enjoyable read! thnx

    ReplyDelete
    Replies
    1. Thanks for the feedback Bas! I hope the article helped you out :)

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. So I would simply like to summarize for HQL ::
    For MaxResults set on the query :
    1) Join will give the correct results , if there is a fetch in the query , but at the expense of memory consumption .[No effect of Lazy or Eager]
    FROM Person person inner join fetch person.thingList order by person.id asc

    2) A simple select will also give the correct results , but with the N+1 select problem .

    3) @BatchSize(size = ?) with a simple select will also give the correct results and also does not run into the N+1 problem .
    The only consideration is that the batch size should be selected carefully .

    ReplyDelete
  5. Hi Roshan,

    Thank you for replying to my post!

    I'm disappointed you removed your first comment, because you pointed out something I wanted to follow up on (though you still mention it in this comment). I hope you don't mind me reposting some of it from the email alert I received from the comment:

    Query query = session.createQuery("FROM Person person inner join person.things order by person.id asc");

    , it then runs a bounded query ::
    Hibernate:
    select
    person0_.id as id1_0_0_,
    thinglist1_.id as id1_1_1_,
    person0_.name as name2_0_0_,
    thinglist1_.name as name2_1_1_,
    thinglist1_.person_id as person_i3_1_1_
    from
    Person person0_
    inner join
    Thing thinglist1_
    on person0_.id=thinglist1_.person_id
    order by
    person0_.id asc limit ?

    If you do this, you would get the same issue that comes up in the first case: two instances of the parent. When you do that, you just get rows back that include parent data and Hibernate doesn't collapse them into a singular parent entity. In this case Hibernate can't reconcile the fact that you have two different entities in the result set rows, and will instead return you a List of Object[], where in this case the record at index 0 is a Person and the record at index 1 is a Thing.

    So, given that, I would argue that point 1 isn't actually true since you get back duplicate parents again, and also have to massage the data yourself from the results. Please correct me if I missed something or misinterpreted your HQL somehow. (I had to run "FROM Person person inner join person.things order by person.id asc" because thingList isn't a field in my pojos).

    I agree that #3 requires careful consideration. You can certainly hamper things by picking a number that doesn't make sense for your application.

    ReplyDelete
    Replies
    1. No , you did not misinterpret the query...
      But i saw one more thing.. I saw that the List returned was not of Person Type . It was a list of object array . And each array had at it's index[0] the Person object and index[1] the things object .
      and as you mentioned rightly , yes the parent was repeated .

      Initially I hadn't debugged the application and the only thing i saw was the limit thing applied .

      But here is the reason why i stumbled on to your blog . I was looking for something that can help me limit the number of associated records returned .
      So say I want to show articles and the comments on them , and there would be 100's of article and each article would have 100's of comments .And I just want to show 20 articles at once with 20 comments each .
      So when i set maxResults , i can limit the number of parent records [ i.e. 20 articles ] , but then by default for each of these articles all comments are returned , which i want to limit to 20 .

      And i was searching for something like this and stumbled here and It is a great article .

      Delete
    2. Ah, gotcha, I see how the double limit throws things off.

      I wonder if you could implement what you're looking for more effectively with a view in the database that would automatically restrict the association to bring in 20 child entities per parent.

      Just poking around online, it looks like other people have tried applying a limit within the beginning of a select in the form of a nested query: http://stackoverflow.com/questions/8583117/sql-query-join-limit-results-to-one-item-per-id

      That seems like it would be pretty ugly in Hibernate though, and you'd probably just have to use direct SQL since HQL wouldn't work (you couldn't apply both limits since HQL doesn't support the term; it only applies a maxResults to the entire result set). The stackoverflow post above shows the following SQL:

      SELECT p.*,
      (SELECT i.imageName
      FROM images
      WHERE i.productId = p.id
      ORDER BY i.date_created ASC
      LIMIT 1)
      AS imageName
      FROM products p

      Delete
    3. Also, thanks for the compliment :) Hopefully we can get this sorted out; it's an interesting problem!

      Delete
  6. Thank you for this post!! You have saved me hours of time debugging hibernate code this morning.

    ReplyDelete
  7. greate. It is exactly what I am looking for.... last month^^. Now I am already solve it !! I wish I could see this before..

    ReplyDelete
  8. Toyed with a couple approaches to tackle this issue -- found the one to make most sense to be: select a list of IDs of your parent entity using all of your criteria (WHERE etc.), then use that list of IDs as the only criteria in your new query that includes the JOIN on the collection

    ReplyDelete
  9. Thank you so much! I couldn't understand why setMaxResults was filtering out more records than I expected. Once I set the FetchMode and BatchSize in my model object, the query returned the results I was expecting.

    ReplyDelete
  10. So this blog captures my last week of development almost exactly. Wish I had found this at the start of last week.

    ReplyDelete
  11. I have few tables with huge data (~4TB/table), I have a query with joins on those tables, I would like to track and limit the "resultset" size in my report( without running count completely for table a left join table b, as it returns me rows close to 30000000000 )
    is there a way to achive this in java/hibernate/spring..?

    ReplyDelete
  12. Damn good blog, I stumbled into the same error on Grails and didn't know what hit me ^^ Thank you sir !

    ReplyDelete