sometimes I just go where the search results take me
I was doing some research on SolrCloud tonight, and wound up learning about enough disparate things that I figured I'd put together a quick page summarizing what I'd read along with some links. If nothing else this post is going to just end up being notes for my own memory, but if somehow this helps someone else along the way all the better.
so, what did you learn?
regarding data consistency and cap theorem
SolrCloud (or distributed Solr) claims to use a CP model for data, which surprised me. CP means consistent and partition tolerant and refers to CAP theorem; if you aren't familiar with it you should read about it. The more I read about this though, I would disagree that "CP" is correct unless my understanding of CAP is flawed.
According to this, SolrCloud "is a CP system - in the face of partitions, we favor consistency over availability." This discussion gets things a little more clear, clarifying that SolrCloud "favors consistency over availability (mostly concerning writes)."
To expand on what this means, you need to have at least a high level understanding of Solr's sharding capabilities, which is about all I have at this point. When you shard, you have a leader for certain documents as well as replicas. When you go to update a document, Solr will route the request to the leader and then propagate the change to the replicas. If you happen to look up data from replicas as well as the leader, then you'll actually be using an eventual consistency model. One request that hits the replica can get a stale document compared to what the leader has if the leader hasn't finished distributing an update to the replicas in the event of a real time get.
The "A" is missing in this equation because it's possible that update requests will be rejected under certain conditions. SolrCloud uses ZooKeeper to elect a leader, and ZooKeeper will not allow a split brain condition to happen if part of the cluster goes down. If ZooKeeper doesn't agree on a leader due to a partition of the cluster and a potential split brain condition, update requests will be rejected, i.e. availability is sacrificed in favor of remaining consistent and being partition tolerant. However, availability is still maintained for read operations; the cluster will not reject those requests unless you've partitioned in such a way that there's no shard or replica corresponding to a particular document.
To wrap things up, I found the assertion of a CP model surprising when it's using the same eventual consistency model that AP data stores use such as CouchDB. To Solr's credit, changes should be distributed to replicas extremely fast and soft commits happen within seconds meaning the eventual consistency window is quite small, so the odds that it will create a problem are small.
soft commits, hard commits, real time gets and the transaction log
This is merely a terse summary of the documentation around real time gets and near real time searching, but since it falls under the "things I learned and may likely forget tomorrow morning" umbrella I'm writing about it.
First, it's important to call out that when you update a document in Solr that doesn't make it automatically available within searches. As of Solr 4, you can access a fresh version of a resource after it's been updated by using a real time get as long as you have the transaction log enabled. The transaction log is not unlike what databases use to track changes, and to be honest Solr can behave more like a database than I thought as a result of this feature. Enabling real time gets makes Solr behave more like a NoSQL database.
If you've updated a document, then you have two options to make the changes searchable: a hard commit or soft commit. A hard commit is expensive: it pushes changes to the file system (making them persistent) and has a significant performance impact. A soft commit is less expensive but not persistent. All updates are persistent if you have the transaction log enabled. According to Solr's documentation, it's reasonable to have soft commits automatically happen within seconds while hard commits are restricted to a much longer interval (maybe 10-15 minutes).
You need to be aware of a few things when using the transaction log, as documented here. First, all your updates are written to the transaction log before a successful response is returned to a client. Second, performing a hard commit will persist all changes in the transaction log. Third, not performing a hard commit periodically can result in having a huge transaction log that can potentially kick the crap out of your Solr instance on startup should it try to persist changes potentially on the order of gigs. So, keep an eye on how large you're allowing your transaction log to become, lest you send Solr into a tailspin on startup.
block joins make searching relational
If you've ever wanted a nice parent-child relationship on your documents, it's here. I'm not going to talk about this too much myself because I have a tenuous understanding of how to query this in Solr so far, and there are awesome resources here, here, here and here. One thing worth calling out is that apparently this won't work correctly in JSON until version 4.7 according to this jira ticket.
that's it for now
There's a lot more I'm planning on reading up on regarding Solr in the next few weeks, meaning there's a decent chance of more posts like this as well as in-depth follow ups to help people get started with certain features. In the meantime, feel free to share anything you think I or others should dedicate some time to learning about Solr next!