2013/02/27 1 Comment
SOLR is a popular full text search engine based on Lucene. Not only is it very efficient to search documents but it is also very fast to run simple queries on relational data and unstructured data and can be used as a NoSQL database. SOLR4 was released recently and now includes SOLR Cloud that allows to scale SOLR by using sharding and replication.
Although SOLR is very fast for operations such as filtering, sorting and pagination, it does not support aggregations (similar to SELECT item_id, SUM(purchases) FROM t GROUP BY item_id in SQL) so they have to be aggregated in advance. We will show in this post how to install SOLR Cloud and then illustrate it with a simple e-commerce example where we describe how one can use SOLR Cloud to provide very responsive analytics. Finally we will provide some hints on how to design a schema to deal with pre-aggregations.
Let’s consider an e-commerce site that stores the volume of monthly purchases for each item.
For that we will consider the following table items:
CREATE TABLE items ( id INTEGER, name VARCHAR(256), year INTEGER, month INTEGER, shipping_method INTEGER, us_sold INTEGER, ca_sold INTEGER, fr_sold INTEGER, uk_sold INTEGER );
The items table will contain for each item and for each month the number of purchases for each country (we only used 4 countries but you can use many more).
They are also broken up by different shipping methods:
- 1: Fedex Overnight
- 2: UPS ground 2-3 days
- 3: UPS ground 3-5 days saver (free shipping)
- 0: any shipping method (sum of all of the above)
We suppose that we have a responsive UI (that will use sorting and pagination) where analysts can have a very fast response time for queries like:
- What are my best selling products in America (US and CA)?
- What are my worst sales for Christmas 2012 for the product X in Europe that used Fedex overnight?
- What are my sales for item X over the last 10 years broken up by shipping method
SOLR Cloud Architecture
SOLR runs inside a servlet container (e.g., Jetty, Tomcat, …). Because SOLR cloud shards and replicates the data across the different nodes in the cluster, every node has to know who is in charge of which shard. This information is stored in ZooKeeper that can be seen as a shared filesystem used to store shared data but also to do more complex things such as synchronization and leader election. ZooKeeper can also be distributed on multiple nodes to form a ZooKeeper ensemble to ensure fault resiliency.
We will consider in our architecture 2 nodes that will each has SOLR running and ZooKeeper as depicted in the figure below.
Replication and Partitioning
In order to scale data there are two different things one can do:
Read more of this post