Using SOLR Cloud as a NoSQL database for low latency analytics

SOLR is a popular full text search engine based on Lucene. Not only is it very efficient to search documents but it is also very fast to run simple queries on relational data and unstructured data and can be used as a NoSQL database. SOLR4 was released recently and now includes SOLR Cloud that allows to scale SOLR by using sharding and replication.

Although SOLR is very fast for operations such as filtering, sorting and pagination, it does not support aggregations (similar to SELECT item_id, SUM(purchases) FROM t GROUP BY item_id in SQL) so they have to be aggregated in advance. We will show in this post how to install SOLR Cloud and then illustrate it with a simple e-commerce example where we describe how one can use SOLR Cloud to provide very responsive analytics. Finally we will provide some hints on how to design a schema to deal with pre-aggregations.

Use Case

Let’s consider an e-commerce site that stores the volume of monthly purchases for each item.

For that we will consider the following table items:

CREATE TABLE items (
  id INTEGER,
  name VARCHAR(256),
  year INTEGER,
  month INTEGER,
  shipping_method INTEGER,
  us_sold INTEGER,
  ca_sold INTEGER,
  fr_sold INTEGER,
  uk_sold INTEGER
);

The items table will contain for each item and for each month the number of purchases for each country (we only used 4 countries but you can use many more).

They are also broken up by different shipping methods:

  • 1: Fedex Overnight
  • 2: UPS ground 2-3 days
  • 3: UPS ground 3-5 days saver (free shipping)
  • 0: any shipping method (sum of all of the above)

We suppose that we have a responsive UI (that will use sorting and pagination) where analysts can have a very fast response time for queries like:

  • What are my best selling products in America (US and CA)?
  • What are my worst sales for Christmas 2012 for the product X in Europe that used Fedex overnight?
  • What are my sales for item X over the last 10 years broken up by shipping method

SOLR Cloud Architecture

SOLR runs inside a servlet container (e.g., Jetty, Tomcat, …). Because SOLR cloud shards and replicates the data across the different nodes in the cluster, every node has to know who is in charge of which shard. This information is stored in ZooKeeper that can be seen as a shared filesystem used to store shared data but also to do more complex things such as synchronization and leader election. ZooKeeper can also be distributed on multiple nodes to form a ZooKeeper ensemble to ensure fault resiliency.

We will consider in our architecture 2 nodes that will each has SOLR running and ZooKeeper as depicted in the figure below.

solr_cloud_arch

Replication and Partitioning

In order to scale data there are two different things one can do:
Read more of this post

Faceted Search with Lucene 4

Faceted search is a technique used on several ecommerce websites and search engines to allow users to refine their search results by narrowing down the scope of their queries to a category or a sub category.

amazon_facetsebay_facets

The facet implementation in Lucene allows to categorize documents by categories and subcategories, then get the list of categories of the documents matching a query and also to drill down to a specific category or a sub category.

In this post, we are going to write three programs:

  • an indexer
  • a searcher
  • an advanced searcher that narrows down the scope to a category or subcategory

Read more of this post