Faceted Search with Lucene 4

2013/01/30 4 Comments

Faceted search is a technique used on several ecommerce websites and search engines to allow users to refine their search results by narrowing down the scope of their queries to a category or a sub category.

The facet implementation in Lucene allows to categorize documents by categories and subcategories, then get the list of categories of the documents matching a query and also to drill down to a specific category or a sub category.

In this post, we are going to write three programs:

an indexer
a searcher
an advanced searcher that narrows down the scope to a category or subcategory

Installing Storm on Ubuntu

2013/01/25 1 Comment

Storm is an open source ETL created by Nathan Marz in late 2011. Unlike Hadoop where data are processed offline in big batches, Storm takes another approach by aggregating streaming data on the fly so that aggregated data are immediately available. It is scalable, fault tolerant (no data loss guarantee) and the benchmarks showed that every node can process over a million tuples per seconds.

We describe below the different steps to install Storm in Ubuntu Linux describing the issues we had during the process.
Read more of this post

Filed under Installation Tagged with bigdata, etl, installation, linux, realtime, storm, ubuntu

Deploying Hadoop on EC2 with Whirr

2013/01/20 10 Comments

Apache Whirr is a set of tools to deploy cloud services. It can be used on Amazon Elastic Cloud(EC2), Rackspace Cloud and many other cloud providers.

Requirement

You need to have an account on Amazon EC2. If you don’t have an account yet, that’s a good news because you are eligible for the AWS Free Tier (750 hours of cloud computing per month for free for 12 month). In the example below, we are using micro instances so you are not going to pay anything (up to 750 hours) with the free tier plan.

Make sure that you have Java JDK 6 or 7 installed on your machine.

Installation

You can download whirr at http://www.apache.org/dyn/closer.cgi/whirr/

Uncompress the archive:

tar xvfz whirr-0.8.1.tar.gz

Now we are going to write a config file to tell whirr how to deploy hadoop on amazon ec2. Create the file ~/hadoop-ec2.properties with the following content:
Read more of this post

Filed under EC2, hadoop, whirr Tagged with ec2, hadoop, whirr

Chimpler

Faceted Search with Lucene 4

Installing Storm on Ubuntu

Deploying Hadoop on EC2 with Whirr

Requirement

Installation

Authors

Websites

Recent Posts

Tweets

Recent Comments

Categories

Archives

Meta

Blog Stats

Chimpler

Faceted Search with Lucene 4

Share this:

Installing Storm on Ubuntu

Share this:

Deploying Hadoop on EC2 with Whirr

Requirement

Installation

Share this:

Authors

Websites

Recent Posts

Tweets

Recent Comments

Categories

Archives

Meta

Blog Stats