Using Hadoop Pig with MongoDB

In this post, we’ll see how to install MongoDB support for Pig and we’ll illustrate it with an example where we join 2 MongoDB collections with Pig and store the result in a new collection.

Requirements

Building Mongo Hadoop

We’re going to use the GIT project  developed by 10gen but with a slightly modification that we made. Because the Pig language doesn’t support variable that starts with underscore (e.g., _id) which is used in MongoDB, we added the ability to use it by replacing the _ prefix with u__ so _id becomes u__id.

First get the source:

$ git clone https://github.com/darthbear/mongo-hadoop

Compile the Hadoop pig part of it:

$ ./sbt package
$ ./sbt mongo-hadoop-core/package
$ ./sbt mongo-hadoop-pig/package
$ mkdir ~/pig_libraries
$ cp ./pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar \
./target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar ~/pig_libraries

Running a join query with Pig on MongoDB collections

One of the thing you can’t do in MongoDB is to do a join between 2 collections. So let’s see how we can do it simply with a pig script.
Read more of this post

Playing with Hadoop Pig

Hadoop Pig is a tool to manipulate data from various sources (CSV file, MySQL, MongoDB, …) using a procedural language (Pig Latin). It can run standalone or distributed with Hadoop. Unlike Hive, it can manipulate non-relational data and do things like aggregations, joins (including left joins), regular expression, sorting, …

In our post, in order to simplify we will consider only standalone executions of Pig.

To install Pig in Ubuntu:

$ sudo apt-get install hadoop-pig

Let’s take a simple example by considering the geo location database from Maxmind. Download the latest GeoLite City file in CSV format (e.g., GeoLiteCity_20130101.zip ) from: http://geolite.maxmind.com/download/geoip/database/GeoLiteCity_CSV/.

The file in the zip we’re interested in is GeoLiteCity-Location.csv.

Let’s remove the header from the csv file:

$ tail -n +3  GeoLiteCity-Location.csv > location.csv

Let’s first start by loading the CSV file and display it:

data = LOAD 'location.csv' USING PigStorage(',')
       AS (locId, country, region, city, postalCode,
           latitude, longitude, metroCode, areaCode);
dump data;

To run it:

$ pig -x local script.pig

The first line maps the different columns in the CSV to fields which are by default chararray (e.g., city:chararray).
Let’s get rid of information we don’t need and only keep city, region and country:

cities = FOREACH data GENERATE city, region, country;

Now let’s only keep US cities where the city name is set:

usCities = FILTER cities BY country == '"US"' AND city != '""';

Now let’s try to see what city names are popular, i.e., see how many states use the city names.
In SQL, the query would be something like:
Read more of this post

Faceted Search with Lucene 4

Faceted search is a technique used on several ecommerce websites and search engines to allow users to refine their search results by narrowing down the scope of their queries to a category or a sub category.

amazon_facetsebay_facets

The facet implementation in Lucene allows to categorize documents by categories and subcategories, then get the list of categories of the documents matching a query and also to drill down to a specific category or a sub category.

In this post, we are going to write three programs:

  • an indexer
  • a searcher
  • an advanced searcher that narrows down the scope to a category or subcategory

Read more of this post

Installing Storm on Ubuntu

Storm is an open source ETL created by Nathan Marz in late 2011. Unlike Hadoop where data are processed offline in big batches, Storm takes another approach by aggregating streaming data on the fly so that aggregated data are immediately available. It is scalable, fault tolerant (no data loss guarantee) and the benchmarks showed that every node can process over a million tuples per seconds.

We describe below the different steps to install Storm in Ubuntu Linux describing the issues we had during the process.
Read more of this post

Deploying Hadoop on EC2 with Whirr

Apache Whirr is a set of tools to deploy cloud services. It can be used on Amazon Elastic Cloud(EC2), Rackspace Cloud and many other cloud providers.

Requirement

You need to have an account on Amazon EC2. If you don’t have an account yet, that’s a good news because you are eligible for the AWS Free Tier (750 hours of cloud computing per month for free for 12 month). In the example below, we are using micro instances so you are not going to pay anything (up to 750 hours) with the free tier plan.

Make sure that you have Java JDK 6 or 7 installed on your machine.

Installation

You can download whirr at http://www.apache.org/dyn/closer.cgi/whirr/

Uncompress the archive:

tar xvfz whirr-0.8.1.tar.gz

Now we are going to write a config file to tell whirr how to deploy hadoop on amazon ec2. Create the file ~/hadoop-ec2.properties with the following content:
Read more of this post