Playing with Apache Hive, MongoDB and the MTA

Apache Hive is a popular datawarehouse system for Hadoop that allows to run SQL queries on top of Hadoop by translating queries into Map/Reduce jobs. Due to the high latency incurred by Hadoop to execute Map/Reduce jobs, Hive cannot be used in applications that require fast access to data. One common technique is to use Hive to pre-aggregate data logs stored in HDFS and then sync the data to a Datawarehouse.

In this post we’re going to describe how to install Hive and then, as New York City straphangers, we’re going to load subway train movement data from the MTA in HDFS, execute Hive queries to aggregate the number of daily average train movements per line and store the result in MongoDB.
Read more of this post

Using Hadoop Pig with MongoDB

In this post, we’ll see how to install MongoDB support for Pig and we’ll illustrate it with an example where we join 2 MongoDB collections with Pig and store the result in a new collection.

Requirements

Building Mongo Hadoop

We’re going to use the GIT project  developed by 10gen but with a slightly modification that we made. Because the Pig language doesn’t support variable that starts with underscore (e.g., _id) which is used in MongoDB, we added the ability to use it by replacing the _ prefix with u__ so _id becomes u__id.

First get the source:

$ git clone https://github.com/darthbear/mongo-hadoop

Compile the Hadoop pig part of it:

$ ./sbt package
$ ./sbt mongo-hadoop-core/package
$ ./sbt mongo-hadoop-pig/package
$ mkdir ~/pig_libraries
$ cp ./pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar \
./target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar ~/pig_libraries

Running a join query with Pig on MongoDB collections

One of the thing you can’t do in MongoDB is to do a join between 2 collections. So let’s see how we can do it simply with a pig script.
Read more of this post