Using Hadoop Pig with MongoDB

In this post, we’ll see how to install MongoDB support for Pig and we’ll illustrate it with an example where we join 2 MongoDB collections with Pig and store the result in a new collection.

Requirements

Building Mongo Hadoop

We’re going to use the GIT project  developed by 10gen but with a slightly modification that we made. Because the Pig language doesn’t support variable that starts with underscore (e.g., _id) which is used in MongoDB, we added the ability to use it by replacing the _ prefix with u__ so _id becomes u__id.

First get the source:

$ git clone https://github.com/darthbear/mongo-hadoop

Compile the Hadoop pig part of it:

$ ./sbt package
$ ./sbt mongo-hadoop-core/package
$ ./sbt mongo-hadoop-pig/package
$ mkdir ~/pig_libraries
$ cp ./pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar \
./target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar ~/pig_libraries

Running a join query with Pig on MongoDB collections

One of the thing you can’t do in MongoDB is to do a join between 2 collections. So let’s see how we can do it simply with a pig script.
Read more of this post

Advertisements
%d bloggers like this: