Using Hadoop Pig with MongoDB
2013/02/07 6 Comments
In this post, we’ll see how to install MongoDB support for Pig and we’ll illustrate it with an example where we join 2 MongoDB collections with Pig and store the result in a new collection.
Requirements
- Download Hadoop 1.1.1 from http://www.apache.org/dyn/closer.cgi/hadoop/common/. Set your PATH to the bin directory in Hadoop.
- Download the latest Mongo Java driver from https://github.com/mongodb/mongo-java-driver/downloads and put it in your directory ~/pig_libraries.
Building Mongo Hadoop
We’re going to use the GIT project developed by 10gen but with a slightly modification that we made. Because the Pig language doesn’t support variable that starts with underscore (e.g., _id) which is used in MongoDB, we added the ability to use it by replacing the _ prefix with u__ so _id becomes u__id.
First get the source:
$ git clone https://github.com/darthbear/mongo-hadoop
Compile the Hadoop pig part of it:
$ ./sbt package $ ./sbt mongo-hadoop-core/package $ ./sbt mongo-hadoop-pig/package $ mkdir ~/pig_libraries $ cp ./pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar \ ./target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar ~/pig_libraries
Running a join query with Pig on MongoDB collections
One of the thing you can’t do in MongoDB is to do a join between 2 collections. So let’s see how we can do it simply with a pig script.
Read more of this post
Recent Comments