Using Hadoop Pig with MongoDB

In this post, we’ll see how to install MongoDB support for Pig and we’ll illustrate it with an example where we join 2 MongoDB collections with Pig and store the result in a new collection.

Requirements

Building Mongo Hadoop

We’re going to use the GIT project  developed by 10gen but with a slightly modification that we made. Because the Pig language doesn’t support variable that starts with underscore (e.g., _id) which is used in MongoDB, we added the ability to use it by replacing the _ prefix with u__ so _id becomes u__id.

First get the source:

$ git clone https://github.com/darthbear/mongo-hadoop

Compile the Hadoop pig part of it:

$ ./sbt package
$ ./sbt mongo-hadoop-core/package
$ ./sbt mongo-hadoop-pig/package
$ mkdir ~/pig_libraries
$ cp ./pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar \
./target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar ~/pig_libraries

Running a join query with Pig on MongoDB collections

One of the thing you can’t do in MongoDB is to do a join between 2 collections. So let’s see how we can do it simply with a pig script.

Let’s consider 2 MongoDB collections:

  • person
  • department

and let’s create a new collection person_department that will give for each person the name of the department (s)he is in.
persons.csv

name, department_id
John Doe, 1
Mary Jane, 1
Fred Astair, 1
Jimmy Lee, 2
Jane Stanislas, 2
Harry Kane, 3

departments.csv

1, Support
2, Technology
3, Finance

Let’s import these two files in MongoDB:

$ mongoimport  --type csv --headerline -d test -c persons < persons.csv
$ mongoimport  --type csv -f _id,name -d test -c departments < departments.csv

Let’s now write the Pig script (replace the path to the jar file according to your environment):

REGISTER /home/chimpler/pig_libraries/mongo-2.7.3.jar
REGISTER /home/chimpler/pig_libraries/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
REGISTER /home/chimpler/pig_libraries/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar

persons = LOAD 'mongodb://localhost/test.persons'
      USING com.mongodb.hadoop.pig.MongoLoader('name:chararray, department_id:int')
      AS (person_name, department_id);

departments = LOAD 'mongodb://localhost/test.departments'
              USING com.mongodb.hadoop.pig.MongoLoader('u__id:int, name:chararray')
              AS (department_id, department_name);

joinDepartmentPersons = JOIN departments BY department_id LEFT OUTER,
 persons BY department_id;

/* note: the join will prefix all the fields with the original collections:
 e.g., departments::department_id, departments::name, person::person_name, ... */

result = FOREACH joinDepartmentPersons
         GENERATE persons::person_name as person_name,
                  departments::department_name as department_name;
STORE result INTO 'mongodb://localhost/test.person_department'
    USING com.mongodb.hadoop.pig.MongoStorage();

Let’s run it:

$ pig -x script.pig
$ mongo test
> db.person_department.find();
{ "_id" : ObjectId("510d394824acb06ea5deee58"), "person_name" : "John Doe", "department_name" : "Support" }
{ "_id" : ObjectId("510d394824acb06ea5deee59"), "person_name" : "Mary Jane", "department_name" : "Support" }
{ "_id" : ObjectId("510d394824acb06ea5deee5a"), "person_name" : "Fred Astair", "department_name" : "Support" }
{ "_id" : ObjectId("510d394824acb06ea5deee5b"), "person_name" : "Jimmy Lee", "department_name" : "Technology" }
{ "_id" : ObjectId("510d394824acb06ea5deee5c"), "person_name" : "Jane Stanislas", "department_name" : "Technology" }
{ "_id" : ObjectId("510d394824acb06ea5deee5d"), "person_name" : "Harry Kane", "department_name" : "Finance" }

Voila! you just did a join with Pig and MongoDB!

About these ads

About chimpler
http://www.chimpler.com

5 Responses to Using Hadoop Pig with MongoDB

  1. Pingback: Using Hadoop Pig with MongoDB | pdg-technologies.com | Scoop.it

  2. Pingback: Using Hadoop Pig with MongoDB | EEDSP | Scoop.it

  3. Pingback: Using Hadoop Pig with MongoDB | connectedcity | Scoop.it

  4. Pingback: Confluence: Wirespring

  5. KratiJain says:

    Hi,

    Thanks for this post.

    I ran the code but while running the pig script, I get following exception (which possibly is due to incompatible versions of mongodb and hadoop/pig): Can you please specify the minimum version required for the pig-mongo interaction

    Pig Stack Trace
    —————
    ERROR 2998: Unhandled internal error. Found class org.apache.hadoop.mapreduce.JobContext, but interface was expected

    java.lang.IncompatibleClassChangeError: Found class org.apache.hadoop.mapreduce.JobContext, but interface was expected
    at com.mongodb.hadoop.MongoOutputFormat.checkOutputSpecs(MongoOutputFormat.java:43)
    at org.apache.pig.newplan.logical.rules.InputOutputFileValidator$InputOutputFileVisitor.visit(InputOutputFileValidator.java:80)
    at org.apache.pig.newplan.logical.relational.LOStore.accept(LOStore.java:66)
    at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:64)
    at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
    at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
    at org.apache.pig.newplan.DepthFirstWalker.depthFirst(DepthFirstWalker.java:66)
    at org.apache.pig.newplan.DepthFirstWalker.walk(DepthFirstWalker.java:53)
    at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:52)

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 136 other followers

%d bloggers like this: