Using Hadoop Pig with MongoDB
2013/02/07 4 Comments
In this post, we’ll see how to install MongoDB support for Pig and we’ll illustrate it with an example where we join 2 MongoDB collections with Pig and store the result in a new collection.
Requirements
- Download Hadoop 1.1.1 from http://www.apache.org/dyn/closer.cgi/hadoop/common/. Set your PATH to the bin directory in Hadoop.
- Download the latest Mongo Java driver from https://github.com/mongodb/mongo-java-driver/downloads and put it in your directory ~/pig_libraries.
Building Mongo Hadoop
We’re going to use the GIT project developed by 10gen but with a slightly modification that we made. Because the Pig language doesn’t support variable that starts with underscore (e.g., _id) which is used in MongoDB, we added the ability to use it by replacing the _ prefix with u__ so _id becomes u__id.
First get the source:
$ git clone https://github.com/darthbear/mongo-hadoop
Compile the Hadoop pig part of it:
$ ./sbt package $ ./sbt mongo-hadoop-core/package $ ./sbt mongo-hadoop-pig/package $ mkdir ~/pig_libraries $ cp ./pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar \ ./target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar ~/pig_libraries
Running a join query with Pig on MongoDB collections
One of the thing you can’t do in MongoDB is to do a join between 2 collections. So let’s see how we can do it simply with a pig script.
Let’s consider 2 MongoDB collections:
- person
- department
and let’s create a new collection person_department that will give for each person the name of the department (s)he is in.
persons.csv
name, department_id John Doe, 1 Mary Jane, 1 Fred Astair, 1 Jimmy Lee, 2 Jane Stanislas, 2 Harry Kane, 3
departments.csv
1, Support 2, Technology 3, Finance
Let’s import these two files in MongoDB:
$ mongoimport --type csv --headerline -d test -c persons < persons.csv $ mongoimport --type csv -f _id,name -d test -c departments < departments.csv
Let’s now write the Pig script (replace the path to the jar file according to your environment):
REGISTER /home/chimpler/pig_libraries/mongo-2.7.3.jar
REGISTER /home/chimpler/pig_libraries/mongo-hadoop-core-1.1.0-SNAPSHOT.jar
REGISTER /home/chimpler/pig_libraries/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar
persons = LOAD 'mongodb://localhost/test.persons'
USING com.mongodb.hadoop.pig.MongoLoader('name:chararray, department_id:int')
AS (person_name, department_id);
departments = LOAD 'mongodb://localhost/test.departments'
USING com.mongodb.hadoop.pig.MongoLoader('u__id:int, name:chararray')
AS (department_id, department_name);
joinDepartmentPersons = JOIN departments BY department_id LEFT OUTER,
persons BY department_id;
/* note: the join will prefix all the fields with the original collections:
e.g., departments::department_id, departments::name, person::person_name, ... */
result = FOREACH joinDepartmentPersons
GENERATE persons::person_name as person_name,
departments::department_name as department_name;
STORE result INTO 'mongodb://localhost/test.person_department'
USING com.mongodb.hadoop.pig.MongoStorage();
Let’s run it:
$ pig -x script.pig
$ mongo test
> db.person_department.find();
{ "_id" : ObjectId("510d394824acb06ea5deee58"), "person_name" : "John Doe", "department_name" : "Support" }
{ "_id" : ObjectId("510d394824acb06ea5deee59"), "person_name" : "Mary Jane", "department_name" : "Support" }
{ "_id" : ObjectId("510d394824acb06ea5deee5a"), "person_name" : "Fred Astair", "department_name" : "Support" }
{ "_id" : ObjectId("510d394824acb06ea5deee5b"), "person_name" : "Jimmy Lee", "department_name" : "Technology" }
{ "_id" : ObjectId("510d394824acb06ea5deee5c"), "person_name" : "Jane Stanislas", "department_name" : "Technology" }
{ "_id" : ObjectId("510d394824acb06ea5deee5d"), "person_name" : "Harry Kane", "department_name" : "Finance" }
Voila! you just did a join with Pig and MongoDB!
Frederic Dang Ngoc
@fredang
Pingback: Using Hadoop Pig with MongoDB | pdg-technologies.com | Scoop.it
Pingback: Using Hadoop Pig with MongoDB | EEDSP | Scoop.it
Pingback: Using Hadoop Pig with MongoDB | connectedcity | Scoop.it
Pingback: Confluence: Wirespring