Deploying Hadoop on EC2 with Whirr

Apache Whirr is a set of tools to deploy cloud services. It can be used on Amazon Elastic Cloud(EC2), Rackspace Cloud and many other cloud providers.

Requirement

You need to have an account on Amazon EC2. If you don’t have an account yet, that’s a good news because you are eligible for the AWS Free Tier (750 hours of cloud computing per month for free for 12 month). In the example below, we are using micro instances so you are not going to pay anything (up to 750 hours) with the free tier plan.

Make sure that you have Java JDK 6 or 7 installed on your machine.

Installation

You can download whirr at http://www.apache.org/dyn/closer.cgi/whirr/

Uncompress the archive:

tar xvfz whirr-0.8.1.tar.gz

Now we are going to write a config file to tell whirr how to deploy hadoop on amazon ec2. Create the file ~/hadoop-ec2.properties with the following content:

whirr.cluster-name=hadoop-ec2
whirr.cluster-user=${sys:user.name}
whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,1 hadoop-datanode+hadoop-tasktracker
whirr.hadoop.version=1.1.1
whirr.provider=aws-ec2
whirr.identity=<AMAZON ACCESS KEY ID>
whirr.credential=<AMAZON SECRET ACCESS KEY>
whirr.hardware-id=t1.micro
whirr.image-id=us-east-1/ami-1624987f
whirr.location-id=us-east-1
whirr.java.install-function=install_oab_java

To get your amazon access key id and secret access key, go to https://portal.aws.amazon.com/gp/aws/securityCredentials

Then scroll down to the section ‘Access Credentials’. If you don’t have an access key yet, click on “Create a new Access Key” and click on yes. In the properties file, set the properties whirr.identity with the access key(20 alpha numeric characters) and whirr.credential with the secret access key (41 characters).

We also need a RSA ssh key pair with no password, to generate it type:

 ssh-keygen -t rsa -P ''

And use the default values.

Now we can deploy hadoop on amazon ec2  using whirr

cd whirr-0.8.1/
bin/whirr launch-cluster --config ~/hadoop-ec2.properties

It will start 2 micro instances on amazon EC2. Then it will install java and hadoop. On one instance, we’ll have the hadoop namenode and the hadoop jobtracker and on the other instance we’ll have the tasktracker.

When whirr is done deploying hadoop, you should get a message like:

You can log into instances using the following ssh commands:
[hadoop-datanode+hadoop-tasktracker]: ssh -i /home/chimpler/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no chimpler@54.243.99.201
[hadoop-namenode+hadoop-jobtracker]: ssh -i /home/chimpler/.ssh/id_rsa -o "UserKnownHostsFile /dev/null" -o StrictHostKeyChecking=no chimpler@107.22.164.228

From this message you know that the instance running the namenode + job tracker is running on the server with ip 107.22.164.228

If everything went well, you should be able to see the two instances running on amazon ec2 console: https://console.aws.amazon.com/ec2/home?region=us-east-1#s=Instances

You should also be able to access to the hadoop web interface for the namenode and the jobtracker using your browser:

Running a job on hadoop from your local box

You need to have hadoop on your local box. You can download it at: http://www.apache.org/dyn/closer.cgi/hadoop/common/

Uncompress the archive:

tar xvfz hadoop-1.1.1-bin.tar.gz

Then set the following environment variables:

export HADOOP_PREFIX=~/hadoop-1.1.1
export PATH=$HADOOP_PREFIX/bin:$PATH

When you have deployed hadoop on EC2, whirr has created some files in ~/.whirr/hadoop-ec2:

  • a script to run hadoop proxy
  • a config file hadoop-site.xml to use hadoop with the proxy

Start the hadoop proxy:

cd ~/.whirr/hadoop-ec2
./hadoop-proxy.sh

Tell hadoop to use the config file in ~/.whirr/hadoop-ec2:

export HADOOP_CONF_DIR=~/.whirr/hadoop-ec2

Now, you can access to the hadoop filesystem running on ec2 by typing:

hadoop fs -ls /

Let’s try to run a mapreduce job on the EC2 cluster:

Copy a text file to HDFS to run the wordcount mapreduce job:

cd ~/hadoop-1.1.1 
bin/hadoop fs -put LICENSE.txt LICENSE.txt

Run the wordcount mapreduce job:

bin/hadoop jar hadoop*examples*.jar wordcount LICENSE.txt output

Display the result:

bin/hadoop fs -cat output/part-r-00000

When you are done with your tests, you can stop your instances on EC2:

bin/whirr destroy-cluster --config ~/hadoop-ec2.properties

Then check on the Amazon EC2 console that no instances are running.

Config file description

Instance template

In the config file, we describe using the property whirr.instance-templates what to deploy on each instance:

whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,1 hadoop-datanode+hadoop-tasktracker

The value above means to deploy:

  • 1 instance running both hadoop-namenode and hadoop-jobtracker
  • 1 instance running both hadoop-datanode and hadoop-tasktracker

If we have wanted to deploy 4 hadoop datanode and tasktracker, we could have written:

whirr.instance-templates=1 hadoop-namenode+hadoop-jobtracker,4 hadoop-datanode+hadoop-tasktracker

Provider

Whirr is cloud provider agnostic and can work with several cloud providers: rackspace, cloudsigma, gogrid, …
With this value we are using amazon ec2

whirr.provider=aws-ec2

But we could have easily used any other provider. To get the list of cloud providers supported by whirr:

bin/whirr list-providers compute

More information on configuration properties at http://whirr.apache.org/docs/0.8.1/configuration-guide.html

About chimpler
http://www.chimpler.com

10 Responses to Deploying Hadoop on EC2 with Whirr

  1. Pat says:

    This is awesome! I was looking to deploy a hadoop cluster for glord and going through a bunch of tutorials. Thanks for introducing Whirr, pretty sure it will make my life easier. Continue the good work.

  2. Pingback: Playing with the Mahout recommendation engine on a Hadoop cluster « Chimpler

  3. Pingback: Playing with Apache Hive, MongoDB and the MTA | Chimpler

  4. Purl says:

    thanks man !! great work …

  5. praveen says:

    excellent work!

  6. Pingback: Distributed Transitive Closure Algorithms – Up and Running with Pig « A Single Neuron

  7. sudheer1313 says:

    Thanks for sharing this valuble information and it is useful for hadoop learners.Hadoop

    online trainings trainings provides best hadoop online training

  8. sudheer1313 says:

    It was nice article it was very useful for me as well as useful foronline Hadoop training learners. thanks for providing this valuable information.

  9. Max says:

    Thanks for this quick intro to using Whirr. Regarding the free tier at AWS, I saw that the t1.micro instances aren’t too powerful to say the least. Can you give any details regarding the setup you used to deploy the CDH cluster, specifically regarding storage, etc. I am debating on trying the free tier offer but want to make sure that there are enough computing resources left to actually run some basic tests on Hadoop with Pig.

Leave a reply to praveen Cancel reply