Playing with the Mahout recommendation engine on a Hadoop cluster

Elephant and riderApache Mahout is an open source library which implements several scalable machine learning algorithms. They can be used among other things to categorize data, group items by cluster, and to implement a recommendation engine.

In this tutorial we will run the Mahout recommendation engine on a data set of movie ratings and show the movie recommendations for each user.

For more details on the recommendation algorithm, you can look at the tutorial from Jee Vang.

Requirement

  • Java (to run hadoop)
  • Hadoop (used by Mahout)
  • Mahout
  • Python (use to show the result)

Running Hadoop

In this section, we are going to describe how to quickly install and configure hadoop on a single machine.

Alternatively you can follow the instruction on this post to deploy hadoop for free on an Amazon EC2 cluster.

To install Hadoop on your local box, go to http://www.apache.org/dyn/closer.cgi/hadoop/common/ and download hadoop-1.1.1.tar.gz
Uncompress the archive:

tar xvfz hadoop-1.1.1-bin.tar.gz

Edit the file conf/hadoop-env.sh and add the following line:

export JAVA_HOME=<JDK DIRECTORY>

Generate a rsa key to ssh to your local box without password:

ssh-keygen -t rsa -P ''

And save in <HOME>/.ssh/id_rsa
Now authorize the access to your local box to itself

cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys

Check that it works by doing

ssh localhost ls

It should not ask for your password.

If you don’t have a ssh server installed, you can install it by typing:

sudo apt-get install openssh-server

 

Now set the environment variables:

export HADOOP_PREFIX=<HADOOP DIRECTORY>
export HADOOP_CONF_DIR=$HADOOP_PREFIX/conf
export PATH=$HADOOP_PREFIX/bin:$PATH

To configure HDFS, edit the file conf/core-site.xml and add the following property in configuration:

<configuration>
        <property>
                <name>fs.default.name</name>
                <value>hdfs://localhost:9000</value>
        </property>
</configuration>

Then format the HDFS filesystem:

hadoop namenode -format

We are now ready to start hadoop:

start-all.sh

Mahout

To install Mahout, go to http://www.apache.org/dyn/closer.cgi/mahout/ and download mahout-distribution-0.7.tar.gz

Uncompress the archive:

tar xvfz mahout-distribution-0.7.tar.gz

Getting the movie dataset

The recommender engine accepts any files containing a set of lines with the userId, the itemId and a preference value(optional) separated by a tab. The userId and itemId must be an integer and the preference value can be an integer or a double.

The GroupLens Movie DataSet provides the rating of movies in this format. You can download it: MovieLens 100k.

Uncompress the archive

unzip ml-100k.zip

This archive contains:

  • u.data: contains several tuples(user_id, movie_id, rating, timestamp)
  • u.user: contains several tuples(user_id, age, gender, occupation, zip_code)
  • u.item: contains several tuples(movie_id, title, release_date, video_release_data, imdb_url, cat_unknown, cat_action, cat_adventure, cat_animation, cat_children, cat_comedy, cat_crime, cat_documentary, cat_drama, cat_fantasy, cat_film_noir, cat_horror, cat_musical, cat_mystery, cat_romance, cat_sci_fi, cat_thriller, cat_war, cat_western)

This data set contains 943 users, 1,682 movies and 100,000 ratings.

Running the Mahout recommender

First we need to copy the file u.data to HDFS:

cd ml-100k
hadoop fs -put u.data u.data

To run the mahout recommender, type:

hadoop jar <MAHOUT DIRECTORY>/mahout-core-0.7-job.jar org.apache.mahout.cf.taste.hadoop.item.RecommenderJob -s SIMILARITY_COOCCURRENCE --input u.data --output output

With the argument “-s SIMILARITY_COOCURRENCE”, we tell the recommender which item similary formula to use. With SIMILARITY COOCURRENCE, two items(movies) are very similar if they often appear together in users’ rating. So to find the movies to recommend to a user, we need to find the 10 movies most similar to the movies the user has rated. Or said differently, if a user A gives a good rating on movie X and other users gives a good rating on movie X and movie Y, then we can recommend the movie Y to the user A.

Mahout computes the recommendations by running several Hadoop mapreduce jobs.
After 30-50 minutes, the jobs are finished and each user will have the 10 movies that she might mostly like based on the co-occurrence of each movie in users’ reviews.

To copy and merge the files from HDFS to your local filesystem, type:

hadoop fs -getmerge output output.txt

The file output.txt should contains lines like this:

1       [845:5.0,550:5.0,546:5.0,25:5.0,531:5.0,529:5.0,527:5.0,31:5.0,515:5.0,514:5.0]
2       [546:5.0,288:5.0,11:5.0,25:5.0,531:5.0,527:5.0,515:5.0,508:5.0,496:5.0,483:5.0]
3       [137:5.0,284:5.0,508:4.8327274,248:4.826923,285:4.80597,845:4.754717,124:4.7058825,319:4.703242,293:4.6792455,591:4.6629214]
4       [748:5.0,1296:5.0,546:5.0,568:5.0,538:5.0,508:5.0,483:5.0,475:5.0,471:5.0,876:5.0]
5       [732:5.0,550:5.0,9:5.0,546:5.0,11:5.0,527:5.0,523:5.0,514:5.0,511:5.0,508:5.0]
6       [739:5.0,9:5.0,546:5.0,11:5.0,25:5.0,531:5.0,528:5.0,527:5.0,526:5.0,521:5.0]
7       [879:5.0,845:5.0,751:5.0,750:5.0,748:5.0,746:5.0,742:5.0,739:5.0,735:5.0,732:5.0]
8       [742:5.0,550:5.0,546:5.0,566:5.0,568:5.0,527:5.0,31:5.0,523:5.0,515:5.0,514:5.0]
9       [739:5.0,550:5.0,546:5.0,11:5.0,527:5.0,523:5.0,514:5.0,511:5.0,508:5.0,498:5.0]
10      [732:5.0,9:5.0,546:5.0,11:5.0,25:5.0,529:5.0,528:5.0,527:5.0,526:5.0,523:5.0]

Each line represents the recommendation for a user. The first number is the user id and the 10 number pairs represents a movie id and a score.
If we are looking at the first line for example, it means that for the user 1, the 10 best recommendations are for the movies 845, 550, 546, 25 ,531, 529, 527, 31, 515, 514.

It’s not easy to see what those recommendation means so we wrote a small python program to show for a given user, the movies he has rated and the movies we recommend him.
The python program uses the file u.data for the list of rated movies, the file u.item to get the movie titles and output.txt to get the list of recommended movies for the user.

import sys

if len(sys.argv) != 5:
        print "Arguments: userId userDataFilename movieFilename recommendationFilename"
        sys.exit(1)

userId, userDataFilename, movieFilename, recommendationFilename = sys.argv[1:]

print "Reading Movies Descriptions"
movieFile = open(movieFilename)
movieById = {}
for line in movieFile:
        tokens = line.split("|")
        movieById[tokens[0]] = tokens[1:]
movieFile.close()

print "Reading Rated Movies"
userDataFile = open(userDataFilename)
ratedMovieIds = []
for line in userDataFile:
        tokens = line.split("\t")
        if tokens[0] == userId:
                ratedMovieIds.append((tokens[1],tokens[2]))
userDataFile.close()

print "Reading Recommendations"
recommendationFile = open(recommendationFilename)
recommendations = []
for line in recommendationFile:
        tokens = line.split("\t")
        if tokens[0] == userId:
                movieIdAndScores = tokens[1].strip("[]\n").split(",")
                recommendations = [ movieIdAndScore.split(":") for movieIdAndScore in movieIdAndScores ]
                break
recommendationFile.close()

print "Rated Movies"
print "------------------------"
for movieId, rating in ratedMovieIds:
        print "%s, rating=%s" % (movieById[movieId][0], rating)
print "------------------------"

print "Recommended Movies"
print "------------------------"
for movieId, score in recommendations:
        print "%s, score=%s" % (movieById[movieId][0], score)
print "------------------------"

To run the python program to get the recommended movies for the user 4:

$ python show_recommendations.py 4 u.data u.item output.txt
Reading Movies Descriptions
Reading Rated Movies
Reading Recommendations
Rated Movies
------------------------
Mimic (1997), rating=3
Ulee's Gold (1997), rating=5
Incognito (1997), rating=5
One Flew Over the Cuckoo's Nest (1975), rating=4
Event Horizon (1997), rating=4
Client, The (1994), rating=3
Liar Liar (1997), rating=5
Scream (1996), rating=4
Star Wars (1977), rating=5
Wedding Singer, The (1998), rating=5
Starship Troopers (1997), rating=4
Air Force One (1997), rating=5
Conspiracy Theory (1997), rating=3
Contact (1997), rating=5
Indiana Jones and the Last Crusade (1989), rating=3
Desperate Measures (1998), rating=5
Seven (Se7en) (1995), rating=4
Cop Land (1997), rating=5
Lost Highway (1997), rating=5
Assignment, The (1997), rating=5
Blues Brothers 2000 (1998), rating=5
Spawn (1997), rating=2
Wonderland (1997), rating=5
In & Out (1997), rating=5
------------------------
Recommended Movies
------------------------
Saint, The (1997), score=5.0
Indian Summer (1996), score=5.0
Broken Arrow (1996), score=5.0
Speed (1994), score=5.0
Anastasia (1997), score=5.0
People vs. Larry Flynt, The (1996), score=5.0
Casablanca (1942), score=5.0
Trainspotting (1996), score=5.0
Courage Under Fire (1996), score=5.0
Money Talks (1997), score=5.0
------------------------

We showed in this tutorial how to use the Mahout recommendation engine. However we only scratched the surface of Mahout capabilities. Indeed, there is a lot more to it and you can see the list of algorithms implemented by Mahout on the Mahout wiki page.

About these ads

About chimpler
http://www.chimpler.com

17 Responses to Playing with the Mahout recommendation engine on a Hadoop cluster

  1. Jason Gowans says:

    Thanks for sharing – really useful.

  2. Pingback: Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages | Chimpler

  3. Pingback: Generating EigenFaces with Mahout SVD to recognize person faces | Chimpler

  4. that was one of the few mahout/hadoop tutorials that work perfectly. real help for first time experimentation! thanks!

    • chimpler says:

      Thank you Anilabh!

      • is there a way to process incremental data on mahout? so i got 100,000 movies rated and recommended. now i got a new set of 50 movies. some of those have been rated by users. how do i add these new movie recommendations for existing users? do i need to reprocess the entire set (including the new movies) again?

  5. Pingback: Playing with the Mahout recommendation engine o...

  6. Pingback: Finding association rules with Mahout Frequent Pattern Mining | Chimpler

  7. Tarun Gulyani says:

    Really Nice Article…Thanks Chimpler…How can we predict rating from user to movie ..instead of recommend the movie to user?

  8. Renuka SEO says:

    The Information which you provided is very much useful for Hadoop Online Training Learners Thank You for Sharing Valuable Information

  9. Reblogged this on techtogive and commented:
    A really working tutorial on “How to use Mahout recommendation”

  10. Lays says:

    I configured the JAVA_HOME but I got the following error when I try to format the namenode

    JAVA_HOME is not set.

    what could it be?

  11. Lays says:

    sorry, already found what was wrong.

    Thank you!

  12. Pingback: Using Amazon’s Elastic Map Reduce to compute recommendations with Apache Mahout 0.8 | Blog of Adam Warski

  13. Pingback: Using Amazon’s Elastic MapReduce to Compute Recommendations with Apache Mahout 0.8 | Big Data NewsBig Data News

  14. sudheer1313 says:

    Thanks for this valuble information and itis useful for us .Biginfosys also provides the best online Hadoop training classes.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 104 other followers

%d bloggers like this: