June | 2013 | Chimpler

Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages (part 2: distribute classification with hadoop)

2013/06/24 12 Comments

In this post, we are going to categorize the tweets by distributing the classification on the hadoop cluster. It can make the classification faster if there is a huge number of tweets to classify.

To go through this tutorial you would need to have run the commands in the post Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages.

To distribute the classification on the hadoop nodes, we are going to define a mapreduce job:

the csv containing the tweets to classify is split into several chunks
each chunk is sent to the hadoop node that will process it by running the map class
the map class loads the naive bayes model and some document/word frequency into memory
for each tweet of the chunk, it computes the best matching category. The result is written in the output file. We are not using a reducer class as we don’t need to do aggregations.

To download the code used in this post, you can fetch it from github:

$ git clone https://github.com/fredang/mahout-naive-bayes-example2.git

To compile the project:

$ mvn clean package assembly:single

Chimpler

Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages (part 2: distribute classification with hadoop)

Authors

Websites

Recent Posts

Tweets

Recent Comments

Categories

Archives

Meta

Blog Stats

Chimpler

Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages (part 2: distribute classification with hadoop)

Share this:

Authors

Websites

Recent Posts

Tweets

Recent Comments

Categories

Archives

Meta

Blog Stats