# Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages

Classification algorithms can be used to automatically classify documents, images, implement spam filters and in many other domains. In this tutorial we are going to use Mahout to classify tweets using the Naive Bayes Classifier. The algorithm works by using a training set which is a set of documents already associated to a category. Using this set, the classifier determines for each word, the probability that it makes a document belong to each of the considered categories. To compute the probability that a document belongs to a category, it multiplies together the individual probability of each of its word in this category.  The category with the highest probability is the one the document is most likely to belong to.

To get more details on how the Naive Bayes Classifier is implemented, you can look at the mahout wiki page.

This tutorial will give you a step-by-step description on how to create a training set, train the Naive Bayes classifier and then use it to classify new tweets.

# Requirement

For this tutorial, you would need:

• jdk >= 1.6
• maven
• hadoop (preferably 1.1.1)
• mahout >= 0.7

To install hadoop and mahout, you can follow the steps described on a previous post that shows how to use the mahout recommender.

When you are done installing hadoop and mahout, make sure you set them in your PATH so you can easily call them:

export PATH=$PATH:[HADOOP_DIR]/bin:$PATH:[MAHOUT_DIR]/bin

In our tutorial, we will limit the tweets to deals by getting the tweets containing the hashtags #deal, #deals and #discount. We will classify them in the following categories:

• apparel (clothes, shoes, watches, …)
• art (Book, DVD, Music, …)
• camera
• event (travel, concert, …)
• health (beauty, spa, …)
• home (kitchen, furniture, garden, …)
• tech (computer, laptop, tablet, …)

You can get the scripts and java programs used in this tutorial from our git repository on github:

$git clone https://github.com/fredang/mahout-naive-bayes-example.git You can compile the java programs by typing: $ mvn clean package assembly:single

# Preparing the training set

UPDATE(2013/06/23): this section was updated to support twitter 1.1 api (1.0 was just shutdown).

As preparing the training set is very time consuming, we have provided in the source repository a training set so that you don’t need to build it. The file is data/tweets-train.tsv. If you choose to use it, you can directly jump to the next section.

To prepare a training set, we fetched the tweets with the following hashtags: #deals, #deal or #discount by using the script twitter_fetcher.py. It is using the python-tweepy 2.1 library (make sure to install the latest version as we have to use the twitter 1.1 api now). You can install it by typing:

git clone https://github.com/tweepy/tweepy.git
cd tweepy
sudo python setup.py install

You need to have consumer keys/secrets and access token key/secrets to use the api. If you don’t have them, simply login on the twitter website then go to: https://dev.twitter.com/apps. Then create a new application.
When you are done, you should see in the section ‘OAuth settings’, the Consumer Key and secret, and in the section ‘Your access token’, the Access Token and the Access Token secret.

Edit the file script/twitter_fetcher.py and change the following lines to use your twitter keys and secrets:

CONSUMER_KEY='REPLACE_CONSUMER_KEY'
CONSUMER_SECRET='REPLACE_CONSUMER_SECRET'
ACCESS_TOKEN_KEY='REPLACE_ACCESS_TOKEN_KEY'
ACCESS_TOKEN_SECRET='REPLACE_ACCESS_TOKEN_SECRET'


You can now run the script:

$python scripts/twitter_fetcher.py 5 > tweets-train.tsv Code to fetch tweets: import tweepy import sys CONSUMER_KEY='REPLACE_CONSUMER_KEY' CONSUMER_SECRET='REPLACE_CONSUMER_SECRET' ACCESS_TOKEN_KEY='REPLACE_ACCESS_TOKEN_KEY' ACCESS_TOKEN_SECRET='REPLACE_ACCESS_TOKEN_SECRET' auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET) auth.set_access_token(ACCESS_TOKEN_KEY, ACCESS_TOKEN_SECRET) api = tweepy.API(auth) pageCount = 5 if len(sys.argv) >= 2: pageCount = int(sys.argv[1]) hashtags = ['deal', 'deals', 'discount'] for tag in hashtags: maxId = 999999999999999999999 for i in range(1, pageCount + 1): results = api.search(q='#%s' % tag, max_id=maxId, count=100) print len(results) for result in results: print result.text maxId = min(maxId, result.id) # only keep tweets pointing to a web page if result.text.find("http:") != -1: print "%s %s" % (result.id, result.text.encode('utf-8').replace('\n', ' '))  The file tweets-train.tsv contains a list of tweets in a tab separated value format. The first number is the tweet id followed by the tweet message: 308215054011194110 Limited 3-Box$20 BOGO, Supreme $9 BOGO, PTC Basketball$10 BOGO, Sterling Baseball $20 BOGO, Bowman Chrome$7 http://t.co/WMdbNFLvVZ #deals
308215054011194118      Purchase The Jeopardy! Book by Alex Trebek, Peter Barsocchini for only $4 #book #deals - http://t.co/Aw5EzlQYbs @ThriftBooksUSA 308215054011194146 #Shopping #Bargain #Deals Designer KATHY Van Zeeland Luggage & Bags @ http://t.co/GJC83p8eKh To transform this into a training set, you can use your favorite editor and add the category of the tweet at the beginning of the line followed by a tab character: tech 308215054011194110 Limited 3-Box$20 BOGO, Supreme $9 BOGO, PTC Basketball$10 BOGO, Sterling Baseball $20 BOGO, Bowman Chrome$7 http://t.co/WMdbNFLvVZ #deals
art     308215054011194118      Purchase The Jeopardy! Book by Alex Trebek, Peter Barsocchini for only $4 #book #deals - http://t.co/Aw5EzlQYbs @ThriftBooksUSA apparel 308215054011194146 #Shopping #Bargain #Deals Designer KATHY Van Zeeland Luggage & Bags @ http://t.co/GJC83p8eKh Make sure to use tab between the category and the tweet id and between the tweet id and the tweet message. For the classifier to work properly, this set must have at least 50 tweets messages in each category. # Training the model with Mahout First we need to convert the training set to the hadoop sequence file format: $ java -cp target/twitter-naive-bayes-example-1.0-jar-with-dependencies.jar com.chimpler.example.bayes.TweetTSVToSeq data/tweets-train.tsv tweets-seq

The sequence file has as key: /[category]/ and as value: .

Code to convert tweet tsv to sequence file

public class TweetTSVToSeq {
public static void main(String args[]) throws Exception {
if (args.length != 2) {
System.err.println("Arguments: [input tsv file] [output sequence file]");
return;
}
String inputFileName = args[0];
String outputDirName = args[1];

Configuration configuration = new Configuration();
FileSystem fs = FileSystem.get(configuration);
Writer writer = new SequenceFile.Writer(fs, configuration, new Path(outputDirName + "/chunk-0"),
Text.class, Text.class);

int count = 0;
BufferedReader reader = new BufferedReader(new FileReader(inputFileName));
Text key = new Text();
Text value = new Text();
while(true) {
String line = reader.readLine();
if (line == null) {
break;
}
String[] tokens = line.split("\t", 3);
if (tokens.length != 3) {
System.out.println("Skip line: " + line);
continue;
}
String category = tokens[0];
String id = tokens[1];
String message = tokens[2];
key.set("/" + category + "/" + id);
value.set(message);
writer.append(key, value);
count++;
}
writer.close();
System.out.println("Wrote " + count + " entries.");
}
}


Then we upload this file to HDFS:

$hadoop fs -put tweets-seq tweets-seq We can run mahout to transform the training sets into vectors using tfidf weights(term frequency x document frequency): $ mahout seq2sparse -i tweets-seq -o tweets-vectors

It will generate the following files in HDFS in the directory tweets-vectors:

• df-count: sequence file with association word id => number of document containing this word
• dictionary.file-0: sequence file with association word => word id
• frequency.file-0: sequence file with association word id => word count
• tf-vectors: sequence file with the term frequency for each document
• tfidf-vectors: sequence file with association document id => tfidf weight for each word in the document
• tokenized-documents: sequence file with association document id => list of words
• wordcount: sequence file with association word => word count

In order to do the training and check that the classification works fine, Mahout splits the set into two sets: a training set and a testing set:

$mahout split -i tweets-vectors/tfidf-vectors --trainingOutput train-vectors --testOutput test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential We use the training set to train the classifier: $ mahout trainnb -i train-vectors -el -li labelindex -o model -ow -c

It creates the model(matrix word id x label id) and a label index(association label and label id).

To test that the classifier is working properly on the training set:

$mahout testnb -i train-vectors -m model -l labelindex -ow -o tweets-testing -c [...] Summary ------------------------------------------------------- Correctly Classified Instances : 314 97.2136% Incorrectly Classified Instances : 9 2.7864% Total Classified Instances : 323 ======================================================= Confusion Matrix ------------------------------------------------------- a b c d e f g <--Classified as 45 0 0 0 0 0 1 | 46 a = apparel 0 35 0 0 0 0 0 | 35 b = art 0 0 34 0 0 0 0 | 34 c = camera 0 0 0 39 0 0 0 | 39 d = event 0 0 0 0 23 0 0 | 23 e = health 1 1 0 0 1 48 2 | 53 f = home 0 0 1 0 1 1 90 | 93 g = tech And on the testing set: $ mahout testnb -i test-vectors -m model -l labelindex -ow -o tweets-testing -c
[...]
Summary
-------------------------------------------------------
Correctly Classified Instances          :        121	   78.0645%
Incorrectly Classified Instances        :         34	   21.9355%
Total Classified Instances              :        155

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	c    	d    	e    	f    	g    	<--Classified as
27   	1    	1    	1    	2    	2    	2    	 |  36    	a     = apparel
1    	22   	0    	2    	1    	0    	0    	 |  26    	b     = art
0    	1    	27   	1    	0    	0    	1    	 |  30    	c     = camera
0    	1    	0    	23   	4    	0    	0    	 |  28    	d     = event
0    	1    	0    	2    	9    	2    	0    	 |  14    	e     = health
0    	1    	1    	1    	2    	13   	1    	 |  19    	f     = home
0    	0    	2    	0    	0    	0    	0    	 |  2     	g     = tech

If the percentage of correctly classified instance is too low, you might need to improve your training set by adding more tweets or by changing your categories to not have too many similar categories or by removing categories that are used very rarely. After you are done with your changes, you would need to restart the training process.

To use the classifier to classify new documents, we would need to copy several files from HDFS:

• model (matrix word id x label id)
• labelindex (mapping between a label and its id)
• dictionary.file-0 (mapping between a word and its id)
• df-count (document frequency: number of documents each word is appearing in)
$hadoop fs -get labelindex labelindex$ hadoop fs -get model model
$hadoop fs -get tweets-vectors/dictionary.file-0 dictionary.file-0$ hadoop fs -getmerge tweets-vectors/df-count df-count

To get some new tweets to classify, you can run the twitter fetcher again(or use the one provided in data/tweets-to-classify-tsv):

$python scripts/twitter_fetcher.py 1 > tweets-to-classify.tsv Now we can run the classifier on this file: $ java -cp target/twitter-naive-bayes-example-1.0-jar-with-dependencies.jar com.chimpler.example.bayes.Classifier model labelindex dictionary.file-0 df-count data/tweets-to-classify.tsv
Number of labels: 7
Number of documents: 486
Tweet: 309836558624768000       eBay - Porter Cable 18V Ni CAD 2-Tool Combo Kit (Refurbished) $56.99 http://t.co/pCSSlSq2c1 #Deal - http://t.co/QImHB6xJ5b apparel: -252.96630831136127 art: -246.9351025603821 camera: -262.28340417385357 event: -262.5573608070056 health: -238.17884382282813 home: -253.05135616792995 tech: -232.9118 41377148 => tech Tweet: 309836557379043329 Newegg - BenQ GW2750HM 27" Widescreen LED Backlight LCD Monitor$209.99 http://t.co/6ezbjGZIta #Deal - http://t.co/QImHB6xJ5b
apparel: -287.5588179141781  art: -284.27401807389435  camera: -278.4968305457808  event: -292.56786244190556  health: -292.22158238362204  home: -281.9809996515652  tech: -253.34354
804349476 => tech
Tweet: 309836556355657728       J and R - Roku 3 Streaming Player 4200R $89.99 http://t.co/BAaMEmEdCm #Deal - http://t.co/QImHB6xJ5b apparel: -192.44260718853357 art: -187.6881145121525 camera: -175.8783440835461 event: -191.74948688734446 health: -190.45406023882765 home: -192.9107077937349 tech: -185.52068 485514894 => camera Tweet: 309836555248361472 eBay - Adidas Adicross 2011 Men's Spikeless Golf Shoes$42.99 http://t.co/oRt8JIQB6v #Deal - http://t.co/QImHB6xJ5b
apparel: -133.86214565455646  art: -174.44106424825426  camera: -188.66719939648308  event: -188.83296276708387  health: -188.188838820323  home: -178.13519042380085  tech: -190.7717
2248114303 => apparel
Tweet: 309836554187202560       Buydig - Tamron 18-270mm Di Lens for Canon + Canon 50mm F/1.8 Lens $464 http://t.co/Dqj9DdqmTf #Deal - http://t.co/QImHB6xJ5b apparel: -218.82418584296866 art: -228.25052760371423 camera: -183.46066199290763 event: -245.186963518965 health: -244.70464331200444 home: -236.16560862254997 tech: -244.4118 6823539707 => camera Code to classify the tweets using the model and the dictionary file: public class Classifier { public static Map<String, Integer> readDictionnary(Configuration conf, Path dictionnaryPath) { Map<String, Integer> dictionnary = new HashMap<String, Integer>(); for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>(dictionnaryPath, true, conf)) { dictionnary.put(pair.getFirst().toString(), pair.getSecond().get()); } return dictionnary; } public static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath) { Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>(); for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable, LongWritable>(documentFrequencyPath, true, conf)) { documentFrequency.put(pair.getFirst().get(), pair.getSecond().get()); } return documentFrequency; } public static void main(String[] args) throws Exception { if (args.length < 5) { System.out.println("Arguments: [model] [label index] [dictionnary] [document frequency] "); return; } String modelPath = args[0]; String labelIndexPath = args[1]; String dictionaryPath = args[2]; String documentFrequencyPath = args[3]; String tweetsPath = args[4]; Configuration configuration = new Configuration(); // model is a matrix (wordId, labelId) => probability score NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), configuration); StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model); // labels is a map label => classId Map<Integer, String> labels = BayesUtils.readLabelIndex(configuration, new Path(labelIndexPath)); Map<String, Integer> dictionary = readDictionnary(configuration, new Path(dictionaryPath)); Map<Integer, Long> documentFrequency = readDocumentFrequency(configuration, new Path(documentFrequencyPath)); // analyzer used to extract word from tweet Analyzer analyzer = new DefaultAnalyzer(); int labelCount = labels.size(); int documentCount = documentFrequency.get(-1).intValue(); System.out.println("Number of labels: " + labelCount); System.out.println("Number of documents in training set: " + documentCount); BufferedReader reader = new BufferedReader(new FileReader(tweetsPath)); while(true) { String line = reader.readLine(); if (line == null) { break; } String[] tokens = line.split("\t", 2); String tweetId = tokens[0]; String tweet = tokens[1]; System.out.println("Tweet: " + tweetId + "\t" + tweet); Multiset words = ConcurrentHashMultiset.create(); // extract words from tweet TokenStream ts = analyzer.reusableTokenStream("text", new StringReader(tweet)); CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class); ts.reset(); int wordCount = 0; while (ts.incrementToken()) { if (termAtt.length() > 0) { String word = ts.getAttribute(CharTermAttribute.class).toString(); Integer wordId = dictionary.get(word); // if the word is not in the dictionary, skip it if (wordId != null) { words.add(word); wordCount++; } } } // create vector wordId => weight using tfidf Vector vector = new RandomAccessSparseVector(10000); TFIDF tfidf = new TFIDF(); for (Multiset.Entry entry:words.entrySet()) { String word = entry.getElement(); int count = entry.getCount(); Integer wordId = dictionary.get(word); Long freq = documentFrequency.get(wordId); double tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount, documentCount); vector.setQuick(wordId, tfIdfValue); } // With the classifier, we get one score for each label // The label with the highest score is the one the tweet is more likely to // be associated to Vector resultVector = classifier.classifyFull(vector); double bestScore = -Double.MAX_VALUE; int bestCategoryId = -1; for(Element element: resultVector) { int categoryId = element.index(); double score = element.get(); if (score > bestScore) { bestScore = score; bestCategoryId = categoryId; } System.out.print(" " + labels.get(categoryId) + ": " + score); } System.out.println(" => " + labels.get(bestCategoryId)); } } }  Most of the tweets are classified properly but some are not. For example, the tweet “J and R – Roku 3 Streaming Player 4200R$89.99″ is incorrectly classified as camera. To fix that, we can add this tweet to the training set and classify it as tech. You can do the same for the  other tweets which are incorrectly classified. When you are done, you can repeat the training process and check the results again.

# Conclusion

In this tutorial we have seen how to build a training set, then how to use it with Mahout to train the Naive Bayes model. We showed how to test the classifier and how to improve the training set to get a better classification. Finally we use it to build an application to automatically assign a category to a tweet. In this post, we only study one Mahout classifier among many others: SGD, SVM, Neural Network, Random Forests, …. We will see in future posts how to use them.

# Misc

## View content of sequence files

To show the content of a file in HDFS, you can use the command

$hadoop fs -text [FILE_NAME] However, there might be some sequence file which are encoded using mahout classes. You can tell hadoop where to find those classes by editing the file [HADOOP_DIR]conf/hadoop-env.sh and add the following line: export HADOOP_CLASSPATH=[MAHOUT_DIR]/mahout-math-0.7.jar:[MAHOUT_DIR]/mahout-examples-0.7-job.jar and restart hadoop. You can use the command mahout seqdumper: $ mahout seqdumper -i [FILE_NAME]

## View words which are the most representative of each categories

You can use the class TopCategoryWords that shows the top 10 words of each category.

public class TopCategoryWords {

public static Map<Integer, String> readInverseDictionnary(Configuration conf, Path dictionnaryPath) {
Map<Integer, String> inverseDictionnary = new HashMap<Integer, String>();
for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>(dictionnaryPath, true, conf)) {
inverseDictionnary.put(pair.getSecond().get(), pair.getFirst().toString());
}
return inverseDictionnary;
}

public static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath) {
Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();
for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable, LongWritable>(documentFrequencyPath, true, conf)) {
documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());
}
return documentFrequency;
}

public static class WordWeight implements Comparable {
private int wordId;
private double weight;

public WordWeight(int wordId, double weight) {
this.wordId = wordId;
this.weight = weight;
}

public int getWordId() {
return wordId;
}

public Double getWeight() {
return weight;
}

@Override
public int compareTo(WordWeight w) {
return -getWeight().compareTo(w.getWeight());
}
}

public static void main(String[] args) throws Exception {
if (args.length < 4) { 			System.out.println("Arguments: [model] [label index] [dictionnary] [document frequency]"); 			return; 		} 		String modelPath = args[0]; 		String labelIndexPath = args[1]; 		String dictionaryPath = args[2]; 		String documentFrequencyPath = args[3]; 		 		Configuration configuration = new Configuration(); 		// model is a matrix (wordId, labelId) => probability score
NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), configuration);

StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model);

// labels is a map label => classId
Map<Integer, String> labels = BayesUtils.readLabelIndex(configuration, new Path(labelIndexPath));
Map<Integer, String> inverseDictionary = readInverseDictionnary(configuration, new Path(dictionaryPath));
Map<Integer, Long> documentFrequency = readDocumentFrequency(configuration, new Path(documentFrequencyPath));

int labelCount = labels.size();
int documentCount = documentFrequency.get(-1).intValue();

System.out.println("Number of labels: " + labelCount);
System.out.println("Number of documents in training set: " + documentCount);

for(int labelId = 0 ; labelId < model.numLabels() ; labelId++) {
SortedSet wordWeights = new TreeSet();
for(int wordId = 0 ; wordId < model.numFeatures() ; wordId++) { 				WordWeight w = new WordWeight(wordId, model.weight(labelId, wordId)); 				wordWeights.add(w); 			} 			System.out.println("Top 10 words for label " + labels.get(labelId)); 			int i = 0; 			for(WordWeight w: wordWeights) { 				System.out.println(" - " + inverseDictionary.get(w.getWordId()) 						+ ": " + w.getWeight()); 				i++; 				if (i >= 10) {
break;
}
}
}
}
}


## Using your own testing set with mahout

Previously, we showed how to generate a testing set from the training set using the mahout split command.

In this section, we are going to describe how to use our own testing set and run mahout to check the accuracy of the testing set.

We have a small testing set in data/tweets-test-set.tsv that we are transforming into a tfidf vector sequence file:
the tweet words are converted into word id using the dictionary file and are associated to their tf x idf value:

public class TweetTSVToTrainingSetSeq {
public static Map<String, Integer> readDictionnary(Configuration conf, Path dictionnaryPath) {
Map<String, Integer> dictionnary = new HashMap<String, Integer>();
for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>(dictionnaryPath, true, conf)) {
dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());
}
return dictionnary;
}

public static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath) {
Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();
for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable, LongWritable>(documentFrequencyPath, true, conf)) {
documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());
}
return documentFrequency;
}

public static void main(String[] args) throws Exception {
if (args.length < 4) {
System.out.println("Arguments: [dictionnary] [document frequency]  [output file]");
return;
}
String dictionaryPath = args[0];
String documentFrequencyPath = args[1];
String tweetsPath = args[2];
String outputFileName = args[3];

Configuration configuration = new Configuration();
FileSystem fs = FileSystem.get(configuration);

Map<String, Integer> dictionary = readDictionnary(configuration, new Path(dictionaryPath));
Map<Integer, Long> documentFrequency = readDocumentFrequency(configuration, new Path(documentFrequencyPath));
int documentCount = documentFrequency.get(-1).intValue();

Writer writer = new SequenceFile.Writer(fs, configuration, new Path(outputFileName),
Text.class, VectorWritable.class);
Text key = new Text();
VectorWritable value = new VectorWritable();

Analyzer analyzer = new DefaultAnalyzer();
BufferedReader reader = new BufferedReader(new FileReader(tweetsPath));
while(true) {
String line = reader.readLine();
if (line == null) {
break;
}

String[] tokens = line.split("\t", 3);
String label = tokens[0];
String tweetId = tokens[1];
String tweet = tokens[2];

key.set("/" + label + "/" + tweetId);

Multiset words = ConcurrentHashMultiset.create();

// extract words from tweet
TokenStream ts = analyzer.reusableTokenStream("text", new StringReader(tweet));
CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
ts.reset();
int wordCount = 0;
while (ts.incrementToken()) {
if (termAtt.length() > 0) {
String word = ts.getAttribute(CharTermAttribute.class).toString();
Integer wordId = dictionary.get(word);
// if the word is not in the dictionary, skip it
if (wordId != null) {
words.add(word);
wordCount++;
}
}
}

// create vector wordId => weight using tfidf
Vector vector = new RandomAccessSparseVector(10000);
TFIDF tfidf = new TFIDF();
for (Multiset.Entry entry:words.entrySet()) {
String word = entry.getElement();
int count = entry.getCount();
Integer wordId = dictionary.get(word);
// if the word is not in the dictionary, skip it
Long freq = documentFrequency.get(wordId);
double tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount, documentCount);
vector.setQuick(wordId, tfIdfValue);
}
value.set(vector);

writer.append(key, value);
}
writer.close();
}
}


To run the program:

$java -cp target/twitter-naive-bayes-example-1.0-jar-with-dependencies.jar com.chimpler.example.bayes.TweetTSVToTrainingSetSeq dictionary.file-0 df-count data/tweets-test-set.tsv tweets-test-set.seq To copy the generated seq file to hdfs: $ hadoop fs -put tweets-test-set.seq tweets-test-set.seq

To run the mahout testnb on this sequence file:

## Errors

When running the script to convert the tweet TSV message, I got the following errors:

Skip line: tech	309167277155168257      Easy web hosting. $4.95 - http://t.co/0oUGS6Oj0e - Review/Coupon- http://t.co/zdgH4kv5sv #wordpress #deal #bluehost #blue host Skip line: art 309167270989541376 Beautiful Jan Royce Conant Drawing of Jamaica - 1982 - Rare CT Artist - Animals #CPTV #EBAY #FineArt #Deals http://t.co/MUZf5aixMz Make sure that the category and the tweet id are followed by a tab character and not spaces. To run the classifier on the hadoop cluster, you can read the post part 2: distribute classification with hadoop. About chimpler http://www.chimpler.com ### 69 Responses to Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages 1. YP says: Can you please provide the import class statements? 2. vanitha says: hi you are blog is giving nice explanation about mahout navie bayees classification implementation. to use ur java code shall i need to include any licence please suggest me. thank u • chimpler says: Hi Vanitha, Thank you. According to the Apache licence, you should keep the licence disclaimer in the header of the file. If you use only part of the code, it’s really up to you whether or not you want to include the licence in the header. 3. vanitha says: hi Thank you for your reply. the classifier logic which u have implemented in java. i need to write as map reduce job. how to read model and and dictonary path using map reduce job.please suggest me . • chimpler says: Hi Vanitha, There are several ways to do it. For the mapper class, you can override the method setup: public class MyMapper extends Mapper { private static StandardNaiveBayesClassifier classifier; [...] // it gets initialized when the task is initialized public void setup(Mapper.Context context) { if (classifier == null) { // we are reading the data from the model directory in hdfs // you should see this directory when typing from the command line: // hadoop fs ls model String modelPath = "model"; NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), context.getConfiguration()); classifier = new StandardNaiveBayesClassifier(model); } } public void map(K key, V value, Mapper.Context context) { // you can use the classifier object here } [...] }  By defining the classifier as static, we only initialize it one time for each JVM. To make sure that a new JVM is not started for every task, you can use the option jvm reuse and set it to -1. It can be done when you setup your mapreduce jobs: JobConf.setNumTasksToExecutePerJvm(-1); You can read the definition file in the Reducer class and only read it once for each JVM using the same method. Let me know if that helps. • YP says: I tried the above method of passing a variable between setup method to mapper but no luck. It is still null. public static class MapClass extends Mapper{ private static StandardNaiveBayesClassifier classifier; private static Analyzer analyzer; protected void setup(Context context) throws IOException { NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), context.getConfiguration()); StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model); protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { TokenStream ts = analyzer.tokenStream(“text”, new StringReader(tweet)); Getting error on above line. Analyzer is null • chimpler says: Hi YP, The Analyzer needs to be initialized: analyzer = new DefaultAnalyzer(); You can look at the new post: http://chimpler.wordpress.com/2013/06/24/using-the-mahout-naive-bayes-classifier-to-automatically-classify-twitter-messages-part-2-distribute-classification-with-hadoop/ on how to distribute the classification on the hadoop nodes. 4. Michael says: Hi, really helpful article! Thanky you. I have one (maybe stupid) question: When creating sequence file, you are using category AND the id. Byt what if you have data only with the known category? I mean what if you are not able to have unique key? (eg. you have one big file containing a pair category-text per line) Is it ok to have repeating key values or it’s manadatory to generate those unique keys in this case? Thanks. 5. vanitha says: Hi .. Thank you for your rply. i have tried the implementation u are suggested using void setup() the main problem is while running this job has java application it is running . but while running as map reduce job it is giving org.apache.mahout.vectorizer.DefaultAnalyzer; org.apache.lucene.analysis.Analyzer classes not found error. i have added mahout jars in side hadoop cluster . please suggest me possible solution . • chimpler says: Hi Vanitha, You would need to modify the HADOOP_CLASSPATH in hadoop-1.1.1/conf/hadoop-env.sh and set it to: export HADOOP_CLASSPATH=~/mahout-distribution-0.7/mahout-core-0.7.jar:~mahout-distribution-0.7/mahout-core-0.7-job.jar:~/mahout-distribution-0.7/mahout-math-0.7.jar:~/mahout-distribution-0.7/lib/* If you are using maven, another option is to use the plugin assembly(http://maven.apache.org/plugins/maven-assembly-plugin/) so that your jar will contains all the dependent classes (by compiling using mvn assembly:single). With this solution, you don’t need to have the mahout jars installed on all the hadoop nodes. Let me know if it works for you 6. vanitha says: hi .. Thank you for ur help . now its working fine for me . 7. vanitha says: hi .. iam working on mahout navie bayes classification . algorithum is calculating probability for each catageory multiplying each word probability in that catageory . how to display each word and its probabilities from model. please give me idea how to read a model To displaying top 5 feature set for each catageory. • chimpler says: Hi Vanitha, We have just implemented a class that does that. However it’s not showing the probability but the weight of the words. The weight formula (TFxIDF) is described on this page: https://cwiki.apache.org/MAHOUT/bayesian.html (looks at CBayes) Let me know if that helps. • smtc says: Hi Chimplers, this is a great tutorial (probably the best on the net so far on this subject). I went through all the steps and all worked very well. I even applied it on my own data and I got very desirable results. The only thing that didn’t work is the class that counts the TopCategory words. I get the following error: Exception in thread “main” java.io.FileNotFoundException: File model/naiveBayesModel.bin does not exist. at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:371) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:245) at org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputChecker.(ChecksumFileSystem.java:125)
at org.apache.hadoop.fs.ChecksumFileSystem.open(ChecksumFileSystem.java:283)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:400)
at org.apache.mahout.classifier.naivebayes.NaiveBayesModel.materialize(NaiveBayesModel.java:108)
at com.chimpler.example.bayes.TopCategoryWords.main(TopCategoryWords.java:92)

Apparently the model file is missing but when I check the hadoop file location with
$hadoop fs -ls model and I get Found 1 items -rw-r–r– 1 user1 supergroup 26907 2013-04-18 14:36 /user/user1/model/naiveBayesModel.bin The file does exist. What am I doing wrong? Thanks in advance. • smtc says: I solved it. I had not copied the model file from hdfs. It works I can now see the TopCategoryWords :))) 8. vanitha says: thank u for ur rply . it really helped me to work on probability scores. i have an issue regarding removing stop words (have did from etc ) which doesn’t contain any useful information in the traning data . iam trying using –maxDFpersent is 60 but this option is ignoring . and not disabling stopwords from feature set. please suggest me possible options. Thanks for ur help. • chimpler says: Hi Vanitha, If maxDFPercent does not remove the words used in 60% of the documents: mahout seq2sparse -i tweets-seq -o tweets-vectors –maxDFPercent 60 You can have two options: 1) You can get the list of english stopwords: http://www.textfixer.com/resources/common-english-words.txt String stopWordsString = "a,able,about,across,after,all,almost,also,am,among,an,and,any,are,as,at,be,because,been,but,by,can,cannot,could,dear,did,do,does,either,else,ever,every,for,from,get,got,had,has,have,he,her,hers,him,his,how,however,i,if,in,into,is,it,its,just,least,let,like,likely,may,me,might,most,must,my,neither,no,nor,not,of,off,often,on,only,or,other,our,own,rather,said,say,says,she,should,since,so,some,than,that,the,their,them,then,there,these,they,this,tis,to,too,twas,us,wants,was,we,were,what,when,where,which,while,who,whom,why,will,with,would,yet,you,your"; Set<String> stopWords = new HashSet<String>(Sets.newHashSet(stopWordsString.split(",")));  2) Or get the top X words using the documentFrequency map:  public static Map<Integer, Long> getTopWords(Map<Integer, Long> documentFrequency, int topWordsCount) { List<Map.Entry<Integer, Long>> entries = new ArrayList<Map.Entry<Integer, Long>>(documentFrequency.entrySet()); Collections.sort(entries, new Comparator<Map.Entry<Integer, Long>>() { @Override public int compare(Entry<Integer, Long> e1, Entry<Integer, Long> e2) { return -e1.getValue().compareTo(e2.getValue()); } }); Map<Integer, Long> topWords = new HashMap<Integer, Long>(); int i = 0; for(Map.Entry<Integer, Long> entry: entries) { topWords.put(entry.getKey(), entry.getValue()); i++; if (i > topWordsCount) { break; } } return topWords; } [...] Map<Integer, String> inverseDictionary = readInverseDictionnary(configuration, new Path(dictionaryPath)); Map<Integer, Long> documentFrequency = readDocumentFrequency(configuration, new Path(documentFrequencyPath)); Map<Integer, Long> topWords = getTopWords(documentFrequency, 10); Set<String> stopWords = new HashSet<String>(); for(Map.Entry<Integer, Long> entry: topWords.entrySet()) { topWords.add(inverseDictionary.get(entry.getKey())); }  Then you can use this stopWords set to skip those words when computing the scores: if (topWords.contains(word)) { // skip this word continue; }  Hope that it helps. • Tarun Gulyani says: In Mahout how can we make train vectors and test vectors for Naive Bayes Classifier manually instead of use “–randomSelectionPct” option for split. According to my understanding i had built train vectors and test vectors manually as bin/mahout seq2sparse -i TestSet0-seq -o TestSet0-vectors bin/mahout seq2sparse -i TrainSet0-seq -o TrainSet0-vectors /home/marvin1/hadoop-1.0.4/bin/hadoop fs -cp /user/marvin1/TestSet0-vectors/tfidf-vectors /user/marvin1/test-vectors /home/marvin1/hadoop-1.0.4/bin/hadoop fs -cp /user/marvin1/TrainSet0-vectors/tfidf-vectors /user/marvin1/train-vectors But by this accuracy is just 1%. Here data was 90-10 split manually. But when i had passed complete data(train+test) to mahout and used “–randomSelectionPct 10″. Then it gives accuracy around 50%. Please let me know what i had done wrong in this. 9. chimpler says: Hi Tarun, When you run seq2sparse, it will convert the words into word ids and put the mapping in dictionary.file-0. When you run: bin/mahout seq2sparse -i TestSet0-seq -o TestSet0-vectors bin/mahout seq2sparse -i TrainSet0-seq -o TrainSet0-vectors The mapping word=>word id will be different in the two sets. So the same word id will map to two different words in the testing and training set. That’s why the accuracy you get is very low. We have just added a new section to this post “Using your own testing set with mahout” to convert the testing set in CSV into a tf-idf vectors sequence file. The conversion uses the same mapping word id => word id that is used in the training set. Let us know if it helps 10. Tarun Gulyani says: Oh… i didn’t see at above…Thanks a a lot…..It solved my problem…Now getting better result. 11. Shagun Jhaver says: This is a great tutorial, thanks! I’ve a query: After we create a classifier, is there any way we can make classification on the distributed file system itself, instead of copying the model files to local system, and then running the java code for it? • chimpler says: Thank you Shagun. To read directly from HDFS instead of the local filesystem, you can change the configuration object to make it uses the settings defined in core-site.xml and hdfs-site.xml in the hadoop conf directory: Configuration configuration = new Configuration(); configuration.addResource(new Path("/opt/hadoop-1.1.1/conf/core-site.xml")); configuration.addResource(new Path("/opt/hadoop-1.1.1/conf/hdfs-site.xml")); With this, all the objects relying on this configuration object will read from HDFS: SequenceFileIterable, NaiveBayesModel, … In the Classifier class, if you want to read the tweets file from HDFS instead of the local filesystem, you can use the method FileSystem.open. Instead of: BufferedReader reader = new BufferedReader(new FileReader(tweetsPath)); do: FileSystem fileSystem = FileSystem.get(configuration); BufferedReader reader = new BufferedReader(new InputStreamReader(fileSystem.open(new Path(tweetsPath)))); Let us know if that helps • Shagun Jhaver says: Thanks for the quick reply. Correct me if I’m wrong, but the code you mentioned would make the master node read files from hdfs and do the classification at the master node. Is there a way to shift the burden of this processing onto the slave nodes? • chimpler says: Yes you are right Shagun, with the code above the processing will be done by your local node. To make the processing distributed on the hadoop cluster, we can implement a map reduce job. The map function will accept as input: key=tweet id value=tweet message It will run the classification on this tweet and output: key=tweet id value=category id We don’t need a reduce function. The resulting file will contain for each tweet id, the category id that the classifier found. There might be some optimization to prevent each hadoop thread to reload the model in memory at each execution. We will write a post on how to do this. • YP says: Has anyone done a performance bench marking on this program? I have a 2 node cluster with 6GB Ram and 4 Core. It took 14min for predicting sentiment for 10000 rows. Input file size was 20Mb. 12. kaushik54 says: How can I see the Actual output i.e. the attributes,text and the predicted category instead of summary at the end of the execution of that mahout command • chimpler says: Hi, If you use mahout command line you cannot get that information. You would need to use the Classifier class provided in this post to get the attributes, text and predicted category. With “mahout testnb”:$ mahout testnb -i train-vectors -m model -l labelindex -ow -o tweets-testing -c
By looking at the output in tweets-testing, you can only have some information on the score:

$mahout seqdumper -i tweets-testing/part-m-00000 Key: apparel: Value: Key: apparel: Value: {0:-131.05278244010563,1:-190.28938458545053,2:-179.34751789477707,3:-174.60141541961985,4:-174.59629190657571,5:-181.1691413925544,6:-190.3951542401648} Key: apparel: Value: {0:-66.78164807930018,1:-96.3273647506556,2:-98.71864411526653,3:-91.37887040377464,4:-89.81178537003822,5:-88.19329347217497,6:-103.1540466728606} Key: apparel: Value: {0:-283.4610288938896,1:-387.7672283868884,2:-402.0972460642157,3:-403.8214845180489,4:-403.7424530656129,5:-377.1466290217178,6:-415.06647397949877} [...]  If we look at the first line, it means that one document in the testing set that was classified as apparel, has the following score when using the naives bayes classifier: -131 for category 0(apparel) -190 for category 1(art) -179 for category 2(camera) -174 for category 3(event) -174 for category 4(health) -181 for category 5(home) -190 for category 6(tech) Mahout uses this to generate the confusion matrix that is displayed in the output of “mahout testnb” • kaushik54 says: Thanks for the reply, I got the output by redirecting it to a file, now i can use the prediction I used the following command: mahout seqdumper -i tweets-testing/part-m-00000 > train-output.txt • kaushik54 says: In the above case i am just getting the key value pairs, what if i want the output as the twitter userid and his tweets category. I cant get this information from {key,value}, Atleast a hint in that direction would help me • chimpler says: Yes the file tweets-testing/part-m-00000 does not contains the tweet id, nor the tweet message. If you want to get the twitter user id, you would need to modify the python script that fetches the tweets so that it will write a file with the association twitter id => user id Then you would need to change the Classifier java code in this post to output for each twitter id, the user id(that you get from the association file) and the category id with the best score. Let me know if that helps. 13. Puneet says: Hey Hi thanks for the tutorials its just awesome. But I had a problem while fetching the tweets from tweeter its showing following errors: Traceback (most recent call last): File “scripts/twitter_fetcher.py”, line 30, in res = urllib2.urlopen(req) File “/usr/lib/python2.7/urllib2.py”, line 126, in urlopen return _opener.open(url, data, timeout) File “/usr/lib/python2.7/urllib2.py”, line 406, in open response = meth(req, response) File “/usr/lib/python2.7/urllib2.py”, line 519, in http_response ‘http’, request, response, code, msg, hdrs) File “/usr/lib/python2.7/urllib2.py”, line 444, in error return self._call_chain(*args) File “/usr/lib/python2.7/urllib2.py”, line 378, in _call_chain result = func(*args) File “/usr/lib/python2.7/urllib2.py”, line 527, in http_error_default raise HTTPError(req.get_full_url(), code, msg, hdrs, fp) urllib2.HTTPError: HTTP Error 410: Gone • chimpler says: Hi Puneet, Twitter has just turned off the v1 of their REST API. We will update the script to use the new api (v1.1) with authentication. In the meantime, you can look at their new api at: https://dev.twitter.com/docs Thanks • Puneet arora says: Thanks a lot Sir……. Waiting for for your update • chimpler says: Hi Puneet, We have updated the section ‘Preparing the training set’ to describe how to use the updated script (now uses twitter 1.1 api). Let us know if you have any questions on that. Thanks. • Puneet Arora says: Once again thanks a lot for ur support. Mean a while I got into new problem of ckassifying the different documemts using 20 news example of mahout in eclispe there where some error after partial map and reduce phase • Puneet arora says: this error while taking output in tsv or txt file format :— I tried using other encoding but still but yes its working fine when you use on command line ‘ascii’ codec can’t encode character u’\xfc’ in position 45: ordinal not in range(128) 14. YP says: Has anyone done a performance bench marking on this program? I have a 2 node cluster with 6GB Ram and 4 Core. It took 14min for predicting sentiment for 10000 rows. Input file size was 20Mb. 15. Somebody essentially lend a hand to make significantly posts I might state. This is the very first time I frequented your website page and thus far? I amazed with the analysis you made to create this particular put up amazing. Wonderful activity! 16. Dimitar says: Hi, I have tried the example and it is excellent, however I have difficulty to run the classifier at the end. It gives the following error: Exception in thread “main” java.lang.IllegalStateException: java.io.EOFException at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.computeNext(SequenceFileIterator.java:104) which is happening when the program is trying to initially read the df-count. Strangely, if I try to run in it distributed mode, it errors with NullPointerException at the following line of the Classifier: int documentCount = documentFrequency.get(-1).intValue();, which is again related to the document count. Obviously, something is wrong, but can not figure it out. Train and test work perfectly. 17. Hari says: Hi Chimpler, I am trying to run the mahout trainnb program as follows mahout trainnb -i tweet-vectors -el -li labelindex -o model -ow -c I get the error pasted below. However if I do hadoop fs -ls /user/hhhh/tweet-vectors/df-count, I could see the following files in the df-count folder (part-r-00000 to part-r-00009, a _SUCESS file and _logs folder) Exception in thread “main” java.lang.IllegalStateException: hdfs://machineinfo:8020/user/hhhh/tweet-vectors/df-count at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:115)
at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:106) at com.google.common.collect.Iterators$8.transform(Iterators.java:860)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at com.google.common.collect.Iterators$5.hasNext(Iterators.java:597) at com.google.common.collect.ForwardingIterator.hasNext(ForwardingIterator.java:43) at org.apache.mahout.classifier.naivebayes.BayesUtils.writeLabelIndex(BayesUtils.java:122) at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.createLabelIndex(TrainNaiveBayesJob.java:180) at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:94) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.main(TrainNaiveBayesJob.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:616) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:194)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:160)
Caused by: java.io.FileNotFoundException: File does not exist: /user/hhhh/tweet-vectors/df-count
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.fetchLocatedBlocks(DFSClient.java:2006) at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:1975)
at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.(DFSClient.java:1967) at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:735) at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSystem.java:165) at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1499)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1486) at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1479)
at org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:1474) at org.apache.mahout.common.iterator.sequencefile.SequenceFileIterator.(SequenceFileIterator.java:63) at org.apache.mahout.common.iterator.sequencefile.SequenceFileDirIterator$1.apply(SequenceFileDirIterator.java:110)

18. Rohit says:

Hi,

A great tutorial!!

Quick question: It seems the mahout split does not seem to work as map reduce job for mahout-0.7 . It just works as sequential job. If am planning to use my own testing set, then what are my options as the approach suggested by you will not run on a hadoop cluster.

Thanks,
Rohit

19. anonymous521 says:

how to convert csv to seq file

20. Tarun says:

Hi,
DO you have any idea in mahout clustering, how can we identify in document belong to which cluster. Because it returns the result on clusterdump by terms as :
:VL-0{n=2 c=[0:2.491, 000:2.642, 1:4.209, 1,765:5.290, 1.0:4.165, 1.2:1.256, 10:2.105, 1000:2.034, 10
Top Terms:
categorization => 20.145038604736328
21578 => 14.676787376403809
22173 => 13.752102851867676
tags => 12.930264472961426
categories => 12.056022644042969
sgml => 11.829574584960938
booktitle => 11.829574584960938
inproceedings => 11.222519874572754
collection => 11.18535041809082
docs => 10.580693244934082
sameline => 10.580693244934082
stories => 10.250133514404297
hayes90b => 10.006124496459961
modlewis => 9.897332191467285
documents => 9.776784896850586
reuters => 9.625079154968262
topics => 9.590224266052246
hayes89 => 9.43386459350586
lewis91d => 9.1631498336792
formatting => 9.1631498336792
Weight : [props - optional]: Point:

:VL-1{n=2 c=[amex:1.298, ase:1.869, asx:2.488, biffex:2.237, bse:2.642, cboe:1.795, cbt:1.298, cme:1.
Top Terms:
wce => 3.3353748321533203
klce => 3.3353748321533203
jse => 3.3353748321533203
klse => 3.3353748321533203
mase => 3.0476927757263184
cse => 3.0476927757263184
mnse => 3.0476927757263184
ose => 3.0476927757263184
mise => 2.8245491981506348
stse => 2.8245491981506348
ipe => 2.8245491981506348
bse => 2.6422276496887207

21. Matt says:

To those running into the EOF exception error, I’ve narrowed the problem down to the getmerge df-count file. When “hadoop fs -getmerge tweets-vectors/df-count df-count” is executed with a cluster greater than 1 node, the merge corrupts the seq file. The workaround I’ve found for now is to run “MAHOUT_LOCAL=true mahout seq2sparse -i tweets-seq -o tweets-vectors” followed by “hadoop fs -put tweets-vectors”

22. Matt says:

This is a better fix: https://github.com/mshean2011/mahout-naive-bayes-example/commit/94a766ee988f5bba43f579ca2e95e271cbe90baf
It requires the df-count path variable to point to a directory instead of a merged file

23. Puneet arora says:

Hey Hi Sir,

I had doubt that if we are working with ngrams while creating model what changes we need make in classifer, since now we are using combination of words in the dictionary file .

Thanks & With Regards
Puneet Arora

24. tk yass says:

Hi,

I’m trying to implement the tutorial but I keep having the error when running the classifier.

SLF4J: Failed to load class “org.slf4j.impl.StaticLoggerBinder”.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Exception in thread “main” java.lang.IllegalStateException: java.io.EOFException

I checked the SLF4J site and they suggested to add one of their jar files to the classpath, I tried but it’s not working.. any suggestions??
Thanks.

25. baris says:

Hi all

When I try to execute mahout trainnb -i tweet-vectors -el -li labelindex -o model -ow -c command, I take below error. My hadoop version is 2.2.0. What is your ideas to solve this problem? I think problem may belongs to my hadoop version.

Error

Exception in thread “main” java.lang.IncompatibleClassChangeError: Found interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at org.apache.mahout.common.HadoopUtil.getCustomJobName(HadoopUtil.java:174)
at org.apache.mahout.common.AbstractJob.prepareJob(AbstractJob.java:614)
at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.run(TrainNaiveBayesJob.java:103)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.mahout.classifier.naivebayes.training.TrainNaiveBayesJob.main(TrainNaiveBayesJob.java:64)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.ProgramDriver\$ProgramDescription.invoke(ProgramDriver.java:72)
at org.apache.hadoop.util.ProgramDriver.run(ProgramDriver.java:144)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:152)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:194)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.hadoop.util.RunJar.main(RunJar.java:212)

26. elkoo says:

If somebody is facing with this exception:
Exception in thread “main” java.lang.IllegalArgumentException: Label not found: heath

It says that there is a mistake in the training set.

Solution: fix the ‘tweets-train.tsv’ label which is on the end of the training set: heath – > health

27. Carlos says:

Hi, very good tutorial for the starters like me.I have a question.
Do the training data need to be in such format: Class (Category), unique_id, some training text?
I cannot understand why unique_id is needed ? I want to create a simple analysis tool for negative, neutral and positive classification. Please explain me, thank you!

28. I am wondering , what if training data contains more than one category? How mahaout would handle it. am trying to implement tag prediction api using Mahaout, The training data has more than one target variable (comma separated tags).

29. tuku says:

very good tutorial, I could not understand NB in mahout without it.

I got a question, the scores are all negative, why is that?
Is it because I did something wrong or is it the case for mahout.
shouldn’t the scores vary between 0 and 1

• chimpler says:

That’s a good question Tuku. The reason is because the score represents the probability of the document belonging to that category.

To compute the probability of a document belonging to a category, we compute the products of the probability of each word w to belong the category C:
$p(D\vert C)=\prod_w p(w \vert C)$
As the probability of each word to belong to the category is tiny, by doing the multiplications we are going to lose a lot of precision.

So the naive bayes implementation is using a log scale instead:
$log(p(D\vert C))=\sum_w log(p(w \vert C))$

As the logarithm of a number between 0 and 1 is negative, its sum are also negative.

30. immagulate says:

sir please tell about “Using the Mahout Naive Bayes Classifier to automatically classify the images”

31. Is this tested with mahout 0.9?
I’m running this on a single node clusted on mahout 0.9 and hadoop 1.2.1 and am getting all sorts of errors when executing
 mahout seq2sparse -i tweets-seq -o tweets-vectors

32. John says:

Hello,

I have a question about using my own testing set with mahout. How do you calculate the tfidf of the new document? I see that you get line by line the new document, you calculate the tf of words that this line has, but when you calculate the tfidf i dont understand why you use the training documents. Why you use the df and documentcount of the training documents and not the current documents?

I really appreciate any help you can provide.