Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages

mahout2Classification algorithms can be used to automatically classify documents, images, implement spam filters and in many other domains. In this tutorial we are going to use Mahout to classify tweets using the Naive Bayes Classifier. The algorithm works by using a training set which is a set of documents already associated to a category. Using this set, the classifier determines for each word, the probability that it makes a document belong to each of the considered categories. To compute the probability that a document belongs to a category, it multiplies together the individual probability of each of its word in this category.  The category with the highest probability is the one the document is most likely to belong to.

To get more details on how the Naive Bayes Classifier is implemented, you can look at the mahout wiki page.

This tutorial will give you a step-by-step description on how to create a training set, train the Naive Bayes classifier and then use it to classify new tweets.

Requirement

For this tutorial, you would need:

  • jdk >= 1.6
  • maven
  • hadoop (preferably 1.1.1)
  • mahout >= 0.7

To install hadoop and mahout, you can follow the steps described on a previous post that shows how to use the mahout recommender.

When you are done installing hadoop and mahout, make sure you set them in your PATH so you can easily call them:

export PATH=$PATH:[HADOOP_DIR]/bin:$PATH:[MAHOUT_DIR]/bin

In our tutorial, we will limit the tweets to deals by getting the tweets containing the hashtags #deal, #deals and #discount. We will classify them in the following categories:

  • apparel (clothes, shoes, watches, …)
  • art (Book, DVD, Music, …)
  • camera
  • event (travel, concert, …)
  • health (beauty, spa, …)
  • home (kitchen, furniture, garden, …)
  • tech (computer, laptop, tablet, …)

You can get the scripts and java programs used in this tutorial from our git repository on github:

$ git clone https://github.com/fredang/mahout-naive-bayes-example.git

You can compile the java programs by typing:

$ mvn clean package assembly:single

Preparing the training set

UPDATE(2013/06/23): this section was updated to support twitter 1.1 api (1.0 was just shutdown).

As preparing the training set is very time consuming, we have provided in the source repository a training set so that you don’t need to build it. The file is data/tweets-train.tsv. If you choose to use it, you can directly jump to the next section.

To prepare a training set, we fetched the tweets with the following hashtags: #deals, #deal or #discount by using the script twitter_fetcher.py. It is using the python-tweepy 2.1 library (make sure to install the latest version as we have to use the twitter 1.1 api now). You can install it by typing:

git clone https://github.com/tweepy/tweepy.git
cd tweepy
sudo python setup.py install

You need to have consumer keys/secrets and access token key/secrets to use the api. If you don’t have them, simply login on the twitter website then go to: https://dev.twitter.com/apps. Then create a new application.
When you are done, you should see in the section ‘OAuth settings’, the Consumer Key and secret, and in the section ‘Your access token’, the Access Token and the Access Token secret.

Edit the file script/twitter_fetcher.py and change the following lines to use your twitter keys and secrets:

CONSUMER_KEY='REPLACE_CONSUMER_KEY'
CONSUMER_SECRET='REPLACE_CONSUMER_SECRET'
ACCESS_TOKEN_KEY='REPLACE_ACCESS_TOKEN_KEY'
ACCESS_TOKEN_SECRET='REPLACE_ACCESS_TOKEN_SECRET'

You can now run the script:

$ python scripts/twitter_fetcher.py 5 > tweets-train.tsv

Code to fetch tweets:

import tweepy
import sys

CONSUMER_KEY='REPLACE_CONSUMER_KEY'
CONSUMER_SECRET='REPLACE_CONSUMER_SECRET'
ACCESS_TOKEN_KEY='REPLACE_ACCESS_TOKEN_KEY'
ACCESS_TOKEN_SECRET='REPLACE_ACCESS_TOKEN_SECRET'

auth = tweepy.OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
auth.set_access_token(ACCESS_TOKEN_KEY, ACCESS_TOKEN_SECRET)
api = tweepy.API(auth)

pageCount = 5
if len(sys.argv) >= 2:
        pageCount = int(sys.argv[1])
hashtags = ['deal', 'deals', 'discount']

for tag in hashtags:
        maxId = 999999999999999999999
        for i in range(1, pageCount + 1):
                results = api.search(q='#%s' % tag, max_id=maxId, count=100)
                print len(results)
                for result in results:
                        print result.text
                        maxId = min(maxId, result.id)
                        # only keep tweets pointing to a web page
                        if result.text.find("http:") != -1:
                                print "%s       %s" % (result.id, result.text.encode('utf-8').replace('\n', ' '))

The file tweets-train.tsv contains a list of tweets in a tab separated value format. The first number is the tweet id followed by the tweet message:

308215054011194110      Limited 3-Box $20 BOGO, Supreme $9 BOGO, PTC Basketball $10 BOGO, Sterling Baseball $20 BOGO, Bowman Chrome $7 http://t.co/WMdbNFLvVZ #deals
308215054011194118      Purchase The Jeopardy! Book by Alex Trebek, Peter Barsocchini for only $4 #book #deals - http://t.co/Aw5EzlQYbs @ThriftBooksUSA
308215054011194146      #Shopping #Bargain #Deals Designer KATHY Van Zeeland Luggage & Bags @ http://t.co/GJC83p8eKh

To transform this into a training set, you can use your favorite editor and add the category of the tweet at the beginning of the line followed by a tab character:

tech    308215054011194110      Limited 3-Box $20 BOGO, Supreme $9 BOGO, PTC Basketball $10 BOGO, Sterling Baseball $20 BOGO, Bowman Chrome $7 http://t.co/WMdbNFLvVZ #deals
art     308215054011194118      Purchase The Jeopardy! Book by Alex Trebek, Peter Barsocchini for only $4 #book #deals - http://t.co/Aw5EzlQYbs @ThriftBooksUSA
apparel 308215054011194146      #Shopping #Bargain #Deals Designer KATHY Van Zeeland Luggage & Bags @ http://t.co/GJC83p8eKh

Make sure to use tab between the category and the tweet id and between the tweet id and the tweet message.

For the classifier to work properly, this set must have at least 50 tweets messages in each category.

Training the model with Mahout

First we need to convert the training set to the hadoop sequence file format:

$ java -cp target/twitter-naive-bayes-example-1.0-jar-with-dependencies.jar com.chimpler.example.bayes.TweetTSVToSeq data/tweets-train.tsv tweets-seq

The sequence file has as key: /[category]/ and as value: .

Code to convert tweet tsv to sequence file

public class TweetTSVToSeq {
	public static void main(String args[]) throws Exception {
		if (args.length != 2) {
			System.err.println("Arguments: [input tsv file] [output sequence file]");
			return;
		}
		String inputFileName = args[0];
		String outputDirName = args[1];

		Configuration configuration = new Configuration();
		FileSystem fs = FileSystem.get(configuration);
		Writer writer = new SequenceFile.Writer(fs, configuration, new Path(outputDirName + "/chunk-0"),
				Text.class, Text.class);

		int count = 0;
		BufferedReader reader = new BufferedReader(new FileReader(inputFileName));
		Text key = new Text();
		Text value = new Text();
		while(true) {
			String line = reader.readLine();
			if (line == null) {
				break;
			}
			String[] tokens = line.split("\t", 3);
			if (tokens.length != 3) {
				System.out.println("Skip line: " + line);
				continue;
			}
			String category = tokens[0];
			String id = tokens[1];
			String message = tokens[2];
			key.set("/" + category + "/" + id);
			value.set(message);
			writer.append(key, value);
			count++;
		}
		writer.close();
		System.out.println("Wrote " + count + " entries.");
	}
}

Then we upload this file to HDFS:

$ hadoop fs -put tweets-seq tweets-seq

We can run mahout to transform the training sets into vectors using tfidf weights(term frequency x document frequency):

$ mahout seq2sparse -i tweets-seq -o tweets-vectors

It will generate the following files in HDFS in the directory tweets-vectors:

  • df-count: sequence file with association word id => number of document containing this word
  • dictionary.file-0: sequence file with association word => word id
  • frequency.file-0: sequence file with association word id => word count
  • tf-vectors: sequence file with the term frequency for each document
  • tfidf-vectors: sequence file with association document id => tfidf weight for each word in the document
  • tokenized-documents: sequence file with association document id => list of words
  • wordcount: sequence file with association word => word count

In order to do the training and check that the classification works fine, Mahout splits the set into two sets: a training set and a testing set:

$ mahout split -i tweets-vectors/tfidf-vectors --trainingOutput train-vectors --testOutput test-vectors --randomSelectionPct 40 --overwrite --sequenceFiles -xm sequential

We use the training set to train the classifier:

$ mahout trainnb -i train-vectors -el -li labelindex -o model -ow -c

It creates the model(matrix word id x label id) and a label index(association label and label id).

To test that the classifier is working properly on the training set:

$ mahout testnb -i train-vectors -m model -l labelindex -ow -o tweets-testing -c
[...]
Summary
-------------------------------------------------------
Correctly Classified Instances          :        314	   97.2136%
Incorrectly Classified Instances        :          9	    2.7864%
Total Classified Instances              :        323

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	c    	d    	e    	f    	g    	<--Classified as
45   	0    	0    	0    	0    	0    	1    	 |  46    	a     = apparel
0    	35   	0    	0    	0    	0    	0    	 |  35    	b     = art
0    	0    	34   	0    	0    	0    	0    	 |  34    	c     = camera
0    	0    	0    	39   	0    	0    	0    	 |  39    	d     = event
0    	0    	0    	0    	23   	0    	0    	 |  23    	e     = health
1    	1    	0    	0    	1    	48   	2    	 |  53    	f     = home
0    	0    	1    	0    	1    	1    	90   	 |  93    	g     = tech

And on the testing set:

$ mahout testnb -i test-vectors -m model -l labelindex -ow -o tweets-testing -c
[...]
Summary
-------------------------------------------------------
Correctly Classified Instances          :        121	   78.0645%
Incorrectly Classified Instances        :         34	   21.9355%
Total Classified Instances              :        155

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	c    	d    	e    	f    	g    	<--Classified as
27   	1    	1    	1    	2    	2    	2    	 |  36    	a     = apparel
1    	22   	0    	2    	1    	0    	0    	 |  26    	b     = art
0    	1    	27   	1    	0    	0    	1    	 |  30    	c     = camera
0    	1    	0    	23   	4    	0    	0    	 |  28    	d     = event
0    	1    	0    	2    	9    	2    	0    	 |  14    	e     = health
0    	1    	1    	1    	2    	13   	1    	 |  19    	f     = home
0    	0    	2    	0    	0    	0    	0    	 |  2     	g     = tech

If the percentage of correctly classified instance is too low, you might need to improve your training set by adding more tweets or by changing your categories to not have too many similar categories or by removing categories that are used very rarely. After you are done with your changes, you would need to restart the training process.

To use the classifier to classify new documents, we would need to copy several files from HDFS:

  • model (matrix word id x label id)
  • labelindex (mapping between a label and its id)
  • dictionary.file-0 (mapping between a word and its id)
  • df-count (document frequency: number of documents each word is appearing in)
$ hadoop fs -get labelindex labelindex
$ hadoop fs -get model model
$ hadoop fs -get tweets-vectors/dictionary.file-0 dictionary.file-0
$ hadoop fs -getmerge tweets-vectors/df-count df-count

To get some new tweets to classify, you can run the twitter fetcher again(or use the one provided in data/tweets-to-classify-tsv):

$ python scripts/twitter_fetcher.py 1 > tweets-to-classify.tsv

Now we can run the classifier on this file:

$ java -cp target/twitter-naive-bayes-example-1.0-jar-with-dependencies.jar com.chimpler.example.bayes.Classifier model labelindex dictionary.file-0 df-count data/tweets-to-classify.tsv
Number of labels: 7
Number of documents: 486
Tweet: 309836558624768000       eBay - Porter Cable 18V Ni CAD 2-Tool Combo Kit (Refurbished) $56.99 http://t.co/pCSSlSq2c1 #Deal - http://t.co/QImHB6xJ5b
  apparel: -252.96630831136127  art: -246.9351025603821  camera: -262.28340417385357  event: -262.5573608070056  health: -238.17884382282813  home: -253.05135616792995  tech: -232.9118
41377148 => tech
Tweet: 309836557379043329       Newegg - BenQ GW2750HM 27" Widescreen LED Backlight LCD Monitor $209.99 http://t.co/6ezbjGZIta #Deal - http://t.co/QImHB6xJ5b
  apparel: -287.5588179141781  art: -284.27401807389435  camera: -278.4968305457808  event: -292.56786244190556  health: -292.22158238362204  home: -281.9809996515652  tech: -253.34354
804349476 => tech
Tweet: 309836556355657728       J and R - Roku 3 Streaming Player 4200R $89.99 http://t.co/BAaMEmEdCm #Deal - http://t.co/QImHB6xJ5b
  apparel: -192.44260718853357  art: -187.6881145121525  camera: -175.8783440835461  event: -191.74948688734446  health: -190.45406023882765  home: -192.9107077937349  tech: -185.52068
485514894 => camera
Tweet: 309836555248361472       eBay - Adidas Adicross 2011 Men's Spikeless Golf Shoes $42.99 http://t.co/oRt8JIQB6v #Deal - http://t.co/QImHB6xJ5b
  apparel: -133.86214565455646  art: -174.44106424825426  camera: -188.66719939648308  event: -188.83296276708387  health: -188.188838820323  home: -178.13519042380085  tech: -190.7717
2248114303 => apparel
Tweet: 309836554187202560       Buydig - Tamron 18-270mm Di Lens for Canon + Canon 50mm F/1.8 Lens $464 http://t.co/Dqj9DdqmTf #Deal - http://t.co/QImHB6xJ5b
  apparel: -218.82418584296866  art: -228.25052760371423  camera: -183.46066199290763  event: -245.186963518965  health: -244.70464331200444  home: -236.16560862254997  tech: -244.4118
6823539707 => camera

Code to classify the tweets using the model and the dictionary file:

public class Classifier {

	public static Map<String, Integer> readDictionnary(Configuration conf, Path dictionnaryPath) {
		Map<String, Integer> dictionnary = new HashMap<String, Integer>();
		for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>(dictionnaryPath, true, conf)) {
			dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());
		}
		return dictionnary;
	}

	public static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath) {
		Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();
		for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable, LongWritable>(documentFrequencyPath, true, conf)) {
			documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());
		}
		return documentFrequency;
	}

	public static void main(String[] args) throws Exception {
		if (args.length < 5) { 			System.out.println("Arguments: [model] [label index] [dictionnary] [document frequency] "); 			return; 		} 		String modelPath = args[0]; 		String labelIndexPath = args[1]; 		String dictionaryPath = args[2]; 		String documentFrequencyPath = args[3]; 		String tweetsPath = args[4]; 		 		Configuration configuration = new Configuration(); 		// model is a matrix (wordId, labelId) => probability score
		NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), configuration);

		StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model);

		// labels is a map label => classId
		Map<Integer, String> labels = BayesUtils.readLabelIndex(configuration, new Path(labelIndexPath));
		Map<String, Integer> dictionary = readDictionnary(configuration, new Path(dictionaryPath));
		Map<Integer, Long> documentFrequency = readDocumentFrequency(configuration, new Path(documentFrequencyPath));

		// analyzer used to extract word from tweet
		Analyzer analyzer = new DefaultAnalyzer();

		int labelCount = labels.size();
		int documentCount = documentFrequency.get(-1).intValue();

		System.out.println("Number of labels: " + labelCount);
		System.out.println("Number of documents in training set: " + documentCount);
		BufferedReader reader = new BufferedReader(new FileReader(tweetsPath));
		while(true) {
			String line = reader.readLine();
			if (line == null) {
				break;
			}

			String[] tokens = line.split("\t", 2);
			String tweetId = tokens[0];
			String tweet = tokens[1];

			System.out.println("Tweet: " + tweetId + "\t" + tweet);

			Multiset words = ConcurrentHashMultiset.create();

			// extract words from tweet
			TokenStream ts = analyzer.reusableTokenStream("text", new StringReader(tweet));
			CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
			ts.reset();
			int wordCount = 0;
			while (ts.incrementToken()) {
				if (termAtt.length() > 0) {
					String word = ts.getAttribute(CharTermAttribute.class).toString();
					Integer wordId = dictionary.get(word);
					// if the word is not in the dictionary, skip it
					if (wordId != null) {
						words.add(word);
						wordCount++;
					}
				}
			}

			// create vector wordId => weight using tfidf
			Vector vector = new RandomAccessSparseVector(10000);
			TFIDF tfidf = new TFIDF();
			for (Multiset.Entry entry:words.entrySet()) {
				String word = entry.getElement();
				int count = entry.getCount();
				Integer wordId = dictionary.get(word);
				Long freq = documentFrequency.get(wordId);
				double tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount, documentCount);
				vector.setQuick(wordId, tfIdfValue);
			}
			// With the classifier, we get one score for each label
			// The label with the highest score is the one the tweet is more likely to
			// be associated to
			Vector resultVector = classifier.classifyFull(vector);
			double bestScore = -Double.MAX_VALUE;
			int bestCategoryId = -1;
			for(Element element: resultVector) {
				int categoryId = element.index();
				double score = element.get();
				if (score > bestScore) {
					bestScore = score;
					bestCategoryId = categoryId;
				}
				System.out.print("  " + labels.get(categoryId) + ": " + score);
			}
			System.out.println(" => " + labels.get(bestCategoryId));
		}
	}
}

Most of the tweets are classified properly but some are not. For example, the tweet “J and R – Roku 3 Streaming Player 4200R $89.99″ is incorrectly classified as camera. To fix that, we can add this tweet to the training set and classify it as tech. You can do the same for the  other tweets which are incorrectly classified. When you are done, you can repeat the training process and check the results again.

Conclusion

In this tutorial we have seen how to build a training set, then how to use it with Mahout to train the Naive Bayes model. We showed how to test the classifier and how to improve the training set to get a better classification. Finally we use it to build an application to automatically assign a category to a tweet. In this post, we only study one Mahout classifier among many others: SGD, SVM, Neural Network, Random Forests, …. We will see in future posts how to use them.

Misc

View content of sequence files

To show the content of a file in HDFS, you can use the command

$ hadoop fs -text [FILE_NAME]

However, there might be some sequence file which are encoded using mahout classes. You can tell hadoop where to find those classes by editing the file [HADOOP_DIR]conf/hadoop-env.sh and add the following line:

export HADOOP_CLASSPATH=[MAHOUT_DIR]/mahout-math-0.7.jar:[MAHOUT_DIR]/mahout-examples-0.7-job.jar

and restart hadoop.

You can use the command mahout seqdumper:

$ mahout seqdumper -i [FILE_NAME]

View words which are the most representative of each categories

You can use the class TopCategoryWords that shows the top 10 words of each category.

public class TopCategoryWords {

	public static Map<Integer, String> readInverseDictionnary(Configuration conf, Path dictionnaryPath) {
		Map<Integer, String> inverseDictionnary = new HashMap<Integer, String>();
		for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>(dictionnaryPath, true, conf)) {
			inverseDictionnary.put(pair.getSecond().get(), pair.getFirst().toString());
		}
		return inverseDictionnary;
	}

	public static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath) {
		Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();
		for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable, LongWritable>(documentFrequencyPath, true, conf)) {
			documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());
		}
		return documentFrequency;
	}

	public static class WordWeight implements Comparable {
		private int wordId;
		private double weight;

		public WordWeight(int wordId, double weight) {
			this.wordId = wordId;
			this.weight = weight;
		}

		public int getWordId() {
			return wordId;
		}

		public Double getWeight() {
			return weight;
		}

		@Override
		public int compareTo(WordWeight w) {
			return -getWeight().compareTo(w.getWeight());
		}
	}

	public static void main(String[] args) throws Exception {
		if (args.length < 4) { 			System.out.println("Arguments: [model] [label index] [dictionnary] [document frequency]"); 			return; 		} 		String modelPath = args[0]; 		String labelIndexPath = args[1]; 		String dictionaryPath = args[2]; 		String documentFrequencyPath = args[3]; 		 		Configuration configuration = new Configuration(); 		// model is a matrix (wordId, labelId) => probability score
		NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), configuration);

		StandardNaiveBayesClassifier classifier = new StandardNaiveBayesClassifier(model);

		// labels is a map label => classId
		Map<Integer, String> labels = BayesUtils.readLabelIndex(configuration, new Path(labelIndexPath));
		Map<Integer, String> inverseDictionary = readInverseDictionnary(configuration, new Path(dictionaryPath));
		Map<Integer, Long> documentFrequency = readDocumentFrequency(configuration, new Path(documentFrequencyPath));

		int labelCount = labels.size();
		int documentCount = documentFrequency.get(-1).intValue();

		System.out.println("Number of labels: " + labelCount);
		System.out.println("Number of documents in training set: " + documentCount);

		for(int labelId = 0 ; labelId < model.numLabels() ; labelId++) {
			SortedSet wordWeights = new TreeSet();
			for(int wordId = 0 ; wordId < model.numFeatures() ; wordId++) { 				WordWeight w = new WordWeight(wordId, model.weight(labelId, wordId)); 				wordWeights.add(w); 			} 			System.out.println("Top 10 words for label " + labels.get(labelId)); 			int i = 0; 			for(WordWeight w: wordWeights) { 				System.out.println(" - " + inverseDictionary.get(w.getWordId()) 						+ ": " + w.getWeight()); 				i++; 				if (i >= 10) {
					break;
				}
			}
		}
	}
}

$ java -cp target/twitter-naive-bayes-example-1.0-jar-with-dependencies.jar com.chimpler.example.bayes.TopCategoryWords model labelindex dictionary.file-0 df-count
Top 10 words for label camera
– digital: 70.05728101730347
– camera: 63.875202655792236
– canon: 53.79892921447754
– mp: 49.64586567878723
– nikon: 47.830992698669434
– slr: 45.931694984436035
– sony: 44.55785942077637
– lt: 37.998433113098145
– http: 29.718397855758667
– t.co: 29.65730857849121
Top 10 words for label event
– http: 33.16791915893555
– t.co: 33.09973907470703
– deals: 26.246684789657593
– days: 25.533835887908936
– hotel: 22.658542156219482
– discount: 19.89004611968994
– amp: 19.645113945007324
– spend: 18.805208206176758
– suite: 17.21832275390625
– deal: 16.84959626197815
[...]

Running the training without splitting the data into testing and training set

You can run the training just after having executed the mahout seq2sparse command:

$ mahout trainnb -i tweets-vectors/tfidf-vectors -el -li labelindex -o model -ow -c

Using your own testing set with mahout

Previously, we showed how to generate a testing set from the training set using the mahout split command.

In this section, we are going to describe how to use our own testing set and run mahout to check the accuracy of the testing set.

We have a small testing set in data/tweets-test-set.tsv that we are transforming into a tfidf vector sequence file:
the tweet words are converted into word id using the dictionary file and are associated to their tf x idf value:

public class TweetTSVToTrainingSetSeq {
	public static Map<String, Integer> readDictionnary(Configuration conf, Path dictionnaryPath) {
		Map<String, Integer> dictionnary = new HashMap<String, Integer>();
		for (Pair<Text, IntWritable> pair : new SequenceFileIterable<Text, IntWritable>(dictionnaryPath, true, conf)) {
			dictionnary.put(pair.getFirst().toString(), pair.getSecond().get());
		}
		return dictionnary;
	}

	public static Map<Integer, Long> readDocumentFrequency(Configuration conf, Path documentFrequencyPath) {
		Map<Integer, Long> documentFrequency = new HashMap<Integer, Long>();
		for (Pair<IntWritable, LongWritable> pair : new SequenceFileIterable<IntWritable, LongWritable>(documentFrequencyPath, true, conf)) {
			documentFrequency.put(pair.getFirst().get(), pair.getSecond().get());
		}
		return documentFrequency;
	}

	public static void main(String[] args) throws Exception {
		if (args.length < 4) {
			System.out.println("Arguments: [dictionnary] [document frequency]  [output file]");
			return;
		}
		String dictionaryPath = args[0];
		String documentFrequencyPath = args[1];
		String tweetsPath = args[2];
		String outputFileName = args[3];

		Configuration configuration = new Configuration();
		FileSystem fs = FileSystem.get(configuration);

		Map<String, Integer> dictionary = readDictionnary(configuration, new Path(dictionaryPath));
		Map<Integer, Long> documentFrequency = readDocumentFrequency(configuration, new Path(documentFrequencyPath));
		int documentCount = documentFrequency.get(-1).intValue();

		Writer writer = new SequenceFile.Writer(fs, configuration, new Path(outputFileName),
				Text.class, VectorWritable.class);
		Text key = new Text();
		VectorWritable value = new VectorWritable();

		Analyzer analyzer = new DefaultAnalyzer();
		BufferedReader reader = new BufferedReader(new FileReader(tweetsPath));
		while(true) {
			String line = reader.readLine();
			if (line == null) {
				break;
			}

			String[] tokens = line.split("\t", 3);
			String label = tokens[0];
			String tweetId = tokens[1];
			String tweet = tokens[2];

			key.set("/" + label + "/" + tweetId);

			Multiset words = ConcurrentHashMultiset.create();

			// extract words from tweet
			TokenStream ts = analyzer.reusableTokenStream("text", new StringReader(tweet));
			CharTermAttribute termAtt = ts.addAttribute(CharTermAttribute.class);
			ts.reset();
			int wordCount = 0;
			while (ts.incrementToken()) {
				if (termAtt.length() > 0) {
					String word = ts.getAttribute(CharTermAttribute.class).toString();
					Integer wordId = dictionary.get(word);
					// if the word is not in the dictionary, skip it
					if (wordId != null) {
						words.add(word);
						wordCount++;
					}
				}
			}

			// create vector wordId => weight using tfidf
			Vector vector = new RandomAccessSparseVector(10000);
			TFIDF tfidf = new TFIDF();
			for (Multiset.Entry entry:words.entrySet()) {
				String word = entry.getElement();
				int count = entry.getCount();
				Integer wordId = dictionary.get(word);
				// if the word is not in the dictionary, skip it
				Long freq = documentFrequency.get(wordId);
				double tfIdfValue = tfidf.calculate(count, freq.intValue(), wordCount, documentCount);
				vector.setQuick(wordId, tfIdfValue);
			}
			value.set(vector);

			writer.append(key, value);
		}
		writer.close();
	}
}

To run the program:

$ java -cp target/twitter-naive-bayes-example-1.0-jar-with-dependencies.jar com.chimpler.example.bayes.TweetTSVToTrainingSetSeq dictionary.file-0 df-count data/tweets-test-set.tsv tweets-test-set.seq

To copy the generated seq file to hdfs:

 $ hadoop fs -put tweets-test-set.seq tweets-test-set.seq

To run the mahout testnb on this sequence file:

  $ mahout testnb -i tweets-test-set.seq -m model -l labelindex -ow -o tweets-test-output -c
Summary
-------------------------------------------------------
Correctly Classified Instances          :          5	   18.5185%
Incorrectly Classified Instances        :         22	   81.4815%
Total Classified Instances              :         27

=======================================================
Confusion Matrix
-------------------------------------------------------
a    	b    	c    	d    	e    	f    	g    	<--Classified as
2    	1    	0    	1    	3    	0    	1    	 |  8     	a     = apparel
0    	0    	0    	2    	0    	0    	1    	 |  3     	b     = art
0    	0    	0    	0    	1    	0    	0    	 |  1     	c     = camera
0    	0    	0    	2    	1    	0    	0    	 |  3     	d     = event
0    	0    	0    	1    	1    	0    	0    	 |  2     	e     = health
0    	0    	0    	2    	0    	0    	1    	 |  3     	f     = home
0    	3    	0    	1    	0    	3    	0    	 |  7     	g     = tech

Cleanup files

In case you want to rerun the classification, an easy way to delete all the files in your home in HDFS is by typing:

$ hadoop fs -rmr \*

Errors

When running the script to convert the tweet TSV message, I got the following errors:

Skip line: tech	309167277155168257      Easy web hosting. $4.95 -  http://t.co/0oUGS6Oj0e  - Review/Coupon- http://t.co/zdgH4kv5sv  #wordpress #deal #bluehost #blue host
Skip line: art	309167270989541376      Beautiful Jan Royce Conant Drawing of Jamaica - 1982 - Rare CT Artist - Animals #CPTV #EBAY #FineArt #Deals http://t.co/MUZf5aixMz

Make sure that the category and the tweet id are followed by a tab character and not spaces.

 

To run the classifier on the hadoop cluster, you can read the post part 2: distribute classification with hadoop.

About chimpler
http://www.chimpler.com

88 Responses to Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages

  1. Daniel says:

    Hello, My Friend

    i am trying to use you example inside a Jetty Servlet to classify some spam like the guy at http://emmaespina.wordpress.com/2011/04/26/ham-spam-and-elephants-or-how-to-build-a-spam-filter-server-with-mahout/
    However, i am having some problems with materializing the bayes model:
    NaiveBayesModel model = NaiveBayesModel.materialize(new Path(modelPath), configuration);

    Heres is the post with curl:
    curl http://localhost:8080/antispam -H “Content-T-Type: text/xml” –data-binary @ham.txt

    Here is the error i receive when i do the post:

    014-06-09 21:16:44.065:INFO::Started SelectChannelConnector@0.0.0.0:8080
    [INFO] Started Jetty Server
    java.lang.IllegalArgumentException: Unknown flags set: %d [-1110101]
    at com.google.common.base.Preconditions.checkArgument(Preconditions.java:148)
    at org.apache.mahout.math.VectorWritable.readFields(VectorWritable.java:88)
    at org.apache.mahout.math.VectorWritable.readVector(VectorWritable.java:199)
    at org.apache.mahout.classifier.naivebayes.NaiveBayesModel.materialize(NaiveBayesModel.java:112)
    at org.example.SpamClassifier.classify(SpamClassifier.java:49)
    at org.example.SpamClassifierServlet.doPost(SpamClassifierServlet.java:24)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:727)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:820)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:533)
    at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:475)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119)
    at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:514)
    at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:226)
    at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:920)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:403)
    at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:184)
    at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:856)
    at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117)
    at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:247)
    at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:151)
    at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:114)
    at org.eclipse.jetty.server.Server.handle(Server.java:352)
    at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:596)
    at org.eclipse.jetty.server.HttpConnection$RequestHandler.content(HttpConnection.java:1066)
    at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:805)
    at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:218)
    at org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:426)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:510)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint.access$000(SelectChannelEndPoint.java:34)
    at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:40)
    at org.eclipse.jetty.util.thread.QueuedThreadPool$2.run(QueuedThreadPool.java:450)
    at java.lang.Thread.run(Thread.java:745)

    Am i missing something? The project is at github ( https://github.com/danielneis/mahout-spamclassifier-servlet ) , i’ve tried several versions of mahout/hadoop-common/hadoop-core/hadoop-hdfs in the pom file.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 136 other followers

%d bloggers like this: