Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages (part 2: distribute classification with hadoop)


In this post, we are going to categorize the tweets by distributing the classification on the hadoop cluster. It can make the classification faster if there is a huge number of tweets to classify.

To go through this tutorial you would need to have run the commands in the post Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages.

To distribute the classification on the hadoop nodes, we are going to define a mapreduce job:

  • the csv containing the tweets to classify is split into several chunks
  • each chunk is sent to the hadoop node that will process it by running the map class
  • the map class loads the naive bayes model and some document/word frequency into memory
  • for each tweet of the chunk, it computes the best matching category. The result is written in the output file. We are not using a reducer class as we don’t need to do aggregations.

To download the code used in this post, you can fetch it from github:

$ git clone

To compile the project:

$ mvn clean package assembly:single

Read more of this post

Installing and comparing MySQL/MariaDB, MongoDB, Vertica, Hive and Impala (Part 1)

impalaA common thing a data analyst does in his day to day job is to run aggregations of data by generally summing and averaging columns using different filters. When tables start to grow to hundreds of millions or billions of rows, these operations become extremely expensive and the choice of a database engine is crucial. Indeed, the more queries an analyst can run during the day, the better he can be at understanding the data.

In this post, we’re going to install 5 popular databases on Linux Ubuntu (12.04):

  • MySQL / MariaDB 10.0: Row based database
  • MongoDB 2.4: NoSQL database
  • Vertica Community Edition 6: Columnar database (similar to Infobright, InfiniDB, …)
  • Hive 0.10: Datawarehouse built on top of HDFS using Map/Reduce
  • Impala 1.0:  Database implemented on top of HDFS (compatible with Hive) based on Dremel that can use different data formats (raw CSV format, Parquet columnar format, …)

Then we’ll provide some scripts to populate them with some test data, run some simple aggregation queries and measure the response time. The tests will be run on only one box without any tuning using a relatively small dataset (160 million rows) but we’re planning on running more thorough tests in the cloud later with much bigger datasets (billions of rows). This is just to give a general idea on the performance of each of the database.
Read more of this post

Playing with Apache Hive and SOLR

As described in a previous post, Apache SOLR can perform very well to provide low latency analytics. Data logs can be pre-aggregated using Hive and then synced to SOLR. To this end, we developed a simple Storage Handler for SOLR so that data can be read and written to SOLR transparently using an external table.

We will show in this post how to install our SOLR storage handler and then run a simple example where we sync some data from Hive to SOLR.
Read more of this post

Playing with Apache Hive, MongoDB and the MTA

Apache Hive is a popular datawarehouse system for Hadoop that allows to run SQL queries on top of Hadoop by translating queries into Map/Reduce jobs. Due to the high latency incurred by Hadoop to execute Map/Reduce jobs, Hive cannot be used in applications that require fast access to data. One common technique is to use Hive to pre-aggregate data logs stored in HDFS and then sync the data to a Datawarehouse.

In this post we’re going to describe how to install Hive and then, as New York City straphangers, we’re going to load subway train movement data from the MTA in HDFS, execute Hive queries to aggregate the number of daily average train movements per line and store the result in MongoDB.
Read more of this post

Playing with the Mahout recommendation engine on a Hadoop cluster

Elephant and riderApache Mahout is an open source library which implements several scalable machine learning algorithms. They can be used among other things to categorize data, group items by cluster, and to implement a recommendation engine.

In this tutorial we will run the Mahout recommendation engine on a data set of movie ratings and show the movie recommendations for each user.

For more details on the recommendation algorithm, you can look at the tutorial from Jee Vang.


  • Java (to run hadoop)
  • Hadoop (used by Mahout)
  • Mahout
  • Python (use to show the result)

Running Hadoop

In this section, we are going to describe how to quickly install and configure hadoop on a single machine.
Read more of this post

A Hadoop Alternative: Building a real-time data pipeline with Storm

With the tremendous growth of the online advertising industry, ad networks have to deal with a humongous amount of data to process. For years, Hadoop has been the de-facto technology used to aggregate data logs but although it is efficient in processing big batches, it has not been designed to deal with real-time data. With the popularity of streaming aggregation engine (Storm, S4, …) ad networks started to integrate them in their data pipeline to offer real-time aggregation.

When you go to a website and see an ad, it’s usually the result of a process run by an ad network that runs a real-time bidding between different advertisers. Whoever puts the highest bid will have their ads displayed on your browser. Advertisers use different information to determine their bids based on the user’s segment, the page the user is on and a lot of other factors. This process is usually run by an ad server.

Each of these (ad) impressions is usually logged by an ad server and sent to a data pipeline that is in charge of aggregating these data (e.g., number of impressions for a particular advertiser in the last hour). The data pipeline usually processes billions and billions of logs daily and it’s a challenge to have the resources to be able to cope with the huge input of data. The aggregation must also be made available to advertisers and publishers in a reasonable amount of time so that they can adapt their strategy in their campaigns (an advertiser might see that they manage to get sales from a particular publisher and thus wants to display more ads on their website). So, the faster they get these data, the better, so companies are trying to aggregate data and made them available up to the last hour and for some up to the minute.


Data pipeline based on Hadoop

Most of the ad networks use Hadoop to aggregate data. Hadoop is really efficient in processing a large amount of data but it is not suited for real-time aggregation where data need to be available to the minute. Usually the way it works is each ad server sends its logs to the data pipeline continuously through a queue mechanism. Then Hadoop is scheduled to run an aggregation every hour and then store it in a data warehouse.

More recently, some ad networks started to use Storm which is a real-time aggregation ETL developed by Nathan Marz in late 2011. In this new architecture, Storm can read the stream of logs from the adservers, aggregate them on the fly and store the aggregated data directly in the data warehouse as soon as they’re available. We explore here through a simple example how to develop and deploy a simple real-time aggregator that aggregates data by minute, by hour and by day.
Read more of this post

Using Hadoop Pig with MongoDB

In this post, we’ll see how to install MongoDB support for Pig and we’ll illustrate it with an example where we join 2 MongoDB collections with Pig and store the result in a new collection.


Building Mongo Hadoop

We’re going to use the GIT project  developed by 10gen but with a slightly modification that we made. Because the Pig language doesn’t support variable that starts with underscore (e.g., _id) which is used in MongoDB, we added the ability to use it by replacing the _ prefix with u__ so _id becomes u__id.

First get the source:

$ git clone

Compile the Hadoop pig part of it:

$ ./sbt package
$ ./sbt mongo-hadoop-core/package
$ ./sbt mongo-hadoop-pig/package
$ mkdir ~/pig_libraries
$ cp ./pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar \
./target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar ~/pig_libraries

Running a join query with Pig on MongoDB collections

One of the thing you can’t do in MongoDB is to do a join between 2 collections. So let’s see how we can do it simply with a pig script.
Read more of this post

%d bloggers like this: