Implementing a java agent to instrument code

4186-12920With a system running 24/7, you have to make sure that it performs well at any time of the day. Several commercial solutions exist to monitor the performance of systems: NewRelic, GraphDat and many others. They allow to see for instance if the api call response time change week after week or after each release of the project. So the developers can easily spot where the bottlenecks are and fix them.

You can also use profilers such as JProfiler, YourKit, … to detect bottlenecks, memory leaks, thread leaks, …

Most of those tools works by using a java agent, a pluggable library that runs embedded in a JVM that intercepts the classloading process. By modifying the classloading process, they can dynamically change the classes instructions to perform method logging, performance measure, …

In this post we are going to describe how to implement a simple java agent to measure how frequently and how long some methods of your application take and publish the results to JMX and to Graphite.
Read more of this post

Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages (part 2: distribute classification with hadoop)


In this post, we are going to categorize the tweets by distributing the classification on the hadoop cluster. It can make the classification faster if there is a huge number of tweets to classify.

To go through this tutorial you would need to have run the commands in the post Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages.

To distribute the classification on the hadoop nodes, we are going to define a mapreduce job:

  • the csv containing the tweets to classify is split into several chunks
  • each chunk is sent to the hadoop node that will process it by running the map class
  • the map class loads the naive bayes model and some document/word frequency into memory
  • for each tweet of the chunk, it computes the best matching category. The result is written in the output file. We are not using a reducer class as we don’t need to do aggregations.

To download the code used in this post, you can fetch it from github:

$ git clone

To compile the project:

$ mvn clean package assembly:single

Read more of this post

Installing and comparing MySQL/MariaDB, MongoDB, Vertica, Hive and Impala (Part 1)

impalaA common thing a data analyst does in his day to day job is to run aggregations of data by generally summing and averaging columns using different filters. When tables start to grow to hundreds of millions or billions of rows, these operations become extremely expensive and the choice of a database engine is crucial. Indeed, the more queries an analyst can run during the day, the better he can be at understanding the data.

In this post, we’re going to install 5 popular databases on Linux Ubuntu (12.04):

  • MySQL / MariaDB 10.0: Row based database
  • MongoDB 2.4: NoSQL database
  • Vertica Community Edition 6: Columnar database (similar to Infobright, InfiniDB, …)
  • Hive 0.10: Datawarehouse built on top of HDFS using Map/Reduce
  • Impala 1.0:  Database implemented on top of HDFS (compatible with Hive) based on Dremel that can use different data formats (raw CSV format, Parquet columnar format, …)

Then we’ll provide some scripts to populate them with some test data, run some simple aggregation queries and measure the response time. The tests will be run on only one box without any tuning using a relatively small dataset (160 million rows) but we’re planning on running more thorough tests in the cloud later with much bigger datasets (billions of rows). This is just to give a general idea on the performance of each of the database.
Read more of this post

Finding association rules with Mahout Frequent Pattern Mining

s_elephant_monkeyAssociation Rule Learning is a method to find relations between variables in a database. For instance, using shopping receipts, we can find association between items: bread is often purchased with peanut butter or chips and beer are often bought together. In this post, we are going to use the Mahout Frequent Pattern Mining implementation to find the associations between items using a list of shopping transactions. For details on the algorithms(apriori and fpgrowth) used to find frequent patterns, you can look at “The comparative study of apriori and FP-growth algorithm” from Deepti Pawar.

EDIT 2014-01-08: updated link to data sample marketbasket.csv (old link was dead). Corrected lift computation. Thanks Felipe F. for pointing the error in the formula.
Read more of this post

Generating EigenFaces with Mahout SVD to recognize person faces


In this tutorial, we are going to describe how to generate and use eigenfaces to recognize people faces.
Eigenfaces are a set of eigenvectors derived from the covariance matrix of the probability distribution of the high-dimensional vector space of possible faces of human beings. It can be used to identify a face on a picture from a person face database very quickly. In this post, we’ll not give much details on the mathematical aspects but if you are interested on those, you can look at the excellent post Face Recognition using Eigenfaces and Distance Classifiers: A Tutorial from the Onionesque Reality Blog.

Read more of this post

Playing with Apache Hive and SOLR

As described in a previous post, Apache SOLR can perform very well to provide low latency analytics. Data logs can be pre-aggregated using Hive and then synced to SOLR. To this end, we developed a simple Storage Handler for SOLR so that data can be read and written to SOLR transparently using an external table.

We will show in this post how to install our SOLR storage handler and then run a simple example where we sync some data from Hive to SOLR.
Read more of this post

Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages

mahout2Classification algorithms can be used to automatically classify documents, images, implement spam filters and in many other domains. In this tutorial we are going to use Mahout to classify tweets using the Naive Bayes Classifier. The algorithm works by using a training set which is a set of documents already associated to a category. Using this set, the classifier determines for each word, the probability that it makes a document belong to each of the considered categories. To compute the probability that a document belongs to a category, it multiplies together the individual probability of each of its word in this category.  The category with the highest probability is the one the document is most likely to belong to.

To get more details on how the Naive Bayes Classifier is implemented, you can look at the mahout wiki page.

This tutorial will give you a step-by-step description on how to create a training set, train the Naive Bayes classifier and then use it to classify new tweets.

Read more of this post

%d bloggers like this: