Playing with OpenCV in Scala to do face detection with Haarcascade classifier using a webcam

Detecting objects in images has been used in many applications: auto tagging pictures (e.g. Facebook, Phototime), counting the number of people in a street(e.g. Placemeter), classifying pictures, … And even in Bistro, a device to feed cats.

This post is a small introduction to OpenCV an open source computer vision library using scala.
The OpenCV library provides several features to manipulate images(apply filters, transformation), detect faces and recognize faces in images.
In this post, we are going to implement two small scala programs:

  • read an image and run the Haar cascade classifier to detect the faces in the image
  • use the webcam and detect faces in real time

To detect faces, we are using the Haar Feature-based Cascade Classifier which is a classifier to detect objects in an image. You can find a good introduction on the openCV website and on the Facial Recognition youtube video by Tom Neumark

Note that detecting faces and recognizing faces(who?) are two different problems and use two different approaches. In this post we are going to only look at face detection.


You can fetch the code used in this post on github:

git clone

Detecting faces in an image

In this section we are going to detect the faces in the skyfall movie cast picture below by using a Haar Cascade classifier:

James Bond Skyfall photocall - London


We start by reading the image using opencv:
Read more of this post

Building a food recommendation engine with Spark / MLlib and Play

Recommendation engines have become very popular in the last decade with the explosion of e-commerce, on demand music and movie services, dating sites, local reviews, news aggregation and advertising (behavioral targeting, intent targeting, …). Depending on your past actions (e.g., purchases, reviews left, pages visited, …) or your interests (e.g., Facebook likes, Twitter follows), the recommendation engine will present other products that might interest you using other users actions and user behaviors (page clicks, page views, time spent on page, clicks on images/reviews, …).

In this post, we’re going to implement a recommender for food using Apache Spark and MLlib. So for instance, if one is interested by some coffee products then we might recommend her some other coffee brands, coffee filters or some related products that some other users like too.

Read more of this post

Segmenting Audience with KMeans and Voronoi Diagram using Spark and MLlib

Analyzing huge data set to extract meaningful properties can be a difficult task. Several methods have been developed for the last 50 years to find hidden information.

Clustering algorithms can be used to group similar news like in Google News, find areas with high crime concentration, find trends, .. and segment the data into groups. This segmentation can be used for instance by publisher to reach a specific target audience.

In this post, we will be using the k-means clustering algorithm implemented in Spark Machine Learning Library(MLLib) to segment the dataset by geolocation .


The k-mean clustering algorithm is an unsupervised algorithm meaning that you don’t need to provide a training example for it to work(unlike neural network, SVM, Naives Bayes classifiers, …). It partitions observations into clusters in which each observation belongs to the cluster with the nearest mean. The algorithm takes as input the observations, the number of clusters(denoted k) that we want to partition the observation into and the number of iterations. It gives as a result the centers of the clusters.

The algorithm works as follow:

  1. Take k random observations out of the dataset. Set the k centers of the clusters to those points
  2. For each observation, find the cluster center which is the closest and assign this observation to this cluster
  3. For each cluster, compute the new center by taking the average of the features of the observations assigned to this dataset
  4. Go back to 2 and repeat this for a given number of iterations

The centers of the clusters will converge and will minimize the cost function which is the sum of the square distance of each observation to their assigned cluster centers.

This minimum might be a local optimum and will depend on the observation that were randomly taken at the beginning of the algorithm.
Read more of this post

Implementing a real-time data pipeline with Spark Streaming

Real-time analytics has become a very popular topic in recent years. Whether it is in finance (high frequency trading), adtech (real-time bidding), social networks (real-time activity), Internet of things (sensors sending real-time data), server/traffic monitoring, providing real-time reporting can bring tremendous value (e.g., detect potential attacks on network immediately, quickly adjust ad campaigns, …). Apache Storm is one of the most popular frameworks to aggregate data in real-time but there are also many others such as Apache S4, Apache Samza, Akka Streams, SQLStream and more recently Spark Streaming.

According to Kyle Moses, on his page on Spark Streaming, it can process about 400,000 records / node / second for simple aggregations on small records and significantly outperforms other popular streaming systems such as Apache Storm (40x) and Yahoo S4 (57x). This can be mainly explained because Apache Storm processes messages individually while Apache Spark groups messages in small batches. Moreover in case of failure, where in Storm messages can be replayed multiple times, Spark Streaming batches are only processed once which greatly simplifies the logic (e.g., to make sure some values are not counted multiple times).

At a higher level, we can see Spark Streaming as a layer on top of Spark where data streams (coming from various sources such as Kafka, ZeroMQ, Twitter, …) can be transformed and batched in a sequence of RDDs (Resilient Distributed DataSets) using a sliding window. These RDDs can then be manipulated using normal Spark operations.

In this post we consider an adnetwork where adservers log impressions in Apache Kafka (distributed publish-subscribe messaging system). These impressions are then aggregated by Spark Streaming into a datawarehouse (here MongoDB to simplify). Usually the data is further aggregated into a fast data access layer (direct lookup) and accessible through an API as depicted in the figure below.

adnetwork architecture
Figure 1. Ad network architecture
Read more of this post

Analyzing your audience location with Twitter Streams and Heat Maps

With the democratization of GPS and IP geolocation in portable devices (laptop, tablet, phone, Internet of things, …), more and more data containing geolocation information become available. Geolocation is now used by most of the main web applications to improve their services. For instance social network, transport network company or dating sites can use your instant location to show potential matches around you. Search engines can provide more personalized search result based on your location and ads network to better target their audience. With this geolocated data available in realtime, some applications such as Swarm, FourSquare are now allowing to be notified of friends coming nearby or events happening in their neighborhood.

In this post we will describe how to listen to tweet streams and represent their positions on a world map.


A geo location is described by three values:

  • the latitude: the angular distance of the place to the earth’s equator (range from -90 to 90 degrees)
  • the longitude: the angular distance of the place to the greenwish meridian (range from -180 to 180 degrees)
  • the elevation: height above sea level

To represent the Earth’s surface on a two dimensional plane, we can use different map projections (Mercator, Tissot, …), each having their own advantages and drawbacks in term of distance, area and angle distortions.

We are going to use the equirectangular projection (also known as Platte Carrée) which is quite popular because of its simplicity. On a 2D map, the x-coordinate position is proportional to the longitude and the y-coordinate position to the latitude.


Read more of this post

Classifiying documents using Naive Bayes on Apache Spark / MLlib

Apache SparkIn recent years, Apache Spark has gained in popularity as a faster alternative to Hadoop and it reached a major milestone last month by releasing the production ready version 1.0.0. It claims to be up to a 100 times faster by leveraging the distributed memory of the cluster and by not being tied to the multi stage execution of Map/Reduce. Like Hadoop, it offers a similar ecosystem with a database (Shark SQL), a machine learning library (MLlib), a graph library (GraphX) and many other tools built on top of Spark. Finally Spark integrates well with Scala and one can manipulate distributed collections just like regular Scala collections and Spark will take care of distributing the processing to the different workers.

In this post, we describe how we used Spark / MLlib to classify HTML documents using the popular Reuters 21578 collection of documents that appeared on Reuters newswire in 1987 as a training set.
Read more of this post

Implementing a java agent to instrument code

4186-12920With a system running 24/7, you have to make sure that it performs well at any time of the day. Several commercial solutions exist to monitor the performance of systems: NewRelic, GraphDat and many others. They allow to see for instance if the api call response time change week after week or after each release of the project. So the developers can easily spot where the bottlenecks are and fix them.

You can also use profilers such as JProfiler, YourKit, … to detect bottlenecks, memory leaks, thread leaks, …

Most of those tools works by using a java agent, a pluggable library that runs embedded in a JVM that intercepts the classloading process. By modifying the classloading process, they can dynamically change the classes instructions to perform method logging, performance measure, …

In this post we are going to describe how to implement a simple java agent to measure how frequently and how long some methods of your application take and publish the results to JMX and to Graphite.
Read more of this post


Get every new post delivered to your Inbox.

Join 177 other followers

%d bloggers like this: