Segmenting Audience with KMeans and Voronoi Diagram using Spark and MLlib

Analyzing huge data set to extract meaningful properties can be a difficult task. Several methods have been developed for the last 50 years to find hidden information.

Clustering algorithms can be used to group similar news like in Google News, find areas with high crime concentration, find trends, .. and segment the data into groups. This segmentation can be used for instance by publisher to reach a specific target audience.

In this post, we will be using the k-means clustering algorithm implemented in Spark Machine Learning Library(MLLib) to segment the dataset by geolocation .

tweets-voronoi5

The k-mean clustering algorithm is an unsupervised algorithm meaning that you don’t need to provide a training example for it to work(unlike neural network, SVM, Naives Bayes classifiers, …). It partitions observations into clusters in which each observation belongs to the cluster with the nearest mean. The algorithm takes as input the observations, the number of clusters(denoted k) that we want to partition the observation into and the number of iterations. It gives as a result the centers of the clusters.

The algorithm works as follow:

  1. Take k random observations out of the dataset. Set the k centers of the clusters to those points
  2. For each observation, find the cluster center which is the closest and assign this observation to this cluster
  3. For each cluster, compute the new center by taking the average of the features of the observations assigned to this dataset
  4. Go back to 2 and repeat this for a given number of iterations

The centers of the clusters will converge and will minimize the cost function which is the sum of the square distance of each observation to their assigned cluster centers.

This minimum might be a local optimum and will depend on the observation that were randomly taken at the beginning of the algorithm.
Read more of this post

Analyzing your audience location with Twitter Streams and Heat Maps

With the democratization of GPS and IP geolocation in portable devices (laptop, tablet, phone, Internet of things, …), more and more data containing geolocation information become available. Geolocation is now used by most of the main web applications to improve their services. For instance social network, transport network company or dating sites can use your instant location to show potential matches around you. Search engines can provide more personalized search result based on your location and ads network to better target their audience. With this geolocated data available in realtime, some applications such as Swarm, FourSquare are now allowing to be notified of friends coming nearby or events happening in their neighborhood.

In this post we will describe how to listen to tweet streams and represent their positions on a world map.

Introduction

A geo location is described by three values:

  • the latitude: the angular distance of the place to the earth’s equator (range from -90 to 90 degrees)
  • the longitude: the angular distance of the place to the greenwish meridian (range from -180 to 180 degrees)
  • the elevation: height above sea level

To represent the Earth’s surface on a two dimensional plane, we can use different map projections (Mercator, Tissot, …), each having their own advantages and drawbacks in term of distance, area and angle distortions.

We are going to use the equirectangular projection (also known as Platte Carrée) which is quite popular because of its simplicity. On a 2D map, the x-coordinate position is proportional to the longitude and the y-coordinate position to the latitude.

cartesian2d

Read more of this post