With the tremendous growth of the online advertising industry, ad networks have to deal with a humongous amount of data to process. For years, Hadoop has been the de-facto technology used to aggregate data logs but although it is efficient in processing big batches, it has not been designed to deal with real-time data. With the popularity of streaming aggregation engine (Storm, S4, …) ad networks started to integrate them in their data pipeline to offer real-time aggregation.
When you go to a website and see an ad, it’s usually the result of a process run by an ad network that runs a real-time bidding between different advertisers. Whoever puts the highest bid will have their ads displayed on your browser. Advertisers use different information to determine their bids based on the user’s segment, the page the user is on and a lot of other factors. This process is usually run by an ad server.
Each of these (ad) impressions is usually logged by an ad server and sent to a data pipeline that is in charge of aggregating these data (e.g., number of impressions for a particular advertiser in the last hour). The data pipeline usually processes billions and billions of logs daily and it’s a challenge to have the resources to be able to cope with the huge input of data. The aggregation must also be made available to advertisers and publishers in a reasonable amount of time so that they can adapt their strategy in their campaigns (an advertiser might see that they manage to get sales from a particular publisher and thus wants to display more ads on their website). So, the faster they get these data, the better, so companies are trying to aggregate data and made them available up to the last hour and for some up to the minute.
Data pipeline based on Hadoop
Most of the ad networks use Hadoop to aggregate data. Hadoop is really efficient in processing a large amount of data but it is not suited for real-time aggregation where data need to be available to the minute. Usually the way it works is each ad server sends its logs to the data pipeline continuously through a queue mechanism. Then Hadoop is scheduled to run an aggregation every hour and then store it in a data warehouse.
More recently, some ad networks started to use Storm which is a real-time aggregation ETL developed by Nathan Marz in late 2011. In this new architecture, Storm can read the stream of logs from the adservers, aggregate them on the fly and store the aggregated data directly in the data warehouse as soon as they’re available. We explore here through a simple example how to develop and deploy a simple real-time aggregator that aggregates data by minute, by hour and by day.
Read more of this post