A Hadoop Alternative: Building a real-time data pipeline with Storm

With the tremendous growth of the online advertising industry, ad networks have to deal with a humongous amount of data to process. For years, Hadoop has been the de-facto technology used to aggregate data logs but although it is efficient in processing big batches, it has not been designed to deal with real-time data. With the popularity of streaming aggregation engine (Storm, S4, …) ad networks started to integrate them in their data pipeline to offer real-time aggregation.

When you go to a website and see an ad, it’s usually the result of a process run by an ad network that runs a real-time bidding between different advertisers. Whoever puts the highest bid will have their ads displayed on your browser. Advertisers use different information to determine their bids based on the user’s segment, the page the user is on and a lot of other factors. This process is usually run by an ad server.

Each of these (ad) impressions is usually logged by an ad server and sent to a data pipeline that is in charge of aggregating these data (e.g., number of impressions for a particular advertiser in the last hour). The data pipeline usually processes billions and billions of logs daily and it’s a challenge to have the resources to be able to cope with the huge input of data. The aggregation must also be made available to advertisers and publishers in a reasonable amount of time so that they can adapt their strategy in their campaigns (an advertiser might see that they manage to get sales from a particular publisher and thus wants to display more ads on their website). So, the faster they get these data, the better, so companies are trying to aggregate data and made them available up to the last hour and for some up to the minute.

adnetwork-hadoop

Data pipeline based on Hadoop

Most of the ad networks use Hadoop to aggregate data. Hadoop is really efficient in processing a large amount of data but it is not suited for real-time aggregation where data need to be available to the minute. Usually the way it works is each ad server sends its logs to the data pipeline continuously through a queue mechanism. Then Hadoop is scheduled to run an aggregation every hour and then store it in a data warehouse.

More recently, some ad networks started to use Storm which is a real-time aggregation ETL developed by Nathan Marz in late 2011. In this new architecture, Storm can read the stream of logs from the adservers, aggregate them on the fly and store the aggregated data directly in the data warehouse as soon as they’re available. We explore here through a simple example how to develop and deploy a simple real-time aggregator that aggregates data by minute, by hour and by day.

adnetwork_storm

Streaming data pipeline aggregation by minute with Storm

Use Case

A very simplified log of impressions will look something like:

timestamp               publisher advertiser  bid     website  cookie  geo
----------------------------------------------------------------------------
2013-01-28 13:21:12     pub1      adv10       0.0001  abc.com  1214    NY
2013-01-28 13:21:13     pub1      adv10       0.0002  abc.com  1214    NY
2013-01-28 13:21:14     pub2      adv20       0.0003  xyz.com  4321    CA
2013-01-28 13:21:15     pub2      adv20       0.0001  xyz.com  5675    CA

So at 2013-01-28 13:21:12, the advertiser adv10 sent an impression to the user identified by the cookie 1214 coming from NY on the website abc.com with a winning bid of $0.0001.

However in reality, ad servers will log these data in a much compact way using ids instead of strings so logs are smaller and faster to parse, so it will more look like this:

timestamp               publisher_id advertiser_id  bid     website_id  cookie  state_id
----------------------------------------------------------------------------------------
2013-01-28 13:21:12     1            10             0.0001  1           1214    1
2013-01-28 13:21:13     1            10             0.0002  1           1214    1
2013-01-28 13:21:14     2            20             0.0003  2           4321    2
2013-01-28 13:21:15     2            20             0.0001  2           5675    2

For the sake of the example, let’s simplify it even further by only considering the following columns:

timestamp               publisher_id   cookie
---------------------------------------------
2013-01-28 13:21:12     1              1214
2013-01-28 13:21:13     1              1214
2013-01-28 13:21:14     2              4321
2013-01-28 13:21:15     2              5675

In our example, we will aggregate these data so we know for each publisher:

  • number of impressions per minute: in our example, publisher 1 has 2 impressions
  • number of unique visitors per minute (determined by the cookie): in our example, publisher 1 has 1 unique impression because the impressions originate from the same person.

We will also do the same per hour and per day using the exact same stream.

Using Storm

In a Storm topology, there are 2 main types of nodes:

  • spout: that gets an input stream to the Storm cluster. It can get data from a JMS queue, a Twitter Stream, a database, … and then emit the data to the input stream of the cluster that will be processed by bolts.
  • bolt: it processes data taking as an input a stream from a spout or another bolt. After processing data, it can either store them in a database or pass it in a stream to other bolts.

In our example, the topology will look like this:

aggregation_topology

ImpressionLogSpout will send a stream that will be aggregated by 3 bolts at the same time.

The AggByMinuteBolt will be in charge to aggregate the impression logs as they come in in order so that we know for each publisher how many impressions and how many unique impressions we got for a particular minute.

Internally, we use an accumulator structure for each publisher:

  • When an impression comes in, we check if it belongs to the current minute, if so, we increment the impression count and add the cookie id to the cookie set.
  • When an impression comes in that belongs to the next minute, we know that no further impression for the previous minute will be received so we can persist the number of impressions and the number of unique impressions (number of cookie ids in the sets) to MongoDB.

In our example that you can get from github:

$ git clone http://github.com/chimpler/blog-storm-adnetwork-example

You can see there are only 3 java classes:

  • Main.java
  • RandomImpressionTupleSpout.java
  • AggregateByTimeAndPersistBolt.java

The Main class will basically build the topology and run it in local mode (useful for testing). We will show in a later post how to run it in the cloud. For the sake of simplicity we will pretend that the spout receives log impressions from ad servers and sends the logs to the bolt in streaming. In fact the spout in our case will simply generate fake log impressions. AggregateByTimeAndPersistBolt.java does the aggregation of the impressions by publisher and can be configured to aggregate by any time interval (in seconds).

As you can see in the Main the logic is very simple:

builder.setSpout("ImpressionLogSpout", new RandomImpressionTupleSpout());
builder.setBolt("AggByMinuteBolt", new AggregateByTimeAndPersistBolt(10))
       .fieldsGrouping("ImpressionLogSpout", new Fields("publisher_id"));
builder.setBolt("AggByHourBolt", new AggregateByTimeAndPersistBolt(30))
       .fieldsGrouping("ImpressionLogSpout", new Fields("publisher_id"));
builder.setBolt("AggByDayBolt", new AggregateByTimeAndPersistBolt(60))
       .fieldsGrouping("ImpressionLogSpout", new Fields("publisher_id"));

We set the initial spout to be the RandomImpressionTupleSpout and we name it “ImpressionLogSpout”.
Then we create 3 bolts AggregateByTimeAndPersistBolt, each of which having a different time interval (e.g., the first one will aggregate in interval of 10 seconds) and receiving the stream from “ImpressionLogSpout” as mentioned in the fieldsGrouping.

In Storm, a topology can run multiple instances for each spouts / bolts in order to increase parallelism. So suppose we have 1 instance of the ImpressionLogSpout and 2 instances of AggregateByTimeAndPersistBolt(10) running. When the spout emits a new tuple, it can be read by either of the 2 instances. In Storm you can tell the topology to chose it at random using shuffleGrouping(“ImpressionLogSpout”). By using fieldsGrouping, we tell explicitly to redirect all the tuples of a particular publisher_id to the same bolt.
So each instance of the bolt will know that only it will receive tuples for a particular publisher so there is no need later on to merge the result from the other instances for the publisher.

The RandomImpressionTupleSpout is pretty straightforward as you can see in the code excerpt:

@Override
public void nextTuple() {
    long timestamp = System.currentTimeMillis() / 1000;
    int publisherId = random.nextInt(10);
    int countryId = random.nextInt(10);
    int cookieId = random.nextInt(10000);

    collector.emit(new Values(timestamp, publisherId, countryId, cookieId));
}

@Override
public void declareOutputFields(OutputFieldsDeclarer declarer) {
    declarer.declare(new Fields("timestamp", "publisher_id", "country_id", "cookie_id"));
}

The method nextTuple will emit new tuples to the stream. We also define the structure of the tuples by giving the name of the different fields in the method declareOutputFields.

Now let’s take a look at how we aggregate the data in the class AggregateByTimeAndPersistBolt:

@Override
public void execute(Tuple input) {
    long timestamp = input.getLongByField("timestamp");
    int publisherId = input.getIntegerByField("publisher_id");
    int cookieId = input.getIntegerByField("cookie_id");

    // round time by aggregationTime
    long timestampSlice = (timestamp / aggregationTime) * aggregationTime;

    Accumulator accumulator = accumulators.get(publisherId);
    if (accumulator == null) {
        accumulator = new Accumulator();
        accumulators.put(publisherId, accumulator);
    } else {
        // if we receive a new tuple that has a timestamp in another time slice
        // persist previous data
        long lastTimestampSlice = accumulator.getLastTimestamp();

        if (lastTimestampSlice != timestampSlice) {
            persistLastSlice(publisherId);
        }
    }

    accumulator.add(cookieId, timestampSlice);
}

We simply read the tuples and see if the current impression belongs to the same minute from the current aggregation. If so, we simply tell the accumulator to add the cookie id to its set and increment the impression counter. If not, we know that we won’t get any other impression for that minute and so we can just push the data to MongoDB.

For example if a bolt receives the following stream associated to publisher1:

timestamp  publisher_id  cookie_id
----------------------------------
15:01:00   1             4214
15:01:23   1             1234
15:01:40   1             1234
15:01:42   1             1234
15:01:43   1             4214
15:01:52   1             1234
15:01:59   1             4214
15:02:02   1             4214

For the minute 15:01 we increment a counter (incremented here 7 times) and maintain a set of unique cookie ids (e.g., {4214, 1234}).
When we encounter the minute 15:02 we know that we are done with all the data for 15:01 and so at that point we can issue the aggregation for 15:01: 7 impressions and 2 uniques.

Running the example

You will need to have Java and Maven installed and MongoDB running.
Now let’s run the example:

$ ./run.sh

In another terminal, open MongoDB:

$ mongo adnetwork_test
> show collections;
report_data_10
report_data_30
report_data_60
system.indexes

> db.report_data_60.find().sort({$natural:-1})
{ "_id" : "5-1359945180", "publisher_id" : 5, "timestamp" : 1359945180, "count" : 10267, "uniques" : 6391 }
{ "_id" : "1-1359945180", "publisher_id" : 1, "timestamp" : 1359945180, "count" : 10290, "uniques" : 6414 }
{ "_id" : "9-1359945180", "publisher_id" : 9, "timestamp" : 1359945180, "count" : 10260, "uniques" : 6457 }
{ "_id" : "3-1359945180", "publisher_id" : 3, "timestamp" : 1359945180, "count" : 10248, "uniques" : 6360 }
{ "_id" : "0-1359945180", "publisher_id" : 0, "timestamp" : 1359945180, "count" : 10350, "uniques" : 6427 }

As you can see, you have for each time interval and for each publisher the number of impressions and uniques.

Conclusion

As you can see, creating a topology is very simple. We didn’t go into details about how to configure parallelism of the bolts or how to deploy it into the cloud. One thing that you can notice it how easy it is to compute multiple aggregations at the same time and also how fast we can spit aggregations out as soon as they are computed.

However note that Storm works particularly well on time based data. For any other arbitrary aggregation, Hadoop is probably a better choice.

Also a further advantage of using streaming aggregation is the ability to run multi-level aggregations. In our example, each data center will be in charge of partially aggregating the data before sending them to the data center in charge of the final aggregation. Partial number of impressions can be summed up as well as partial unique count in some cases (if we consider data centers on different continents, in most cases a user will only hit one data center). This allows to reduce the amount of data to be sent between data centers and save on bandwidth.

We will show in a next post how to deploy this example in the cloud.

About these ads

About chimpler
http://www.chimpler.com

12 Responses to A Hadoop Alternative: Building a real-time data pipeline with Storm

  1. Pingback: Good books for developer in 2013

  2. Pingback: » Link: Building a real-time data pipeline with Storm Korrelate Engineering Blog

  3. Pingback: A Hadoop Alternative: Building a real-time data...

  4. Pingback: A Hadoop Alternative: Building a real-time data...

  5. Pere says:

    How do you manage fault tolerancy in this system? Having state in Bolts doesn’t seem to be fault tolerant: if a node dies all the state in the Bolt is lost, isn’t it?

    • chimpler says:

      Hi Pere. In our simple example, we don’t handle fault tolerance. However, you can read this page that explains how to do it in Storm: https://github.com/nathanmarz/storm/wiki/Guaranteeing-message-processing. Basically in our example, we would have to have a mechanism where we acknowledge all the tuples right after pushing an aggregation to MongoDB. If the bolt dies in the middle of an aggregation, then Storm would restart the bolt and because the tuples forming the aggregation wouldn’t have been acknowledged, the spout would know that these tuples haven’t been aggregated and can replay them again. So in our example, we would have to modify the spout and the bolt to take into account all this.

  6. gvtech says:

    Why don’t you use an in-line topology where the ByMinuteBold emit the total by minute abd this total is then consumed by the ByHourBold,… Something like what is done in a RRD database?

    • chimpler says:

      Hi gvtech, yes you’re right we can use an in-line topology to do it but I wanted to keep the example simple. As you said chaining aggregation per minute and per hour can be done easily when you aggregate the number of impressions since you can sum up the total number of impressions from the different minutes within the hour. However summing up the number of unique visitors is a little bit trickier as you have to pass the sets of cookie ids for each minute and then aggregate them to count the number of distinct elements per hour.

  7. Elvis says:

    I am now not positive where you are getting your information, however good topic.

    I needs to spend some time studying much more or figuring out
    more. Thanks for wonderful information I used to be
    in search of this info for my mission.

  8. Pingback: Building a real-time data pipeline with Storm |...

  9. sudheer1313 says:

    Thank for providing this information and it is useful for us.
    Hadoop online trainings provides hadoop online training

  10. Daniel says:

    Trackback Submitter is world’s first software which can speed-up website indexing, boost search engine rankings, increase traffic and supercharge overall income with the help of trackback links. Furthermore, it’s the only tool which enables using trackbacks for promoting not only blogs, but also all other types of websites. visit this link – http://bitly.com/PkozGw

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Follow

Get every new post delivered to your Inbox.

Join 140 other followers

%d bloggers like this: