Implementing a real-time data pipeline with Spark Streaming

Real-time analytics has become a very popular topic in recent years. Whether it is in finance (high frequency trading), adtech (real-time bidding), social networks (real-time activity), Internet of things (sensors sending real-time data), server/traffic monitoring, providing real-time reporting can bring tremendous value (e.g., detect potential attacks on network immediately, quickly adjust ad campaigns, …). Apache Storm is one of the most popular frameworks to aggregate data in real-time but there are also many others such as Apache S4, Apache Samza, Akka Streams, SQLStream and more recently Spark Streaming.

According to Kyle Moses, on his page on Spark Streaming, it can process about 400,000 records / node / second for simple aggregations on small records and significantly outperforms other popular streaming systems such as Apache Storm (40x) and Yahoo S4 (57x). This can be mainly explained because Apache Storm processes messages individually while Apache Spark groups messages in small batches. Moreover in case of failure, where in Storm messages can be replayed multiple times, Spark Streaming batches are only processed once which greatly simplifies the logic (e.g., to make sure some values are not counted multiple times).

At a higher level, we can see Spark Streaming as a layer on top of Spark where data streams (coming from various sources such as Kafka, ZeroMQ, Twitter, …) can be transformed and batched in a sequence of RDDs (Resilient Distributed DataSets) using a sliding window. These RDDs can then be manipulated using normal Spark operations.

In this post we consider an adnetwork where adservers log impressions in Apache Kafka (distributed publish-subscribe messaging system). These impressions are then aggregated by Spark Streaming into a datawarehouse (here MongoDB to simplify). Usually the data is further aggregated into a fast data access layer (direct lookup) and accessible through an API as depicted in the figure below.

adnetwork architecture
Figure 1. Ad network architecture
Read more of this post

Classifiying documents using Naive Bayes on Apache Spark / MLlib

Apache SparkIn recent years, Apache Spark has gained in popularity as a faster alternative to Hadoop and it reached a major milestone last month by releasing the production ready version 1.0.0. It claims to be up to a 100 times faster by leveraging the distributed memory of the cluster and by not being tied to the multi stage execution of Map/Reduce. Like Hadoop, it offers a similar ecosystem with a database (Shark SQL), a machine learning library (MLlib), a graph library (GraphX) and many other tools built on top of Spark. Finally Spark integrates well with Scala and one can manipulate distributed collections just like regular Scala collections and Spark will take care of distributing the processing to the different workers.

In this post, we describe how we used Spark / MLlib to classify HTML documents using the popular Reuters 21578 collection of documents that appeared on Reuters newswire in 1987 as a training set.
Read more of this post

%d bloggers like this: