Playing with Apache Hive, MongoDB and the MTA

Apache Hive is a popular datawarehouse system for Hadoop that allows to run SQL queries on top of Hadoop by translating queries into Map/Reduce jobs. Due to the high latency incurred by Hadoop to execute Map/Reduce jobs, Hive cannot be used in applications that require fast access to data. One common technique is to use Hive to pre-aggregate data logs stored in HDFS and then sync the data to a Datawarehouse.

In this post we’re going to describe how to install Hive and then, as New York City straphangers, we’re going to load subway train movement data from the MTA in HDFS, execute Hive queries to aggregate the number of daily average train movements per line and store the result in MongoDB.
Read more of this post

Advertisements
%d bloggers like this: