2013/05/10 16 Comments
A common thing a data analyst does in his day to day job is to run aggregations of data by generally summing and averaging columns using different filters. When tables start to grow to hundreds of millions or billions of rows, these operations become extremely expensive and the choice of a database engine is crucial. Indeed, the more queries an analyst can run during the day, the better he can be at understanding the data.
In this post, we’re going to install 5 popular databases on Linux Ubuntu (12.04):
- MySQL / MariaDB 10.0: Row based database
- MongoDB 2.4: NoSQL database
- Vertica Community Edition 6: Columnar database (similar to Infobright, InfiniDB, …)
- Hive 0.10: Datawarehouse built on top of HDFS using Map/Reduce
- Impala 1.0: Database implemented on top of HDFS (compatible with Hive) based on Dremel that can use different data formats (raw CSV format, Parquet columnar format, …)
Then we’ll provide some scripts to populate them with some test data, run some simple aggregation queries and measure the response time. The tests will be run on only one box without any tuning using a relatively small dataset (160 million rows) but we’re planning on running more thorough tests in the cloud later with much bigger datasets (billions of rows). This is just to give a general idea on the performance of each of the database.
Read more of this post