May | 2013 | Chimpler

Installing and comparing MySQL/MariaDB, MongoDB, Vertica, Hive and Impala (Part 1)

2013/05/10 25 Comments

A common thing a data analyst does in his day to day job is to run aggregations of data by generally summing and averaging columns using different filters. When tables start to grow to hundreds of millions or billions of rows, these operations become extremely expensive and the choice of a database engine is crucial. Indeed, the more queries an analyst can run during the day, the better he can be at understanding the data.

In this post, we’re going to install 5 popular databases on Linux Ubuntu (12.04):

MySQL / MariaDB 10.0: Row based database
MongoDB 2.4: NoSQL database
Vertica Community Edition 6: Columnar database (similar to Infobright, InfiniDB, …)
Hive 0.10: Datawarehouse built on top of HDFS using Map/Reduce
Impala 1.0: Database implemented on top of HDFS (compatible with Hive) based on Dremel that can use different data formats (raw CSV format, Parquet columnar format, …)

Then we’ll provide some scripts to populate them with some test data, run some simple aggregation queries and measure the response time. The tests will be run on only one box without any tuning using a relatively small dataset (160 million rows) but we’re planning on running more thorough tests in the cloud later with much bigger datasets (billions of rows). This is just to give a general idea on the performance of each of the database.
Read more of this post

Filed under hadoop Tagged with columnar, comparison, dremel, hadoop, hive, impala, installation, linux, mariadb, mysql, parquet, ubuntu, vertica

Finding association rules with Mahout Frequent Pattern Mining

2013/05/02 21 Comments

Association Rule Learning is a method to find relations between variables in a database. For instance, using shopping receipts, we can find association between items: bread is often purchased with peanut butter or chips and beer are often bought together. In this post, we are going to use the Mahout Frequent Pattern Mining implementation to find the associations between items using a list of shopping transactions. For details on the algorithms(apriori and fpgrowth) used to find frequent patterns, you can look at “The comparative study of apriori and FP-growth algorithm” from Deepti Pawar.

EDIT 2014-01-08: updated link to data sample marketbasket.csv (old link was dead). Corrected lift computation. Thanks Felipe F. for pointing the error in the formula.
Read more of this post

Filed under machine learning, mahout Tagged with association rules, frequent pattern mining, machine learning, mahout

Chimpler

Installing and comparing MySQL/MariaDB, MongoDB, Vertica, Hive and Impala (Part 1)

Finding association rules with Mahout Frequent Pattern Mining

Authors

Websites

Recent Posts

Tweets

Recent Comments

Categories

Archives

Meta

Blog Stats

Chimpler

Installing and comparing MySQL/MariaDB, MongoDB, Vertica, Hive and Impala (Part 1)

Share this:

Finding association rules with Mahout Frequent Pattern Mining

Share this:

Authors

Websites

Recent Posts

Tweets

Recent Comments

Categories

Archives

Meta

Blog Stats