Installing and comparing MySQL/MariaDB, MongoDB, Vertica, Hive and Impala (Part 1)

impalaA common thing a data analyst does in his day to day job is to run aggregations of data by generally summing and averaging columns using different filters. When tables start to grow to hundreds of millions or billions of rows, these operations become extremely expensive and the choice of a database engine is crucial. Indeed, the more queries an analyst can run during the day, the better he can be at understanding the data.

In this post, we’re going to install 5 popular databases on Linux Ubuntu (12.04):

  • MySQL / MariaDB 10.0: Row based database
  • MongoDB 2.4: NoSQL database
  • Vertica Community Edition 6: Columnar database (similar to Infobright, InfiniDB, …)
  • Hive 0.10: Datawarehouse built on top of HDFS using Map/Reduce
  • Impala 1.0:  Database implemented on top of HDFS (compatible with Hive) based on Dremel that can use different data formats (raw CSV format, Parquet columnar format, …)

Then we’ll provide some scripts to populate them with some test data, run some simple aggregation queries and measure the response time. The tests will be run on only one box without any tuning using a relatively small dataset (160 million rows) but we’re planning on running more thorough tests in the cloud later with much bigger datasets (billions of rows). This is just to give a general idea on the performance of each of the database.

Installation

Installing MySQL/MariaDB

Go to https://downloads.mariadb.org/mariadb/repositories/ and follow instructions:

$ sudo apt-get install python-software-properties
$ sudo apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 0xcbcb082a1bb943db sudo add-apt-repository 'deb http://ftp.osuosl.org/pub/mariadb/repo/10.0/ubuntu precise main'
$ sudo apt-get update sudo apt-get install mariadb-server

To start it:

$ sudo service mysql start

Installing MongoDB

Download the latest version from http://www.mongodb.org/downloads. Uncompress the archive:

$ cd /usr/local
$ sudo tar xvfz /tmp/mongodb-linux-x86_64-2.4.3.tgz
$ sudo ln -s mongodb-linux-x86_64-2.4.3 mongo

Edit the file /etc/profile and add mongo server to the PATH:

export PATH=/usr/local/mongo:$PATH

To start it:

$ sudo /usr/local/mongo/bin/mongod --fork --logpath=/var/log/mongodb.log

Installing Vertica

Go to: http://www.vertica.com

Create a login / password and follow instruction. And then go to:  https://my.vertica.com/download-community-edition/

Under Download HP Vertica Analytic Database Server Community Edition, Click on Debian Linux 5/6.

Under Download HP Vertica Client Packages for the Community Edition, Click on Linux 32-bit.

Once this is downloaded, install the package as follows:

$ sudo apt-get install ssh sysstat pstack
$ sudo dpkg -i vertica-ce_6.0.1-0_amd64.deb
$ sudo tar xvfz vertica-client-6.0.1-0.x86_64.tar.gz -C /

This will uncompress the Vertica client in /opt/vertica.

Also, there will be some issues with the Debian version that we can fix following the tips given in a Youtube Video provided by דניאל ליבוביץ

Basically the file /opt/vertica/oss/python/lib/python2.7/site-packages/vertica/network/SystemProfileFactory.py is expecting to read a debian version number (4, 5 or 6)  in your /etc/debian_version but on Ubuntu you don’t have a version number. So either you add a number in /etc/debian_version (e.g., 6 wheezy/sid) or you edit the file /opt/vertica/oss/python/lib/python2.7/site-packages/vertica/network/SystemProfileFactory.py and locate the following lines:

        # Debian 4 and 5 Support
        (status, res) = ssh.execute("[ -f /etc/debian_version ]", hide=True)
        if (res[0] == '0'):
            (status, res) = ssh.execute("grep \"^4.\" /etc/debian_version",

Replace the line:

            (status, res) = ssh.execute("grep \"^4.\" /etc/debian_version",

by:

            (status, res) = ssh.execute("grep \"^.\" /etc/debian_version",

which tells the script that it will always be ok with your Debian version.

There is also another problem described in this thread that you need to fix. So basically edit the file /opt/vertica/bin/validators.sh. And replace the line:

dpkg -p vertica 2>&1 > /dev/null

by:

dpkg -p vertica > /dev/null 2>&1

Finally run the install script:

$ sudo /opt/vertica/sbin/install_vertica
$ su - dbadmin
$ /opt/vertica/bin/adminTools

Configuration Menu and Create database (Database name: test)
For our example, we’ll assume you use the password “password”.

Next time, when you need to restart the database, you can use the same interface to do so.

Edit the file /etc/profile and add the following line:

export PATH=/opt/vertica/bin:$PATH

Installing Hadoop

Install Java 1.6 and 1.7 from the Oracle Java Site.

Unpack them as follows:

$ mkdir /opt/java
$ cd /opt/java
$ sudo sh ~/Downloads/jdk-6u45-linux-x64.bin
$ sudo tar xvfz ~/Downloads/jdk-7u21-linux-x64.tar.gz
$ sudo ln -s jdk-6u45-linux-x64 default

Edit the file /etc/default and append the following lines at the end:

export JAVA_HOME=/opt/java/default
export PATH=$JAVA_HOME/bin:$PATH

$ cd /etc/apt/sources.list.d/
$ sudo wget http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/cloudera.list -O archive-cloudera.list
$ sudo sh -c "wget http://archive.cloudera.com/cdh4/ubuntu/precise/amd64/cdh/archive.key -O - | apt-key add -"
$ sudo apt-get update
$ sudo apt-get install hadoop hadoop-conf-pseudo hadoop-hdfs hadoop-hdfs-datanode hadoop-hdfs-namenode hadoop-mapreduce hadoop-yarn hadoop-yarn-nodemanager hadoop-yarn-resourcemanager

Edit the file /etc/hadoop/conf/hdfs-site.xml and add the following lines:

  <property>
     <name>dfs.permissions</name>
     <value>false</value>
     <!-- Whether we consider permissions or not -->
  </property>
  <property>
     <name>dfs.client.file-block-storage-locations.timeout</name>
     <value>3000</value>
  </property>
  <property>
     <name>dfs.client.read.shortcircuit</name>
     <value>true</value>
  </property>

We disable permission for the sake of simplicity. The last 2 properties will be used by Impala.

You’ll also need to authorize SSH access to the box without entering your password:

$ ssh-keygen -t dsa -P '' -f ~/.ssh/id_dsa 
$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

To start Hadoop:

$ sudo service hadoop-hdfs-namenode start
$ sudo service hadoop-hdfs-datanode start
$ sudo service hadoop-yarn-resourcemanager start
$ sudo service hadoop-yarn-nodemanager start

Installing Hive

Download Hive from http://www.apache.org/dyn/closer.cgi/hive/.

Untar it:

$ tar xvfz hive-0.10.0-bin.tar.gz -C /usr/local
$ sudo ln -s /usr/local/hive-0.10.0-bin /usr/local/hive

Let’s configure Hive so that metadata are stored in MySQL.
Create the file $HIVE_HOME/conf/hive-site.xml with the following content:

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?><configuration>
  <property>
     <name></name>
     <value></value>
  </property>
  <property>
     <name>hive.metastore.local</name>
     <value>true</value>
  </property>
  <property>
     <name>javax.jdo.option.ConnectionURL</name>
     <value>jdbc:mysql://localhost:3306/hive?createDatabaseIfNotExist=true</value>
  </property>
  <property>
     <name>javax.jdo.option.ConnectionDriverName</name>
     <value>com.mysql.jdbc.Driver</value>
  </property>
  <property>
     <name>javax.jdo.option.ConnectionUserName</name>
     <value>hive</value>
  </property>
  <property>
     <name>datanucleus.transactionIsolation</name>
     <value>repeatable-read</value>
  </property>
  <property>
     <name>datanucleus.valuegeneration.transactionIsolation</name>
     <value>repeatable-read</value>
  </property>
</configuration>

Download the MySQL JDBC driver from http://dev.mysql.com/downloads/connector/j (using the MariaDB JDBC driver won’t work well with Hive).
Untar it and copy the jar file to $HIVE_HOME/lib:

$ tar xvfz mysql-connector-java-5.1.25.tar.gz
$ cp mysql-connector-java-5.1.25/mysql-connector-java-5.1.25-bin.jar $HIVE_HOME/lib

In MariaDB, create a user test:

$ mysql -u root -p
MariaDB [(none)]> CREATE DATABASE hive;
MariaDB [(none)]> GRANT ALL PRIVILEGES ON hive.* TO hive@localhost IDENTIFIED BY 'hive'
MariaDB [(none)]> GRANT ALL PRIVILEGES ON hive.* TO hive@'%' IDENTIFIED BY 'hive'

Edit your file /etc/profile with the following lines:

$ export HIVE_HOME=/usr/local/hive
$ export PATH=$PATH:$HIVE_HOME/bin

Installing Impala

We tried using the repo provided by Cloudera but got some issues when running the server.
Here what we did:

$ cd /etc/apt
$ sudo wget http://archive.cloudera.com/impala/ubuntu/precise/amd64/impala/cloudera.list -O cloudera-impala.list
$ sudo apt-get update
$ sudo apt-get install sudo apt-get install impala impala-server impala-shell

But then when we start the service, the process will crash with a Segmentation Fault (Java 1.7) and another error with Java 1.6 (/usr/lib/impala/sbin/impalad: error while loading shared libraries: libjvm.so: wrong ELF class: ELFCLASS32).

So we decided to build Impala from the source in GitHub using some instruction taken from https://users.soe.ucsc.edu/~ivo//blog/2013/04/03/building-impala-on-ubuntu-12.04/ and from the Impala GitHub page.

First install some packages:

$ sudo apt-get install \
     build-essential automake libtool flex bison \
     git subversion \
     unzip \
     libboost-test-dev libboost-program-options-dev libboost-filesystem-dev libboost-system-dev \
     libboost-regex-dev libboost-thread-dev \
     protobuf-compiler \
     libsasl2-dev \
     libbz2-dev \
     libevent1-dev \
     pkg-config \
     doxygen

Install LLVM:

$ wget http://llvm.org/releases/3.2/llvm-3.2.src.tar.gz
$ tar xvzf llvm-3.2.src.tar.gz
$ cd llvm-3.2.src/tools
$ svn co http://llvm.org/svn/llvm-project/cfe/tags/RELEASE_32/final/ clang
$ cd ../projects
$ svn co http://llvm.org/svn/llvm-project/compiler-rt/tags/RELEASE_32/final/ compiler-rt
$ cd ..
$ ./configure --with-pic --prefix=$HOME/.opt/
$ make -j4 REQUIRES_RTTI=1
$ make install

For some reason, we managed to make Impala works by compiling it with Java 1.7 and then run it with Java 1.6.

Install Maven:

$ wget http://www.fightrice.com/mirrors/apache/maven/maven-3/3.0.4/binaries/apache-maven-3.0.4-bin.tar.gz
$ sudo tar xvf apache-maven-3.0.4.tar.gz && sudo mv apache-maven-3.0.4 /usr/local
$ sudo ln -s /usr/local/apache-maven-3.0.4 /usr/local/maven

Edit the file /etc/profile and add the following lines:

export JAVA_HOME=/opt/java/default
export MVN_HOME=/usr/local/maven
export PATH=$JAVA_HOME/bin:$MVN_HOME/bin:$PATH

Build Impala:

$ . /etc/profile
$ export JAVA_HOME=/opt/java/jdk1.7.0_21
$ git clone http://github.com/cloudera/impala
$ cd impala
$ . bin/impala-config.sh
$ cd thirdparty
$ ./download_thirdparty.sh
$ cd ..
$ ./build_public.sh -build_thirdparty

Edit the file bin/set-classpath.sh and configure the CLASSPATH as follows:

CLASSPATH=\
$IMPALA_HOME/fe/src/test/resources:\
$IMPALA_HOME/fe/target/classes:\
$IMPALA_HOME/fe/target/dependency:\
$IMPALA_HOME/fe/target/test-classes:\
${HIVE_HOME}/lib/datanucleus-core-2.0.3.jar:\
${HIVE_HOME}/lib/datanucleus-enhancer-2.0.3.jar:\
${HIVE_HOME}/lib/datanucleus-rdbms-2.0.3.jar:\
${HIVE_HOME}/lib/datanucleus-connectionpool-2.0.3.jar:\
${HIVE_HOME}/conf:\
${HIVE_HOME}/lib/mysql-connector-java-5.1.25-bin.jar:\
/etc/hadoop/conf

Running the tests

Data Generator

We wrote a simple data generator that creates data inspired from advertising networks. It represents the number of impressions, number of clicks, revenue broken out by advertiser, publisher and hour.
In terms of data distribution we have a year of data with hourly data with 20 advertisers and 100 publishers resulting in 160 million rows.

First get the source code:

$ git clone http://github.com/chimpler/blog-mysql-vertica-mongodb-impala
$ cd blog-mysql-vertica-mongodb-impala/datagen
$ make
$ ./gendata > /tmp/datatest.csv

Note that it will take about 15 minutes to generate the 160 million rows.

Importing data in MySQL/MariaDB

$ sudo mysql -u root -p
MariaDB [(none)]> CREATE DATABASE test
MariaDB [(none)]> GRANT ALL PRIVILEGES ON test.* TO test@localhost IDENTIFIED BY 'test';
MariaDB [(none)]> GRANT ALL PRIVILEGES ON test.* TO test@'%' IDENTIFIED BY 'test';

Then use our import script:

$ import_mysql.sh test test test

It will basically create a new table analytics and import all the CSVs.

Importing data in Vertica

To use the Vertica shell:

/opt/vertica/bin/vsql -d test -U dbadmin -w password

Useful commands:

  • \d: see all the tables
  • \d analytics: describe table analytics
  • \timing: toggle timings on/off (you can put it in you $HOME/.vsqlrc to enable it by default)
  • \q: quit

You can run the script import_vertica.sh that will take care of creating the table analytics and importing the CSV data:

$ ./import_vertica.sh dbadmin password test

Importing data in Hive

Some useful commands:

  • show tables;: show tables
  • desc <table_name>;: description of the table
  • quit;: quit

You can run Hive as follows to get more logs:

$ hive -hiveconf hive.root.logger=INFO,console

You can run the script import_hive.sh to import the data into Hive:

./import_hive.sh

Importing data in Impala

Some useful commands:

  • show tables;: show tables
  • desc table_name;: description of the table
  • refresh: refresh schema from metastore
  • quit;: quit

The table analytics should be visible since we’re getting the metadata from the Hive metastore:

$ impala-shell
[localhost:21000] > refresh;
[localhost:21000] > show tables;
+-------------------+
| name              |
+-------------------+
| analytics         |
+-------------------+

We can query the table using Impala. However we can use Parquet to store the data in a columnar format to have better performance. Let’s create a Parquet table and import the data there:

$ ./import_impala_parquet.sh

That will create a new table analytics_parquet with the same data but using the Parquet format.

Some measures

We represent for each database the time to first run the query and the time to run the query a second time (cached).

Data Set:

  • 160 million rows
  • 12 months, 28 days per month, 24 hours
  • 20 advertisers
  • 100 publishers
  • imps, clicks and revenue metrics

We ran the following tests:

  • bTest 1: SELECT sum(imps) FROM analytics
  • Test 2: SELECT advertiser_id, AVG(clicks) FROM analytics WHERE ymdh BETWEEN ‘2012-06-01 00:00:00’ AND ‘2012-07-01 00:00:00’ GROUP BY advertiser_id
  • Test 3: SELECT AVG(1000 * revenue / imps) AS cpm FROM analytics WHERE ymdh > ‘2012-12-01 00:00:00’ GROUP BY publisher_id ORDER BY cpm DESC LIMIT 10

We used the scripts test_all.sh that will run the tests on all the different databases on an AMD Phenom(tm) II X6 1045T Processor (2.7 GHz) with 8GB of RAM.

We obtained the following response times:

db_comparison

Database Import Time Indexing Time Disk Space Test 1 Test 2 Test 3
MariaDB 83m 224m 21GB 115s/0.1s 51s/0.1s 47s/0.1s
Vertica 15m 93GB 15.3s/1.3s 2.7s/0.5s 5.16s/3.25s
MongoDB 80m 175m 59GB 305s/305s 59s/47s 59s/52s
Hive 250s 59GB 252s/225s 381s/382s 404s/407s
Impala (CSV) 59GB 131s/119s 119s/110s 106s/109s
Impala (Parquet) 219s (from data already in HDFS) 2.8GB 12s/6s 62s/62s 51s/53s

For each test, we measure the response time for the first run and the second run so we can see if there is any cache use. So in our tests we can see that MySQL / MariaDB and Vertica use query caching.

Note that Hive performance can be improved by compressing data in order to reduce I/O time. Check this page http://www.adaltas.com/blog/2012/03/13/hdfs-hive-storage-format-compression/ and http://www.saurabhnanda.com/2009/07/how-to-make-your-hive-cluster-blazingly.html for more information.

Conclusion

When looking strictly at numbers, Vertica is a clear winner. MySQL because of its row-based storage design is not well suited for aggregations on millions of rows but still performs reasonably well. MongoDB, despite its new Aggregation Framework is not suited for this kind of aggregation either. As expected Hive performs poorly but this can be explained by the Map/Reduce overhead. Finally we can see that Impala is promising in the Hadoop world. Using raw CSV file format, it is much faster than Hive and performs quite well when using the columnar format Parquet. However when filtering is involved, it does not perform very well.

In this post, we didn’t consider any partitioning of the data and we ran our queries on only one box so we didn’t take advantage of parallel execution. Also we didn’t compare the databases when they’re under different loads and see how they degrade as we increase the number of concurrent connections. Moreover Impala went GA only a week ago and so we can expect more performance improvement in the next months to come. Other Dremel implementations such as HortonWorks Stinger and EMC HAWQ boast very fast response time so it will be interesting to compare them with Impala when they are publicly released.

Advertisements

About chimpler
http://www.chimpler.com

25 Responses to Installing and comparing MySQL/MariaDB, MongoDB, Vertica, Hive and Impala (Part 1)

  1. Pingback: Installing and comparing MySQL/MariaDB, MongoDB...

  2. Greg Rahn says:

    Why are you not using partitioning for Hive/Impala?
    If you used monthly partitioning you could cut the query time for query 2 and 3 by 12x since they query just 1 month out of 12 (if I understand the date range correctly).

    Also, use a file format other than delimited text – that is the most inefficient format of them all.

    https://raw.github.com/chimpler/blog-mysql-vertica-mongodb-impala/master/datagen/import_hive.sh

    • chimpler says:

      Hi Greg,

      Thank you for your comment. That’s a good point, having monthly partitioning will really help here. We were focusing more on the installation of the different databases in this post but we’ll use partitioning and compression in Part 2 🙂

      Thanks!

  3. Roman says:

    This is veryvinteresting setup.
    But I think you jump to conclusion too early.
    I don’t see a word regarding optimizations. Without indexes nosql DBs are just a file system storage with query capabilities. Am I missing it?

    • chimpler says:

      Hi Roman. Thank you for your comment. In this post we were mostly focusing on the installation of the different databases. For Hive and Impala, we could have used data partitioning that would have helped with filtering. With Impala/Parquet, we wanted to see how it performs without any explicit optimization just as we did with Vertica. Also with Hive we could have used data compression to reduce I/O time… We’ll use data partitioning and data compression in Part 2 🙂 Thanks.

  4. Alok says:

    Is there a similar guide for Windows (where I do most of my development)?

  5. clasense4 says:

    Reblogged this on clasense4 blog.

  6. Olga says:

    8 GB of Ram really? Also, you cant compare software that is meant to be distributed on one machine.

    • chimpler says:

      Hi Olga, yes only 8GB of RAM, we don’t have a better computer at home 😦 which is also why we considered a rather small dataset. We’ll try to run more tests on multiple machines using data partitioning in Part 2. Thanks!

      • jorge g. says:

        You can also rent AWS instances by the hour with better specs (up to 88 cores/244G ram)

      • Rajesh says:

        Hi Chimpler, Can we install it on a Home PC which is of 4 GB (Excluding RAM allocated to WIN 8 OS). I want to practice Vertica with just 10K to 0.1 million rows…

      • chimpler says:

        Hi Rajesh, yes you should be able to test it with a few million rows without any problem.

  7. Did you tweak my.cnf in any way? Are you running MySQL/MariaDB completely off of defaults?

  8. Dan Johnson says:

    I find it interesting that only int, datetime and floats were used in this analysis. Varchar/text columns would have provided an interesting dimension to your test.

  9. Maninder says:

    very interesting and good test.. its good to compare out of the box configured setups.. there are tons of tweak that can be done, but running at default give you a much better understanding of the tool..

  10. virtul says:

    Thanks for you work. It would be really interesting to add PostgreSQL in this list though. Some people think that it might be pretty decent candidate for DW/OLAP:

    http://wiki.postgresql.org/images/3/38/PGDay2009-EN-Datawarehousing_with_PostgreSQL.pdf
    http://pgexperts.com/document.html?id=49

  11. VA says:

    I’m surprised that you are comparing a columnar database with non-columnar (or more of type A columnar) database. There are at least 2 other columnar databases that have community edition available – InfoBright/inifinDB.

    The installation of infobright was much easier than Vertica. The community edition is fast and compresses very well (for me it was 5x) but has the restriction that you can only load the data (no insert/update/delete).

  12. svar says:

    Thanks for the effort, did you setup MariaDB with MRR that can boost disk bound workload https://kb.askmonty.org/en/dbt3-benchmark-results-myisam/ ?. TokuDB is open source it would be good to add it in your test as default storage for such task . InfiniDB would come clause to Vertica results +/- 3%

  13. Jim Tommaney says:

    InfiniDB should do well on this class of queries. Also, as of October 15, InfiniDB 4 is now available as open source (GPL v2) with no restrictions on syntax or scale. In addition we have InfiniDB for Hadoop where we run as a non-map/reduce engine on your Hadoop cluster. We typically offer a better experience with performance, syntax support, and general robustness than any other SQL on Hadoop offering including Impala.
    Cheers,
    Jim Tommaney (CTO at Calpont)

  14. Marcelo Manzano says:

    Hey man, it’s an amazing benchmark! When will you have the Part 2 available?!
    Thanks a lot

  15. vonkhades says:

    I think you should include Cassandra and HBase on this performance tests…

  16. Pingback: Confluence: Analytics and Data Collection

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: