Playing with HazelCast, a distributed datagrid on Amazon EC2 with jclouds-cli

datagridHazelcast is an open-source in-memory datagrid that allows to store data in memory distributed across a cluster of servers and to execute distributed tasks. It can be used as an in-memory database that can be queried using SQL-like queries or any filter that you can implement in Java. To prevent data loss, data in memory can be backed by a persistent storage (file, relational database, NoSQL database, …). Data can be persisted synchronously when the data are written to the in-memory database (write through) or asynchronously to batch the writes (write behind).

In applications which are read and write intensive, relying on a relational database server (or a master/slaves configuration) can be very inefficient as it often becomes a bottleneck and a single point of failure. With data in memory, reads and writes are very fast and as data is distributed and replicated there is no single point of failure. Indeed, if we consider a replication factor of 3, we have a primary and 2 backups nodes so if one node of the cluster were to go down, other nodes of the network can take over and get reassigned the data. In the catastrophic event where the database goes down, writes in the cache are queued in a log file so the writes can be persisted in the database once it is backed up.

There are other products offering similar features than Hazelcast:

  • Oracle Coherence: it is a very robust and popular data grid solution used by many financial companies and systems having to deal with a lot of transaction. It also has an active community.
  • VMWare Gemfire: It is used by some financial companies and provides most of the features Coherence has but the community is much smaller so it’s harder to find documentation.
  • GigaSpaces XAP: The system provides a lot of features. It allows among other things to dynamically instantiate services on multiple servers and handles services failover.

In this tutorial we are going to deploy hazelcast on an EC2 cluster. Then we will run some operations in the datagrid and finally we will stop the cluster.

Read more of this post

Using Hadoop Pig with MongoDB

In this post, we’ll see how to install MongoDB support for Pig and we’ll illustrate it with an example where we join 2 MongoDB collections with Pig and store the result in a new collection.

Requirements

Building Mongo Hadoop

We’re going to use the GIT project  developed by 10gen but with a slightly modification that we made. Because the Pig language doesn’t support variable that starts with underscore (e.g., _id) which is used in MongoDB, we added the ability to use it by replacing the _ prefix with u__ so _id becomes u__id.

First get the source:

$ git clone https://github.com/darthbear/mongo-hadoop

Compile the Hadoop pig part of it:

$ ./sbt package
$ ./sbt mongo-hadoop-core/package
$ ./sbt mongo-hadoop-pig/package
$ mkdir ~/pig_libraries
$ cp ./pig/target/mongo-hadoop-pig-1.1.0-SNAPSHOT.jar \
./target/mongo-hadoop-core-1.1.0-SNAPSHOT.jar ~/pig_libraries

Running a join query with Pig on MongoDB collections

One of the thing you can’t do in MongoDB is to do a join between 2 collections. So let’s see how we can do it simply with a pig script.
Read more of this post

Installing Storm on Ubuntu

Storm is an open source ETL created by Nathan Marz in late 2011. Unlike Hadoop where data are processed offline in big batches, Storm takes another approach by aggregating streaming data on the fly so that aggregated data are immediately available. It is scalable, fault tolerant (no data loss guarantee) and the benchmarks showed that every node can process over a million tuples per seconds.

We describe below the different steps to install Storm in Ubuntu Linux describing the issues we had during the process.
Read more of this post