Digging into Apache Hadoop

I´m exploring Apache Hadoop at the moment. This looks like a really interesting technology. What´s Hadoop? Hmm … to explain it in a really simplified manner: It´s a distributed, highly available datastore. Okay … yawn … no big deal so far. But there is an interesting twist in Hadoop. Let´s assume you have vast amounts of log files. A pile of data in the size of multiple Terabytes. You want to know the URLs of the Top-10 pages of your website. The standard old-school approach to this problem is: Starting an analyser on a big server which gathers all data via block or file based protocols to this analysing server. Of course this approach has several bottlenecks: The size of the network pipes, the amount of computing power in a single box, the amount of IOPS in a single server, the amount of IOPS of a single storage attached to the server, the amount of memory in a single server. But now think in HPC terms about this problem: You could divide this task in several ones. Let´s assume 64 MB shards. You could compute the result for each of the shards on a seperate node. To stay in our example: This step print outs the pageviews of any URL in it´s shard. This fragments of the final result are collected and reduced to the final result: For example by adding the pageviews of a URL from every shard. By using the concepts you you seperate the task of analysing the log files on thousands of nodes in parallel. You get your answer in minutes, not hours or days. The advantage of doing so is to bring the data intensive parts of computation to the data instead of bringing the data to the computation. Such algorithms are called MapReduce. This concept was introduced by Google and the core competency of Google is to analyse big piles of data, thus such an mechanism is quite handy. I have several usecases in mind for such a solution: Commercial data warehousing, billing of large heaps of Call Data Records, mass converting jobs … and so on … What has all this stuff to do with Hadoop. Hadoop is an open-source implementation of this concepts. The Hadoop Wiki writes:

Hadoop is a framework for running applications on large clusters built of commodity hardware. The Hadoop framework transparently provides applications both reliability and data motion. Hadoop implements a computational paradigm named Map/Reduce, where the application is divided into many small fragments of work, each of which may be executed or reexecuted on any node in the cluster. In addition, it provides a distributed file system (HDFS) that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster. Both Map/Reduce and the distributed file system are designed so that node failures are automatically handled by the framework.

It does the housekeeping, the seperation of data in shards, the distribution of the analysing tasks on the server. You can view at it as an API/command line controlled grid engine for data distribution and data processing. It consists out of the Hadoop Core, the Hadoop Distributed File System (it´s not a POSIXfilesystem integrated to the VFS, you can think of it like FTP, you need a client or you use an API to use it) , there is even an scripting language helping you to write the analysing jobs. This language is called Pig. Additional there is an effort to implement a database for structured data on top of Hadoop with HBase. At the moment there are some gotchas in this technology. Fore example you can´t work with compressed files in it, as gzip files shards aren´t decompressible seperately (okay, it´s a problem of gzip, but it prevents you to work with it) (Update: I´m not entirely sure about this, it seems that you can work with block based compression like bz2, and gzip is indevelopment, at least according to the Pig documents). But here ZFS on-the-fly compression can be very helpful. I think i will create a Hadoop testbed with multiple zones on one of my systems this weekend.