Hadoop in the enterprise

We had an Hadoop presentation in the Emerging Technology track at CEC 2008. While it gave an good insight of the technology to the uninitiated, it left out one important thing: How to use to sell it. I talked to some colleagues, who were a little bit perplexed about the use cases in the real world. Of course it´s a technology that found it´s birth in the web, but when you really think about it, more use cases will come into your mind. It´s about processing data with simple means. The customers have already the resources to build a hadoop cluster. Often they have dozens of servers, not all are equally loaded. So: Startup the Fair Share Scheduler of Solaris, give the hadoop 10% of your system and 90% for it´s real application. You have already the storage for the hadoop file system. An Solaris or Linux takes 5-10 GB at max, but you have 73 to 146 GB disks in your systems. That´s perfect … What are matching loads for such an construct? Everybody has such loads. For example scanning for logfiles for certain patterns (think about getting situational awareness about the data from all your intrusion detection sensors around your network). Or think about optical character recognition of scanned quotes, bills and receipts. You have to do an an mass convert job (old format of your scanned paper to a new one). Let´s use already existent resources for it at first. Processing all this process data from your automated test tools. Well … implement this process in Java or Pig and let it run on you hadoop cluster. Analysing database informations by a sequence of hadoop jobs to find patterns. And thats just the ideas i´ve got in a few mintes. And it´s really easy to write such jobs with Pig There are really reasons to do such stuff in a distributed manner. Instead of hours you could process a multiple TB logfile within minutes by distributing the job to many nodes. And, dear colleagues, when the admins and developers of a company get more and more fluent to work with such an environment like hadoop, they doesn´t want to use the residual computing time of existing servers, they want new servers ;)