Parasitic Hadoop

This isn’t one of my ideas that normally reaches me while waiting for my tea boiling idea. Interestingly i found an name for it while waiting for the tea bag to put colour and traces of taste into hot hot water in the train towards Berlin. It looks like tea is somewhat inspiring for me ;) I don’t think that this idea is unique and new, but it’s an interesting thought game. In my experience many systems aren’t really loaded. At least not the whole day. Thus you have a lot of idling cycles in your system. Even with power management this isn’t desirable, a system loaded 0% doesn’t use no power at all. You have an basic active idle power consumption. The most efficient system is a system at 100% load. Of course you should have mechanisms to load the systems as far as possible. While completing my presentation for the Hadoop Get Together in Berlin i remembered an idea i had a while ago in conjunction with the Sun Grid Engine. But you can do pretty much the same with Hadoop for example. It’s the parasitic grid or parasitic hadoop - because when i thought about this idea, i remembered my classes at school on Biology. What is a parasite? It lives on the resources of it’s host. A perfect parasite doesn’t harm it’s host at all, because when the host dies, the parasite dies as well. A perfect parasite lives from the stuff, that the host doesn’t need. Otherwise it tries to replicate. Okay, it’s a definition that ensure that every biologist will run way screaming in pain but it’s sufficient for our purposes. Let’s transpose this concept into computing: The resources are computing cycles and memory, the host is our operating system with the production load, the parasite is a zone in it to create an independent entity within it. But how do we keep the parasite zone from eating too much resources and thus starving the host system? Well that’s easy and i assume, frequent readers of my blog already know the way i want to describe. To ensure, that the parasite zone doesn’t eat away all the CPU cycles and memory, you would use Solaris Resource Manager. To prevent the starving of the host system of network bandwidth you would use Crossbow to limit the bandwidth and lower the priority of the traffic for the parasite zone. By using ZFS you would create an own filesystem in the root pool and limiting it’s hunger by imposing a quota on it. Additionally it looks like a wise choice to put some space into a reservation for the host system filesystem. In my tutorial about the Solaris Resource management i demonstrated, a process under control of the SRM can use all the resources of the system even when it already had used up its share, as long there isn’t a process that didn’t used up this share. Thus it’s perfectly possible to configure the production zone or the processes in your global zone with 99% percent of the shares to ensure it has priority and 1% for the parasitic hadoop zone, and it can still get all the idle cpu cycles to do it’s work. Replication of this parasitic zone is easy. You can create a parent parasitic hadoop zone and clone as many child parasitic hadoop zones as you want from it. I think the parasitic Hadoop is especially useful. For a grid engine many people have not that much use cases. But we have all task that boil down to do analytics on large heaps of text: Webserver logfiles, firewall logs, mail server logs and so on. But for most of us it isn’t feasible to create an dedicated cluster for it. With the parasitic Hadoop you could use the idle resources on you systems you have anyway without harming the production load of the system.