It's facepalm time ...

Surely you’ve recognized that my blog was down for a few days and with it all services on the system. The problem that led to this situation was a really dumb one. Perhaps this article is more a story about not thinking about a failure mode just because it’s not a problem under your preferred operating system (Or to be exact: It was a problem before Solaris 10, but afterwards it was solved). And and most it’s a story about being totally problem-blind in the first moment. Perhaps I should explain first that c0t0d0s0.org isn’t run with Solaris, it uses this-other-unixoid-operating-system in a well-known non-commercial variant. That’s the dirty secret of c0t0d0s0.org. No technical reason for it, but webserving and mail could be done by any operating system and thus I used that operating system with ubiquitous availability at almost all providers of dedicated servers. I’m able to migrate the server from one dedicated server provider to another within 2 hours including moving the data and did this three times in the past (from 1&1 to Hetzner, and two times within Hetzner). This saved quite significant money until now and that’s the basic reason why I don’t want donations and when you do donations I would donate this money to kiva.org) Hetzner has reasonably priced dedicated servers and I had no problems in the past, however they have one important shortfall: No serial console in the standard product. When you need a console, you have to make a support call and they connect one. As you need it really seldom, it’s okay. As I found out later: I With this serial console I would have recognized the problem within a minute, and fixed in a second. However: The console was exactly thing that I didn’t had to my disposal at this moment. So it was a lot harder to find out what’s happened. However i wanted my server back as soon as possible (out of personal reasons I was just able to start the recovery in the evening and as I have job to do I could only do the further stuff in the evenings as well) and thus I just reimaged the server after keeping a copy of the logfiles. I have a quite extensive backup regimen with very regular rsyncs and database replication on my server at home thus I knew I would perhaps just lose an hour of minutes of data and that was okay for me, additionally I was able to mount the disks of the non-working installation and to copy the delta of mails between the last backup and the last mail in the queue to my backup. What had happened: At 10:something my server provider had a large power outage. The UPS didn’t take over as planed and thus a lot of servers rebooted. One of them was mine. Damned … but that’s the basic reason why I’m a fan of proper enterprise architecture and not of some singular availability features, no matter what marketing tells you. Real availability is hard work and often expensive. But: When you really bet your business on IT, you need an architecture that is even capable to cover an UPS that proofs to be not so uninterruptible. The availability feature UPS may fail (and did fail my case) but a proper enterprise architecture keeps your service up and running. Even more important: With a proper enterprise architecture you don’t need the feature UPS for availability reasons at all because your service can survive the outage of some parts. Perhaps you want the UPS out of other reasons like “don’t want the hassle of bringing up all the systems again.”. But you don’t need it with such an architecture to keep your business running. By having a proper planed enterprise architecture with servers on two seperate sides with different power grids you may forget about the UPS because a UPS won’t help you with a prolonged power outage for example because of region-wide blackout. An outage that maybe will take out the connectivity as well as it’s not that unprobable that your local carrier has the same power problem ;) However my business is being an architect, not blogging and thus I didn’t have such an architecture in place for c0t0d0s0.org But that’s a different story. Okay: After a while my system worked again and thus I had time to find out what had happened. I knew that the system was still reacting on pings, thus I knew the kernel of this-other-unixoid-operating-system in a well-known non-commercial variant was working. Looking into the logfiles I saw complete bootup of the kernel and some of the daemons were starting up …. like acpid for example. However I couldn’t log into SSH. No signs in the logfiles of a ssh daemon startup. The apache was in a half-reacting state. Port 80 was open but it didn’t reacted to HTTP commands. Out of this I concluded: The kernel and the boot configuration is okay. The bring up of the services is at least working partially, because otherwise it would start services at all. And as it reacted on the networking there must have been at least a working boot of services mandated by rcS.d, as otherwise there would be no networking. The problem must be in apache that is frozen halfway. And out of other reasons ssh isn’t started at all. There must have been a major fsckup in the startup of the services As I had no console as explained before I needed to conclude from the leftovers what had happened. And now was a little bit puzzled. 5-6 years ago I would have recognized this problem in an instance (because sometimes i’ve produced … well … suboptimal startup scripts) … but now today it took a while, because I didn’t felt prey for 5-6 years to such a problem. It took me a lot of more thoughts what might had happened. When you do one operating system for a living and one for hobby, you tend to project your mindset of one to the others and you don’t do justice to this other OS. As you all may know, Solaris ditched init.d with Solaris 10 in order to introduce SMF (not to forget the equally important features like the contract filesystem and the Fault Management Architecture). One of the nice advantages of SMF is that services that aren’t interdependent will be started in parallel without waiting for another. This has two advantages: At first the system can start up much faster, at second a service not able to start up can’t block the startup of the rest (short of services needed by all others). The init.d concept is a different. All services are started in sequence. The sequence is numerical and then alphabetical. That is of course slower but more important … depending on the way you write your script a script or binary hanging or waiting for user interaction can block the startup. The variant of this-other-unixoid-operating-system is using init.d And it’s quite easy to block a service. For example by integrating a new SSL key and certificate. My key had a password and apache was asking for this key in order to startup. This exacly happened. Acpi started up because it was started before Apache (guess what: ACpi is before APache in an alphabetical order, and way before Ssh). This is the basic reason why you strip of the password from your key. Guess what I did last week: I put a new key and certificate on my server and I forgot to strip the password from it. And that exactly happened: In my version of this-other-unixoid-operating-system the ssh daemon is started after the apache daemon. When Apache waits for something you won’t get SSH. Damned … it’s facepalm time. Basically I felt prey to a beginners error because I’m working with an operating system that reacts totally differently on such situations. On Solaris such a situation just don’t matter at all … you get at least your ssh login and the system non-availability is just a service-non-availability you can fix within a second. However given the init.d system of this-other-unixoid-operating-system the outcome was somewhat more problematic. However: What a dumb error on my side … However: The reinstallation wasn’t that bad … the system could used a reinstallation because of some tests and experiments. So it was worth the work in the evenings. And on the other side: Who had the glorious idea to start apache before ssh? That said, this-other-unixoid-operating-system in newer variants have a different startup mechanism up upstart all the services. However: My heating control is running on a beagleboard-XM at the moment using a really current variant of this-other-unixoid-operating-system just released a few days ago. It uses this new startup mechanism. And it’s justs my unimportant personal preference, that doesn’t matter: But I don’t like it. And I have a lot of reasons for it. It looks by far too much designed for desktop needs. However my dislike would require an article I won’t write in this blog nowadays. But as I wrote: That’s my personal opinion that doesn’t matter. However it’s really important that this-other-unixoid-operating-system gets away from the old init.d mechanism to something more current. I think in 2011 every operating system deserves something more functional, something better than init.d … init.d is simple and well understood, however it creates classes of problems unnecessary today. Especially: In order to keep die-hard Solaris admins to fall prey to such a beginners error because such problems were parts of their distant past. And now i will start to cut holes for my eyes in the brown paper bag for my head.