34
Post your IT redundancy tales here
(awful.systems)
this is FreeAssembly, a non-toxic design, programming, and art collective. post your share-alike (CC SA, GPL, BSD, or similar) projects here! collaboration is welcome, and mutual education is too.
in brief, this community is the awful.systems answer to Hacker News. read this article for a solid summary of why having a less toxic collaborative community is important from a technical standpoint in addition to a social one.
some posting guidelines apply in addition to the typical awful.systems stuff:
(logo credit, with modifications by @dgerard@awful.systems)
Once upon a time I was, for employment reasons, part of a team providing customer support for police forces' booking equipment including RHL, Solaris, HP-UX, and Tru64 servers. If you were arrested in some specific parts of the USA in the early 2000s it's likely your PII travelled across a server I had logged into on its way to the FBI.
One specific police force called us about a red light. It turned out that half of their two-disk RAID1 array had failed. Then it transpired that they had not been rotating the backup tapes. Or even putting in the tape for backup. After some discussion it turned out that their server was in a grimy, dusty janitor's closet instead of, say, under a desk or in a spare office. Which is why it had been out of sight and mind and getting clogged with filth.
I was asked to do a checkup on this server and see how it was. Of course this was after 3 PM on a Friday. The server seemed on the face of it fine, the RAID array was working on one disk, there were no errors on the box, and so on. Apart from the dead disk everything was fine.
(While I was being finicky with this host it got late and somebody turned off the lights and I yelled "I'm still here turn the goddamn lights back on!" or words to that effect and it turned out I had unintentionally cursed at the CEO with whom I had less than ideal relations. So it goes. He seemed more copacetic than usual. He left and I got on with things.)
Eventually I was finishing my little audit and my very junior self (job title: Technician) was wondering how little work I could get away with in my correctly lazy sysadmin style. For the first time I thought the thought that has guided my actions with systems ever since: "If I stopped here, would it be okay if a problem happened later?"
I called the police force and said given circumstances I needed to take a cold backup of their Oracle database and their booking equipment would be down for a bit. The response was that this was fine given arrest volume only picked up later on a Friday anyway. I merrily took down a whole police force's ability to book suspects to cold-backup their Oracle database onto the third disk in the host (secondary backup mechanism or something, purchasing is a weird art). Then I turned them back on and had them do a smoke test and grabbed the bus home.
I had myself a happy little weekend in the era before cell phones and when I arrived a bit late on Monday morning my workplace was in a rather unprecedented uproar. Readers, the second disk in that police force's RAID array had failed and taken with it their ability to book prisoners and their built up years of criminal intelligence data.
(In this situation the civil rights clock ticks and judges do not accept "well our computer systems were down" for slow-rolling delivery to bail hearings as much as the public thinks they do. So if this isn't fixed a whole bunch of innocent and/or gormless and/or unpleasant people run wild and free.)
I was called over by the Executive Vice President of Operations and asked about the database. I said in my then-typical very guileless way "oh, I did a cold backup onto the third disk". It was like everybody had just exhaled around me.
If I recall correctly the job of restoring the Oracle database was delegated to J. who was a great Oracle DBA among other talents. Well, as soon as a disk arrived. In the meantime the police force dug out their inkpads and paper from somewhere.
This was a superior lesson in the fragility of computer systems and why the extra mile is actually no more or less than all the required miles.