Anecdotes

A Myth from the Early Days.

"Don't test for an error you can't handle"

This was a popular adage at one time. My view is, if you can't handle an error condition your system analysis is incomplete. In the past when pointing out the possibility of an error in software, I have heard programmers reply "What's the chances of that happening"? (They mean what is the probability).

I call this Statistical Programming. When dealing with machines that can do millions of operations a second, and run 24 hours a day seven days a week, 365 days a year, huge numbers become meaningless, and even the slightest of probabilities eventually occur. My reply is, "Given long enough the probability approaches 1".

If an error can occur, you must presume that it will: and program for it.

Do you like reliability in your computer systems?

One of our Toll Road clients was the M4 motorway in Sydney, which is no longer a toll road, incidentally.

Anyhow, one day they phoned us to say that they were doing some electrical work and had to change the server on to a new power circuit, and could we please shut their server down.

Before we did, we ran an 'uptime' command: a UNIX command that reports how long it has been since the machine was last booted. It had been running continuously for 15 months.

That's reliability. It was also a Digital Alpha machine running TRU64 UNIX, a totally fabulous machine with the most deadly accurate real time clock: look it up, the RTC of the Alpha was legendary, we used it as the NTP server for the site.