"Heresy is another word for freedom of thought."
- Graham Greene
Software Reliability Redux
Perhaps not surprisingly, the posts received mixed reactions. Some comments were positive, but many reflect the "traditional" view that all software failures are systematic and therefore must be excluded from the universe of probability and statistics. One commenter even charitably suggested that "you obviously don't understand how software works..."
Which brings me to today's post. I confess to not being a software or computing expert (my formal education is in chemical engineering and reliability engineering), so I have been continuing my informal education on the subject.
A Heretical Book
I recently purchased an excellent book Embedded Software Development for Safety-Critical Systems [affiliate link] by Chris Hobbs. Mr. Hobbs is a kernel developer and software safety specialist at Blackberry QNX. As you may know, QNX is the popular RTOS that has safety certifications per IEC 61508, ISO 26262, IEC 62304 to name a few. Mr. Hobbs understands how software works!
I won't do a full review of the book here. However, it is a surprisingly comprehensive and practical treatment of the subject of building safe and reliable software. A wide array of practicalities such as standards, code coverage, static analysis, and release synchronization are discussed. More obscure topics such as formal analysis, Bayesian Belief Networks, and Goal Structuring Notation are also covered.
What inspired this post is the pure gold that is Chapter 13 Software Failure Rates. Rather than summarize, I will directly quote the first few paragraphs (emphasis added):
"It must be said at the outset that the contents of this chapter are considered heretical by many people in the world of safety-critical systems. IEC 61508 makes the assumption that, whereas hardware failures can occur at random, software failures are all systematic [...] This is an extremely naive point of view that originated in a world of mechanical hardware and single-threaded programs running on single-core processors with a simple run-to-completion executive program rather than a re-entrant operating system. It also makes the assumption that the hardware of the processor will always correctly execute a correctly compiled program. These assumptions are not true of the integrated hardware of today's software-based systems."
He goes on to give a practical example of software failure that is random in all respects that matter. He then makes a case that the same argument that software failures are not random could be used to argue that no hardware failures are random.
I used a similar argument in the post IEC 61511 is Wrong about Systematic Failures to illustrate that the classification of systematic vs. random failures is really an engineering choice (i.e. level of effort to investigate causation) rather than something inherent to the failure mechanism.
Not an Outlier
Mr. Hobbs is far from alone in his views on software reliability, as my previous posts illustrate. Even the chip builders at Intel acknowledge that modern multicore microprocessors are inherently non-deterministic.
In a separate paper, Mr. Hobbs notes that modern processors often come with 20+ pages of errata (usually under NDA, of course!). Out-of-order execution is not even errata on most modern processors; it is a design feature! So even perfect software will run non-deterministically and may encounter quasi-random hardware errors.
I highly recommend Mr. Hobb's book. Even if you disagree with the software reliability discussion, there is a huge amount of other practical knowledge in the book.
😉 Most importantly, if you buy the book, please use my affiliate link!
It is shaping up to be a busy year for me both at home and at work. As time permits, I will keep posting. Please keep visiting!