It must have been 25 years ago when it dawned on me that, in the future, we would have fabulous products that didn’t work all the time. You see, I was working at a consumer electronics company churning out circuits and code for gadgets at a feverish pace. The rule was you had to get the product fast to market and at a very low price. I reflect on that now as I attempt to use my tablet computer and the application crashes. You see, reliability is inversely proportional to complexity.
Think of the Space Shuttle. Per https://www.nasa.gov/pdf/566250main_2011.07.05%20SHUTTLE%20ERA%20FACTS.pdf there were about two and a half million parts in the vehicle. The parts in a space vehicle are classified as class ‘S’ parts. ( http://engineer.jpl.nasa.gov/practices/1203.pdf ) These are the highest reliability parts available. Now, suppose they had a reliability of 99.999%. That means there is a failure rate of 0.001%. Thus, there is a probability that 2,500 parts may be failing.
There are ways to figure out how to increase the reliability. The usual first attempt is to install a redundant system. However, this is often not done correctly. Suppose you have two power supplies in your home computer. If one fails then the other will take over for it. But, they are plugged into the same outlet. A common mode failure of the power in the outlet will still cause the system to fail.
Then there is cross-coupling. Suppose have two computers monitoring a device. They also send a message to each other to make sure the other is alive. They are both on different power systems and both have different mechanisms to observe the device they are monitoring. If anything should fail, they are to send out a message to headquarters telling of the need to attend to the device. Now suppose that the first computer has an electrical fault and it burns out its little electronic brain. Along with its conflagration, the electrical fault travels down the communications connection between the two monitors and deals a similar fatal blow to the second computer. Nothing is sent back to mission control and the critical part that is being monitored falls into a silent death.
Perhaps the end all in reliability analysis is the Failure Mode Effect and Criticality Analysis http://rsdo.gsfc.nasa.gov/documents/Rapid-III-Documents/MAR-Reference/GSFC-FAP-322-208-FMEA-Draft.pdf . This analysis is performed in both spacecraft as well as in medical life-support devices. At issue with the FMECA is that it costs a great deal if the system is not documented. Therefore, if you are designing a system from scratch, it is best to include an FMECA just after the first high-level design review and use the results to iterate the findings back into the requirements.
Here is another fun thing to do. Take a drawing of your system and then X out one of the components. Was any type of alert sent to notify of the failure? If not, then move up the chain and fail another component until something is noticed. Does your system fail catastrophically? What happens if you do notice the failure but the Mean Time To Repair (MTTR) is longer than the Mean Time Between Failure (MTBF) of the components? This means that, even if you have a backup system, if you don’t repair the failed system before there is a high probability of failure of the backup then the whole system may go down.
Ok. So what is all of this stuff worth? If you are not in the technology industry, think of an expensive item you buy that is very complex. Lets say, your laptop. Yes. You have just purchased a new laptop. You should just start using it right? Wrong. These things have limited warranty and are cranked out in the thousands per year. The warranty is usually for 1 year.
Most failures occur within the first 100 hours of the life of a part. Thus, if you have been shipped a lemon laptop, it would behoove you to turn the thing on and make it so that it comes out of power fail mode and stays oun (or buy a Wiebetech Mouse Jiggler http://www.amazon.com/CRU-Inc-30200-0100-0011-WiebeTech-Jiggler/dp/B000O3S0PK) . It is also good if you can have it do something while it is powered up for 100 hours but just letting it run and ensuring its disk drive is spinning is good enough. If it makes it through the first 100 hours then it will most likely survive until it wears out.