My CPOSC 2015 Talk for Android developers.
As we continue to make software a component of more technologies, software failures are evolving from losing the last 20 minutes of your work to losing the rest of your life. I was recently reminded of something that happened to me years ago, in which I encountered a potentially life-threatening software failure.
In the winter, about 10 years ago, I had a major software failure in my 1997 Saturn SC2. I was living near the top of a steep hill, with a road to match. My drive to work required me to descend this steep road, which hit a low point before rising up again to touch the main road. If you could look at the road from its side, it would resemble a check mark, with my house near the top of the longer stroke. Now picture the road covered with a fresh, wet, slippery snow.
I descended the hill in low gear, to take advantage of engine braking, but I also had my foot on the brake. The action of the ABS brakes caused the usual pulsation in the brake pedal, along with the typical rattling sound, as it kept my speed down. So far, so good. Then, after a few seconds of constant ABS activity, I lost the brakes, as my dashboard lit up with red lights.
“Hmm,” I thought. The low gear was helping to slow me down, but it could only do so much without brakes. My steering still worked, so my plan was to drift down the hill and rely on the braking power of the incline between me and the main road. This was the plan for several seconds, until someone pulled onto the road from an adjacent apartment complex. They were heading to the main road, too, and they were in front of me.
“Hmm,” I thought again. I couldn’t rely on the other car moving fast enough to not be in my way, so I had to come up with a plan B. I quickly thought about the failure, running through various scenarios. While I wasn’t certain, I suspected that I had encountered a software failure, and that the hardware (the brakes and the ABS controller) were fine. Ultimately, I decided to reboot the car.
This was a little scary. Power steering would go away while I did this, even if only for a few seconds. I’d also never started my car while the wheels were in motion. Out of an obscure memory, I pulled information my dad told me once: “You can start a car with automatic transmission in either Park or Neutral.” Park was out of the question, but Neutral would work just fine. So, I turned the key into the Off position, shifted to Neutral, and then turned the car back on. After the usual brief test period for indicator lights, all the red lights were gone. The brakes had resumed their clicking noise, and this time, they kept working. After shifting into low gear, everything was back to where I wanted it to be.
I considered contacting Saturn about this, but I didn’t think it would lead to any improvements. My mindset at the time was that unless I had a way to consistently cause this failure, the report wouldn’t be acted upon. In fact, I only encountered that failure once during the time I owned the car. If it happened to me now, I think I would contact Saturn and anyone else who had an interest in making sure cars are safe to drive.
If there’s any life lesson here, it’s that the Hitchhiker’s Guide to the Galaxy was right: Don’t panic!
Let’s say you’re in the business of buying food raw materials. How do you know you’re buying a protein powder and not, say, talcum powder? Tasting it may not be your first idea, just in case a shipment got mixed up, and it doesn’t scale well, so you’d probably want some way to test the powder to make sure it’s what you ordered. You could probably perform a really specific test that would be costly and/or time consuming but would definitively tell you that you had protein powder with a given protein content and zero contaminants.
But why would you do that? People are basically honest, right? All you need to do is look for something associated with protein to check up on your supplier now and then and compare what they’re sending you to what their rivals could provide at the same or better cost. Well, there’s plenty of nitrogen in protein, so a test for nitrogen would be a pretty good test for protein… Or fertilizer. Or melamine.
You’ve heard of melamine in the news, and QA is the reason why. Two tests are generally used to measure protein content in milk, one called the Kjeldahl test, the other called the Dumas test, both of which provide similar results. It is tempting to call it the Dumbass test instead, though that would be unfair to the test itself, which does a good job of detecting nitrogen levels. No, the dumbasses are the people who use these test results to determine protein levels, when what they actually do is determine nitrogen content. That’s how it ended up in baby formula and other food products, because the test results showed a high nitrogen content, which the testers falsely concluded to mean a high protein content – and not a high protein + industrial chemical content.
Put in other terms, this is like taking the knowledge that English documents are about 6.5% N’s and trying to determine if an author sent you a 40,000-word novel by counting the N’s and deriving how many total letters there must be, and by extension how many words. Much like with the protein test, it works fine as long as the test subject is honest. When the author who knows your QA process sends you a document containing nothing but N’s, you are in for a surprise of the worst kind.
What is most frustrating is that there do appear to be accurate alternatives to the foolable tests, though they are surely not cheaper. As with so many things in life, cheaper wins until cheaper fails so badly that cheaper ends up in the headlines and in shiny new laws.
Now and then, a manager will make a decision to meet a software delivery deadline by bypassing a QA process, because in their mind, meeting the deadline is more important than assuring quality. They may send something broken, but at least they send something broken on time. To be fair, this is not always their fault; their own managers may be the ones enforcing this world view, and adhering to it, however nonsensical, may make a major difference in their annual raise.
Those of us who actually produce work product for a living tend to think of things in terms of whether or not they work, not when they are delivered. The when is largely a measure of how inaccurate the manager’s arbitrary schedule was, nothing more. It took as long as it took to make it work, and that’s that.
I propose that any manager who elects to circumvent a QA process in order to rush software toward a deadline must read and sign a document stating the following:
I’m a manager, and as such, I know way more than the QA department’s trained professionals about the chaotic nature of software changes. I acknowledge that one simple change can cause side-effects that are not obvious, but I’m certain that this isn’t the case here because my mind is more powerful than any mere QA process, and I know better than the entire QA group. If for some reason I’m wrong (which I am not), I will buy everyone a pony of their choosing.
This may seem a little harsh, but when I was starting my career, I became the delivery manager and the QA department, making me the gatekeeper between a deliverable and our customer. I had a total quality failure involving someone who was new to a software module and building it for the first time. Twice in the same week, he assured me it was good to go, and twice in the same week, I trusted that and delivered something that was fundamentally broken. We worked together to resolve the problem, but the lesson I learned was that trust and judgment are not as powerful as a robust QA process. When I elected to bypass the QA process, I was essentially making an implicit declaration like the one above. Of course, I was, in effect, saying that I knew more than me, since I was both the delivery manager and the QA department 🙂