© Dan Starr, 1998

We all know that software’s special. It’s a magical discipline that creates wonderful special effects, makes machines talk, controls both the motors in our cars and the robots that build those cars, and brings unlimited information and entertainment into our homes over the Internet. Software developers are special people, some of them more famous than rock stars or presidents.

And when software fails, those failures are special. When the baggage-handling system at the new Denver airport started shredding suitcases, it was a software problem [1]. When millions of people lose their phone service, it’s because of a software problem [2]. And when civilization collapses on the morning of “1/1/00,” it will be blamed on a software problem [3].

Software’s problems are so special that we have a special name for them&endash;the “software crisis.” Software takes too long and costs too much to develop, and when it’s finished, it’s often unsatisfactory, to the point where some large software systems are never even used. According to Scientific American, an average project finishes some 50 percent behind schedule, about one large project in four simply dies, and three quarters of large systems don’t work as intended or are never used. [4] Our special discipline seems to have some special problems.

But are the problems really that special? Consider the Seikan rail tunnel, which connects the Japanese islands of Honshu and Hokkaido. It took seventeen years to drill (twice the estimate) and cost over eight billion dollars (ten times the estimate). Adding insult to expense, by the time the tunnel opened, airlines had long since displaced the unreliable and dangerous ferries that were the main reason for building a rail tunnel. The world’s longest tunnel is hardly used. [5]

On the other side of the world, the Humber Bridge in England was briefly the longest in the world. Construction began in 1972 and was supposed to take four years and cost 23 million pounds. In fact it took nine years to build and had accumulated some 145 million pounds in debt by its completion. Like the Seikan tunnel, the Humber Bridge gets little use, toll revenues do not come close to covering interest on construction loans, and the debt continues to climb. It presently exceeds half a billion pounds. [6]

And we thought we had problems.

The Seikan tunnel and Humber Bridge are works of civil engineering&endash;the world’s oldest engineering discipline. They are not alone. Engineering disciplines that have been “mature” for decades or centuries still spit up failures, many of which look suspiciously like the things that are held up as evidence of a “software crisis.” In this article, I will describe some of these failures and suggest that software is much more like the more “mature” engineering disciplines than we might think. There is bad news and good news in this observation–bad news, because things aren’t going to get miraculously better, and good news, because we somehow find ways to live with the limitations and failings that are inherent in the products of human creativity.

PROJECTS

The Seikan Tunnel and Humber Bridge got to be so late and over budget much the same way that software projects do. The problem to be solved was more complicated than the project expected. Both projects underestimated the geologic troubles they would encounter (the Seikan Tunnel was driven through young and unstable rock formations; the Humber Bridge’s foundations were in a touchy “overconsolidated” clay). They ran into unexpected ground water problems that forced major changes in their designs and plans, adding months to the schedules. The Humber Bridge suffered resource shortages as well&endash;because of a strike, there was a steel shortage at the worst possible time for the bridge builders. [7] Software projects run into similar problems: the new compiler doesn’t work, the critical person leaves for a better job, the competition’s new product announcement means that the features you planned for Release 3 are now in Release 1.5.

Projects, be they tunnels, bridges or software, are about solving a new and unique problem, and this involves risk. The people who built the Humber Bridge could now probably build an identical bridge on the same site on time and within budget. At another site, there will be new unique problems and new risks. The same applies to software. I can re-develop my last system sooner and cheaper than I developed it the first time. Who would buy it? Customers want a better version of the system, with more features, more capabilities, more performance…all of which spell more risk and uncertainty, and more chance of being late and over budget.

So, our first lesson: software projects are projects first, software second. Projects, by their nature, are doing new things, running new risks, and likely to cost more and take longer than we thought. [8]

SYSTEMS

Software sometimes blows up unexpectedly. A tested, trusted program that we think we understand crashes (“GENERAL PROTECTION FAULT!!”) without warning. Worse, we restart the machine, and we can’t re-create the problem. Even though we designed the code, we really don’t understand how it works.

In this respect, software behaves just like a chemical-processing plant, where every now and then a reaction vessel blows up without warning. Chemical engineers grudgingly accept that even though they designed the reactions and the vessels, they don’t really understand fully what’s happening. Their designs are at best models of the system’s actual behavior.

Chemical plants and software systems blow up from time to time because they are complex (a lot of different things going on) and tightly coupled (a change in one thing affects a lot of others, very quickly). [9] Chemical plants explode because a bit of something got too concentrated in this one spot for a split second; software crashes because for a couple milliseconds there was no memory available in a system table, or there was nobody there to handle an unexpected interrupt.

Some of the complexity and coupling in software is accidental, a product of our development processes. We can reduce this. But some complexity and coupling are essential to the jobs our systems are doing. Turning oil into plastic is by its nature a complex and tightly-coupled task. So is turning electrical signals into interactive words and pictures. [10]

Software is vulnerable to tiny mistakes. Three inverted bits in a patch cause millions of people to lose their phone service. [1] A misplaced decimal point lands a space probe on the bottom of the ocean. A data-entry “oops” leads to a million-dollar electricity bill [11].

Software is hardly alone in this. In 1980, the Alexander L. Keilland oil drilling platform broke up and capsized in the North Sea, killing 123 workers. The reason: a faulty weld in a hydrophone-mounting bracket caused a small crack in a bracing tube. In time, the crack grew and the brace broke, which led to other braces breaking, which led to a flotation pontoon coming off, which led to the rig tipping in the water, which allowed water into the other pontoons, which led to the rig overturning. All from one bad weld, in what wasn’t even a “structural” component. [12]

This is the nature of complex systems. Failures always start from something small–an incorrect pointer, a bad weld, a weak bolt. Then, because of the interactions within the system, the small failure causes a bigger failure, which causes a still bigger failure, until the whole thing fails.

To keep tiny mistakes from crashing our software or trashing our data, we write more software to do error checking and correction. Much of the time it works. Every now and then, though, it causes the problem it’s supposed to prevent. The “backup” software that was supposed to keep my personal digital assistant from losing data wiped out two months’ worth of memos; the computer won’t restart after you run the virus checker; the disk-cleanup utility corrupts your files while “fixing” allocation tables. A unique problem of software? Hardly. For 120 years, the Britannia Bridge carried trains across the Menai Straits in Wales. Somebody noticed that the great wrought-iron tubes of the bridge were vulnerable to rust, and added a wooden roof to protect them from rain. In 1979, the roof caught fire, the heat distorted the tubes beyond repair, and the bridge had to be torn down and replaced. [13]

This is called a “revenge effect,” and it’s another common behavior of complex systems. New roads encourage development and create more congestion. Better antibiotics help breed tougher germs by killing the weaker ones. Safety devices encourage people to take more risks. And software that’s supposed to prevent one problem causes another. [14]

A second lesson: software systems are systems first, software second. Complex systems, be they software, buildings or bridges, are prone to revenge effects and unexpected failures triggered by tiny defects.

ECONOMICS

Software doesn’t always age gracefully. Look at the “year 2000 problem.” A lot of workhorse programs, that print our paychecks and tax bills, were written back when two digits seemed like plenty of space for the year. They’re going to do strange things when the year goes from “99” to “00.” Industry experts estimate that it will cost between 300 and 600 billion dollars to fix this little bug. [2]

About the same time those programs were being written, thousands of suburban tract houses were being built with one-car garages. One car was all the Nuclear Family of the fifties and early sixties needed, and land and lumber cost money. Today most families have two or three cars, and the streets of our suburbs are parking lots for them.

Who could have predicted that we’d have two-income, three-car families long before these houses were worn out? Who could have predicted that payroll programs written in the sixties would still be in common use at the turn of the century? Who’d have paid the extra money up front even if they were offered a two-car garage or a four-digit year? Resources, be they spaces in the garage or digits in the year, have a cost, and it’s always easier to cut costs now and let the future take care of itself.

All technologies face economic limits. The first large cantilever bridge was built across the Firth of Forth in Scotland in 1890, with spans of 1710 feet. The largest cantilever bridge ever built was completed in Quebec in 1919, with a span only 90 feet longer. [15] Longer cantilevers are possible, but they aren’t cost-competitive with cheaper suspension or cable-stayed designs.

Software runs into the same problem. We often write software to reduce the cost of doing something, but software isn’t always cheaper than doing it manually. People are very good at picking things up and moving them around. We have excellent vision and all kinds of sensors in our skin and muscles that let us judge weight, center of mass, frictional coefficient, and so forth when we pick something up. Giving a computer good vision or a sense of touch is expensive, requiring lots of bandwidth and costly transducers. Automating a function that involves visual and tactile information (say, handling airline baggage) is hard enough. Making a machine that does it more cheaply than semi-skilled labor is just about impossible.

A third lesson: software economics is economics first, software second. We build what seems to give us the most value for the least cost, and we keep it around, warts and all, until it’s cheaper to replace it. This leads to a cottage industry in “year 2000 compliance,” parked cars all over the street…and on occasion, a wholesale shift to new technologies.

TECHNOLOGY

All technologies have limits. For instance, propeller-driven airplanes can’t fly at the speed of sound. Shock waves form around the blades, and the prop no longer generates any thrust. [16] The laws of physics divide the universe of problems into those we can solve with propellers and those we can’t.

Software technology undoubtedly has similar limits, which is why some software projects work and some never seem to. I had a job offer from a government-affiliated think tank back in 1975. They were working on computerized air traffic control, they had demonstrated a few basic functions like keeping planes from hitting each other, and they expected to soon be able to control all the traffic at the world’s busiest airport with a couple minicomputers. Their estimate was slightly off. After twenty years and millions of dollars of R&D, computerized ATC seems further off than it did when I turned down that job. [17]

I think automated air traffic control is running into a fundamental limitation of software: you can’t code what you can’t explain to someone who has no domain knowledge. Nobody’s exactly sure how air traffic controllers do their jobs; they just know which blips on the screen are important and what to do with them. It’s an intuitive thing, and new controllers learn on the job, working with experienced controllers.

In contrast, software-controlled telephone switching has been a great success. Millions of lines of code switch billions of calls every day. Telephone switching has succeeded where air traffic control hasn’t in part because the functions of a telephone operator were made into algorithms and implemented by machines years before the first line of switching code was written. [18] The job of completing a telephone connection can be explained to someone with no domain knowledge, so we can write software to do it. The job of controlling air traffic can’t (at least so far), so we haven’t been able to write software to do it.

A fourth lesson: Software technology is technology first, software second. This means that like any other technology, it has limits; there are things it just can’t do.

PEOPLE

People write software, and some are much better at it than others, by as much as a factor of ten. Maybe we could solve our “software crisis” by making sure that we put the best people to work on the hardest problems–a “brilliant designer/obedient coders” approach. This idea is at the heart of Harlan Mills’s “surgical team” model of software development described by Fred Brooks. [19]

Alas, history suggests that brilliant designers make their share of mistakes. Thomas Bouch was knighted for designing and building the Tay Bridge in Scotland, but he’d neglected to consider wind loading and the bridge was blown off its supports in 1879. Leon Mossieff, with several landmark suspension bridges already to his name, designed the Tacoma Narrows Bridge using the state-of-the-art “deflection theory.” This slender and beautiful bridge was unstable in wind, and twisted itself to pieces. Bouch and Mosseiff were brilliant designers, as were Theodore Cooper, Robert Stephenson and Charles Ellett, all of whom designed “state-of-the-art” bridges that fell down. [18] Even the most brilliant designers occasionally fail to think of something important, or follow an incorrect theory.

So, to catch these mistakes and incorrect theories, we put in place reviews, inspections and independent certification. These are certainly useful tools and I support their use; I’m in the review business myself. Still, mistakes occasionally slip through. Look again at the Alexander Kielland. This vessel received a “classification for suitability” (that is, it was declared seaworthy) by Det norske Veritas, an independent inspection agency. Similar platforms were inspected by DNV and Lloyds of London and found fit for use. Yet, the Keilland went to sea with a crack some three inches long in a structural member. And, when the Keilland’s sister rigs were inspected after the disaster, they were found to have similar cracks. The inspections simply hadn’t caught them.

Why not? Bignell and Fortune suggest that the inspectors may have missed the cracks because “they were not looked for; this part of the brace, near its mid-length, was not regarded as critical.” [19] This is an inherent limit of inspections: to carry out an effective inspection, we have to know where to look and what to look for. This knowledge comes from past experience with failures of things similar to the thing being inspected. DNV and Lloyds had lots of experience inspecting ships, but ships and oil platforms are different, and ships naturally resist the kind of fatigue that caused the Kielland to break up.

Inspection and certification have similar limits when applied to software. The inspectors have to know where to look for defects, and have to know what a defect looks like. It takes time and experience to know what to look for, and the rapid evolution of software technology makes that knowledge obsolete.

A fifth lesson: software people are first and foremost people. They make mistakes, even the smartest of them fall victim to the attractive-but-wrong theory, and they don’t always spot every mistake.

ENGINEERING

Bridges sometimes fall down, and chemical plants sometimes blow up, and major engineering projects are almost always late and over budget. We build them anyway because they’re useful. We do proof tests before we open new bridges to traffic, we locate chemical plants to minimize the damage when things explode, we hedge our bets with insurance, and we get on with the business of creating value within the limits of our technology. And we accept that some of our projects are going to fail, no matter what we do.

This is a hard thing for software people to accept. We’ve tended to view software more as magic than engineering. Look at all the “wizards,” “gurus,” and “demons” in our systems and organizations. Our customers encourage this way of thinking, because for many of them software is the technology of last resort. My colleague Phil Fuhrer observes that “software is what you put in that makes it work.” When we don’t understand its limits, it’s easy for us to believe that software can do anything, and to start pretending the software is in fact intelligent and able to deal with situations its designers didn’t anticipate [22]. The limits are still there, and part of engineering is knowing what they are. You can’t have a supersonic propeller-driven airplane, no matter how badly you want one.

Consultant and author Jerry Weinberg says that a crisis is really just the end of an illusion. [23] It could be that our “software crisis” is really just the end of the illusion that software is magic. This brings us to our final lesson: software engineering is engineering first, software second. Engineering isn’t magic; it accepts and lives within its limits.

BAD NEWS, GOOD NEWS

In the end, maybe software isn’t all that special after all. Software systems and projects succeed or fail for pretty much the same reasons that buildings, airplanes, ships and other products of human ingenuity do. They deal well or poorly with the issues inherent in being projects, with the complexity and coupling inherent in the task they’re performing, with the constraints of economics, and with the inherent limitations of the technology. This is both bad news and good news. The bad news is that silver bullets, fairy godmothers, or object-oriented miracles aren’t coming to suddenly make every project a success and every product perfect. The good news is that we can learn from the mistakes of other disciplines, and that as our discipline evolves, we’ll learn the limits of our technology and how to live with them.

References

1. New York Times, May 10, 1994
2. George Watson, “Faults & Failures,” IEEE Spectrum, May 1992

3. Chicago Tribune, 12/12/97, section 3 page 1, “SEC revises year 2000 guidelines to keep investors informed.” It’s mostly about the effect of this bug on the value of companies.

4. W. Wayt Gibbs, “Software’s Chronic Crisis,” Scientific American, September 1994

5. Nigel Hawkes, Structures, Macmillan 1990, page 206

6. Structures, page 170

7. Victor Bignell and Joyce Fortune, Understanding Systems Failures, Manchester University Press, 1984, Chapter 3

8. Robert Gilbreath, Winning at Project Management, Wiley, 1986, Chapter 2.

9. Charles Perrow, Normal Accidents: Living with High Risk Technologies, (New York: Basic Books, 1984) .

10. Fred Brooks, “No Silver Bullet: Essence and Accidents of Software Engineering,” IEEE Computer 20.4 (1987): 10-19.

11. Peter G. Neumann, “Inside Risks,” Communications of the ACM, July, 1992

12. Understanding Systems Failures, Chapter 5

13. Henry Petroski, Design Paradigms, Cambridge University Press, 1994, page 114

14. Edward Tenner, Why Things Bite Back, Knopf, 1996

15. David J. Brown, Bridges , page 90

16. This came from the PBS “Nova” show about the first supersonic flight; alas, I don’t know the date of the show.

17. “Software’s Chronic Crisis,” page 89

18. Chapuis and Joel, 100 Years of Telephone Switching, North-Holland

19. Fred Brooks, The Mythical Man-Month, Addison-Wesley, 1975. Actually, he’s quoting the “chief programmer team” idea put forth by Harlan Mills. But I bet a lot more people have read this proposal from Brooks.

20. Henry Petroski, Engineers of Dreams

21. Understanding Systems Failures, Chapter 5

22. Peter G. Neumann, “Inside Risks,” Communcations of the ACM, October, 1992

23. Gerald M. Weinberg, The Secrets of Consulting, (New York: Dorset House Publishing, 1986) .