How Software Doesn't Work
Nine ways to make your code reliable
by: Alan Joch
The next time you board a plane, try not to think about this: Flight Simulator
running on your notebook may be more reliable than the software that keeps
planes from colliding in midair. That's because the FAA's air-traffic-control
system stil1 uses software from the 1970s. It runs on a vacuum-tube IBM 9020e
mainframe that dates back a decade earlier. This system contributed to almost a
dozen failures at air-traffic-control centers in the past year, including
unnerving back-to-back breakdowns on July 23 and 24 in Chicago, the Santa
Monica Freeway of the skies.
For more than a decade, the FAA has been working to replace this antiquated
system. Sadly, the alternative, the Advanced Automation System with its
million-plus lines of code written since the early 1980s, is riddled with bugs.
And six years late. Computer scientists from two leading universities have had
to comb through it to see if any code is salvage- able. Faced with software
that's too unreliable to trust in life-and-death situations, the FAA must
rely instead on its old and collapsing-but well-understood-air-traffic-control
Unfortunately, this isn't the only example of unreliable software:
Item: In the summer of 1991, telephone outages occurred in local telephone
systems in Calitornia and along the Eastern seaboard. These breakdowns were all
the fault of an error in signaling software. Right betore the outages. DSC
Communications (Plano, TX) introduced a bug when it changed three lines of code
in the several-million-line signaling program. After this tiny change. nobody
thought it necessary to retest the program.
Item: In 1986, two cancer patients at the East Texas Cancer Center in Tyler
received fatal radiation overdoses from the Therac-25, a computer-controlled
radiation-therapy machine. There were several errors, among them the tailure of
the programmer to detect a race condition (i.e., miseoordination between
Item: A New Jersey inmate escaped from computer-monitored house arrest in the
spring of 1992. He simply removed the rivets holding his electronic anklet
together and went off to eommit a murder. A computer detected the tampering.
However, when it called a second computer to report the incident, the first
computer received a busy signal and never called back.
We've known for decades that software is too complex to develop without
adequate quality control. Books, conferences, and formal methods prescribe ways
of coping with the complexities of software development: Plan. Sweat over the
design specification. Isolate critical functions. Document the development
process. Comment your code. Test extensively, both the individual components
and the interworkings of the entire system. Independently validate the product.
Include backup systems. Eat your vegetables.
Why don't we do all this? Because it's expensive. Each line of the space
shuttle' s flight-control software costs NASA contractor Loral about $1000, or
10 times more than for typical commercial software. Would you buy a word
processor or a spreadsheet for $5000, no matter how bug-free it was? Or would
you rather pay 90 percent less and live with the bugs?
Clearly, the commercial market has spoken. But users of business-critical
software demand that we carefully weigh the trade-offs between delivering a
program and assuring reliability. Software developers will always make
mistakes, but a slow, careful-and costly-development process can minimize them
(see the text boxes "How to Build Reliable Code" and "Five Easy Steps Toward
There are three important battles developers must fight. Managers and customers
often find it extraordinarily difficult to specify how a proposed program is
supposed to perform. Second, commercial pressures and tight deadlines
practically guarantee chaos during the development process. Third, no program
is immune from featuritus. Even if the drive for high quality motivates
managers and programmers, an ever-expanding features list keeps program
specifications in flux and compounds the chances for introducing bugs.
Often, the first breakdown in quality control occurs before developers
write a single line of code. The New Jersey murder case illustrates how hard it
is to write a comprehensive specification. The computer that detected the
inmate's tampering correctly reported the action to the second computer. But no
one responsible for the software required that the call had to be redialed if
there was a busy signal.
When commercial pressures produce crises that masquerade as projects, people
often cut corners by skimping on testing. This was the case with the telephone
example. DSC Communications chose not to retest because it wanted to give
customers a new feature right now.
One of the most celebrated cases of featuritus happened in 1993, when Silicon
Graphics, Inc. (SGI) released version 5.1 of Irix with over 500 serious bugs.
Management had pushed for a new OS, a new user interface (UI), better compilers
and tools, and new multimedia features-everything in version 5.1 was supposed
to be better. No sacrifices were to be made. But nine months before the
release, when morale was low and the bug count high, two senior engineers
pointed out the impossibility of the task. Management responded by hiring two
contractors who were strangers to SGI's software and organization.
"The desperate attempt to do everything caused programmers to cut corners, with
disastrous effects on the bug count," said Tom Davis, principal scientist, in
an internal company memo. It described the struggle and lamented the OS's
bloated code, sluggish performance, and unrealistic memory requirements.
SGI's schedule imposed a code freeze before code was stable, which resulted in
a familiar problem. "We're trying to wrap up the box before the stuff inside is
finished, and then trying to fix things inside the box without undoing the
wrapping," the memo said. Energies are diverted at a key moment, as everyone
looks for ways around the rule that says they can't change things anymore. But
sometimes they must make changes.
Either way leads to a familiar crisis: the meeting in which features are cast
out of the release. In the SGI case. the company exiled entire applications
wholesale, but it was too late to do much good. "We bit off more than we could
chew," Davis concluded. "As a company, we still don't understand how difficult
Adding to SGI's problems, the memo leaked to the Internet. The response was
revealing: fan mail. Davis received scores of messages from similarly
beleaguered developers, and the software community as a whole tacitly owned up
to the problem. This reaction helped remind salespeople at SGl's competitors
that they were not immune to similar charges.
Despite the embarrassment, the memo may ultimately prove a boon to SGI because
the author spoke so passionately about quality. And support came from a key
corner. "The instant the [software] release hit the street, customers started
screaming.... It helped management read the memo with an open mind," says
SGI responded with a six-week software summit for all departments: engineering,
management, testing, marketing, manufacturing, documentation, and field
service. It invested in integrated measurement tools and ensured they'd be
documented and always available. The company identified process bottlenecks and
Librarians came on-board to ensure that project documentation was up to date.
Managers received software-development books. SGI sought ways to integrate
quality assurance throughout the development process. The direct costs: tens of
thousands of dollars for new tools, hundreds of thousands of dollars yearly for
new staff, and inestimable millions of dollars in additional engineering time.
What SGI did not do in the aftermath is also interesting. It didn't mandate any
specific new type of development process. Instead, the groups chose the tools
and methods they thought best. The result: Version 5.2 fixed bugs, improved
performance, and added no new features. Version 5.3 added a few strategically
While SGI learned its lessons the hard way, other companies look to a variety
of techniques and tools to save them from bug-infested nightmares. Most
business software isn't as life-critical as the digital flight-control system
in Boeing's 777. Never-the-less, Boeing's development process illustrates how a
firm managerial hand can help every company fight featuritus.
About 400 people spent five years working on the Boeing 777's flight-control
software. Jim McWha, the Boeing Commercial Airplane Group's chief engineer for
flight-control systems, worked hard to make sure the 777 team got the
requirements right. To ensure that errors were caught early, when they're
cheaper to fix, the 777 team solicited input from all the key people in the
life of a jet-everyone from pilots to manufacturing personnel. They evaluated
the results of simulations for a year in the laboratory and another year in the
"iron bird," a full-scale mock-up of the airplane. Boeing's goal was to have a
complete specification before de- velopers wrote any code.
McWha also resisted cancerous growth of the wish list once coding
started. To hold the line, Boeing set up review boards to evaluate every change
request; it refused about half of thems but credit undoubtedly goes to
McWha's air of authority. He's a no-nonsense guy; a large sign on his desk
reads: NO! (What part of this don't you understand?).
Most educational was the approach it abandoned. Boeing contracted with GEC
Marconi Avionics to write three versions of the flight-control software, each
to execute in its own lane. Working from the same requirements, three groups
(who were not supposed to communicate with each other) coded in Ada, C, and
The strategy, called n-version programming, is that if each lane executes code
written by different minds, errors in one lane will be eliminated by the other
two lanes. In practice, n-version programming is no magic bullet; independently
written programs tend to have trouble in the same spots. The hard parts are
hard for everybody. Boeing eventually decided to refocus its resources on these
The three groups proceeded independently for about 18 months before the
approach became more pain than it was worth. The systems people had to
communicate continually with three software teams without influencing their
directions. Developers found it almost impossible to keep code in the three
lanes synchronized, leading to nuisance disconnects.
Finally, expertise became too valuable to squander-skilled people needed to be
working together, not separately. So members of the C and PL/M teams joined the
Ada team or took on testing or verification chores. The three lanes now use
dissimilar processors and different compilers, but one group produced the code.
Some developers address reliability with the Capability Maturity Model (CMM)
from Carnegie Mellon University's Software Engineering Institute. The CMM rates
software-development processes on a five-level scale (for details on this and a
NASA quality model, see the text box "Make Quality Job 1"). Items that are
considered in the CMM range from how unambiguous specifications are to whether
a program's reliability receives independent verification. A level 1 rating
means the organization practices ad hoc chaos; level 5 identifies superlative
discipline from management and engineering.
"It's hard to argue with the CMM," says Roger Blais, manager of software
process improvement for Tasc, a government and private-sector systems
integrator in Reading, Massachusetts. The company has used the model for five
years and is in the process of being CMM-certified by a Software Engineering
Institute-accredited evaluator. Blais believes the CMM is valuable because it
puts importance on the process of software development. "Tools are so volatile,
and platforms are moving all the time," he says. "But if you have a process in
place, you have a glue that you can always count on."
But nothing is perfect. Space-shuttle software developers claim to be doing
everything recommended by the CMM. Even so, the program has experienced
numerous software problems, including errors on Discovery that made it
improperly position itself for a laser-beam experiment over an observatory in
Additional help against development chaos comes from formal methods designed to
bring scientific principles to a largely creative process. Blais says formal
methodologies play key roles in helping Tasc make products ranging from
document management to avionics sys- tems. The company builds most of its
applications for Windows and Unix using C, C++, or Visual Basic.
Customers sometimes specify that a formal methodology be used. Other times,
Tasc uses a variant of the spiral-development life-cycle methodology, a model
for iteratively combining pieces of a project as they evolve. Blais says spiral
development is valuable for its ability to provide a framework tor each
project. The framework helps when requests surface for changes to the design
Tasc also relies on Atria's ClearCase, a software-configuration management
tool, which Blais calls the core of Tasc's development efforts. It tracks
changes to code, records which programmers made the changes, and analyzes how
the changes impact other areas of the program. This information helps the
company manage releases and "keeps everyone honest," according to Blais.
Testing Is Everything
Other companies use quality assurance as the key tool for producing reliable
software. It's never too early to think about testing, according to Tom
Milkowski, a manager of software development at Dow Jones Telerate, a
financial-services company in Jersey City, New Jersey. "As you're writing code,
you should be creating tests. If you put an IF statement in the code, you
should make a note to test this call while it's fresh in your mind," he says.
Milkowski helps manage 35 developers who are building a real-time system based
on HP-UX to deliver financial information to the company's clients over a
private WAN. In the past 18 months, the development staff has written about
800,000 lines of C and C++ code. At the system's rollout, slated for next
April, the program will consist of about a million lines of code.
When Telerate developers finish each component in the program, they're expected
to review their work for errors. Next, each module undergoes a code review,
where other developers evaluate the code. Milkowski wants subsequent code
reviews if any changes, even minor ones. are made. "It's the minor changes that
can come back to bite you," he says.
But the complexity of Telerate's financial-information system makes testing a
challenge. For example, Telerate designed one of the four servers to handle 120
or more concurrent clients at a transaction rate of 1000 per second. Some of
the servers in the system have 1 GB of RAM. Developers can write code that
accesses memory anywhere in that gigabyte of space. "Everything is potentially
so interconnected through RAM, there's almost a limitless opportunity for
problems," says Milkowski.
He relies on his "tool bag" to reduce these opportunities. Hewlett-Packard's
SoftBench and the Discover Development Information System, from Software
Emancipation Technology, analyze legacy code to build structure diagrams and
help the staff decide what code is reusable. If a module's call functions are
longer than a page or two, Telerate developers start to worry. The more complex
the code is, the greater the likelihood of a defect. "The tools help us focus
our attention on the appropriate modules during our code reviews." Milkowski
To find memory and resource leaks, the staff uses Purify, from Pure Software.
and Sentinel, from AIB Software. Also important are test-coverage analyzers,
which help make sure that the tests Telerate creates exercise all the code.
Iterative tests of each program component give useful feedback on the quality
of the code, but developers still won't know how well the entire system will
work under real-world pressures. When it's time to simulate heavy-load
conditions, Telerate uses client-loading tools such as Empower, from Performix,
and LoadRunner, from Mercury Interactive, to run multiple clients and processes
according to preset schedules.
Tools such as these make the testing of complex programs possible, but the
tools are not problem-free. It takes a lot of work to get them running right,
Milkowski concedes. Also, managers must budget for additional support costs in
the form of systems-administration staff and training for the people who use
the tools. "But in most organizations, it's easier to get money for tools than
for more pro- grammers," he says.
Adobe Systems (Mountain View, CA) also uses testing as an early-warning system.
Marc Aronson, the director of Adobe's Software Productivity Group, established
a testing strategy that builds the interpreter every night and runs it on
several print engines that Adobe designed for testing. The system uses a subset
of the standard QA test suite, and it logs errors. Because the programming
environment tracks code changes made since the previous day, programmers know
where to look when a new problem crops up.
Although in-house testing is the first line of defense, a thorough beta-testing
program can be invaluable. Some programs, like Windows 95, attract enough
interest that there is no shortage of testers.
America Online also finds it easy to get volunteers. Mike Fairbarns coordinated
the beta-test process for the Macintosh version of the code. The company
categorizes each user response depending on whether it is a suggested
improvement or a bug catch. It further divides the bugs into types and sets
priorities to identify the ones to address most urgently.
The Cost of Complexity
No tool or methodology provides the perfect answer to creating great code in an
imperfect world. But legal and ethical consideraations aside, making software
as reliable as possible from the beginning has become a mantra among
developers. It's good business. As Dow Jones Telerate's Milkowski points out,
if a bug costs a dollar to fix when it's discovered by a programmer during code
generation, it will cost $1000 if no one finds it until the program ships to
the end user.
Still unclear, however, is whether quality-from-the start programming means
software will become more reliable or whether developers will merely keep from
falling backward in the face of ever-ensnarling complexity.
Oliver Sharp also contributed to this article.
Alan Joch is a BYTE senior editor.
You can reach him on the Internet or BIX at
Sidebar on page 50 and 51
How to Build Reliable Code
9 Ways to Write More-Reliable Software...
- Fight for a Stable Design
- Cleanly Divide Up Tasks
- Avoid Shortcuts
- Use Assertions Liberally
- Use Tools Judiciously
- Rely on Fewer Programmers
- Diligently Avoid Featuritus
- Use Formal Methods Where Appropriate
- Begin Testing Once You Write the First Line of Code
The first thing to under stand: It is hard to build complex software that works
well. In the search for salvation, or what soltware engineer and author Fred
Brooks calls the silver bullet, many people look to models, techniques, and
tools. Once upon a time, the solutions were structured programming and
high-level languages; now, they're applications builders, componentware and
object-oriented-programming (OOP), techniques. However, evangelists of all
these solutions ignore an uncomfortable truth: Reliable software can be written
using gotos and assembly language, and truly dismal code has been produced
using impeccably modern tools and techniques.
The reality is that one factor completely dominates every other in determining
software quality: hou well the project is managed. The development team must
know what code it is supposed to build, must test the software constantly as it
evolves. and must be willing to sacrifice some development speed on the altar
of reliability. The leaders of the team need to establish a policy for how code
is built and tested. Tools are valuable because they make it easier to
implement a policy but they can't define it. That is the Job of the team
leaders, and if they fail to do it, no tool or technique will save them.
One reason that quality often takes a backseat is that it is not free. Reliable
software often has fewer features and takes longer to produce. No trick or
technique will eliminate the complexity of a modern application, but there are
a few ideas that can help.
Fight for a Stable Design
One of the worst obstacles to building a good system is a design that keeps
changing. Each change means redoing code that has already been written,
shifting plans in midstream, and corrupting the internal consistency of the
The problem is that often nobody knows what the program should do until there
is a preliminary version to run. An excellent strategy is to build mock-ups and
prototypes that potential users can start working with early, so that the
design settles down as soon as possible. Once designers hammer out the basic
structure of the system, any changes that aren't critical should wait until the
next version. This is a hard line to hold, but the closer developers can come
to it, the better off the code will be.
Cleanly Divide Up Tasks
When designing a complex system, divide the work into smaller pieces that have
good interfaces and share the appropriate data structures. lf you get that
right, you can make many bad implementation decisions without ruining the
overall design and performance of the system.
Object-oriented languages can be a useful way to express and enforce the
decomposition strategy, but they don't tell the designer how to do the job. It
is infinitely better to have a good design implemented in C than a poor one in
Programmers often don't take time to fix a design error as the code evolves.
Those decisions can come back to haunt everyone. Avoid shortcuts by insisting
that each one is carefully documented. The pain of writing something up can act
as a useful deterrent.
Use Assertions Liberally
An assertion is simply a line of code that says, "I think this is true. If it
isn't, something is wrong, so stop execution and let me know right away." If a
value is supposed to be within a certain range, check first. Make sure that
pointers point somewhere and that internal data structures are consistent.
Just like other debugging code, you can compile assertions out of production
code before it enters final testing stages. There is every reason to litter
your code with assertions. You will find problems quickly, making them much
easier to track down.
Use Tools Judiciously
Tools are not a panacea - they can't help you fix a project that is being
administered badly. But tools can make it easier for development teams to put
good policies into effect. Source code management tools, such as the public
domain RCS or PVCS from Intersolv, help you coordinate modules being used by
There are also some tools that can find certain errors in your code instead of
forcing you to do it. The Unix utility 1int (or the turbo-charged version
offered in Centerline's Code Center) will find syntax errors and mismatches
between different source code files. Purify, from Pure Software. and
BoundsChecker, from NuMega Technologies, catch a wide variety of memory errors
when they occur, rather than when they manifest themselves later on. Other
tools perform regression tests or do code-coverage analysis to see if there are
dusty corners of your program that are not being exercised.
Rely on Fewer Programmers
An easy way to reduce the number of bugs in a project is to cut down on the
number of people who are involved in it. The advantages are less management
overhead, less need for coordination, and more contact among the team members
who are building the system.
You can reduce the number of people by having individual programmers produce
code more quickly or by reducing the amount of code that needs to be written.
CASE tools, applications builders, and code reuse are all attempts to meet one
or both of these goals. While these products don't always live up to their
promise, they can simplify a project so that a smaller team can handle it.
Oliver Sharp is the director of
consulting services at Coulusa
Software (Berkelev. CA).
contact him on the Internet at
Five Easy Steps Toward Disaster
Although there are an unlimited number of ways you can foul up a programming
project, here are a few particularly popular ones:
Pile on the Features
The easiest way to ruin a program is to add a whole series of features
to it without enough time to integrate them properly. Under heavy time
pressure, the natural tendency is to glue the new functionality anywhere
you can, without thinking about how you're affecting the core design of
After you have done this several times, the resulting program becomes a
diffuse and unwieldy collection of modules, and nobody understands how
they interact. Making any further changes requires an act of faith.
- Target Heterogeneous Environments
It is hard to support the kind of hardware and software variations that
are common in the PC industry. Because no organization can try every
possible system configuration, programs refuse to install, run poorly
or not at all, and interact unpredictably with other applications.
Here are two ways to make the problems worse.
First, take undocumented shortcuts that probably will not be supported
in future releases.
Second, don't bother to follow the standard interface guidelines of the
system. This ensures that users and other programmers are confused.
Because formal proofs won't eliminate bugs anytime soon, careful testing
is the only way to be sure that a program works correctly.
Consequelltly. disaster aficionados should delay systematic product
testing until coding is almost finished. At that point, programmers
can't easily undo faulty design decisions, and it's hard to isolate
Most programmers don't like to write documentation. This is a real aid
to disaster because good notes on thc basic internal systems design are
valuable when it's time to update. Reliability will result if
development team leaders make sure programmers write the documentation
and keep it up to date.
If the documents do go out of date, whatever you do. don't schedule
extra time to clean them up at the end of the project. If you're a
disaster seeker, you can take comfort in the fact that memories will
fade quickly when programmers move on to a new task.
When In Doubt, Vacillate
The team leaders should avoid clearly defined project specifications and
change specifications whenever pressure to do so strikes.
"It's the minor changes that can come back to bite you," says Tom
Milkowski, a manager of software development at Dow Jones Telerate. He
expects a code review every time a programmer makes a change, no matter
how minor, to the code.