951201; Byte; Dec 1995; p. 49; IT work practices and industry standards

Byte

December, 1995

page 49

How Software Doesn't Work

Nine ways to make your code reliable

by: Alan Joch

The next time you board a plane, try not to think about this: Flight Simulator running on your notebook may be more reliable than the software that keeps planes from colliding in midair. That's because the FAA's air-traffic-control system stil1 uses software from the 1970s. It runs on a vacuum-tube IBM 9020e mainframe that dates back a decade earlier. This system contributed to almost a dozen failures at air-traffic-control centers in the past year, including unnerving back-to-back breakdowns on July 23 and 24 in Chicago, the Santa Monica Freeway of the skies.

For more than a decade, the FAA has been working to replace this antiquated system. Sadly, the alternative, the Advanced Automation System with its million-plus lines of code written since the early 1980s, is riddled with bugs. And six years late. Computer scientists from two leading universities have had to comb through it to see if any code is salvage- able. Faced with software that's too unreliable to trust in life-and-death situations, the FAA must rely instead on its old and collapsing-but well-understood-air-traffic-control system.
..
Unfortunately, this isn't the only example of unreliable software:

Item: In the summer of 1991, telephone outages occurred in local telephone systems in Calitornia and along the Eastern seaboard. These breakdowns were all the fault of an error in signaling software. Right betore the outages. DSC Communications (Plano, TX) introduced a bug when it changed three lines of code in the several-million-line signaling program. After this tiny change. nobody thought it necessary to retest the program.
..
Item: In 1986, two cancer patients at the East Texas Cancer Center in Tyler received fatal radiation overdoses from the Therac-25, a computer-controlled radiation-therapy machine. There were several errors, among them the tailure of the programmer to detect a race condition (i.e., miseoordination between concurrent tasks).

Item: A New Jersey inmate escaped from computer-monitored house arrest in the spring of 1992. He simply removed the rivets holding his electronic anklet together and went off to eommit a murder. A computer detected the tampering. However, when it called a second computer to report the incident, the first computer received a busy signal and never called back.
..
We've known for decades that software is too complex to develop without adequate quality control. Books, conferences, and formal methods prescribe ways of coping with the complexities of software development: Plan. Sweat over the design specification. Isolate critical functions. Document the development process. Comment your code. Test extensively, both the individual components and the interworkings of the entire system. Independently validate the product. Include backup systems. Eat your vegetables.
..
Why don't we do all this? Because it's expensive. Each line of the space shuttle' s flight-control software costs NASA contractor Loral about $1000, or 10 times more than for typical commercial software. Would you buy a word processor or a spreadsheet for $5000, no matter how bug-free it was? Or would you rather pay 90 percent less and live with the bugs?
..
Clearly, the commercial market has spoken. But users of business-critical software demand that we carefully weigh the trade-offs between delivering a program and assuring reliability. Software developers will always make mistakes, but a slow, careful-and costly-development process can minimize them (see the text boxes "How to Build Reliable Code" and "Five Easy Steps Toward Disaster").
..
There are three important battles developers must fight. Managers and customers often find it extraordinarily difficult to specify how a proposed program is supposed to perform. Second, commercial pressures and tight deadlines practically guarantee chaos during the development process. Third, no program is immune from featuritus. Even if the drive for high quality motivates managers and programmers, an ever-expanding features list keeps program specifications in flux and compounds the chances for introducing bugs.
..
Often, the first breakdown in quality control occurs before developers write a single line of code. The New Jersey murder case illustrates how hard it is to write a comprehensive specification. The computer that detected the inmate's tampering correctly reported the action to the second computer. But no one responsible for the software required that the call had to be redialed if there was a busy signal.

When commercial pressures produce crises that masquerade as projects, people often cut corners by skimping on testing. This was the case with the telephone example. DSC Communications chose not to retest because it wanted to give customers a new feature right now.
..
One of the most celebrated cases of featuritus happened in 1993, when Silicon Graphics, Inc. (SGI) released version 5.1 of Irix with over 500 serious bugs. Management had pushed for a new OS, a new user interface (UI), better compilers and tools, and new multimedia features-everything in version 5.1 was supposed to be better. No sacrifices were to be made. But nine months before the release, when morale was low and the bug count high, two senior engineers pointed out the impossibility of the task. Management responded by hiring two contractors who were strangers to SGI's software and organization.
..
"The desperate attempt to do everything caused programmers to cut corners, with disastrous effects on the bug count," said Tom Davis, principal scientist, in an internal company memo. It described the struggle and lamented the OS's bloated code, sluggish performance, and unrealistic memory requirements.

SGI's schedule imposed a code freeze before code was stable, which resulted in a familiar problem. "We're trying to wrap up the box before the stuff inside is finished, and then trying to fix things inside the box without undoing the wrapping," the memo said. Energies are diverted at a key moment, as everyone looks for ways around the rule that says they can't change things anymore. But sometimes they must make changes.
..
Either way leads to a familiar crisis: the meeting in which features are cast out of the release. In the SGI case. the company exiled entire applications wholesale, but it was too late to do much good. "We bit off more than we could chew," Davis concluded. "As a company, we still don't understand how difficult software is."

Adding to SGI's problems, the memo leaked to the Internet. The response was revealing: fan mail. Davis received scores of messages from similarly beleaguered developers, and the software community as a whole tacitly owned up to the problem. This reaction helped remind salespeople at SGl's competitors that they were not immune to similar charges.
..
Despite the embarrassment, the memo may ultimately prove a boon to SGI because the author spoke so passionately about quality. And support came from a key corner. "The instant the [software] release hit the street, customers started screaming.... It helped management read the memo with an open mind," says Davis.

SGI responded with a six-week software summit for all departments: engineering, management, testing, marketing, manufacturing, documentation, and field service. It invested in integrated measurement tools and ensured they'd be documented and always available. The company identified process bottlenecks and upgraded equipment.
..
Librarians came on-board to ensure that project documentation was up to date. Managers received software-development books. SGI sought ways to integrate quality assurance throughout the development process. The direct costs: tens of thousands of dollars for new tools, hundreds of thousands of dollars yearly for new staff, and inestimable millions of dollars in additional engineering time.

What SGI did not do in the aftermath is also interesting. It didn't mandate any specific new type of development process. Instead, the groups chose the tools and methods they thought best. The result: Version 5.2 fixed bugs, improved performance, and added no new features. Version 5.3 added a few strategically chosen features.

..
Managing Chaos

While SGI learned its lessons the hard way, other companies look to a variety of techniques and tools to save them from bug-infested nightmares. Most business software isn't as life-critical as the digital flight-control system in Boeing's 777. Never-the-less, Boeing's development process illustrates how a firm managerial hand can help every company fight featuritus.
..
About 400 people spent five years working on the Boeing 777's flight-control software. Jim McWha, the Boeing Commercial Airplane Group's chief engineer for flight-control systems, worked hard to make sure the 777 team got the requirements right. To ensure that errors were caught early, when they're cheaper to fix, the 777 team solicited input from all the key people in the life of a jet-everyone from pilots to manufacturing personnel. They evaluated the results of simulations for a year in the laboratory and another year in the "iron bird," a full-scale mock-up of the airplane. Boeing's goal was to have a complete specification before de- velopers wrote any code.
..
McWha also resisted cancerous growth of the wish list once coding started. To hold the line, Boeing set up review boards to evaluate every change request; it refused about half of thems but credit undoubtedly goes to McWha's air of authority. He's a no-nonsense guy; a large sign on his desk reads: NO! (What part of this don't you understand?).

Most educational was the approach it abandoned. Boeing contracted with GEC Marconi Avionics to write three versions of the flight-control software, each to execute in its own lane. Working from the same requirements, three groups (who were not supposed to communicate with each other) coded in Ada, C, and PL/M.
..
The strategy, called n-version programming, is that if each lane executes code written by different minds, errors in one lane will be eliminated by the other two lanes. In practice, n-version programming is no magic bullet; independently written programs tend to have trouble in the same spots. The hard parts are hard for everybody. Boeing eventually decided to refocus its resources on these areas.

The three groups proceeded independently for about 18 months before the approach became more pain than it was worth. The systems people had to communicate continually with three software teams without influencing their directions. Developers found it almost impossible to keep code in the three lanes synchronized, leading to nuisance disconnects.
..
Finally, expertise became too valuable to squander-skilled people needed to be working together, not separately. So members of the C and PL/M teams joined the Ada team or took on testing or verification chores. The three lanes now use dissimilar processors and different compilers, but one group produced the code.

Formal Methods
..
Some developers address reliability with the Capability Maturity Model (CMM) from Carnegie Mellon University's Software Engineering Institute. The CMM rates software-development processes on a five-level scale (for details on this and a NASA quality model, see the text box "Make Quality Job 1"). Items that are considered in the CMM range from how unambiguous specifications are to whether a program's reliability receives independent verification. A level 1 rating means the organization practices ad hoc chaos; level 5 identifies superlative discipline from management and engineering.
..
"It's hard to argue with the CMM," says Roger Blais, manager of software process improvement for Tasc, a government and private-sector systems integrator in Reading, Massachusetts. The company has used the model for five years and is in the process of being CMM-certified by a Software Engineering Institute-accredited evaluator. Blais believes the CMM is valuable because it puts importance on the process of software development. "Tools are so volatile, and platforms are moving all the time," he says. "But if you have a process in place, you have a glue that you can always count on."
..
But nothing is perfect. Space-shuttle software developers claim to be doing everything recommended by the CMM. Even so, the program has experienced numerous software problems, including errors on Discovery that made it improperly position itself for a laser-beam experiment over an observatory in Hawaii.

Additional help against development chaos comes from formal methods designed to bring scientific principles to a largely creative process. Blais says formal methodologies play key roles in helping Tasc make products ranging from document management to avionics sys- tems. The company builds most of its applications for Windows and Unix using C, C++, or Visual Basic.
..
Customers sometimes specify that a formal methodology be used. Other times, Tasc uses a variant of the spiral-development life-cycle methodology, a model for iteratively combining pieces of a project as they evolve. Blais says spiral development is valuable for its ability to provide a framework tor each project. The framework helps when requests surface for changes to the design specification.

Tasc also relies on Atria's ClearCase, a software-configuration management tool, which Blais calls the core of Tasc's development efforts. It tracks changes to code, records which programmers made the changes, and analyzes how the changes impact other areas of the program. This information helps the company manage releases and "keeps everyone honest," according to Blais.

..
Testing Is Everything

Other companies use quality assurance as the key tool for producing reliable software. It's never too early to think about testing, according to Tom Milkowski, a manager of software development at Dow Jones Telerate, a financial-services company in Jersey City, New Jersey. "As you're writing code, you should be creating tests. If you put an IF statement in the code, you should make a note to test this call while it's fresh in your mind," he says.
..
Milkowski helps manage 35 developers who are building a real-time system based on HP-UX to deliver financial information to the company's clients over a private WAN. In the past 18 months, the development staff has written about 800,000 lines of C and C++ code. At the system's rollout, slated for next April, the program will consist of about a million lines of code.

When Telerate developers finish each component in the program, they're expected to review their work for errors. Next, each module undergoes a code review, where other developers evaluate the code. Milkowski wants subsequent code reviews if any changes, even minor ones. are made. "It's the minor changes that can come back to bite you," he says.
..
But the complexity of Telerate's financial-information system makes testing a challenge. For example, Telerate designed one of the four servers to handle 120 or more concurrent clients at a transaction rate of 1000 per second. Some of the servers in the system have 1 GB of RAM. Developers can write code that accesses memory anywhere in that gigabyte of space. "Everything is potentially so interconnected through RAM, there's almost a limitless opportunity for problems," says Milkowski.
..
He relies on his "tool bag" to reduce these opportunities. Hewlett-Packard's SoftBench and the Discover Development Information System, from Software Emancipation Technology, analyze legacy code to build structure diagrams and help the staff decide what code is reusable. If a module's call functions are longer than a page or two, Telerate developers start to worry. The more complex the code is, the greater the likelihood of a defect. "The tools help us focus our attention on the appropriate modules during our code reviews." Milkowski says.
..
To find memory and resource leaks, the staff uses Purify, from Pure Software. and Sentinel, from AIB Software. Also important are test-coverage analyzers, which help make sure that the tests Telerate creates exercise all the code. Iterative tests of each program component give useful feedback on the quality of the code, but developers still won't know how well the entire system will work under real-world pressures. When it's time to simulate heavy-load conditions, Telerate uses client-loading tools such as Empower, from Performix, and LoadRunner, from Mercury Interactive, to run multiple clients and processes according to preset schedules.
..
Tools such as these make the testing of complex programs possible, but the tools are not problem-free. It takes a lot of work to get them running right, Milkowski concedes. Also, managers must budget for additional support costs in the form of systems-administration staff and training for the people who use the tools. "But in most organizations, it's easier to get money for tools than for more pro- grammers," he says.

Adobe Systems (Mountain View, CA) also uses testing as an early-warning system. Marc Aronson, the director of Adobe's Software Productivity Group, established a testing strategy that builds the interpreter every night and runs it on several print engines that Adobe designed for testing. The system uses a subset of the standard QA test suite, and it logs errors. Because the programming environment tracks code changes made since the previous day, programmers know where to look when a new problem crops up.
..
Although in-house testing is the first line of defense, a thorough beta-testing program can be invaluable. Some programs, like Windows 95, attract enough interest that there is no shortage of testers.

America Online also finds it easy to get volunteers. Mike Fairbarns coordinated the beta-test process for the Macintosh version of the code. The company categorizes each user response depending on whether it is a suggested improvement or a bug catch. It further divides the bugs into types and sets priorities to identify the ones to address most urgently.

..
The Cost of Complexity

No tool or methodology provides the perfect answer to creating great code in an imperfect world. But legal and ethical consideraations aside, making software as reliable as possible from the beginning has become a mantra among developers. It's good business. As Dow Jones Telerate's Milkowski points out, if a bug costs a dollar to fix when it's discovered by a programmer during code generation, it will cost $1000 if no one finds it until the program ships to the end user.
..
Still unclear, however, is whether quality-from-the start programming means software will become more reliable or whether developers will merely keep from falling backward in the face of ever-ensnarling complexity.

Oliver Sharp also contributed to this article.

Alan Joch is a BYTE senior editor.
You can reach him on the Internet or BIX at

..
Sidebar on page 50 and 51

How to Build Reliable Code

OLIVER SHARP
..
9 Ways to Write More-Reliable Software...

Fight for a Stable Design
Cleanly Divide Up Tasks
Avoid Shortcuts
Use Assertions Liberally
Use Tools Judiciously
Rely on Fewer Programmers
Diligently Avoid Featuritus
Use Formal Methods Where Appropriate
Begin Testing Once You Write the First Line of Code

..
The first thing to under stand: It is hard to build complex software that works well. In the search for salvation, or what soltware engineer and author Fred Brooks calls the silver bullet, many people look to models, techniques, and tools. Once upon a time, the solutions were structured programming and high-level languages; now, they're applications builders, componentware and object-oriented-programming (OOP), techniques. However, evangelists of all these solutions ignore an uncomfortable truth: Reliable software can be written using gotos and assembly language, and truly dismal code has been produced using impeccably modern tools and techniques.
..
The reality is that one factor completely dominates every other in determining software quality: hou well the project is managed. The development team must know what code it is supposed to build, must test the software constantly as it evolves. and must be willing to sacrifice some development speed on the altar of reliability. The leaders of the team need to establish a policy for how code is built and tested. Tools are valuable because they make it easier to implement a policy but they can't define it. That is the Job of the team leaders, and if they fail to do it, no tool or technique will save them.
..
One reason that quality often takes a backseat is that it is not free. Reliable software often has fewer features and takes longer to produce. No trick or technique will eliminate the complexity of a modern application, but there are a few ideas that can help.

Fight for a Stable Design
..
One of the worst obstacles to building a good system is a design that keeps changing. Each change means redoing code that has already been written, shifting plans in midstream, and corrupting the internal consistency of the system.
..
The problem is that often nobody knows what the program should do until there is a preliminary version to run. An excellent strategy is to build mock-ups and prototypes that potential users can start working with early, so that the design settles down as soon as possible. Once designers hammer out the basic structure of the system, any changes that aren't critical should wait until the next version. This is a hard line to hold, but the closer developers can come to it, the better off the code will be.

..
Cleanly Divide Up Tasks

When designing a complex system, divide the work into smaller pieces that have good interfaces and share the appropriate data structures. lf you get that right, you can make many bad implementation decisions without ruining the overall design and performance of the system.
..
Object-oriented languages can be a useful way to express and enforce the decomposition strategy, but they don't tell the designer how to do the job. It is infinitely better to have a good design implemented in C than a poor one in C++.

Avoid Shortcuts
..
Programmers often don't take time to fix a design error as the code evolves. Those decisions can come back to haunt everyone. Avoid shortcuts by insisting that each one is carefully documented. The pain of writing something up can act as a useful deterrent.

Use Assertions Liberally
..
An assertion is simply a line of code that says, "I think this is true. If it isn't, something is wrong, so stop execution and let me know right away." If a value is supposed to be within a certain range, check first. Make sure that pointers point somewhere and that internal data structures are consistent.

Just like other debugging code, you can compile assertions out of production code before it enters final testing stages. There is every reason to litter your code with assertions. You will find problems quickly, making them much easier to track down.

..
Use Tools Judiciously

Tools are not a panacea - they can't help you fix a project that is being administered badly. But tools can make it easier for development teams to put good policies into effect. Source code management tools, such as the public domain RCS or PVCS from Intersolv, help you coordinate modules being used by multiple developers.
..
There are also some tools that can find certain errors in your code instead of forcing you to do it. The Unix utility 1int (or the turbo-charged version offered in Centerline's Code Center) will find syntax errors and mismatches between different source code files. Purify, from Pure Software. and BoundsChecker, from NuMega Technologies, catch a wide variety of memory errors when they occur, rather than when they manifest themselves later on. Other tools perform regression tests or do code-coverage analysis to see if there are dusty corners of your program that are not being exercised.

..
Rely on Fewer Programmers

An easy way to reduce the number of bugs in a project is to cut down on the number of people who are involved in it. The advantages are less management overhead, less need for coordination, and more contact among the team members who are building the system.
..
You can reduce the number of people by having individual programmers produce code more quickly or by reducing the amount of code that needs to be written. CASE tools, applications builders, and code reuse are all attempts to meet one or both of these goals. While these products don't always live up to their promise, they can simplify a project so that a smaller team can handle it.

..
Oliver Sharp is the director of consulting services at Coulusa Software (Berkelev. CA).
You can contact him on the Internet at

..

Five Easy Steps Toward Disaster

Although there are an unlimited number of ways you can foul up a programming project, here are a few particularly popular ones:
..

Pile on the Features

The easiest way to ruin a program is to add a whole series of features to it without enough time to integrate them properly. Under heavy time pressure, the natural tendency is to glue the new functionality anywhere you can, without thinking about how you're affecting the core design of the program.
..
After you have done this several times, the resulting program becomes a diffuse and unwieldy collection of modules, and nobody understands how they interact. Making any further changes requires an act of faith.

Target Heterogeneous Environments

It is hard to support the kind of hardware and software variations that are common in the PC industry. Because no organization can try every possible system configuration, programs refuse to install, run poorly or not at all, and interact unpredictably with other applications.
..
Here are two ways to make the problems worse.

First, take undocumented shortcuts that probably will not be supported in future releases.

Second, don't bother to follow the standard interface guidelines of the system. This ensures that users and other programmers are confused.
..

Test Inadequately

Because formal proofs won't eliminate bugs anytime soon, careful testing is the only way to be sure that a program works correctly. Consequelltly. disaster aficionados should delay systematic product testing until coding is almost finished. At that point, programmers can't easily undo faulty design decisions, and it's hard to isolate bugs.
..

Document Poorly

Most programmers don't like to write documentation. This is a real aid to disaster because good notes on thc basic internal systems design are valuable when it's time to update. Reliability will result if development team leaders make sure programmers write the documentation and keep it up to date.
..
If the documents do go out of date, whatever you do. don't schedule extra time to clean them up at the end of the project. If you're a disaster seeker, you can take comfort in the fact that memories will fade quickly when programmers move on to a new task.
..

When In Doubt, Vacillate

The team leaders should avoid clearly defined project specifications and change specifications whenever pressure to do so strikes.

"It's the minor changes that can come back to bite you," says Tom Milkowski, a manager of software development at Dow Jones Telerate. He expects a code review every time a programmer makes a change, no matter how minor, to the code.