Frontiers of Supercomputing II "d0e9184"

Computing for Correctness

Peter Weinberger

Peter J. Weinberger is Director of the Software and Systems Research Center at AT&T Bell Laboratories. He has a Ph.D. in mathematics from the University of California at Berkeley. After teaching mathematics at the University of Michigan, he moved to Bell Labs.

He has done research in various aspects of system software, including operating systems, network file systems, compilers, performance, and databases. In addition to publishing a number of technical articles, he has coauthored the book The AWK Programming Language.

At Computer Science Research at Bell Laboratories, a part of AT&T, I have a legitimate connection with supercomputers that I'm not going to talk about, and Bell Labs has a long-standing connection with supercomputers. Instead of going over a long list of things we've done for supercomputer users, let me advertise a new service. We have this swell Fortran-to-C converter that will take your old dusty decks—or indeed, your modern dusty decks, your fancy Fortran 77—and convert from an obsolete language to perhaps an obsolescent language named ANSI C. You can even do it by electronic mail.

I'm speaking here about what I feel is a badly underrepresented group at this meeting, namely, the supercomputer nonuser. Let me explain how I picked the topic I'm going to talk about, which is, of course, not quite the topic of this session.

― 268 ―

When you walk through the Pentagon and ask personnel what their major problem is, they say it's that their projects are late and too expensive. And if you visit industry and talk to vice presidents and ask them what their major problem is, it's the same: their stuff is late and too expensive. If you ask them why, nine out of ten times in this informal survey they will say, "software." So there's a problem with software. I don't know whether it's writing it fast or getting it correct or both. My guess is that it's both.

So I'm going to talk about a kind of computation that's becoming—to be parochial about it—more important at AT&T, but it's intended to be an example of trying to use some semiformal methods to try to get your programs out faster and make them better.

The kind of calculation that I'm going to talk about is what's described as protocol analysis. Because that's perhaps not the most obvious supercomputer-ready problem, I'll describe it in a little bit of detail.

Protocol analysis of this sort has three properties. First, it's a combinatorially unsatisfactory problem, suffering from exponential explosion. Second, it doesn't use any floating-point operations at all, and so I would be delighted at the advent of very powerful zero-FLOPS machines (i.e., machines running at zero floating-point operations per second), just supposing they are sufficiently cheap, and possibly a slow floating-point operation, as well, so I can tell how fast my program is running.

Because these problems are so big, speed is important. But what's most important is the third property, memory. Because these problems are so big, we can't do them exactly; but because they are so important, we want to do them approximately—and the more memory, the better the approximation. So what I really want is big, cheap memory. Fast is important, but cheap is better.

So what are protocols? They are important in several areas. First, there's intercomputer communications, where computers talk through networks.

They are also important, or can be made important, in understanding the insides of multiprocessors. This is a problem that was solved 20 years ago, or maybe it was only 10, but it was before the Bell System was broken up. Thus, it was during the golden age instead of the modern benighted times. We each have our own interests, right? And my interest is in having large amounts of research money at AT&T.

The problem is that there are bunches of processors working simultaneously sharing a common memory, and you want to keep them from stomping all over each other. You can model the way to do that using

― 269 ―

protocol analysis. As I said, this is a dead subject and well understood. Now let me try to tell this story without identifying the culprit.

Within the last five years, someone at Bell Labs bought a commercial multiprocessor, and this commercial multiprocessor had the unattractive feature of, every several days or so, dying. Careful study of the source, obtained with difficulty, revealed that the multiprocessor coordination was bogus. In fact, by running the protocol simulation stuff, you could see that not only was the code bogus, but the first proposed fix was bogus, too—although quite a lot less bogus in that you couldn't break it with only two processors. So that's actual protocols in multiprocessing.

There's another place where protocols are important, and that's in the phone system, which is increasingly important. Here's an intuitive idea why some sort of help with programs might be important in the phone system. You have just gone out and bought a fabulous new PBX or some other sort of electronic switchboard. It's got features: conferencing, call-waiting, all sorts of stuff. And here I am setting up a conference call, and just as I'm about to add you to the conference call, you are calling me, and we hit the last buttons at the same time. Now you're waiting for me and I'm waiting for you, and the question is, what state is the wretched switch in? Well, it doesn't much matter unless you have to reboot it to get it back.

What's interesting is that the international standards body, called the Consultative Committee in International Telegraphy and Telephony (CCITT), produces interoperability specifications for the telephone network in terms of protocol specifications. It's not exactly intuitive what the specification means, or if it's correct. In fact, there are two languages: one is a programming language, and the other is a graphical language, presumably for managers. You are supposed to use this language to do successive refinements to get a more and more detailed description of how your switch behaves. This example is in the CCITT standard language SDL (system definition language).

The trouble with international standards is that they are essentially a political process, and so you get the CCITT International Signaling System Seven Standard, and then you go around, if you happen to build switches for a living, and tell the customers that you're going to implement it. They say, "No, no, no, that's not what we use; we use CCITT Seven Standard except for all these different things."

So, what's a protocol, how are protocols modeled, and what's the point of modeling them? Take, for example, a typical computer-science pair of people. We have A, who is named Alice, and B, who is named Bob. Alice has a sequence of messages to send to Bob. The problem isn't with

― 270 ―

Alice, Bob, or even the operating system. The problem is with the communications channel, which could have the usual bad properties—it loses messages. The whole system is correct if eventually Bob gets all the messages that Alice sends, in exactly the sequence Alice sends them.

Let me point out why this is not a trivial problem for those of you who don't believe it. The easiest thing is that Alice sends a message, and when Bob gets it, he sends an acknowledgment, and then Alice sends the next message, and so on. The problem is that the channel can lose the first message, and after a while Alice will get tired of waiting and send it again. Or the channel could lose the return message instead, in which case Alice will send the first message twice, and you will have paid your mortgage twice this month.

Now I'm going to describe what's called a sliding window, selective retransmission protocol. With each message there's a sequence number, which runs from 0 to M-1, and there can only be M/2 outstanding unacknowledged messages at once. Now you model these by what are called finite-state machines. For those of you who aren't into finite-state machines, a finite-state machine is just like a finite Markov chain without any probabilities, where you can take all the paths you're allowed to take, and on each path you might change some state variables or emit a message to one of your buddies.

The way we're going to prove this protocol correct is by modeling Alice by a finite-state machine and modeling Bob by a finite-state machine, and you can model each half of the protocol engine by a little finite-state machine. Then you just do all the transitions in the finite-state machines, and look at what happens. Now there's a sequence of three algorithms that have been used historically to study these problems.

I should point out that there's been a lot of talk here about natural parallelism, and the way you model natural parallelism in this example is by multiplying the state spaces together. You just treat the global state as the product of all the local states, and you get exponential complexity in no time at all. But that's the way it is in nature, too.

So you just build the transition graph for all these states, and you get this giant graph, and you can search around for all kinds of properties. In realistic examples you get numbers like 2²⁶ reachable states. These are not the biggest cases we want to do, they're just the biggest ones we can do. There are 2³⁰ edges in the graph. That's too big.

So you give up on that approach, and you look for properties that you can check on the basis of nothing other than the state you're in. And there are actually a lot of these properties.

― 271 ―

Even in cases where we don't keep the whole graph, if each state is 64 or 128 bytes—which is not unreasonable in a big system—and if there are 2²⁶ reachable states, we're still losing because there is too much memory needed. So as you're computing along, when you get to a state that you have reached before, you don't have to continue. We also have to keep track of which states we've seen, and a good way to do that is a hash table—except that even in modest examples, we can't keep track of all the stuff about all the states we've seen.

The third step is that we use only one bit per state in the hash table. If the bit is already on, you say to yourself that you've already been here. You don't actually know that unless each state has its own hash code. All you know is that you've found a state that hashes to the same bit as some other state. That means there may be part of the state space you can't explore and problems you can't find. If you think two-dimensional calculations are good enough, why should this approximation bother you?

Now what do you want to get out of this analysis? What you want are little diagrams that give messages saying that you've found a deadlock, and it's caused by this sequence of messages, and you've blown it (again) in your protocol.

Now suppose I do the example I was doing before. With a two-bit sequence number and a channel that never loses anything, there are about 90,000 reachable states, and the longest sequence of actions before it repeats in this finite thing is about 3700 long. To see the exponential explosion, suppose you allow six sequence numbers instead of four. You get 700,000 reachable states, and the longest sequence is 15,000. In four sequence numbers with the message loss, the state space is huge, but the number of reachable states is (only) 400,000. Now if the channel could duplicate messages too, then we're up to 6 million reachable states, and we're beginning to have trouble with our computer.

Now if you do the one-bit thing and hash into a 2²⁹ table, you get very good coverage. Not perfect, but very good, and it runs a lot faster than with the full state storage, and it doesn't take a gigabyte to store it all.

If you break the protocol by allowing three outstanding messages, then the analyzer finds that immediately. If you change the channel so it doesn't make mistakes, then the protocol is correct, and there are 287,000 reachable states. That's three times bigger than the two-message version we started with.

There is some potential that formal methods will someday make programs a little bit better. Almost all of these methods are badly

― 272 ―

exponential in behavior. So you'll never do the problems you really want to do, but we can do better and better with bigger and bigger memories and machines. Some of this can probably be parallelized, although it's not cycles but memory that is the bottleneck in this calculation.

So that's it from the world of commercial data processing.

― 273 ―