Preferred Citation: Ames, Karyn R., and Alan Brenner, editors Frontiers of Supercomputing II: A National Reassessment. Berkeley:  University of California Press,  c1994 1994. http://ark.cdlib.org/ark:/13030/ft0f59n73z/


 
8— THE FUTURE COMPUTING ENVIRONMENT

8—
THE FUTURE COMPUTING ENVIRONMENT

Panelists in this session speculated on the high-performance computing environment of the future. Discussions were about speed, memory, architectures, workstations, connectivity, distributed computing, the seamless hardware environment of networked heterogeneous computing systems, new models of computation, personalized portable interface software, and adaptive interface software, as well as audio-visual interfaces.

Session Chair

Bob Kahn,
Corporation for National Research Initiatives


275

Interactive Steering of Supercomputer Calculations

Henry Fuchs

Henry Fuchs is the Federico Gil Professor of Computer Science and Adjunct Professor of Radiation Oncology at the University of North Carolina at Chapel Hill. His current research interests are high-performance graphics hardware, three-dimensional medical imaging, and head-mounted display and virtual environments. Dr. Fuchs is one of the principal investigators on the VIST Anet program, which is one of five gigabit network testbed projects supported by NSF and the Defense Advanced Research Projects Agency. He has been an associate editor of the Association for Computing Machinery (ACM) Transactions on Graphics (1983–88) and has chaired many conferences, including ACM's SIGGRAPH '81 (technical program chair), the 1985 Chapel Hill Conference on Advanced Research in VLSI, the 1986 Chapel Hill Workshop on Interactive 3D Graphics, and the 1990 NATO Advanced Research Workshop on 3D Imaging in Medicine (with cochairs Karl Heinz Höhne and Stephen M. Pizer). He serves on various advisory committees, including NSF's Division of Microelectronic Information Processing Systems and the ShoGraphics Technical Advisory Board.

I will discuss the aspect of the future computing environment that has to do with interactive visualization. What I mean by interactive visualization is that you can control what is happening on the supercomputer and


276

see the results, all in an interactive loop. For instance, your Macintosh could be connected to a CRAY Y-MP, and you could have interactive visualization.

I am going to tell you about one particular application that we are pursuing in the VISTAnet project and give you some idea of where we hope to make some progress. Perhaps we could generalize so that some of the lessons we learned might be applicable to other projects.

A lot of interactive visualization has to do with getting more graphics power and seeing more than just what is on the 19-inch CRT, so I am going to emphasize that aspect of it. The VISTAnet project is pursuing, as its first application, radiation therapy treatment planning. The only way to do that right is to do some applications that you cannot do now but that you might be able to do if you had a fast enough connection.

Let us say that the treatment involves a cancer patient with a tumor. Medical practitioners decide that the way to treat the tumor is by hitting it with sufficient radiation to kill it, but they hope that there will be sufficiently low radiation to the rest of the patient's body so that it will not kill the patient. This, then, becomes an interesting computer-aided design problem that does not always have a solution. Because of the complicated anatomical structures in the human body and the erratic manner in which many tumors grow, it is almost impossible to know if there is a particular set of places where you could put radiation beams so that you can kill the tumor and not kill the patient.

This is not just an optimization problem in which you get the best answer; even the best answer may not be good enough. We are talking about the kind of problem where the window of opportunity may be 10 per cent or so. That is, if you go 10 per cent over, you may kill the patient or have very serious consequences; if you go 10 per cent under, you may not cure the patient.

Now, of course, the standard thing to do is to hit the tumor with multiple beams and then hope that at the tumor region you get lethal doses and at other places you do not get lethal doses. This is how radiation treatment is done in two dimensions (Figure 1) everywhere in the world. But, of course, the body is three dimensional, and you could aim the beams from a three-dimensional standpoint. That would give you a whole lot of places where you could aim the beam and get better treatment plans.

The problem is that if you have all these degrees of freedom, you do not know exactly where to start. Thus, the standard thing that people do is to go to protocols in which they know that for a certain kind of tumor in a certain place, they will treat it with a certain kind of beam placement.


277

figure

Figure 1.
Two-dimensional treatment planning (courtesy of Rosenman & Chaney).

Then they look at these plots on different levels and make minor corrections when they see that there is some healthy tissue that should not get that much radiation. Because it takes a half hour to two hours on the highest-performance workstation to get a dose distribution, the typical way that this is done is that the physicist and the therapist talk about things, and then they do one particular plan and iterate a few times through over a couple of days until they are satisfied with the outcome. What we hope is, if you could do this iteration on a second-by-second basis for an hour or two hours, you could get dramatically better plans than you can with current systems.

Now I would like to discuss what kinds of visualizations people are dealing with in medical graphics. Through these graphics you could see the place where the tumor is. In digital surgery you can cut into the body, and you do have to cut into it to see what is going on inside. We hope this kind of cutting is also going to be done interactively. There are a number of different things that you have to see, all at the same time, and that you have to work with, all at the same time. When you move the beam, you


278

have to see the new dose, and you have to compare that against the anatomy and against the tumor volume because certain kinds of tissue are more sensitive to radiation than others. A lot of patients are repeat patients, so you know that if you have treated the patient a year and a half before, certain regions are significantly more sensitive to repeat doses than they were before.

Figure 2 shows the relationship that VIST Anet plays in medical visualization. It has the CRAY Y-MP at the North Carolina Supercomputing Center, located at the Microelectronics Center for North Carolina (MCNC); Pixel-Planes 5 at the University of North Carolina (UNC), Chapel Hill; and the medical workstation, which will be at the UNC Medical School initially but which we hope to extend to Duke University and elsewhere. We work with the fastest workstations that we can get. When the patient is diagnosed, he/she gets scanned in the CT scanner and may also get other tests like magnetic resonance imaging. Then the patient can go home and return to the facility in a week. Treatment may go on for a month, perhaps twice a week. We hope at the end of six weeks that, when we do another scan, the tumor volume is reduced.

figure

Figure 2.
VIST Anet and medical networking.


279

The bottleneck right now in this type of treatment is the graphics because even the most powerful graphics machines cannot do those kinds of calculations and imaging at interactive rates. The problem is at the frame buffer. The way that the fastest machines currently operate is that they take the frame buffer, divide it into a large number of small frame buffers that are interleaved (the large number might be anywhere from 16 to 80), and then assign a different processing element to each one of those in an interleaved fashion so that, as you can see in Figure 3, processor A gets every fourth pixel on every fourth scan line. When a primitive comes down the pipeline, then most or all of the processors get to work at it. Figure 3 shows the kind of layout that you get when some of the video memory is assigned to each one of the processing elements and then combined together to form the video display.

There is a limit to this. The limit comes, not surprisingly, when you start getting more and more processors and smaller and smaller amounts of video RAM, and when the memory bandwidth, like in all systems, finally gets to you (Figure 4).

Figure 5 shows one of our systems that in many ways is simpler than a general-purpose one because lots of the graphics operations are totally

figure

Figure 3.
Layout of processing elements that eventually combines to form a video display.


280

figure

Figure 4.
Interleaved image memory system.

figure

Figure 5.
Layout of pixel systems.


281

local. That is, you do the same thing at every pixel, and you do not care what is done at the neighboring pixel.

At UNC we have been working on varieties of Pixel-Planes systems, and we are on the fifth generation. We build memory chips in which every pixel gets its own little processor. It turns out that if all you do is put a processor at every pixel, you cannot have a big enough processor to make it meaningful to get anything done. We factor out as much arithmetic as possible into a hardware linear or quadratic expression tree; in this manner we get linear and quadratic expressions essentially for free. It very fortuitously happens that almost all the rendering algorithms can be expressed as polynomials in screen space (Figure 6). Our systems basically consist of memory chips for frame buffers, and they do almost all the rendering with a pixel processor for every pixel and a global-linear and quadratic-expression evaluation. If you make these chips so that the addressing on the memory chip can change, then you could take each one, a cluster of these memory chips, and make them essentially like virtual memory so that you can assign them to different parts of the frame buffer at different times.

The ring network runs at 160 megahertz, with many boards that are general-purpose 860-based systems. Virtual pixel processors can be

figure

Figure 6.
Pixel-Planes 5 overview.


282

assigned to any place on the screen. In fact, if you want to do parametric-space calculations, they work just as well in parametric space as in x,y space. Whenever you are finished, you can do block moves from the renderers to the frame buffer. In fact, some of the algorithms that people are developing use these renderers for front-end machines and use the graphics processors for the back.

It turns out that the visualization problem is a lot more than just having a fast graphics engine. If you want to see all these things at the same time well enough to be able to interact with them on a detailed basis, then you need to have a lot better perception capability than you can get with current workstation displays. A 19-inch vacuum tube is not adequate because of the complexity of cues that humans get to perceive three dimensions in the world. The solution would be to bring all human perception capabilities to bear on the problem, such as obscuration, stereopsis, kinetic depth effect, head-motion parallax, spatial memory, and so on. Our current graphics machines give us very few of these cues. The machines basically only give us obscuration. That is, we can see when something is in front and where there are other things that are in back, although we do not see the things that are in back.

It will take essentially all of our human perception capabilities to produce a sufficiently powerful visualizer to be able to work with complex, three-dimensional systems. I believe the only candidate in sight is Ivan Sutherland's pioneering work on head-mounted displays, which are currently in vogue. They are called virtual-reality systems. Basically, these are systems in which the display is on your head, your head is trapped, and you perceive some object in front of you. As you walk around, you see what is in front of you, and you can walk literally around it.

In the head-mounted display, you wear little TVs with a small tracking system to track head and hand movements. If you are looking around a room, and in the middle of the room you see molecular models, you can reach out with your three-dimensional cursor, grab the models, and move them around. The molecular-modeling work is a long-time project of Fred Brooks and is supported by the National Institutes of Health.

If eventually you want to be able to have this three-dimensional constellation in front of you because you want to see not simply obscuration, stereopsis, head-motion parallax, and so on, there is a lot more work that needs to be done, not just in image generation but in good tracking of the head and hand. You need to have something in which you can have a wide field of view and a high-resolution display.


283

Several kinds of tracking might become possible with the three-dimensional technology, such as mechanical, ultrasonic, inertial, magnetic, and optical. For instance, in ultrasound examinations, images could be superimposed inside the patient, and as the transducer is moved about the patient, the data are remembered so that you sweep out a three-dimensional volume of data and actually see that data image (Figure 7). Then you could do a procedure in which you could see what you were doing rather than going in blindly.

Another application we can imagine is one in which your work on an engine and the head-mounted display overlay would have three-dimensional pointers rather than two-dimensional pointers, which would give you information about the path along which an item is to be removed

figure

Figure 7.
Three-dimensional data imaging for medical applications.


284

figure

Figure 8.
Three-dimensional data imaging for engineering and mechanical applications.

(Figure 8). One could imagine further applications in reconnaissance, in which information is merged from a number of different sources, or in architectural previewing, that is, viewing in three dimensions, and perhaps making changes for a building before it is actually constructed (Figure 9).

In summary, the message I have for you is that you want to think about computing environments, not just from the standpoint of program development but also of program needs. Computing environments will be heterogeneous and must include, for many applications, a very strong visualization component. Interactive visualization needs a whole lot more power than it has right now to benefit from enhanced three-dimensional perception technology.


285

figure

Figure 9
Three-dimensional imaging for construction applications.


287

A Vision of the Future at Sun Microsystems

Bill Joy

Bill Joy is well known as a founder of Sun Microsystems, Inc., a designer of the network file system, a codesigner of Scalable Processor ARChitecture (SPARC), and a key contributor in Sun's creation of the open-systems movement. Before coming to Sun, Bill created the Berkeley version of the UNIX operating system, which became the standard for academic and scientific research in the late 1970s and early 1980s. At Berkeley, he founded the Berkeley Standard Distribution, which first distributed applications software for the PDP-11 and, later, complete systems for the VAX. He is still known as the creator of the "VI" text editor, which he wrote more than 10 years ago.

In recent years, Bill has traveled widely to speak on the future of computer technology—hardware, software, and social impacts. In the early 1980s, Bill framed what has become known as "Joy's law," which states that the performance of personal microprocessor-based systems can be calculated as MIPS = 2yr-84 . This prediction, made in 1984, is still widely held to be the goal for future system designs.

About 15 years ago I was at the University of Michigan working on large sparse matrix codes. Our idea was to try to decompose and "VAX-solve" a 20,000-by-20,000 sparse matrix on an IBM 370, where the computer center's charging policy charged us for virtual memory. So we, in fact, did real I/O to avoid using virtual memory. We used these same codes


288

on early supercomputers, I think that set for me, 15 years ago, an expectation of what a powerful computer was.

In 1975 I went to the University of California-Berkeley, where everyone was getting excited about Apple computers and the notion of one person using one computer. That was an incredibly great vision. I was fortunate to participate in putting UNIX on the Digital Equipment Corporation VAX, which was meant to be a very popular machine, a very powerful machine, and also to define the unit of performance for a lot of computing simply because it didn't get any faster. Although I was exposed to the kinds of things you could do with more powerful computers, I never believed that all I needed was a VAX to do all of my computing.

Around 1982, the hottest things in Silicon Valley were the games companies. Atari had a huge R&D budget to do things that all came to nothing. In any case, if they had been successful, then kids at home would have had far better computers than scientists would, and clearly, that would have been completely unacceptable.

As a result, several other designers and I wanted to try to get on the microprocessor curve, so we started talking about the performance of a desktop machine, expressed in millions of instructions per second (MIPS), that ought to equal the quantity 2 raised to the power of the current year minus 1984. Now in fact, the whole industry has signed up for that goal. It is not exactly important whether we are on the curve. Everyone believes we should be on the curve, and it is very hard to stay on the curve. So this causes a massive investment in science, not in computer games, which is the whole goal here.

The frustrating thing in 1983 was to talk to people who thought that was enough, although it clearly was not anywhere near enough. In fact, hundreds of thousands of megabytes did not seem to me to be very much because I could not load a data set for a large scientific problem in less than 100 or 1000 megabytes. Without that much memory, I had to do I/O. I had already experienced striping sparse matrices and paging them in and out by hand, and that was not very much fun.

I think we are on target now. Enough investments have been made in the world to really get us to what I would call a 300-megapixel machine in 1991 and in 1995, a 3000-megaflops machine, i.e., a machine capable of 3000 million floating-point operations per second (FLOPS). Economics will affect the price, and different things may skew the schedule plus or minus one year, but it will not really make that much difference.

You will notice that I switched from saying megapixel to megaflops, and that is because with RISC architectures and superscalar implementations,


289

you have the same number of MFLOPS as MIPS, if not more, in the next generation of all the RISC microprocessors. The big change in the next decade will be that we will not be tied to the desktop machine.

In the computer market now, I see an enormous installed base of software on single-CPU, single-threaded code on Macintoshes, UNIX, and DOS converging so that we can port the applications back and forth. This new class of machines will be shipped in volume with eight to 16 CPUs because that is how many I can get on a small card. In a few years, on a sheet-of-paper-size computer, I can get an eight- to 16-CPU machine with several hundred bytes or a gigabyte of memory, which is a substantial computer, quite a bit faster than the early supercomputers I benchmarked.

That creates a real problem in that I don't think we have much interesting software to run on those computers. In fact, we have a very, very small number of people on the planet who have ever had access to those kinds of computers and who really know how to write software, and they've been in a very limited set of application domains. So the question is, how do we get new software? This is the big challenge.

In 1983, I should have bought as much Microsoft Corporation stock as I could when it went public because Microsoft understood the power of what you might call the software flywheel, which is basically, once you get to 100,000 units of a compatible machine a year, the thing starts going into positive feedback and goes crazy. The reason is, as soon as you have 100,000 units a year, software companies become possible because most interesting software companies are going to be small software companies clustered around some great idea. In addition, we have a continuing flow of new ideas, but you have got to have at least 10 people to cater to the market—five people in technical fields and five in business. They cost about $100,000 apiece per year, each, which means you need $1 million just to pay them, which means you need about $2 million of revenue.

People want to pay about a couple hundred dollars for software, net, which means you need to ship 10,000 copies, which means since you really can only expect about 10 per cent penetration, you have got to ship 100,000 units a year. You can vary the numbers, but it comes out to about that order of magnitude. So the only thing you can do, if you've got a kind of computer that's shipping less than 100,000 units a year, is to run university-, research-, or government-subsidized software. That implies, in the end, sort of centralized planning as opposed to distributed innovation, and it loses.

This is why the PC has been so successful. And this is, in some sense, the big constraint. It is the real thing that prevents a new architecture, a new kind of computing platform, from taking off, if you believe that


290

innovation will occur. I think, especially in high technology, you would be a fool not to believe that new ideas, especially for software, will come around. No matter how many bright people you have, most of them don't work for you. In addition, they're on different continents, for instance, in eastern Europe. They're well educated. They haven't had any computers there. They have lots of time to develop algorithms like the elliptical sorts of algorithms. Because there are lots of bright people out there, they are going to develop new software. They can make small companies. If they can hit a platform—that is, 100,000 units a year—they can write a business model, and they can get someone to give them money.

There are only four computing platforms today in the industry that have 100,000 units a year: DOS with Windows, Macs and UNIX on the 386, and UNIX on Scalable Processor ARChitecture (SPARC). That's it. What this tells you is that anybody who isn't on that list has got to find some way to keep new software being written for their platform. There is no mechanism to really go out and capture innovation as it occurs around the world. This includes all the supercomputers because they're equipped with way too much low volume, and they're off by orders of magnitude.

Some platforms can survive for a small amount of time by saying they're software-compatible with another one. For instance, I can have UNIX on a low-volume microprocessor, and I can port the apps from, say, SPARC or the 386 to it. But there's really no point in that because you do the economics, and you're better off putting an incremental dollar in the platform that's shipping in volume than taking on all the support costs of something that didn't make it into orbit. So this is why there's a race against time. For everyone to get their 100,000 units per year is like escaping the gravity field and not burning up on reentry.

Now, here's the goal for Sun Microsystems, Inc. We want to be the first company to ship 100,000 multiprocessors per year. This will clearly make an enormous difference because it will make it possible for people to write software that depends on having a multiprocessor to be effective. I can imagine hundreds or thousands of small software companies becoming possible.

Today we ship $5000 diskless, monochrome workstations and $10,000 standalone, color workstations; both of these are shipping at 100,000 a year. So I've got a really simple algorithm for shipping 100,000 color and 100,000 monochrome workstations a year: I simply make those multiprocessors. And by sticking in one extra chip to have two instead of one and putting the software in, people can start taking advantage of it. As you stick in more and more chips, it just gets better and better. But without


291

this sort of a technique, and without shipping 100,000 multis a year, I don't see how you're going to get the kind of interesting new software that you need. So we may have to keep using the same 15-year-old software because we just don't have time to write any new software. Well, I don't share that belief in the past. I believe that bright new people with new languages will write new software.

The difficulty is, of course, you've got all these small companies. How are they going to get the software to the users? A 10-person company is not a Lotus or a Microsoft; they can't evangelize it as much. We have a problem in the computer industry in that the retail industry is dying. Basically, we don't have any inventory. The way you buy software these days, you call an 800 number, and you get it by the next morning. In fact, you can call until almost midnight, New York time, use your American Express card, and it will be at your door before you get up in the morning. The reason is that the people typically put the inventory at the crosspoint for, say, Federal Express, which is in Memphis, so that it only flies on one plane. They have one centralized inventory, and they cut their costs way down.

But I think there's even a cheaper way. In other words, when you want to have software, what if you already have it? This is the technique we're taking. We're giving all of our users compact-disk (CD) ROMs. If you're a small company and you write an application for a Sun, we'll put it on one of our monthly CD-ROMs for free for the first application that you do if you sign up for our software program, and we'll mail it to every installation of Sun.

So if you get a Sun magazine that has an ad for your software, you can pull a CD-ROM you already have off the shelf, boot up the demo copy of the software you like, dial an 800 number, and turn the software on with a password. Suppose there are 10 machines per site and a million potential users. That means I need 100,000 CDs, which cost about $3 apiece to manufacture. That's about $300,000. So if I put 100 applications on a CD, each company can ship its application to a million users for $3000. I could almost charge for the space in Creative Computer Application Magazine . The thing can fund itself because a lot of people will pay $10 for a disk that contains 100 applications that they can try, especially if it's segmented, like the magazine industry is segmented.

This is a whole new way of getting people software applications that really empowers small companies in a way that they haven't been empowered before. In fact, you can imagine if these applications were cheap enough, you could order them by dialing a 900 number where there wouldn't even have to be a human; the phone company would do the billing, and you'd just type in on your touch-tone phone the serial


292

number of your machine, and it would read you the code back. In that case, I think you could probably support a one-person company—maybe a student in a dorm who simply pays $3000 to put a zap on the thing and arranges with some BBS-like company to do the accounting and the billing. These new ways of distributing software become possible once you spin up the flywheel, and I think they will all happen.

The workstation space I said I think will bifurcate into the machines that run the existing uniprocessor software should be shipping about a million units a year, about 100 MIPS per machine site, because that's not going to cost any more than zero MIPS. In fact, that's what you get roughly for free in that time frame. That's about $6 billion for the base hardware—maybe a $12 billion industry. I may be off by a factor of two here, but it's just a rough idea.

Then you're going to have new space made possible by this new way of letting small software companies write software, eight to 16 CPUs. That's what I can do with sort of a crossbar, some sort of simple bus that I can put in a sheet-of-paper-sized, single-board computer, in shipping at least 100,000 a year, probably at an average price of $30,000, and doing most of the graphics in software. There would not be much specialpurpose hardware because that's going to depend on whether all those creative people figure out how to do all that stuff in software. And that's another, perhaps, $3 billion market.

I think what you see, though, is that these machines have to run the same software that the small multis do because that's what makes the business model possible. If you try to do this machine without having this machine to draft, you simply won't get the applications, which is why some of the early superworkstation companies have had so much trouble. It's the same reason why NeXT will ultimately fail—they don't have enough volume.

So across this section of the industry, if I had my way, it looks like we're going to ship roughly 200 TFLOPS in 1995, with lots and lots of interesting new, small software applications. The exception is that we're going to ship the 200 TFLOPS mostly as 100,000, 1000-MIP machines instead of as a few TFLOPS machines. I just have a belief that that's going to make our future change, and that's going to be where most of the difference is made—in giving 100,000 machines of that scale to 100,000 different people, which will have more impact than having 100 TFLOPS on 100 computers.

The economics are all with us. This is free-market economics and doesn't require the government to help. It will happen as soon as we can spin up the software industry.


293

On the Future of the Centralized Computing Environment

Karl-Heinz A. Winkler

Karl-Heinz A. Winkler worked at the Max Planck Institute before first coming to Los Alamos National Laboratory. He next went to the University of Illinois and then returned to Los Alamos, where he is a program manager in the Computing and Communications Division. Dr. Winkler's main interest is in computational science, high-speed communication, and interactive graphics, as well as in coupling numerical experiments to laboratory experiments.

Bill Joy, of Sun Microsystems, Inc., states very forcefully in the preceding paper exactly why we have to change our way of doing things. Building upon what Bill said, I would first like to discuss structural changes. By way of background, I should mention that I have been a supercomputer user all of my adult life—for at least the last 20 years. During the past year, I worked closely with Larry Smarr (see Session 10) at the University of Illinois National Center for Supercomputing Applications (NCSA), so I have learned what it is like on the other side of the fence—and what an education that was!

I think we are going through extremely exciting times in the sense that there is really a revolution going on in the way we have to do our business. In the past, if you bought yet another Cray and a hundred PCs, you could depend on not getting fired, even if you were a computer center director. That game is over. For the first time in a long time, everybody is again allowed to make disastrous investment decisions.


294

Having been part of a supercomputing center and working now for the Computing and Communications Division at Los Alamos National Laboratory, I find myself compelled to ask what for me is a very interesting question: if Bill is right, is there any place left for a centralized computing facility? Of course, I ask this question because my job depends on the answer.

Certainly the days are gone where you have a centralized environment and terminals around it. In a way, through X Windows, more powerful machines, and the client-host model, some of that structure will survive. If you believe in the federal High Performance Computing Initiative (HPCI), then you expect that in a few years we we will have gigabit-per-second research and education networks. And if you believe in the availability of these high-end workstations, then, of course, you envision the ultimate computer as consisting of all Bill Joy's workstations hooked together in a distributed computing environment. Obviously, that will never work globally because there are too many things going on simultaneously; you also have a lot of security concerns and a lot of scheduling problems.

Yet, in principle it is realistic to expect that the majority of the computer power in a large organization will not be in the centralized facility but in the distributed computing environment. This dramatic change that we have to react to is caused by technological advances, specifically in the microtechnology based on complementary metal oxide semiconductors—which exemplifies the smart thing to do these days. I mean, specifically, look at the forces that drive society, and bank on the technologies that address those needs rather than the needs of specialty niches. This latter point was hammered into me when we had the Supercomputer conference in Boston in 1988. At NCSA I certainly learned the value of Macintoshes and other workstations and how to use, for the first time, software I hadn't written myself.

If we look at the driving forces of society, we discover two areas we have to exploit. Take, for instance, this conference, where a relatively small number of people are convened. Consider, too, the investment base, even in CONVEX and Cray machines, combined; that base equals less than half the IBM 3090s that have been sold, and that is still less than 5000 worldwide.

There is a limit to what one can do. Referring specifically to the presentations at this conference on vector computing, faster cycle time, etc., if you want to achieve in a very few years the extreme in performance—like machines capable of 1012 floating-point operations per second (TFLOPS)—it's the wrong way to go, I believe. We must keep in


295

mind the driving forces of technology. For instance, I have a Ford truck, and it has a computer chip in it. (I know that because it failed.) That kind of product is where a majority of the processing technology really goes. So one smart way of going about building a super-supercomputer is to invent an architecture based on the technological advances in, e.g., chip design that are being made anyway. These advances are primarily in the microtechnology and not in the conventional technology based on emitter-coupled logic.

Going to a massively parallel distributed machine, whether based on SIMD or MIMD architecture, allows you to exploit the driving forces of technology. What you really do is buy yourself into a technology where, because of the architectural advantage that went over the scalar and the pipelining architecture, you get away with relatively slow processors, although we see there is a tremendous speedup coming because of miniaturization. Also, you can usually get away with the cheapest, crudest memory chips. This allows you to put a machine together that, from a price/performance point of view, is extremely competitive. If you have ever opened up a Connection Machine (a Thinking Machines Corporation product), you know what I mean. There's not much in there, but it's a very, very fast machine.

Another area where one can make a similar argument is in the mass-storage arena. Unfortunately, at this conference we have no session on mass storage. I think it's one of the most neglected areas. And I think, because there is insufficient emphasis on mass storage and high-speed communication, we have an unbalanced scientific program in this country, resulting from the availability of certain components in the computing environment and the lack of others. Thus, certain problems get attention while other problems are ignored.

If you want to do, say, quantum chromodynamics, you need large memories and lots of computer time. If you want to do time-dependent, multidimensional-continuum physics, then you need not only lots of compute power and large memory but also data storage, communication, visualization, and maybe even a database so that you can make sense of it all.

One of the developments I'm most interested in is the goal of HPCI to establish a gigabit-per-second education and research network. When I had the opportunity in 1989 to testify on Senate Bill S-1067, I made it a point to talk for 15 minutes about the Library of Congress and why we don't digitize it. If you check into the problem a little closer, you find that a typical time scale on which society doubles its knowledge is about a decade—every 10 or 12 years. That would also seem to be a reasonable


296

time scale during which we could actually accomplish the conversion from analog to digital storage. Unfortunately, even to this day, lots of knowledge is first recorded electronically and printed on paper, and then the electronic record is destroyed because the business is in the paper that is sold, not in the information.

It would be fantastic if we could use HPCI as a way to digitize Los Alamos National Laboratory's library. Then we could make it available over a huge network at very high speeds.

The supercomputing culture was established primarily at the national laboratories, and it was a very successful spinoff. One of the reasons why the NSF's Office of Advanced Scientific Computing made it off the ground so fast was because they could rely on a tremendous experience base and lots of good working software. Although I have no direct knowledge of work at the National Security Agency in digitization (because I am a foreign national and lack the necessary clearance), I nevertheless cannot imagine that there are not excellent people at the Agency and elsewhere who have not already solved the problem of converting analog data into digital form. I hear you can even take pictures in a hurry and decipher them. There's great potential here for a spinoff. Whether that's politically possible, I have no idea, but I have an excuse: I'm just a scientist.

Returning to the matter of data storage, in terms of software for a computational supercomputer environment, I must say the situation is disastrous. The common-file system was developed at Los Alamos over a decade ago, and it has served us extremely well in the role for which it was designed: processing relatively small amounts of data with very, very high quality so that you can rely on the data you get back.

Now, 10 years have passed. The Cray Timesharing System will shortly be replaced, I guess everywhere, with UNIX. It would be good to have a mass-storage system based entirely on UNIX, complete with an archival system. The most exciting recent work I'm aware of in this area was carried out at NASA Ames, with the development of MSS-2 and the UTX. But we still have a long way to go if we really want to hook the high-speed networks into a system like that.

Advances in communication also include fiber-distributed data interface, which is a marginal improvement over the Ethernet. High-performance parallel interface (better known as HIPPI) was developed at Los Alamos in the mid-1980s. But there is a tremendous lag time before technology like this shows up in commercial products.

Another question, of course, is standard communication protocols. One aspect of standard communications protocols that has always


297

interested me is very-high-speed, real-time, interactive visualization. I realized some time ago that one could visualize for two-dimensional, time-dependent things but not for three-dimensional things, and that's why it's such a challenge. You probably need a TFLOPS machine to do the real-time calculations that Henry Fuchs mentioned earlier in this session.

Some additional problems on which I have been working are time-dependent, multidimensional-continuum physics simulations; radiation hydrodynamics; and Maxwell equations for a free-electron laser. On a Connection Machine, you can have eight gigabytes of memory right now. If you divide that by four bytes per word, you have about two gigawords. If you have 10 variables in your problem, then you have 200 million grid points. That is in principle what you could do on your machine.

If you're really interested in the dynamics of a phenomenon and do a simulation, you typically do 1000 snapshots of it, then you have your terabyte. Even at a data rate of 200 megabytes per second, it still takes 10 to 12 hours to ship the data around. A Josephson junction is only 282 , or 128-by-128 connections. This indicates what you could do with the available machinery, assuming you could handle the data. Also, in a few years, when the earth-observing satellites will be flying, there will be a few terabytes per day being beamed down. That translates into only &227A;100 megabits per second, but it's coming every second.

One of the things I really can appreciate concerns software. I spent a year at NCSA working with their industrial partners—about 40 per cent of their funding comes from private companies—and I found the situation very depressing. (Here at Los Alamos National Laboratory, it's even worse in a way; we hardly take advantage of any commercial software, so we miss out on many innovations coming from that quarter.) To give you an example, at NCSA there are about 120 commercial software packages on the Crays; even for their industrial customers, the fee they must pay per year is a little less than for the software you would like to get if you were operating on your Sun workstation. It's a couple of hundred thousand dollars per year.

Typically, that software is generated by a couple of people working out of one of the garage operations. Hardly anything is market tested. There's absolutely no interest in developing stuff for parallel systems. Typically you play a game of tag trying to get in touch with the software vendor, the supercomputer manufacturer, and the industrial partner you're trying to serve.

To reinforce that point, I reiterate that the standards, the software, and everything else along those lines will be determined by what comes out of personal computers, Macintoshes, and workstations because that's the


298

most innovative sector of the environment. The implications for the rest of the environment are considerable. For example, if you don't have the same floating-point standard as you have on a workstation, I don't care what you do, you're doomed.

I would like to close with an analogy to the digital high-definition video standard. The society at large is, in fact, helping a few of us scientists solve our data-storage problem because if you have to store digital images for high-definition video, then all our terabytes will not be a big problem. Coupling real experiments to numerical experiments would provide tremendously valuable two-way feedback in the experimental world and would provide a sanity check, so to speak, both ways while you do the experiment.


299

Molecular Nanotechnology

Ralph Merkle

Ralph C. Merkle received his Ph.D. from Stanford University in 1979 and is best known as a coinventor of public-key cryptography. Currently, he pursues research in computational nanotechnology at the Xerox Research Center in Palo Alto, California.

We are going to discuss configurations of matter and, in particular, arrangements of atoms. Figure 1 is a Venn diagram, and the big circle with a P in it represents all possible arrangements of atoms. The smaller circle with an M in it represents the arrangements of atoms that we know how to manufacture. The circle with a U in it represents the arrangements of atoms that we can understand.

Venn diagrams let you easily look at various unions and intersections of sets, which is exactly what we're going to do. One subset is the arrangements of atoms that are physically possible, but which we can neither manufacture nor understand. There's not a lot to say about this subset, so we won't.

The next subset of interest includes those arrangements of atoms that we can manufacture but can't understand. This is actually a very popular subset and includes more than many people think, but it's not what we're going to talk about.

The subset that we can both manufacture and understand is a good, solid, worthwhile subset. This is where a good part of current research is devoted. By thinking about things that we can both understand and


300

figure

Figure 1.
A Venn diagram, where P is the set of all possible arrangements of atoms, M is 
the set of all arrangements of atoms we can manufacture, and U is the set of all 
arrangements of atoms we can understand.

manufacture, we can make them better. Despite its great popularity, though, we won't be talking about this subset either .

Today, we'll talk about the subset that we can understand but can't yet manufacture. The implication is that the range of things we can manufacture will extend and gradually encroach upon the range of things that we can understand. So at some point in the future, we should be able to make most of these structures, even if we can't make them today.

There is a problem in talking about things that we can't yet manufacture: our statements are not subject to experimental verification, which is bad. This doesn't mean we can't think about them, and if we ever expect to build any of them we must think about them. But we do have to be careful. It would be a great shame if we never built any of them, because some of them are very interesting indeed. And it will be very hard to make them, especially the more complex ones, if we don't think about them first.

One thing we can do to make it easier to think about things that we can't build (and make it less likely that we'll reach the wrong conclusions) is to think about the subset of mechanical devices: machinery. This subset includes things made out of gears and knobs and levers and things. We can make a lot of mechanical machines today, and we can see how they work and how their parts interact. And we can shrink them down to smaller and smaller sizes, and they still work. At some point,


301

they become so small that we can't make them, so they move from the subset of things that we can make to the subset of things that we can't make. But because the principles of operation are simple, we believe they would work if only we could make them that small. Of course, eventually they'll be so small that the number of atoms in each part starts to get small, and we have to worry about our simple principles of operation breaking down. But because the principles are simple, it's a lot easier to tell whether they still apply or not. And because we know the device works at a larger scale, we only need to worry about exactly how small the device can get and still work. If we make a mistake, it's a mistake in scale rather than a fundamental mistake. We just make the device a little bit bigger, and it should work. (This isn't true of some proposals for molecular devices that depend fundamentally on the fact that small things behave very differently from big things. If we propose a device that depends fundamentally on quantum effects and our analysis is wrong, then we might have a hard time making it slightly bigger to fix the problem!)

The fact remains, though, that we can't make things as small as we'd like to make them. In even the most precise modern manufacturing, we treat matter in bulk. From the viewpoint of an atom, casting involves vast liquid oceans of billions of metal atoms, grinding scrapes off great mountains of atoms, and even the finest lithography involves large numbers of atoms. The basic theme is that atoms are being dealt with in great lumbering statistical herds, not as individuals.

Richard Feynman (1961) said: "The principles of physics, as far as I can see, do not speak against the possibility of maneuvering things atom by atom." Eigler and Schweizer (1990) recently gave us experimental proof of Feynman's words when they spelled "IBM" by dragging individual xenon atoms around on a nickel surface. We have entered a new age, an age in which we can make things with atomic precision. We no longer have to deal with atoms in great statistical herds—we can deal with them as individuals.

This brings us to the basic idea of this talk, which is nanotechnology. (Different people use the term "nanotechnology" to mean very different things. It's often used to describe anything on a submicron scale, which is clearly not what we're talking about. Here, we use the term "nanotechnology" to refer to "molecular nanotechnology" or "molecular manufacturing," which is a much narrower and more precise meaning than "submicron.") Nanotechnology, basically, is the thorough, inexpensive control of the structure of matter. That means if you want to build something (and it makes chemical and physical sense), you can very likely build it. Furthermore, the


302

individual atoms in the structure are where you want them to be, so the structure is atomically precise. And you can do this at low cost. This possibility is attracting increasing interest at this point because it looks like we'll actually be able to do it.

For example, IBM's Chief Scientist and Vice President for Science and Technology, J. A. Armstrong, said: "I believe that nanoscience and nanotechnology will be central to the next epoch of the information age, and will be as revolutionary as science and technology at the micron scale have been since the early '70's. . . . Indeed, we will have the ability to make electronic and mechanical devices atom-by-atom when that is appropriate to the job at hand."

To give you a feeling for the scale of what we're talking about, a single cubic nanometer holds about 176 carbon atoms (in a diamond lattice). This makes a cubic nanometer fairly big from the point of view of nanotechnology because it can hold over a hundred atoms, and if we're designing a nano device, we have to specify where each of those 176 atoms goes.

If you look in biological systems, you find some dramatic examples of what can be done. For instance, the storage capacity of DNA is roughly 1 bit per 16 atoms or so. If we can selectively remove individual atoms from a surface (as was demonstrated at IBM), we should be able to beat even that!

An even more dramatic device taken from biology is the ribosome. The ribosome is a programmable machine tool that can make almost any protein. It reads the messenger RNA (the "punched paper tape" of the biological world) and builds the protein, one amino acid at a time. All life on the planet uses this method to make proteins, and proteins are used to build almost everything else, from bacteria to whales to giant redwood trees.

There's been a growing interest in nanotechnology (Dewdney 1988, The Economist 1989, Pollack 1991). Fortune Magazine had an article about where the next major fortunes would come from (Fromson 1988), which included nanotechnology. The Fortune Magazine article said that very large fortunes would be made in the 21st century from nanotechnology and described K. Eric Drexler as the "theoretician of nanotechnology." Drexler (1981, 1986, 1988, 1992) has had a great influence on the development of this field and provided some of the figures used here.

Japan is funding research in this area (Swinbanks 1990). Their interest is understandable. Nanotechnology is a manufacturing technology, and Japan has always had a strong interest in manufacturing technologies. It will let you make incredibly small things, and Japan has always had a strong interest in miniaturization. It will let you make things where


303

every atom is in the right place: this is the highest possible quality, and Japan has always had a strong interest in high quality. It will let you make things at low cost, and Japan has always been interested in low-cost manufacturing. And finally, the payoff from this kind of technology will come in many years to a few decades, and Japan has a planning horizon that extends to many decades. So it's not surprising that Japan is pursuing nanotechnology.

This technology won't be developed overnight. One kind of development that we might see in the next few years would be an improved scanning tunneling microscope (STM) that would be able to deposit or remove a few atoms on a surface in an atomically precise fashion, making and breaking bonds in the process. The tip would approach a surface and then withdraw from the surface, leaving a cluster of atoms in a specified location (Figure 2). We could model this kind of process today using a computational experiment. Molecular modeling of this kind of interaction is entirely feasible and would allow a fairly rapid analysis of a broad variety of tip structures and tip-surface interactions. This would let us rapidly sort through a wide range of possibilities and pick out the most useful approaches. Now, if in fact you could do something like that, you could build structures using an STM at the molecular and atomic scale.

Figure 3 shows what might be described as a scaled-down version of an STM. It is a device that gives you positional control, and it is roughly 90 nanometers tall, so it is very tiny. It has six degrees of freedom and can position its tip accurately to within something like an angstrom. We can't build it today, but it's a fairly simple design and depends on fairly simple mechanical principles, so we think it should work.

This brings us to the concept of an "assembler." If you can miniaturize an STM and if you can build structures by controlled deposition of small clusters of atoms on surfaces, then you should be able to build small structures with a small version of the STM. Of course, you'd need a small computer to control the small robotic arm. The result is something that looks like an industrial robot that is scaled down by a factor of a million. It has millionfold smaller components and millionfold faster operations.

The assembler would be programmable, like a computer-controlled robot. It would be able to use familiar chemistry: the kind of chemistry that is used in living systems to make proteins and the kind of chemistry that chemists normally use in test tubes. Just as the ribosome can bond together amino acids into a linear polypeptide, so the assembler could bond together a set of chemical building blocks into complex three-dimensional structures by directly putting the compounds in the right places. The major differences between the ribosome and the assembler


304

figure

Figure 2.
A scanning tunneling microscope depositing a cluster of atoms on a surface.


305

figure

Figure 3.
A scanning tunneling microscope built on a molecular scale.

are (1) the assembler has a more complex (computerized) control system (the ribosome can only follow the very simple instructions on the messenger RNA), (2) the assembler can directly move the chemical building blocks to the right place in three dimensions, and so could directly form complex three-dimensional structures (the ribosome can only form simple linear sequences and can make three-dimensional structures only by roundabout and indirect means), and (3) the assembler can form several different types of bonds (the ribosome can form just one type of bond, the bond that links adjacent amino acids).

You could also use rather exotic chemistry. Highly reactive compounds are usually of rather limited use in chemistry because they react with almost anything they touch and it's hard to keep them from touching something you don't want them to touch. If you work in a vacuum, though, and can control the positions of everything, then you can work with highly reactive compounds. They won't react with things they're not supposed to react with because they won't touch anything they're not supposed to touch. Specificity is provided by controlling the positions of reacting compounds.

There are a variety of things that assemblers could make. One of the most interesting is other assemblers. That is where you get low


306

manufacturing cost. (At Xerox, we have a special fondness for machines that make copies of things.) The idea of assemblers making other assemblers leads to self-replicating assemblers. The concept of self-replicating machines has actually been around for some time. It was discussed by von Neumann (1966) back in the 1940s in his work on the theory of self-reproducing automata. Von Neumann's style of a self-replicating device had a Universal Computer coupled to what he called a Universal Constructor. The Universal Computer tells the Universal Constructor what to do. The Universal Constructor, following the instructions of the Universal Computer, builds a copy of both the Universal Computer and the Universal Constructor. It then copies the blueprints into the new machine, and away you go. That style of self-replicating device looks pretty interesting.

NASA (1982) did a study called "Advanced Automation for Space Missions." A large part of their study was devoted to SRSs, or Self-Replicating Systems. They concluded, among other things, that "the theoretical concept of machine duplication is well developed. There are several alternative strategies by which machine self-replication can be carried out in a practical engineering setting. An engineering demonstration project can be initiated immediately. . . ." They commented on and discussed many of the strategies. Of course, their proposals weren't molecular in scale but were quite macroscopic. NASA's basic objective was to put a 100,000-ton, self-replicating seed module on the lunar surface. Designing it would be hard, but after it was designed, built, and installed on the lunar surface, it would manufacture more of itself. This would be much cheaper than launching the same equipment from the earth.

There are several different self-replicating systems that we can examine. Von Neumann's proposal was about 500,000 bits. The Internet Worm was also about 500,000 bits. The bacterium, E. coli , a self-replicating device that operates in nature, has a complexity of about 8,000,000 bits. Drexler's assembler has an estimated complexity of 100 million bits. People have a complexity of roughly 6.4 gigabits. Of course, people do things other than replicate, so it's not really fair to chalk all of this complexity up to self-replication. The proposed NASA lunar manufacturing facility was very complex: 100 to 1000 gigabits.

To summarize the basic idea: today, manufacturing limits technology. In the future we'll be able to manufacture most structures that make sense. The chief remaining limits will be physical law and design capabilities. We can't make it if it violates physical law, and we can't make it if we can't specify it.


307

It will take a lot of work to get there, and more than just a lot of work, it will take a lot of planning. It's likely that general-purpose molecular manufacturing systems will be complex, so complex that we won't stumble over them by accident or find that we've made one without realizing it. This is more like going to the moon: a big project with lots of complicated systems and subsystems. Before we can start such a project, though, there will have to be proposals, and analyses of proposals, and a winnowing of the proposals down to the ones that make the most sense, and a debate about which of these few best proposals is actually worth the effort to build. Computers can help a great deal here. For virtually the first time in history, we can use computational models to study structures that we can't build and use computational experiments, which are often cheap and quick, compared with physical experiments, to help us decide which path is worth following and which path isn't.

Boeing builds airplanes in a computer before they build them in the real world. They can make better airplanes, and they can make them more quickly. They can shave years off the development time. In the same way, we can model all the components of an assembler using everything from computational-chemistry software to mechanical-engineering software to system-level simulators. This will take an immense amount of computer power, but it will shave many years off the development schedule.

Of course, everyone wants to know how soon molecular manufacturing will be here. That's hard to say. However, there are some very interesting trends. The progress in computer technology during the past 50 years has been remarkably regular. Almost every parameter of hardware technology can be plotted as a straight line on log paper. If we extrapolate those straight lines, we find they reach interesting values somewhere around 2010 to 2020. The energy dissipation per logic operation reaches thermal noise at room temperature. The number of atoms required to store one bit of information reaches approximately one. The raw computational power of a computer starts to exceed the raw computational power of the human brain. This suggests that somewhere between 2010 and 2020, we'll be able to build computers with atomic precision. It's hard to see how we could achieve such remarkable performance otherwise, and there are no fundamental principles that prevent us from doing it. And if we can build computers with atomic precision, we'll have to have developed some sort of molecular manufacturing capability.


308

Feynman said: "The problems of chemistry and biology can be greatly helped if our ability to see what we are doing and to do things on an atomic level is ultimately developed, a development which, I think, cannot be avoided."

While it's hard to say exactly how long it will take to develop molecular manufacturing, it's clear that we'll get there faster if we decide that it's a worthwhile goal and deliberately set out to achieve it.

As Alan Kay said: "The best way to predict the future is to create it."

References

A. K. Dewdney, "Nanotechnology: Wherein Molecular Computers Control Tiny Circulatory Submarines," Scientific American257 , 100-103 (January 1988).

K. E. Drexler, Engines of Creation , Anchor Press, New York (1986).

K. E. Drexler, "Molecular Engineering: An Approach to the Development of General Capabilities for Molecular Manipulation," in Proceedings of the National Academy of Sciences of the United States of America78 , 5275-78 (1981).

K. E. Drexler, Nanosystems: Molecular Machinery, Manufacturing and Computation , John Wiley and Sons, Inc., New York (1992).

K. E. Drexler, "Rod Logic and Thermal Noise in the Mechanical Nanocomputer," in Proceedings of the Third International Symposium on Molecular Electronic Devices , F. Carter, Ed., Elsevier Science Publishing Co., Inc., New York (1988).

The Economist Newspaper Ltd., "The Invisible Factory," The Economist313 (7632), 91 (December 9, 1989).

D. M. Eigler and E. K. Schweizer, "Positioning Single Atoms with a Scanning Tunnelling Microscope," Nature344 , 524-526 (April 15, 1990).

R. Feynman, There's Plenty of Room at the Bottom , annual meeting of the American Physical Society, December 29, 1959. Reprinted in "Miniaturization," H. D. Gilbert, Ed., Reinhold Co., New York, pp. 282-296 (1961).

B. D. Fromson, "Where the Next Fortunes Will be Made," Fortune Magazine , Vol. 118, No. 13, pp. 185-196 (December 5, 1988).


309

NASA, "Advanced Automation for Space Missions," in Proceedings of the 1980 NASA/ASEE Summer Study , Robert A. Freitas, Jr. and William P. Gilbreath, Eds., National Technical Information Service (NTIS) order no. N83-15348, U.S. Department of Commerce, Springfield, Virginia (November 1982).

A. Pollack, "Atom by Atom, Scientists Build 'Invisible' Machines of the Future," The New York Times (science section), p. B7 (November 26, 1991).

D. Swinbanks, "MITI Heads for Inner Space," Nature346 , 688-689 (August 23, 1990).

J. von Neumann, Theory of Self Reproducing Automata , Arthur W. Burks, Ed., University of Illinois Press, Urbana, Illinois (1966).


311

Supercomputing Alternatives

Gordon Bell

C. Gordon Bell, now an independent consultant, was until 1991 Chief Scientist at Stardent Computer. He was the leader of the VAX team and Vice President of R&D at Digital Equipment Corporation until 1983. In 1983, he founded Encore Computer, serving as Chief Technical Officer until 1986, when he founded and became Assistant Director of the Computing and Information Science and Engineering Directorate at NSF. Gordon is also a founder of The Computer Museum in Boston, a fellow of both the Institute of Electrical and Electronics Engineers and the American Association for the Advancement of Science, and a member of the National Academy of Engineering. He earned his B.S. and M.S. degrees at MIT.

Gordon was awarded the National Medal of Technology by the Department of Commerce in 1991 and the von Neumann Medal by the Institute of Electrical and Electronics Engineers in 1992.

Less Is More

Our fixation on the supercomputer as the dominant form of technical computing is finally giving way to reality. Supers are being supplanted by a host of alternative forms of computing, including the interactive, distributed, and personal approaches that use PCs and workstations. The technical computing industry and the community it serves are poised for an exciting period of growth and change in the 1990s.


312

Traditional supercomputers are becoming less relevant to scientific computing, and as a result, the growth in the traditional vector supercomputer market, as defined by Cray Research, Inc., is reduced from what it was in the early 1980s. Large-system users and the government, who are concerned about the loss of U.S. technical supremacy in this last niche of computing, are the last to see the shift. The loss of supremacy in supercomputers should be of grave concern to the U.S. government, which relies on supercomputers and, thus, should worry about the loss of one more manufacturing-based technology. In the case of supercomputers, having the second-best semiconductors and high-density packaging means that U.S. supercomputers will be second.

The shift away from expensive, highly centralized, time-shared supercomputers for high-performance computing began in the 1980s. The shift is similar to the shift away from traditional mainframes and minicomputers to workstations and PCs. In response to technological advances, specialized architectures, the dictates of economies, and the growing importance of interactivity and visualization, newly formed companies challenged the conventional high-end machines by introducing a host of supersubstitutes: minisupercomputers, graphics supercomputers, superworkstations, and specialized parallel computers. Cost-effective FLOPS, that is, the floating-point operations per second essential to high-performance technical computing, come in many new forms. The compute power for demanding scientific and engineering challenges could be found across a whole spectrum of machines with a range of price/performance points. Machines as varied as a Sun Microsystems, Inc., workstation, a graphics supercomputer, a minisupercomputer, or a special-purpose computer like a Thinking Machines Corporation Connection Machine or an nCUBE Corporation Hypercube all do the same computation for five to 50 per cent of the cost of doing it on a conventional supercomputer. Evidence of this trend abounds.

An example of cost effectiveness would be the results of the PERFECT (Performance Evaluation for Cost-Effective Transformation) contest. This benchmark suite, developed at the University of Illinois Supercomputing Research and Development Center in conjunction with manufacturers and users, attempts to measure supercomputer performance and cost effectiveness.

In the 1989 contest, an eight-processor CRAY Y-MP/832 took the laurels for peak performance by achieving 22 and 120 MFLOPS (million floating-point operations per second) for the unoptimized baseline and hand-tuned, highly optimized programs, respectively. A uniprocessor Stardent Computer 3000 graphics supercomputer won the


313

cost/performance award by a factor of 1.8 and performed at 4.2 MFLOPS with no tuning and 4.4 MFLOPS with tuning. The untuned programs on the Stardent 3000 were a factor of 27 times more cost effective than the untuned CRAY Y-MP programs. In comparison, a Sun SPARCstation 1 ran the benchmarks roughly one-third as fast as the Stardent.

The PERFECT results typify "dis-economy" of scale. When it comes to getting high-performance computation for scientific and engineering problems, the biggest machine is rarely the most cost effective. This concept runs counter to the myth created in the 1960s known as Grosch's Law, which stated that the power of a computer increased as its price squared. Many studies have shown that the power of a computer increased at most as the price raised to the 0.8 power—a dis-economy of scale.

Table 1 provides a picture of the various computing power and capacity measures for various types of computers that can substitute for supercomputers. The computer's peak power and LINPACK 1K × 1K estimate the peak power that a computer might deliver on a highly parallel application. LINPACK 100-×-100 shows the power that might be expected for a typical supercomputer application and the average speed at which a supercomputer might operate. The Livermore Fortran Kernels (LFKs) were designed to typify workload, that is, the capacity of a computer operating at Lawrence Livermore National Laboratory.

The researchers who use NSF's five supercomputing centers at no cost are insulated from cost considerations. Users get relatively little processing power per year despite the availability of the equivalent 30 CRAY

 

Table 1. Power of 1989 Technical Computers, in MFLOPS

 

Type

No. Proc. Max.

LFK per Proc.

LFK per Machine

LINPACK 100 × 100

LINPACK 1K × 1K


Peak

PC

1

0.1–0.5

0.1–0.5

0.1–1.0

1

Workstation

1

0.2–1.5

0.5–3.0

6

8

Micro/Mini

1

0.1–0.5

0.1–0.5

0.1–0.5

2

Supermini

6

1

4

1

6

24

Superworkstation

4

1.5–5

10

6–12

80

128

Minisuper

8

2–4.3

10

6–16

166

200

Main/Vectors

6

7.2

43

13

518

798

Supercomputer

8

19

150

84

2144

2667


314

X-MP processors, or 240,000 processor hours per year. When that processing power is spread out among 10,000 researchers, it averages out to just 24 hours per year, or about what a high-power PC can deliver in a month. Fortunately, a few dozen projects get 1000 hours per year. Moreover, users have to contend with a total solution time disproportionate to actual computation time, centralized management and allocation of resources, the need to understand vectorization and parallelization to utilize the processors effectively (including memory hierarchies), and other issues.

These large, central facilities are not necessarily flawed as a source of computer power unless they attempt to be a one-stop solution. They may be the best resource for the very largest users with large, highly tuned parallel programs that may require large memories, file capacity of tens or hundreds of gigabytes, the availability of archive files, and the sharing of large databases and large programs. They also suffice for the occasional user who needs only a few hours of computing a year and doesn't want to own or operate a computer.

But they're not particularly well suited to the needs of the majority of users working on a particular engineering or scientific problem that is embodied in a program model. They lack the interactive and visualization capabilities that computer-aided design requires, for example. As a result, even with free computer time, only a small fraction of the research community, between five and 10 per cent, uses the NSF centers. Instead, users are buying smaller computing resources to make more power available than the large, traditional, centralized supercomputer supplies. Ironic though it may seem, less is more .

Supersubstitutes Provide More Overall Capacity

Users can opt for a supersubstitute if it performs within a factor of 10 of a conventional supercomputer. That is, a viable substitute must supply up to 10 per cent the power of a super so as to deliver the same amount of computation in one day that the typical user could expect from a large, time-shared supercomputer—between a half-hour and an hour of Cray service per day and a peak of two hours. Additionally, it should be the best price performer in its class, sustain high throughput on a wide variety of jobs, and have appropriate memory and other resources.

Data compiled by the market research firm Dataquest Inc. has been arranged in Table 2 so as to show technical computer installations in 1989, along with several gauges of computational capacity: per-processor


315
 

Table 2. Installed Capacity for General-Purpose Technical Computing Environment
(Source: Dataquest)

 

Type

Dataquest Installed


1989 Ships

1989 LFK Capacity

Companies Selling


Building


Dead

PC

3.4M

1M

1341

100s

?

?

Workstation

0.4M

145K

960

7

?

~50

Micro/Mini

0.9M

75K

30

~20

?

~100

Supermini

0.3M

7.5K

200

7

?

~10

Superworkstation

10K

10K

100

3

2

2

Minisuper

1.6K

600

32

5

> 2

8

Parallel Proc.

365

250

4

24

> 9

8

Main/Vectors

8.3K

100

29

3

?

3

Supercomputer

450

130

100

4

> 3

3

performance on the Livermore Loops workload benchmark,[*] per-processor performance on LINPACK 100-×-100 and peak performance on the LINPACK 1000-×-1000 benchmark,[**] and total delivered capacity using the Livermore Loops workload measure, expressed as an equivalent to the CRAY Y-MP eight-processor computer's 150 MFLOPS.

How Supers Are Being Niched

Supercomputers are being niched across the board by supersubstitutes that provide a user essentially the same service but at much lower entry and use costs. In addition, all the other forms of computers, including

[*] LFKs consist of 24 inner loops that are representative of the programs run at Lawrence Livermore National Laboratory. The Spectrum of Code varies from being entirely scalar to almost perfectly vectorizable, whereby the supercomputer can run at its maximum speed. The harmonic mean is used to measure relative performances, which correspond to the time it takes to run to all 24 programs. The SPEC and PERFECT benchmarks also correlate with the Livermore benchmark.

[**] The LINPACK benchmark measures the computer's ability to solve a set of linear algebraic equations. These equations are the basis of a number of programs such as finite-element models used for physical systems. The small matrix size (100 × 100) benchmark corresponds to the rate at which a typical application program runs on a supercomputer. The large LINPACK corresponds to the best case that a program is likely to achieve.


316

mainframes with vector facilities, minis, superminis, minisupers, ordinary workstations, and PCs, offer substitutes. Thus, the supercomputer problem (i.e., the lack of the U.S.'s ability to support them in a meaningful market fashion) is based on economics as much as on competition.

Numerous machines types are contenders as supersubstitutes. Here are some observations on each category.

Workstations

Workstations from companies like Digital Equipment Corporation (DEC), the Hewlett-Packard Company, Silicon Graphics Inc., and Sun Microsystems, among others, provide up to 10 per cent of the capacity of a CRAY Y-MP processor. But they do it at speeds of less than 0.3 per cent of an eight-processor Y-MP LINPACK peak and at about two per cent the speed of a single-processor Y-MP on the LINPACK 100-×-100 benchmark. Thus, while they may achieve impressive scalar performance, they have no way to hit performance peaks for the compute-intensive programs for which the vector and parallel capabilities of supercomputers were developed. As a result, they are not ideal as supersubstitutes. Nevertheless, ordinary computers like workstations, PCs, minicomputers, and superminis together provide most of the technical computing power available today.

Minicomputers and Superminis

These machines provide up to 20 per cent of the capacity of a CRAY-MP processor. But again, with only 0.25 per cent the speed of the LINPACK peak of the Cray, they are also less-than-ideal supercomputer substitutes.

Mainframes

IBM may be the largest supplier of supercomputing power. It has installed significant computational power in its 3090 mainframes with vector-processing facilities. Dataquest has estimated that 250 of the 750 3090-processors shipped last year had vector-processing capability. Although a 3090/600 has 25 per cent of the CRAY Y-MP's LINPACK peak power, its ability to carry out a workload, as measured by Livermore Loops, is roughly one-third that of a CRAY Y-MP/8.

But we see only modest economic advantages and little or no practical benefit to be derived from substituting one centralized, time-shared resource for another. For numeric computing, mainframes are not the best performers in their price class. Although they supply plenty of computational power, they rarely hit the performance peaks that supercomputer-class applications demand. The mainframes from IBM—and


317

even the new DEC 9000 series—suffer from the awkwardness of traditional architecture evolution. Their emitter-coupled-logic (ECL) circuit technology is costly. And the pace of improvement in ECL density lags far behind the rate of progress demonstrated by the complementary-metaloxide-semiconductor (CMOS) circuitry employed in more cost-effective and easier-to-use supersubstitutes.

Massively Data-Parallel Computers

There is a small but growing base of special-purpose machines in two forms: multicomputers (e.g., hundreds and thousands of computers interconnected) and the SIMD (e.g., the Connection Machine, MasPar), some of which supply a peak of 10 times a CRAY Y-MP/8 with about the same peak-delivered power (1.5 GFLOPS) on selective, parallelized applications that can operate on very large data sets. This year a Connection Machine won the Bell Perfect Club Prize[*] for having the highest peak performance for an application. These machines are not suitable for a general scientific workload. For programs rich in data parallelism, these machines can deliver the performance. But given the need for complete reprogramming to enable applications to exploit their massively parallel architectures, they are not directly substitutable for current supercomputers. They are useful for the highly parallel programs for which the super is designed. With time, compilers should be able to better exploit these architectures that require explicitly locating data in particular memory modules and then passing messages among the modules when information needs to be shared.

The most exciting computer on the horizon is the one from Kendall Square Research (KSR), which is scalable to over 1000 processors as a large, shared-memory multiprocessor. The KSR machine functions equally well for both massive transaction processing and massively parallel computation.

Minisupercomputers

The first viable supersubstitutes, minisupercomputers, were introduced in 1983. They support a modestly interactive, distributed mode of use and exploit the gap left when DEC began in earnest to ignore its

[*] A prize of $1000 is given in each of three categories of speed and parallelism to recognize applications programs. The 1988 prizes went to a 1024-node nCUBE at Sandia and a CRAY X-MP/416 at the National Center for Atmospheric Research; in 1989 a CRAY Y-MP/832 ran the fastest.


318

technical user base. In terms of power and usage, their relationship to supercomputers is much like that of minicomputers to mainframes. Machines from Alliant Computer Systems and CONVEX Computer Corporation have a computational capacity approaching one CRAY Y-MP processor.

Until the introduction of graphics supercomputers in 1988, minisupers were the most cost-effective source of supercomputing capacity. But they are under both economic and technological pressure from newer classes of technical computers. The leading minisuper vendors are responding to this pressure in different ways. Alliant plans to improve performance and reduce computing costs by using a cost-effective commodity chip, Intel's i860 RISC microprocessor. CONVEX has yet to announce its next line of minisupercomputers; however, it is likely to follow the Cray path of a higher clock speed using ECL.

Superworkstations

This machine class, judging by the figures in Table 1, is the most vigorous of all technical computer categories, as it is attracting the majority of buyers and supplying the bulk of the capacity for high-performance technical computing. In 1989, superworkstation installations reached more users than the NSF centers did, delivering four times the computational capacity and power supplied by the CRAY Y-MP/8.

Dataquest's nomenclature for this machine class—superworkstations—actually comprises two kinds of machines: graphics supercomputers and superworkstations. Graphics supercomputers were introduced in 1988 and combine varying degrees of supercomputer capacity with integral three-dimensional graphics capabilities for project and departmental use (i.e., multiple users per system) at costs ranging between $50,000 and $200,000. Priced even more aggressively, at $25,000 to $50,000, superworkstations make similar features affordable for personal use.

Machines of this class from Apollo (Hewlett-Packard), Silicon Graphics, Stardent, and most recently from IBM all provide between 10 and 20 per cent of the computational capacity of a CRAY Y-MP processor, as characterized by the Livermore Loops workload. They also run the LINPACK 100-×-100 benchmark at about 12 per cent of the speed of a one-processor Y-MP. While the LINPACK peak of such machines is only two per cent of an eight-processor CRAY Y-MP, the distributed approach of the superworkstations is almost three times more cost effective. In other words, users spending the same amount can get three to five times


319

as much computing from superworkstations and graphics supercomputers than from a conventional supercomputer.

In March 1990, IBM announced its RS/6000 superscalar workstation, which stands out with exceptional performance and price performance. Several researchers have reported running programs at the same speed as the CRAY Y-MP. The RS/6000's workload ability measured by the Livermore Loops is about one-third that of a CRAY Y-MP processor.

Superworkstations promise the most benefits for the decade ahead because they conjoin more leading-edge developments than any other class of technical computer, including technologies that improve performance and reduce costs, interactivity, personal visualization, smarter compiler technologies, and the downward migration of super applications. More importantly, superworkstations provide for interactive visualization in the same style that PCs and workstations used to stabilize mainframe and minicomputer growth. Radically new applications will spring up around this new tool that are not versions of tired 20-year-old code that ran on the supercomputer, mainframe, and minicode museums. These will come predominantly from science and engineering problems, but most financial institutions are applying supercomputers for econometric modeling, work optimization, portfolio analysis, etc.

Because these machines are all based on fast-evolving technologies, including single-chip RISC microprocessors and CMOS, we can expect performance gains to continue at the rate of over 50 per cent a year over the next five years. We'll also see continuing improvements in clock-rate growth to more than 100 megahertz by 1992. By riding the CMOS technology curve, future superworkstation architectures will likely be able to provide more power for most scientific applications than will be available from the more costly multiple-chip systems based on arrays of ECL and GaAs (gallium arsenide) gates. Of course, the bigger gains will come through the use of multiple of these low-cost processors for parallel processing.

Why Supercomputers Are Becoming Less General Purpose

Like their large mainframe and minicomputer cousins, the super is based on expensive packaging of ECL circuitry. As such, the evolution in performance is relatively slower (doubling every five years) than that of the single-chip microprocessor, which doubles every 18 months. One of the problems in building a cost-effective, conventional supercomputer is


320

that every part—from the packaging to the processors, primary and secondary memory, and the high-performance network—typically costs more than it contributes to incremental performance gains. Supercomputers built from expensive, high-speed components have elaborate processor-memory connections, very fast transfer disks, and processing circuits that do relatively few operations per chip and per watt, and they require extensive installation procedures with high operating costs.

To get the large increases in peak MFLOPS performance, the supercomputer architecture laid down at Cray Research requires having to increase memory bandwidth to support the worst-case peak. This is partially caused by Cray's reluctance to use modem cache memory techniques to reduce cost and latency. This increase in bandwidth results in a proportional increase in memory latency, which, unfortunately, decreases the computer's scalar speed. Because workloads are dominated by scalar code, the result is a disproportionately small increase in throughput, even though the peak speed of a computer increases dramatically. Nippon Electric Corporation's (NEC's) four-processor SX-3, with a peak of 22 GFLOPS, is an example of providing maximum vector speed. In contrast, one-chip microprocessors with on-board cache memory, as typified by IBM's superscalar RS/6000 processor, are increasing in speed more rapidly than supers for scientific codes.

Thus, the supercomputer is becoming a special-purpose computer that is only really cost effective for highly parallel problems. It has about the same performance of highly specialized, parallel computers like the Connection Machine, the microprocessor-based nCUBE, and Intel's multicomputers, yet the super costs a factor of 10 more because of its expensive circuit and memory technology. In both the super and nontraditional computers, a program has to undergo significant transformations in order to get peak performance.

Now look at the situation of products available on the market and what they are doing to decrease the supercomputer market. Figure 1, which plots performance versus the degree of problem parallelism (Amdahl's Law), shows the relative competitiveness in terms of performance for supersubstitutes. Fundamentally, the figure shows that supers are being completely "bracketed" on both the bottom (low performance for scalar problems) and the top (highly parallel problems). The figure shows the following items:

1. The super is in the middle, and its performance ranges from a few tens of MFLOPS per processor to over two GFLOPS, depending on the degree of parallelization of the code and the number of processors


321

figure

Figure 1.
Performance in floating-point operations per second versus degree of parallel code for four
classes of computer: the CRAY Y-MP/8; the Thinking Machines Corporation CM-2; the IBM
RS/6000 and Intel i860-based workstation; and the Intel 128 and 1024 iPSC/860 multicomputers.

used. Real applications have achieved sustained performance of about 50 per cent of the peak.

2. Technical workstations supply the bulk of computing for the large, untrained user population and for code that has a low degree of vectorization (that must be tuned to run well). In 1990, the best CMOS-based technical workstations by IBM and others performed at one-third the capacity of a single-processor Y-MP on a broad range of programs and cost between $10,000 and $100,000. Thus, they are anywhere from five to 100 times more cost effective than a super costing at around $2,500,000 per processor. This situation differs from a decade ago, when supers provided over a factor of 20 greater


322

performance for scalar problems against all computers. The growth in clock performance for CMOS is about 50 per cent per year, whereas the growth in performance for ECL is only 10 to 15 per cent per year.

3. For programs that have a high degree of parallelization, two alternatives threaten the super in its natural habitat. The parallelization has to be done by a small user base.

a. The Connection Machine costs about one-half the Y-MP but provides a peak of almost 10 times the Cray. One CM-2 runs a real-time application code at two times the peak of a Cray.

b. Multicomputers can be formed from a large collection of high-volume, cost-effective CMOS microprocessors. Intel's iPSC/860 multicomputer comes in a range of sizes from typical (128 computers) to large (1K computers). IBM is offering the ability to interconnect RS/6000s. A few RS/6000s will offer any small team the processing power of a CRAY Y-MP processor for a cost of a few hundred thousand dollars.

The Supercomputer Industry

The business climate of the 1990s offers further evidence that big machines may be out of step with more cost-effective, modern computing styles. Table 2 shows the number of companies involved in the highly competitive technical computing industry. The business casualties in the high end of the computer industry last year constitute another indicator that the big-machine approach to technical computing might be flawed. No doubt, a variety of factors contributed to the demise of Control Data Corporation's Engineering Technology Associates Systems subsidiary (St. Paul, Minnesota), Chopp (San Diego, California), Cydrome Inc. (San Jose, California), Evans & Sutherland's startup supercomputer division (Sunnyvale, California), Multiflow Computer (New Haven, Connecticut), and Scientific Computer Systems (San Diego, California). But in an enormously crowded market, being out of step with the spirit of the times might have had something to do with it.

With Cray Research, Inc., having spun off Cray Computer Corporation and with several other startups designing supercomputers, it's hard to get very worried about the U.S. position vis-à-vis whether enough is being done about competitiveness. Unfortunately, less and less is being spent on the underlying circuit technologies for the highest possible speeds. It's fairly easy to predict that of the half-dozen companies attempting to build new supers, there won't be more than three viable U.S. suppliers, whereas today we only have one.


323

From an economic standpoint, the U.S. is fortunate that the Japanese firms are expending large development resources that supercomputers require because these same engineers could, for example, be building consumer products, robots, and workstations that would have greater impact on the U.S. computer and telecommunications markets. It would be very smart for these Japanese manufacturers to fold up their expensive efforts and leave the small but symbolically visible supercomputer market to the U.S. Japan could then continue to improve its position in the larger consumer and industrial electronics, communication, and computing sectors.

Is the Supercomputer Industry Hastening Its Own Demise?

The supercomputer industry and its patrons appear to be doing many things that hasten an early demise. Fundamentally, the market for supercomputers is only a billion dollars, and the R&D going into supers is also on the order of a billion. This simply means too many companies are attempting to build too many noncompatible machines for too small a market. Much of the R&D is redundant, and other parts are misdirected.

The basic architecture of the "true" supercomputer was clearly defined as a nonscalable, vector multiprocessor. Unfortunately, the larger it is made to get the highest peak or advertising speed, the less cost effective it becomes for real workloads. The tradeoff inherent in making a high-performance computer that is judged on the number of GFLOPS it can calculate, based on such a design, seems to be counterproductive. The supercomputer has several inconsistencies (paradoxes) in its design and use:

1. In providing the highest number of MFLOPS by using multiprocessors with multiple pipe vector units to support one to 1.5 times the number of memory accesses as the peak arithmetic speed, memory latency is increased. However, to have a well-balanced, general-purpose supercomputer that executes scalar code well, the memory latency needs to be low.

2. In building machines with the greatest peak MFLOPS (i.e., the advertising speed), many processors are required, raising the computer's cost and lowering per-processor performance. However, supercomputers are rarely used in a parallel mode with all processors; thus, supers are being built at an inherent dis-economy of scale to increase the advertising speed.


324

3. Having many processors entails mastering parallelism beyond that obtainable through automatic parallelization/vectorization. However, supercomputer suppliers aren't changing their designs to enable scaleability or to use massive parallelism.

4. In providing more than worst-case design of three pipelines to memory, or 1.5 times as many mega-accesses per second as the machine has MFLOPS, the cost effectiveness of the design is reduced at least 50 per cent. However, to get high computation rates, block algorithms are used that ensure memory is not accessed. The average amount of computation a super delivers over a month is only five to 10 per cent of the peak, indicating the memory switch is idle most of the time.

In addition to these paradoxes, true supers are limited in the following ways:

1. Not enough is being done to train users or to make the super substantially easier to use. Network access needs to be much faster and more transparent. The X-Terminal server interface can potentially show the super to have a Macintosh-like interface. No companies provide this at present.

2. The true supercomputer design formula seems flawed. The lack of caches, paging, and scaleability make it doomed to chase the clock. For example, paradox 4 above indicates that a super could probably deliver two to four times more power by doubling the number of processors but without increasing the memory bandwidth or the cost.

3. Cray Research describes a massively parallel attached computer. Cray is already quite busy as it attempts to enter into the minisupercomputer market. Teaming with a startup such as Thinking Machines Corporation (which has received substantial government support) or MasPar for a massively parallel facility would provide a significantly higher return on limited brain power.

4. The U.S. has enough massively parallel companies and efforts. These have to be supported in the market and through use before they perish. Because these computers are inherently specialized (note the figure), support via continued free gifts to labs and universities is not realistic in terms of establishing a real marketplace.

A Smaller, Healthier Supercomputer Industry

Let's look at what fewer companies and better R&D focus might bring:

1. The CRAY Y-MP architecture is just fine. It provides the larger address space of the CRAY-2. The CRAY-3 line, based on a new architecture, will further sap the community of skilled systems-software and applications-builder resources. Similarly, Supercomputer Systems, Inc., (Steve Chen's startup) is most likely inventing a new architecture that requires new systems software and applications. Why have three


325

architectures for which people have to be trained so they can support operating systems and write applications?

2. Resources could be deployed on circuits and packaging to build GaAs or more aggressive ECL-based or even chilled CMOS designs instead of more supercomputer architectures and companies.

3. Companies that have much of the world's compiler expertise, such as Burton Smith's Tera Computer Company or SSI in San Diego, could help any of the current super companies. It's unlikely that any funding will come from within the U.S. to fund these endeavors once the government is forced into some fiscal responsibility and can no longer fund them. Similarly, even if these efforts get Far-East funding, it is unlikely they will succeed.

4. Government support could be more focused. Supporting the half-dozen companies by R&D and purchase orders just has to mean higher taxes that won't be repaid. On the other hand, continuing subsidies of the parallel machines is unrealistic in the 1990s if better architectures become available. A more realistic approach is to return to the policy of making the funds available to buy parallel machines, including ordinary supers, but to not force the purchase of particular machines.

Policy Issues

Supporting Circuit and Packaging Technology

There is an impression that the Japanese manufacturers provide access to their latest and fastest high-speed circuitry to build supercomputers. For example, CONVEX gets the parts from Fujitsu for making cost-effective minisupercomputers, but these parts are not components of fast-clock, highest-speed supercomputers. The CONVEX clock is two to 10 times slower when compared with a Cray, Fujitsu, Hitachi, or NEC mainframe or super.

High-speed circuitry and interconnect packaging that involves researchers, semiconductor companies, and computer manufacturers must be supported. This effort is needed to rebuild the high-speed circuitry infrastructure. We should develop mechanisms whereby high-speed-logic R&D is supported by those who need it. Without such circuitry, traditional vector supercomputers cannot be built. Here are some things that might be done:

1. Know where the country stands vis-à-vis circuitry and packaging. Neil Lincoln described two developments at NEC in 1990—the SX-3 is running benchmark programs at a 1.9-nanosecond clock; one


326

processor of an immersion-cooled GaAs supercomputer is operating at a 0.9-nanosecond clock.

2. Provide strong and appropriate support for the commercial suppliers who can and will deliver in terms of quality, performance, and cost. This infrastructure must be rebuilt to be competitive with Japanese suppliers. The Department of Defense's (DoD's) de facto industrial policy appears to support a small cadre of incompetent suppliers (e.g., Honeywell, McDonnell Douglas, Rockwell, Unisys, and Westinghouse) who have repeatedly demonstrated their inability to supply industrial-quality, cost-effective, high-performance semiconductors. The VHSIC program institutionalized the policy of using bucks to support the weak suppliers.

3. Build MOSIS facilities for the research and industrial community to use to explore all the high-speed technologies, including ECL, GaAs, and Josephson junctions. This would encourage a foundry structure to form that would support both the research community and manufacturers.

4. Make all DoD-funded semiconductor facilities available and measured via MOSIS. Eliminate and stop supporting the poor ones.

5. Support foundries aimed at custom high-speed parts that would improve density and clock speeds. DEC's Sam Fuller (a Session 13 presenter) described a custom, 150-watt ECL microprocessor that would operate at one nanosecond. Unfortunately, this research effort's only effect is likely to be a demonstration proof for competitors.

6. Build a strong packaging infrastructure for the research and startup communities to use, including gaining access to any industrial packages from Cray, DEC, IBM, and Microelectronics and Computer Technology Corporation.

7. Convene the supercomputer makers and companies who could provide high-speed circuitry and packaging. Ask them what's needed to provide high-performance circuits.

Supers and Security

For now, the supercomputer continues to be a protected species because of its use in defense. Also, like the Harley Davidson, it has become a token symbol of trade and competitiveness, as the Japanese manufacturers have begun to make computers with peak speeds equal to or greater than those from Cray Research or Cray Computer. No doubt, nearly all the functions supers perform for defense could be carried out more cheaply by using the alternative forms of computing described above.


327

Supers for Competitiveness

Large U.S. corporations are painstakingly slow, reluctant shoppers when it comes to big, traditional computers like supercomputers, mainframes, and even minisupercomputers. It took three years, for example, for a leading U.S. chemical company to decide to spring for a multimillion-dollar CRAY X-MP. And the entire U.S. automotive industry, which abounds in problems like crashworthiness studies that are ideal candidates for high-performance computers, has less supercomputer power than just one of its Japanese competitors. The super is right for the Japanese organization because a facility can be installed rapidly and in a top-down fashion.

U.S. corporations are less slow to adopt distributed computing by default. A small, creative, and productive part of the organization can and does purchase small machines to enhance their productivity. Thus, the one to 10 per cent of the U.S.-based organization that is responsible for 90 to 95 per cent of a corporation's output can and does benefit. For example, today, almost all electronic CAD is done using workstations, and the product gestation time is reduced for those companies who use these modern tools. A similar revolution in design awaits other engineering disciplines such as mechanical engineering and chemistry—but they must start.

The great gain for productivity is by visualization that comes through interactive supercomputing substitutes, including the personal supercomputers that will appear in the next few years. A supercomputer is likely to increase the corporate bureaucracy and at the same time inhibit users from buying the right computer—the very users who must produce the results!

By far, the greatest limitation in the use of supercomputing is training. The computer-science community, which, by default, takes on much of the training for computer programming, is not involved in supercomputing. Only now are departments becoming interested in the highly parallel computers that will form the basis of this next (fifth) generation of computing.

Conclusions

Alternative forms for supercomputing promise the brightest decade ever, with machines that have the ability to simulate and interact with many important physical phenomena.


328

Large, slowly evolving central systems will continue to be supplanted by low-cost, personal, interactive, and highly distributed computing because of cost, adequate performance, significantly better performance/cost, availability, user friendliness, and all the other factors that caused users of mainframes and minis to abandon the more centralized structures for personal computing. By the year 2000, we expect nearly all personal computers to have the capability of today's supercomputer. This will enable all users to simulate the immense and varied systems that are the basis of technical computing.

The evolution of the traditional supercomputer must change to a more massively parallel and scalable structure if it is to keep up with the peak performance of evolving new machines. By 1995, specialized, massively parallel computers capable of a TFLOPS (1012 floating-point operations per second) will be available to simulate a much wider range of physical phenomena.

Epilogue, June 1992

Clusters of 10 to 100 workstations are emerging as a high-performance parallel processing computer—the result of economic realities. For example, Lawrence Livermore National Laboratory estimates spending three times more on workstations that are 15 per cent utilized than it does on supercomputers. Supers cost a dollar per 500 FLOPS and workstations about a dollar per 5000 FLOPS. Thus, 25 times the power is available in their unused workstations as in supers. A distributed network of workstations won the Gordon Bell Prize for parallelism in 1992.

The ability to use workstation clusters is enabled by a number of environments such as Linda, the Parallel Virtual Machine, and Parasoft Corporation's Express. HPF (Fortran) is emerging as a powerful standard to allow higher-level use of multicomputers (e.g., Intel's Paragon, Thinking Machine's CM-5), and this could also be used for workstation clusters as standardization of interfaces and clusters takes place.

The only inhibitor to natural evolution is that government, in the form of the High Performance Computing and Communications (HPCC) Initiative, and especially the Defense Advanced Research Projects Agency, is attempting to "manage" the introduction of massive parallelism by attempting to select winning multicomputers from its development-funded companies. The HPCC Initiative is focusing on the peak TFLOPS at any price, and this may require an ultracomputer (i.e., a machine costing $50 to $250 million). Purchasing such a machine would be a


329

mistake—waiting a single three-year generation will reduce prices by a least a factor of four.

In the past, the government, specifically the Department of Energy, played the role of a demanding but patient customer, but it never funded product development—followed by managing procurement to the research community. This misbehavior means that competitors are denied the significant market of leading-edge users. Furthermore, by eliminating competition, weak companies and poor computers emerge. There is simply no need to fund computer development. This money would best be applied to attempting to use the plethora of extant machines—and with a little luck, weed out the poor machines that absorb and waste resources.

Whether traditional supercomputers or massively parallel computers provide more computing, measured in FLOPS per month by 1995, is the object of a bet between the author and Danny Hillis of Thinking Machines. Unless government continues to tinker with the evolution of computers by massive funding for massive parallelism, I believe supers will continue as the main source of FLOPS in 1995.


331

8— THE FUTURE COMPUTING ENVIRONMENT
 

Preferred Citation: Ames, Karyn R., and Alan Brenner, editors Frontiers of Supercomputing II: A National Reassessment. Berkeley:  University of California Press,  c1994 1994. http://ark.cdlib.org/ark:/13030/ft0f59n73z/