previous part
12— EXPERIENCE AND LESSONS LEARNED
next part

12—
EXPERIENCE AND LESSONS LEARNED

Panelists in this session discussed recent case histories of supercomputing developments and companies: technology problems, financial problems, and what went right or wrong and why. Presentations concerned recent work at Engineering Technology Associates Systems, Bolt Beranek & Newman Inc., FPS Computing, the National Security Agency's Project THOTH, and Princeton University's NSF center.

Session Chair

Lincoln Faurer,
Corporation for Open Systems


449

Supercomputing since 1983

Lincoln Faurer

Lincoln D. Faurer signed on with the Corporation for Open Systems (COS) International on April 7, 1986, as its first employee. Mr. Faurer is a voting ex officio member of the COS Board of Directors and Executive Committee. As Chair of the COS Strategy Forum, Mr. Faurer coordinates and recommends the overall technical direction of COS, and in his role as President and Chief Executive Officer, he oversees the day-to-day business affairs of the company.

Mr. Faurer came to COS after retiring from a distinguished 35-year Air Force career, where he achieved the rank of Lieutenant General. In 1986, President Reagan awarded the National Security Award to Lt. Gen. Faurer for "exemplary performance of duty and distinguished service" as the Director of the National Security Agency from April 1, 1981, through March 31, 1985. Mr. Faurer's prior positions in the Air Force included Deputy Chairman, NATO Military Committee, and Director of Intelligence, Headquarters, U.S. European Command.

Mr. Faurer is a graduate of the United States Military Academy in West Point, New York, the Rensselaer Polytechnic Institute in Troy, New York, and George Washington University in Washington, DC .


450

I suspect that we all pleasure ourselves occasionally by looking back on events with which we proudly identify, and one such event for me was the establishment of the National Security Agency's (NSA's) Supercomputer Research Center. It was an idea spawned by the predecessor to this meeting in 1983, the first Frontiers of Supercomputing conference. A number of us from NSA left that meeting a little bit taken aback by what we perceived as a lack of a sense of urgency on the part of the government people who were in attendance at that session. Sitting where we did at NSA, we thought there was a lot more urgency to the supercomputer field and therefore set about to create a proposal that culminated in the establishment of a Supercomputer Research Center, which we ended up having to fund alone.

Therefore, it was really a pleasure when I was asked if I would play the role of presiding over one of the sessions at this conference. This session is designed to treat successes and failures, and the latter, I do believe, outnumber the former by a wide margin. But even the so-called failures can spawn successes. In any event, it is the things that are learned that matter, and they matter if we are willing to use what we learn and change.

The objectives of the 1983 conference were to bring together the best in government and industry to understand the directions that technology and requirements were taking. Yet, it was a different industry at that time—small but totally dominant, with a still reasonably healthy U.S. microelectronics industry to support it. That, certainly, now has changed. More foreign competition, tougher technological demands to satisfy, weaker U.S. support industry, microelectronics, storage, etc., affect us now.

So in 1983 at NSA, in addition to starting the Supercomputer Research Center, we made an announcement of intent to buy a heterogeneous element processor from Denelcor. It was their first advance sale. NSA took delivery and struggled for almost a year to get the machine up—our first UNIX four-processor system. However, it did not become operational, and Denelcor subsequently went under. The point is that we took a chance on a new previous hit architecture next hit and lost out in the process but learned an important lesson: do not let a firm, even a new one, try to complete the development process in an operational setting.

We could not foresee all the changes of the past seven years at the 1983 conference—the changes in Cray Research, Inc., the loss of Engineering Technology Associates Systems, the loss of Evans & Sutherland supercomputer division, the loss of Denelcor, etc. The major difference


451

is that in 1983, the government market underpinned the industry, certainly to a different extent than today. As of 1990, this is no longer true. The government continues to set the highest demands and is still probably the technological leader, but its market does not and cannot sustain the industry. The supercomputer industry is turning increasingly to an industrial market, an academic market, and foreign markets. Strong pressures from Japan are certainly being encountered in the international marketplace.

One wonders, where will we be seven years from now? What will have occurred that we did not expect? I certainly hope that seven years from now we are not as bad off as some of the current discussion would suggest we could be if we do not do anything right.

Looking back over the past seven years, a number of important developments should not be overlooked.

First, there is a strong trend toward open systems. It is within a niche of that open-systems world that I reside at the Corporation for Open Systems. Second, the evolution of UNIX as a cornerstone of high-performance computer operating systems has lots of pluses and a few minuses. Third, growth of low-end systems, coupled with high-performance workstations, often now with special accelerator boards, has led to truly distributed high-performance computing in many environments. Fourth is the appearance, or imminent appearance, of massively parallel systems with some promise of higher performance at lower cost than traditional serial and vector processing previous hit architectures next hit. What remains to be seen is if they can turn the corner from interesting machines to general-purpose systems.

Almost all of this is to the good. Some would argue that high-performance computing does remain the research province of the U.S., by and large. Whether you accept that or not, it is critically important that we dominate the world market with U.S. machines based on U.S. research and composed, at least mostly, of U.S. parts. Aside from the obvious economic reasons for that, it is very important to the nation's security and to the survival of U.S. leadership in areas like aerospace, energy exploration, and genetic research for world scientific prominence.


453

Lessons Learned

Ben Barker

William B. "Ben" Barker is President of BBN Advanced Computers Inc. and is Senior Vice President of its parent corporation, Bolt Beranek & Newman Inc. (BBN). He joined BBN in 1969 as a design engineer on the Advanced Research Projects Agency ARPANET program and installed the world's first packet switch at the University of California-Los Angeles in October 1969. In 1972 Dr. Barker started work on the architectural design of the Pluribus, the world's first commercial parallel processor, delivered in 1975. In 1979 he started the first of BBN's three product subsidiaries, BBN Communications Corporation, and was its president through 1982. Until April 1990, Dr. Barker served as BBN's Senior Vice President, Business Development, in which role he was responsible for establishing and managing the company's first R&D limited partnerships, formulating initial plans for the company's entry into the Japanese market and for new business ventures, including BBN Advanced Computers.

Dr. Barker holds a B.A. in chemistry and physics and an M.S. and Ph.D. in applied mathematics, all from Harvard University. The subject of his doctoral dissertation was the architectural design of the Pluribus parallel processor.


454

Abstract

Bolt Beranek & Newman Inc. (BBN) has been involved in parallel computing for nearly 20 years and has developed several parallel-processing systems and used them in a variety of applications. During that time, massively parallel systems built from microprocessors have caught up with conventional supercomputers in performance and are expected to far exceed conventional supercomputers in the coming decade. BBN's experience in building, using, and marketing parallel systems has shown that programmer productivity and delivered, scalable performance are important requirements that must be met before massively parallel systems can achieve broader market acceptance.

Introduction

Parallel processing as a computing technology has been around almost as long as computers. However, it has only been in the last decade that systems based on parallel-processing technology have made it into the mainstream of computing. This paper explores the lessons learned about parallel computing at BBN and, on the basis of these lessons, our view of where parallel processing is headed in the next decade and what will be required to bring massively parallel computing into the mainstream.

BBN has a unique perspective on the trends and history of parallel computation because of its long history in parallel processing, dating back to 1972. BBN has developed several generations of computer systems based on parallel-processing technology and has engaged in advanced research in parallel algorithms and very-large-scale systems (Figure 1). In addition to parallel-processing research, BBN has been shipping commercial parallel processors since 1975 and has installed more than 300 systems. This represents approximately $100 million in investment of private and government funds in BBN parallel-processing technology and products.

BBN has developed extensive experience with parallel-processing systems during this 18-year period. Large numbers of these systems have been used by BBN in communications and simulation applications, most of which are still in operation today. BBN has also used BBN parallel systems in a wide range of research projects, such as speech recognition and artificial intelligence. This extensive experience using our own parallel computers is unique within the industry and has enabled BBN to better understand the needs of parallel computer users.


455

Figure 1.
BBN parallel processing systems and projects.

Parallel Processing:
1980 to 2000

In the 1980s, the high-performance computing industry gained experience with parallel processing on a small scale. Vendors such as Sequent Computer Systems and Alliant Computer Systems developed systems with up to tens of processors, and the major vendors, including IBM, Cray Research, and Digital Equipment Corporation, all began marketing systems with four to eight processors. Parallel processing on this scale has now become commonplace in the industry, with even high-end workstations and PC servers employing multiple CPUs.

A key development that helped bring about this acceptance was the symmetric multiprocessing (SMP) operating system. Typically based on UNIX®, but in some cases on proprietary operating systems, SMP operating systems made it much easier to use multiprocessor computers. All of these systems support shared memory, which is needed to develop the parallel operating system kernels used in SMP systems.

However, all of these systems have bus-based or crossbar previous hit architectures next hit, limiting the scalability of the systems. The bus in a bus-based


456

previous hit architecture next hit has a fixed bandwidth, limited by the technology used and by the physical dimensions of the bus. This fixed bandwidth becomes a bottleneck as more processors are added because of the increase in contention for the bus. Crossbar previous hit architectures next hit provide scalable bandwidth, but the cost of crossbars increases as the square of the number of ports, rendering them economically infeasible for more than a few dozen processors.

In the 1990s, massively parallel computers based on scalable interconnects will become a mainstream technology, just as small-scale parallel processing did in the 1980s. The driving force is the economics involved in increasing the performance of the most powerful computers. It is becoming increasingly expensive in both dollars and time to develop succeeding generations of traditional ultra-high clock rate supercomputers and mainframes. Massively parallel systems will be the only affordable way to achieve the performance goals of the 1990s. This shift is made possible by three technologies discussed in the following sections:

• high-performance RISC microprocessors,

• advanced software, and

• versatile, scalable system interconnects.

The Attack of the Killer Micros

One of the key drivers in the high-performance computing industry is the disparity between the price/performance and overall performance gains of microprocessors versus conventional mainframes and vector supercomputers. As Figure 2 illustrates, the gains in microprocessor performance are far more rapid than those for supercomputers, with no end in sight for this trend. When looking at curves such as these, it seems obvious that high-performance microprocessors and parallel systems built from these microprocessors will come to dominate the high-end computing market; this is the "attack of the killer micros."

This changeover is only just now occurring. As Figure 3 illustrates, parallel systems are now capable of higher performance and better price/performance than traditional supercomputers. This transition occurred with the advent of RISC microprocessors, which provided sufficient floating point performance to enable parallel systems to rival supercomputers. This performance and price/performance gap will continue to widen in favor of parallel micro-based systems as microprocessor gains continue to outstrip those of supercomputers.


457

Figure 2.
Absolute performance gains of microprocessors versus supercomputers.

Figure 3.
Improvements in parallel processors versus supercomputers.


458

Programmer Productivity on Massively Parallel Systems

High performance and attractive price/performance are not enough to bring massively parallel systems into the computing mainstream. It is well known that only 10–20 per cent of computer center's budget goes to paying for computer hardware. The largest portion goes to paying for people to write software and to support the computers. Large gains in price/performance can be quickly erased if the system is difficult to use. In order to be accepted by a larger number of customers, massively parallel systems must provide ease-of-use and programmer productivity that is more like current mainstream high-performance systems.

The conventional wisdom in the 1980s was that parallel systems are difficult to use because it is hard to parallelize code. However, many problems are naturally parallel and readily map to parallel previous hit architectures next hit. For these problems, the past has been spent trying to develop serial algorithms that solve these problems on single-CPU systems. Trying to take this serial code and parallelize it is clearly not the most productive approach. A more productive way is to directly map the parallel problem onto a parallel system.

Also, most computer systems today are parallel systems. Minicomputers, workstations, minisupers, supercomputers, even mainframes all have more than a single CPU. Clearly, parallelism itself isn't the only problem, since such systems from major computer vendors are now considered mainstream computers.

Yet there is a programmer productivity gap on most massively parallel systems, as illustrated in Figure 4. While the productivity on

Figure 4.
The programmer-productivity gap.


459

small-scale parallel systems now mirrors the traditionally higher productivity of uniprocessor systems, the productivity on these massively parallel systems is still very low. Given that there are plenty of parallel problems and that parallel processing has reached the mainstream, what is still holding massively parallel systems back? The answer lies in their software development environment.

Front End/Back End Versus Native UNIX

One key differentiator between most massively parallel systems and the successful mainstream parallel systems is the relationship of the development environment to the computer. In most massively parallel systems, the computer is an attached processor, or back end, to a front-end workstation, minicomputer, or personal computer (Figure 5). All software development and user interaction is done on the front end, whereas program execution runs on the back-end parallel system. BBN's early parallel processors, such as the Butterfly® I and Butterfly Plus systems, required such front ends. As we learned, there are several problems with this previous hit architecture next hit, including

• Bottlenecks: the link between the front end and the back end is a potential bottleneck. It is frequently a local area network, such as Ethernet, with a very limited bandwidth compared with the requirements of high-end supercomputer applications.

Figure 5.
A front-end/back-end system.


460

• Debugging and tuning difficulties: because the software-development tools are separate from the parallel back end, it can be difficult to debug and tune programs. The tools on the front end cannot directly examine the memory of the back end and must rely on the back-end processors for information. If a program crashes some or all of the parallel nodes' kernels, the debugger may not be able to provide sufficient information.

• Slow development cycle: because development is done on a separate computer, the power of the parallel supercomputer is not available to run the development tools, such as the compiler. Also, executable program images must be downloaded into the back end, adding a step to the development cycle and further slowing down productivity.

• Multiple environments: while the front end typically runs UNIX, the back-end processors run a proprietary kernel. This requires the developer to learn two different environments.

• Limited kernel: the proprietary kernel that runs on the back-end processors does not provide all of the facilities that users expect on modern computers. These kernels provide little protection between tasks, no virtual memory, and few operating-system services.

Contrast this with modern supercomputers, mainframes, minicomputers, and workstations. All have native development environments, typically based on UNIX. This greatly simplifies development because the tools run on the same machine and under the same operating system as the executable programs. The full services of UNIX that are available to the programmer are also available to the executable program, including virtual memory, memory protection, and other system services. Since these systems are all shared-memory machines, powerful tools can be built for debugging and analyzing program performance, with limited intrusion into the programs operation.

Recent BBN parallel computers, such as the GP1000Ô and TC2000Ô systems, are complete, stand-alone UNIX systems and do not require a front end. The Mach 1000Ô and nXÔ operating systems that run on these computers contain a highly symmetric multiprocessing kernel that provides all of the facilities that users expect, including load balancing, parallel pipes and shell commands, etc. Since these operating systems present a standard UNIX interface and are compliant with the POSIX 1003.1 standard, users familiar with UNIX can begin using the system immediately. In fact, there are typically many users logged into and using a TC2000 system only hours after it is installed. This is in contrast to our earlier front-end/back-end systems, where users spent days or weeks studying manuals before running their first programs.


461

Single User versus Multiple Users

A second difference between mainstream computers and most massively parallel systems is the number of simultaneous users or applications that can be supported. Front-end/back-end massively parallel systems typically allow only a few users to be using the back end at one time. This style of resource scheduling is characterized by batch operation or "sign-up sheets." This is an adequate model for systems that will be dedicated to a single application but is a step backward in productivity for multiuser environments when compared with mainstream computers that support timesharing operating systems. As has been known for many years, timesharing provides a means to more highly utilize a computer system. Raw peak MFLOPS (i.e., millions of floating-point operations per second) are not as important as the number of actual FLOPS that are used in real programs; unused FLOPS are wasted FLOPS. The real measure of system effectiveness is the number of solutions per year that the user base can achieve.

Early in BBN's use of the Butterfly I computer, we realized that flexible multiuser access was required in order to get the most productivity out of the system. The ability to cluster together an arbitrary number of processors was added to the ChrysalisÔ operating system (and later carried forward into Mach 1000 and nX), providing a simple but powerful "space-sharing" mechanism to allow multiple users to share a system. However, in order to eliminate the front end and move general computing and software development activities onto the system, real time-sharing capabilities were needed to enable processors to be used by multiple users. The Mach 1000 and nX operating systems provide true time sharing.

Interconnect Performance, System Versatility, and Delivered Performance

Related to the number of users that can use a system at the same time is the number of different kinds of problems that a system can solve. The more flexible a system is in terms of programming paradigms that it supports, the more solutions per year can be delivered. As we learned while adapting a wide range of applications to the early Butterfly systems, it is much more productive to program using a paradigm that is natural to the problem at hand than to attempt to force-fit the code to a machine-dependent paradigm. Specialized previous hit architectures next hit do have a place running certain applications in which the specialized system's


462

previous hit architecture next hit provides very high performance and the same code will be run a large number of times. However, many problems are not well suited to these systems.

Current mainstream systems provide a very flexible programming environment in which to develop algorithms. Based on shared-memory previous hit architectures next hit, these systems have demonstrated their applicability in a wide range of applications, from scientific problems to commercial applications. BBN's experience with our parallel systems indicates that shared-memory previous hit architectures next hit are the best way to provide a multiparadigm environment comparable to mainstream systems. For example, the TC2000 uses the Butterfly® switch (BBN 1990) to provide a large, globally addressable memory space that is shared by the processors yet is physically distributed: "distributed shared memory." This provides the convenience of the shared-memory model for those applications to which it is suited while providing the scalable bandwidth of distributed memory. The TC2000's Butterfly switch also makes it an ideal system for programming with the message-passing paradigm, providing low message-transfer latencies.

Another key difference between mainstream systems and primitive massively parallel computers is the system's delivered performance on applications that tend to randomly access large amounts of memory. According to John Gustafson of NASA/Ames Laboratory, "Supercomputers will be rated by dollars per megabyte per second more than dollars per megaflop . . . by savvy buyers." A system's ability to randomly access memory is called its random-access bandwidth, or RAB. High RAB is needed for such applications as data classification and retrieval, real-time programming, sparse matrix algorithms, adaptive grid problems, and combinational optimization (Celmaster 1990).

High-RAB systems can deliver high performance on a wider range of problems than can systems with low RAB and can provide the programmer with more options for developing algorithms. This is one of the strengths of the traditional supercomputers and mainframes and is a key reason why most massively parallel systems do not run certain parallel applications as well as one would expect. The TC2000 is capable of RAB that is comparable to, and indeed higher than, that of mainstream systems. Table 1 compares the TC2000 with several other systems on the portable random-access benchmark (Celmaster undated). For even higher RAB, the Monarch project (Rettberg et al. 1990) at BBN explored advanced switching techniques and designed a very-large-scale MIMD computer with the potential for several orders of magnitude more RAB than modern supercomputers. Lastly, BBN's experience using early


463
 

Table 1. Comparison of Random-Access Bandwidth



System


Number of Processors

RAB (kilo random-access words per second)

TC2000

1

258

 

40

9,447

 

128

23,587

i860 Touchstone

1

~2.5

 

128

~300

IRIS 4D/240

1

349

 

4

779

CRAY Y-MP/832

1

28,700

Butterfly systems in real-time simulation and communications applications indicated that special capabilities were required for these areas. The real-time model places very demanding constraints on system latencies and performance and requires software and hardware beyond what is provided by typical non-real-time systems. These capabilities include a low-overhead, real-time executive, low-latency access to shared memory, hardware support such as timers and globally synchronized, real-time clocks, and support for the Ada programming language.

Challenges and Directions for the Future

The challenge facing the vendors of massively parallel processors in the 1990s is to develop systems that provide high levels of performance without sacrificing programmer productivity. When comparing the next generation of parallel systems, it is the interconnect and memory previous hit architecture next hit and the software that will distinguish one system from another. All of these vendors will have access to the same microprocessors, the same semiconductor technology, the same memory chips, and comparable packaging technologies. The ability to build scalable, massively parallel systems that are readily programmable will determine the successful systems in the future.

Most vendors have realized this and are working to enhance their products accordingly, as shown in Figure 6. The traditional bus-based


464

Figure 6.
Directions in previous hit architecture next hit.

and crossbar previous hit architecture next hit systems have always held the lead in application versatility and programmer productivity but do not scale to massively parallel levels. Many of these vendors, such as Cray, have announced plans to develop systems that scale beyond their current tens of processors. At the same time, vendors of data-parallel and private-memory-MIMD systems are working to make their systems more versatile by improving interconnect latency, adding global routing or simulated shared memory, and adding more UNIX facilities to their node kernels. The direction in which all of this development is moving is toward a massively parallel UNIX system with low-latency distributed shared memory.

As in previous transitions in the computer industry, the older technology will not disappear but will continue to coexist with the new technology. In particular, massively parallel systems will coexist with conventional supercomputers, as illustrated in Figure 7. In this "model supercomputer center," a variety of resources are interconnected via a high-speed network or switch and are available to users. The traditional vector supercomputer will provide compute services for those problems that are vectorizable and primarily serial and will continue to run some older codes. The special-purpose application accelerators provide very high performance on select problems that are executed with sufficient frequency to justify the development cost of the application and the cost of the hardware. The general-purpose parallel system will off-load the vector supercomputer of nonvector codes and will provide a production environment for most new parallel applications. It will also serve as a testbed for parallel algorithm research and development.


465

Figure 7.
Model supercomputer center.

Summary

In the 1980s, parallel processing moved into the mainstream of computing technologies. The rapid increases in "killer micro" performance will enable massively parallel systems to meet the needs of high-performance users in the 1990s. However, in order to become a mainstream technology, massively parallel systems must close the programmer productivity gap that exists between them and small-scale parallel systems. The keys to closing this gap are standard languages with parallel extensions, native operating systems (such as UNIX), a powerful software development tool set, and an previous hit architecture next hit that supports multiple programming paradigms.

Appendix:
BBN Parallel-Processing Systems

A summary of the BBN parallel-processing systems appears in Table 2. The Pluribus was BBN's first parallel-processing system. Developed in the early 1970s, with initial customer shipments in 1975, it consisted of up to 14 Lockheed Sue minicomputers interconnected via a bus-based distributed crossbar switch and supported shared global memory. It was used primarily in communications applications, many of which are still operational today.


466
 

Table 2. BBN Parallel Processing Systems

Pluribus

Butterfly I

Butterfly Plus

GP1000

TC2000

Coral

Parallel
Hardware

Shared
Memory

Bus and
Crossbar

Massively
Parallel
Hardware

Shared
Memory

Butterfly
Switch

Chrysalis

Uniform
System

Performance
Improvement

More
Memory

Mach 1000

pSOS
TotalView

Parallel
Fortran

10X
Performance

nX, pSOS+m

Ada, C++

VME

Xtra

More
Memory

5-10X CPU
Performance

4X Package
Density

More
Memory

Compilers

HIPPI, FDDI

VME64

Tools and
Libraries

The initial member of the Butterfly family of systems, the Butterfly I, was developed beginning in 1977. An outgrowth of the Voice Funnel program, a packetized voice satellite communications system funded by the Defense Advanced Research Projects Agency (DARPA), the Butterfly I computer was designed to scale to 256 Motorola, Inc., M68000 processors (a system of this size was built in 1985) but without giving up the advantages of shared memory. The key to achieving this scalability was a multistage interconnection network called the Butterfly switch. BBN developed the proprietary Chrysalis operating system, the GistÔ performance analyzer, and the Uniform SystemÔ parallel programming library for this system. Butterfly I machines were used in a wide variety of research projects and also are used as Internet gateways when running communications code developed at BBN.

In the early 1980s, DARPA also funded BBN to explore very-large-scale parallel-processing systems. The Monarch project explored the design of a 65,536-processor shared-memory MIMD system using a multistage interconnection network similar to the Butterfly switch. The high-speed switch was implemented and tested using very-large-scale integration based on complementary metal oxide semiconductor technology, and a system simulator was constructed to explore the


467

performance of the system on real problems. Some of the concepts and technologies have already been incorporated into Butterfly products, and more will be used in future generations.

The Butterfly Plus system was developed to provide improved processor performance over the Butterfly I system by incorporating Motorola's MC68020 processor and the MC68881 (later, the MC68882) floating-point coprocessor. Since this system used the same Butterfly switch, Butterfly I systems could be easily upgraded to Butterfly Plus performance.

The Butterfly Plus processor boards also included more memory and a memory-management unit, which were key to the development of the Butterfly GP1000 system. The GP1000 used the same processors as the Butterfly Plus but ran the Mach 1000 operating system, the world's first massively parallel implementation of UNIX. Mach 1000 was based on the Mach operating system developed at Carnegie Mellon University but has been extended and enhanced by BBN. The TotalViewÔ debugger was another significant development that was first released on the GP1000.

The TC2000 system, BBN's newest and most powerful computer, was designed to provide an order of magnitude greater performance than previous Butterfly systems. The world's first massively parallel RISC system, the TC2000 employs the Motorola M88000 microprocessor and a new generation Butterfly switch that has ten times the capacity of the previous generation. The TC2000 runs the nX operating system, which was derived from the GP1000's Mach 1000 operating system. The TC2000 also runs pSOS+mÔ , a real-time executive.

The goal of the Coral project is to develop BBN's next-generation parallel system for initial delivery in 1992. The Coral system is targeted at providing up to 200 GFLOPS peak performance using 2000 processors while retaining the shared-memory previous hit architecture next hit and advanced software environment of the TC2000 system, with which Coral will be software compatible.

References

BBN, "Inside the TC2000," BBN Advanced Computers Inc. report, Cambridge, Massachusetts (February 1990).

W. Celmaster, "Random-Access Bandwidth and Grid-Based Algorithms on Massively Parallel Computers," BBN Advanced Computers Inc. report, Cambridge, Massachusetts (September 5, 1990).


468

W. Celmaster, "Random-Access Bandwidth Requirements of Point Parallelism in Grid-Based Problems," submitted to the 5th SIAM Conference on Parallel Processing for Scientific Computing , Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania.

R. D. Rettberg, W. R. Crowther, P. P. Carvey, and R. S. Tomlinson, "Monarch Parallel Processor Hardware Design," Computer23 (4), 18-30 (1990).


469

The John von Neumann Computer Center:
An Analysis

Al Brenner

Alfred E. Brenner is Director of Applications Research at the Supercomputing Research Center, a division of the Institute for Defense Analysis, in Bowie, Maryland. He was the first president of the Consortium for Scientific Computing, the corporate parent of the John von Neumann Computer Center. Previously, he was head of the department of Computing at Fermi National Accelerator Laboratory. He has a bachelor's degree in physics and a Ph.D. in experimental high-energy physics from MIT.

Introduction

I have been asked to discuss and analyze the factors involved in the demise of the NSF Office of Advanced Scientific Computing (OASC) in Princeton, New Jersey—the John von Neumann Center (JVNC). My goal is to see if we can extract the factors that contributed to the failure to see whether the experience can be used to avoid such failures in the future. Analysis is much easier in hindsight than before the fact, so I will try to be as objective as I can in my analysis.

The "Pre-Lax Report" Period

During the 1970s, almost all of the supercomputers installed were found in government installations and were not generally accessible to the university research community. For those researchers who could not


470

gain access to these supercomputers, this was a frustrating period. A few found it was relatively easy to obtain time on supercomputers in Europe, especially in England and West Germany.

By the end of the decade, a number of studies, proposals, and other attempts were done to generate funds to make available large-scale computational facilities for some of the university research community. All of this was happening during a period when U.S. policy was tightening rather than relaxing the mechanisms for acquiring large-scale computing facilities.

The Lax Report

The weight of argument in the reports from these studies and proposals moved NSF to appoint Peter Lax, of New York University, an NSF Board Member, as chairman of a committee to organize a Panel on Large-Scale Computing in Science and Engineering. The panel was sponsored jointly by NSF and the Department of Defense in cooperation with the Department of Energy and NASA. The end product of this activity was the "Report of the Panel on Large-Scale Computing in Science and Engineering," usually referred to as the Lax Report, dated December 26, 1982.

The recommendations of the panel were straightforward and succinct. The overall recommendation was for the establishment of a national program to support the expanded use of high-performance computers. Four components to the program were

• increased access to supercomputing facilities for scientific and engineering research;

• increased research in computational mathematics, software, and algorithms;

• training of personnel in high-performance computing; and

• research and development for the implementation of new supercomputer systems.

The panel indicated that insufficient funds were being expended at the time and suggested an interagency and interdisciplinary national program.

Establishment of the Centers

In 1984, once the NSF acquired additional funding from Congress for the program, NSF called for proposals to establish national supercomputer centers. Over 20 proposals were received, and these were evaluated in an extension of the usual NSF peer-review process. In February 1985, NSF selected four of the proposals and announced awards to establish four national supercomputer centers. A fifth center was added in early 1986.


471

The five centers are organizationally quite different. The National Center for Supercomputing Applications at the University of Illinois-Urbana/Champaign and the Cornell Theory Center are formally operated by the universities in which those centers are located. The JVNC is managed by a nonprofit organization, the Consortium for Scientific Computing, Inc. (CSC), established solely to operate this center. The San Diego Supercomputer Center is operated by the for-profit General Atomics Corporation and is located on the campus of the University of California at San Diego. Finally, the Pittsburgh Supercomputing Center is run jointly by the University of Pittsburgh, Carnegie Mellon University, and Westinghouse Electric Corporation. NSF established the OASC that reported directly to the Director of NSF as the NSF program office through which to fund these centers.

While the selected centers were being established (these centers were called Phase 2 centers), NSF supported an extant group of supercomputing facilities (Phase 1 centers) to start supplying cycles to the research community at the earliest possible time. Phase 1 centers included Purdue University and Colorado State University, both with installed CYBER 205 computers; and the University of Minnesota, Boeing Computer Services, and Digital Productions, Inc., all with CRAY X-MP equipment. It is interesting to note that all these centers, which had been established independent of the OASC initiative, were phased out once the Phase 2 centers were in operation. All Phase 1 centers are now defunct as service centers for the community, or they are at least transformed rather dramatically into quite different entities. Indeed, NSF "used" these facilities, supported them for a couple of years, and then set them loose to "dry up."

From the very beginning, it was evident there were insufficient funds to run all Phase 2 centers at adequate levels. In almost all cases, the centers from the beginning have been working within very tight budgets, which has resulted in difficult decisions to be made by management and a less aggressive program than the user community demands. However, with a scarce and expensive resource such as supercomputers, such limitations are not unreasonable. During the second round of funding for an additional five-year period, the NSF has concluded that the JVNC should be closed. The closing of that center will alleviate some of the fiscal pressure on the remaining four centers. Let us now focus on the JVNC story.


472

The John von Neumann Center

The Proposal

When the call for proposals went out in 1984 for the establishment of the national supercomputer centers, a small number of active and involved computational scientists and engineers, some very closely involved with the NSF process in establishing these centers, analyzed the situation very carefully and generated a strategy that had a very high probability of placing their proposal in the winning set. One decision was to involve a modest number of prestigious universities in a consortium such that the combined prominence of the universities represented would easily outweigh almost any competition. Thus, the consortium included Brown University, Harvard University, the Institute for Advanced Study, MIT, New York University, the University of Pennsylvania, Pennsylvania State University, Princeton University, Rochester Institute, Rutgers University, and the Universities of Arizona and Colorado. (After the establishment of the JVNC, Columbia University joined the consortium.) This was a powerful roster of universities indeed.

A second important strategy was to propose a machine likely to be different from most of the other proposals. At the time, leaving aside IBM and Japan, Inc., the only two true participants were Cray Research and Engineering Technology Associates Systems (ETA). The CRAY X-MP was a mature and functioning system guaranteed to be able to supply the necessary resources for any center. The ETA-10, a machine under development at the time, had much potential and was being designed and manufactured by an experienced team that had spun off from Control Data Corporation (CDC). The ETA-10, if delivered with the capabilities promised, would exceed the performance of the Cray Research offerings at the time. A proposal based on the ETA-10 was likely to be a unique proposal.

These two strategic decisions were the crucial ones. Also, there were other factors that made the proposal yet more attractive. The most important of these was the aggressive networking stance of the proposal in using high-performance communications links to connect the consortium-member universities to the center.

Also, the plan envisioned a two-stage physical plant, starting with temporary quarters to house the center at the earliest possible date, followed by a permanent building to be occupied later. Another feature was to contract out the actual operations functions to one of the firms experienced in the operation of supercomputing centers at other laboratories.


473

Finally, the proposal was nicely complemented with a long list of proposed computational problems submitted by faculty members of the 12 founding institutions. Although these additional attributes of the proposal were not unique, they certainly enhanced the strong position of a consortium of prestigious universities operating a powerful supercomputer supplied by a new corporation supported by one of the most prominent of the old-time supercomputer firms. It should surprise no one that on the basis of peer reviews, NSF found the JVNC proposal to be an attractive one.

I would like now to explore the primary entities involved in the establishment, operation, funding, and oversight of the JVNC.

Consortium for Scientific Computing

The CSC is a nonprofit corporation formed by the 12 universities of the consortium for the sole purpose of running the JVNC. Initially, each university was to be represented within the consortium by the technical representative who had been the primary developer of the proposal submitted to NSF. Early in the incorporation process, representation on the consortium was expanded to include two individuals from each university—one technical faculty and one university administrator. The consortium Board of Directors elected an Executive Committee from its own membership. This committee of seven members, as in normal corporate situations, wielded the actual power of the consortium. The most important function of the CSC included two activities: (1) the appointment of a Chief Operating Officer (the President) and (2) the establishment of policies guiding the activities of the center. As we analyze what went wrong with the JVNC, we will see that the consortium, in particular the Executive Committee, did not restrict itself to these functions but ranged broadly over many activities, to the detriment of the JVNC.

The Universities

The universities were the stable corporate entities upon which the consortium's credibility was based. Once the universities agreed to go forth with the proposal and the establishment of the consortium, they played a very small role.

The proposal called for the universities to share in the support of the centers. Typically, the sharing was done "in kind" and not in actual dollars, and the universities were involved in establishing the bartering chips that were required.


474

The State of New Jersey

The State of New Jersey supported the consortium enthusiastically. It provided the only, truly substantial, expendable dollar funding to the JVNC above the base NSF funding. These funds were funneled through the New Jersey State Commission for Science and Technology. The state was represented on the consortium board by one nonvoting member.

The NSF

NSF had moved forward on the basis of the proposals of the Lax Report and, with only modest previous experience with such large and complex organizations, established the five centers. The OASC reported directly to the Director of NSF to manage the cooperative agreements with the centers. Most of the senior people in this small office were tapped from other directorates within NSF to take on difficult responsibilities, and these people often had little or no prior experience with supercomputers.

ETA

In August 1983, ETA had been spun off from CDC to develop and market the ETA-10, a natural follow-on of the modestly successful CYBER 205 line of computers. The reason for the establishment of ETA was to insulate from the rest of CDC the ETA development team and its very large demands for finances. This was both to allow ETA to do its job and to protect CDC from an arbitrary drain of resources.

The ETA machine was a natural extension of the CYBER 205 previous hit architecture next hit. The primary architect was the same individual, and much of the development team was the same team, that had been involved in the development of the CYBER 205.

Zero One

The JVNC contracted the actual daily operations of its center to an experienced facilitator. The important advantage to this approach was the ability to move forward as quickly as possible by using the resources of an already extant entity with an existing infrastructure and experience.

Zero One, originally Technology Development Corporation, was awarded the contract because it had experience in operating supercomputing facilities at NASA Ames, and they appeared to have an adequate, if not large, personnel base. As it turned out, apart from a small number of people, all of the personnel assigned to the JVNC were newly hired.


475

JVNC

During the first half of 1985, the consortium moved quickly and initiated the efforts to establish the JVNC. One of the first efforts was to find a building. Once all the factors were understood, rather than the proposed two-phase building approach, it was decided to move forward with a permanent building as quickly as possible and to use temporary quarters to house personnel, but not equipment, while the building was being readied.

The site chosen for the JVNC was in the Forrestral Research Center off Route 1, a short distance from Princeton University. The building shell was in place at the time of the commitment by the consortium and it was only the interior "customer modification" that was required. Starting on July 1, 1986, the building functioned quite well for the purposes of the JVNC.

A small number of key personnel were hired. Contracts were written with the primary vendors. The Cooperative Agreement to define the funding profile and the division of responsibility between the consortium and the NSF was also drawn up.

What Went Wrong?

The Analysis

The startup process at JVNC was not very different from the processes at the other NSF-funded supercomputing centers. Why are they still functioning today while the JVNC is closed? Many factors contributed to the lack of strength of the JVNC. As with any other human endeavor, if one does not push in all dimensions to make it right, the sum of a large number of relatively minor problems might mean failure, whereas a bit more strength or possibly better luck might make for a winner.

I will first address the minor issues that, I believe, without more detailed knowledge, may sometimes be thought of as being more important than they actually were. I will then address what I believe were the real problems.

Location

Certainly, the location of the JVNC building was not conducive to a successful intellectual enterprise. Today, with most computer accesses occurring over communications links, it is difficult to promote an intellectually vibrant community at the hardware site. If the hardware is close by, on the same campus or in the same building where many of the user


476

participants reside, there is a much better chance of generating the collegial spirit and intellectual atmosphere for the center and its support personnel. The JVNC, in a commercial industrial park sufficiently far from even its closest university customers, found itself essentially in isolation.

Furthermore, because of the meager funding that allowed no in-house research-oriented staff, an almost totally vacuous intellectual atmosphere existed, with the exception of visitors from the user community and the occasional invited speaker. For those centers on campuses or for those centers able to generate some internal research, the intellectual atmosphere was certainly much healthier and more supportive than that at the JVNC.

Corporate Problems

Some of the problems the JVNC experienced were really problems that emanated from the two primary companies that the JVNC was working with: ETA and Zero One. The Zero One problem was basically one of relying too heavily on a corporate entity that actually had very little flex in its capabilities. At the beginning, it would have been helpful if Zero One had been able to better use its talent elsewhere to get the JVNC started, but it was not capable of doing that, with one or two exceptions. The expertise it had, although adequate, was not strong, so the relationship JVNC had with Zero One was not particularly effective in establishing the JVNC. Toward the end of June 1989, JVNC terminated its relationship with Zero One and took on the responsibility of operating the center by itself. Consequently, the Zero One involvement was not an important factor in the long-term JVNC complications.

The problems experienced in regard to ETA were much more fundamental to the demise of JVNC. I believe there were two issues that had a direct bearing on the status of the JVNC. The first was compounded by the inexperience of many of the board members. When the ETA-10 was first announced, the clock cycle time was advertised as five nanoseconds. By the time contractual arrangements had been completed, it was clear the five-nanosecond time was not attainable and that something more like seven or eight nanoseconds was the best goal to be achieved. As we know, the earliest machines were delivered with cycle times twice those numbers. The rancor and associated interactions concerning each of the entities' understanding of the clock period early in the relationship took what could have been a cooperative interaction and essentially poisoned it. Both organizations were at fault. ETA advertised more than they could deliver, and the consortium did not accommodate the facts.


477

Another area where ETA failed was in its inability to understand the importance of software to the success of the machine. Although the ETA hardware was first-rate in its implementation, the decision to make the ETA-10 compatible with the CYBER 205 had serious consequences. The primary operating-system efforts were to replicate the functionality of the CYBER 205 VSOS; any extensions would be shells around that native system. That decision and a less-than-modern approach to the implementation of the approach bogged down the whole software effort. One example was the high-performance linkages; these were old, modified programs that gave rise to totally unacceptable communications performance. As the pressures mounted for a modern operating system, in particular UNIX, the efforts fibrillated, no doubt consuming major resources, and never attained maturity. The delays imposed by these decisions certainly were not helpful to ETA or to the survival of the JVNC.

NSF, Funding, and Funding Leverage

We now come to an important complication, not unique to the JVNC but common to all of the NSF centers. To be as aggressive as possible, NSF extended itself as far as the funding level for the OASC would allow and encouraged cost-sharing arrangements to leverage the funding. This collateral funding, which came from universities, states, and corporate associates interested in participating in the centers' activities, was encouraged, expected, and counted upon for adequate funding for the centers.

As the cooperative agreements were constructed in early 1985, the funding profiles for the five-year agreements were laid out for each individual center's needs. The attempt to meet that profile was a painful experience for the JVNC management, and I believe the same could be said for the other centers as well. For the JVNC, much of the support in kind from universities was paper; indeed, in some cases, it was closer to being a reverse contribution.

As the delivery of the primary computer equipment to JVNC was delayed while some of the other centers were moving forward more effectively, the cooperative agreements were modified by NSF to accommodate these changes and stay within the actual funding profile at NSF. Without a modern functioning machine, the JVNC found it particularly difficult to attract corporate support. The other NSF centers, where state-of-the-art supercomputer systems were operational, were in much better positions to woo industrial partners, and they were more successful. Over the five-year life of the JVNC, only about $300,000 in corporate support was obtained; that was less than 10 per cent of the proposed


478

amount and less than three-quarters of one per cent of the actual NSF contribution.

One corporate entity, ETA, contributed a large amount to the JVNC. Because the delivery of the ETA-10 was so late, the payment for the system was repeatedly delayed. The revenue that ETA expected from delivery of the ETA-10 never came. Thus, in a sense, the hardware that was delivered to the JVNC—two CYBER 205 systems and the ETA-10—represented a very large ETA corporate contribution to the JVNC. The originally proposed ETA contribution, in discounts on the ETA-10, personnel support, and other unbilled services, was $9.6 million, which was more than 10 per cent of the proposed level of the NSF contribution.

A year after the original four centers were started, the fiscal stress in the program was quite apparent. Nevertheless, NSF chose to start the fifth center, thereby spreading its resources yet thinner. It is true that the NSF budgets were then growing, and it may have seemed to the NSF that it was a good idea to establish one more center. In retrospect, the funding level was inadequate for a new center. Even today, the funding levels of all the centers remain inadequate to support dynamic, powerful centers able to maintain strong, state-of-the-art technology.

Governance

I now come to what I believe to be the most serious single aspect that contributed to the demise of the JVNC: governance. The governance, as I perceive it, was defective in three separate domains, each defective in its own right but all contributing to the primary failure, which was the governance of the CSC. The three domains I refer to are the universities, NSF, and the consortium itself.

Part of the problem was that the expectations of almost all of the players far exceeded the possible realities. With the exception of the Director of NSF, there was hardly a person directly or indirectly involved in the governance of the JVNC who had any experience as an operator of such complex facilities as the supercomputing centers represented. Almost all of the technical expertise was as end users. This was true for the NSF OASC and for the technical representatives on the Board of Directors of the consortium. The expertise, hard work, maturation, and planning needed for multi-million-dollar computer acquisitions were unknown to this group. Their expectations both in time and in performance levels attainable at the start-up time of the center were totally unrealistic.

At one point during the course of the first year, when difficulties with ETA meeting its commitments became apparent, the consortium


479

negotiated the acquisition of state-of-the-art equipment from an alternate vendor. To move along expeditiously, the plan included acquiring a succession of two similar but incompatible supercomputing systems from that vendor, bringing them up, networking them, educating the users, and bringing them down in sequence—all over a nine-month period! This was to be done in parallel with the running of the CYBER 205, which was then to be the ETA interim system—all of this with the minuscule staff at JVNC. At a meeting where these plans were enunciated to NSF, the Director of NSF very vocally expressed his consternation of and disbelief in the viability of the proposal. The OASC staff, the actual line managers of the centers, had no sense of the difficulty of the process being proposed.

At a meeting of the board of the consortium, the board was frustrated by the denial of this alternate approach that had by then been promulgated by NSF. A senior member of the OASC, who had participated in the board meeting but had not understood the nuances of the problem, when given the opportunity to make clear the issues involved, failed to do so, thereby allowing to stand misconceptions that were to continue to plague the JVNC. I believe that incident, which was one of many, typified a failure in governance on the part of NSF's management of the JVNC Cooperative Agreement.

With respect to the consortium itself, the Executive Committee, which consisted of the small group of people who had initiated the JVNC proposal, insisted on managing the activities as they did their own individual research grants. On a number of occasions, the board was admonished by the nontechnical board members to allow the president to manage the center. At no point did that happen during the formation of the JVNC.

These are my perceptions of the first year of operation of the JVNC. I do not have first-hand information about the situation during the remaining years of the JVNC. However, leaving aside the temporary management provided by a senior Princeton University administrator on a number of occasions, the succession of three additional presidents of the consortium over the next three years surely supports the premise that the problems were not fixed.

Since NSF was not able to do its job adequately in its oversight of the consortium, where were the university presidents during this time? The universities were out of the picture because they had delegated their authority to their representatives on the board. In one instance, the president of Princeton University did force a change in the leadership of


480

the Board of Directors to try to fix the problem. Unfortunately, that action was not coupled to a simultaneous change of governance that was really needed to fix the problem. One simple fix would have been to rotate the cast of characters through the system at a fairly rapid clip, thereby disengaging the inside group that had initiated the JVNC.

Although the other centers had to deal with the same NSF management during the early days, their governance typically was in better hands. Therefore, they were in a better position to accommodate the less-than-expert management within the NSF. Fortunately, by the middle of the second year, the NSF had improved its position. A "rotator" with much experience in operating such centers was assigned to the OASC. Once there was a person with the appropriate technical knowledge in place at the OASC, the relationship between the centers and the NSF improved enormously.

Conclusions

I have tried to expose the problems that contributed to the demise of the JVNC. In such a complex and expensive enterprise, not everything will go right. Certainly many difficult factors were common to all five centers. It was the concatenation of the common factors with the ones unique to the JVNC that caused its demise and allowed the other centers to survive. Of course, once the JVNC was removed, the funding profile for the other centers must have improved.

In summary, I believe the most serious problem was the governance arrangements that controlled the management of the JVNC. Here seeds of failure were sown at the inception of the JVNC and were not weeded out. A second difficulty was the lack of adequate funding. I believe the second factor continues to affect the other centers and is a potential problem for all of them in terms of staying healthy, acquiring new machines, and maintaining challenging environments.

I have tried as gently as possible to expose the organizations that were deficient and that must bear some responsibilities for the failure of the JVNC. I hope, when new activities in this vein are initiated in the future, these lessons will be remembered and the same paths will not be traveled once again.


481

Project THOTH:
An NSA Adventure in Supercomputing, 1984–88

Larry Tarbell

Lawrence C. Tarbell, Jr., is Chief of the Office of Computer and Processing Technology in the Research Group at the National Security Agency (NSA). His office is charged with developing new hardware and software technology in areas such as parallel processing, optical processing, recording, natural language processing, neural nets, networks, workstations, and database-management systems and then transferring that technology into operational use. For the last five years, he has had a special interest in bringing parallel processing technology into NSA for research and eventually operational use. He served as project manager for Project THOTH at NSA for over two years. He is the chief NSA technical advisor for the Supercomputing Research Center.

Mr. Tarbell holds a B.S. in electrical engineering from Louisiana State University and an M.S. in the same field from the University of Maryland. He has been in the Research Group at the NSA for 25 years.

I will discuss an adventure that the National Security Agency (NSA) undertook from 1984 to 1988, called Project THOTH. The reason the project was named THOTH was because the person who had the ability to choose the name liked Egyptian gods. There is no other meaning to the name.

NSA's goal for Project THOTH was to design and build a high-speed, high-capacity, easy-to-use, almost-general-purpose supercomputer. Our


482

performance goal, which we were not really sure we could meet but really wanted to shoot for, was to have a system that ended up being 1000 times the performance of a Cray Research CRAY-1S on the kinds of problems we had in mind. The implication of that, which I do not think we really realized at the time, was that we were trying to skip a generation or so in the development of supercomputers. We really hoped that, at the end, THOTH would be a general-purpose, commercial system.

The high-level requirements we set out to achieve were performance in the range of 50 to 100 × 109 floating-point operations per second (GFLOPS), with lots of memory, a functionally complete instruction set (because we needed to work with integers, with bits, and with floating-point numbers), and state-of-the-art software. We were asking an awful lot! In addition, we wanted to put this machine into the operational environment we already had at NSA so that workstations and networks could have access to this system.

This was a joint project between the research group that I was in and the operations group that had the problems that needed to be solved. The operations organization wanted a very powerful system that would let them do research in the kinds of operational problems they had. They understood those problems and they understood the algorithms they needed to use for it. We supposedly understood something about previous hit architecture next hit and software and, more importantly, about how to deal with contracting, which turned out to be a big problem.

After a lot of thought, we came up with the project structure. To characterize the kinds of computations we wanted to do, the operations group developed 24 sample problems. We realized our samples were not complete, but they were representative. In addition, we asked ourselves to imagine that there were 24 or 48 or 100 other kinds of problems that needed the same kind of computation.

These problems were not programs; they were just mathematical and word statements because we did not want to bias anybody by using old code. We were going to use these problems to try to evaluate the performance of the previous hit architectures next hit we hoped would come out of THOTH. So they were as much a predictor of success as they were a way to characterize what we needed to do.

We ended up deciding to approach the project in three phases. The first phase was architectural studies, where we tried to look at a broad range of options at a really high level and then tried to choose several possibilities to go into a detailed design. For the second phase, we hoped at the end to be able to choose one of the detailed designs to actually build something. The third phase was the implementation of THOTH. We


483

probably deluded ourselves a bit by saying that what we were trying to do was not to develop a new supercomputer; we were just trying to hasten the advent of something that somebody was already working on. In the end, that was not true.

We chose contractors using three criteria. First, they had to be in the business of delivering complete systems with hardware and software. Second, they had to have an active and ongoing R&D program in parallel processing and supercomputing. Third, we wanted them to have at least one existing previous hit architecture next hit already under development because we thought that, if they were already involved in previous hit architecture next hit development, we would have a much better chance of success.

We had envisioned six such companies participating in Phase 1. We hoped to pick the two best from Phase 1 to go into Phase 2. Then we hoped at the end of Phase 2 to have two good alternatives to choose from, to pick the better of those, and to go with it. We also had some independent consulting contractors working with us who were trying to make sure that we had our feet on the ground.

Figure 1 shows a time line of what actually happened. The concept for this project started in 1984. It was not until late 1985 that we actually had a statement of work that would be presented to contractors. Following the usual government procedure, it was not until mid-1986 that we

Figure 1.
THOTH time line.


484

actually had some contracts under way. Thus, we spent a lot of time getting started, and while we were getting started, technology was moving. We were trying to keep up with the technology, but that was a hard thing to do.

At the beginning of 1985, we had as our first guess that we would get a machine in late 1991. By the time we got around to beginning the second phase, it was clear that the absolute best case for a machine delivery would be late in 1992, if then.

We ended up with nine companies invited to participate in Phase 1, seven of which chose to participate with us in a nine-month, low-cost effort. We had three parts of Phase 1. The first part, the definition study, was for the companies to show us that they understood what we were after. They were supposed to analyze the problems, to show us possibility in the problems for parallelism, and to talk about how their potential previous hit architecture next hit might work on those problems. In the architectural study, we actually asked them to formally assess their favorite previous hit architecture next hit as to how it performed against the THOTH problems that we had given them. In the third task, we actually paid them to produce a formal proposal for a high-level hardware/software system that would be designed in the second phase if we had enough belief in what they were trying to do. This took a lot more work on our part than we had initially envisioned. It essentially burned up the research organization involved in this; we did nothing else but this. A similar thing happened with the operations group and with some of the people that were supporting us.

We had some problems in Phase 1. I had not expected much personnel change in a nine-month phase, but in one company we had a corporate shakeup. They ended up with all new people and a totally new technical approach, and they waited until they were 40 per cent into the project to do this. That essentially sunk them. In several companies we had major personnel changes in midstream, and in nine months there was not time to catch up from changes in project managers.

Several companies had a lack of coordination because they had software people on one side of the country and hardware people on another side of the country, and once again that presented some problems. Even though most of them told us that they really understood our requirements, it turned out that they really had not fully understood them. I suppose we should not have been surprised. In one case, the division that was doing THOTH was sold. The project was then transferred from the division that was sold to another part of the company that actually built computers. The late submission of reports did not surprise us at all, but in nine months this caused us some problems. Also, there


485

was one company that just could not pick out which way they wanted to go. They had two possible ways, and they never chose.

The first result in Phase 1 was that two companies dropped out before Phase 1 was over. We got five proposals at the end of Phase 1, but then one company subsequently removed their proposal, leaving us with only four. We ended up choosing three of those four, which surprised us. But when we sat down and looked at what was presented to us, it looked like there were three approaches that had some viability to them. Strangely enough, we had enough money to fund all three, so we did.

Then we moved into Phase 2, detailed design. Our goal was to produce competing detailed system designs from which to build THOTH, and we had three companies in the competition to produce a detailed design. That was what we thought would happen. They were to take the high-level design they had presented us at the end of Phase 1, refine it, and freeze it. We went through some detailed specifications, preliminary design reviews, and critical design reviews. Near the end of this one-year phase, they had to tell us that the performance they predicted in Phase 1 was still pretty much on target, and we had to believe them. This was a one-year, medium-cost effort, and once again, it cost us lots more in people than we had expected that it would.

Phase 2 had some problems. Within one company, about three months into Phase 2, the THOTH team was reorganized and moved from one part of the company to the other. This might not have been so bad except that for three months they did not have any manager, which led to total chaos. One company spent too much time thinking about hardware at the expense of software, which was exactly the reverse of what they had done in Phase 1, and that really shocked us. That, nevertheless, is what happened. Another company teamed, which was an interesting arrangement, with another software firm. Midway through, at a design review, the software firm stood up and quit right there. Another company had problems with their upper management. The lower-level people actually running the project had a pretty good sense of what to do, but upper management kept saying no, that is not the right way to do it—do it our way. They had no concept of what was going on, so that did not work out very well either.

There was a lack of attention to packaging and cooling, there were high compiler risks and so forth, and the real killer was that the cost of this system began to look a lot more expensive than we had thought. We had sort of thought, since we were going to be just accelerating somebody else's development, that $30 or $40 million would be enough. That ended up not being the case at all.


486

Several results came from Phase 2. One company withdrew halfway through. After the costs began to look prohibitive, we finally announced to the two remaining contractors that the funding cap was $40 million, which they agreed to compete for. We received two proposals. One was acceptable, although just barely. In the end, NSA's upper management canceled Project THOTH. We did not go to Phase 3.

Why did we stop? After Phase 2, we really only had one competitor that we could even think about going with. We really had hoped to have two. The one competitor we had looked risky. Perhaps we should have taken that risk, but at that time budgets were starting to get tighter, and upper management just was not willing to accept a lot of risk. Had we gone to Phase 3, it would have doubled the number of people that NSA would have had to put into this—at a time when our personnel staff was being reduced.

Also, the cost of building the system was too high. Even though the contractor said they would build it for our figure, we knew they would not.

There was more technical risk than we wanted to have. The real killer was that when we were done, we would have had a one-of-a-kind machine that would be very difficult to program and that would not have met the goals of being a system for researchers to use. It might have done really well for a production problem where you could code up one problem and then run it forever, but it would not be suited as a mathematical research machine.

So, in the end, those are the reasons we canceled. If only one or two of those problems had come up, we might have gone ahead, but the confluence of all those problems sunk the whole project.

As you can imagine from an experience like this, you learn some things. While we were working on THOTH, the Supercomputing Research Center (SRC) had a paper project under way, called Horizon, that began to wind down just about the time that THOTH was falling apart. So in April 1989, after the THOTH project was canceled in December, we had a workshop to try to talk about what had we learned from both of these projects together. Some of our discoveries were as follows:

• We did get some algorithmic insights from the Phase 1 work, and even though we never built a THOTH, some of the things we learned in THOTH have gone back into some of the production problems that were the source of THOTH and have improved things somewhat.

• We believe that the contractors became more aware of NSA computing needs. Even those who dropped out began to add things we wanted to their previous hit architectures next hit or software that they had not done before, and we would like to believe that it was because of being in THOTH.


487

• We pushed the previous hit architectures next hit and technology too hard. It turns out that parallel software is generally weak, and the cost was much higher than either we or the contractors had estimated at the time.

• We learned that the companies we were dealing with could not do it on their own. They really needed to team and team early on to have a good chance to pull something like this off.

• We learned that the competition got in our way. We did this as a competitive thing all along because the government is supposed to promote competition. We could see problems with the three contractors in Phase 2 that, had we perhaps been able to say a word or two, might have gone away. But we were constrained from doing that. So we were in a situation where we could ask questions of all three contractors, and if they chose to find in those questions the thing we were trying to suggest, that was fine. And if not, we couldn't tell them.

Hindsight is always 20/20. If we look backward, we realize the project took too long to get started. Never mind the fact that maybe we should not have started, although I think we should have, and I think we did some good. However, it took too long to get moving.

While we were involved in the project, especially in software, the technology explosion overtook us. When we started this system, we were not talking about a UNIX-like operating system, C, and Fortran compilers. By the end, we were. So the target we were shooting for kept moving, and that did not help us.

NSA also was biased as we began the project. We had built our own machines in the past. We even built our own operating system and invented our own language. We thought that since we had done all that before, we could do it again. That really hid from us how much trouble THOTH would be to do.

Our standards for credibility ended up being too low. I don't know how we could have changed that, but at the end we could see that we had believed people more than we should have. Our requirements were too general at first and became more specific as we went along. Thus, the contractors, unfortunately, had a moving target.

We also learned from these particular systems that your favorite language may not run well, if at all, and that the single most important thing that could have been built from this project would have been a compiler—not hardware, not an interconnection network, but a good compiler. We also learned that parallel previous hit architectures next hit can get really weird if you let some people just go at it.

In the end, success might have been achieved if we had had Company A design the system, had Company B build it, and had Company C


488

provide the software because in teaming all three companies we would have had strength.

There is a legacy to THOTH, however. It has caused us to work more with SRC and elsewhere. We got more involved with the Defense Advanced Research Projects Agency during this undertaking, and we have continued that relationship to our advantage. The project also resulted in involvement with another company that came along as a limited partner who had a system already well under way. There was another related follow-on project that came along after THOTH was canceled, but it relied too much on what we had done with THOTH, and it also fizzled out a little later on. Last, we now believe that we have better cooperation among NSA, SRC, and industry, and we hope that we will keep increasing that cooperation.


489

The Demise of ETA Systems

Lloyd Thorndyke

Lloyd M. Thorndyke is currently the CEO and chairman of DataMax, Inc., a startup company offering disk arrays. Before joining DataMax, he helped to found and was President and CEO of Engineering Technology Associates Systems, the supercomputer subsidiary of Control Data Corporation. At Control Data he held executive and technical management positions in computer and peripheral operations. He received the 1988 Chairman's Award from the American Association of Engineering Societies for his contributions to the engineering professions.

In the Beginning

Engineering Technology Associates Systems, or just plain ETA, was organized in the summer of 1983, and as some of you remember, its founding was announced here at the first Frontiers of Supercomputing conference in August 1983. At the start, 127 of the 275 people in the Control Data Corporation (CDC) CYBER 205 research and applications group were transferred to form the nucleus of ETA. This was the first mistake—we started with too many people.

The original business plan called for moving the entire supercomputer business, including the CYBER 205 and its installed base of 40 systems, to ETA. That never happened, and the result was fragmentation of the product line strategies and a split in the management of the CYBER 200 and ETA product lines. As a consequence it left the CDC CYBER 205 product line without dedicated management and direction and


490

undermined any upgrade strategy to the ETA-10. Another serious consequence was the lack of a migration path for CYBER 200 users to move to the ETA-10.

ETA was founded with the intention of eventually becoming an independent enterprise. Initially, we had our own sales and marketing groups because the CYBER 205 was planned as part of ETA. Because the CYBER 205 was retained by CDC, the sales and marketing organizations of the two companies were confused. It seemed that CDC repeatedly reorganized to find the formula for success. The marketing management at CDC took responsibility when ETA was successful and returned responsibility to us when things did not go well. Without question, the failure to consolidate the CYBER 200 product line and marketing and sales management at ETA was a major contributing factor to the ETA failure.

There was an initial major assumption that the U.S. government would support the entry of a new supercomputer company through a combination of R&D funding and orders for early systems. The blueprint of the 1960s was going to be used over again. We read the tea leaves wrong. No such support ever came forth, and we did not secure orders from the traditional leading-edge labs. Furthermore, we did not receive R&D support from the funding agencies for our chip, board, and manufacturing technologies. We had four meetings with U.S. government agencies, and they shot the horse four times: the only good result was that they missed me each time.

The lack of U.S. government support was critical to our financial image. The lack of early software and systems technical help also contributed to delays in maturing our system. Other vendors did and still do receive such support, but such was not case with ETA. Our planning anticipated that the U.S. government would help a small startup. That proved to be a serious error.

Control Data played the role of the venture capitalist at the start and owned 90 per cent of the stock, with the balance held by the principals. The CDC long-range intent was to dilute to 40 per cent ownership through a public offering or corporate partnering as soon as possible. The failure to consummate the corporate partner, although we had a willing candidate, was a major setback in the financial area.

The first systems shipment was made to Florida State University (FSU) in December of 1986—three years and four months from the start. From that standpoint, we feel we reduced the development schedule of a complex supercomputer by almost 50 per cent.

At the time of the dynamiting of ETA on April 17, 1989, there were six liquid nitrogen-cooled systems installed. Contrary to the bad PR you


491

might have heard, the system at FSU was a four-processor G system operating at seven nanoseconds. In fact, for a year after the closing, the system ran for a year at high levels of performance and quality, as FSU faculty will attest.

We had about 25 air-cooled systems installed (you may know them as Pipers) at customer sites. Internally, there were a total of 25 processors, both liquid- and air-cooled, dedicated to software development. Those are impressive numbers if one considers the costs of carrying the inventory and operating costs.

Hardware

From a technology viewpoint, I believe the ETA-10 was an outstanding hardware breakthrough and a first-rate manufacturing effort. We used very dense complementary metal oxide semiconductor (CMOS) circuits, reducing the size of the supercomputer processor to a single 16- by 22-inch board. I'm sure many of you have seen that processor. The CMOS chips reduced the power consumption of the processor to about 400 watts—that's watts, not 400 kilowatts. The use of CMOS chips operating in liquid nitrogen instead of ambient air resulted in doubling the speed of the CMOS. As a result of the two cooling methods and the configuration span from a single air-cooled processor to an eight-processor, liquid-cooled machine, we achieved a 27-to-one performance range. That range was able to use the same software and training for the diagnostics, operating-system software, and manufacturing checkout. We had broad commonality on a product line and inventory from top to bottom. We paid for the design only once, not many times. Other companies have proposed such a strategy—we executed it.

The liquid-nitrogen cryogenic cooling was a critical part of our design. I would suggest liquid nitrogen cooling as a technology other people should seriously consider. For example, a 20-watt computer will boil off one gallon of liquid nitrogen in an eight-hour period. Liquid nitrogen can be bought in bulk at a price cheaper than milk—it is as low as 25 cents a gallon in large quantities. This equals eight hours of operation for $0.40, assuming $0.40 per gallon. We get about 90 cubic feet of -200°C nitrogen gas. This gas can also help cool the rest of your computer room, greatly reducing the cooling requirements.

The criticism that liquid nitrogen resulted in a long mean time to repair was erroneous because at the time of the ETA closure, we could replace a processor in a matter of hours. The combination of CMOS and liquid-nitrogen cooling coupled with the configuration range provided a broad


492

product family. These were good decisions—not everything we did was wrong.

The ETA-10 manufacturing process was internally developed and represented a significant advance in the state of the art. The perfect processor board yield at the end was 65 per cent for a board that was 16 by 22 inches with 44 layers and a 50-ohm controlled impedance. Another 30 per cent were usable with surface ECO wires. The remaining five per cent were scrap. This automated line produced enough boards to build two computers a day with just a few people involved.

For board assembly, we designed and built a pick-and-place robot to set the CMOS chips onto the processor board, an operation it could perform in less than four hours. The checkout of the computer took a few more hours. We really did have a system designed for volume manufacturing.

Initially, the semiconductor vendor was critical to us because it was the only such vendor in the United States that would even consider our advanced CMOS technology. In retrospect, our technology requirements and schedule were beyond the capabilities of the vendor to develop and deliver. Also, this vendor was not a merchant semiconductor supplier and did not have the infrastructure or outside market to support the effort. We were expected to place enough orders and supply enough funding to keep them interested in our effort. Our mistake was teaming with a nonmerchant vendor needing our resources to stay in the commercial semiconductor business.

We believed that we should work with U.S. semiconductor vendors because of the critical health of the U.S. semiconductor business. I would hasten to point out that the Japanese were very willing to supply us with the technology, both logic and memory that met or exceeded what we needed. Still, we stayed with the U.S. suppliers longer than good judgment warranted because we thought there was value to having a U.S.-made supercomputer with domestic semiconductors. We believed that our government encouraged such thinking, but ETA paid the price. In essence, we were trying to sell a computer with 100 per cent U.S. logic and memory components against a computer with 90 per cent Japanese logic and memory components, but we could not get any orders. I found the government's encouragement to us to use only U.S. semiconductor components and the subsequent action of buying competitive computers with the majority of their semiconductor content produced in Japan inconsistent and confusing.

Very clearly, the use of Japanese components does not affect the salability of the system in the U.S.—that message should be made clear to everyone. This error is not necessarily ETA's alone, but if the U.S.


493

government wants healthy U.S. semiconductor companies, then it must create mechanisms to encourage products with high U.S. semiconductor content only and to support R&D to keep domestic suppliers up to the state of the art.

Software

It is difficult to say much good about the early ETA software and its underlying strategy, although it was settling down at the end. A major mistake was the early decision to develop a new operating system, as against porting the CYBER 205 VSOS operating system. Since the CYBER 205 remained at CDC, we did not have product responsibility or direction, and the new operating system seemed the best way at the time.

In hindsight there has been severe criticism for not porting UNIX to the ETA-10 at the beginning—that is, start with UNIX, only. But in 1983 it was not that clear. I now hear comments from people saying, "If ETA would have started with UNIX, I would have bought." It was only two years later that they said, "Well, you should have done UNIX." However, we did not get UNIX design help, advice, or early orders for a UNIX system.

After we completed a native UNIX system and debugged the early problems, the UNIX system stabilized and ran well on the air-cooled systems, and as a result, several additional units were ordered. While the ETA UNIX lacked many features needed for supercomputer operation, users knew that these options were coming, but we were late to market. In hindsight, we should have ported VSOS and then worked only on UNIX.

Industry Observations

To be successful in the commercial supercomputer world, one must have any array of application packages. While we recognized this early on, as a new entrant to the business, we faced a classical problem that was talked about by other presenters: you can't catch up if you can't catch up.

We were not able to stimulate the applications vendors' interest because we didn't have a user base. Simultaneously, it's hard to build a user base without application packages. This vicious circle has to be broken because all the companies proposing new previous hit architectures next hit are in the same boat. Somehow, we must figure out how to get out of it, or precious few new applications will be offered, except by the wealthiest of companies.

We need to differentiate the true supercomputer from the current situation, where everyone has a supercomputer of some type. The PR people have confiscated the supercomputer name, and we must find a new name. Therefore, I propose that the three or four companies in this


494

business should identify their products as superprocessor systems. It may not sound sexy, but it does the job. We can then define a supercomputer system as being composed of one or more superprocessors.

The supercomputer pursuit is equivalent to a religious crusade. One must have the religion to pursue the superprocessors because of the required dedication and great, but unknown, risks. In the past, CDC and some of you here pioneered the supercomputer. Mr. Price had the religion, but CDC hired computer executives who did not, and in fact, they seemed to be supercomputer atheists. It was a major error by CDC to hire two and three levels of executives with little or no experience in high-performance or supercomputer development, marketing, or sales and place them in the computer division. Tony Vacca, ETA's long-time technologist, now at Cray Research, Inc. (see Session 2), observed that supercomputer design and sales are the Super Bowl of effort and are not won by rookies. It seems CDC has proved that point.

Today we all use emitter-coupled-logic (ECL), bipolar, memory chips, and cooling technologies in product design because of performance and cost advantages. Please remember that these technologies were advanced by supercomputer developers and indirectly paid for by supercomputer users.

If Seymour Cray had received a few cents for every ECL chip and bipolar chip and licensing money for cooling technology, he wouldn't need any venture capital today to continue his thrust. However, that is not the case. I believe that Seymour is a national treasure, but he may become an endangered species if the claims I have heard at this conference about massively parallel systems are true. However, remember that claims alone do not create an endangered species.

I have learned a few things in my 25 years in the supercomputer business. One is that the high-performance computers pioneer costly technology and bear the brunt of the startup costs. The customers must pay a high price partly because of these heavy front-end costs. Followers use this developed technology without the heavy front-end costs and then argue supercomputers are too costly without considering that the technology is low-cost because supercomputers footed the early bills.

Somehow, some way, we in the U.S. must find a way to help pay the cost of starting up a very expensive, low-volume gallium arsenide facility so that all of us can reap the performance and cost benefits of the technology. Like silicon, the use will develop when most companies can afford to use it. That occurs only after someone has paid to put the technology in production, absorbed the high learning-curve costs, proved the performance, and demonstrated the packaging. Today we are asking


495

one company to support those efforts. Unfortunately, we hear complaints that supercomputers with new technology cost too much. We should all be encouraging Seymour's effort, not predicting doom, and we should be prepared to share in the expenses.

The Japanese supercomputer companies are vertically integrated—an organizational structure that has worked well for them. Except for IBM and AT&T, the U.S. companies practice vertical cooperation. However, vertical cooperation must change so that semiconductor vendors will underwrite a larger part of the development costs. The user cannot continue to absorb huge losses while the vendor is making a profit and still expecting the relationship to flourish. This is not vertical cooperation; it is simply a buyer-seller relationship. To me, vertical cooperation means that the semiconductor vendors and the application vendors underwrite their costs for part of the interest in the products. That is true cooperation, and the U.S. must evolve to this or ignore costly technology developments and get out of the market.

I have been told frequently by the Japanese that they push the supercomputer because it drives their semiconductor technology to new components leading to new products that they know will be salable in the marketplace. In their case, vertical integration is a market-planning asset. I maintain that vertical cooperation can have similar results.

I believe that we have seen the gradual emergence of parallelism in the supercomputers offered by Cray and the Japanese—I define those previous hit architectures next hit as Practical Parallelism. During the past two days, we have heard about the great expectations for massively parallel processors and the forecasted demise of the Cray dynasty. I refer to these efforts as Research Parallelism, and I want to add that Research Parallelism will become a practicality not when industry starts to buy them but when the Japanese start to sell them. The Japanese are attracted to profitable markets. Massively parallel systems will achieve the status of Practical Parallelism when the Japanese enter that market—that will be the sign that users have adopted the previous hit architecture next hit, and the market is profitable.

I would like to close with a view of the industry. I lived through the late 1960s and early 1970s, when the U.S. university community was mesmerized by the Digital Equipment Corporation VAX and held with the belief that VAXs could do everything and there was no need for supercomputers. A few prophets like Larry Smarr, at the University of Illinois (Session 10), kept saying that supercomputers were needed in universities. That they were right is clearly demonstrated by the large number of supercomputers installed in universities today.


496

Now I hear that same tune again. We are becoming mesmerized with superperformance workstations: they can do everything, and there is again no need for supercomputers. When will we learn that supercomputers are essential for leading-edge work? It is not whether we need supercomputers or super-performance workstations but that we need both working in unison. The supercomputer will explore new ideas, new applications, and new approaches. Therefore, I believe very strongly that it is both and not one or the other. The supercomputer has a place in our industry, so let's start to hear harmonious words of support in place of the theme of supercomputers being too costly and obsolete while massively parallel systems are perfect, cheap, and the only approach.


497

FPS Computing:
A History of Firsts

Howard Thrailkill

Howard A. Thrailkill is President and CEO of FPS Computing. He received his bachelor of science degree in electrical engineering from the Georgia Institute of Technology and his master of science degree in the same field from the Florida Institute of Technology. He has received several patents for work with electronic text editing and computerized newspaper composition systems. For the past 20 years, he has held successively more responsible management positions with a variety of high-technology and computer companies, including General Manager of two divisions of Harris Corporation, President of Four-Phase Systems, Inc., Corporate Vice President of Motorola, Inc., and President and CEO of Saxpy Computer Corporation.

I suspect that a significant number of you in this audience were introduced to high-performance computing on an FPS-attached processor of some type. That technology established our company's reputation as a pioneer and an innovator, and it has a profound effect on the evolution of supercomputing technology. For the purposes of this session, I have been asked to discuss a pioneering product whose failure interrupted a long and successful period of growth for our company.

Pioneers take risks. Such was the case with our T-Series massively parallel computer, which was announced in 1985 with considerable fanfare and customer interest. It was the industry's first massively parallel machine that promised hypercube scalability and peak power in


498

the range of multiple GFLOPS (i.e., 109 floating-point operations per second). Regrettably, it was not a successful product.

Before presenting a T-Series "postmortem," I would like to retrace some history. FPS Computing, formerly known as Floating Point Systems, is a 20-year-old firm with a strong tradition of innovation. In 1973, we introduced the first floating-point processor for minicomputers and, in 1976, the first array processor—the FPS 120B. That was followed in 1981 by the first 64-bit minisupercomputer, the FPS 164, and by the FPS 264 in 1985; both enjoyed widespread acceptance. Buoyed by those successes, FPS then failed to perceive a fundamental shift in the direction of minisupercomputer technology: the shift toward the UNIX software environment and previous hit architectures next hit with tightly coupled vector and scalar processors.

Other companies correctly recognized that shift and introduced competitive machines, which stalled our rapid growth. We responded with heavy investment in our T-Series machine, a radically new product promising extraordinary peak performance. It never lived up to its promise, and it is interesting to understand why.

First, a small company like FPS lacked the resources to absorb comfortably a mistake costing a few tens of millions of dollars. We simply overreached our resources.

Second, we failed to recognize that our software system, based on Occam, could never compete with the widespread acceptance of UNIX. I had not yet joined FPS at the time, and I recall my reading the T-Series announcement and becoming concerned that I knew so little about this new software environment. Clearly, I was not alone. In retrospect, this miscalculation crippled the product from the outset.

Third, the machine exhibited unbalanced performance even when its software shortcomings were overlooked. Highly parallel portions of an application code could be hand tuned and impressive speeds achieved. However, portions of the code that did not yield to such treatment ran very slowly. Too often we were limited to the speed of a single processor, and performance also suffered from the inefficiencies of message passing in our distributed-memory subsystem. The complexity of I/O with our hypercube previous hit architecture next hit exacerbated the problem.

Fourth, we went to market with very little applications software. This deficiency was never overcome because our tedious programming environment did not encourage porting of any significant existing codes.

Finally, we entered a market that was much too small to justify our product development expenditures. While some talented researchers achieved impressive performance in a small number of special-purpose


499

applications, users wanted broader capabilities even in networked environments.

I have been asked to relate the lessons we learned from this experience. Clearly, no $100 million company like FPS wanted to face the serious consequences of a market miscalculation like we experienced. We had counted on the T-Series as a primary source of revenue upon which we would maintain and build our position in the industry.

Among my first duties upon joining FPS was to assess the prospects for the T-Series to fulfill the needs of our customers and our company. My first conversations with customers were not at all encouraging. They wanted standards-compliant systems that would integrate smoothly into their heterogeneous computing environment. They wanted balanced performance. While they valued the promise of parallel processing, they were reluctant to undertake the complexity of parallel programming, although a few users did justify the effort simply to avail themselves of the machine's raw speed. They also wanted a comprehensive library of third-party application codes.

Unfortunately, we saw no way to meet the needs of a large enough number of customers with the T-Series. Thus, we suspended further investment in the T-Series perhaps 45 days after I joined FPS.

Having made that decision, a senior representative from FPS was dispatched to meet individually with every T-Series customer and reconcile our commitments to them. It was a step that cost us a lot of money. With our commitment to continuing as a long-term participant in the high-performance computing business, we believed that nothing less than the highest standards of business ethics and fairness to our customers would be acceptable. That painful process was completed before we turned our attention to redirecting our product strategy.

We then took steps to move back into the mainstream of high-performance computing with our announcement in late 1988 of a midrange UNIX supercomputer from FPS—the Model 500. An improved version, the Model 500EA, followed in 1989.

In the summer of 1990, FPS emerged again as an industry innovator with our announcement of Scalable Processor previous hit ARChitecture next hit (SPARC) technology, licensed from Sun Microsystems, Inc., as the foundation for future generation supercomputers from FPS. With our concurrent announcement of a highly parallel Matrix Coprocessor, we also introduced the notion of integrated heterogeneous supercomputing. Integrated heterogeneous supercomputing features modular integration of multiple scalar, vector, and matrix processors within a single, standards-compliant


500

software environment. In our implementation, that software environment is to be an extended version of SunOS. We also announced an alliance with Canon Sales in Japan, a major market for our products.

FPS has made a major commitment to industry standards for high-performance computing—SPARC, SunOS (UNIX), network file system, high-performance parallel interface, and others. We believe we have a strong partner, Sun, in our corner now as we move forward.

We are reassured as we visit customers and observe SPARC technology we have licensed from Sun enjoying such widespread acceptance. Our joint sales and marketing agreement with Sun has also served us well since its announcement in June 1990. We support their sales activities and they support ours.

We will be announcing shortly further steps toward standardization in the software environment we make available to customers. Our compiler will soon have a full set of Cray Research and Digital Fortran extensions, along with a set of ease-of-use tools.

Our customers seem to be receptive to our product strategy, as dozens of machines are now installed in major universities, research laboratories, and industrial sites around the world. Interestingly, about one-third of them are in Japan.

We believe we have defined a new direction for high-speed computing—integrated heterogeneous supercomputing. Our current implementation is shown in Figure 1.

Figure 1.
FPS 500 series integrated heterogeneous supercomputing.


501

As you may observe from the figure, up to eight high-speed SPARC RISC processors may be plugged in independently to run as a shared-memory multiprocessor. Conveniently, all of our installed customers can be upgraded to SPARC and run SunOS software in this modular fashion.

This high-speed SPARC RISC multiprocessor previous hit architecture next hit can be augmented by modular addition of multiple vector and matrix coprocessors. The system configuration can be tailored to the application at hand. As technology advances, new processors can modularly replace current versions, essentially obsoleting the industry's tradition of "forklift" replacement of the entire supercomputer when upgrades are undertaken.

While vector processing from FPS and others is a familiar technology, matrix coprocessing may be a new concept to you. We consider this technology as applied parallel processing. FPS matrix coprocessors, currently implemented with up to 84 parallel processing elements, attack the locality of computation present in the compute-intensive portions of many application codes that address big problems. Linear solvers, eigenvalue problems, fast Fourier transforms, and convolutions all involve many computations per data element, i.e., high locality. These portions of code run very fast on our machine, achieving sustained performance in the one- to three-GFLOPS range, and this performance can be achieved in the familiar SunOS software environment.

We believe we have now overcome the setbacks that the ill-fated T-Series dealt to our company. We have returned to our roots in high-performance computing and are now moving forward and building upon our notion of integrated heterogeneous supercomputing as implemented in our current supercomputer product line. Innovation is once again driving FPS, but we are now much more sensitive to customer demands.


503

previous part
12— EXPERIENCE AND LESSONS LEARNED
next part