VECTOR PIPELINE ARCHITECTURE
This session focused on the promise and limitations of architectures featuring a moderate number of tightly coupled, powerful vector processors—the limitations, dependencies, sustained-performance potential, processor performance, interconnection topologies, and applications domains for such architectures. How fast do things have to be to eliminate vectors and use only scalar processors?
Cray Research, Inc.
Vector Architecture in the 1990s
Les Davis has been with Cray Research, Inc., since its founding in 1972. Initially, he was the chief engineer for the CRAY-1 project. Today, Mr. Davis is the Executive Vice President and a member of the Technology Council, which formulates the company's strategic direction. Before joining Cray Research, Mr. Davis was Director of Electrical Engineering and General Manager of the Chippewa Laboratory for Control Data Corporation.
As the title of this session suggests, we will be interested in pursuing the discussion on vector processing. It is interesting to note, when you go to a meeting like this, how many people—the same people—have been in this business for 20 years or longer. I don't know if that's a good thing or a bad thing, but at least it attests to your persistence in sticking with one kind of work despite the passage of time and several companies. It is also interesting to note, now, how openly we discuss high-performance computing. I can remember in the early days when my kids would ask me what I did, I'd kind of mumble something about scientific computing, hoping they wouldn't ask what that was. Yet, it's surprising today that we talk about it quite openly.
Nevertheless, it is the politicians that I have trouble communicating with about high-performance computing. I am hoping that after a series of meetings like this, we will be able to convince the politicians of the importance of high-performance computing.
I believe that the vector architecture in the 1980s played almost the same role that the first transistorized computers played in the 1960s. Vector architecture really did offer the researchers an opportunity to do things that in the late 1960s and late 1970s we were unable to achieve with the machines that were available at that time.
I think the vector machines were characterized by several positive things. One was the introduction of very large memories, high-bandwidth memories to support those large memories, and very efficient vectorizing compilers. As a result of those combinations, we saw several orders of magnitude improvement in performance over what previous architectures offered.
On the negative side, scalar processing did not move along quite as rapidly because it was restrained by slow clock rates. If you looked at the performance improvements, you only saw a factor of perhaps 2.5 from 1975 through 1990. On the other side of the coin, if you looked at the ability to incorporate not only vectorization but also large or reasonably large numbers of vector processors tightly coupled to the memory, you saw, in many cases, several orders of magnitude improvement in performance.
I also think the multiprocessor vector machines were another significant step that we took in the 1980s, and now we are able to couple up to 16 processors in a very tight fashion. Interprocessor communication and memory communication actually allow us to make very efficient use of those machines.
The other important thing is that we have allowed our compiler developers to move along and take advantage of these machines. I think a lot of that work will take and be transportable when we look at some of the newer architectures that we are examining in the research and development areas.
I think the importance of the U.S. retaining its leadership in the supercomputer industry has been stated many times. For us to retain that leadership in the high-performance computing area, we must be able to maintain our lead in the manufacturing, as well as in the design of the systems. That is something that was touched on in the pevious session, but I think it has much more importance than a lot of people attach to it.
We need now to be able to compete in the world markets because in many cases, that is one of the few ways in which we not only can get research and development dollars but also can perfect our manufacturing capabilities. If we are not able to do that, I don't think we're going to be able to capitalize on some of the new technologies and new developments that are taking place today.
I think the vector architectures are going to be the backbone of our high-performance computing initiative throughout the 1990s. This is not to say that there will not be newer software and hardware architectures that will be coming along. However, if we are not able to take and maintain the leadership with our current types of architectures, I know very well of a group of people that are located overseas that would just love to be able to do that.
My commitment here is to make sure that we not only are looking ahead and trying to make sure that we move very aggressively with new architectures, both in hardware and software, but also that we are not giving up and losing sight of the fact that we have quite a commitment to a large number of people today that have invested in these vector-type architectures.
In Defense of the Vector Computer
Harvey G. Cragon has held the Ernest Cockrell, Jr., Centennial Chair in Engineering at the University of Texas, Austin, since 1984. Previously he was employed at Texas Instruments for 25 years, where he designed and constructed the first integrated-circuit computer, the first transistor-transistor logic computer, and a number of other computers and microprocessors. His current interests center upon computer performance and architecture design. He is a fellow of the Institute of Electrical and Electronics Engineers (IEEE) and a member of the IEEE Computer Society, the National Academy of Engineering, and the Association for Computing Machinery (ACM). Professor Cragon received the IEEE Emanuel R. Piore Award in 1984 and the ACM-IEEE Eckert-Mauchly Award in 1986. He is also a trustee of The Computer Museum in Boston.
As several have said this morning, parallel computers, as an architectural concept, were talked about and the research was done on them before the advent of vector machines. The vector machine is sort of the "new kid on the block," not the other way around.
Today I am going to defend the vector computer. I think that there are some reasons why it is the workhorse of the industry and why it has been successful and will continue to be successful.
The first reason is that in the mid-1960s, about 1966, it suddenly dawned on me, as it had on others, that the Fortran DO loop was a direct
invocation of a vector instruction. Everything would not be vectorizable, but just picking out the Fortran DO loops made it possible to compile programs for the vector machine. That is, I think, still an overwhelming advantage that the vector machines have—that the arrayed constructs of languages such as Ada are vectorizable.
The second reason is that there is a natural marriage between pipelining and the vector instruction. Long vectors equal long pipelines, short clock periods, and high performance. Those items merge together very well.
Now, back to the programming point of view and how vectors more or less got started. Erich Bloch, in Session 1, was talking about the Stretch computer. I remember reading in a collection of papers on Stretch that there was a flow chart of a vector subroutine. I looked at that and realized that's what ought to be in the hardware. Therefore, we were taking known programming constructs and mapping them into hardware.
Today it strikes me that we are trying to work the parallel computer problem the other way. We are trying to find the programming constructs that will work on the hardware. We come at it from the programming point of view.
I believe that vector pipeline machines give a proper combination of the space-time parallelism that arises in many problems. The mapping of a problem to perform pipeline and vector instructions is more efficient and productive than mapping the same type of problem to a fixed array—to an array that has fixed dimensionality.
We worked at Illinois on the ILLIAC-IV. It was a traumatic decision to abandon that idea because a semiconductor company would love to have something replicated in large numbers. However, we did not know how to program ILLIAC-IV, but we did know how to program the vector machine.
Looking to the future, I think that vector architecture technology is fairly mature, and there are not a whole lot of improvements to make. We are going to be dependent in large measure on the advances in circuit technology that we will see over the next decade. A factor of 10 is probably still in the works for silicon.
Will the socioeconomic problems of gallium arsenide and Josephson junctions overcome the technical problems? Certainly, as Tony Vacca (Session 2) said, we need high-performance technology just as much as the parallel computer architects need it.
I have made a survey recently of papers in the International Solid State Circuits Conference, and it would appear that over the last 10 years, clock rates in silicon have improved about 25 per cent per year. This would translate into a 109-type clock rate in another eight or 10 years. At those
clock rates, the power problems would become quite severe. If we try to put one of these things on a chip, we have got real power problems that have got to be solved.
I also perceive another problem facing us—that we have not paid as much attention to scalar processing as we should. Given that either pipeline vector machines stay dominant or that the multiprocessors become dominant, we still have to have higher-performance scalar machines to support them. I think that we need research in scalar machines probably as much as, if not more than, we need research in vector machines or parallel machines.
I tend to believe that the RISC theology is going the wrong way and that what we really need to do is raise rather than lower the level of abstraction so that we can get the proper computational rates out of scalar processing that we really need to support vector or parallel machines.
In conclusion, there is a saying from Texas: if it's not broke, don't fix it. There is also another saying: you dance with the one that "brung" you. Well, the one that "brung" us was the vector machine, so let's keep dancing.
Market Trends in Supercomputing
Neil Davenport is the former President and CEO of Cray Computer Corporation. Before the spinoff of the company from Cray Research, Inc. (CRI), in November 1989, Neil served from 1981 to 1988 as the Cray Research Ltd. (UK) Managing Director for Sales, Support, and Service for Northern Europe, the Middle East, India, and Australia; from 1988 to November 1989, he was Vice President of Colorado Operations, with responsibility for the manufacture of the CRAY-3. Before joining CRI, he worked 11 years for ICL in England, the last three managing the Education and Research Region, which had marketing responsibility for the Distributed Array Processor program.
Since 1976 and the introduction of the CRAY-1, which for the purpose of this paper is regarded as the start of the supercomputer era, the market for large-scale scientific computers has been dominated by machines of one architectural type. Today, despite the introduction of a number of new architectures and despite the improvement in performance of machines at all levels in the marketplace, most large-scale scientific processing is carried out on vector pipeline computers with from one to eight processors and a common memory. The dominance of this architecture is equally strong when measured by the number of machines installed or by the amount of money spent on purchase and maintenance.
As with every other level of the computer market, the supply of software follows the dominant hardware. Accordingly, the library of
application software for vector pipeline machines has grown significantly. The investment by users of the machines and by third-party software houses in this architecture is considerable.
The development of vector pipeline hardware since 1976 has been significant, with the prospect of machines with 100 times the performance of the CRAY-1 being delivered in the next year or two. The improvement in performance of single processors has not been sufficient to sustain this growth. Multiple processors have become the norm for the highest-performance offerings from most vendors over the past few years. The market leader, Cray Research, Inc., introduced its first multiprocessor system in 1982.
Software development for single processors, whether part of a larger system or not, has been impressive. The proportion of Fortran code that is vectorized automatically by compilers has increased continuously since 1976. Several vendors offer good vectorization capabilities in Fortran and C. For the scientist, vectorization has become transparent. Good code runs very well on vector pipeline machines. The return for vectorization remains high for little or no effort on the part of the programmer. This improvement has taken the industry 15 years to accomplish.
Software for multiprocessing a single task has proved to be much more difficult to write. Preprocessors to compilers to find and highlight opportunities for parallel processing in codes are available, along with some more refined structures for the same function. As yet, the level of multitasking single programs over multiple processors remains low. There are exceptional classes of problems that lend themselves to multitasking, such as weather models. Codes for these problems have been restructured to take advantage of multiple processors, with excellent results. Overall, however, the progress in automatic parallelization and new parallel-application programs has been disappointing but not surprising. The potential benefits of parallel processing and massively parallel systems have been apparent for some time. Before 1980, a number of applications that are well suited to the massively parallel architecture were running successfully on the ICL Distributed Array Processor. These included estuary modeling, pattern recognition, and image processing. Other applications that did not map directly onto the machine architecture did not fare so well, including oil reservoir engineering, despite considerable effort.
The recent improvements in performance and the associated lowering in price of microprocessors has greatly increased the already high level of attraction to massively parallel systems. A number of vendors have
introduced machines to the market, with some success. The hardware issues seem to be manageable, with the possible exception of common memory. The issues for system and application software are still formidable. The level of potential reward and the increase in the numbers of players will accelerate progress, but how quickly? New languages and new algorithms do not come easily, nor are they easily accepted.
In the meantime, vector pipeline machines are being enhanced. Faster scalar processing with cycle times down to one nanosecond are not far away. Faster, larger common memories with higher bandwidth are being added. The number of processors will continue to increase as slowly as the market can absorb them. With most of the market momentum—also called user friendliness, or more accurately, user familiarity—still being behind such machines, it would seem likely that the tide will be slow to turn.
In summary, it would appear that the increasing investment in massive parallelism will yield returns in some circumstances that could be spectacular; but progress will be slow in the general case. Intermediate advances in parallel processing will benefit machines of 16 and 64 processors, as well as those with thousands. If these assumptions are correct, then the market share position in 1995 by type of machine will be similar to that of today.
Massively Parallel SIMD Computing on Vector Machines Using PASSWORK
Kenneth Iobst received a B.S. degree in electrical engineering from Drexel University, Philadelphia, in 1971 and M.S. and Ph.D. degrees in electrical engineering/computer science from the University of Maryland in 1974 and 1981, respectively. Between 1967 and 1985, he worked as an aerospace technologist at the NASA Langley Research Center and the NASA Goddard Space Flight Center and was actively involved in the Massively Parallel Processor Project. In 1986 he joined the newly formed Supercomputing Research Center, where he is currently employed as a research staff member in the algorithms group. His current research interests include massively parallel SIMD computation, SIMD computing on vector machines, and massively parallel SIMD architecture.
When I first came to the Supercomputing Research Center (SRC) in 1986, we did not yet have a SIMD research machine—i.e., a machine with single-instruction-stream, multiple-data-streams capability. We did, however, have a CRAY-2. Since I wanted to continue my SIMD research started at NASA, I proceeded to develop a simulator of my favorite SIMD machine, the Goodyear MPP (Massively Parallel Processor), on the CRAY-2.
This SIMD simulator, called PASSWORK (PArallel SIMD Simulation WORKbench); now runs on seven different machines and represents a truly machine-independent SIMD parallel programming environment. Initially developed in C, PASSWORK is now callable from both C and
Fortran. It has been used at SRC to develop bit-serial parallel algorithms, solve old problems in new ways, and generally achieve the kind of performance one expects on "embarrassingly" parallel problems.
As a result of this experience, I discovered something about the equivalence between a vector machine and a real SIMD machine that I would now like to share with you. In general, the following remarks apply to both the Goodyear MPP and Thinking Machines Corporation's CM-2.
There are two basic views of a vector machine like the CRAY-2. In the traditional vector/scalar view, the CRAY-2 has four processors, each with 16K words of local memory, 256 megawords of globally shared memory, and a vector processing speed of four words per 4.1 nanoseconds. From a massively parallel point of view, the CRAY-2 has a variable number of bit-serial processors (4K per vector register) and a corresponding amount of local memory per processor equal to 234 processor bits.
Given an understanding of SIMD computing, one can see how the broadcast of a single instruction to multiple processors on a SIMD machine is analogous to the pipelined issue of vector instructions on a vector machine. There is a natural sort of equivalence here between these two seemingly different machine architectures.
As can be seen in Figure 1, there are two basic computing domains—a vector/scalar domain and a bit-serial domain. In the vector/scalar domain, we do things conventionally. In the bit-serial domain, we are more able to trade space for time and to solve the massively parallel parts of problems more efficiently. This higher performance results from operating on small fields or kernels with linear/logarithmic bit-serial computational complexity. In this bit-serial domain, we are operating on fully packed words, where the bits of a word are associated with single-bit processors, not with a physical numeric representation.
If you take a single problem and break it up into a conventional and a bit-serial part, you may find that a performance synergy exists. This is true whenever the whole problem can be solved in less time across two domains instead of one. This capability may depend heavily, however, on an efficient mechanism to translate between the computing domains. This is where the concept of corner-turning becomes very important.
The concept of corner-turning allows one to view a computer word of information sometimes as containing spatial information (one bit per processor) and at other times as containing numeric information, as is depicted in Figure 2. Corner-turning is the key to high-performance SIMD computing on vector machines and is best implemented in hardware with a separate vector functional unit in each CPU. With this support,
vector machines would be used much more extensively for SIMD computing than they are today.
To give you an idea of how things might be done on such a machine, let's look at the general routing problem on a SIMD machine. Suppose we have a single bit of information in each of 4K processors and wish to arbitrarily route this information to some other processor. To perform this operation on a real SIMD machine requires some sort of sophisticated routing network to handle the simultaneous transmissions of data, given collisions, hot spots, etc. Typically, the latencies associated with parallel routing of multiple messages are considerably longer than in cases where a single processor is communicating with one other processor.
On a vector machine, this routing is pipelined and may suffer from bank conflicts but in general involves very little increased latency for multiple transmissions. To perform this kind of routing on a vector machine, we simply corner-turn the single bit of information across 4K processors into a 4K vector, permute the words of this vector with hardware scatter/gather, and then corner-turn the permuted bits back into the original processors.
Using this mechanism for interprocessor SIMD communication on a vector machine depends heavily on fast corner-turning hardware but in general is an order of magnitude faster than the corresponding operation on a real SIMD machine. For some problems, this routing time dominates, and it becomes very important to make corner-turning as fast as possible to minimize this "scalar part" of this parallel SIMD problem. This situation is analogous to minimizing the scalar part of a problem according to Amdahl's Law.
Figure 2 shows some other equivalences between SIMD computing and vector/scalar computing. Some of these vector/scalar operations do not require corner turning but suffer from a different kind of overhead—the large number of logical operations required to perform basic bit-serial arithmetic. For example, bit-serial full addition requires five logical operations to perform the same computation that a real SIMD machine performs in a single tick. Fortunately, a vector machine can sometimes hide this latency with multiple logical functional units. Conditional store, which is frequently used on a SIMD machine to enable or
disable computation across a subset of processors, also suffers from this same overhead.
There are some other "SIMD operations," however, that are actually performed more effectively on a vector/scalar machine than on a real SIMD machine. This seems like a contradiction, but the reference to "SIMD operation" here is used in the generic sense, not the physical sense. Operations in this class include single-bit tallies across the processors and the global "or" of all processors that is frequently used to control SIMD instruction issue.
Single-bit tallies across the processors are done much more efficiently on a vector machine using vector popcount hardware than on the much slower routing network of real SIMD machines. The global "or" of all processors on a real SIMD machine generally requires an "or" tree depth equal to the log of the number of processors. On a typical SIMD machine, the time needed to generate this signal is in the range of 300–500 nanoseconds.
On a vector machine, this global "or" signal may still have to be computed across all processors but in general can be short-stopped once one processor is found to be nonzero. Therefore, the typical time to generate the global "or" on a vector machine is only one scalar memory access, or typically 30–50 nanoseconds. This is a significant performance advantage for vector machines and clearly demonstrates that it may be much better to pipeline instructions than to broadcast them.
As stated earlier, PASSWORK was originally developed as a research tool to explore the semantics of parallel SIMD computation. It now represents a new approach to SIMD computing on conventional machines and even has some specific advantages over real SIMD machines. One of these distinct advantages is the physical mapping of real problems onto real machines. Many times the natural parallelism of a problem does not directly map onto the physical number of SIMD processors. In PASSWORK the natural parallelism of any problem is easily matched to a physical number of simulated processors (to within the next higher power of four processors).
This tradeoff between the number of processors and the speed of the processors is most important when the natural parallelism of the problem is significantly less than the physical number of SIMD processors. In this case, a vector machine, although possibly operating in a short vector mode, can always trade space for time and provide a fairly efficient logical-to-physical mapping for SIMD problems. On a real SIMD machine, there is a significant performance degradation in this case because of the underutilization of physical processors.
In a direct comparison between the CRAY-2 and the CM-2, most SIMD problems run about 10 times slower on the CRAY-2 than on the CM-2. If the CRAY-2 had hardware support for corner-turning and a memory bandwidth equivalent to the CM-2, this performance advantage would completely disappear. Most of this performance loss is due to memory subsystem design, not to basic architectural differences between the two machines; i.e., the CRAY-2 was designed with a bank depth of eight, and the CM-2 was designed with a bank depth of one. As a result, the CRAY-2 can cycle only one-eighth of its memory chips every memory cycle, whereas the CM-2 can cycle all of its memory chips every memory cycle.
As shown in Figure 3, PASSWORK basically models the MPP arithmetic logic unit (ALU), with extensions for indirect addressing and floating point. This MPP model supports both a one- and two-dimensional toroidal mesh of processors. Corner-turning is used extensively for interprocessor routing, floating point, and indirect addressing/table lookup. The PASSWORK library supports a full complement of bit-serial operations that treat bits as full-class objects. Both the massively parallel dimension and the bit-serial dimension are fully exposed to the programmer for algorithmic space/time tradeoff. Other features include
• software support for interactive bit-plane graphics on USN 3/4 workstations with single-step/animation display at 20 frames per second (512 × 512 images);
• input/output of variable-length integers expressed as decimal or hexadecimal values;
• variable-precision, unsigned-integer arithmetic, including addition, subtraction, multiplication, division, and GCD computations; and
• callable procedures from both C and Fortran.
In summary, the PASSWORK system demonstrates that a vector machine can provide the best of both SIMD and MIMD worlds in one shared-memory machine architecture. The only significant performance limits to SIMD computing on a vector machine are memory bandwidth, the ability to efficiently corner-turn data in a vector register, and the ability to perform multiple logical operations in a single tick.
In contrast to real SIMD machines, a vector machine can more easily trade space for time and provide the exact amount of parallelism needed to solve an actual problem. In addition, global operations like processor tally and global "or" are performed much faster on vector machines than on real SIMD machines.
In my opinion, the SIMD model of computation is much more applicable to general problem solving than is realized today. Causes for this
may be more psychological than technical and are possibly due to a Catch 22 between the availability of SIMD research tools and real SIMD machines. Simulators like PASSWORK are keys to breaking this Catch 22 by providing a portable SIMD programming environment for developing new parallel algorithms on conventional machines. Ideally, results of this research will drive the design of even higher-performance SIMD engines.
Related to this last remark, SRC has initiated a research project called PETASYS to investigate the possibility of doing SIMD computing in the memory address space of a general-purpose machine. The basic idea here is to design a new kind of memory chip (a process-in-memory chip) that associates a single-bit processor with each column of a standard RAM. This will break the von Neumann bottleneck between a CPU and its memory and allow a more natural evolution from MIMD to a mixed MIMD/SIMD computing environment.
Applications in this mixed computing environment are just now beginning to be explored at SRC. One of the objectives of the PETASYS Project is to design a small-scale PETASYS system on a SUN 4 platform with 4K SIMD processors and a sustained bit-serial performance of 10 gigabit operations per second. Scaling this performance into the supercomputing arena should eventually provide a sustained SIMD
performance of 1015 bit operations per second across 64 million SIMD processors. The Greek prefix peta , representing 1015 , suggested a good name for this SRC research project and potential supercomputer—PETASYS.
Vectors Are Different
Steven J. Wallach
Steven J. Wallach, a founder of CONVEX Computer Corporation, is Senior Vice President of Technology and a member of the CONVEX Board of Directors. Before founding CONVEX, Mr. Wallach served as product manager at ROLM for the 32-bit mill-spec computer system. From 1975 to 1981, he worked at Data General, where he was the principal architect of the 32-bit Eclipse MV superminicomputer series. As an inventor, he holds 33 patents in various areas of computer design. He is featured prominently in Tracy Kidder's Pulitzer Prize-winning book , The Soul of a New Machine.
Mr. Wallach received a B.S. in electrical engineering from Polytechnic University, New York, an M.S. in electrical engineering from the University of Pennsylvania, and an M.B.A. from Boston University. He serves on the advisory council of the School of Engineering at Rice University, Houston, and on the external advisory council of the Center for Research on Parallel Computation, a joint effort of Rice/Caltech/Los Alamos National Laboratory. Mr. Wallach also serves on the Computer Systems Technical Advisory Committee of the U.S. Department of Commerce and is a member of the Board of Directors of Polytechnic University.
In the late 1970s, in the heyday of Digital Equipment Corporation (DEC), Data General, and Prime, people were producing what we called minicomputers, and analysts were asking how minicomputers were different
from an IBM mainframe, etc. We used to cavalierly say that if it was made east of the Hudson River, it was a minicomputer, and if was made west of the Hudson River, it was a mainframe.
At the Department of Commerce, the Computer Systems Technical Advisory Committee is trying to define what a supercomputer is for export purposes. For those who know what that entails, you know it can be a "can of worms." In Figure 1, I give my view of what used to be the high-end supercomputer, which had clock cycle X. What was perceived as a microcomputer in the late 1970s was 20X, and perhaps if it came out of Massachusetts, it was 10- to 15X. Over time, we have had three different slopes, as shown in Figure 1. The top slope is a RISC chip, and that is perhaps where CONVEX is going, and maybe this is where Cray Research, Inc., and the Japanese are going. We are all converging on something called the speed of light. In the middle to late 1990s, the clock-cycle difference between all levels of computing will be, at best, four to one, not 20 to one.
The next question to consider is, how fast do things have to be to eliminate vectors and use scalar-only processors? That is why I titled my talk "Vectors are Different."
I think the approach to take is to look at both the hardware and the software. If you only look at the hardware, you will totally miss the point.
I used to design transistor amplifiers. However, today I tend to be more comfortable with compiler algorithms. So what I am going to do is talk about vectors, first with respect to software and then with respect to hardware, and see how that relates to scalar processing.
First, the reason we are even asking the question is because we have very highly pipelined machines. But, as Harvey Cragon pointed out earlier in this session, as have others, we have a history of vectorizing both compilers and coding styles. I would not say that it is "cookie-cutter" technology, but it is certainly based on the pioneering work of such people as Professors David Kuck and Ken Kennedy (presenters in Session 5).
I ran the program depicted in Figure 2 on a diskless SUN with one megabyte of storage. I kept wondering why, after half an hour, I wasn't getting the problem done. So I put a PRINT statement in and realized that about every 15 to 20 minutes I was getting another iteration. The problem was that it was page-faulting across the network. I then ran that benchmark on five different RISC machines or workstations, which are highly pipelined machines that are supposed to put vector processing out of business.
Examine Table 1 row by row, not column by column. What is important is the ratio of row entries for the same processor. The right way to view Table 1 means i is on the inner loop; the wrong way means j is on the inner loop. On the MIPS (millions of instructions per second), it was a
difference of three to one. That is, if I programmed DO j, DO i versus DO i, DO j, it was three to one for single precision.
Double precision for the R6000 was approximately the same, which was the best of them because it had the biggest cache—a dual-level cache of R6000 (Table 1). Single precision for the R3000 was about 10 to one, and double precision was five to one. Now, that may sound counter-intuitive. Why should it be a smaller ratio with a bigger data type? The reason is that you start going nonlinear faster because it is bigger than the cache size. The Solbourne, which is the SPARC, is three to one.
We couldn't get the real *8 to run because it had a process size that was too big relative to physical memory, and we didn't know how to fix it. Then we ran it on RIOS (reduced instruction set computing), the hottest thing. The results were interesting—20 to one and 20 to one. Now you get the smaller cache, and in this case, you get almost 30 to one.
What does this all mean? What it really means to me is that to use these highly pipelined machines, especially if they all have caches, we're basically going to vectorize our code anyway. Whether we call it vectors or not, we're going to code DO j, DO i, just to get three or 10 times the performance. It also means that the compilers for these machines are
going to have to do dependency analysis and all these vector transformations to overcome this.
In our experience at CONVEX, we have taken code collectively and "vectorized" it, so that 99 times out of 100, you put it back on the scalar machine and it runs faster. The difference is, if it only runs 15 per cent faster, people say, "Who cares? It's only 15 per cent." Yet, if my mainline machine is going to be this class of scalar pipeline machines, 15 per cent actually is a big deal. As the problem size gets bigger, that increase in performance will get bigger also. By programming on any of those workstations, I got anywhere from three to 20 times faster.
No matter what we do, I believe our compiler is going forward, and soon these will be vectorizing compilers. We just won't say it, because it would do vector transformations anyway.
Now let's look at hardware. The goals of hardware design are to maximize operations on operands and use 100 per cent of available memory bandwidth. When I consider clock cycles and other things, I design computers by starting with the memory system, and I build the fastest possible memory system I can and make sure the memory is busy 100 per cent of the time, whether it's scalar or vector. It's very simple—you tell me a memory bandwidth, and I'll certainly tell you your peak performance. As we all know, you very rarely get the peak, but at least I know the peak.
The real issue as we go forward is, if we want to get performance, we have to design high-speed memory systems that effectively approach the point where we get an operand back every cycle. If we get an operand back at every cycle, we're effectively designing a vector memory system, whether we call it that or not. In short, memory bandwidth determines the dominant cost of a system.
Also, as we build machines in the future, we're beginning to find out that the cost of the CPU is becoming lower and lower. All the cost is in the memory system, the crossbar, the amount interleaving, and the mechanics. Consequently, again, we're paying for bandwidth. Figure 3 shows you some hard numbers on these costs, and these numbers are not just off the top of my head.
The other thing that we have to look at is expandability, which adds cost. If you have a machine that can go from baseline X to 4X, whatever that means, you have to design that expandability in from the start. Even if you buy the machine with the baseline configuration, there is overhead, and you're paying for the ability to go to 4X. For example, look at the IBM RS/6000, a workstation that IBM lists for $15,000. But it's not expandable in the scheme of things. It's a uniprocessor.
Now let's examine the server version of the RS/6000. We are using price numbers from the same company and using the same chips. So the comparison in cost is apples to apples. When I run a single-user benchmark with little or no I/O (like the LINPACK), I probably get the same performance on the workstation as I do on the server, even though I pay six times more for the server. The difference in price is due to the server having expandability in I/O, CPU, and physical memory.
Workstations generally don't have this expandability. If they do, they start to be $50,000 single-seat workstations. There were two companies that used to make these kinds of workstations, which seems to prove that $50,000 single-user workstations don't sell well.
What is the future, then? For superscalar hardware, I think every one of these machines eventually will have to have vector-type memory systems to sustain CPU performance. If I can't deliver an operand every cycle, I can't operate it on every cycle. You don't need a Ph.D. in computer science to figure that out. If you have a five-inch pipe feeding water to a 10-inch pipe, you can expand the 10-inch pipe to 20 inches; but you're still going to have the same flow through the pipe. You have to have it balanced. Because all these machines so far have caches, you still need loop interchange or other vector transformations.
You're going to see needs for "blocking algorithms," that is, how to take matrices or vectors and make them submatrices to get the performance. I call it "strip mining." In reality, it's the same thing. That is, how do I take a data set and divide it into a smaller data set to fit into a register or memory to get high-speed performance? Fundamentally, it's the same type of algorithm.
So I believe that, again, we're all going to be building the memory systems, and we're going to be building the compilers. Superscalar hardware will become vector processors in practice, but some people won't acknowledge it.
Now, what will happen to true vector machines? As Les Davis pointed out in the first presentation in this session, they're still going to be around for a long time. I think more and more we're going to have specialized functional units. A functional unit is something like a multiplier or an adder or a logical unit. Very-large-scale integration will permit specialized functional units to be designed, for example, O(N2 ) operations for O(N) data. The type of thing that Ken Iobst was talking about with corner-turning (see the preceeding paper, this session) is an example of that. A matrix-multiply type of thing is another example, but in this case it uses N2 data for N3 operations. More and more you'll see more emphasis on parallelism (MIMD), which will evolve into massive parallelism.
The debate will continue on the utility of multiple memory paths. I view that debate as whether you should have one memory pipe, two memory pipes, or N memory pipes. I've taken a very straightforward position in the company: there's going to be one memory pipe. Not all agree with me.
I look at benchmarks, and I look at actual code, and I find that the clock cycle (the memory bandwidth of a single pipe) seems to be more of a determining factor. If I'm going to build a memory system to tolerate all the bandwidth of multiple pipes, because of parallelism, I believe, rather than having eight machines with three pipes, I'd rather have a 24-processor machine and utilize the bandwidth. I'd rather utilize the bandwidth that way because I get more scalar processing. Because of parallel compilers, I'll probably get more of that effective bandwidth than someone else will with eight machines using three memory pipes.
One other issue: there shall be no Cobol compiler. I have my lapel button that says, "Cobol can be eliminated in our lifetime if we have to." In reality, ever since we've been programming computers, we've had the following paradigm, although we just didn't realize it: we had that part of the code that was scalar and that part of the code that was vector. We go back and forth. You know, nothing is ever 100 per cent vectorizable, nothing is 100 per cent parallelizable, etc. So I think, going forward, we'll see machines with all these capabilities, and the key is getting the compilers to do the automatic decomposition for this.
If we're approaching the speed of light, then, yes, we're all going to go parallel—there's no argument, no disagreement there. But the real issue is, can we start beginning to do it automatically?
With gallium arsenide technology, maybe I can build an air-cooled, four-or five-nanosecond machine. It may not be as fast as a one-nanosecond machine, but if I can get it on one chip, maybe I can put four of them on one board. That may be a better way to go than a single, one-nanosecond chip because, ultimately, we're going to go to parallelism anyway.
The issue is that, since we have to have software to break this barrier, we have to factor that into how we look at future machines, and we can't just let clock cycle be the determining factor. I think if we do, we're "a goner." At least personally I think, as a designer, that's the wrong way to do it. It's too much of a closed-ended way of doing it.