previous chapter
Parallel Processing: Moving into the Mainstream
next chapter

Parallel Processing:
Moving into the Mainstream

H. T. Kung

H. T. Kung joined the faculty of Carnegie Mellon University in 1974, after receiving his Ph.D. there. Since 1992 he has been serving as the Gordon McKay Professor of Electrical Engineering and Computer Science at Harvard University. During a transition period, he continues his involvement with projects under way at Carnegie Mellon. Dr. Kung's research interests are in high-speed networks, parallel-computer previous hit architectures next hit, and the interplay between computer and telecommunications systems. Together with his students, he pioneered the concept of systolic-array processing. This effort recently culminated in the commercial release by Intel Supercomputer Systems Division of the iWarp parallel computer.

In the area of networks, Dr. Kung's team has developed the Nectar System, which uses fiber-optic links and large crossbar switches. A prototype system employing 100 megabits/second links and more than 20 hosts has been operational since early 1989. The team is currently working with industry on the next-generation Nectar, which will employ fibers operating at gigabits/second rates. The gigabit Nectar is one of the five testbeds in a current national effort to develop gigabits/second wide-area networks. His current network research is directed toward gigabit, cell-based local-area networks capable of guaranteeing performance.


102

I will focus on three key issues in parallel processing: computation models, interprocessor communication, and system integration. These issues are important in moving parallel processing into the mainstream of computing. To illustrate my points, I will draw on examples from the systems we are building at Carnegie Mellon University—iWarp and Nectar. iWarp is a fine-grain parallel machine developed under a joint project with Intel. Nectar is a network backplane that connects different kinds of machines together.

In discussing computation models, we are really trying to address some of the most difficult problems that people are having with parallel computers. Namely, it can be very difficult to write code for these machines, and the applications codes are not portable between parallel machines or between sequential machines and parallel machines. That has been very troublesome.

There have been attempts to solve these problems. For example, some theoretically oriented researchers have come up with the PRAM model, which presents to the user a parallel computation model that hides almost all the properties of a parallel machine. However, because of its high degrees of generality and transparency, the PRAM model does not exploit useful properties such as the locality or regularity that we worked on so diligently for the last 30 years in order to achieve high-performance computing. What we want is more specialized models—in the beginning, at least.

Therefore, I propose to work on those models that users really understand. For example, one thing we have been doing is that, for each specific area for which we have a good understanding of its computation characteristics, we will develop a parallelizing compiler, although we do not call it a compiler because it is not so general purpose. Instead, we call it a parallel program generator (Figure 1). We start with the specifications, without any detailed knowledge about the underlying parallel machines. Then we have a compiler that will generate code for the specific parallel machines that would have SEND and RECEIVE instructions. So the users can work on a high level in a machine-independent manner.

A concrete example is the APPLY compiler, or parallel program generator (Figure 2) used for many image-processing computations defined in terms of localized operations. That is, each output pixel depends on the small neighborhoods of the corresponding input pixel. In APPLY, the software can generate code for each processor and also do the boundary operations, all automatically.


103

Figure 1.
Parallel program generators for special computation models.

Figure 2.
APPLY: a parallel program generator for data parallelism using local operators.


104

Currently APPLY is operational for Warp, iWarp, Transputers, and a couple of other machines. Actually, it generates code for sequential machines, as well. So you can develop your APPLY programs in a comfortable environment of a sequential machine. Intel is going to support APPLY for the iWarp.

Another program generator developed by Carnegie Mellon is called AL, which will generate matrix operations automatically. Actually, we generate much of the LINPACK code on Warp using this language. The language basically is like Fortran but allows hints to the compiler about roughly how arrays should be partitioned, e.g., in certain directions, onto a parallel machine. Then the code for the specific parallel machine will be generated automatically.

We also have a very high-level parallel program generator, called ASSIGN, for large signal flow graphs with nodes that are signal processing operations like fast Fourier transforms, filters, and so on (Figure 3). You just use the fact that there are so many graph nodes that you have to deal with; as a result, you usually can load balance them by mapping an appropriate number of graph nodes onto each processor. Again, this mapping has been done automatically.

One of the most difficult things, so far, for a parallel machine to handle is branch-and-bound types of operations (Figure 4), which are similar to searching a tree, for example. In this type of operation you do a lot of backtracking, which can depend on the computation you are doing at run time. For example, you might sometimes like to go deep so that you can do a useful operation, but you also want to go breadth so that you can increase concurrency. Usually only the user has this rough idea of what the priority is between the depth-first and the breadth-first search. Yet, today we do not even have languages that can express that idea. Even if the user knows that you should go depth a little before you go breadth, we still do not know how to say it to the parallel machine. So we are pretty far away from having a general computation model.

A strategy that I propose is to gain experience in a few computation models for special application areas. At the same time, we should develop insights for the more general models.

The second key issue in parallel processing is interprocessor communication. After all, parallel machines are about communications between different processors. A lot of the parallel machines that people have today are really built out of processing elements that are not good for communication. As a result, some vendors are telling people that their parallel machines are great—but only if you don't communicate!


105

Figure 3.
ASSIGN: a parallel program generator for signal flow graphs.

Figure 4.
General models are not quite there yet.


106

We are trying to make a processor that is good for computation and good for communication. In the case of iWarp, you can build a single component with communication and computation on the same chip, and then you can use this component to form different parallel processing arrays (Figure 5). Once you have such a processor array, you can program each single cell using C and Fortran in the conventional manner. Then you can use parallelizing compilers, or parallel program generators as described above, to generate parallel code for the array.

The iWarp component itself has both a computation and communication agent (Figure 6). Most significant is that the communication part of the chip can do a total of 320 megabytes/second I/O, whereas other current components can do no more than 10 megabytes/second. In addition, we have 160 megabytes/second I/O for the local memory. If you add them up, that is 480 megabytes/second in the current version.

iWarp has three unique innovations in communication: high-bandwidth I/O, systolic communication, and logical channels.

Obviously, in any parallel machine, you have got to be good in I/O because you have got to be able, at least, to get the input data quickly. For example, if you have a high-performance parallel interface (HIPPI) channel with a 100 megabytes/second I/O bandwidth, you can probably get an interface of such bandwidth into the array (see Figure 7). The challenge, however, will be how to distribute the input data onto multiple

Figure 5.
iWarp: a VLSI building block for parallel systems.


107

Figure 6.
A single-chip component capable of 480 megabytes/second I/O.

Figure 7.
High-bandwidth distribution and collection of data in processor array.


108

processing elements. For this distribution we need a large I/O bandwidth per processor element.

For iWarp we have eight links, and each of them is 40 megabytes/second. In particular, we can simultaneously use four links each at 25 megabytes/second to distribute a 100 megabyte/second input in parallel and another four links to distribute the 100 megabyte/second output.

Besides the I/O bandwidth, the other important thing about each processor is easy and direct access to I/O. This is done in iWarp through the systolic communication mechanism. Figure 8 compares systolic communication with the traditional memory-based, message-passing communication.

One of the things that any parallel machine should do well is to distribute an array—say, a row of an image into a number of processing elements. In iWarp a processor can send out a row in a single message to many other processing elements and have the first processing element take one end of the row (Figure 9). Then the first processor can change the message header, redirect the rest of the message to go into the second processing element, and so on. Note that the sender only needs to send out one message to distribute the data to many other processors.

Figure 8.
Systolic communications in iWarp.


109

Figure 9.
Message redirection in iWarp.

Logical channels are an important concern. No matter how many pins or how many connectors you have, you never have enough to support some applications. People always want more connectivity so that they can map the computation on the parallel machine easily and so that they can do a reconfiguration of the parallel-processor array more easily. Therefore, it is very useful to time-multiplex the wire by using hardware support so that logically you can imagine having many, many wires instead of having one physical wire.

For iWarp we can have 20 logical connections in and out from each chip. For example, we can have both blue and red connections happening at the same time. A programmer can use the blue connection for some computation. In the meantime, the system message can always go on using the red connection without being blocked by the programmer's computation. Therefore, you can reserve some bandwidth logically for system use (Figure 10). Once you have these kinds of logical connections, you will also find it very easy to reconfigure a processor array. For example, to avoid a faulty node, you can just route it around using logical connections. Because you have so many logical connections available, the routing will be easy.

The message here is that although we have already seen some successful parallel machines, we have not yet seen the really good parallel


110

Figure 10.
Reconfiguration for fault tolerance: an application of logical channels.

machine that would support many very basic interprocessor communication operations that applications will typically need. In the future, parallel machines built out of processing building blocks that inherently support efficient and flexible interprocessor communication will be much easier to use.

The last key issue in parallel processing I will address is system integration. Parallel machines are not suited for sequential operations, by definition. Thus, parallel machines typically need to be integrated in a general computing environment. The more powerful a machine is, the more accessible it ought to be. Ideally, all of these computers should be connected together. Actually, this kind of configuration is happening in almost all high-performance computing sites (Figure 11).

At Carnegie Mellon, we are building such a network-based high-performance computing system, called Nectar (Figure 12). Nectar has a general network that supports very flexible, yet high-bandwidth and low-latency, communication. The Nectar demo system (Figure 13) connects about 26 hosts. Currently, we can do interhost communication at about 100 megabits/second, with a latency of about 170 microseconds. Nectar supports transmission control protocol/Internet protocol (TCP/IP). If TCP checksum is turned out, the CAB-to-CAB (i.e., without


111

Figure 11.
Accessibility in a general-computing environment.

Figure 12.
The Nectar System at Carnegie Mellon University.


112

Figure 13.
Host Nectar demo system (May 1990).

crossing the host VME bus) TCP/IP bandwidth can be close to the peak link bandwidth of 100 megabits/second for large packets of 32 kilobytes. It's reasonable to consider the case where TCP checksum is turned off because this checksum can be easily implemented in hardware in the future.

We believe that over the next 18 months, we can build a Nectar-like system with much improved performance. In particular, we're building a HIPPI Nectar, which will support the 800 megabits/second HIPPI channels. For TCP/IP, the goal is to achieve at least 300 megabits/second bandwidth. With Bellcore we are also working on an interface between HIPPI networks and telecommunication ATM/SONET networks.

Figure 14 shows a network-based multicomputer configuration that we proposed about six months ago. With this kind of a network, you can literally have a system capable of 1011 or even 1012 floating-point operations per second, without even building any new machines.

In summary, at Carnegie Mellon we are making progress on the key issues in parallel processing that I have discussed.


113

Figure 14.
High-speed networks make high-performance computing over network feasible.


115

previous chapter
Parallel Processing: Moving into the Mainstream
next chapter