Compiler Issues for TFLOPS Computing
Ken Kennedy
Ken Kennedy received a B.A. in mathematics from Rice University, Houston, in 1967, an M.S. in mathematics from New York University in 1969, and a Ph.D. in computer science from New York University in 1971. He has been a faculty member at Rice since 1971, achieving the rank of Professor of Mathematical Sciences in 1980. He served as Chairman of the Department of Computer Science from its founding in 1984 until 1989 and was appointed Noah Harding Professor in 1985. He is currently Director of the Computer and Information Technology Institute at Rice and heads the Center for Research on Parallel Computing, an NSF Science and Technology Center at Rice, Caltech, and Los Alamos National Laboratory.
From 1985 to 1987, Professor Kennedy chaired the Advisory Committee to the NSF Division of Computer Research and has been a member of the Board of the Computing Research Association since 1986. In 1990, he was elected to the National Academy of Engineering.
Professor Kennedy has published over 60 technical articles on programming support software for high-performance computer systems and has supervised the construction of two significant software systems for programming parallel machines: a vectorizer for Fortran and an integrated scientific software development environment.
Professor Kennedy's current research focuses on extending techniques developed for automatic vectorization to programming tools for parallel computer systems and high-performance microprocessors. Through the Center for Research on Parallel Computation, he is seeking to develop new strategies for supporting architecture -independent parallel programming, especially in science and engineering.
I would like to ponder the notion of TFLOPS computing, which is a sort of subgoal of our high-performance computing initiative. I opposed this subgoal when I was first on the subcommittee of the Federal Coordinating Committee on Science, Engineering, and Technology that wrote the report on high-performance computing, but now I've warmed up to it a lot, primarily because I've seen the kind of scientific advances that we can achieve if we're able to get to the level of a TFLOPS and the corresponding memory sizes.
I want to talk about what I see as the compiler issues if we build such a machine. There are many ways one might consider building such a machine. I'll pick the way Session 13 presenter Justin Rattner would build it, which is to take Intel's latest microprocessors. Although I don't know exactly what the peaks are going to be, I'm sure that we won't be able to achieve more than about half of peak out of those things. Intel's study indicates they're going to achieve all of these in the middle-to-late 1990s and that they're going to have all sorts of performance out of single-chip processors. I'm counting on about 250 in the middle-to-late 1990s. That means that we'll have to have at least 8000 processors. If we are able to get Intel to give them to us for $10,000 a processor, which is less than they're currently offering for their iPSC/860s, that will be an $80 million machine, which is reasonable in cost. This means that we have to think about how to use 4000 to 8000 processors to do science. So I'd like to reflect on how we're using parallelism today.
The good news is that we have machines, and we're doing science with them. Some people in the audience can talk about the real achievements in which science and engineering calculations have been done at very low cost relative to conventional supercomputers, with high degrees of parallelism. Unfortunately, the other side of the coin is that there's some bad news. The bad news is that we have a diverse set of architectures , and the programming systems for those machines are primitive in the sense that they reflect a great deal of the architecture of the machine in the programming system. Thus, the user is, in fact, programming a specific
architecture quite frequently rather than writing general parallel programs.
In addition, there are the ugly new kinds of bugs that we have to deal with and all sorts of performance anomalies that, when you have a thousand processors and you get a speedup of two, people wonder what's gone wrong. You have to have some mechanism to help people to deal with these things.
So all sorts of problems exist in the programming. I think that the really critical issue is that we have not achieved the state where commercial firms with those zillions of dollars of investment will leap into the massive-parallelism world; they're afraid of losing that investment, and they're afraid of a major reprogramming effort for a particular machine that will be lost the next time a new machine has a different architecture .
I think people in the commercial world are standing back and looking at parallelism with a bit of a skeptical eye. Some of the scientists aren't, but I think a lot of scientists are. That means if we want parallelism to be successful in the way that previous generations of supercomputers have been successful, we have to provide some form of machine-independent parallel programming, at least enough so that people feel comfortable in protecting their investments.
It is useful for us to look at the legacy of automatic vectorization, which I think really made it possible for vector supercomputers to be well programmed. I want to dispel what may be a misconception on the part of some people. Vectorization technology did not make it possible for people to take dusty decks and run them on their local supercomputer and expect high performance. What it did was provide us with a subdialect of Fortran 77, a sort of vectorizable Fortran 77, which the whole community learned. Once they had this model of how to write vectorizable programs in Fortran 77, people could write them and run them with high performance on a variety of different vector machines. In that sense, a great deal of the architectural specificity was factored out. I think that's one of the main reasons for the success of this technology.
Now, the question is, can we achieve the similar success with automatic parallelization? As Fran Allen pointed out earlier in this session, there are a lot of people on this panel and elsewhere who are doing very hard work on this problem of automatic parallelization. I've got to say that our system, like that of Fran and of Dave Black (also a Session 5 presenter)—all of those systems—can do many impressive things. I think each of us can give you some examples in which our system does not just do impressive things but amazing things.
Unfortunately, in general we can't bet on that, which has to do with all sorts of technical problems. Mainly it has to do with the fact that if we're dealing with parallelism, at least of the MIMD asynchronous variety, we have to deal with the overhead, and that means we have to find larger and larger regions of parallelism. The imprecision of dependence analysis and all the interprocedural effects that Fran talked about are really causing problems.
Thus, I think this technology is going to make a contribution, but it's not going to be the answer in the same way vectorization technology was. That means we have to support explicit parallel programming. However, can we support that in a machine-independent way? The goal of our efforts, I think, should be to let the programmer specify parallelism at a fairly high level of abstraction, which may be contradictory if we stay within Fortran. But even within Fortran, I think we should be able to do it at a high level of abstraction, in the sense that it shouldn't depend on the machine. The programmer should specify the strategy, and the compiler should take care of the details.
If we're going to address this problem, what are the issues we have to deal with? The first issue is that we don't even know what the right programming paradigm is for these machines. If I pick anyone in the audience, I can probably get that person to tell me which paradigm he or she "knows" to be the right one. But the fact of the matter is that we can't bet on any of these because they're not proven.
There are a lot of candidates for parallel languages that we have to consider. We have to consider that the shared-memory community is centering on the Parallel Computing Forum (PCF) effort, which is, I think, going to provide at least some machine independence across all shared-memory machines. Unfortunately, not many of the massively parallel machines, except for that of the Cedar Project and IBM's RP3-Mach project, are still trying to implement shared-memory hardware. So we have to concern ourselves with the generality of PCF.
Then there's Fortran 90, which is the language of choice for the SIMD community led by Thinking Machines Corporation; also there's Linda and other languages. So one can have Fortran extensions.
There are also all sorts of efforts in compositional specifications of programming, where one takes Fortran and combines it with some specification of how to combine various Fortran modules and runs them in parallel. Then there are people who are using STRAND88 and its successors to specify parallel compositions, with Fortran as the components. Also there is the functional community, which makes the very good point that in functional languages, the dependence analysis is
precise. Unfortunately, we have a problem with mapping that down to a finite amount of memory. So the efficiencies in this area are still a major question, although there have been some strides in that direction in getting better efficiencies.
I think the language and abstract-programming model we provide is not yet clear. If we're going to have machine-independent programming, we have to have some model, and there's no resolution on what that should be like.
Another question is, how much of the tradeoff of machine independence versus efficiency can we tolerate? One of the reasons we can't do machine-independent programming right now is because the ways that we could do it are unacceptably inefficient on the machines available. A fraction of the user community agrees that they could do it in a particular language available today, but the programs that come out aren't efficient enough and have to be reprogrammed. Therefore, I think we have to understand how far we can go in abstracting the parallel decomposition process. I think a major issue is how far we can go in combining the compiler and operating system and managing the memory hierarchy, which is a major dependency that we have to deal with.
Yet another issue is, of course, that although compilers exist within an environment, I think we've seen the end of the day when a compiler sits by itself and is an independent component of the programming system. Every compiler will exist in an environment because all the tools in the environment will be intimately intertwined in their design. The debugger has to understand what the compiler does in order to present in an understandable way the execution of the program as it happened to the user. Performance tuners have to be able to explain in a language understandable to the user why the compiler chose to do the strange things that it did, not just spit out lines and lines of assembly code that are hard to trace to anything. The performance tuner has to explain how the program ran and why in somewhat high-level terms that the average user is able to grasp.
Fran talked about interprocedural compilation. Environmental support will need to be provided in order to manage that in a way that programmers will find acceptable. We have to have tools that help people prepare well-formed parallel programs in whatever language we have.
Finally, I think there's an issue of what we do if we're going to manage a memory hierarchy in a way that's transparent to the user so that we can have true machine-independent programming. To do so, I think we have to have some help from the architecture . What the form of that help is, I'm
not certain. I think that if I knew, we would be farther along in the research line than we are right now.
I think the ideas of virtual shared memory, when combined with compiler ideas, may be very promising. The Cedar Project, led by the next presenter in this session, Dave Kuck, incorporates some very interesting ideas about what things you can do with memory to support a more programmable interface. The architecture and operating system have to do more and more to support debugging and performance modeling, since those are going to be so critical in the environment that is part of the compiler system. We need to have some support to make sure we're not paying an enormous amount to achieve that.
To summarize, I think we have to have machine-independent parallel programming or we are going to waste a generation of physicists and chemists as low-level programmers. I think we need to provide a level of programming support that is consistent with what we did in vectorization. We have a long research agenda if we're going to achieve that by the end of this decade, when TFLOPS computing is to arrive.