Parallel Software
Fran Allen
Frances E. Allen is an IBM Fellow and Senior Manager of Parallel Software at the IBM Thomas J. Watson Research Center in Yorktown Heights, New York. She joined IBM Research in 1957 and has since specialized in compilers, compiler optimization, programming languages, and parallelism. She has been an Institute of Electrical and Electronics Engineers (IEEE) Distinguished Visitor, an Association for Computing Machines (ACM) National Lecturer, a member of the NSF Computer Science Advisory Board, a Visiting Professor at New York University, and a Consulting Professor at Stanford University. Dr. Allen has also served as General, Program, and Tutorial Chair of the ACM SIGPLAN Compiler Construction Conference.
In 1987, Dr. Allen was elected to the National Academy of Engineering, and she was appointed the Chancellor's Distinguished Lecturer and the Mackay Lecturer for 1988–89 at the University of California, Berkeley. In 1990, she became an IEEE Fellow and in 1991 was awarded an Honorary Doctor of Science degree from the University of Alberta.
Parallel hardware is built and bought for performance. While cost/performance is a significant, sometimes overriding factor, performance is the primary motivation for developing and acquiring supercomputers and massively parallel machines. Advances in hardware such as the astounding increase in CPU speeds, increases in communication bandwidth, and
reduction in communication latencies are changing both the raw performance of available hardware systems and their structure. An even more profound change is in the diversity of computational models provided by the hardware. These range from the multiple-CPU, shared-memory systems typified by Cray Research, Inc., to the single instruction, multiple data (SIMD) Connection Machine from Thinking Machines Corporation, to a variety of multiple instruction, multiple data (MIMD), distributed-memory machines. However, these technological advances have not resulted in widespread use of parallelism, nor has the promised performance been easily and frequently realized. Parallel software is the key to realizing the potentials of parallelism: performance and pervasive use. The purpose of this discussion is to assess briefly the state of parallel software for numeric-intensive computing on parallel systems and to indicate promising directions.
Fortran, the lingua franca for scientific and engineering programming, was designed for shared-memory uniprocessors. Traditional optimizing Fortran compilers generate high-performance code, sometimes perfect code, for such a machine. Traditionally, however, the user is responsible for managing storage so that memory and caches are used effectively. With the advent of vector hardware on the uniprocessors, Fortran compilers needed to compose vector instructions out of the multiple statements used to iterate through the elements of arrays. Dependence analysis to determine the relative order of references to the same storage location and loop transformations to reorder the references have been developed. When applied to Fortran codes, these transformations are now powerful enough so that vectorization is usually left to the compiler, with little or no intervention by the user.
Parallelism in any form involving more than one processor presents a very different challenge to the user and to the parallel software. There are two fundamental problems: forming parallel work and reducing overheads. Solutions to these problems involve the user and all aspects of the parallel software: the compiler, the run-time environment, the operating system, and the user environment. However, the user is often unwilling to rework the code unless there is a significant performance gain; currently, at least a factor of 10 speedup is expected for rework to be worth the effort. Even when the user does rework the algorithm and the code, what target machine model should be targeted? Given the plethora of supercomputers and massively parallel machines and, more importantly, the variety of computational models supported, users are reluctant to choose one. Thus, the challenge of parallel software is both performance and portability.
Two recent developments are very promising for numeric-intensive computing: Fortran 90 and a user-level single computational model. Fortran 90 array language allows the user to succinctly express computations involving arrays. An operand in a statement can be an entire array or a subsection of an array, and because all operands in a statement must be conformable, all the implicit, element-level operations such as add or multiply can be executed in parallel. While the designers of Fortran 90 primarily had vectors in mind, the array language is very appropriate as a computational model potentially applicable to several hardware computational models.
Fortran 90 array statements can also be compiled easily to shared-memory machines, though to do this, compilers must eliminate the transformational functions used to provide conformable operands, replacing them with appropriate index reference patterns to the operand elements. Fortran 90 array statements can be compiled very effectively to SIMD architectures. The distributed-memory MIMD architectures pose a bigger problem.
Computation can only be performed on data accessible to the computation. Data must be in registers, in cache, or in addressable memory. The delay or latency in accessing the data is critical to the performance of the machine because delays due to data accesses only reduce the effective speed of the computational units. As stated earlier, one of the two challenges for achieving parallel performance is the reduction of overheads. Delays in data movement in the system are often the most significant contributors to overhead. For shared-memory machines, this is alleviated by effective use of registers and caches, with appropriately placed synchronization and refreshing of shared memory so that shared values are correct. For distributed-memory MIMD machines, the cost of assuring that correct data is available to the computation can be very significant, hundreds or thousands of cycles in many cases, with explicit sends and receives being required at the hardware level to move information from one machine to another. It becomes important, therefore, to preplan the placement and movement of information through the system. An effort is currently under way to augment the Fortran 90 language with user directives indicating how data can be partitioned.
Another factor contributing to overhead is synchronization. While the SIMD architectures do not incur explicit synchronization overhead because the machine is synchronized at every instruction, all other forms of parallel hardware assume software-controlled synchronization to coordinate work and data availability. Programs running on massively
parallel MIMD machines often have a data-driven, intermachine model. As soon as data is computed, it is sent to the appropriate machine(s), which can then start computing with it. This model of computation, a direct extension of the basic hardware-level, send-and-receive model, generally requires a total recasting of the solution and rewrite of the code.
An emerging computational model that has great promise is SPMD: single program, multiple data. A program written in Fortran 90, augmented with data-partitioning directives, can be executed concurrently on multiple processors but on different data partitions. It has some of the characteristics of SIMD but is much more flexible in that different control paths can be taken on different parallel executions of the same code. But the most important aspect of SPMD is that it may not require extensive problem rework, yet the programs may be portable across very diverse parallel systems. It should be emphasized that parallel software systems do not currently support the approach, so the hoped-for performance and portability goals may not be achievable.
Advances in parallel software toward the goals of performance, portability, and ease of use requires enlarging the scope of software in two directions. Whole programs or applications should be analyzed, transformed, and managed by the system, and the whole parallel software system should participate as an integrated unit.
The traditional functional boundaries among operating systems, runtime, compilers, and user environments must be revised. For example, scheduling of work to processors can be done by the compiler and the run-time system and even by the user, not just the operating system. Fast protocols avoiding operating system calls are needed to reduce overheads. The user and compiler need information about the performance of the running program so that adjustments can be made in the source code or the compilation strategy. Debugging of parallel code is a major problem, and integrated tools to facilitate that solution are needed. For the foreseeable future, the user must be an active participant in enabling the parallelism of the application and should be able to participate interactively with the entire system.
Whole-program analysis and transformation is essential when compiling programs to target machines such as distributed-memory MIMD systems because of the relatively large granularity of parallel work and the need to partition data effectively. This analysis is both broad and deep, involving interprocedural analysis and analyses for data and control dependence and for aliases. The transformations involve transforming loops, regions, and multiple procedures to create appropriate
parallelism for the target machine. They also involve modifying the size, shape, location, and lifetimes of variables to enable parallelism.
It is widely recognized that the major challenge in parallelism is in developing the parallel software. While there are products and active research on whole-program analysis and transformation, there is less effort on whole-system integration, including an integrated user environment. A totally unanswered question is how effective these parallel software directions will be on massively parallel MIMD machines. What does seem certain is that without advances in parallel software and in problem-solving languages, the true potential of parallel systems will not be realized.