previous chapter
Linear Algebra Library for High-Performance Computers*
next chapter

Linear Algebra Library for High-Performance Computers[*]

Jack Dongarra

Jack Dongarra is a distinguished scientist specializing in numerical algorithms in linear algebra, parallel computing, advanced computer architectures, programming methodology, and tools for parallel computers at the University of Tennessee's Department of Computer Science and at Oak Ridge National Laboratory's Mathematical Sciences Section. Other current research involves the development, testing, and documentation of high-quality mathematical software. He was involved in the design and implementation of the EISPACK, LINPACK, and LAPACK packages and of the BLAS routines and is currently involved in the design of algorithms and techniques for high-performance computer architectures.

Dr. Dongarra's other experience includes work as a visiting scientist at IBM's T. J. Watson Research Center in 1981, a consultant to Los Alamos Scientific Laboratory in 1978, a research assistant with the University of New Mexico in 1978, a visiting scientist at Los Alamos Scientific Laboratory in 1977, and a Senior Scientist at Argonne National Laboratory until 1989.


244

Dr. Dongarra received a Ph.D. in applied mathematics from the University of New Mexico in 1980, an M.S. in computer science from the Illinois Institute of Technology in 1973, and a B.S. in mathematics from Chicago State University in 1972.

Introduction

For the past 15 years, my colleagues and I have been developing linear algebra software for high-performance computers. In this presentation, I focus on five basic issues: (1) the motivation for the work, (2) the development of standards for use in linear algebra, (3) a basic library for linear algebra, (4) aspects of algorithm design, and (5) future directions for research.

LINPACK

A good starting point is LINPACK (Dongarra et al. 1979). The LINPACK project began in the mid-1970s, with the goal of producing a package of mathematical software for solving systems of linear equations. The project developers took a careful look at how one puts together a package of mathematical software and attempted to design a package that would be effective on state-of-the-art computers at that time—the Control Data Corporation (scalar) CDC 7600 and the IBM System 370. Because vector machines were just beginning to emerge, we also provided some vector routines. Specifically, we structured the package around the basic linear algebra subprograms (BLAS) (Lawson et al. 1979) for doing vector operations.

The package incorporated other features, as well. Rather than simply collecting or translating existing algorithms, we reworked algorithms. Instead of the traditional row orientation, we used a column orientation that provided greater efficiency. Further, we published a user's guide with directions and examples for addressing different problems. The result was a carefully designed package of mathematical software, which we released to the public in 1979.

LINPACK Benchmark

Perhaps the best-known part of that package—indeed, some people think it is LINPACK—is the benchmark that grew out of the documentation. The so-called LINPACK Benchmark (Dongarra 1991) appears in the appendix to the user's guide. It was intended to give users an idea of how long it would take to solve certain problems. Originally, we measured


245

the time required to solve a system of equations of order 100. We listed those times and gave some guidelines for extrapolating execution times for about 20 machines.

The times were gathered for two routines from LINPACK, one (SGEFA) to factor a matrix, the other (SGESL) to solve a system of equations. These routines are called the BLAS, where most of the floating-point computation takes place. The routine that sits in the center of that computation is a SAXPY, taking a multiple of one vector and adding it to another vector:

 image

Table 1 is a list of the timings of the LINPACK Benchmark on various high-performance computers.

The peak performance for these machines is listed here in millions of floating-point operations per second (MFLOPS), in ascending order from 16 to 3000. The question is, when we run this LINPACK Benchmark,

 

Table 1. LINPACK Benchmark on High-Performance Computers


Machine

Peak
MFLOPS

Actual
MFLOPS

System
Efficiency

Ardent Titan-1

16

7

 

0.44

 

CONVEX C-130

62

17

 

0.27

 

SCS-40

44

8.0

 

0.18

 

IBM RS/6000

50

13

 

0.26

 

CONVEX C-210

50

17

 

0.34

 

FPS 264

54

5.6

 

0.10

 

Multiflow 14/300

62

17

 

0.27

 

IBM 3090/VF-180J

138

16

 

0.12

 

CRAY-1

160

12

(27)

0.075

 

Alliant FX/80

188

10

(8 proc.)

0.05

 

CRAY X-MP/1

235

70

 

0.28

 

NEC SX-1E

325

32

 

0.10

 

ETA-10P

334

24

 

0.14

(0.07)

CYBER 205

400

17

 

0.04

 

ETA-10G

644

93

(1 proc.)

0.14

 

NEC SX-1

650

36

 

0.06

 

CRAY X-MP/4

941

149

(4 proc.)

0.16

 

Fujitsu VP-400

1142

20

 

0.018

 

NEC SX-2

1300

43

 

0.033

 

CRAY-2

1951

101

(4 proc.)

0.051

 

CRAY Y-MP/8

2664

275

(8 proc.)

0.10

 

Hitachi S-820/90

3000

    107

       0.036


246

what do we actually get on these machines? The column labeled "Actual MFLOPS" gives the answer, and that answer is quite disappointing in spite of the fact that we are using an algorithm that is highly vectorized on machines that are vector architectures. The next question one might ask is, why are the results so bad? The answer has to do with the transfer rate of information from memory into the place where the computations are done. The operation—that is, a SAXPY—needs to reference three vectors and do essentially two operations on each of the elements in the vector. And the transfer rate—the maximum rate at which we are going to transfer information to or from the memory device—is the limiting factor here.

Thus, as we increase the computational power without a corresponding increase in memory, memory access can cause serious bottlenecks. The bottom line is MFLOPS are easy, but bandwidth is difficult .

Transfer Rate

Table 2 lists the peak MFLOPS rate for various machines, as well as the peak transfer rate (in megawords per second).

Recall that the operation we were doing requires three references and returns two operations. Hence, to run at good rates, we need a ratio of three to two. The CRAY Y-MP does not do badly in this respect. Each

 

Table 2. MFLOPS and Memory Bandwidth




Machine



Peak
MFLOPS

Peak
Transfer (megawatts/
second)




Ratio

Alliant FX/80

188

22

0.12

Ardent Titan-4

64

32

0.5

CONVEX C-210

50

25

0.5

CRAY-1

160

80

0.5

CRAY X-MP/4

940

1411

1.5

CRAY Y-MP/8

2667

4000

1.5

CRAY-2S

1951

970

0.5

CYBER 205

400

600

1.5

ETA-10G

644

966

1.5

Fujitsu VP-200

533

533

1.0

Fujitsu VP-400

1066

1066

1.0

Hitachi 820/80

3000

2000

0.67

IBM 3090/600-VF

798

400

0.5

NEC SX-2

1300

2000

1.5


247

processor can transfer 50 million (64-bit) words per second; and the complete system, from memory into the registers, runs at four gigawords per second. But for many of the machines in the table, there is an imbalance between those two. One of the particularly bad cases is the Alliant FX/80, which has a peak rate of 188 MFLOPS but can transfer only 22 megawords from memory. It is going to be very hard to get peak performance there.

Memory Latency

Another issue affecting performance is, of course, the latency: how long (in terms of cycles) does it actually take to transfer the information after we make a request? In Table 3, we list the memory latency for seven machines. We can see that the time ranges from 14 to 50 cycles. Obviously, a memory latency of 50 cycles is going to impact the algorithm's performance.

Development of Standards

The linear algebra community has long recognized that we needed something to help us in developing our algorithms. Several years ago, as a community effort, we put together a de facto standard for identifying basic operations required in our algorithms and software. Our hope was that the standard would be implemented on the machines by many manufacturers and that we would then be able to draw on the power of having that implementation in a rather portable way.

We began with those BLAS designed for performing vector-vector operations. We now call them the Level 1 BLAS (Lawson et al. 1979). We later defined a standard for doing some rather simple matrix-vector calculations—the so-called Level 2 BLAS (Dongarra et al. 1988). Still

 

Table 3. Memory Latency

 

    Machine

Latency Cycles

 
 

CRAY-1

15

 
 

CRAY X-MP

14

 
 

CRAY Y-MP

17

 
 

CRAY-2

50

 
 

CRAY-2S

35

 
 

CYBER 205

50

 
 

Fujitsu VP-400

31

 

248

later, the basic matrix-matrix operations were identified, and the Level 3 BLAS were defined (Dongarra, Du Croz, et al. 1990). In Figure 1, we show these three sets of BLAS.

Why were we so concerned about getting a handle on those three different levels? The reason lies in the fact that machines have a memory hierarchy and that the faster memory is at the top of that hierarchy (see Figure 2).

Naturally, then, we would like to keep the information at the top part to get as much reuse or as much access of that data as possible. The higher-level BLAS let us do just that. As we can see from Table 4, the Level 2 BLAS offer the potential for two floating-point operations for every reference; and with the Level 3 BLAS, we would get essentially n operations for every two accesses, or the maximum possible.

Figure 1.
Levels 1, 2, and 3 BLAS.


249

Figure 2.
Memory hierarchy.

 

Table 4. Capabilities of Higher-Level BLAS


BLAS

Memory Reference


FLOPS

FLOPS/Memory Reference

Level 1: y¬y + ax

3n

2n

2/3

Level 2: y¬y + Ax

n2

2n2

2

Level 3: A¬A + BC

4n2

2n3

n /2

These higher-level BLAS have another advantage. On some parallel machines, they give us increased granularity and the possibility for parallel operations, and they end up with lower synchronization costs.

Of course, nothing comes free. The BLAS require us to rewrite our algorithms so that we use these operations effectively. In particular, we need to develop blocked algorithms that can exploit the matrix-matrix operation.

The development of blocked algorithms is a fascinating example of history repeating itself. In the sixties, these algorithms were developed on machines having very small main memories, and so tapes of secondary storage were used as primary storage (Barron and Swinnerton-Dyer 1960, Chartres 1960, and McKellar and Coffman 1969). The programmer would reel in information from tapes, put it into memory, and get as much access as possible before sending that information out. Today people are reorganizing their algorithms with that same idea. But now instead of dealing with tapes and main memory, we are dealing with vector registers, cache, and so forth to get our access (Calahan 1979, Jordan and Fong 1977, Gallivan et al. 1990, Berry et al. 1986, Dongarra, Duff, et al. 1990, Schreiber 1988, and Bischof and Van Loan 1986). That is essentially what LAPACK is about: taking those ideas—locality of reference and data reuse—and embodying them in a new library for linear algebra.


250

LAPACK

Our objective in developing LAPACK is to provide a package for the solution of systems of equations and the solution of eigenvalue problems. The software is intended to be efficient across a wide range of high-performance computers. It is based on algorithms that minimize memory access to achieve locality of reference and reuse of data, and it is built on top of the Levels 1, 2, and 3 BLAS—the de facto standard that the numerical linear algebra community has given us. LAPACK is a multi-institutional project, including people from the University of Tennessee, the University of California at Berkeley, New York University's Courant Institute, Rice University, Argonne National Laboratory, and Oak Ridge National Laboratory.

We are in a testing phase at the moment and just beginning to establish world speed records, if you will, for this kind of work. To give a hint of those records, we show in Table 5 some timing results for LAPACK routines on a CRAY Y-MP.

Let us look at the LU decomposition results. This is the routine that does that work. On one processor, for a matrix of order 32, it runs at four MFLOPS; for a matrix of order 1000, it runs at 300 MFLOPS. Now if we take our LAPACK routines (which are written in Fortran), called the Level 3 BLAS (which the people from Cray have provided), and go to eight processors, we get 32 MFLOPS—a speeddown . Obviously, if we wish to solve the matrix, we should not use this approach!

When we go to large-order matrices, however, the execution rate is close to two GFLOPS—for code that is very portable. And for LLT and QR factorization, we get the same effect.

Note that we are doing the same number of operations that we did when we worked with the unblocked version of the algorithms. We are not cheating in terms of the MFLOPS rate here.

One other performance set, which might be of interest for comparison, is that of the IBM RISC machine RS/6000-550 (Dongarra, Mayer, et al. 1990). In Figure 3, we plot the speed of LU decomposition for the LAPACK routine, using a Fortran implementation of the Level 3 BLAS. For the one-processor workstation, we are getting around 45 MFLOPS on larger-order matrices.

Clearly, the BLAS help, not only on the high-performance machines at the upper end but also on these RISC machines, perhaps at the lower end—for exactly the same reason: data are being used or reused where the information is stored in its cache.


251
 

Table 5. LAPACK Timing Results for a CRAY Y-MP (in MFLOPS)

Name

32

64

128

256

512

1024

SGETRF (LU )

           

1 proc.

40

108

195

260

290

304

2 proc.

32

91

229

408

532

588

4 proc.

32

90

260

588

914

1097

8 proc.

32

90

205

375

1039

1974


SPOTRF (LL T )

           

1 proc.

34

95

188

259

289

301

2 proc.

29

84

221

410

539

594

4 proc.

29

84

252

598

952

1129

8 proc.

29

84

273

779

1592

2115


SGEQRF (QR )

           

1 proc.

54

139

225

275

294

301

2 proc.

50

134

256

391

505

562

4 proc.

50

136

292

612

891

1060

8 proc.

50

133

328

807

1476

1937

Figure 3.
Variants of LU  factorization on the IBM RISC System RS/6000-550.


252

Algorithm Design

Up to now, we have talked about restructuring algorithms; that is essentially what we did when we changed them to block form. The basic algorithm itself remains the same; we are simply affecting the locality of how we reference data and the independence of operations that we are trying to focus on—the matrix-matrix operations.

Divide-and-Conquer Approach

Let us now talk about designing algorithms. In this case, the basic algorithm will change. In particular, let us consider a divide-and-conquer technique for finding the eigenvalues and eigenvectors of a symmetric tridiagonal matrix (Dongarra and Sorensen 1987). The technique is also used in other fields, where it is sometimes referred to as domain decomposition. It involves tearing, or partitioning, the problem to produce small, independent pieces. Then, the eigenvalue of each piece is found independently. Finally, we "put back" the eigenvalues of these pieces, and put back the eigenvalues of these others, and so on. We were successful. In redesigning this algorithm, we ended up with one that runs in parallel very efficiently. Table 6 gives ratios of performance on a CRAY-2 for up to four processors. As we can see from the table, we are getting four times the speed up, and sometimes even better. What's more, we have an example of serendipity: the same algorithm is actually more efficient than the "best" sequential algorithm, even in the sequential setting.

 

Table 6. Ratio of Execution Time for TQL2 over Divide-and-Conquer
Algorithm
a

No.
Proc.

100
E/(p)

100
(1)/(p)

200
E/(p)

200
(1)/(p)

300
E/(p)

300
(1)/(p)

1

1.35

1

1.45

1

1.53

1

2

2.55

1.88

2.68

1.84

2.81

1.84

3

3.39

2.51

3.71

2.55

3.79

2.48

4

4.22

3.12

4.60

3.17

50.3

3.28

a where E = EISPACK TQL,

            (1) = parallel divide-and-conquer algorithm on one processor, and

            (p) = parallel divide-and-conquer algorithm on p processors.


253

Accuracy

Working with LAPACK has given us an opportunity to go back and rethink some of the algorithms. How accurately can we solve NA problems (Demmel and Kahan 1988)? The answer depends on the accuracy of the input data and how much we are willing to spend:

• If the input data is exact, we can ask for a (nearly) correctly rounded answer—generally done only for +, *, /, , and cos.

• If the input H is uncertain in a normwise sense (true input = H + dH , where ||dH || / ||H || is small), the conventional backward stable algorithm is suitable; it is the usual approach to linear algebra, it does not respect sparsity structure, and it does not respect scaling.

• If the input H is uncertain in a component-wise relative sense (true input = H + dH , where maxi,j |dHij | / |Hij | is small), it does respect sparsity, it does respect scaling, but it does need new algorithms, perturbation theory, and error analysis.

In the end, we have new convergence criteria, better convergence criteria, and better error bounds. We also enhance performance because we now terminate the iteration in a much quicker way.

Tools

Our work in algorithm design has been supported by tool development projects throughout the country. Of particular note are the projects at Rice University and the University of Illinois. Other projects help in terms of what we might call logic or performance debugging of the algorithms—trying to understand what the algorithms are doing when they run on parallel machines. The objective here is to give the implementor a better feeling for where to focus attention and to show precisely what the algorithm is doing while it is executing on parallel machines (Dongarra, Brewer, et al. 1990).

Testing

Testing and timing have been an integral part of the LAPACK project. Software testing is required to verify new machine-specific versions. Software timing is needed to measure the efficiency of the LAPACK routines and to compare new algorithms and software. In both of these tasks, many vendors have helped us along the way by implementing basic routines on various machines and providing essential feedback (see Table 7).

The strategy we use may not be optimal for all machines. Our objective is to achieve a "best average" performance on the machines listed in Table 8. We are hoping, of course, that our strategy will also perform well


254
 

Table 7. Vendor Participation

Alliant Computer Sys.

BBN Advanced Comp.

CONVEX Computer

Cray Computer

Cray Research

Digital Equipment Corp.

Encore Computer Corp.

FPS Computing

Fujitsu

Hitachi

IBM ECSEC Italy

Intel

Kendall Square Res.

MasPar

Myrias Research Corp.

NEC

Sequent Computer Sys.

Silicon Graphics

Stardent Computer

Sun Microsystems, Inc.

Supercomputer Sys., Inc.

Thinking Machines Corp.

 

Table 8. Target Machines (1-100 Processors)

Alliant FX/80

IBM 3090/VF

BBN TC2000

Multiflow

CONVEX C-2

Myrias

CRAY-2

NEC SX

CRAY Y-MP

RISC machines

Encore Multimax

Sequent Symmetry

Fujitsu VP

Stardent Computer

Hitachi S-820

 

on a wider range of machines, including the Intel iPSC, iWarp, MasPar, nCUBE, Thinking Machines, and Transputer-based computers.

Future Directions for Research

We have already started looking at how we can make "cosmetic changes" to the LAPACK software—adapt it in a semiautomatic fashion for distributed-memory architectures. In this effort, our current work on blocked operations will be appropriate because the operations minimize communication and provide a good surface-to-volume ratio. We also expect that this task will require defining yet another set of routines, this one based on the BLACS (basic linear algebra communication routines). Once again, we will draw on what has been done in the community for those operations.

As a preliminary piece of data, we show in Figure 4 an implementation of LU decomposition from LAPACK, run on a 64-processor Intel iPSC. Clearly, we are not yet achieving optimum performance, but the situation is improving daily.


255

Figure 4.
Pipelined LU  factorization results for 64 and 128 nodes.

Some interest has also been expressed in developing a C implementation of the LAPACK library. And we continue to track what is happening with Fortran 90 and with the activities of the Parallel Computing Forum.

In the meantime, we are in our last round of testing of the shared-memory version of LAPACK. The package will be released to the public in 1992.

References

D. W. Barron and H. P. F. Swinnerton-Dyer, "Solution of Simultaneous Linear Equations Using a Magnetic-Tape Store," Computer Journal3 , 28-33 (1960).

M. Berry, K. Gallivan, W. Harrod, W. Jalby, S. Lo, U. Meier, B. Phillippe, and A. Sameh, "Parallel Algorithms on the Cedar System," Center for Supercomputing Research and Development technical report 581, University of Illinois-Urbana/Champaign (October 1986).

C. Bischof and C. Van Loan, "Computing the Singular Value Decomposition on a Ring of Array Processors," in Large Scale Eigenvalue Problems , J. Cullum and R. Willoughby, Eds., North-Holland, Amsterdam (1986).


256

D. Calahan, "A Block-Oriented Sparse Equation Solver for the CRAY-1," in Proceedings of the 1979 International Conference Parallel Processing , pp. 116-123 (1979).

B. Chartres, "Adoption of the Jacobi and Givens Methods for a Computer with Magnetic Tape Backup Store," technical report 8, University of Sydney, Australia (1960).

J. Demmel and W. Kahan, "Computing the Small Singular Values of Bidiagonal Matrices with Guaranteed High Relative Accuracy," Argonne National Laboratory Mathematics and Computer Science Division report ANL/MCS-TM-110 (LAPACK Working Note #3) (1988).

J. Dongarra, "Performance of Various Computers Using Standard Linear Equations Software in a Fortran Environment," technical report CS-89-85, University of Tennessee, Knoxville (1991).

J. Dongarra, O. Brewer, S. Fineberg, and J. Kohl, "A Tool to Aid in the Design, Implementation, and Understanding of Matrix Algorithms for Parallel Processors," Parallel and Distributed Computing9 , 185-202 (1990).

J. Dongarra, J. Bunch, C. Moler, and G. W. Stewart, LINPACK User's Guide , Society for Industrial and Applied Mathematics Publications, Philadelphia, Pennsylvania (1979).

J. Dongarra, J. Du Croz, I. Duff, and S. Hammarling, "A Set of Level 3 Basic Linear Algebra Subprograms," ACM Transactions on Mathematical Software16 , 1-17 (1990).

J. Dongarra, J. Du Croz, S. Hammarling, and R. Hanson, "An Extended Set of Fortran Basic Linear Algebra Subroutines," ACM Transactions on Mathematical Software 14 , 1-17 (1988).

J. Dongarra, I. S. Duff, D. C. Sorensen, and H. A. Van der Vorst, Solving Linear Systems on Vector and Shared Memory Computers , Society for Industrial and Applied Mathematics Publications, Philadelphia, Pennsylvania (1990).

J. Dongarra, P. Mayer, and G. Radicati, "The IBM RISC System 6000 and Linear Algebra Operations," Computer Science Department technical report CS-90-122, University of Tennessee, Knoxville (LAPACK Working Note #28) (December 1990).


257

J. Dongarra and D. Sorensen, "A Fully Parallel Algorithm for the Symmetric Eigenproblem," SIAM Journal on Scientific and Statistical Computing8 (2), 139-154 (1987).

K. Gallivan, R. Plemmons, and A. Sameh, "Parallel Algorithms for Dense Linear Algebra Computations," SIAM Review32 (1), 54-135 (1990).

T. Jordan and K. Fong, "Some Linear Algebraic Algorithms and Their Performance on the CRAY-1," in High Speed Computer and Algorithm Organization , D. Kuck, D. Lawrie, and A. Sameh, Eds., Academic Press, New York, pp. 313-316 (1977).

C. Lawson, R. Hanson, D. Kincaid, and F. Krogh, "Basic Linear Algebra Subprograms for Fortran Usage," ACM Transactions on Mathematical Software5 , 308-323 (1979).

A. C. McKellar and E. G. Coffman Jr., "Organizing Matrices and Matrix Operations for Paged Memory Systems," Communications of the ACM12 (3), 153-165 (1969).

R. Schreiber, "Block Algorithms for Parallel Machines," in Volumes in Mathematics and Its Applications, Vol. 13, Numerical Algorithms for Modern Parallel Computer Architectures , M. Schultz, Ed., Berlin, Germany, pp. 197-207 (1988).


259

previous chapter
Linear Algebra Library for High-Performance Computers*
next chapter