Frontiers of Supercomputing II "d0e7999"

LAPACK

Our objective in developing LAPACK is to provide a package for the solution of systems of equations and the solution of eigenvalue problems. The software is intended to be efficient across a wide range of high-performance computers. It is based on algorithms that minimize memory access to achieve locality of reference and reuse of data, and it is built on top of the Levels 1, 2, and 3 BLAS—the de facto standard that the numerical linear algebra community has given us. LAPACK is a multi-institutional project, including people from the University of Tennessee, the University of California at Berkeley, New York University's Courant Institute, Rice University, Argonne National Laboratory, and Oak Ridge National Laboratory.

We are in a testing phase at the moment and just beginning to establish world speed records, if you will, for this kind of work. To give a hint of those records, we show in Table 5 some timing results for LAPACK routines on a CRAY Y-MP.

Let us look at the LU decomposition results. This is the routine that does that work. On one processor, for a matrix of order 32, it runs at four MFLOPS; for a matrix of order 1000, it runs at 300 MFLOPS. Now if we take our LAPACK routines (which are written in Fortran), called the Level 3 BLAS (which the people from Cray have provided), and go to eight processors, we get 32 MFLOPS—a speeddown . Obviously, if we wish to solve the matrix, we should not use this approach!

When we go to large-order matrices, however, the execution rate is close to two GFLOPS—for code that is very portable. And for LL^T and QR factorization, we get the same effect.

Note that we are doing the same number of operations that we did when we worked with the unblocked version of the algorithms. We are not cheating in terms of the MFLOPS rate here.

One other performance set, which might be of interest for comparison, is that of the IBM RISC machine RS/6000-550 (Dongarra, Mayer, et al. 1990). In Figure 3, we plot the speed of LU decomposition for the LAPACK routine, using a Fortran implementation of the Level 3 BLAS. For the one-processor workstation, we are getting around 45 MFLOPS on larger-order matrices.

Clearly, the BLAS help, not only on the high-performance machines at the upper end but also on these RISC machines, perhaps at the lower end—for exactly the same reason: data are being used or reused where the information is stored in its cache.

― 251 ―

Table 5. LAPACK Timing Results for a CRAY Y-MP (in MFLOPS)
Name	32	64	128	256	512	1024
SGETRF (LU )
1 proc.	40	108	195	260	290	304
2 proc.	32	91	229	408	532	588
4 proc.	32	90	260	588	914	1097
8 proc.	32	90	205	375	1039	1974
SPOTRF (LL^T )
1 proc.	34	95	188	259	289	301
2 proc.	29	84	221	410	539	594
4 proc.	29	84	252	598	952	1129
8 proc.	29	84	273	779	1592	2115
SGEQRF (QR )
1 proc.	54	139	225	275	294	301
2 proc.	50	134	256	391	505	562
4 proc.	50	136	292	612	891	1060
8 proc.	50	133	328	807	1476	1937

[Full Size]

Figure 3.
Variants of LU factorization on the IBM RISC System RS/6000-550.

― 252 ―