Linear Algebra Library for High-Performance Computers*

### Algorithm Design

Up to now, we have talked about restructuring algorithms; that is essentially what we did when we changed them to block form. The basic algorithm itself remains the same; we are simply affecting the locality of how we reference data and the independence of operations that we are trying to focus on—the matrix-matrix operations.

### Divide-and-Conquer Approach

Let us now talk about designing algorithms. In this case, the basic algorithm will change. In particular, let us consider a divide-and-conquer technique for finding the eigenvalues and eigenvectors of a symmetric tridiagonal matrix (Dongarra and Sorensen 1987). The technique is also used in other fields, where it is sometimes referred to as domain decomposition. It involves tearing, or partitioning, the problem to produce small, independent pieces. Then, the eigenvalue of each piece is found independently. Finally, we "put back" the eigenvalues of these pieces, and put back the eigenvalues of these others, and so on. We were successful. In redesigning this algorithm, we ended up with one that runs in parallel very efficiently. Table 6 gives ratios of performance on a CRAY-2 for up to four processors. As we can see from the table, we are getting four times the speed up, and sometimes even better. What's more, we have an example of serendipity: the same algorithm is actually more efficient than the "best" sequential algorithm, even in the sequential setting.

 Table 6. Ratio of Execution Time for TQL2 over Divide-and-Conquer Algorithma No. Proc. 100 E/(p) 100 (1)/(p) 200 E/(p) 200 (1)/(p) 300 E/(p) 300 (1)/(p) 1 1.35 1 1.45 1 1.53 1 2 2.55 1.88 2.68 1.84 2.81 1.84 3 3.39 2.51 3.71 2.55 3.79 2.48 4 4.22 3.12 4.60 3.17 50.3 3.28 a where E = EISPACK TQL, (1) = parallel divide-and-conquer algorithm on one processor, and (p) = parallel divide-and-conquer algorithm on p processors.

253

### Accuracy

Working with LAPACK has given us an opportunity to go back and rethink some of the algorithms. How accurately can we solve NA problems (Demmel and Kahan 1988)? The answer depends on the accuracy of the input data and how much we are willing to spend:

• If the input data is exact, we can ask for a (nearly) correctly rounded answer—generally done only for +, *, /, , and cos.

• If the input H is uncertain in a normwise sense (true input = H + dH , where ||dH || / ||H || is small), the conventional backward stable algorithm is suitable; it is the usual approach to linear algebra, it does not respect sparsity structure, and it does not respect scaling.

• If the input H is uncertain in a component-wise relative sense (true input = H + dH , where maxi,j |dHij | / |Hij | is small), it does respect sparsity, it does respect scaling, but it does need new algorithms, perturbation theory, and error analysis.

In the end, we have new convergence criteria, better convergence criteria, and better error bounds. We also enhance performance because we now terminate the iteration in a much quicker way.

### Tools

Our work in algorithm design has been supported by tool development projects throughout the country. Of particular note are the projects at Rice University and the University of Illinois. Other projects help in terms of what we might call logic or performance debugging of the algorithms—trying to understand what the algorithms are doing when they run on parallel machines. The objective here is to give the implementor a better feeling for where to focus attention and to show precisely what the algorithm is doing while it is executing on parallel machines (Dongarra, Brewer, et al. 1990).

### Testing

Testing and timing have been an integral part of the LAPACK project. Software testing is required to verify new machine-specific versions. Software timing is needed to measure the efficiency of the LAPACK routines and to compare new algorithms and software. In both of these tasks, many vendors have helped us along the way by implementing basic routines on various machines and providing essential feedback (see Table 7).

The strategy we use may not be optimal for all machines. Our objective is to achieve a "best average" performance on the machines listed in Table 8. We are hoping, of course, that our strategy will also perform well

254

 Table 7. Vendor Participation Alliant Computer Sys. BBN Advanced Comp. CONVEX Computer Cray Computer Cray Research Digital Equipment Corp. Encore Computer Corp. FPS Computing Fujitsu Hitachi IBM ECSEC Italy Intel Kendall Square Res. MasPar Myrias Research Corp. NEC Sequent Computer Sys. Silicon Graphics Stardent Computer Sun Microsystems, Inc. Supercomputer Sys., Inc. Thinking Machines Corp.

 Table 8. Target Machines (1-100 Processors) Alliant FX/80 IBM 3090/VF BBN TC2000 Multiflow CONVEX C-2 Myrias CRAY-2 NEC SX CRAY Y-MP RISC machines Encore Multimax Sequent Symmetry Fujitsu VP Stardent Computer Hitachi S-820

on a wider range of machines, including the Intel iPSC, iWarp, MasPar, nCUBE, Thinking Machines, and Transputer-based computers.

Linear Algebra Library for High-Performance Computers*