Frontiers of Supercomputing II "d0e8412"

Algorithm Design

Up to now, we have talked about restructuring algorithms; that is essentially what we did when we changed them to block form. The basic algorithm itself remains the same; we are simply affecting the locality of how we reference data and the independence of operations that we are trying to focus on—the matrix-matrix operations.

Divide-and-Conquer Approach

Let us now talk about designing algorithms. In this case, the basic algorithm will change. In particular, let us consider a divide-and-conquer technique for finding the eigenvalues and eigenvectors of a symmetric tridiagonal matrix (Dongarra and Sorensen 1987). The technique is also used in other fields, where it is sometimes referred to as domain decomposition. It involves tearing, or partitioning, the problem to produce small, independent pieces. Then, the eigenvalue of each piece is found independently. Finally, we "put back" the eigenvalues of these pieces, and put back the eigenvalues of these others, and so on. We were successful. In redesigning this algorithm, we ended up with one that runs in parallel very efficiently. Table 6 gives ratios of performance on a CRAY-2 for up to four processors. As we can see from the table, we are getting four times the speed up, and sometimes even better. What's more, we have an example of serendipity: the same algorithm is actually more efficient than the "best" sequential algorithm, even in the sequential setting.

Table 6. Ratio of Execution Time for TQL2 over Divide-and-Conquer Algorithm^a
No. Proc.	100 E/(p)	100 (1)/(p)	200 E/(p)	200 (1)/(p)	300 E/(p)	300 (1)/(p)
1	1.35	1	1.45	1	1.53	1
2	2.55	1.88	2.68	1.84	2.81	1.84
3	3.39	2.51	3.71	2.55	3.79	2.48
4	4.22	3.12	4.60	3.17	50.3	3.28
^a where E = EISPACK TQL,
(1) = parallel divide-and-conquer algorithm on one processor, and
(p) = parallel divide-and-conquer algorithm on p processors.

― 253 ―

Accuracy

Working with LAPACK has given us an opportunity to go back and rethink some of the algorithms. How accurately can we solve NA problems (Demmel and Kahan 1988)? The answer depends on the accuracy of the input data and how much we are willing to spend:

• If the input data is exact, we can ask for a (nearly) correctly rounded answer—generally done only for +, *, /, , and cos.

• If the input H is uncertain in a normwise sense (true input = H + dH , where ||dH || / ||H || is small), the conventional backward stable algorithm is suitable; it is the usual approach to linear algebra, it does not respect sparsity structure, and it does not respect scaling.

• If the input H is uncertain in a component-wise relative sense (true input = H + dH , where max_i,j |dH_ij | / |H_ij | is small), it does respect sparsity, it does respect scaling, but it does need new algorithms, perturbation theory, and error analysis.

In the end, we have new convergence criteria, better convergence criteria, and better error bounds. We also enhance performance because we now terminate the iteration in a much quicker way.

Tools

Our work in algorithm design has been supported by tool development projects throughout the country. Of particular note are the projects at Rice University and the University of Illinois. Other projects help in terms of what we might call logic or performance debugging of the algorithms—trying to understand what the algorithms are doing when they run on parallel machines. The objective here is to give the implementor a better feeling for where to focus attention and to show precisely what the algorithm is doing while it is executing on parallel machines (Dongarra, Brewer, et al. 1990).

Testing

Testing and timing have been an integral part of the LAPACK project. Software testing is required to verify new machine-specific versions. Software timing is needed to measure the efficiency of the LAPACK routines and to compare new algorithms and software. In both of these tasks, many vendors have helped us along the way by implementing basic routines on various machines and providing essential feedback (see Table 7).

The strategy we use may not be optimal for all machines. Our objective is to achieve a "best average" performance on the machines listed in Table 8. We are hoping, of course, that our strategy will also perform well

― 254 ―

Table 7. Vendor Participation
Alliant Computer Sys.	BBN Advanced Comp.
CONVEX Computer	Cray Computer
Cray Research	Digital Equipment Corp.
Encore Computer Corp.	FPS Computing
Fujitsu	Hitachi
IBM ECSEC Italy	Intel
Kendall Square Res.	MasPar
Myrias Research Corp.	NEC
Sequent Computer Sys.	Silicon Graphics
Stardent Computer	Sun Microsystems, Inc.
Supercomputer Sys., Inc.	Thinking Machines Corp.

Table 8. Target Machines (1-100 Processors)
Alliant FX/80	IBM 3090/VF
BBN TC2000	Multiflow
CONVEX C-2	Myrias
CRAY-2	NEC SX
CRAY Y-MP	RISC machines
Encore Multimax	Sequent Symmetry
Fujitsu VP	Stardent Computer
Hitachi S-820

on a wider range of machines, including the Intel iPSC, iWarp, MasPar, nCUBE, Thinking Machines, and Transputer-based computers.