previous sub-section
Linear Algebra Library for High-Performance Computers*
next sub-section

Development of Standards

The linear algebra community has long recognized that we needed something to help us in developing our algorithms. Several years ago, as a community effort, we put together a de facto standard for identifying basic operations required in our algorithms and software. Our hope was that the standard would be implemented on the machines by many manufacturers and that we would then be able to draw on the power of having that implementation in a rather portable way.

We began with those BLAS designed for performing vector-vector operations. We now call them the Level 1 BLAS (Lawson et al. 1979). We later defined a standard for doing some rather simple matrix-vector calculations—the so-called Level 2 BLAS (Dongarra et al. 1988). Still

 

Table 3. Memory Latency

 

    Machine

Latency Cycles

 
 

CRAY-1

15

 
 

CRAY X-MP

14

 
 

CRAY Y-MP

17

 
 

CRAY-2

50

 
 

CRAY-2S

35

 
 

CYBER 205

50

 
 

Fujitsu VP-400

31

 

248

later, the basic matrix-matrix operations were identified, and the Level 3 BLAS were defined (Dongarra, Du Croz, et al. 1990). In Figure 1, we show these three sets of BLAS.

Why were we so concerned about getting a handle on those three different levels? The reason lies in the fact that machines have a memory hierarchy and that the faster memory is at the top of that hierarchy (see Figure 2).

Naturally, then, we would like to keep the information at the top part to get as much reuse or as much access of that data as possible. The higher-level BLAS let us do just that. As we can see from Table 4, the Level 2 BLAS offer the potential for two floating-point operations for every reference; and with the Level 3 BLAS, we would get essentially n operations for every two accesses, or the maximum possible.

Figure 1.
Levels 1, 2, and 3 BLAS.


249

Figure 2.
Memory hierarchy.

 

Table 4. Capabilities of Higher-Level BLAS


BLAS

Memory Reference


FLOPS

FLOPS/Memory Reference

Level 1: y¬y + ax

3n

2n

2/3

Level 2: y¬y + Ax

n2

2n2

2

Level 3: A¬A + BC

4n2

2n3

n /2

These higher-level BLAS have another advantage. On some parallel machines, they give us increased granularity and the possibility for parallel operations, and they end up with lower synchronization costs.

Of course, nothing comes free. The BLAS require us to rewrite our algorithms so that we use these operations effectively. In particular, we need to develop blocked algorithms that can exploit the matrix-matrix operation.

The development of blocked algorithms is a fascinating example of history repeating itself. In the sixties, these algorithms were developed on machines having very small main memories, and so tapes of secondary storage were used as primary storage (Barron and Swinnerton-Dyer 1960, Chartres 1960, and McKellar and Coffman 1969). The programmer would reel in information from tapes, put it into memory, and get as much access as possible before sending that information out. Today people are reorganizing their algorithms with that same idea. But now instead of dealing with tapes and main memory, we are dealing with vector registers, cache, and so forth to get our access (Calahan 1979, Jordan and Fong 1977, Gallivan et al. 1990, Berry et al. 1986, Dongarra, Duff, et al. 1990, Schreiber 1988, and Bischof and Van Loan 1986). That is essentially what LAPACK is about: taking those ideas—locality of reference and data reuse—and embodying them in a new library for linear algebra.


250

previous sub-section
Linear Algebra Library for High-Performance Computers*
next sub-section