Linear Algebra Library for High-Performance Computers*

LINPACK Benchmark

Perhaps the best-known part of that package—indeed, some people think it is LINPACK—is the benchmark that grew out of the documentation. The so-called LINPACK Benchmark (Dongarra 1991) appears in the appendix to the user's guide. It was intended to give users an idea of how long it would take to solve certain problems. Originally, we measured

245

the time required to solve a system of equations of order 100. We listed those times and gave some guidelines for extrapolating execution times for about 20 machines.

The times were gathered for two routines from LINPACK, one (SGEFA) to factor a matrix, the other (SGESL) to solve a system of equations. These routines are called the BLAS, where most of the floating-point computation takes place. The routine that sits in the center of that computation is a SAXPY, taking a multiple of one vector and adding it to another vector:

Table 1 is a list of the timings of the LINPACK Benchmark on various high-performance computers.

The peak performance for these machines is listed here in millions of floating-point operations per second (MFLOPS), in ascending order from 16 to 3000. The question is, when we run this LINPACK Benchmark,

 Table 1. LINPACK Benchmark on High-Performance Computers Machine Peak MFLOPS Actual MFLOPS System Efficiency Ardent Titan-1 16 7 0.44 CONVEX C-130 62 17 0.27 SCS-40 44 8.0 0.18 IBM RS/6000 50 13 0.26 CONVEX C-210 50 17 0.34 FPS 264 54 5.6 0.10 Multiflow 14/300 62 17 0.27 IBM 3090/VF-180J 138 16 0.12 CRAY-1 160 12 (27) 0.075 Alliant FX/80 188 10 (8 proc.) 0.05 CRAY X-MP/1 235 70 0.28 NEC SX-1E 325 32 0.10 ETA-10P 334 24 0.14 (0.07) CYBER 205 400 17 0.04 ETA-10G 644 93 (1 proc.) 0.14 NEC SX-1 650 36 0.06 CRAY X-MP/4 941 149 (4 proc.) 0.16 Fujitsu VP-400 1142 20 0.018 NEC SX-2 1300 43 0.033 CRAY-2 1951 101 (4 proc.) 0.051 CRAY Y-MP/8 2664 275 (8 proc.) 0.10 Hitachi S-820/90 3000 107 0.036

246

what do we actually get on these machines? The column labeled "Actual MFLOPS" gives the answer, and that answer is quite disappointing in spite of the fact that we are using an algorithm that is highly vectorized on machines that are vector architectures. The next question one might ask is, why are the results so bad? The answer has to do with the transfer rate of information from memory into the place where the computations are done. The operation—that is, a SAXPY—needs to reference three vectors and do essentially two operations on each of the elements in the vector. And the transfer rate—the maximum rate at which we are going to transfer information to or from the memory device—is the limiting factor here.

Thus, as we increase the computational power without a corresponding increase in memory, memory access can cause serious bottlenecks. The bottom line is MFLOPS are easy, but bandwidth is difficult .

Transfer Rate

Table 2 lists the peak MFLOPS rate for various machines, as well as the peak transfer rate (in megawords per second).

Recall that the operation we were doing requires three references and returns two operations. Hence, to run at good rates, we need a ratio of three to two. The CRAY Y-MP does not do badly in this respect. Each

 Table 2. MFLOPS and Memory Bandwidth Machine Peak MFLOPS Peak Transfer (megawatts/ second) Ratio Alliant FX/80 188 22 0.12 Ardent Titan-4 64 32 0.5 CONVEX C-210 50 25 0.5 CRAY-1 160 80 0.5 CRAY X-MP/4 940 1411 1.5 CRAY Y-MP/8 2667 4000 1.5 CRAY-2S 1951 970 0.5 CYBER 205 400 600 1.5 ETA-10G 644 966 1.5 Fujitsu VP-200 533 533 1.0 Fujitsu VP-400 1066 1066 1.0 Hitachi 820/80 3000 2000 0.67 IBM 3090/600-VF 798 400 0.5 NEC SX-2 1300 2000 1.5

247

processor can transfer 50 million (64-bit) words per second; and the complete system, from memory into the registers, runs at four gigawords per second. But for many of the machines in the table, there is an imbalance between those two. One of the particularly bad cases is the Alliant FX/80, which has a peak rate of 188 MFLOPS but can transfer only 22 megawords from memory. It is going to be very hard to get peak performance there.

Memory Latency

Another issue affecting performance is, of course, the latency: how long (in terms of cycles) does it actually take to transfer the information after we make a request? In Table 3, we list the memory latency for seven machines. We can see that the time ranges from 14 to 50 cycles. Obviously, a memory latency of 50 cycles is going to impact the algorithm's performance.

Linear Algebra Library for High-Performance Computers*