previous sub-section
Linear Algebra Library for High-Performance Computers*
next sub-section

LINPACK Benchmark

Perhaps the best-known part of that package—indeed, some people think it is LINPACK—is the benchmark that grew out of the documentation. The so-called LINPACK Benchmark (Dongarra 1991) appears in the appendix to the user's guide. It was intended to give users an idea of how long it would take to solve certain problems. Originally, we measured


245

the time required to solve a system of equations of order 100. We listed those times and gave some guidelines for extrapolating execution times for about 20 machines.

The times were gathered for two routines from LINPACK, one (SGEFA) to factor a matrix, the other (SGESL) to solve a system of equations. These routines are called the BLAS, where most of the floating-point computation takes place. The routine that sits in the center of that computation is a SAXPY, taking a multiple of one vector and adding it to another vector:

 image

Table 1 is a list of the timings of the LINPACK Benchmark on various high-performance computers.

The peak performance for these machines is listed here in millions of floating-point operations per second (MFLOPS), in ascending order from 16 to 3000. The question is, when we run this LINPACK Benchmark,

 

Table 1. LINPACK Benchmark on High-Performance Computers


Machine

Peak
MFLOPS

Actual
MFLOPS

System
Efficiency

Ardent Titan-1

16

7

 

0.44

 

CONVEX C-130

62

17

 

0.27

 

SCS-40

44

8.0

 

0.18

 

IBM RS/6000

50

13

 

0.26

 

CONVEX C-210

50

17

 

0.34

 

FPS 264

54

5.6

 

0.10

 

Multiflow 14/300

62

17

 

0.27

 

IBM 3090/VF-180J

138

16

 

0.12

 

CRAY-1

160

12

(27)

0.075

 

Alliant FX/80

188

10

(8 proc.)

0.05

 

CRAY X-MP/1

235

70

 

0.28

 

NEC SX-1E

325

32

 

0.10

 

ETA-10P

334

24

 

0.14

(0.07)

CYBER 205

400

17

 

0.04

 

ETA-10G

644

93

(1 proc.)

0.14

 

NEC SX-1

650

36

 

0.06

 

CRAY X-MP/4

941

149

(4 proc.)

0.16

 

Fujitsu VP-400

1142

20

 

0.018

 

NEC SX-2

1300

43

 

0.033

 

CRAY-2

1951

101

(4 proc.)

0.051

 

CRAY Y-MP/8

2664

275

(8 proc.)

0.10

 

Hitachi S-820/90

3000

    107

       0.036


246

what do we actually get on these machines? The column labeled "Actual MFLOPS" gives the answer, and that answer is quite disappointing in spite of the fact that we are using an algorithm that is highly vectorized on machines that are vector architectures. The next question one might ask is, why are the results so bad? The answer has to do with the transfer rate of information from memory into the place where the computations are done. The operation—that is, a SAXPY—needs to reference three vectors and do essentially two operations on each of the elements in the vector. And the transfer rate—the maximum rate at which we are going to transfer information to or from the memory device—is the limiting factor here.

Thus, as we increase the computational power without a corresponding increase in memory, memory access can cause serious bottlenecks. The bottom line is MFLOPS are easy, but bandwidth is difficult .

Transfer Rate

Table 2 lists the peak MFLOPS rate for various machines, as well as the peak transfer rate (in megawords per second).

Recall that the operation we were doing requires three references and returns two operations. Hence, to run at good rates, we need a ratio of three to two. The CRAY Y-MP does not do badly in this respect. Each

 

Table 2. MFLOPS and Memory Bandwidth




Machine



Peak
MFLOPS

Peak
Transfer (megawatts/
second)




Ratio

Alliant FX/80

188

22

0.12

Ardent Titan-4

64

32

0.5

CONVEX C-210

50

25

0.5

CRAY-1

160

80

0.5

CRAY X-MP/4

940

1411

1.5

CRAY Y-MP/8

2667

4000

1.5

CRAY-2S

1951

970

0.5

CYBER 205

400

600

1.5

ETA-10G

644

966

1.5

Fujitsu VP-200

533

533

1.0

Fujitsu VP-400

1066

1066

1.0

Hitachi 820/80

3000

2000

0.67

IBM 3090/600-VF

798

400

0.5

NEC SX-2

1300

2000

1.5


247

processor can transfer 50 million (64-bit) words per second; and the complete system, from memory into the registers, runs at four gigawords per second. But for many of the machines in the table, there is an imbalance between those two. One of the particularly bad cases is the Alliant FX/80, which has a peak rate of 188 MFLOPS but can transfer only 22 megawords from memory. It is going to be very hard to get peak performance there.

Memory Latency

Another issue affecting performance is, of course, the latency: how long (in terms of cycles) does it actually take to transfer the information after we make a request? In Table 3, we list the memory latency for seven machines. We can see that the time ranges from 14 to 50 cycles. Obviously, a memory latency of 50 cycles is going to impact the algorithm's performance.


previous sub-section
Linear Algebra Library for High-Performance Computers*
next sub-section