previous chapter
The Interplay between Algorithms and Architectures: Two Examples
next chapter

The Interplay between Algorithms and Architectures:
Two Examples

Duncan Buell

Duncan A. Buell received his Ph.D. degree in mathematics from the University of Illinois at Chicago in 1976. He has held academic positions in mathematics at Carleton University in Canada and in computer science at Bowling Green State University in Ohio and at Louisiana State University (LSU). In 1985 he joined the Supercomputing Research Center in Bowie, Maryland, where he is presently a Senior Research Scientist. He has done research in number theory, information retrieval, and parallel algorithms and was part of a team at LSU that built a 256-bit-wide, reconfigurable parallel integer arithmetic processor, on which he holds a patent. He has written numerous journal articles and one book.

This presentation describes briefly, using two examples, the relevance of computer architecture to the performance achievable in running different algorithms.

This session is supposed to focus on architectural features of high-performance computing systems and how those features relate to the design and use of algorithms for solving hard problems.

Table 1 presents timings from a computation-bound program. This is a benchmark that I have created, which is the computational essence of the sort of computations I normally do. As with any benchmark, there are many caveats and qualifications, but you will have to trust me that this


240
 

Table 1. An Integer Arithmetic Benchmark

 

CPU Times (in Seconds)
by Optimization Level

 

None

Full

SUN 4 cc compiler
SUN 4 gcc compiler

92.3
89.9

80.6
78.1

SUN 3 cc compiler

83.5

67.6

SUN 3 cc compiler
SUN 3 gcc compiler

64.8
61.6

51.1
51.4

CONVEX cc compiler
CONVEX cc compiler

39.3
69.0

39.3 (32-bit arith.)
68.2 (64-bit arith.)

CRAY-2 cc compiler
CRAY-2 scc compiler

49.0

48.7 (64-bit arith.)
48.7 (64-bit arith.)

DEC-5000/200

19.7

18.5

table is generated honestly and that in my experience, performance on this benchmark is a reasonable predictor of performance I might personally expect from a computer. This is a program written in C; not being trained in archaeology, I tend to avoid antique languages like Fortran.

Table 2 displays counts of "cell updates per second." I will get to the details in a moment. This table has largely been prepared by Craig Reese of the Supercomputing Research Center (SRC). I apologize for the fact that the tables are in inverse scales, one measuring time and the other measuring a rate.

So what's the point? As mentioned in Session 3 by Harvey Cragon, there have been some things that have been left behind as we have developed vector computing, and one of those things left behind is integer arithmetic. This is a benchmark measuring integer arithmetic performance, and on the basis of this table, one could justifiably ask why one of the machines in Table 1 is called a supercomputer and the others are not.

As an additional comment, I point out that some of the newer RISC chips are intentionally leaving out the integer instructions—this is the reason for the poor performance of the Sun Microsystems, Inc., SUN 4 relative to the Digital Equipment Corporation DEC-5000. Those in the


241
 

Table 2. The DNA Homology Problem

 

Cell Updates per Second
(in Thousands)

CM-2 (64K)
CM-2 (64K)
CM-2 (64K)

1,085,202 
1,081,006 
712,348 

vpratio=128, algorithm 1 (8192K strings)
vpratio=32, algorithm 1 (2048K strings)
vpratio=1, algorithm 1 (64K strings)

CM-2 (64K)
CM-2 (64K)

873,813 
655,360 

vpratio=32, algorithm 2 (2048K strings)
vpratio=1, algorithm 2 (64K strings)

Splasha

50,000 

 

CRAY-2
SUN 4/370
SUN 4/60
SUN 4/280
PNACa
SUN 3/280
SUN 3/60
SUN 3/50
CM-2a
CRAY-2a
CONVEX C1a
SUN 2/140a

6,400 
3,030 
2,273 
2,127 
1,099 
758 
617 
450 
212 
154 
112 
20 

 

a Computations from published services

floating-point world will not notice the absence of those instructions because they will be able to obtain floating-point performance through coprocessor chips.

The second table is a DNA string-matching benchmark. This is a computation using a dynamic programming algorithm to compare a string of characters against a great many other strings of characters. The items marked with an asterisk come from published sources; the unmarked items come from computations at SRC. Some machines are included multiple times to indicate different implementations of the same algorithm.

As for precise machines, PNAC is Dan Lopresti's special-purpose DNA string-matching machine, and Splash is a board for insertion into a SUN computer that uses Xilinx chips arranged in a linear array. The timings of the Connection Machine CM-2 (from Thinking Machines Corporation) are largely C/Paris timings from the SRC Connection Machine.


242

And what is the message? The DNA problem and the dynamic programming edit distance algorithm are inherently highly parallel and dominated as a computation by the act of "pushing a lot of bits around." It should, therefore, not come as a surprise that the Connection Machine, with inherent parallelism and a bit orientation, outperforms all other machines. In Table 1, we see that the absence of integer hardware drags the Cray Research CRAY-2 down to the performance level of a workstation—not even the raw speed of the machine can overcome its inherent limitations. In Table 2, we see that the fit between the computation and the architecture allows for speedups substantially beyond what one might expect to get on the basis of mundane issues of price and basic machine speed.

Now, do we care—or should we—about the problems typified in these two tables? After all, neither of these fits the mode of "traditional" supercomputing. From K. Speierman's summary of the 1983 Frontiers of Supercomputing meeting, we have the assertion that potential supercomputer applications may be far greater than current usage indicates. Speierman had begun to make my case before I even took the floor. Yes, there are hard problems out there that require enormous computation resources and that are simply not supported by the architecture of traditional high-end computers.

Two other lessons emerge. One is that enormous effort has gone into vector computing, with great improvements made in performance. But one can then argue that in the world of nonvector computing, many of those initial great improvements are yet to be made, provided the machines exist to support their kind of computation.

The second lesson is a bitter pill for those who would argue to keep the 1050 lines of code already written. If we're going to come up with new architectures and use them efficiently, then necessarily, code will have to be rewritten because it will take new algorithms to use the machines well. Indeed, it will be a failure of ingenuity on the part of the algorithm designers if they are unable to come up with algorithms so much better for the new and different machines that, even with the cost of rewriting code, such rewrites are considered necessary.


243

previous chapter
The Interplay between Algorithms and Architectures: Two Examples
next chapter