Shell Oil Supercomputing
Patric Savage
Patric Savage is a Senior Research Fellow in the Computer Science Department of Shell Development Company. He obtained a B.A. degree in mathematics from Rice University, Houston, in 1952 and began his career in computing in 1955, when he left graduate school to become Manager of Computer Programming at Hughes Tool Company. There, he led Hughes's pioneering efforts in the use of computers for inventory management, production control, and shop scheduling, using IBM-650, 305, and 1410 computers. Following a brief stint in the aerospace industry, he joined IBM in Los Angeles, where he designed a large grocery billing system and took part in a comprehensive study of hospital information systems.
Mr. Savage began his career with Shell in 1965 in seismic processing. This is an area that has grown into one of the world's largest nonmilitary computing endeavors, and Mr. Savage has remained active in this field since then as a computer scientist, consultant, and advisor. Since 1980 he has been very active in parallel and distributed computing systems R&D. For the past year he has been regularly attending the HIPPI and Fibre Channel Standards Working Group meetings. Recently he helped establish an Institute of Electrical and Electronics Engineers (IEEE) Standards Project that will eventually lead to a Storage System Standards protocol.
Mr. Savage is a member of the Computer Society, the IEEE, and the IEEE Mass Storage Technical Committee, which h e
chaired from 1986 through 1988 and for which he now chairs the Standards Subcommittee. He also chairs the Storage and Archiving Standards Subcommittee for the Society of Exploration Geophysicists and holds a life membership in Sigma Xi, the Society for Scientific Research.
I will give you a quick overview and the history of supercomputing at Shell Oil Company and then discuss our recent past in parallel computing. I will also discuss our I/O and mass-storage facility and go into what we are now doing and planning to do in parallel computing in our problem-solving environment that is under development.
Shell's involvement in high-performance computing dates from about 1963. When I arrived at Shell in 1965, seismic processing represented 95 per cent of all the scientific computing that was done in the entire company. Since then there has a been steady increase in general scientific computing at Shell. We now do a great many more reservoir simulations, and we are using codes like NASTRAN for offshore platform designs. We are also heavily into chemical engineering modeling and such.
Seismic processing has always required array processors to speed it up. So from the very beginning, we have had powerful array processors at all times. Before 1986 we used exclusively UNIVAC systems with an array processing system whose design I orchestrated. That was a machine capable of 120 million floating-point operations per second (MFLOPS) and was not a specialized device. It was a very flexible, completely programmable special processor on the UNIVAC system. We "maxed out" at 11 of those in operation. At one time we had a swing count of 13, and, for the three weeks that it lasted, we had more MFLOPS on our floor than Los Alamos National Laboratory.
In about 1986, our reservoir-simulation people were spending so much money renting time on Cray Research, Inc., machines that it was decided we could half-fund a Cray of our own. Other groups at Shell were willing to fund the other half. So that is how we got into using Cray machines. We were able and fortunate to acquire complete seismic programming codes externally and thus, we were able to jump immediately onto the Crays. Otherwise, we would have had an almost impossible conversion problem.
We began an exploratory research program in parallel computing about 1982. We formed an interdisciplinary team of seven people: three geophysicists, who were skilled at geophysical programming, and four Ph.D. computer scientists. Our goal was to enable us to make a truly giant leap ahead—to be able to develop applications that were hitherto totally
unthinkable. We have not completely abandoned that goal, although we have pulled in our horns a good bit. We acquired an nCUBE 1, a 512-node research vehicle built by nCUBE Corporation, and worked with it. That was one of the very first nCUBEs sold to industry. In the process, we learned a great deal about how to make things work on a distributed-memory parallel computer.
In early 1989, we installed a single application on our nCUBE 1 at our computer center on a 256-node machine. It actually "blows away" a CRAY X-MP CPU on that same application. But the fact that it was convincingly cost effective to management is the thing that really has spurred further growth in our parallel computing effort.
To deviate somewhat, I will now discuss our I/O and mass-storage system. (The mass-storage system that many of you may be familiar with was designed and developed at Shell in conjunction with MASSTOR Corporation.) We have what we call a virtual-tape system. The tapes are in automated libraries. We do about 8000 mounts a day. We import 2000 reels and export another 2000 every day into that system. The concept is, if a program touches a tape, it has to swallow it all. So we stage entire tapes and destage entire tapes at a time. No program actually owns a tape drive; it only is able to own a virtual tape drive. We were able to have something like 27 tape drives in our system, and we were able to be dynamically executing something like 350 virtual tape units.
The records were delivered on demand from the computers over a Network Systems Hyperchannel. This system has been phased out, now that we have released all of the UNIVACs, and today our Crays access shared tape drives that are on six automated cartridge libraries. We will have 64 tape drives on those, and our Cray systems will own 32 of those tape drives. They will stage tapes on local disks. Their policy will be the same: if you touch a tape, you have to swallow the whole tape. You either have to stage it on your own local disk immediately, as fast as you can read it off of the tape, or else you have to consume it that fast.
This system was obviously limited by the number of ports that we can have. Three Crays severely strain the number of ports that you can have, which would be something like eight. Our near-term goal is to develop a tape data server that will be accessed via a switched high-performance parallel interface (HIPPI) and do our staging onto a striped-disk server that would also be accessed over a switched HIPPI. One of the problems that we see with striped-disk servers is that there is a tremendous disparity between the bandwidth of a striped-disk system and the current 3480 tape. We now suddenly come up with striped disks that will run at rates like 80 to 160 megabytes per second. You cannot handle a
striped disk and do any kind of important staging or destaging using slow tape. I am working with George Michael, of Lawrence Livermore National Laboratory, on this problem. We have discussed use of striped tape that will be operating at rates like 100 megabytes per second. We believe that a prototype can be demonstrated in less than two years at a low cost.
Going back to parallel computing, I will share some observations on our nCUBE 1 experience. First, we absolutely couldn't go on very far without a whole lot more node memory, and the nCUBE 2 solved that problem for us. We absolutely have to have high-bandwidth external I/O. The reason that we were able to run only that one application was because that was a number-crunching application that was satisfied by about 100 kilobytes per second, input and output. So it was a number-cruncher. We were spoon-feeding it with data.
We have discovered that the programmers are very good at designing parallel programs. They do not need a piece of software that searches over the whole program and automatically parallelizes it. We think that the programmer should develop the strategy. However, we have found that programmer errors in parallel programs are devastating because they create some of the most obscure bugs that have ever been seen in the world of computing.
Because we felt that a parallel programming environment is essential, we enlisted the aid of Pacific-Sierra Research (PSR). They had a "nifty" product that many of you are familiar with, called FORGE. It was still in late development when we contacted them. We interested them in developing a product that they chose to call MIMDizer. It is a programmer's workbench for both kinds of parallel computers: those with distributed memories and those with shared memories. We have adopted this. The first two target machines are the Intel system and the nCUBE 2.
The thing that MIMDizer required in its development was that the target machine must be described by a set of parameters so that new target machines can be added easily. Then the analysis of your program will give a view of how the existing copy of your program will run on a given target machine and will urge you to make certain changes in it to make it run more effectively on a different target machine. I have suggested to PSR that they should develop a SIMDizer that would be applicable to other architectures, such as the Thinking Machines Corporation CM-2.
I have been seriously urging PSR to develop what I would call a PARTITIONizer. I would see a PARTITIONizer as something that would help a programmer tear a program apart and break it up so that it can be
run in a distributed heterogeneous computing environment. It would be a powerful tool and a powerful adjunct to the whole package.
Our immediate plans for the nCUBE 2 are in the first quarter of 1991, when we will install a 128-node nCUBE 2, in production. For that, we will have five or six applications that will free up a lot of Cray time to run other applications that today are highly limited by lack of Cray resources.
I now want to talk about the problem-solving environment because I think there is a message here that you all should really listen to. This system was designed around 1980. Three of us in our computer science research department worked on these concepts. It actually was funded in 1986, and we will finish the system in 1992. Basically, it consists of a library of high-level primitive operations. Actually, many of these would be problem domain primitives.
The user graphically builds what we call a "flownet," or an acyclic graph. It can branch out anywhere that it wants. The user interface will not allow an illegal flownet. Every input and every output is typed and is required to attach correctly.
Every operation in the flownet is inherently parallel. Typical jobs have hundreds of operations. We know of jobs that will have thousands of operations. Some of the jobs will be bigger than you can actually run in a single machine, so we will have a facility for cutting up a superjob into real jobs that can actually be run. There will be lots of parallelism available.
We have an Ada implementation—every operation is an Ada task—and we have Fortran compute kernels. At present, until we get good vectorized compilers for Ada, we will remain with Fortran and C compute kernels. That gives us an effectiveness on the 20 per cent of the operations that really are squeaking wheels. We have run this thing on a CONVEX Computer Corporation Ada system. CONVEX, right now, is the only company we found to have a true multiprocessing Ada system. That is, you can actually run multiple processors on the CONVEX Ada system, and you will get true multiprocessing. We got linear speedup when we ran on the CONVEX system, so we know that this thing is going to work. We ran it on a four-processor CONVEX system, and it ran almost four times as fast—something like 3.96 times as fast—as it did with a single processor.
This system is designed to run on workstations and Crays and everything else in between. There has been a very recent announcement of an Ada compiler for the nCUBE 2, which is cheering to us because we did not know how we were going to port this thing to the nCUBE 2. Of course, I still do not know how we will port to any other parallel environment unless they develop some kind of an Ada capability.