Project THOTH:
An NSA Adventure in Supercomputing, 1984–88
Larry Tarbell
Lawrence C. Tarbell, Jr., is Chief of the Office of Computer and Processing Technology in the Research Group at the National Security Agency (NSA). His office is charged with developing new hardware and software technology in areas such as parallel processing, optical processing, recording, natural language processing, neural nets, networks, workstations, and database-management systems and then transferring that technology into operational use. For the last five years, he has had a special interest in bringing parallel processing technology into NSA for research and eventually operational use. He served as project manager for Project THOTH at NSA for over two years. He is the chief NSA technical advisor for the Supercomputing Research Center.
Mr. Tarbell holds a B.S. in electrical engineering from Louisiana State University and an M.S. in the same field from the University of Maryland. He has been in the Research Group at the NSA for 25 years.
I will discuss an adventure that the National Security Agency (NSA) undertook from 1984 to 1988, called Project THOTH. The reason the project was named THOTH was because the person who had the ability to choose the name liked Egyptian gods. There is no other meaning to the name.
NSA's goal for Project THOTH was to design and build a high-speed, high-capacity, easy-to-use, almost-general-purpose supercomputer. Our
performance goal, which we were not really sure we could meet but really wanted to shoot for, was to have a system that ended up being 1000 times the performance of a Cray Research CRAY-1S on the kinds of problems we had in mind. The implication of that, which I do not think we really realized at the time, was that we were trying to skip a generation or so in the development of supercomputers. We really hoped that, at the end, THOTH would be a general-purpose, commercial system.
The high-level requirements we set out to achieve were performance in the range of 50 to 100 × 109 floating-point operations per second (GFLOPS), with lots of memory, a functionally complete instruction set (because we needed to work with integers, with bits, and with floating-point numbers), and state-of-the-art software. We were asking an awful lot! In addition, we wanted to put this machine into the operational environment we already had at NSA so that workstations and networks could have access to this system.
This was a joint project between the research group that I was in and the operations group that had the problems that needed to be solved. The operations organization wanted a very powerful system that would let them do research in the kinds of operational problems they had. They understood those problems and they understood the algorithms they needed to use for it. We supposedly understood something about architecture and software and, more importantly, about how to deal with contracting, which turned out to be a big problem.
After a lot of thought, we came up with the project structure. To characterize the kinds of computations we wanted to do, the operations group developed 24 sample problems. We realized our samples were not complete, but they were representative. In addition, we asked ourselves to imagine that there were 24 or 48 or 100 other kinds of problems that needed the same kind of computation.
These problems were not programs; they were just mathematical and word statements because we did not want to bias anybody by using old code. We were going to use these problems to try to evaluate the performance of the architectures we hoped would come out of THOTH. So they were as much a predictor of success as they were a way to characterize what we needed to do.
We ended up deciding to approach the project in three phases. The first phase was architectural studies, where we tried to look at a broad range of options at a really high level and then tried to choose several possibilities to go into a detailed design. For the second phase, we hoped at the end to be able to choose one of the detailed designs to actually build something. The third phase was the implementation of THOTH. We
probably deluded ourselves a bit by saying that what we were trying to do was not to develop a new supercomputer; we were just trying to hasten the advent of something that somebody was already working on. In the end, that was not true.
We chose contractors using three criteria. First, they had to be in the business of delivering complete systems with hardware and software. Second, they had to have an active and ongoing R&D program in parallel processing and supercomputing. Third, we wanted them to have at least one existing architecture already under development because we thought that, if they were already involved in architecture development, we would have a much better chance of success.
We had envisioned six such companies participating in Phase 1. We hoped to pick the two best from Phase 1 to go into Phase 2. Then we hoped at the end of Phase 2 to have two good alternatives to choose from, to pick the better of those, and to go with it. We also had some independent consulting contractors working with us who were trying to make sure that we had our feet on the ground.
Figure 1 shows a time line of what actually happened. The concept for this project started in 1984. It was not until late 1985 that we actually had a statement of work that would be presented to contractors. Following the usual government procedure, it was not until mid-1986 that we
actually had some contracts under way. Thus, we spent a lot of time getting started, and while we were getting started, technology was moving. We were trying to keep up with the technology, but that was a hard thing to do.
At the beginning of 1985, we had as our first guess that we would get a machine in late 1991. By the time we got around to beginning the second phase, it was clear that the absolute best case for a machine delivery would be late in 1992, if then.
We ended up with nine companies invited to participate in Phase 1, seven of which chose to participate with us in a nine-month, low-cost effort. We had three parts of Phase 1. The first part, the definition study, was for the companies to show us that they understood what we were after. They were supposed to analyze the problems, to show us possibility in the problems for parallelism, and to talk about how their potential architecture might work on those problems. In the architectural study, we actually asked them to formally assess their favorite architecture as to how it performed against the THOTH problems that we had given them. In the third task, we actually paid them to produce a formal proposal for a high-level hardware/software system that would be designed in the second phase if we had enough belief in what they were trying to do. This took a lot more work on our part than we had initially envisioned. It essentially burned up the research organization involved in this; we did nothing else but this. A similar thing happened with the operations group and with some of the people that were supporting us.
We had some problems in Phase 1. I had not expected much personnel change in a nine-month phase, but in one company we had a corporate shakeup. They ended up with all new people and a totally new technical approach, and they waited until they were 40 per cent into the project to do this. That essentially sunk them. In several companies we had major personnel changes in midstream, and in nine months there was not time to catch up from changes in project managers.
Several companies had a lack of coordination because they had software people on one side of the country and hardware people on another side of the country, and once again that presented some problems. Even though most of them told us that they really understood our requirements, it turned out that they really had not fully understood them. I suppose we should not have been surprised. In one case, the division that was doing THOTH was sold. The project was then transferred from the division that was sold to another part of the company that actually built computers. The late submission of reports did not surprise us at all, but in nine months this caused us some problems. Also, there
was one company that just could not pick out which way they wanted to go. They had two possible ways, and they never chose.
The first result in Phase 1 was that two companies dropped out before Phase 1 was over. We got five proposals at the end of Phase 1, but then one company subsequently removed their proposal, leaving us with only four. We ended up choosing three of those four, which surprised us. But when we sat down and looked at what was presented to us, it looked like there were three approaches that had some viability to them. Strangely enough, we had enough money to fund all three, so we did.
Then we moved into Phase 2, detailed design. Our goal was to produce competing detailed system designs from which to build THOTH, and we had three companies in the competition to produce a detailed design. That was what we thought would happen. They were to take the high-level design they had presented us at the end of Phase 1, refine it, and freeze it. We went through some detailed specifications, preliminary design reviews, and critical design reviews. Near the end of this one-year phase, they had to tell us that the performance they predicted in Phase 1 was still pretty much on target, and we had to believe them. This was a one-year, medium-cost effort, and once again, it cost us lots more in people than we had expected that it would.
Phase 2 had some problems. Within one company, about three months into Phase 2, the THOTH team was reorganized and moved from one part of the company to the other. This might not have been so bad except that for three months they did not have any manager, which led to total chaos. One company spent too much time thinking about hardware at the expense of software, which was exactly the reverse of what they had done in Phase 1, and that really shocked us. That, nevertheless, is what happened. Another company teamed, which was an interesting arrangement, with another software firm. Midway through, at a design review, the software firm stood up and quit right there. Another company had problems with their upper management. The lower-level people actually running the project had a pretty good sense of what to do, but upper management kept saying no, that is not the right way to do it—do it our way. They had no concept of what was going on, so that did not work out very well either.
There was a lack of attention to packaging and cooling, there were high compiler risks and so forth, and the real killer was that the cost of this system began to look a lot more expensive than we had thought. We had sort of thought, since we were going to be just accelerating somebody else's development, that $30 or $40 million would be enough. That ended up not being the case at all.
Several results came from Phase 2. One company withdrew halfway through. After the costs began to look prohibitive, we finally announced to the two remaining contractors that the funding cap was $40 million, which they agreed to compete for. We received two proposals. One was acceptable, although just barely. In the end, NSA's upper management canceled Project THOTH. We did not go to Phase 3.
Why did we stop? After Phase 2, we really only had one competitor that we could even think about going with. We really had hoped to have two. The one competitor we had looked risky. Perhaps we should have taken that risk, but at that time budgets were starting to get tighter, and upper management just was not willing to accept a lot of risk. Had we gone to Phase 3, it would have doubled the number of people that NSA would have had to put into this—at a time when our personnel staff was being reduced.
Also, the cost of building the system was too high. Even though the contractor said they would build it for our figure, we knew they would not.
There was more technical risk than we wanted to have. The real killer was that when we were done, we would have had a one-of-a-kind machine that would be very difficult to program and that would not have met the goals of being a system for researchers to use. It might have done really well for a production problem where you could code up one problem and then run it forever, but it would not be suited as a mathematical research machine.
So, in the end, those are the reasons we canceled. If only one or two of those problems had come up, we might have gone ahead, but the confluence of all those problems sunk the whole project.
As you can imagine from an experience like this, you learn some things. While we were working on THOTH, the Supercomputing Research Center (SRC) had a paper project under way, called Horizon, that began to wind down just about the time that THOTH was falling apart. So in April 1989, after the THOTH project was canceled in December, we had a workshop to try to talk about what had we learned from both of these projects together. Some of our discoveries were as follows:
• We did get some algorithmic insights from the Phase 1 work, and even though we never built a THOTH, some of the things we learned in THOTH have gone back into some of the production problems that were the source of THOTH and have improved things somewhat.
• We believe that the contractors became more aware of NSA computing needs. Even those who dropped out began to add things we wanted to their architectures or software that they had not done before, and we would like to believe that it was because of being in THOTH.
• We pushed the architectures and technology too hard. It turns out that parallel software is generally weak, and the cost was much higher than either we or the contractors had estimated at the time.
• We learned that the companies we were dealing with could not do it on their own. They really needed to team and team early on to have a good chance to pull something like this off.
• We learned that the competition got in our way. We did this as a competitive thing all along because the government is supposed to promote competition. We could see problems with the three contractors in Phase 2 that, had we perhaps been able to say a word or two, might have gone away. But we were constrained from doing that. So we were in a situation where we could ask questions of all three contractors, and if they chose to find in those questions the thing we were trying to suggest, that was fine. And if not, we couldn't tell them.
Hindsight is always 20/20. If we look backward, we realize the project took too long to get started. Never mind the fact that maybe we should not have started, although I think we should have, and I think we did some good. However, it took too long to get moving.
While we were involved in the project, especially in software, the technology explosion overtook us. When we started this system, we were not talking about a UNIX-like operating system, C, and Fortran compilers. By the end, we were. So the target we were shooting for kept moving, and that did not help us.
NSA also was biased as we began the project. We had built our own machines in the past. We even built our own operating system and invented our own language. We thought that since we had done all that before, we could do it again. That really hid from us how much trouble THOTH would be to do.
Our standards for credibility ended up being too low. I don't know how we could have changed that, but at the end we could see that we had believed people more than we should have. Our requirements were too general at first and became more specific as we went along. Thus, the contractors, unfortunately, had a moving target.
We also learned from these particular systems that your favorite language may not run well, if at all, and that the single most important thing that could have been built from this project would have been a compiler—not hardware, not an interconnection network, but a good compiler. We also learned that parallel architectures can get really weird if you let some people just go at it.
In the end, success might have been achieved if we had had Company A design the system, had Company B build it, and had Company C
provide the software because in teaming all three companies we would have had strength.
There is a legacy to THOTH, however. It has caused us to work more with SRC and elsewhere. We got more involved with the Defense Advanced Research Projects Agency during this undertaking, and we have continued that relationship to our advantage. The project also resulted in involvement with another company that came along as a limited partner who had a system already well under way. There was another related follow-on project that came along after THOTH was canceled, but it relied too much on what we had done with THOTH, and it also fizzled out a little later on. Last, we now believe that we have better cooperation among NSA, SRC, and industry, and we hope that we will keep increasing that cooperation.