A Brief History of Supercomputers
Computing originally comprised centralized systems to which you submitted your deck of punched cards constituting your code and data; then, you waited for your output, which was likely printed by a huge dot matrix printer on wide green and beige paper. Everyone took turns submitting their applications and data, which were run one after the other – first in, first out (FIFO). When your code and data were read in, you did not need to be there; however, you often had to stick around to submit your card deck when the system was available.
In the latter 1980s and into the 1990s, you likely had a monitor where you were able to input your programs that were saved on mass storage devices. Often, they were dedicated front-end systems that accommodated the users. Anyone that wanted to use the system logged into this front-end system, created their code and data, and then submitted a “job” to run the application with the data. The job was really a script with information about the resources needed: how many CPUs, how much memory, how to run your application (the command), and so on. If the hardware resources could be found, your program was executed, and the results were returned to your account. Notice that you did not have to be logged in to the system for the job to be executed. The job scheduler did everything for you.
The job scheduler could launch multiple jobs to use as many of the resources as possible. It also kept of list of the next jobs to run. How this list was created was defined by policies and could be very sophisticated, so the best utilization of the system resources was made.
Early high-performance computing (HPC) systems were large centralized resources shared by everyone who wanted to use them. You had to wait in line for the resources you needed. Moreover, the resources were not interactive, making it more difficult to write code. Your applications were run, and the results were returned to you. In other words, your work was at the mercy of what everyone else was running and the resources available. Moreover, preprocessing or postprocessing your data on these systems was impossible. For researchers, engineers, scientists, and other HPC users, this workflow was a mess. You had spurts of activity and lots of down time. During college, we used to make basket goals from punch cards and small basketballs from old printouts and tape and shoot goals waiting for our code to execute.
Before and during grad school I used the centralized university supercomputers starting in 1980s with the university’s CDC 6500. Then I moved to the Cyber 205 and ETA 10. Initially, I got almost unlimited processing time because people had not really “discovered” them yet. I learned vector processor on the Cyber 205, which was my first exposure to parallelism (vector parallelism). While I was working on my research, though, the HPC systems were finally discovered by everyone else at the university, so they became heavily used, resulting in the dreaded centralized and tightly controlled resources. Very quickly I lost almost all my time on the systems. What I did to finish my graduate research is a story for another day. Let me just say, thank you Sun workstations.
The point to this long-winded introduction is that researchers increasingly needed HPC resources, especially as applications needed more compute cycles, memory, storage, and interactivity. Although the HPC systems of the time had a great deal of horsepower, they could not meet all of the demands. Moreover, they were very expensive, so they became tightly controlled, centralized, shared resources. This inhibited the growth in computational research – the proverbial 10 pounds of flour in a five-pound bag. However, in this case, it was more like 50 pounds of flour. The large centralized HPC resources were needed for larger scale applications, but what was really needed were smaller HPC systems that were controlled by the researchers themselves. Desktop supercomputing, if you will.
My interest in desktop supercomputers (also called desktop clusters) accelerated with Beowulf clusters, where anything seemed possible. My first desktop cluster was at Lockheed-Martin where I gathered up desktop PCs that were tagged for recycling and made a cluster in my cubicle. It was not exactly on my desktop, but it was close enough. Having a system physically on your desktop, or right beside it, where you can write code, create, and test machine learning (ML) and deep learning (DL) models, do visual pre- and postprocessing, and have all the computational power at your fingertips without having to share it with hundreds or thousands of others is very appealing. Of course, it is not a centralized supercomputer where you can scale to huge numbers of processors or extreme amounts of memory, but you can do a massive amount of computing right there. Plus, you can get your applications ready for the centralized supercomputer on your desktop supercomputer.
From my perspective, four things are driving, and have driven, the case for desktop supercomputers:
- Commodity processors and networking
- Open source software
- Linux
- Beowulf clusters
This article focuses on a bit of the history of supercomputer processors and PC processors and commodity networking. I think that the rise of commodity processors and networking is a huge contributor to desktop supercomputing. Therefore, it is worth revisiting the history of processors and networking through the 1990s.
Early Supercomputers
Supercomputers in the mid-1980s to early 1990s were dominated by Cray. In 1988, Cray Research introduced the Cray Y-MP, which had up to eight 32-bit vector processors running at 167MHz. It had options for 128, 256, or 512MB of SRAM main memory and was the first supercomputer to sustain greater than 1GFLOPS (10^9 floating point operations per second).
That supercomputer was expensive. A predecessor to the Y-MP, the X-MP, sold to a nuclear research center in West Germany for $11.4 million in 1981, or $32.6 million in 2020 dollars (see The Supermen by Charles J. Murray, Wiley, 1997, p. 174).
Although Cray may have dominated the supercomputing industry coming into the 1990s, they were not alone. NEC had a line of vector supercomputers named SX. The first two NEC models, the SX-1 and SX-2, were launched in 1985. Both systems had up to 256MB of main memory. The SX-2 was reportedly the first supercomputer to exceed 1GFLOPS. It had four sets of high-performance vector operation pipelines with up to a maximum of 16 arithmetic units capable of multiple/parallel operation. The NEC SX-1 had about the half the performance of the SX-2 and was presumably less expensive.
Around this time, a number of massively parallel computers came out, including Thinking Machines (think Jurassic Park), nCUBE, Meiko Scientific, Kendall Square Research (KSR), and MasPar. Some of the massively parallel computers had systems into the early 1990s. These systems used a range of ideas and technologies to achieve high performance. Thinking Machines used 65,536 one-bit processors in a hypercube, later adding Weitek 3132 floating-point units (FPUs) and even RAID storage. The final Thinking Machines system, the CM-5E, used Sun SuperSPARC processors. Meiko Scientific Ltd. used transputers and focused on parallel computing. The systems started with the 32-bit INMOS T414 transputers in 1986. Later it switched to Sun SuperSPARC and hyperSPARC processors.
Both Thinking Machines and Meiko survived into the 1990s, and other companies such as KSR and MasPar sold systems into the early 1990s. These companies were especially important to the future of HPC because they showed that using large numbers of processors in a distributed architecture could achieve great performance. They also illustrated that to get to this performance level required a great deal of coding, so software took on a much more important role than before.
No longer did you have to rely on faster processors to get better performance by recompiling your code or making a few minor tweaks. Now, you could use lots of simple computing elements combined with lots of software development to achieve great performance. There was more than one path to efficient HPC performance.
The Intel-based PC processors of those early years were not as advanced as supercomputer processors. However, supercomputers were being challenged by massively parallel systems, and many times the manufacturers were creating their own processors; however, almost all eventually switched to workstation processors. This switch in processors was driven by cost. Developing new, very parallel systems, as well as new processors, is very expensive. As you will see, this will have some effect on HPC system trends in the 1990s. For now, PC CPUs were not in the HPC class, and any possible systems built from them were not scalable because PC networking really did not exist.
Supercomputer Processors in the Early 1990s
Cray entered the 1990s with the Cray Y-MP that had up to eight vector processors running at 167MHz, a much higher clock speed than PC CPUs. However, like the i486, it used 32-bit processors, limiting the addressable memory to 4GB. In the early 1990s, Cray launched some new systems. In 1991, the Cray C90, a development of the Cray Y-MP, was launched. It had a dual-vector pipeline where the Y-MP only had a single pipeline. The clock speed was also increased to 244MHz, resulting in three times the performance of the Y-MP. The maximum number of processors increased from eight to 16. Note that these processors were still designed and used only by Cray.
Cray’s last major new vector processing system, the T90, first shipped in 1995. The processors were an evolution of the those in the C90 but with a much higher clock speed of 450MHz. The number of processors also doubled to 32. The system was not inexpensive with a 32-processor T932 costing $35 million ($59.76 million in 2020 dollars).
Cray launched the T3D in 1994. This was an important system for Cray because it was their first massively parallel supercomputer. It used a 3D Torus network to connect all the processing elements, hence the name T3D. It integrated from 32 to 2,048 processing elements (PEs), where each PE was a 64-bit DEC Alpha 21064 RISC chip. The chip had its own memory area, memory controller, and prefetch queue. The PEs were grouped in pairs or nodes of six chips.
The T3D had distributed memory, each PE with its own memory, but it was all globally addressable with a maximum of 8GB of memory in total. The T3D used a “front-end” system to provide things such as I/O functionality. Examples of front-end systems were the Cray C90 or Y-MP.
The T3D was something of a sea change for Cray. The first reason is that it moved away from vector CPUs and focused more on massively parallel systems (i.e., lots of processing elements). Second, it’s a shift from Cray-designed processors to those from another company – in this case, DEC. Whether this was a move to reduce costs or improve performance is known only to Cray, but it appears to be an effort to reduce costs. A third reason is a move to using a 3D Torus network.
The Cray T3E was a follow-on to the T3D, keeping the massively parallel architecture. Launched in late 1996, it continued to use the 3D Torus network of the T3D but switched to the DEC Alpha 21164 processor. The initial processor speed was 300MHz. Future processors used 450, 600, and even 675MHz. Similar to the T3D, the T3E could scale from 8 to 2,176 PEs, and each PE had between 64MB and 2GB of memory. The T3D and the T3E were arguably highly successful systems. A 1,480-processor system was the first system on the TOP500 to top 1TFLOPS(10^12FLOPS) running a scientific application.
Cray did not just develop multiprocessor systems with DEC Alpha processors. It continued the development of vector systems based on the Cray Y-MP. Recall that the Y-MP was launched in 1988 with up to eight 32-bit vector processors running at 167MHz The Cray J90 was developed from the Cray Y-MP EL (entry level) model. It was hoped to develop a less expensive version of the Y-MP by using air cooling. The system supported up to four processors and 32MB of DRAM memory. The J90 supported up to 32 vector processors at 100MHz and up to 4GB of main memory. Each processor was two chips: one for the scalar portion of the architecture and the second for the vector portion of the architecture.
NEC was also launching new SX systems in the early 1990s. In 1990, it launched the SX-3 system that allowed parallel computing, permitting both SIMD (single instruction, multiple data) and MIMD (multiple instruction, multiple data) operations. It had up to four arithmetic processors with up to four sharing the same main memory and up to several processors in the system. The NEC SX-4 system was announced in 1994 and first shipped in 1995. It arranged several CPUs into a parallel vector processing node; then, these nodes were installed into a regular symmetric multiprocessing (SMP) arrangement.