Living with Multiple and Many Cores
The commodity x86 HPC market is now a multi/many-core world. From the multicore CPU to the many-core GP-GPU, the number of cores at the user’s disposal has increased dramatically. The recent introduction of the Intel MIC (Xeon Phi) has added another layer of cores to the HPC mix. This change has forced a re-evaluation of HPC software and hardware and has even introduced some uncertainty in the minds of many HPC practitioners.
To understand the change and realize why it was inevitable, you need to look at some of the challenges facing modern server and desktop designs. In the past, the market enjoyed constant frequency bumps, wherein each new generation of single-core processor allowed software to run faster. It was a painless and effective upgrade enjoyed at all levels of the market. This design was particularly effective in HPC because clusters were composed of singe- or dual-processor (one- or two-core) nodes and a high-speed interconnect. The Message-Passing Interface (MPI) mapped efficiently onto these designs.
The increase in frequency or “megahertz march,” as it was called, started to create problems when processors approached 4GHz (109Hz, or cycles per second). In simple terms, processors were getting too hot at these frequencies. The number of transistors that could be packed in a processor die continued to increase, however. The solution offered to the market was the multicore processor. Instead of one central processor on a chip, the extra transistors could be used to add additional processing units running at lower temperatures/frequencies. Today a four-core x86 processor is common and almost the minimum found on any new desktop or server hardware.
These additional cores created several challenges for the market. In terms of mainstream software, virtually all applications needed to be “reprogrammed” to make use of the extra cores. Creating the new “parallel” code could be done in several ways. Most developers used Posix threads, OpenMP, parallel libraries, and even MPI. Application speed-up now comes from extra cores and not increased processor frequency. Many applications still use a single core and have no need for increased speed, whereas other applications could use an additional performance boost, but the intricacies of parallel programming have made conversion of these codes difficult or too costly at this time.
An additional problem with multicore has been the pressure on local memory. Central to all multicore designs is the use of a shared main memory pool. As the number of cores increases, so does the contention for main memory. To help solve this problem, memory is divided into independent channels or banks. Data is interleaved over these banks so that parallel access can occur faster than with a single bank. To further improve memory contention, each multicore processor in a multiprocessor server is given its own exclusive memory domain. The domain is shared with the other local (onboard) processors through a high-speed point-to-point bus (e.g., AMD Hyperchannel and Intel QuickPath). Even with all of these advances, there are still situations in which the ability to deliver all the effective cores is limited.
In HPC, multicore CPUs have been a mixed blessing. First, more cores in less space and at lower power is always welcome in HPC data centers. The software issue has not been much of a hurdle for many HPC applications because the programs have been written with MPI and are designed to work in parallel; however, plenty of applications still need to be parallelized.
In terms of memory contention, HPC has not fared as well. Many HPC applications require good memory bandwidth. The use of multicore nodes in HPC clusters has shown reduced scalability when parallel jobs are close packed (i.e., put on as few nodes as possible). Often the best performance comes from sparsely packed jobs (i.e., the parallel job is spread across many nodes). Actual results can vary because sparsely packed jobs usually share the multicore nodes with other user jobs that might or might not have compatible memory access patterns.
Cores Across the Bus
As CPU cores have continued to grow at a steady pace, the cores across the PCI bus have exploded. Around the time multicore designs started to make their way into the mainstream processors, the video processing companies NVidia and AMD/ATI began making their highly parallel (hundreds of small cores) GPU processors work more like general computing devices. That is, they could be used for much more than displaying fast graphics – with one catch: These new General Purpose GPUs (GP-GPUs) are good at certain types of parallel problems in which all the individual GPU cores use the same data. This type of parallel processing is called “data parallel” or Single Instruction, Multiple Data (SIMD). GP-GPUs, although faster at data-parallel tasks, do not perform well when if-then-else computation is required. Thus, GP-GPUs are used as co-processors for standard processors, and the highly parallel portions are transferred to the GP-GPU.
GP-GPU computing has several issues that have limited its usage. First, all data must traverse the PCI bus. This is often a slow step for some applications and, if not managed in an optimum way, can cause poor performance. The programming tools are largely CUDA (NVidia hardware) and OpenCL (NVidia, AMD GPUs, and x86 multicore). These tools often require deep re-writing of exiting code to take advantage of GP-GPUs.The resultant code is often considered “non-portable.” One promising solution to this problem is the use of OpenACC comment directives in existing Fortran and C/C++ code. Code still remains portable and can be compiled to run on a standard CPU or a CPU/GP-GPU system.
Recently, Intel introduced their Many Integrated Core (MIC) or Xeon Phi co-processor. Whereas the Phi lives on the PCI bus and brings more cores to the table, the design is somewhat different from a GP-GPU. The current Phi has 60 general-purpose x86 cores, each coupled with a vector processor. The Phi is not a co-processor like the GP-GPUs but rather a fully functional processing unit. In terms of software, the Phi can be programmed using standard OpenMP, OpenCL, and updated versions of Intel’s Fortran, C++, and math libraries – that is, the same tools used to program the x86 multicore processors. Data must still travel across the PCI bus, but the volume depends on how the Phi is used.
Dumping the Bus
Whereas multicore has put more cycles (albeit parallel) in the processor, and GP-GPU or MIC have put processing powerhouses across the PCI bus, the search for a better solution continues. Some hints of things to come can be found in the AMD Accelerated Processing Unit (APU) or Fusion designs. Currently limited to desktop/laptop systems, the APU design integrates GP-GPUs directly on the CPU die. Although this might seem like an obvious idea, the integration requires some changes in both CPU and GPU designs. AMD has also stated that they expect all their future processors (both desktop and server) to be Fusion-based APUs. The benefits are:
- No need to transfer data across the PCI bus to and from the GP-GPU.
- No need for two memory regions (CPU vs. GPU).
- Tighter integration between CPU and GPU (i.e.. sharing caches and power control).
Intel has not been sitting idle in the onboard graphics area. Their new Sandy/Ivy Bridge processors include integrated graphics as well. Benchmarks indicate that the onboard graphics are equivalent to a medium-sized graphics card. Like the AMD APUs, the Intel Sandy Bridge graphics share main memory and top-level cache with the x86 cores. Whether the onboard GPU hardware can be programed by the user (presumably in OpenCL) is still unknown.
The fusion of CPU and GP-GPU makes sense from many perspectives. First, the amount of graphics processing, at least on the desktop/laptop, has continually increased in recent years. From video display and games to sophisticated rendering, users are asking for more and more graphics capabilities. Placing the GP-GPU directly on the CPU allows for better performance and lower cost because a dedicated GP-GPU graphics card will only be needed for high-end systems.
Additionally, as stated, the integrated GP-GPU can be used as a data-parallel computing device. In many types of non-graphic processing a dedicated data-parallel processor can be useful. These include database, vision, signal processing (speech), cryptography, and a whole series of scientific applications. Interestingly, both the new AMD and possibly Intel processors will support OpenCL for general programming of the CPU/GP-GPU systems. Additionally, OpenACC support for these hybrid architectures is possibly the fastest path to adoption of this new hardware as an HPC platform
The latest and greatest CPU/GP-GPU/MIC accelerator design will certainly bring more power to users but will further complicate software development. In addition to managing the many new cores, users must consider the trade-off between optimization and portability. Multi- or many-core HPC processing can take many forms. The classic cluster design (one node with one core operating on one memory domain) has given way to a number of options, including clusters with the following types of nodes:
- Multicore nodes (single memory domain).
- Multicore nodes with single/multiple GP-GPU (multiple memory domains).
- Multicore nodes with single/multiple Intel MIC (multiple memory domains).
Although these three hardware approaches might seem like a simple choice, there is much more at stake. In addition to the investment in hardware, a much larger investment could be in the software model(s) used for programming. The following list is just some of the possible HPC software approaches, many of which are hardware dependent.
- OpenMP – works on a single node or node with Intel MIC.
- MPI – works on single node, multiple nodes, possibly on Intel MIC.
- MPI with OpenMP (hybrid) – works on a single node, multiple nodes, possibly on Intel MIC.
- CUDA – works on a single node with NVidia GP-GPU.
- OpenCL – works on single CPU node with AMD/NVidia GP-GPU.
- MPI with CUDA (hybrid) – works on multiple nodes with NVidia GP-GPUs.
- MPI with OpenCL (hybrid) – works on multiple AMD/NVidia GP-GPU and x86 node.
- OpenACC (Fortran/C) – works with a single GP-GPU.
- OpenACC (Fortran/C) with MPI – works with multiple nodes and with GP-GPUs.
Any software and hardware solution depends on your applications and your desire to wrestle with new programming models. Clearly, using the classic MPI model with basic multicore nodes is the most generic approach, although the issue of memory contention could become more of a problem as CPU core counts increase. As GP-GPUs or MICs are introduced to increase performance, software complexity can grow, and identifying the best software choice might take some due diligence. Living with multiple and many cores is not going to be easy, but you don’t have many other high-performance choices.