Five HPC Pitfalls (Part 1)

Success in high-performance computing (HPC) is often difficult to measure. Ultimately it depends on your goals and budget. Many casual practitioners assume that a good HPL (High Performance Linpack) benchmark and a few aisles of servers demonstrate a successful HPC installation. However, this notion could not be further from the truth, unless your goal is to build a multimillion dollar HPL machine. In reality, a successful and productive HPC effort requires serious planning and design before purchasing hardware. Indeed, the entire process requires integration skills that often extend far beyond those of a typical data center administrator.

The multivendor nature of today’s HPC market necessitates that the practitioner make many key decisions. In this paper, I outline several common pitfalls and how they might be avoided. One way to navigate the possible pitfalls is to consider a partnership with an experienced third-party HPC integrator or consultant. An experienced HPC integrator/consultant has a practical understanding of current technologies (i.e., what works and what does not), the ability to both listen and execute, and, most importantly, experience delivering production HPC (i.e., HPC that produces actionable results for something other than HPC research or education.)

Introduction

The current state of HPC is such that customers are now responsible for decisions and tasks previously handled by large supercomputer companies. Although you can still buy a turnkey system from a few large vendors, the majority of the market is now based on multisourced commodity hardware and openly available software. Transferring the responsibility for vendor selection and integration to the customer has significantly reduced the cost of HPC systems, but it has also introduced potential pitfalls that could result in extra or hidden costs, reduced productivity, and aggravation.

In today’s market, purchasing an HPC cluster is similar to buying low-cost self-assembled furniture from several different companies. The pile of flat-pack boxes that arrive at your home is often a far cry from the professionally assembled models in the various showrooms. You have saved money, but you will be spending time deciphering instructions, hunting down tools, and undoing missteps before you can enjoy your new furniture. Indeed, should you not have the mechanical inclination to put bolt A16 into offset hole 22, your new furniture might never live up to your expectations. Additionally, integrating the furniture into an existing room could be a challenge. Forgetting that the new surround sound system requires a large number of cables to be routed throughout your new media center means it is time for some customization.

A typical HPC cluster procurement is a similar experience. Multiple components, each from different vendors, could introduce hidden integration and support costs (and time) that are not part of the raw hardware purchase price. These additional costs (and time) are often due to the pitfalls described below and often come as a surprise to the customer.

Five HPC Pitfalls To Avoid

This list is by no means an exclusive collection of potential HPC pitfalls, nor is it intended to imply that all customers have similar experiences. Experience and ability are varied and great among the many commodity hardware and solution providers. In a post–single-vendor supercomputer market, the largest issue facing customers is how to manage multiple vendor relationships successfully. In addition to warning about potential issues, the following scenarios should also help set a reasonable customer expectation level when working within the commodity HPC market.

Pitfall Number One: Popular Benchmarks Tell It All

Benchmarking is at the heart of the HPC purchasing process because price-to-performance drives many sales decisions But the devil, as they say, is in the details when it comes to benchmarking. There are many popular and well-understood benchmarks for HPC, but unless a popular benchmark is similar to or the same as the applications you intend to run on your cluster, its actual usefulness can vary.

Perhaps the best known benchmark is HPL (High Performance Linpack) because it is used to compile the bi-annual list of the fastest 500 computers in the world (known popularly as the “Top500.” These rankings are useful in many contexts but offer little practical guidance for most users. Unless a user plans on running applications that perform the same type of numerical algorithm used in the HPL benchmark (linear algebra), then using HPL as a final arbiter of price-to-performance could lead to disappointment. Additionally, International Data Corporation (IDC) has reported that 57% of all HPC applications/users surveyed use 32 processors (cores) or fewer. Most of the Top500 HPC runs involve thousands of cores, so using the Top500 results as a yardstick to measure applications requiring 16 or 32 cores makes little sense. Using other popular benchmarks presents the same problems. In reality, benchmarking your own applications is the best method to evaluate hardware. Indeed, if you run popular code, many vendors might already have benchmark results that can provide you with a baseline.

There is definitely an advantage to running the HPL benchmark on a newly delivered machine or knowing that a vendor can deliver machines that can run this benchmark. The HPL benchmark can be used to stress the entire system in such a way that it uncovers hidden issues. Additionally, the approximate HPL results for standard HPC hardware are well known and publicly available (i.e., Top500.org is good place to look). So if a newly installed cluster is not producing an HPL result in the ball park of similar machines, it is a good indication that something is amiss.

Another benchmark measure to consider is that of GP-GPUs (General Purpose Graphical Processing Units). The bulk of these devices are sold in the consumer market as high-end video cards, and the “General Purpose” part of the GPU makes them a very powerful parallel processing platform. The benchmark numbers are quoted in terms of speed-up (i.e. 25x speed-up). Unfortunately, no standard baseline exists against which to measure the speed-up, and these results are for single-precision performance (lower quality) Also, a re-programming cost is associated with these devices. As with HPL, if a specific application speed-up is similar to your own requirements, benchmarking with a GP-GPU can be a big win, but be careful with assumptions. This market is under rapid change and attention to detail will allow you to assess the hardware and software properly.

Other design issues include the type of interconnect (either InfiniBand or Ethernet) and the processor family. Choosing an interconnect should be determined by your performance needs (your application benchmarks) and budget and not by low-level benchmarks. Other issues might determine the best interconnect for your cluster, such as local integration issues, performance, and cost. In terms of processors, both Intel and AMD support x86 software compatibility, but the processor and memory architecture of each is very different and could affect performance. Benchmarks are really the only way to determine which of these design features is best for your needs and budget.

Another important issue to consider when evaluating benchmarks is the overemphasis on price-to-performance. (Price-to-performance is commonly reported as dollars per GFLOPS – Giga Floating Point Operations Per Second). When calculating this number, many practitioners use the raw hardware cost and the HPL benchmark result. In the context of current technology trends and costs, this is a valid number. However, in terms of real costs, this approach can be misleading. A better metric to consider is the Total Cost of Ownership (TCO). This number implies a multiyear cost, not just a one-time acquisition cost. In contrast to price-to-performance, TCO is more difficult to calculate because it requires more data. Surprisingly, in many TCO estimates, over a three-year term, the TCO can exceed the initial cost of hardware when integration, infrastructure (power and cooling), maintenance, and personnel costs are included. These costs will be described in more detail in the “Understanding Costs” section in Part 2.

Pitfall Number Two: All Commodity Hardware is The Same

At some level, all commodity hardware is the same; otherwise, it would not be labeled “commodity.” This assumption also entices customers to build clusters from specification sheets, seek low-ball bidders, and skip testing the actual hardware, all of which invites serious and costly problems. HPC pushes hardware more than any other industry segment. Subtle differences that do not matter in other industries can create issues in terms of Reliability, Availability, and Serviceability (RAS) in the HPC sector.

Because clusters are created with multisourced hardware components, the long-term RAS requirement can be difficult to estimate. Choosing the wrong component can result in poor performance, increased downtime, and, in the worst case, an unfixable problem. This situation can also result from buying technology that is “too new” and immature. In some cases, new motherboards have demonstrated stability issues, but the vendor did not address the problem on a particular revision, presumably because it did not affect a large population of customers. In other situations, case designs did not supply adequate cooling for 24/7 HPC use.

Buying all the hardware from the same vendor might avoid some of these issues because vendors usually perform interoperability tests, but a single hardware vendor might limit your choices in terms of available options. Additionally, no single vendor manufactures every component and, as such, cannot provide the vertical depth of support that the Original Equipment Manufacturers (OEMs) can provide. A good example of this is InfiniBand (IB). Cluster nodes might use an IB HCA (PCIe card), whereas others might have IB on the motherboard. Ultimately, deep support issues must be directed back to the InfiniBand OEM.

Support is yet another highly variable issue in the commodity realm. Most companies will support their own hardware, but when it comes to integration with other systems, companies find it difficult to offer any kind of support because they have no control over foreign hardware. Standards help this process, and in theory, they allow interoperability, but they do not guarantee performance. For instance, an NFS server is available from many vendors in the form of a Network Area Storage (NAS) appliance. Each appliance will adhere to the NFS protocol, but poor performance can greatly limit the true potential of a cluster. A similar situation exists for network/interconnect components as well.

As is typical in many software/hardware situations, vendors often play the “blame game” when trying to support other vendors’ hardware and software. Open software exacerbates the problem because the actual software stack typically varies from cluster to cluster. Thus, a vendor will often suggest other software and hardware as the root of a problem – and vice versa from other hardware and software vendors. The user is then stuck in the middle. Getting a clear delineation of vendor responsibilities is important. If you suspect a problem, try and reproduce it outside of the cluster or, at a minimum, collect clear data that points to the problem.

In the second part of this article, I will address open software, system integration, and storage aspects.

Related content

comments powered by Disqus