Top Three HPC Roadblocks
If you are a practitioner of HPC, you might ask, “Only three things?” Of course, there are more, but the problems I want to talk about here are what I judge to be some of the top issues facing the HPC market and community. Over the last 15 years, the HPC market has seen great growth because of low-cost commodity hardware and open software. These dramatic changes have expanded HPC to the point that true HPC computing can be had for several thousand dollars. This growth has not been without challenges, most of which are easily solved. Having been involved in the HPC market for close to 25 years, my frustration stems from the lack of progress with the following issues.
Solve The Last Mile Problem
The cable/Internet industry coined the phrase "the last mile" to indicate the issue of making the final cable/phone/Internet connection to customers. The problem is real because, although each customer wants the same thing, in almost all case it becomes a "custom" job at some point. From the classic console television to the Windows 98 computer to the latest flat screen or tablet, a complete solution takes some know-how that the customer often does not have.
Companies have addressed this problem by creating clear demarcation points where their responsibility ends and the customer’s begins. It can be cable modem or a phone box, but one side is the user’s responsibility and the other is that of the service provider. In general, installers try to assist the customer with their specific situation, which can vary from simple to complex. Setting up wireless is one good example. The situation has created a need for companies like the Best Buy Geek Squad, who offer integration services. Results vary, but at least people have someone to call other than the cousin who figured out how to get his Xbox working on the Internet.
HPC has a huge "last mile" problem because of several factors. Underestimating costs is a clear issue in HPC. The idea that buying a cluster is as simple as specifying how many cores, DIMMS, HDDs, HCAs, switches, and so on you can fit in your budget is a misconception, even if it is a valid and necessary exercise. However, when it comes to software, most of which is open source, the assumption is that it is "free" and should not affect the cost of the system. The administrative expense is usually handled by "existing" resources, which can be anything from a graduate student to a well-trained and certified Windows admin. In almost all cases, the biggest perceived expense is the hardware.
Top-tier vendors will offer various levels of software and integration support, but when the customer finds they might have to cut a third of their hardware to pay for top-tier support, they tend to look elsewhere for a solution. The lower tier vendors, who are working on thin margins, prefer to deliver and support only the hardware. Most vendors do not have a software support staff to help customers beyond the initial install. Depending on the size of the institution and its charter, organizations might have staff that addresses these needs. In general, the national labs and large university computing sites have excellent support structures. Many of the organizations also contribute to the impressive collection of open HPC software. The last mile problem is largely present in the smaller organizations, who have smaller budgets and less personnel resources.
As an example, consider software upgrades. In a typical case, clusters are purchased fully functioning with some type of Red Hat-derived distribution augmented by cluster tools and libraries. Over time, the administrator makes changes and tweaks various items. Eventually, the cluster becomes heavily used, the hardware warranty expires, there is no software support, and users begin to request updated software. The upgrade requires a cascade of updates, and now the administrator is often required to rebuild a custom cluster software infrastructure. The effort could take several weeks of installation and testing that does not sit well with the end users. Depending on the skill level of the administrator, upgrades might not even work or could cause further headaches and delays.
Similar to the cable/Internet last mile problem, there is an economic opportunity in the HPC market. One provider, Bright Computing, has developed a turnkey cluster software solution that many administrator are finding useful. Some other open cluster management (or provisioning) solutions allow for much easier cluster management (e.g., the Warewulf Project). A plethora of other issues face administrators and users as well, and each issue has an associated cost that can include, storage, expansion, training, workflow policies, hardware failures, and local integration, just to name a few.
The situation described above has not really changed much in the last 10 years. The focus on the latest and greatest hardware often dwarfs the attention paid to many of these issues. The net result is slow market growth, failed installations, bad experiences, administrative turmoil, and unexpected costs.
In particular, smaller organizations are more vulnerable to last mile issues, and users need to understand that a successful HPC program has costs beyond the hardware. Another need is for more education and training of both users and system administrators. This effort needs to come from the entire industry because the last mile involves many vendors. Like the cable/Internet industry, once the last mile problem is addressed, the HPC market can expand in many ways.
Refocus Performance Goals
The Top500 is a great historical resource, as well as a way for the "top" computer vendors and users to measure their progress. The problem with the Top500 is that a majority of users do not have access to or require that level of computing; yet, they use it as a measure of overall HPC progress. Numerous surveys have gauged how many cores (in the past, CPUs) are needed to run HPC-specific applications. In a recent informal poll, 55% of the HPC respondents required 32 or fewer cores (presumably because of scaling issues). The term often used for users in this area is the missing middle. One needs to ask, “With a middle market of low-hanging fruit, for which fewer than 100 cores can have a huge economic effect, why does the HPC market focus on benchmark results that require tens of thousands of cores?”
Perhaps the belief that "press release" clusters will garner more business pushes the market toward the Top500. The other, more pragmatic reason, is that delivering a solution to the missing middle is more difficult than deploying racks of servers to achieve a spot on the Top500 list. In essence, the missing middle represent all that has been forgotten or passed over in the HPC industry. These issues include low-cost turnkey systems, real last-mile support, application porting, programming tools, and training – none of which are strong points in the current HPC ecosystem, but all of which contribute to successful production systems.
The absence of a missing middle infrastructure has stymied the growth of this sector. Addressing this audience (see the Council On Competitiveness) can have a huge effect on both the HPC market and the entire economy. Again, the solution lives across the industry but has less to do with peak FLOPS and more to do with effective FLOPS. Reducing the "barriers to effective FLOPS" benefits all HPC vendors, but no vendor has taken on this role in the industry, nor should they. Just like the last mile problem, the problem of the missing middle spans the entire market. It should be mentioned, that the HPC cloud might help with some of the issues mentioned above; however, the need to address the fundamentals remains unchanged.
Focus On HPC Software
HPC software is a hard problem. Although parallel computing is not new and has been part of computing since the beginning, the nature of HPC computing, in particular, makes programming of even the smallest resource difficult. Even a quad-core desktop or laptop can present a formidable parallel programming challenge. In its long history, parallel programming tools and languages seem to be troubled by a lack of progress. Just as the first Fortran compilers hid the complexities of various processor details, the need now is for high-level parallel languages and tools that hide much of the low-level issues from the programmer. Many good researchers are working on this problem with some progress.
Of more concern is the approach hardware companies seem to take with software. Historically, software always follows hardware, but in the case of HPC, software seems to have been almost an afterthought (with a few exceptions). In typical HPC fashion, new hardware is launched with some kind of FLOPS rating (either theoretical or based on the Top500 benchmark), and the market cheers. This fanfare has been occurring since the first parallel computers were sold and continues though single-core clusters, multicore, GP-GPU, and fusion/hybrid approaches. The mantra seems to be: Look at these numbers, our hardware is great, now you figure out the software. This approach might have worked in part in other sectors of the computer market, but it has been a particularly hard nut to crack in the HPC market because of software complexities.
Vendors do offer "local solutions" that are not easily used outside of a particular hardware family. NVidia CUDA is a good example of this situation. CUDA is a great solution for NVidia hardware, but the application becomes locked to a particular type of hardware. The HPC market has been there and done that and is a bit shy of these approaches. It is a tough sell. Optimized versions of popular applications are often made available by some vendors. This approach is laudable and helps users; however, siloed programming approaches do not move the overall market forward.
For example, a user has many choices to express parallelism. They can use MPI on a cluster, OpenMP on an SMP, CUDA/OpenCL on a GPU-assisted CPU, or any combination thereof. These choices have far-reaching economic and performance consequences. Those commercial software providers wanting to address the missing middle face some tough decisions about software. Indeed, porting existing legacy code of any type to one of several "HPC programming models" can be an extremely expensive undertaking.
In the cluster market, the parallel software problem is not easy because the processor vendor has little knowledge of the interconnect between nodes or some of the other performance aspects of a modern HPC cluster. Thus, parallel software development is in the hands of someone else. The integrator does not have the resources to solve this problem, nor do any of the myriad of other vendors who contribute to the overall HPC system. In the end, the users or software vendors are responsible for porting applications, creating tools, or both. More software and better tools sell more software, but that formula does not seem to work in the HPC market.
A Shared Solution
Because most of these issues fall beyond the scope of any one vendor, a solution needs to arise from across the market and community. Perhaps it is time collectively to create an independent organization that can start to address these and other issues holding back the HPC market. Such an organization would be funded by those vendors who stand to benefit from a more robust market. Interestingly, this situation calls for a collaborative open source model wherein each vendor "gives a little and gets a lot." I suppose my final pet peeve is that such an organization still does not exist.
The Author
Douglas Eadline, PhD, is both a practitioner and a chronicler of the Linux Cluster HPC revolution. He has worked with parallel computers since 1988, and he is a co-author of the original Beowulf How To document, a past Editor of ClusterWorld Magazine, and a previous Senior HPC Editor at Linux Magazine . You can contact him at: deadline at clustermonkey dot net.