High-performance computing (HPC) requires powerful servers running the fastest processors (typically GPUs, which are specialized graphics processing chips). But HPC also calls for a full stack of complementary technologies including software, high-performance storage, memory and file systems, high-speed networking (typically InfiniBand), and specialized management tools for scheduling and other tasks.
Although it can be complex, the right HPC implementation provides your enterprise the computing capabilities necessary for high-intensity applications in many industries, especially those taking advantage of AI.
[Download our editors’ PDF high-performance computing (HPC) for AI buyer’s guide today!]
In this buyer’s guide
- HPC for AI explained
- Why enterprises need HPC for AI
- Major trends in HPC for AI
- Leading HPC for AI vendors
- What to ask before buying an HPC system for AI
- Essential reading
Why enterprises need HPC for AI
There are a growing number of specific use cases that require HPC’s parallel processing capabilities — the ability to divide programs into separate chunks that can be run at the same time, for faster overall processing — particularly as organizations explore the game-changing benefits of AI in general and generative AI in particular.
Applications that need the firepower provided by HPC include drug discovery and design in the pharmaceutical industry; genome analysis and clinical treatments in health care; automated trading in finance; real-time fraud detection and risk management in banking; crash simulations and real-time data processing for autonomous driving features in automotive.
In industries such as manufacturing, aerospace, oil and gas, energy, and education, HPC can enable engineers, scientists, and researchers to conduct simulations and modeling at a speed not possible with traditional servers.
The list goes on, and it will continue to grow as organizations discover new ways to gain business benefit from large language models (LLMs) and generative AI. In the generative AI scenario, HPC is required train the models and to generate real-time responses to queries.
Organizations considering HPC need to identify a clear use case that will either fix a vexing problem that is holding the business back or that will solve a complex computational problem that provides a competitive advantage.
Major trends in HPC for AI
Once limited to government research labs and the largest enterprises, HPC is becoming more widely deployed. “The worldwide HPC server market is seeing strong growth driven by enterprises investing in HPC,” says Josephine Palencia, research director for high-performance computing at IDC. “The AI revolution of the past few years has paved the way for organizations gaining an understanding of the importance of performance-intensive computing for business success.”
These are the three major trends in HPC:
Leading HPC for AI vendors
HPC vendors fall into three categories: traditional server vendors (Dell, HPE, IBM, and Lenovo), the hyperscale cloud service providers (AWS, Azure, and Google Cloud), and a new generation of cloud-based, purpose-built HPC service providers. Here are snapshots of the leading vendors in each category.
Server vendors that sell HPC systems
Dell: Dell, which makes its own servers, storage equipment and networking gear, offers an integrated HPC platform that enterprises can purchase in a do-it-yourself, on-premises scenario.
Under the Apex banner, Dell also offers fully managed HPC delivered as a service, which can run in an enterprise data center or at a Dell-managed colocation facility. The Dell solution includes an HPC cluster manager, container orchestration, and job scheduler. Dell also offers preconfigured platforms tailored to specific industries. For example, there are Dell-validated designs for risk assessment in the financial services industry, for life sciences, and for manufacturing.
HPE: HPE, which acquired supercomputing vendors Cray and Silicon Graphics (SGI), offers HPC for enterprises in either a purchase option or a managed private cloud scenario under the GreenLake banner.
The HPC on-premises service (a component of GreenLake) provides flexibility, scalability, and control of HPC solutions with a pay-as-you-go consumption model. In this scenario, HPE staffers implement and operate the environment.
The full-stack GreenLake HPC-as-a-service solution includes high-bandwidth memory, integrated storage, HPE Slingshot high-speed interconnect, built-in acceleration for the fastest-growing workloads, density-optimized power and cooling, and the HPE Cray OS and Programming Environment.
IBM: IBM has an integrated full-stack offering for on-premises deployments that includes Power Systems servers, Spectrum Storage, and the Spectrum Computing Suite for High-Performance Analytics (HPA) that provides workload management and scheduling, data management, data analysis, dashboards, etc. IBM offers HPC-as-a-service in the IBM Cloud, and on-premises HPC customers can burst to the IBM cloud in a hybrid scenario.
Lenovo: Lenovo, which purchased IBM’s x86 server business in 2014, offers integrated HPC systems built around ThinkSystem servers and TruScale Infinite Storage. Lenovo also has an as-a-service offering for on-premises HPC. The pay-as-you-go model provides flexibility, scalability, and security for high-intensity workloads. Lenovo professional services can help enterprises assess their HPC needs, design the appropriate system, install it, and monitor it.
Hyperscalers that offer HPC infrastructure
The major cloud service providers offer HPC scalability, pay-as-you-go pricing, and fast deployment. And these hyperscalers are on the cutting edge of how to build high-performance systems.
The trade-off is that customers are ultimately responsible for decisions on the type and scale of infrastructure, as well as managing, monitoring, and securing the data. And cloud options can be more expensive when factoring in the costs of moving large data sets data back and forth.
AWS: Amazon Web Services (AWS) offers an integrated HPC solution built on the latest chip and server technology, Amazon EC2 storage, Elastic Fabric Adapter for networking, the FSx file system, the AWS Nitro hypervisor, and AWS ParallelCluster for deployment and management of HPC clusters. AWS offers partners, such as Cognizant, Rescale, and TotalCAE, that provide a managed service on top of the HPC implementation.
Azure: Microsoft’s Azure offers a cloud-based HPC platform with compute, networking, and storage resources integrated with workload orchestration services for HPC applications. Azure also offers machine-learning tools and software for building applications with predictive analysis. In addition to the IaaS infrastructure tuned for HPC, Azure also offers a dedicated, fully managed, single-tenant Cray XC or CS series supercomputer for HPC workloads.
Google Cloud: Similarly, Google offers HPC on its Google Cloud, providing customers with a variety of options for selecting CPUs from Intel, AMD, or Arm; GPUs from Nvidia; storage options including object, block, and file storage; toolkits; best practices blueprints; and preconfigured modules.
Purpose-built HPC clouds that target AI
Several startups have come along in the past few years looking to take advantage of the AI frenzy with cloud-based platforms designed from the ground up to support HPC. Two have separated themselves from the pack, generating huge buzz in the HPC community and raising large amounts of venture-capital investment.
Cerebras: Cerebras has built a supercomputing platform based on its own chips, which it claims are the largest processors ever built. Cerebras is selling its WaferScale Engine for AI to the largest enterprises. But it is also offering a cloud-based service that enables any enterprises to rent space on its infrastructure, with technical assistance from the Cerebras team of AI scientists and engineers. The company is privately held, but Karl Freund, principal analyst at Cambrian-AI Research, says that with revenue and commitments approaching $1 billion Cerebras has likely generated more business than all the other startups in its segment combined.
CoreWeave: CoreWeave is a specialized cloud service provider built specifically for large-scale GPU-accelerated workloads. CoreWeave, which uses Dell servers, claims its platform is significantly faster and cheaper than that of the hyperscalers. Nvidia is an investor in CoreWeave, which gives the company a steady flow of Nvidia chips, which are in short supply. In August 2023, CoreWeave raised $2.3 billion and its valuation recently hit $7 billion after a $642 million investment from Fidelity Management and Research.
What to ask before buying an HPC system for AI
Because every enterprise is different and because HPC systems are both complex and varied in their capabilities, you need to get a clear grasp on your specific needs, capabilities, and resources before engaging prospective vendors and then choosing specific solutions.
10 key questions to ask yourself before buying HPC systems for AI
10 critical questions to ask HPC vendors
Essential reading
- High-performance computing: Do you need it?
- AI, machine learning, and deep learning: Everything you need to know
- What is generative AI? Artificial intelligence that creates
- Large language models: The foundations of generative AI
- What is edge computing and why does it matter?
- What is IoT? The internet of things explained
Source:: Network World