Buyer’s guide: High-performance computing (HPC) for AI

High-performance computing (HPC) requires powerful servers running the fastest processors (typically GPUs, which are specialized graphics processing chips). But HPC also calls for a full stack of complementary technologies including software, high-performance storage, memory and file systems, high-speed networking (typically InfiniBand), and specialized management tools for scheduling and other tasks.

Although it can be complex, the right HPC implementation provides your enterprise the computing capabilities necessary for high-intensity applications in many industries, especially those taking advantage of AI.

[Download our editors’ PDF high-performance computing (HPC) for AI buyer’s guide today!]

In this buyer’s guide

  • HPC for AI explained
  • Why enterprises need HPC for AI
  • Major trends in HPC for AI
  • Leading HPC for AI vendors
  • What to ask before buying an HPC system for AI
  • Essential reading

Why enterprises need HPC for AI

There are a growing number of specific use cases that require HPC’s parallel processing capabilities — the ability to divide programs into separate chunks that can be run at the same time, for faster overall processing — particularly as organizations explore the game-changing benefits of AI in general and generative AI in particular.

Applications that need the firepower provided by HPC include drug discovery and design in the pharmaceutical industry; genome analysis and clinical treatments in health care; automated trading in finance; real-time fraud detection and risk management in banking; crash simulations and real-time data processing for autonomous driving features in automotive.

In industries such as manufacturing, aerospace, oil and gas, energy, and education, HPC can enable engineers, scientists, and researchers to conduct simulations and modeling at a speed not possible with traditional servers.

The list goes on, and it will continue to grow as organizations discover new ways to gain business benefit from large language models (LLMs) and generative AI. In the generative AI scenario, HPC is required train the models and to generate real-time responses to queries.

Organizations considering HPC need to identify a clear use case that will either fix a vexing problem that is holding the business back or that will solve a complex computational problem that provides a competitive advantage.

Major trends in HPC for AI

Once limited to government research labs and the largest enterprises, HPC is becoming more widely deployed. “The worldwide HPC server market is seeing strong growth driven by enterprises investing in HPC,” says Josephine Palencia, research director for high-performance computing at IDC. “The AI revolution of the past few years has paved the way for organizations gaining an understanding of the importance of performance-intensive computing for business success.”

These are the three major trends in HPC:

  • Cloud delivery: Enterprises have traditionally run HPC on premises and have used cloud resources for occasional traffic bursts. However, Hyperion Research reports that the cloud market for HPC is expected to grow faster than the on-premises market. Still, because the cloud market is starting from a much smaller base, by 2026 it will still be only be half the size of the on-premises HPC server market, predicts Hyperion.
  • HPC as a service: Another key trend is the emergence of HPC as a fully managed service inside enterprise data centers. This option lets organizations gain the benefits of HPC in a pay-as-you-go model, rather than via capital expenditure. Organizations can get HPC up and running faster and can lean on the technical expertise and staffing of the service provider, while keeping data safely protected within the walls of the data center.
  • Edge computing: Many organizations are adopting a decentralized approach to data center infrastructure, moving resources closer to where data is being created, in what is called edge computing. For organizations in industries like manufacturing that generate internet of things (IoT) traffic from computing resources outside the central data center; or industries that require real-time data processing such as retail, banking, and health care, edge data centers with HPC are becoming more widely deployed.
  • Leading HPC for AI vendors

    HPC vendors fall into three categories: traditional server vendors (Dell, HPE, IBM, and Lenovo), the hyperscale cloud service providers (AWS, Azure, and Google Cloud), and a new generation of cloud-based, purpose-built HPC service providers. Here are snapshots of the leading vendors in each category.

    Server vendors that sell HPC systems

    Dell: Dell, which makes its own servers, storage equipment and networking gear, offers an integrated HPC platform that enterprises can purchase in a do-it-yourself, on-premises scenario.

    Under the Apex banner, Dell also offers fully managed HPC delivered as a service, which can run in an enterprise data center or at a Dell-managed colocation facility. The Dell solution includes an HPC cluster manager, container orchestration, and job scheduler. Dell also offers preconfigured platforms tailored to specific industries. For example, there are Dell-validated designs for risk assessment in the financial services industry, for life sciences, and for manufacturing.

    HPE: HPE, which acquired supercomputing vendors Cray and Silicon Graphics (SGI), offers HPC for enterprises in either a purchase option or a managed private cloud scenario under the GreenLake banner.

    The HPC on-premises service (a component of GreenLake) provides flexibility, scalability, and control of HPC solutions with a pay-as-you-go consumption model. In this scenario, HPE staffers implement and operate the environment.

    The full-stack GreenLake HPC-as-a-service solution includes high-bandwidth memory, integrated storage, HPE Slingshot high-speed interconnect, built-in acceleration for the fastest-growing workloads, density-optimized power and cooling, and the HPE Cray OS and Programming Environment.

    IBM: IBM has an integrated full-stack offering for on-premises deployments that includes Power Systems servers, Spectrum Storage, and the Spectrum Computing Suite for High-Performance Analytics (HPA) that provides workload management and scheduling, data management, data analysis, dashboards, etc. IBM offers HPC-as-a-service in the IBM Cloud, and on-premises HPC customers can burst to the IBM cloud in a hybrid scenario.

    Lenovo: Lenovo, which purchased IBM’s x86 server business in 2014, offers integrated HPC systems built around ThinkSystem servers and TruScale Infinite Storage. Lenovo also has an as-a-service offering for on-premises HPC. The pay-as-you-go model provides flexibility, scalability, and security for high-intensity workloads. Lenovo professional services can help enterprises assess their HPC needs, design the appropriate system, install it, and monitor it.

    Hyperscalers that offer HPC infrastructure

    The major cloud service providers offer HPC scalability, pay-as-you-go pricing, and fast deployment. And these hyperscalers are on the cutting edge of how to build high-performance systems.

    The trade-off is that customers are ultimately responsible for decisions on the type and scale of infrastructure, as well as managing, monitoring, and securing the data. And cloud options can be more expensive when factoring in the costs of moving large data sets data back and forth.

    AWS: Amazon Web Services (AWS) offers an integrated HPC solution built on the latest chip and server technology, Amazon EC2 storage, Elastic Fabric Adapter for networking, the FSx file system, the AWS Nitro hypervisor, and AWS ParallelCluster for deployment and management of HPC clusters. AWS offers partners, such as Cognizant, Rescale, and TotalCAE, that provide a managed service on top of the HPC implementation.

    Azure: Microsoft’s Azure offers a cloud-based HPC platform with compute, networking, and storage resources integrated with workload orchestration services for HPC applications. Azure also offers machine-learning tools and software for building applications with predictive analysis. In addition to the IaaS infrastructure tuned for HPC, Azure also offers a dedicated, fully managed, single-tenant Cray XC or CS series supercomputer for HPC workloads.

    Google Cloud: Similarly, Google offers HPC on its Google Cloud, providing customers with a variety of options for selecting CPUs from Intel, AMD, or Arm; GPUs from Nvidia; storage options including object, block, and file storage; toolkits; best practices blueprints; and preconfigured modules.

    Purpose-built HPC clouds that target AI

    Several startups have come along in the past few years looking to take advantage of the AI frenzy with cloud-based platforms designed from the ground up to support HPC. Two have separated themselves from the pack, generating huge buzz in the HPC community and raising large amounts of venture-capital investment.

    Cerebras: Cerebras has built a supercomputing platform based on its own chips, which it claims are the largest processors ever built. Cerebras is selling its WaferScale Engine for AI to the largest enterprises. But it is also offering a cloud-based service that enables any enterprises to rent space on its infrastructure, with technical assistance from the Cerebras team of AI scientists and engineers. The company is privately held, but Karl Freund, principal analyst at Cambrian-AI Research, says that with revenue and commitments approaching $1 billion Cerebras has likely generated more business than all the other startups in its segment combined.

    CoreWeave: CoreWeave is a specialized cloud service provider built specifically for large-scale GPU-accelerated workloads. CoreWeave, which uses Dell servers, claims its platform is significantly faster and cheaper than that of the hyperscalers. Nvidia is an investor in CoreWeave, which gives the company a steady flow of Nvidia chips, which are in short supply. In August 2023, CoreWeave raised $2.3 billion and its valuation recently hit $7 billion after a $642 million investment from Fidelity Management and Research.

    What to ask before buying an HPC system for AI

    Because every enterprise is different and because HPC systems are both complex and varied in their capabilities, you need to get a clear grasp on your specific needs, capabilities, and resources before engaging prospective vendors and then choosing specific solutions.

    10 key questions to ask yourself before buying HPC systems for AI

  • Do I have the budget? HPC is expensive — one Nvidia high-end GPU can cost $30,000, and an HPC cluster aggregates multiple servers, storage systems, etc.
  • Do I have a better chance of winning funding if I pitch the HPC project as something that can be accomplish via subscription model (the managed service option), rather than as a capital expenditure?
  • Do I have the staff expertise and availability to install, configure, operate, monitor, manage, and troubleshoot a parallel processing system that runs on unfamiliar technology and is, by definition, far more complex that anything I’m used to?
  • Do I have the space, power, cooling, and other physical data center infrastructure required to house this collection of high-powered servers and storage devices?
  • How does an on-premises HPC implementation affect my organization’s sustainability goals?
  • Are my HPC workloads predictable and steady, which is preferable with an on-premises implementation? Or are they highly variable, which would make a cloud-based or hybrid scenario more likely?
  • Am I using HPC for real-time processing or for long-term research projects? Hyperion Research says that from a purely cost perspective, long-running projects with demanding data movement requirements might cost more in the cloud than on-premises. Conversely, workloads with shorter run times and less demanding data requirements could cost less in the cloud.
  • Do I have regulatory, compliance, risk management, security, or other factors that prohibit moving large sets of sensitive data to the cloud?
  • Do I have a way to translate the output of HPC systems into actionable information for stakeholders?
  • Do I have systems in place to measure success? What are goals, benchmarks, KPIs, or other metrics that I can use to track progress and to show results to stakeholders?
  • 10 critical questions to ask HPC vendors

  • Do you have experienced advisors who can help me design and right-size the on-premises deployment so that the processing power I purchase matches the requirements of the application?
  • Is your HPC stack fully integrated, or is it cobbled together with technologies from multiple vendors?
  • Does your solution provide the flexibility I need to quickly adjust HPC components independently? For example, can I add storage without going through an upgrade of the entire system?
  • What types of discounts and financing options do you offer?
  • How deep is your bench? How many trained HPC specialists do you have at every stage of the process — installation and configuration, service and support?
  • What type of service and support contracts do you offer. What are the SLAs? How expensive is it? What types of discounts are available, with what trade-offs?
  • How modular and scalable are your systems, because the assumption is that the HPC footprint will expand over time.
  • What is your technology roadmap across all aspects of the HPC system — chip technology, storage systems, networking, etc.? Will there be predictable upgrades?
  • What additional templates or preconfigured designs do you offer for specific vertical industries or use cases?
  • What types of customization services do you offer so that the HPC system integrates with the unique workflows and processes of my organization?
  • Essential reading

    • High-performance computing: Do you need it?
    • AI, machine learning, and deep learning: Everything you need to know
    • What is generative AI? Artificial intelligence that creates
    • Large language models: The foundations of generative AI
    • What is edge computing and why does it matter?
    • What is IoT? The internet of things explained

    Source:: Network World