Oak Ridge seeks next-level supercomputer to blow away Frontier

The US Frontier supercomputer may be the most powerful computer on the planet, but at Oak Ridge National Laboratory (ORNL), which operates it, isn’t resting on its laurels. This month ORNL issued a massive request for proposal for OLCF-6, a new machine, now dubbed Discovery, to be delivered by late 2027 or early 2028. The Technical Requirements document alone is 69 pages.

“Here in the Oak Ridge Leadership Computing Facility, we build leadership-class supercomputers for scientific applications,” said Matt Sieger, project director for Discovery, Oak Ridge Leadership Computing Facility (OLCF) at ORNL, which is managed and operated by UT-Battelle on behalf of the US Department of Energy. “But we work these systems really hard, running 24 hours a day for years at a time. At that pace, Frontier can’t last forever. Like a cell phone or laptop, the hardware wears out or becomes obsolete.”

With Discovery, Sieger said, ORNL is looking for the next system after Frontier, which currently ranks No. 1 on the Top 500 list of supercomputers.

“The two will operate simultaneously for a time for our research users to transition their codes to the new computer,” he explained.

With an anticipated budget of US$500 million, Discovery’s requirements are extensive. At the top of the list: “It must provide a significant increase in leadership computational and data science capabilities over the Frontier baseline.”

Frontier, which was installed in 2021, is a whopper. The HPE Cray EX exascale supercomputer lives in 74 Olympus rack HPE cabinets with high-speed Slingshot-11 interconnect, housing a total of 9,408 AMD compute nodes with 8,699,904 cores. It gobbles 22,786KW of power. Each node has access to 512GB of DDR4 memory.

Discovery must do better than that. In addition, according to the RFP, it must support everything from small workloads using 20% of the nodes to those that require the entire system. It must be expandable with “new, novel architectures,” and interoperate with and support connected DOE experimental user facilities and other ORNL Leadership Computing Facility (LCF) infrastructure.

It also must be operational before the end of Frontier’s service life, operate within OLCF’s operations and utilities budget, provide a productive programming environment for users, and “continue to make progress and lead in dramatically improving energy efficiency across the ecosystem.”

Oh, yes — and it must be AI-friendly, of course.

Discovery should “be at the forefront in supporting domain scientists and application developers as they explore and integrate transformational AI technologies to accelerate discoveries in science, energy, and security problems of national importance,” the RFP outlined.

As well, OLCF added in its RFP, “We envision a wide spectrum of use cases ranging from inverse design and control of complex systems such as power grids and nuclear reactors, to generative AI and foundational models that integrate text and images that are often unstructured, high-resolution, and from multi-modal data sources.

“Executing AI-empowered computing campaigns and workflows will place new demands on the system architecture, possibly requiring more interconnect bandwidth and an optimized storage layer that can handle very high rates of I/O operations (IOPS) focused on random reads.”

OLCF provides a series of benchmarks whose results must be reported in proposals. The benchmark website contains output results from existing supercomputers, including Frontier, to “assist the Offeror in estimating the benchmark results on the proposed OLCF-6 system.”

The RFP also includes an extensive description of required software, including the operating system and system and workload management systems, as well as a development platform that supports C, C++, Python, and Fortran — a high-priority technical requirement.

Vendors are also expected to provide estimates of maintenance costs for both hardware and software, including the establishment, at a minimum, of a Center of Excellence. And, of course, the system must be highly secure.

Source:: Network World