NVIDIA H100 System Sets Records for HPC and Generative AI Financial Risk Calculations

Generative AI is taking the world by storm, from large language models (LLMs) to generative pretrained transformer (GPT) models to diffusion models. NVIDIA is…

Generative AI is taking the world by storm, from large language models (LLMs) to generative pretrained transformer (GPT) models to diffusion models. NVIDIA is uniquely positioned to accelerate generative AI workloads, but also those for data processing, analytics, high-performance computing (HPC), quantitative financial applications, and more. NVIDIA offers a one-stop solution for diverse workload needs.

In quantitative applications for financial risk management, for example, NVIDIA GPUs offer incredible speed with great efficiency. NVIDIA H100 Tensor Core GPUs were featured in a stack that set several records in a recent STAC-A2 audit. 

STAC-A2 benchmark

STAC-A2 is a risk management benchmark created by the Strategic Technology Analysis Center (STAC) Benchmark Council for the assessment of technology stacks used for compute-intensive analytic workloads in finance. Designed by quantitative analysts and technologists from some of the world’s largest banks, STAC-A2 reports the performance, scaling, quality, and resource efficiency of any technology stack that can handle the workload: Monte Carlo estimation of Heston-based Greeks for path-dependent, multi-asset options with early exercise.

STAC-A2 is the technology benchmark standard based on financial market risk analysis. The workload can be a proxy extended to price discovery, market risk calculations such as sensitivity Greeks, profit and loss, and value at risk (VaR) in market risk. The benchmark as a proxy can also be extended to counterparty credit risk (CCR)  workloads such as credit valuation adjustment (CVA) and margin that financial institutions calculate for trading as well as risk management. 

Compared to all publicly reported solutions to date, this NVIDIA H100-based solution set numerous performance and efficiency records, including (but not limited to):

  • The first sub-10ms warm time (8.9 ms) in the baseline Greeks benchmark.
  • A cold time in the baseline Greeks benchmark of 38 ms, more than 3x faster than any previous benchmark.
  • The fastest warm (0.51 s) and cold (1.85 s) times in the large Greeks benchmarks.
  • The most correlated assets (400) and most paths (310,000,000) simulated in 10 minutes.
  • The best energy efficiency (311,045 options/kWh).

Compared to a solution using 8x NVIDIA A100 SXM4 80GB GPUs, as well as previous generations of the NVIDIA STAC Pack and NVIDIA CUDA, this solution performed:

  • 10x faster in the cold run of the baseline Greeks benchmark2.
  • 1.38x faster in the warm runs of the baseline Greeks benchmark1.
  • 1.38x faster in the cold run of the large Greeks benchmark4.
  • 1.4x faster in the warm runs of the large Greeks benchmark3.
  • With 10% more energy efficiency, benchmark 7.

Integrated hardware plus software solutions

One of the industry’s key concerns is risk calculation, which relies heavily on the latest technologies for real-time calculations and instant decision-making for trading and risk management. The HPE ProLiant XL675d Gen10 Plus server supports a dense node strategy for financial HPC, enabling up to 10 NVIDIA GPUs to efficiently populate a single HPC server. With this type of dense node, a compute farm for portfolio risk calculations can be realized with far fewer nodes achieving higher performance with lower operational costs for real-world calculations such as price discovery, market risk (such as VaR), and counterparty risk (such as CVA).

In areas such as CVA, such setups have been shown to reduce the number of nodes from 100 to 4 in simulation- and compute-intensive calculations (separately from STAC benchmarking).

This dense node solution enables an exciting strategy of scaling up with NVIDIA GPUs for a reduced number of nodes. It enables the highest performance at the lowest operating cost and the smallest data center footprint. The solutions can be extended to other workloads, such as AI language inference and backtesting as needed for a scaling strategy with such dense servers.

In addition to the hardware, NVIDIA provides all the key software component layers. This offers multiple options to developers, including the language of choice such as CUDA C++.

In calculations that are typically developed for fast run times, this implementation was developed on CUDA 12.0. The implementation uses the highly optimized libraries delivered with CUDA: cuBLAS, the GPU-enabled implementation of the linear algebra package BLAS, and cuRAND, a parallel and efficient GPU implementation of random number generators. 

The different components of the implementation are designed in a modular and maintainable framework using object-oriented programming. All floating-point operations were conducted in IEEE-754 double precision (64 bits). The implementation was developed using the various tools provided by NVIDIA to help debug and profile CUDA code. These tools include NVIDIA Nsight Systems for timeline profiling; NVIDIA Nsight Compute for kernel profiling; and NVIDIA Compute Sanitizer and CUDA-GDB for debugging.

Solution for risk HPC and AI convergence

NVIDIA H100 GPUs are an integral part of the NVIDIA data center platform. Built for AI, HPC, and data analytics, the platform accelerates over 4,000 applications. It is available everywhere, from data center to edge, delivering both dramatic performance gains and cost-saving opportunities with the aim of accelerating “every workload, everywhere.” 

The NVIDIA H100 PCIe GPU incorporates groundbreaking technology, such as NVIDIA Hopper architecture, with a theoretical peak performance of 51 TFLOPS for single precision and 26 TFLOPS for double-precision calculations. It uses 14,592 CUDA cores, plus 456 fourth-generation Tensor Core modules, which can deliver theoretical peak performance of 1,513 TFLOPS for BF16 and 51 TFLOPS for FP64 matrix calculations.

For HPC applications, the NVIDIA H100 almost triples the theoretical floating-point operations per second (FLOPS) of FP64 compared to the NVIDIA A100. It also adds dynamic programming instructions (DPX) to help achieve better performance. NVIDIA H100 GPUs feature fourth-generation Tensor Cores and the Transformer Engine with FP8 precision. 

With second-generation Multi-Instance GPU (MIG), built-in NVIDIA confidential computing, and NVIDIA NVLink, the NVIDIA H100 aims to securely accelerate workloads for every data center, from enterprise to exascale. The NVIDIA GPUs in SXM form share a switched NVLink 4.0 interconnect, providing high-speed GPU-to-GPU communication bandwidth. 

The H100 PCIe Gen 4 configuration used in this SUT provides most of the specified capabilities of H100 SXM5 GPUs in just 350 watts of thermal design power (TDP). This configuration can optionally use the NVLink bridge for connecting up to two GPUs for applications like deep learning in AI that are coded to take advantage of inter-GPU communications. (The STAC-A2 Pack does not use these fabrics.)

Summary

Whether at single-server scale or in larger scaling systems optimized for today’s most demanding HPC plus AI workloads, NVIDIA is uniquely positioned to accelerate workloads ranging from HPC quantitative financial applications and data processing to analytics and generative AI. In addition to risk calculations, organizations are converging NLP with generative AI, feeding inputs to quantitative calculations.

This is an active area of work, wherein the convergence of HPC and AI is happening as financial firms work on big-picture solutions combining various modeling techniques including HPC quantitative finance, machine learning (ML), and AI with neural net as well as NLP with generative AI. This enables firms to drive multiple complex business needs with converged HPC plus AI solutions that are increasingly a result of accelerated AI adoption, combining workloads for unique solution needs in the financial industry. 

To learn more, check out HPE and NVIDIA Financial Services Solution, Powered by NVIDIA, Sets New Records in Performance. 

Reach out to NVIDIA Financial Services with questions as you evaluate or apply accelerated compute to your critical business problems. Read the full report, CUDA 12.0 with 8x NVIDIA H100 PCIe 80GiB GPUs in an HPE ProLiant XL675d Gen10 Plus Server. 

References

11.38x faster in the warm runs of the baseline Greeks benchmark (STAC-A2.β2.GREEKS.TIME.WARM).
2STAC-A2.β2.GREEKS.TIME.COLD
3STAC-A2.β2.GREEKS.10-100k-1260TIME.WARM
4STAC-A2.β2.GREEKS.10-100k-1260TIME.COLD
5STAC-A2.β2.GREEKS.MAX_ASSETS
6STAC-A2.β2.GREEKS.MAX_PATHS

Source:: NVIDIA