Unlock the Power of NVIDIA Grace and NVIDIA Hopper Architectures with Foundational HPC Software

An illustration representing HPC applications.

High-performance computing (HPC) powers applications in simulation and modeling, healthcare and life sciences, industry and engineering, and more. In the modern…An illustration representing HPC applications.

High-performance computing (HPC) powers applications in simulation and modeling, healthcare and life sciences, industry and engineering, and more. In the modern data center, HPC synergizes with AI, harnessing data in transformative new ways.

The performance and throughput demands of next-generation HPC applications call for an accelerated computing platform that can handle diverse workloads and has a tight coupling between the CPU and GPU. The NVIDIA Grace CPU and NVIDIA Hopper GPU are industry-leading hardware ecosystems for HPC development.

NVIDIA provides tools, libraries, and compilers to help developers take advantage of the NVIDIA Grace and NVIDIA Grace Hopper architectures. These tools support innovation and help applications make full use of accelerated computing. This foundational software stack provides the means for GPU acceleration, and porting and optimizing your applications on NVIDIA Grace-based systems. Learn more about NVIDIA Grace compilers, tools, libraries, and more on the Grace developer product page.


The new hardware developments in the NVIDIA Grace Hopper systems enable dramatic changes to the way developers approach GPU programming. Most notably, the bidirectional, high-bandwidth, and cache-coherent connection between CPU and GPU memory means that the user can develop their application for both processors while using a single, unified address space. 

Each processor retains its own physical memory that is designed with the bandwidth, latency, and capacity characteristics matched to the workloads most suited for each processor. Code written for existing discrete-memory GPU systems continues to run performantly without modification for the new Grace Hopper architecture.

All application threads (GPU or CPU) can directly access the application’s system-allocated memory, removing the need to copy data between processors. This new ability to read or write directly to the full application memory address space significantly improves programmer productivity for all programming models built on top of NVIDIA CUDA: CUDA C++, CUDA Fortran, standard parallelism in ISO C++, ISO Fortran, OpenACC, OpenMP, and many others.

NVIDIA HPC SDK 23.11 introduces new unified memory programming support, enabling workloads bottlenecked by host-to-device or device-to-host transfers to achieve up to a 7x speedup due to the chip-to-chip (C2C) interconnect in Grace Hopper systems. Additionally, application development can be dramatically simplified because considerations for data location and movement are handled automatically by the system.

Read Simplifying GPU Programming for HPC with the NVIDIA Grace Hopper Superchip for a deeper dive detailing how NVIDIA HPC compilers use these new hardware capabilities to simplify GPU programming with ISO C++, ISO Fortran, OpenACC, and CUDA Fortran.

Get started with the NVIDIA HPC SDK for free and download version 23.11 now. 

NVIDIA Performance Libraries 

NVIDIA has grown to become a full-stack, enterprise platform provider, now offering CPUs as well as GPUs and DPUs. NVIDIA math software offerings now support CPU-only workloads in addition to existing GPU-centric solutions. 

NVIDIA Performance Libraries (NVPL) are a collection of essential math libraries optimized for Arm 64-bit architectures. Many HPC applications rely on mathematical APIs like BLAS and LAPACK, which are crucial to their performance. NVPL math libraries are drop-in replacements for these standardized math APIs. 

They are optimized for the NVIDIA Grace CPU. Applications being ported to or built on Grace-based platforms can fully use the high-performance and high-efficiency architecture. A primary goal of NVPL is to provide developers and system administrators the smoothest experience porting and deploying existing HPC applications to the Grace platform with no source code changes required to achieve maximal performance when using CPU-based, standardized math libraries.

The beta release of NVPL, available now, includes BLAS, LAPACK, FFT, RAND, and SPARSE to accelerate your applications on the NVIDIA Grace CPU. 

Learn more and download the NVPL beta.

NVIDIA CUDA Direct Sparse Solvers

A new standard math library is being introduced to the suite of NVIDIA GPU-accelerated libraries. The NVIDIA CUDA Direct Sparse Solvers library, NVIDIA cuDSS, is optimized for solving linear systems with very sparse matrices. While the first version of cuDSS supports execution on a single-GPU, multi-GPU, and multi-node support will be added in an upcoming release. 

Honeywell is one of the early adopters of cuDSS and is in the final phase of performance benchmarking in its UniSim Design process simulation product.

The cuDSS preview will be available soon. Learn more about CUDA Math Libraries here: CUDA-X GPU-Accelerated Libraries.


NVIDIA cuTENSOR 2.0 is a performant and flexible library for accelerating your applications at the intersection of HPC and AI. In this major release, cuTENSOR 2.0 adds new features and performance improvements, including for arbitrarily high dimensional tensors. To make the new optimizations easily extensible across all tensor operations uniformly, while delivering high performance, the cuTENSOR 2.0 APIs have been completely revised with a focus on flexibility and extensibility.

cuTENSOR 2.0 API flowchart.Figure 1. cuTENSOR APIs are now shared across different tensor operations

The plan-based multi-stage API is extended to all operations through a set of shared APIs. The new APIs can take opaque heap-allocated data structures as input for passing any operation-specific problem descriptors defined for that execution. 

cuTENSOR 2.0 also adds support for just-in-time (JIT) kernels. 

cuTENSOR 2.0 benchmark performance.Figure 2. Average incremental performance improvements from using JIT for various input tensor types compared for two benchmarks: QC-like and Rand1000. Performance improvements from JIT are significant for QC-like test cases with high dimensional tensors

Using JIT kernels helps realize unparalleled performance by tuning the right configuration and optimization knobs for the target configuration at runtime, supporting a myriad of high-dimensional tensors not achievable through generic pre-compiled kernels that the library can ship.

cuTENSOR 2.0 speedup graph.Figure 3. cuTENSOR 2.0.0 performance gains over the previous 1.7.0 version when tuned with JIT and other capabilities

cuTENSOR 2.0 will be available soon.

Grace CPU Performance Tuning with NVIDIA Nsight Systems 2023.4 

Applications on Grace-based platforms benefit from tuning instruction execution on the CPU cores, and from optimizing the CPU’s interaction with other hardware units in the system. When porting applications to Grace, insight into functions at the hardware level will help you configure your software for the new platform. 

NVIDIA Nsight Systems is a system-wide performance analysis tool that collects hardware and API metrics and correlates them on a unified timeline. For Grace CPU performance tuning, Nsight Systems samples instruction pointers and backtraces to visualize where CPU code is busiest, and how the CPU is using resources across the system. Nsight Systems also captures context switching to build a utilization graph for all the Grace CPU cores.

Grace CPU core event rates, like CPU cycles and instructions retired, show how the Grace cores are handling work. Additionally, the summary view for backtrace samples helps you quickly identify which instruction pointers are causing hotspots. 

Now available in Nsight Systems 2023.4, Grace CPU uncore event rates monitor activity outside of the cores—like NVLink-C2C and PCIe activity. Uncore metrics show how activity between sockets supports the work of the cores, helping you find ways to improve the Grace CPU’s integration with the rest of the system.

Grace CPU uncore and core event sampling in Nsight Systems 2023.4 help you find the best optimizations for code running on Grace. For more information on Grace CPU performance tuning, as well as tips on optimizing your CUDA code in conjunction, check out the following video.

Video 1. NVIDIA Grace CPU Performance Tuning with NVIDIA Nsight Tools

Learn more and get started with Nsight Systems 2023.4. Nsight Systems is also available in the HPC SDK and CUDA Toolkit.

Accelerated computing for HPC

NVIDIA provides an ecosystem of tools, libraries, and compilers for accelerated computing on the NVIDIA Grace and Hopper architectures. The HPC software stack is foundational for research and science on NVIDIA data center silicon. 

Dive deeper into accelerated computing topics in the Developer Forums. 

Source:: NVIDIA