Advanced API Performance: Intrinsics

A graphic of a computer sending code to multiple stacks.

Intrinsics can be thought of as higher-level abstractions of specific hardware instructions. They offer direct access to low-level operations or…

Intrinsics can be thought of as higher-level abstractions of specific hardware instructions. They offer direct access to low-level operations or hardware-specific features, enabling increased performance. In this way, operations can be performed across threads within a warp, also known as a wavefront.

Recommended

  • Wave intrinsics can noticeably speed up your shaders.
    • Many sorting or reduction algorithms can use much less or no shared memory with fewer memory barriers, providing a noticeable performance boost.
    • Different types of shuffles and ballots can be useful.
    • Use wave instructions with GroupSize or WorkGroup values larger than the warp or subgroup size (32 threads) wave instructions. There are fewer memory barriers and shared memory accesses that are needed.
    • For more information, see Reading Between The Threads: Shader Intrinsics and Unlocking GPU Intrinsics in HLSL.
  • Use GroupSize and WorkGroup as a multiplier of warp size (32 * N), 64 is usually a sweet spot.
    • With intrinsic GroupSize and WorkGroup size equal, 32 could be a better choice to avoid shared memory usage.
  • Use native HLSL code when vendor-specific extensions are not applicable or are hard to implement.
    • Some instructions can be implemented with recent shader model versions.

The following code example is an example with SM6:

float(4) NvShflXor (float(4) input, uint LaneMask)
{
float(4) output = WaveReadLaneAt(input, WaveGetLaneIndex() ^ LaneMask);
return output;
}

Source:: NVIDIA