Site icon GIXtools

Navigating Generative AI for Network Admins

We all know that AI is changing the world. For network admins, AI can improve day-to-day operations in some amazing ways: Automation of repetitive tasks: This…

We all know that AI is changing the world. For network admins, AI can improve day-to-day operations in some amazing ways:

However, AI is no replacement for the know-how of an experienced network admin. AI is meant to augment your capabilities, like a virtual assistant. So, AI may become your best friend, but generative AI is also a new data center workload that brings a new paradigm shift: NVIDIA Collective Communications Library (NCCL).

The evolution of the data center

Network admins have had to deal with many other recent changes:

Not that long ago, we might have measured the value of a new network admin by the level of expertise in a particular networking command line interface (CLI). With the advent of hybrid cloud computing and DevOps, there is a growing move from CLIs to APIs. Skills in Ansible, SALT, and Python now have more value than a Cisco certification.

Even the way that you monitor and manage networks has changed. You’ve moved from tools that polled devices across the data center using SNMP and NetFlow to new switch-based telemetry models where the switches proactively stream flow-based diagnostic details.

You’re all practiced hands at introducing new workloads into data centers, many with unique networking requirements. You’ve seen legacy databases replaced with data analytics and big data clusters.

Now when tasked with building an AI cluster, it is tempting to think that AI is just a bigger and faster big data application. But AI is different, and AI can be hard without the right tools.

The impact of generative AI and NCCL

You are a network admin for a large enterprise. Your CTO attended GTC 2023 and heard about generative AI. They want to change the way you do business by building a large language model like ChatGPT to respond and interact with end users. The model must be trained. This requires a large AI training cluster with many GPU-accelerated servers connected through a lightning-fast, high-speed network.

This AI training cluster brings many new challenges:

So, what is NCCL? Here’s the textbook answer:

The NVIDIA Collective Communication Library (NCCL) implements multi-GPU and multi-node communication primitives optimized for NVIDIA GPUs and Networking. NCCL provides routines such as all-gather, all-reduce, broadcast, reduce, and reduce-scatter, as well as point-to-point send and receive, that are optimized to achieve high bandwidth and low latency over PCIe and NVLink high-speed interconnects within a node and over the NVIDIA Mellanox Network across nodes.

Source: NVIDIA Collective Communication Library (NCCL)

For the network admin, NCCL controls the traffic patterns of your shiny new AI cluster. This means that you need a network design that is optimized for NCCL, network monitoring tools optimized for NCCL, and Ethernet switches optimized for NCCL.

NCCL is the key to high performance, consistency, and predictability of the workloads running on the AI cluster. NCCL is also the intersection point: both the network admin and data scientist must speak it and understand it. And when both speak it fluently, NCCL can be the Rosetta Stone between these professionals with different and needed skill sets.

Given the importance of NCCL, the right network can make or break an AI cluster’s performance. AI clusters have some unique requirements:

So, what’s next?

It’s your job to keep the network from slowing the AI cluster, but what’s required for AI networking? High bandwidth, low latency, and high resiliency are necessary but not sufficient. How would you pick the right infrastructure?

Networking for AI can be hard. The adage of “no one ever got fired for buying X” is about as dated as Moore’s law because the X factor for AI is different than general-purpose computing. Even large IT shops with dedicated AI engineering teams that pre-test cluster performance are frequently surprised when performance drops precipitously as more users are added and multiple jobs are run simultaneously.

The best way to guarantee the performance of an AI cluster is to follow one of the NVIDIA-published AI reference architectures and to use infrastructure that has the AI-visibility features to verify the health and feeding of your AI cluster.

Whether your AI cluster uses Ethernet or InfiniBand, NVIDIA provides the tools, support, and training that you need for succeeding and becoming an expert at networking for AI.

Source:: NVIDIA

Exit mobile version