Maximizing HPC Cluster Ethernet Fabric Performance with MLAG

Data center with switches

For HPC clusters purposely built for AI training, such as the NVIDIA DGX BasePOD and NVIDIA DGX SuperPOD, fine-tuning the cluster is critical to increasing and…

For HPC clusters purposely built for AI training, such as the NVIDIA DGX BasePOD and NVIDIA DGX SuperPOD, fine-tuning the cluster is critical to increasing and optimizing the overall performance of the cluster. This includes fine-tuning the overall performance of the Ethernet fabric, storage fabric, and the compute fabric. 

This post discusses how to maximize the overall throughput of the Ethernet fabric with Multi-Chassis Link Aggregation (MLAG), available on NVIDIA Cumulus Linux. MLAG enables two separate switches to advertise the same LACP system ID to downstream hosts. As a result, the downstream hosts see the uplinks as if they are connected to a single LACP partner. 

One benefit of using MLAG is physical switch-level redundancy. If either of the two uplink switches experiences a failure, downstream host traffic will not be impacted. A second benefit is that the uplinks of the aggregated bond are all used at the same time. Finally, MLAG technology provides gateway-level redundancy, using technologies such as VRR/VRRP. 

Cumulus MLAG with LACP 

To maximize the overall Ethernet performance of each of the DGX/compute nodes in a cluster, it is recommended to have bonded uplinks configured in LACP (802.1ad) mode. LACP (802.1ad) bonding mode enables both uplinks to be used at same time. Using other bond modes (such as active/standby, where only one of the two uplinks is being used at a given time) results in 50% of the uplink available bandwidth not being used at any given time. 

LACP requires MLAG to be configured between the TOR switches. When configuring MLAG, gateway-level redundancy is also required using technologies such as VRR/VRRP.  

Diagram showing three layers: CUST-EXIT (top); BCM-TOR (middle); DGX (bottom).Figure 1. Ethernet fabric with MLAG configured between the TOR switches and LACP bonding configured on the DGX/compute links

PXE booting with LACP bonded interfaces 

For HPC cluster deployments, PXE booting is often used to provision the nodes in the cluster. For this reason, it is important to set up LACP-by-pass mode on the uplinks. Otherwise, the nodes would not be able to PXE boot without support for LACP during the provisioning process.

Diagram illustrating PXE boot connectivity during the provisioning process, with DGX-01 on the left, a cloud icon in the middle, and PXE/TFTP/DHCP Server on the right.
Figure 2. PXE boot connectivity during the provisioning process

During the provisioning process, the host is configured to boot using one of its network interfaces. It obtains the IP address assignment and TFTP server information from the DHCP server. Once the TFTP server information is received from the DHCP server, the host contacts the TFTP server to retrieve the PXE booting/kickstart instructions for provisioning (Figure 2).

Cumulus Linux MLAG configuration 

You can use the Cumulus Linux CLI interface (NVUE) to configure MLAG between BCM-TOR-01 and BCM-TOR-02 switches. This requires setting up the MLAG peer link member interfaces, MLAG mac-address, MLAG peer-ip, and MLAG priority on each member switch.  

A switch with lower MLAG priority value becomes the primary switch for managing MLAG connectivity. A switch with higher MLAG priority value becomes the secondary switch. If no MLAG priority is set, then a default priority value of 32,768 is set.  

To add MLAG configurations to BCM-TOR-01, use the following configurations:

cumulus@BCM-TOR-01:~$ nv set interface peerlink bond member swp61-62
cumulus@BCM-TOR-01:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@BCM-TOR-01:~$ nv set mlag backup 192.168.200.3 vrf mgmt
cumulus@BCM-TOR-01:~$ nv set mlag peer-ip linklocal
cumulus@BCM-TOR-01:~$ nv set mlag priority 2084
cumulus@BCM-TOR-01:~$ nv config apply
cumulus@BCM-TOR-01:~$ nv config save

To add MLAG configurations to BCM-TOR-02, use the following configurations:

cumulus@BCM-TOR-02:~$ nv set interface peerlink bond member swp61-62
cumulus@BCM-TOR-02:~$ nv set mlag mac-address 44:38:39:BE:EF:AA
cumulus@BCM-TOR-02:~$ nv set mlag backup 192.168.200.2
cumulus@BCM-TOR-02:~$ nv set mlag peer-ip linklocal
cumulus@BCM-TOR-02:~$ nv config apply
cumulus@BCM-TOR-02:~$ nv config save

To verify MLAG state on BCM-TOR-01, use the following command:

cumulus@BCM-TOR-01:mgmt:~$ net show clag
The peer is alive
 	Our Priority, ID, and Role: 2084 48:b0:2d:ad:49:8c primary
	Peer Priority, ID, and Role: 32768 48:b0:2d:5f:4d:d0 secondary
      	Peer Interface and IP: peerlink.4094 fe80::4ab0:2dff:fe5f:4dd0 (linklocal)
                  	Backup IP: 192.168.200.3 vrf mgmt (active)
                 	System MAC: 44:38:39:be:ef:aa
cumulus@BCM-TOR-01:mgmt:~$

To verify MLAG state on BCM-TOR-02, use the following command:

cumulus@BCM-TOR-02:mgmt:~$ net show clag
The peer is alive
 	Our Priority, ID, and Role: 32768 48:b0:2d:5f:4d:d0 secondary
	Peer Priority, ID, and Role: 2084 48:b0:2d:ad:49:8c primary
      	Peer Interface and IP: peerlink.4094 fe80::4ab0:2dff:fead:498c (linklocal)
                  	Backup IP: 192.168.200.2 vrf mgmt (active)
                 	System MAC: 44:38:39:be:ef:aa
cumulus@BCM-TOR-02:mgmt:~$

Interface bond configurations  

You can use the Cumulus Linux CLI interface (NVUE) to configure bonded uplinks to interfaces going to DGX-01 and DGX-02 nodes. For each MLAG bond interface, you must define the bond name, the bond member interface, unique MLAG ID per bond, and bond description. You must also enable LACP-bypass mode for PXE booting purposes, configure the bond to be a L2 bond by forcing it to become a member of the bridge, and configure the native/untagged VLAN that would be used for PXE booting purposes.

To add interface bonding configurations to BCM-TOR-01, use the following configurations:

cumulus@BCM-TOR-01:~$ nv set interface bond1 bond member swp1
cumulus@BCM-TOR-01:~$ nv set interface bond1 bond mlag id 1
cumulus@BCM-TOR-01:~$ nv set interface bond1 bond lacp-bypass on
cumulus@BCM-TOR-01:~$ nv set interface bond1 description dgx01
cumulus@BCM-TOR-01:~$ nv set interface bond2 bond member swp2
cumulus@BCM-TOR-01:~$ nv set interface bond2 bond mlag id 2
cumulus@BCM-TOR-01:~$ nv set interface bond2 description dgx02
cumulus@BCM-TOR-01:~$ nv set interface bond2 bond lacp-bypass on
cumulus@BCM-TOR-01:~$ nv set interface bond1 bridge domain br_default
cumulus@BCM-TOR-01:~$ nv set interface bond2 bridge domain br_default
cumulus@BCM-TOR-01:~$ nv set interface bond1 bridge domain br_default untagged 222
cumulus@BCM-TOR-01:~$ nv set interface bond2 bridge domain br_default untagged 222
cumulus@BCM-TOR-01:~$ nv set bridge domain br_default vlan 221-223
cumulus@BCM-TOR-01:~$ nv config apply
cumulus@BCM-TOR-01:~$ nv config save

To add interface bonding configurations to BCM-TOR-02, use the following configurations:

cumulus@BCM-TOR-02:~$ nv set interface bond1 bond member swp1
cumulus@BCM-TOR-02:~$ nv set interface bond1 bond mlag id 1
cumulus@BCM-TOR-02:~$ nv set interface bond1 bond lacp-bypass on
cumulus@BCM-TOR-02:~$ nv set interface bond1 description dgx01
cumulus@BCM-TOR-02:~$ nv set interface bond2 bond member swp2
cumulus@BCM-TOR-02:~$ nv set interface bond2 bond mlag id 2
cumulus@BCM-TOR-02:~$ nv set interface bond2 bond lacp-bypass on
cumulus@BCM-TOR-02:~$ nv set interface bond2 description dgx02
cumulus@BCM-TOR-02:~$ nv set interface bond1 bridge domain br_default
cumulus@BCM-TOR-02:~$ nv set interface bond2 bridge domain br_default
cumulus@BCM-TOR-02:~$ nv set interface bond1 bridge domain br_default untagged 222
cumulus@BCM-TOR-02:~$ nv set interface bond2 bridge domain br_default untagged 222
cumulus@BCM-TOR-02:~$ nv set bridge domain br_default vlan 221-223
cumulus@BCM-TOR-02:~$ nv config apply
cumulus@BCM-TOR-02:~$ nv config save

network-admin@BCM-TOR-01:mgmt:~$ net show int bond1
    Name    MAC                Speed  MTU   Mode
--  ------  -----------------  -----  ----  -------
UP  bond1  1c:34:da:29:17:04  100G   9216  802.3ad

Bond Details
------------------  --------
Bond Mode:          802.3ad
Load Balancing:     layer3+4
Minimum Links:      1
LACP Sys Priority:
LACP Rate:          1
LACP Bypass:        Active

All VLANs on L2 Port
--------------------
221-223

Untagged
--------
222

cl-netstat counters
-------------------
    RX_OK  RX_ERR  RX_DRP  RX_OVR     TX_OK  TX_ERR  TX_DRP  TX_OVR
---------  ------  ------  ------  --------  ------  ------  ------
249728882       0      18       0  32865480       0       1       0

Conclusion

MLAG is a well-tested feature, used by many NVIDIA customers. It can help provide physical switch-level redundancy, avoid single-point failure, and maximize overall utilization of the total available bandwidth in your Ethernet fabric. On the Ethernet networking side, NVIDIA Cumulus Linux is an industry-leading open network OS used by many Fortune 100 organizations. For more information about how NVIDIA deploys large-scale clusters, check out NVIDIA DGX SuperPOD and NVIDIA DGX BasePOD.  

Source:: NVIDIA