Sharing state between host and upstream network: LACP part 3
So far in the previous articles, we’ve covered the initial objections to LACP a deep dive on the effect on traffic patterns in an MLAG environment without LACP/Static-LAG. In this article we’ll explore how LACP differs from all other available teaming techniques and then also show how it could’ve solved a problem in this particular deployment.
I originally set out to write this as a single article, but to explain the nuances it quickly spiraled beyond that. So I decided to split it up into a few parts.
• Part1: Design choices – Which NIC teaming mode to select
• Part2: How MLAG interacts with the host
• Part3: “Ships in the night” – Sharing state between host and upstream network
Ships in the night
An important element to consider is LACP is the only uplink protocol supported by VMware that directly exchanges any network state information between the host and its upstream switches. An ESXi host is also sortof a host, but also sortof a network switch (in so far as it does forward packets locally and makes path decisions for north/south traffic); here in lies the problem, we effectively have network devices forwarding packets between each other, but not exchanging much in the way of state. That can lead to some bad things…
Some of those ‘bad things’ include:
1. Exploding puppies
2. Traffic black-holing
3. Non-optimal traffic flow
4. Link/fabric oversubscription upstream in the network (in particular the ISL between the switches)
5. Probably other implications… but I think that’s enough for now.
This is also the reason I chose to write this post, I’ve seen many others describe in detail LBT vs etherchannel/LACP (Nice articles @vcdxnz01, btw), but none that go into much detail on the implications of this particular point.
The main piece of information of interest is topology change. For example. If you remove a physical NIC from a (d)VirtualSwitch, how is the network notified of this change? If a switch loses all its uplinks or is otherwise degraded, how is this notified to the hosts?
The intent here is to give the host and switches sufficient information on the current topology so they can dynamically make the best path decisions, as close to the traffic source as possible.
Without LACP, the network will need to make link forwarding decisions independently based on:
1. link state (physical port up / down)
2. mac learning
It also means that if the switch or host wants to influence each other to use an alternative path, the only mechanism available is to bring the link administratively down.
How lack of topology change notification could cause problems
Consider the following scenario:
When one of the 10G VMNICs is removed from the vDS of the ESX (using a vCenter), in some cases it takes a long time (of the order of minutes) for the traffic to switch over. It seems strange given that MLAG should switchover the traffic in the order of seconds (usually a lot less).
What could explain this behavior? Assuming the switches are configured per the network vendor’s best practice (ie host-facing bonds) and the VMware consultant had made similar configurations/recommendations for the host config. In this case, LBT configured.
In this scenario, host-facing LACP bonds had been setup, with LACP bypass enabled. LACP bypass effectively is a “Static bond” setup, until the first LACP frame is received, then it reverts to LACP from then on. This mode is normally used to allow PXE booting and initial configuration of the host, since LACP config can only be applied once the ESXi host is licensed, added to vCenter then a DVS and configured.
Figure 1a: MLAG with Static bonds, ESX with LBT.
Figure 1b: Traffic path between VM1 and VM8
The ESXi host had not been configured with LACP or IP HASH. Figure 1a shows this base topology, assuming initial MAC learning has already occurred.
With both ESXi physical NICs / uplinks in the same vSwitch, VMs (and vmkernel interfaces) could be pinned to either link but return traffic could still be received via either physical adapter, this is the default behavior of ESXi vSwitches. Figure 1b and 1c show this traffic path between VM1 and VM8.
Figure 1c: Traffic flow from VM8 to VM1
The problem comes when the topology changes, say by removing an uplink from the vSwitch. The switches are completely oblivious to this change, as the host hasn’t messaged it in any way.Figure 1c: Traffic flow from VM8 to VM1.
Figure 1d: The failure scenario
The packet is sent out the configured uplink1 successfully to the destination VM, but the reply path could come via NIC2, which is not part of the vSwitch, so the packet will be dropped by ESX.
In my mind there are two ways of looking at the problem:
- “That’s a configuration mistake”: the host config and switches don’t match, so of course there will be a problem, change the config of either the host or the switches!
- Shouldn’t the host message the switches somehow that I’m no longer using this port as an uplink?
Change the config
Easy! There’s a couple of options:
- Fix the host-config: Add the uplink back to the vSwitch or shutdown the uplink.
- Fix the switch-config: Remove the dual-connected config from the ToRs (and accept the consequences of orphan ports described in Part 2).
Message the topology change
This can either be achieved in a couple of ways:
- Manually shut down the physical uplink, so switch2 no longer uses that path.
- Enable LACP and let the LACP driver take care of it (let’s explore that a little further)
How LACP enables topology exchange
The LACP driver on the switches and the driver on the ESX host are exchanging information / status using a LACP “Data Unit” frame.
The important part is the other end of an LACP link is able to make forwarding decisions, based on the information it receives in the LACP-DU. This then provides a mechanism for an endpoint to message a change in link state and have the other side do what’s appropriate.
If a DU is not received, or an incorrect/unexpected DU is received, the link will normally be removed from the bond and it will immediately stop forwarding via that link. Let’s explore that in this particular scenario.
In figure 2a (above), both ports are members of the same uplink group and LACP-DU’s flow to/from both ToR switches.
Then a topology change happens at the host, which is described in figure 2b.
- Vmnic2 is removed from the LACP uplink group at the host “ESX1”
- Switch2 fails to receive a DU within the timeout window, the port is forced “proto-down”
- Switch2 MLAG daemon messages the topology change to Switch1.
- Switch2 MLAG daemon programs the MACs associated with ESX1 onto the peerlink.
ESX1 is now treated as a singly connected host, vmnic2 is not used. Figure 2c shows the traffic flow in the forward direction.
Figure 2d shows the reverse traffic flow. Note that it will correctly use the peerlink.
It should hopefully go without saying, but having a message protocol to advertise changes goes both ways; The network can also inform the host of any changes upstream.
For example, in the case of an MLAG daemon failure, or a split-brain scenario, you would not want the hosts forwarding assuming both links are active and working as normal. LACP allows the switches to advertise such a scenario, without necessarily having to tear down one of the local links itself (which each individual switch could easily get wrong).
In a true split-brain scenario, one valid approach is to treat the two switches as independent again. Remember, an LACP bond can only form across a link links advertising the same SystemID, different systemID’s lets the host know it is wired to two separate switches. The host LACP driver can then make a decision which of the links to disable. This is the ideal scenario as in a true split brain, the switches may not be fully aware if its peer is or isn’t up up and forwarding. Having the host make the decision effectively adds a witness to the scenario to make the tiebreaker decision.
According to the LACP spec (and our testing of several host bonding drivers confirms this), when a host receives an LACP-DU with a new system ID (222222), while the other link(s) are still up with the shared system ID (111111), the link with the changed system ID will be removed from the bundle and brought down. This is what would happen during a peerlink failure as shown above.
Ok, well that was more of a novel than I originally planned to write. Hopefully I’ve done a little to bridge the gap between host networking and the implications with the upstream network.
The summary I’d like to present is this: In a fundamentally Active-Active network fabric, an Active-Active host connectivity option with standard state exchange mechanism is the way to go.
Of course, another option would be to do away with L2 all together and run a routing protocol on the host itself…. But that is an entirely different story for another day!
The post Sharing state between host and upstream network: LACP part 3 appeared first on Cumulus Networks Blog.
Read more here:: Cumulus Networks