Exploring Batfish with Cumulus – Part 2
In Part 1 of our look into navigating Batfish with Cumulus, we explored how to get started with communicating with the pybatfish SDK, as well as getting some basic actionable topology information back. With the introduction out of the way, we’re going to take a look at some of the more advanced use cases when it comes to parsing the information we get back in response to our queries. Finally, we’re going to reference an existing CI/CD pipeline, where templates are used to dynamically generate switch configuration files, and see exactly where and how Batfish can fit in and aid in our efforts to dynamically test changes.
For a look under the covers, the examples mentioned in this series of posts are tracked in “https://gitlab.com/permitanyany/cldemo2”
As you may remember, in Part 1 we gathered the expected BGP status of all our sessions via the bgpSessionStatus query and added some simple logic to tell us when any of those sessions would report back as anything but “Established”. Building on that type of policy expectation, we’re going to add a few more rules that we want to enforce in our topology.
- “A leaf switch should only peer with a spine switch”
- “All spine switches should use the same BGP AS”
- “Leaf switches that are part of an MLAG pair should use the same BGP AS”
This list can easily grow or change based the topology design and how granular you want to get, but it drives home the point that we can make sure that any change to the environment will not violate these expectations.
As a refresher, here is what our BGP information looks like and the data we’ll be parsing from it.
Looking at the first requirement, we want to start with the Node column and view what the corresponding Remote_Node is, when querying for bgpSessionStatus.
After importing the pybatfish libraries and initializing the Batfish snapshot, we iterate through the Node column. We’re looking for values that start with “leaf” and checking to see if the corresponding Remote_Node value contains “spine” in the name. If there’s no matching spine neighbor, we raise an exception. The reason we choose to go the exception route (instead of just printing the message) is because it later helps us properly identify the script’s exit code in our pipeline, and whether the result is a success or failure based on the data we were looking for.
Moving towards our next requirement, we want to make sure all of our spine switches are configured with the same AS number (65020 in this case).
Taking a similar approach, we focus on the nodes that have “spine” in the name. Once those are identified, we iterate through them and make sure their corresponding Local_AS values match 650220.
Looking at the third requirement, we want to be able to make sure that leaf nodes in the same rack (aka MLAG peers) have the same AS number. To do that, we need to jump through a couple of hoops to figure out how to identify that 2 leaf switches are a pair. Our logic here while parsing the nodes, is if a switch ends with an even number we’ll assume his peer we’ll be the same switch number minus one. Likewise, with switches ending in odd numbers, the peer will be assumed to be the switch number plus one.
As seen in the below output, we’re able to confirm that the leaf AS numbers match our expectations.
Bringing all 3 of these tests together, we can now lay the foundation of what we’d like to run as a test with every change to our pipeline.
Continuous Integration Pipeline
Now that we’re ready to handle the testing aspect of our network changes, before we jam our script into a pipeline, it might be worthwhile to review what a typical continuous integration workflow looks like. Keep in mind however, that this is a broad topic which would take numerous separate articles to fully setup and walk through, so I’ll stick to the high-level explanation here and defer to existing online resources for the rest. If the Gitlab repository link at the top of the post doesn’t make much sense to you and you need help getting started, leave a comment at the bottom of this post.
Using Gitlab as the CI tool of choice and Ansible to push out configurations, the overall goal of this pipeline is to define variable files per device or group and have a template that will read those variables in and render out switch configurations, which are finally pushed to the devices themselves. Below is a snippet of the leaf01 specific variables.
These variables are then fed into a switch configuration template that will generate a config specific to a device it is reading variables in from.
An Ansible playbook ultimately sends these rendered configurations to the switches and configuration changes are pushed.
All of this is code is version controlled and Gitlab reacts any time someone pushes a change to the repository, so that the Ansible playbook runs and pushes out the necessary changes automatically. This behavior is controlled by a Gitlab specific configuration file, “.gitlab-ci.yml”. which displays one stage so far, “deploy”.
When one of these deploys occurs, they can be tracked in Gitlab to find out whether the latest commit and code push succeeded or failed.
Now that you have an idea of what a pipeline workflow looks like, let’s add the Batfish testing component. Overall, the “gitlab-ci” file will dictate what scripts we’ll be running to test our changes and in what order they will be executed. The important aspect of this order is that if a certain stage fails, the workflow will be interrupted and subsequent stages will not be executed.
Our proposed workflow should look like the following:
As you may remember from Part 1, Batfish currently only supports looking at Cumulus configurations in an NCLU format, which poses a problem for us, since the templatized configurations we’re generating are using Linux flat files. While Batfish will likely support parsing flat files soon, in the interim we have to think of a way to convert these configurations from flat files to NCLU on the fly. To do this, we’re going to create a Cumulus VX test switch in this topology that we’ll use for the sole purpose of converting configs from flat files to NCLU and sending them back to Batfish for analysis.
Our updated workflow will now be the following:
As you can see in the updated “gitlab-ci” file, we introduced 2 new stages in our pipeline. We have also split up the configuration generation portion of the playbook from the configuration deployment portion and inserted the testing phase in the middle. Our testing phase contains 2 python scripts, one which we came up with in the beginning of this post and the second from the end of part 1.
Let’s now go ahead and make a sample config change that we do not expect to violate any of our testing policies. We’ll change the description of swp1 on leaf01 and commit it to the Gitlab repository.
We can see below that our stages are in the process of running.
Looking at the specific job in the Gitlab pipeline, we see that Ansible detected a change on leaf01 when generating the interfaces config file.
Let’s now introduce a failure that should violate our testing policies. We’re going to change the BGP AS number of leaf02 to 65050 (leaf01 is 65011).
Looking at the pipeline run status, we see a failure.
Digging further into the reason, the failure occurred at the testing phase (as we’d expect). The logs of the run point us exactly to where we expect.
Our script threw up an exception and the config push stage of the pipeline was not reached, due to a failure in the test stage.
Hopefully, seeing Batfish’s place in a real-life workflow helped connect some dots about how a testing methodology fits into a modern datacenter network. You can now envision how this type of pre-change testing can add a level of repeatability and confidence to the practice of treating your network infrastructure as code.
Source:: Cumulus Networks