Cumulus Linux Switch Monitoring with Datadog
As a Linux platform, one of the cool things is that we often don’t have visibility into how customers use their switches running Cumulus Linux. They buy HCL-compatible hardware from our partners, and with some training and enablement, are off to the races.
The idea for running Datadog in Cumulus Linux came about for the simple reason that we were in adjacent booths at PuppetConf last year, and we all figured it would be cool to try it out. Further, since Datadog already provides visibility across systems, apps and services, they were interested in seeing how networking can be added into the mix. As you will see, it turns out to be pretty simple.
Installing Datadog in Cumulus Linux
The Datadog agent, as with most things Debian, installs easily on Cumulus Linux. For x86 switches, this is as simple as installing a Debian package and performing simple changes in the Datadog agent files, such as the application/API key, which is the tag associated with the switch. You can easily automate this installation using common automation tools like Puppet and Ansible.
Since the Datadog agent is designed for servers, metrics can be collected using Datadog’s SNMP plugin, a custom sFlow plugin or other custom scripts. The Datadog agent, out of the box, comes with a plugin to monitor disk, CPU and memory. Further, useful events can be triggered using Rsyslog, Monit, and custom scripts and sent directly to Datadog using SSL and authenticated using Oauth.
Extending these capabilities to a networking switch required some customizations. Datadog has a simple to use API that can be used to generate custom graphs and triggers. A Python script running on the switch makes calls to this API, allowing for custom graphs and views to be defined.
For example, you could have the switch draw graphs of the bond member traffic and put that in a single time series graph. If the bond member configuration changes, simply trigger the graphing script to run and an updated graph is presented on Datadog. This capability enables you to dynamically change graphs as you make changes in your network, without having to make changes on the Datadog browser-based user interface. This is an important benefit of this architecture and eliminates stale data while improving the reliability of the graphs as the network changes.
The diagram below summarizes how Datadog can interact with Cumulus Linux applications and components.
Datadog provides extensive customization for visualizing and alerting against the data being monitored.
You can quickly customize Datadog graphs to capture key data items, such as:
- Viewing CPU information from all switches in a single graph
- Viewing link up/down status of all switches in a single graph
- Showing a set of key metrics like CPU, disk and port utilization for each switch
Here’s an example of a custom dashboard built programmatically using the Datadog API:
The dashboard is dynamically created by running a script on the switch that uses the Datadog API. The script instructs Datadog to only graph ports that are not administratively down. This helps keep the custom dashboard free of unnecessary information about downed ports. To graph information about bond interfaces, the script figures out which physical ports belong to the bond and graphs the transmit and receive bytes for each bond member under the bond graph.
The graphing script can then be incorporated into the change control process, so as interfaces are activated or deactivated or bond members are changed, these changes will be instantly reflected in Datadog without the need to make any changes in the Datadog browser user interface.
Triggers can be generated programmatically using the Datadog API from the switch. DevOps administrators get a lot of flexibility with the Datadog API that provides Bash, Python and Ruby interfaces, all of which are supported on Cumulus Linux.
Here is a simple example of triggers programmatically defined using the Datadog API:
The Value of Full Stack Monitoring
Seamless full stack monitoring across applications, servers and the network has a number of benefits for customers, including:
- Event overlays: A powerful Datadog capability that can help improve visibility across app, server and network events with a single interface.
- Improved correlation across events, both for real-time and historic data.
- Simplified toolset enables DevOps teams to use fewer tools to get the data.
Since the alerting settings can be modified through scripting on the switch, this flexibility can be used to effectively catch the degradation of resources, and proactive alerting ahead of a complete failure. Please comment or tweet back and let us know if this is something we should demonstrate and blog about in the future.
Example Case Study
Let’s look at a simple example we have used to illustrate this solution. We started with an Ubuntu Server running Nginx and a UDP application that is connected to a topology of Cumulus Linux switches.
We simulated a scenario where Nginx connections periodically drop. Below is what the event stream looks like:
Notice that around the time of the Nginx failure, link transitions on spine1, a switch inside the network, reports link transitions. Are these link flaps causing the server failure? Let’s use Datadog’s graph customization feature to quickly prove this theory.
Leveraging the quick graph customization feature in Datadog, three graphs were created with event overlay to help narrow down the scope of the problem:
- Nginx connection rate.
- Basic server network metrics. Total Octets in and out of all server NICs and total multicast receive and transmit traffic.
- Link transition graph across all switches.
Using Datadog’s event stream overlay over the customized view, it is clear that there is a correlation between interface transitions on spine1 and a drop in Nginx connection rate traffic.
The server network rates, especially the multicast rates, do not decrease dramatically, indicating that the server NIC is working as designed.
By having a converged full stack monitoring solution, app, server and network admins are able to easily collaborate to troubleshoot complex problems that could span their apps, server and networks.
Cumulus Linux is the only network OS that enables a full stack monitoring solution, and does this with minimal impact to the existing monitoring policies implemented by existing Datadog customers.
If this looks interesting, come try Cumulus Linux and Datadog together at our remote lab: http://cumulusnetworks.com/get-started/test-drive-open-networking-in-our-remote-lab/
Read more here:: Cumulus Networks