Open Switch Hardware’s Journey into the Linux and Kernel Community



You have read, heard and seen us talk about the benefits of Linux, open source and community. Here, here, and here… and I am pretty sure everywhere. This blog walks you through our journey of pushing Linux open switch ASICs into the Linux kernel and ecosystem. Before we begin, let me tell you that it has and continues to be a fun ride!

A quick historical recap on Linux networking

The Linux kernel has been doing network hardware offloads and acceleration for decades (Nics, smart-nics, wireless AP’s and many other places). Because of this, the kernel has had the infrastructure and the right abstractions to recognize and register a networking hardware device for decades. And this infrastructure has matured over time.

For hardware vendors, enabling their networking hardware for Linux just made it easier to take their hardware to new customers, objectives and industries. Today Linux enablement is the best way to get faster adoption of your hardware or faster marketing for your hardware. It has become a norm for hardware vendors to get their hardware ready for Linux first: getting their drivers in the upstream kernel and getting hardware tools ready for the Linux ecosystem.

A question may arise why were switch ASIC network devices so late in the game? Well, it had to be Cumulus to come and save the world!

Working collaboratively with the Linux community

Recent blog posts, have talked about the collaborative spirit of open source Linux communities. How does that happen? In this process of getting hardware integrated into Linux, hardware vendors, along with the community, have pushed for abstractions and infrastructure pieces in the Linux kernel and ecosystem. This trend/process has always accelerated things for other hardware vendors and paved the way for cross subsystem infrastructure development — thereby enabling newer developments on Linux faster (I will bring this up later in a bit).

We all know maintaining software is a huge cost in any software development. And that maintenance of software gets easier if you work along with the Linux community and get the code integrated in the upstream code-base. There is a chance you will also find software developers who understand that code well and can help with future enhancements!

When all of networking runs Linux, things are easier

Systems infrastructure should always be open in my opinion. This also applies to networking infrastructure. Where is the fun in building the same infrastructure multiple times? Build once, deploy everywhere. Unify infrastructure. Having a single open software infrastructure to work on allows you to have more opportunity and time to develop newer things on top of it. And time is of the essence. In the networking world, the same applies to protocols. The key thing about protocols is the need for them to interop with other instances of the same protocol. If that is true, then why does the world have to run different implementations of the same protocol?

Why have multiple implementations of BGP in the network? In an ideal world, it is better to have single or fewer implementations of the same protocol stacks for lesser interop problems. Debugging a network is also time consuming, especially when there are very few ways to debug protocol stacks! This unification, same protocol, lesser interoperability solution is what Linux brings you — and that works best when all devices are capable of running Linux!

Throughout the remainder of blog we will see how the above factors have helped us leverage and contribute back to the Linux community to build a data center class Linux distribution for all.

History of Cumulus Networks

With Cumulus Networks, that dream of running native Linux on switch ASIC hardware started a few years ago and the Linux networking community has embraced it happily. Linux networking on a new hardware? Bring it on! Cumulus managed to drive many of these networking discussions in the early days with BOF’s and talks at various conferences.

Since those early days we have seen a rapid acceptance of patches in the Linux kernel networking stack for these kind of devices. For example, the switchdev project, created just a few years ago, gives a name to any kernel infrastructure that was added to enable or promote native Linux support on switch ASICs.

What does kernel infrastructure that supports switch ASIC’s mean? Well it’s layering software with abstractions that engineers, developers and designers use as a goal to make things more pluggable, extensible and maintainable while opening up the possibility for even more capabilities. The Linux kernel has multiple examples of this because the Linux networking community was fast enough to add an abstraction for switchdev asics called the switchdev api, new user-space api’s and infrastructure to query and debug switch asic hardware.

Back to my earlier point about cross subsystem use and development, the community saw new possibilities for grouping other existing hardware devices under the switchdev development umbrella. This unified many of these hardware subsystems with similar characteristics into one! How great is that?

The more users you have of an abstraction/infrastructure, the more assurance you have of innovation and maintainability. This is the benefit of having your software in the community.

This has also fueled the growth of the switch ASIC kernel within the development community. Hardware Abstractions in the kernel often go through iterations to accommodate new hardware or new hardware features. It’s an evolutionary process and has also been the process for switch asics.

Linux networking features and development

Now that we’ve talked about networking hardware support on Linux, let’s talk about networking features and development.

Linux development on switch ASICs opened the door for using more networking subsystems bridging, routing, and acl’s with hardware acceleration. This was never seen before and developers and maintainers have embraced it. Networking events [1] have become more popular for switch asic drivers in user-space and debugging networking and the netlink Linux networking API has seen extensions, developments and fixes.

Cumulus Networks is at the center of all of it, even from the beginning, having made Linux kernel routing and bridging stacks to work natively on data center switches and routers.

We started with our focus on routing. Years of work resulted in a very strong routing suite with FRR and have been one of the core contributors to the Linux routing stack. For operational improvements, we have made many efforts to unify the ipv4 and ipv6 stacks [6,7] and ecmp L4 hash support [8]. VRF, to this day, is still our most popular feature [9].

Policy routing needs for our customers inspired us to make improvements to the Linux kernel policy routing API [10]. With FRR support for policy based routing, you can now deploy routing policies with FRR on any Linux. Throughout the evolution of this we have seen many features and improvements to the kernel routing stack and protocols contributed by others in the community including Eg IPV6 segment routing, TCP BBR and others.

The main function of a Linux bridge is learning. We had to teach the Linux bridge about entries learnt by hardware and new flags were soon added to refresh entries learnt by hw sdk’s from user-space or in kernel drivers.

In the process of getting open PIM implementation ready for Linux, we have enabled and contributed fixes to the kernel multicast routing database [13, 14 and more]. This is in addition to enabling PIM routing needs in FRR.

Linux network virtualization.

Now let’s look at network virtualization. Before Cumulus came along, the Linux ecosystem was already there with kernel support and tools to deploy Vxlan and other network virtualization dataplanes. We just had to integrate it with switch ASIC drivers. Who wants to write a networking virtualization stack from ground up? We avoided having to do this which then allowed us to focus on driving and building interesting solutions around Network virtualization.

Ethernet VPN for network virtualization is a hot topic today, and we’ve been and continue to be at the center of building an open Ethernet VPN implementation for Linux with both FRR routing stack and the linux kernel.

First off, we have enabled Linux kernel bridge and neighbor subsystems forwarding elements to carry information from a BGP distributed remote mac/IP entry [15]. Arp and ND suppression which reduces flooding on the overlay were also added [16] and we continue to work with the Linux and FRR communities to implement an open E-VPN solution [17,18].

Both the community and us continue to make Linux kernel networking scale for data centers and new networking markets! See for example the next generation routing API [11], and bridge enhancements for faster convergence [21].

Linux kernel is already ready for MPLS and Linux ecosystem has many open implementations of MPLS control plane. While we were getting MPLS ready for the data center with our contributions [21], kernel received MPLS over GRE support [22]. It seemed like the world worked together to get all pieces of MPLS working in the kernel, which shows you again the power of community and collaboration.

Debuggability and monitoring of networking events on Linux.

Networking applications love async events and we all know epoll on Linux can get the job done. In addition, Linux networking stack provides special subsystem specific channels on the netlink BUS [3] to notify interested parties of networking events.

For example, a bridge forwarding database entry was just learnt or our FRR routing suite has just added a new route in the forwarding database. There are tools like [23], that enable users to start using this monitoring infrastructure. Controllers in user-space can use these events to build networking solutions whether it is for monitoring, orchestration or network analytics. The best part of all of this is all this infrastructure comes for free!

Like any other infrastructure in the Linux kernel, there is a large number of people in the community dedicated to tracing. To make debugging easier on a data center network, we have made contributions to tracing packets through the routing stack and bridge forwarding database. These have been instrumental in helping us efficiently debug our datacenter deployments [24 ]and new developments in tracing while using EBPF has opened up doors for even more possibilities in this area.

Sampling has become native in the kernel with a new sampling API for user-space. The gist of this API is a netlink channel to log sampled packets to user-space. We have been involved in the design of the API to make it applicable for switch ASICs and this netlink channel or infrastructure can now be used and extended to log sampled packages on any hardware running Linux. This is where you see the power of building once and reusing and recycling everywhere come into play enabling sflow collectors, like hsflowd, to use the same native kernel api on every Linux platform [20].

References:

[1] Hardware offloading BOF at netdev0.1:

https://netdevconf.org/0.1/sessions/10.html

[2] switchdev BOF:

https://netdevconf.org/1.1/bof-switchdev-roopa-prabhu-shrijeet-mukherjee.html

[3] netlink API:

http://man7.org/linux/man-pages/man7/netlink.7.html

[4] support for 25G/50G/100G speed

https://patchwork.ozlabs.org/patch/625047/

[5] ethtool: Support for FEC encoding control

https://patchwork.ozlabs.org/patch/850494/

[6] Fix ipv6 address flush on ifdown of an interface:

https://patchwork.ozlabs.org/patch/587543/

[7] Make ipv6 route notifications similar to ipv4:

https://patchwork.ozlabs.org/patch/723301/

[8] Ecmp hash L4 support:

https://patchwork.ozlabs.org/patch/739829/

[9] VRF: https://cumulusnetworks.com/blog/vrf-for-linux/

[10] PBR: Support for policy routing match extensions, Bug fixes and cleanups

https://patchwork.ozlabs.org/patch/878911/

[11] scaling routing with new API:

https://lwn.net/Articles/763950/

[12] devlink api for switch ASICS:

https://lwn.net/Articles/674867/

[13] IP multicast: Improve hash scalability:

https://patchwork.ozlabs.org/patch/714498/

[14 ] IP multicast: Support for all PIM msgs

https://patchwork.ozlabs.org/patch/689275/

[15] E-VPN: Neighbour subsystem updates for E-VPN BGP distributed remote entries:

https://patchwork.ozlabs.org/patch/903868/

[16] E-VPN: Arp-ND suppression support:

https://patchwork.ozlabs.org/cover/822906/

[17] https://www.netdevconf.org/2.2/slides/prabhu-linuxbridge-tutorial.pdf

[18] FRR : Free range routing (FRR): https://frrouting.org/

[19] Vxlan: Enhancements to support single vxlan device for scale

https://patchwork.ozlabs.org/patch/722362/
Netlink api enhancements
https://patchwork.ozlabs.org/patch/730096/

[20] sampling:

https://blog.sflow.com/2017/07/linux-411-kernel-extends-packet.html

[21 ] Linux bridge backup ports for faster convergence:

https://patchwork.ozlabs.org/cover/947461/

[21] MPLS in the kernel:

https://netdevconf.org/1.1/tutorial-deploying-mpls-linux-roopa-prabhu.html

[22] MPLS GRE: http://patchwork.ozlabs.org/patch/636253/

[23] ip monitor to monitor kernel networking events:

http://man7.org/linux/man-pages/man8/ip-monitor.8.html

[24] fib tracepoints + bridge fdb tracepoints:

http://patchwork.ozlabs.org/patch/807267/

The post Open Switch Hardware’s Journey into the Linux and Kernel Community appeared first on Cumulus Networks engineering blog.

Source:: Cumulus Networks