Build an OpenStack/Ceph cluster with Cumulus Networks in GNS3: part 1
I must have built OpenStack demos a dozen times or more over the past few years, for the purposes of learning, training others, or providing proof of concept environments to clients. However these environments always had one thing in common – they were purely demo environments, bearing little relation to how you would build OpenStack in a real production environment. Indeed, most of them were “all-in-one” environments, where every single service runs on a single node, and the loss of that node would mean the loss of the entire environment – never mind the lack of scalability!
Having been tasked with building a prototype OpenStack environment for an internal proof of concept, I decided that it was time to start looking at how to build OpenStack “properly”. However I had a problem – I didn’t have at my disposal the half-dozen or so physical nodes one might typically build a production cluster on, never mind a highly resilient switch core for the network. The on-going lockdown in which I write this didn’t help – in fact it made obtaining hardware more difficult.
I’ve always been inspired by the “cldemo” environments on Cumulus Networks’ GitHub and my first thought was to build on these for this task – however I wanted to use this as a learning tool, and the disconnect between the Vagrant code and a visual representation of the network layout meant I didn’t find this an ideal learning environment. I love Vagrant, but it’s unlikely you would use it in a production environment. However this sparked off an idea – could you use GNS3 to build a working model of an OpenStack cluster?
To answer this question I started to experiment with GNS3. I’ve played with it before to prototype some firewall configurations and other small network related tasks, but I had never considered using it for something heavy weight such as an entire OpenStack environment. However I had already established that it is a great tool for building visual network layous, and for interacting with the network (for example, you can right click on a network link and see a Wireshark trace of everything running over that link). In addition, testing out a few different configurations demonstrated a few things I hadn’t fully appreciated before:
If you are running QEMU images in GNS3, they are run with the same fundamental KVM virtualization on which tools such as OpenStack and oVirt themselves are based – thus this ought to be possible to scale up.
When you define a new QEMU machine image in GNS3, and drag that machine onto the canvas, GNS3 doesn’t create a full copy of the virtual disk image. Rather, it creates a linked clone of the original, meaning that only the differences between the original image and the copy on the canvas are stored on disk, saving space.
This reliance on open source tools means you can interact with the disk images directly using tools such as qemu-img and guestfish if you wish.
GNS3 creates neatly packaged portable project files, meaning you can easily backup and archive your work (or versions of it), provide others with a copy, and so on. Snapshots are also supported.
Overall, it seemed that GNS3 was a far more powerful virtualization tool than I had given it credit for, and actually well suited to the task I had in mind. However…
GNS3 does have it’s limitations, and obviously you’re not going to run things at bare metal or wire speeds inside it. Although GNS3 runs on macOS, Windows and Linux, I started my testing on Windows 10 and quickly discovered that all of the clever work performed by GNS3 happens in a special virtual machine – the “GNS3 VM”. This essentially is a specially customized Ubuntu Linux image designed to handle the GNS3 tasks, and if you are relying on this, then your QEMU virtual machines are running inside another virtual machine – the nested virtualization adding overhead and slowing things down.
Further, some of the functionality I had wanted to take advantage of in my work, such as snapshots and exporting portable project archives seemed to be fundamentally broken in Windows 10 – a tour of the forums seemed to indicate I wasn’t alone, though offered no solutions. However as soon as I installed Ubuntu Desktop on my test rig, all of these problems went away, so if there’s one thing I learned from this experience, it’s run your setup on Linux from the start.
Another thing worth factoring in (though not a limitation of GNS3) is that if you build out the full architecture I created, then you’re going to be running a total of 15 virtual machines on a single host. This means you’re going to need lots of RAM (my setup needed 48GB when fully operational) and fast disk (SSD highly recommended). If you don’t quite have enough RAM (I had 32GB on my test rig), you can run QEMU virtual machines in swap memory. Needless to say this isn’t recommended and is certainly going to decrease the life of your SSD – however if you don’t have a rig with 64GB of RAM to hand, this little “hack” will get you going where you might otherwise come to a grinding halt.
Having decided upon my platform it was time to set some goals for this exercise:
I would use a minimum of VM images to build this infrastructure – in the end, just two were needed:
a. The Cumulus VX 4.0 image
b. The Ubuntu Server 18.04.4 Cloud Image
There would be no “by-hand” configuration of any nodes in the virtual infrastructure – not even for authentication.
All code used for the environment should be easily scaled to a real world production environment. In reality this meant two tools:
a. Cloud-init – natively supported by most cloud images including Ubuntu Server
b. Ansible – for all post-boot configuration work
The final infrastructure would be based on the openstack-ansible project, and their documented “Ceph Production example”: https://docs.openstack.org/openstack-ansible/stein/user/ceph/full-deploy.html
This infrastructure would feature 4 Cumulus VX switches in a spine-leaf topology, with a simple layer-2 CLAG configuration to ensure a resilient switched architecture.
An additional Cumulus VX switch would be used as an out-of-band management network, with all nodes having eth0 reserved for out of band management.
Building the infrastructure
Having worked out the platform, tool set, and goal, the remaining task was to build the architecture itself. This was broken down into a series of clear stages, each clearly separate from the other:
Adding the virtual machine images to GNS3
Building the virtual infrastructure on the canvas
Creating the cloud-init configurations for each node, and building these into an ISO for boot-time configuration.
Configuring the management node – this is responsible for for DHCP and DNS on the management network, routing of traffic from the OpenStack infrastructure to the Internet, and running the Ansible playbooks over the management network.
Configuring the out-of-band management network with Ansible.
Configuring the spine-leaf switch topology with Ansible.
Deploying OpenStack using the openstack-ansible playbooks.
In part 2 of this blog, I’ll take a more detailed look at the process for building out the proposed infrastructure in GNS3, and how to get it up and running.
Source:: Cumulus Networks