Accelerating Hadoop With Cumulus Linux
One of the questions I’ve encountered in talking to our customers has been “What environments are a good example of working on top of the Layer-3 Clos design?” Most engineers are familiar with the classic Layer-2 based Core/Distribution/Access/Edge model for building a data center. And while that has served us well in the older client-server north-south traffic flow approaches and in smaller deployments, modern distributed applications stress the approach to its breaking point. Since L2 designs normally need to be built around pairs of devices, relying on individual platforms to carry 50% of your data center traffic can present a risk at scale. On top of this you have to have a long list of protocols that can result in a brittle and operationally complex environment as you deploy 10’s of devices.
Hence the rise of the L3 Clos approach allowing for combining many small boxes, each carrying only a subset of your traffic, along with running industry standard protocols that have a long history of operational stability and troubleshooting ease. And, while the approach can be applied to many different problems, building a practical implementation of a problem is the best way to show it to be true. With that in mind we recently setup a Hadoop cluster leading to a solution validation guide we are publishing with our new release.
Big Data analytics is becoming increasingly common across businesses of all sizes. With the growth of genomic, geographic, social-graph, search indexing and other large data sources, the ability for a single computer to process across these sets in a reasonable time has diminished. Distributed processing models like Hadoop have become increasingly the way to approach the data, breaking down the processing into steps that can be distributed along with the data across the compute nodes.
Many of the Hadoop solutions being published have been built around assuming a high cost of the network, so they have focused on 1Gig Ethernet attached servers, pressing the issues of locality to keep traffic on the same ToR and optimizing keeping traffic off the network. And while the speed of even 10Gig Ethernet can not keep up with locally attached storage, being able to build a low-to-no oversubscription network fabric at 10Gig and higher, in concert with most Big Data class servers shipping with integrated 10Gig Ethernet on the motherboard (LOM), the prices to accomplish this rival solutions built around 1Gig Ethernet and remove your Big Data results from having to be so tied to the locality of the data in your environment.
When it comes to building a network for Hadoop, we chose the enterprise grade Hortonworks Data Platform (HDP) driven by Hortonworks as the platform to stand up and test for a new validated solution. Hortonworks, being a major contributor to open source initiatives (Apache Hadoop, HDFS, Pig, Hive, HBase, Zookeeper), has extensive experience managing production level Hadoop clusters. Given the open nature of both Cumulus Linux and Hortonworks we were able to stand up our environment quickly and validate the operations on the topology. As a follow-on to this project, by combining in the automation powers of tools like Ansible, we will have a demo in Cumulus Workbench to show how you can automate, both on the network and server, the environment and to deploy using a single tool. Keep your eyes out for it.
Read more here:: Cumulus Networks