I Wanna Go Fast – Load Balancing Dynamic Steering

I Wanna Go Fast - Load Balancing Dynamic Steering

I Wanna Go Fast - Load Balancing Dynamic Steering

Earlier this month we released Dynamic Steering for Load Balancing which allows you to have your Cloudflare load balancer direct traffic to the fastest pool for a given Cloudflare region or colo (Enterprise only).

To build this feature, we had to solve two key problems: 1) How to decide which pool of origins was the fastest and 2) How to distribute this decision to a growing group of 151 locations around the world.

I Wanna Go Fast - Load Balancing Dynamic Steering

Distance, Approximate Latency, and a Better Way

As my math teacher taught me, the shortest distance between two points is a straight line. This is also typically true on the internet – the shorter approximate distance there is between a user going through Cloudflare location to a customer origin, the better the experience is for the user. Geography is one way to approximate speed and we included the Geo Steering function when we initially introduced the Cloudflare Load Balancer. It is powerful, but manual; it’s not the best way. A customer on Twitter said it best:

@Cloudflare #FeatureRequest why can’t your load balancers determine which server is closest to the user then direct them to that one?

I don’t want to have configure 10+ regions manually. This feels like something that should be built in? Am I missing it?

cc: @eastdakota

— Adam Evers 📍 OAK / SFO (@adamevers) March 30, 2018

A Brief Refresher on Cloudflare Load Balancing

Cloudflare’s Load Balancers are comprised of a combination of origins, pools, and health checks. Origins are IPs or hostnames from which our customers serve content. Pools are collections of origins, usually grouped in along some dimension, like geography, cloud service provider, or a combination thereof (eg. a pool named GCP-West-1 may contain a customer’s origins in Google Cloud’s Oregon west1 region). Finally, there are health checks — configurable probes by our customers to their pools and origins to identify whether a given pool or origin is up or down. These health checks allow Cloudflare load balancers to quickly identify and fail over from downed origins from a network of systems that can map to the customer’s user base.

Measuring and Determining “Fast”

The first decision we faced was when and how to measure speed. We already probe at regular intervals for uptime from the Cloudflare locations that our customers tell us are relevant for their setup. It was an obvious choice to use our existing health checks and gather the round trip time (RTT) from there.

As pool origins are probed periodically we get RTT information from the edge. The next question was how to use this data to decide which pool is the fastest: we decided to calculate the pool RTT using Exponential Weighted Moving Average (EWMA).

Why did we choose EWMA?

We considered other ways to calculate the RTT such as Simple Moving Average (SMA). Although the RTT calculation is much simpler using SMA, we chose EWMA is because it responds to RTT changes faster than SMA, since it applies more weight to the most recent RTT. Also, it can reduce the noise and help make the trend clearer in a dataset with large variance. Another benefit EWMA has is that stays more true to the trend than other types of moving averages, some of which can over- or under-correct, or others that smooth things out too much.

How does EWMA work?

EWMA works by applying weights to the data in such a way that older data weighs less (and therefore becomes less impactful to the result) than more recent data. The weight for a datapoint decreases exponentially for each time period further in the past. The exponential decay is determined by the time bias parameter. When the time bias is set to 1 minute, about 63.2% of the value is coming from the last minute measurements, 23.3% from the minute before that (0.233 = (1 – 0.632) * 0.632), etc. The weight is decreasing exponentially with each passing minute, historical data older than t minutes have weight 1 / exp(t). The most recent minute has weight 63.2%, since 63.2% = 1 – 1 / exp(1).

Actual Implementation

For every load balancer that has Dynamic Steering enabled, the RTT is calculated independently for each of its pools using an EWMA. We wait for a period of time (default is 10m, but this is configurable) before writing the calculated pool RTT values to our internal key-value store, QuickSilver (QS). This is done to build the RTT profile, which helps reduce the noise in cases of large variance data. From then on, we keep writing the values periodically (default 10m, again this value is tunable) and only if there is a change in RTT value to avoid unnecessary writes to QuickSilver.

Data Propagation

To make sure that Dynamic Steering is as performant as possible, all data we use for steering decisions needs to be as close as possible to every machine serving requests. When it comes down to delivering responses as fast as possible, requesting data from another machine – even in the same datacenter – can add non-trivial overhead.

We run a custom inhouse key-value store on every machine servicing requests. The main advantage of this datastore lies in how its replication logic takes advantage of the hierarchy nature of our network layout to facilitate faster replication while transfering less data.

I Wanna Go Fast - Load Balancing Dynamic Steering

Since we keep a copy of the data on every machine in every data center, we need to make sure our dataset is as small as possible. We evaluated what additional data we actually needed to select a pool inside a Load Balancer configured with Dynamic Steering. Currently the only information we propagate is a map of the pool identifier to the EWMA.

Eyeball Experience

Internally at Cloudflare we often talk about eyeballs, the actual visitors of a site clicking away in their browsers, and their experience of the process. Let’s say you’ve setup three pools around the world: North America, Europe, and Australia. With Dynamic Steering, we will route your traffic to the pool with the lowest EWMA. Assuming all your pools are in good health and reporting expected RTT values an eyeballs experience should look like this.

I Wanna Go Fast - Load Balancing Dynamic Steering

Trying It Out

All Enterprise customers and customers with the Geo Routing add-on for Load Balancing have access to Dynamic Steering. To enable Dynamic Steering, select the option in your Load Balancing traffic steering configuration. Please see the KB article or your Cloudflare account team for more information.

I Wanna Go Fast - Load Balancing Dynamic Steering

Interested in helping us go faster?

The Cloudflare Load Balancing and DNS Engineering teams are hiring in San Francisco and London.

Backend Systems Engineer San Francisco
Backend Systems Engineer London
Software Engineer London

Source:: CloudFlare

Fake products? Only AI can save us now.

Half a trillion dollars.

That’s the rough amount of money that counterfeiters displaced last year by selling phony products. Some 2.5% of all trade is for fake goods.

The United States is hit hardest by the scourge of counterfeit products — U.S. brands accounted in 2013 for 20% of the world’s infringed intellectual property.

When most people think about counterfeiting, they think of knock-off Louis Vuitton handbags sold on the sidewalk. But fake products also include business and enterprise products, as well as everyday consumer goods.

To read this article in full, please click here

Source:: IT news – Security

Kernel of Truth Episode 04 — Cisco, disaggregation and the industry impact

Subscribe to Kernel of Truth on iTunes, Google Play, Spotify, Castbox and Stitcher!

Click here for our previous episode.

On March 27, 2018, Cisco announced it was embracing disaggregation of the data center by allowing customers to run NX-OS on third-party switches and to use any network operating system on its Nexus switches. It’s certainly an interesting move, considering that they’re the company that claimed to have killed white-box networking.

…But does this model REALLY fit the definition of network disaggregation? What does true data center disaggregation look like? Why did Alanis Morissette name the song “Ironic” when none of the lyrics are examples of irony?? To answer these questions, I invited Ben Ritter (Consulting Engineer, Cumulus Networks) and Rama Darbha (Senior Consulting Engineer, who you’ll remember from our second episode — get ready for more #RamaRants!) into the recording booth so we can get to the bottom of this. In addition to breaking down the definition of data center disaggregation, Rama, Ben and I go full John Lennon and imagine a perfect world, where Cisco actually embraces the true spirit of disagreggation. How would this impact the industry? Imagine there’s no black box…it’s easy if you try (those are the lyrics to “Imagine,” right?). We get into an interesting discussion about what will happen when the world of data center networking fully makes that transition to open systems, so you’ll definitely want to tune in!

On another note, I’ve got some good news! If you’re like me and Spotify is your preferred podcasting platform, you’re in luck — Kernel of Truth is now available on Spotify! Make sure to subscribe so you stay updated on current episodes.

If you’ve got any questions, feedback or topics you want us to discuss, make sure to Tweet us at @cumulusnetworks and use the hashtag #KernelOfTruth.

Happy listening!

The post Kernel of Truth Episode 04 — Cisco, disaggregation and the industry impact appeared first on Cumulus Networks Blog.

Source:: Cumulus Networks