Lessons Learned from Scaling Up Cloudflare’s Anomaly Detection Platform

Lessons Learned from Scaling Up Cloudflare’s Anomaly Detection Platform

Introduction to Anomaly Detection for Bot Management

Cloudflare’s Bot Management platform follows a “defense in depth” model. Although each layer of Bot Management has its own strengths and weaknesses, the combination of many different detection systems — including Machine Learning, rule-based heuristics, JavaScript challenges, and more — makes for a robust platform in which different detection systems compensate for each other’s weaknesses.

One of these systems is Anomaly Detection, a platform motivated by a simple idea: because bots are made to accomplish specific goals, such as credential stuffing or content scraping, they interact with websites in distinct and difficult-to-disguise ways. Over time, the actions of a bot are likely to differ from those of a real user. Anomaly detection aims to model the characteristics of legitimate user traffic as a healthy baseline. Then, when automated bot traffic is set against this baseline, the bots appear as outlying anomalies that can be targeted for mitigation.

An anomaly detection approach is:

  • Resilient against bots that try to circumvent protections by spoofing request metadata (e.g., user agents)
  • Able to catch previously unseen bots without being explicitly trained against them.

So, how well does this work?

Today, Anomaly Detection processes more than 500K requests per second. This translates to over 200K CAPTCHAs issued per minute, not including traffic that’s already caught by other bot management systems or traffic that’s outright blocked. These suspected bots originate from over 140 different countries and 2,200 different ASNs. And all of this happens using automatically generated baselines and visitor models which are unique to every enrolled site — no cross-site analysis or manual intervention required.

How Anomaly Detection Identifies Bots

Anomaly Detection uses an algorithm called Histogram-Based Outlier Scoring (HBOS) to detect anomalous traffic in a scalable way. While HBOS is less precise than algorithms like kNN when it comes to local outliers, it is able to score global outliers quickly (in linear time).

There are two parts to every behavioral bot detection: the site-specific baseline and the visitor-specific behavior model. We make heavy use of ClickHouse, an open-source columnar database, for storing and analyzing enormous amounts of data. When a customer opts-in to Anomaly Detection, we begin aggregating traffic data in ClickHouse to form a unique baseline for their site.

To understand visitor behavior, we maintain a sliding window of ephemeral feature aggregates in-memory using Redis. HyperLogLog data structures allow us to efficiently store and estimate unique counts of these high-cardinality features. Because these data are privacy-sensitive, they are retained only within a recent time window and are specific to each opted-in site. This makes efficient data representations due to the resulting high cardinality of the problem space.

The output of each detection run is an outlier score, representing how anomalous a visitor’s behavior is when viewed against the baseline for that particular site. This outlier score feeds into the final bot score calculation for use on the edge.

The Anomaly Detection Platform

Lessons Learned from Scaling Up Cloudflare’s Anomaly Detection Platform

The Anomaly Detection platform consists of a series of microservices running on Kubernetes. Request data come in through a dedicated Kafka topic and are inserted to both ClickHouse and Redis.

The Detector service lazily retrieves (and caches) baseline data from the Baseline service and calculates outlier scores for visitors compared to the baselines. These scores are then written to another Kafka topic to be persisted for later analysis.

Finally, the Publisher service collects batches of detections (visitors whose behavior is anomalous or bot-like) and sends them out to the edge to be applied as part of our bot score calculations.

Each microservice runs independently and tolerates downtime from its dependencies. They are also sized very differently: some services require dozens of replicas and gigabytes of memory, while others are much cheaper.

Today, the Anomaly Detection platform handles nearly 500k requests per second across ~310M unique visitors, representing 2x growth over the last six months. But once upon a time, we struggled to handle even 10K rps.

The story of our growth is also the story of how we adapted, redesigned, and improved our infrastructure in order to respond to the corresponding increases in resource demand, customer support requests, and maintenance challenges.

Launch, then Iterate

In an earlier incarnation, most of the core Anomaly Detection logic was contained in a single (replicated) service running under Kubernetes.

Lessons Learned from Scaling Up Cloudflare’s Anomaly Detection Platform

Each service pod fulfilled multiple responsibilities: generating behavioral baselines from ClickHouse data, aggregating visitor profiles from Redis, and calculating outlier scores. These outlier detections were then forwarded to the edge by piggybacking on another bot management system’s existing integration with Quicksilver, Cloudflare’s replicated key-value store.

This simple design was easy to understand and implement. It also reused existing infrastructure, making it perfect for a v1 deployment in keeping with Cloudflare’s culture of fast iteration. Of course, it also had some shortcomings:

  • A monolithic service meant a single (logical) point of failure.
  • From a resource (CPU, memory) perspective, it was difficult to scale pieces of functionality independently.
  • The “reused” integration with Quicksilver was never meant to support something like Anomaly Detection, causing instability for both systems.

It’s easy in hindsight to focus on the flaws of an existing system, but it’s important to keep in mind that priorities can and should evolve over time. A design that doesn’t meet today’s needs was likely suited to the needs of yesterday.

One key benefit of a launch-and-iterate approach is that you get a concrete, real-world understanding of where your system needs improvement. Having a real system to observe means that improvements are both targeted and measurable.

Tuning Redis

As mentioned above, Redis is a key part of the Anomaly Detection platform, used to store and aggregate features about site visitors. Although we only keep these data in a sliding window, the cardinality of the set is very large (per visitor per site). For this reason, many of the early improvements to Anomaly Detection performance centered on Redis. In fact, the first launch of Anomaly Detection struggled to keep up with only 10k requests per second.

Profiling revealed that load was centered on our heavy use of the Redis PFMERGE command for merging HyperLogLog structures. Unlike most Redis commands, which are O(1), PFMERGE runs in linear time (proportional to the number of features * window size). As the demand for scoring increased, this proved to be a serious bottleneck.

To resolve this problem, we looked for ways to further optimize our use of Redis. One idea was lowering the threshold for promoting a sparse HyperLogLog representation to a dense one – trading memory for compute, as dense HyperLogLogs are generally faster to merge.

However, as is so often the case, a big win came from a simple idea: we introduced a “recency register,” essentially a cache that placed a time bound on how often we would run expensive detection logic on a given site-visitor pair. Since behavior patterns need to be established over a time window, the potential extra detection latency from the recency register was not a significant concern. This straightforward solution was enough to raise our throughput by an order of magnitude.

Working with Redis involves a lot of balancing between memory and compute resources. For example, our Redis shards’ memory sizes were empirically determined based on the observed CPU utilization when reaching memory bounds. A higher memory bound meant more visitors tracked per shard and thus more commands per second. The fact that Redis shards are single-threaded made reasoning about these situations easier as well.

As the number of features and visitors grew, we discovered that “standard” Redis recommendations didn’t always work for us in practice. Redis typically recommends using human-readable keys, even if they are longer.

Lessons Learned from Scaling Up Cloudflare’s Anomaly Detection Platform

However, by encoding our keys in a compact, binary-encoded format, we observed roughly 30% memory savings, as well as CPU savings — which again demonstrates the value of iterating on a real-world system.

Moving to Microservices

As Anomaly Detection continued to grow in adoption, it became clear that optimizing individual pieces of the pipeline was no longer sufficient: our platform needed to scale up as well. But, as it turns out, scaling isn’t as simple as requesting more resources and running more replicas of whatever service isn’t keeping up with demand. As we expanded, the amount of load we placed on external (shared!) dependencies like ClickHouse grew. The way we piggybacked on Quicksilver to send updates to the edge coupled two systems together in a bloated and unreliable way.

So we set out to do more with less – to build a more resilient system that would also be a better steward of Cloudflare’s shared resources.

The idea of a microservice-based architecture was not a new one; in fact, even early Anomaly Detection designs suggested the eventual need for such a migration. But real-world observations indicated that the redesign was now fully worth the time investment.

Why did we think moving to microservices would help us solve our scalability issues?

First, we observed that a large contributor to our load on ClickHouse was repeated baseline aggregation. Because each replica of the monolithic Anomaly Detection service calculated its own copies of the baseline profiles, our pressure on ClickHouse would increase each time we horizontally scaled our service deployment. What’s more, this work was essentially duplicated. There was no reason each replica should need to recalculate copies of the same baseline. Moving this work to a dedicated baseline service cut out the duplication to the tune of a 10x reduction in load from this particular operation.

Secondly, we noticed that part of our use case (accept a stream of data from Kafka, apply simple transformations, and persist batches of this data to ClickHouse) was a pretty common one at Cloudflare. There already existed robust, battle-tested inserter code for launching microservices with exactly this pattern of operation. Adapting this code to suit our needs not only saved us development time, but brought us more in line with wider convention.

We also learned the importance of concrete details during design. When we initially began working on the redesign of the Anomaly Detection platform, we felt that Kafka might have a role to play in connecting some of our services. Still, we couldn’t fully justify the initial investment required to move away from the RESTful interfaces we already had.

The benefits of using Kafka only became clear and concrete once we committed to using ClickHouse as the storage solution for our outlier score data. ClickHouse performs best when data is inserted in larger, less frequent batches, rather than rapid, small operations, which create a lot of internal overhead. Transporting outlier scores via Kafka allowed us to batch updates while being resilient to data loss during transient downtime.

The Future

It’s been a journey getting to this point, but we’re far from done. Cloudflare’s mission is to help make the Internet better for everyone – from small websites to the largest enterprises. For Anomaly Detection, this means expanding into a problem space with huge cardinality: roughly the cross-product of the number of sites and the number of unique visitors. To do this, we’re continuing to improve the efficiency of our infrastructure through smarter traffic sampling, compressed baseline windows, and more memory-efficient data representations.

Additionally, we want to deliver even better detection accuracy on sites with multiple distinct traffic types. Traffic coming to a site from web browsers behaves quite differently than traffic coming from mobile apps — but both sources of traffic are legitimate. While the HBOS outlier detection algorithm is lightweight and efficient, there are alternatives which are more performant in the presence of multiple traffic profiles.

One of these alternatives is local outlier factor (LOF) detection. LOF automatically builds baselines that capture “local clusters” of behavior corresponding to multiple traffic streams, rather than a single site-wide baseline. These new baselines allow Anomaly Detection to better distinguish between human use of a web browser and automated abuse of an API on the same site. Of course, there are trade-offs here as well: generating, storing, and working with these larger and more sophisticated behavioral baselines requires careful and creative engineering. But the reward for doing so is enhanced protection for an even wider range of sites using Anomaly Detection.

Finally, let’s not forget the very human side of building, supporting, and expanding Anomaly Detection and Bot Management. We’ve recently launched features that speed up model experimentation for Anomaly Detection, allow us to run “shadow” models to record and evaluate performance behind the scenes, give us instant “escape hatches” in case of unexpected customer impact, and more. But our team — as well as the many Solutions Engineers, Product Managers, Subject Matter Experts, and others who support Bot Management — are continuing to invest in improved tooling and education. It’s no small challenge, but it’s an exciting one.

Source:: CloudFlare