Crawler Hints: How Cloudflare Is Reducing The Environmental Impact Of Web Searches

Crawler Hints: How Cloudflare Is Reducing The Environmental Impact Of Web Searches

Cloudflare is known for innovation, for needle-moving projects that help make the Internet better. For Impact Week, we wanted to take this approach to innovation and apply it to the environmental impact of the Internet. When it comes to tech and the environment, it’s often assumed that the only avenue tech has open to it is harm mitigation: for example, climate credits, carbon offsets,  and the like. These are undoubtedly important steps, but we wanted to take it further — to get into harm reduction. So we asked — how can the Internet at large use less energy and be more thoughtful about how we expend computing resources in the first place?

Cloudflare has a global view into the traffic of the Internet. More than 1 in 6 websites use our network, and we observe the traffic flowing to and from them continuously. While most people think of surfing the Internet as a very human activity, nearly half of all traffic on the global network is generated by automated systems.

We’ve analyzed this automated traffic, from so-called “bots,” in order to understand the environmental impact. Most of the bot traffic is malicious. Cloudflare protects our clients from this malicious traffic and, in doing so, mitigates their environmental impact. If these bots were not stopped by Cloudflare, they would generate database requests and force dynamic page generation on services far less efficient than Cloudflare’s network.

We even went a step further, committing to plant trees to offset the carbon cost of our bot mitigation services. While we’d love to be able to tell the bad actors to think of the environment and stop running their bots, we don’t think they’d listen. So, instead, we aim to mitigate them as efficiently as possible.

But there’s another type of bot that we don’t want to go away: good bots that index the web for useful reasons. These good bots represent more than 5% of global Internet traffic. The majority of this good bot traffic comes from what are known as search engine crawlers, and they are critical to making the web navigable.

Large-Scale Problems, Large-Scale Opportunities

Online search remains magical. Enter a query into a box on a search engine like Google, Bing, Yandex, or Baidu and, in a fraction of a second, get a list of web resources with information on whatever you’re looking for. To make this magic happen, search engines need to scour the web and, simplistically, make a copy of its contents that are stored and sorted on their own systems to be quickly retrieved whenever needed.

Companies that run search engines have worked hard to make the process as efficient as possible, pushing the state-of-the-art in terms of server and data center efficiency. But there remains one clear area of waste: excessive crawl.

At Cloudflare, we see traffic from all the major search crawlers. We’ve spent the last year studying how often these good bots revisit a page that hasn’t changed since they last saw it. Every one of these visits is a waste. And, unfortunately, our observation suggests that 53% of this good bot traffic is wasted.

The Boston Consulting Group estimates that running the Internet generated 2% of all carbon output, or about 1 billion metric tonnes per year. If 5% of all Internet traffic is good bots, and 53% of that traffic is wasted by excessive crawl, then finding a solution to reduce excessive crawl could help save as much as 26 million tonnes of carbon cost per year. According to the U.S. Environmental Protection Agency, that’s the equivalent of planting 31 million acres of forest, shutting down 6 coal-fired power plants forever, or taking 5.5 million passenger vehicles off the road.

Obviously, it’s not quite that simple. But suffice it to say there’s a big opportunity to make a meaningful impact on the environmental cost of the Internet if we are able to ensure that any search engine only crawls once or whenever it changes.

Recognizing this problem, we’ve been talking with the largest operators of good bots for the last several years to see if, together, we could address the issue.

Crawler Hints

Today, we’re excited to announce Crawler Hints. Crawler Hints provide high quality data to search engine crawlers on when content has been changed on sites using Cloudflare, allowing them to precisely time their crawling, avoid wasteful crawls, and generally reduce resource consumption of customer origins, crawler infrastructure, and Cloudflare infrastructure in the process. The cherry on top: because search engine crawlers now receive signals on when content is fresh, the search experiences powered by these “good bots” will improve, delighting Internet users at large with more relevant and useful content. Crawler Hints is a win for the Internet and a win for the Internet’s energy footprint.

With Crawler Hints, we expect to make crawling a bit more tractable by providing an additional heuristic to bot developers that will allow them to know when content has been changed or added to a site instead of relying on preferences or previous changes that might not reflect the true change cadence for a site.

How will this work?

At its simplest we want a way to proactively tell a search engine when a page has changed, rather than having to wait for the search engine to discover a change has happened. Search engines actually typically have a few ways to tell them about when an individual page or group of pages changes.

For example, you can ask Google to recrawl a website, and they’ll do so in “a few days to a few weeks”.

If you wanted to efficiently tell Google about changes you’d have to keep track of when Google last crawled the page and tell them to recrawl when a change happens. You wouldn’t want to tell Google every time a page changes as there’s a time delay between requesting a recrawl and the spider coming to visit. You could be telling Google to come back during the gap between the request and the spider coming to call.

And there isn’t just one search engine and new search crawlers get created. Trying to keep search engines up to date as your site changes, efficiently, would be messy and very difficult. This is, in part, because this model does not contain explicit information about when something changed.

This model just doesn’t work well. And that’s partly why search engine crawlers inevitably waste energy recrawling sites over and over again regardless of whether there is something new to find.

However, there is an existing mechanism used by search engines to discover the structure of websites that’s perfect: the sitemap. The sitemap is a well-defined, open protocol for telling a crawler about the pages on a site, when they last changed and how often they are likely to change.

Sitemaps have some limitations (on number of URLs and bytes) but do have a mechanism for large sites with millions of URLs. But building sitemaps can be complex and require special software. Getting a consistent, up to date sitemap for a website (especially one that uses different technologies) can be very hard.

That’s where Cloudflare comes in. We see what pages our customers are serving, we know which ones have changed (either by hash value or timestamp) and so can automatically build a complete record of when and which pages have changed.

And we can keep track of when a search crawler visited a particular page and only serve up exactly what changed since last time. Since we can keep track of this on a per-search engine basis it can be very efficient. Each search engine gets its own automagically updated list of URLs or sitemap of just what’s changed since their last visit.

And it adds absolutely no load to the origin website. Cloudflare can tell a search engine in almost real-time about a page’s modifications and provide a view of what changed since their last visit.

The sitemaps protocol also contains a priority for a page. Since we know how often a page is visited we can also hint to a search engine that a page is seen frequently by visitors and thus may be more important to add to the index than another page.

There are a few details to work out, such as how a search engine should identify itself to get its personalized list of URLs, but the protocol is open and in no way depends on Cloudflare. In fact, we hope that every host and Cloudflare-like service will consider implementing the protocol. We plan to continue to work with the search and hosting communities to refine the protocol in order to make it more efficient. Our goal is to ensure that search engines can have the freshest index, content creators will have their new content optimally indexed, and a big chunk of unnecessary Internet traffic, and the corresponding carbon cost, will disappear.

Conclusion

Crawler Hints doesn’t just benefit search engines. For our customers and origin owners, Crawler Hints will ensure that search engines and other bot-powered experiences will always have the freshest version of your content, translating into happier users and ultimately influencing search rankings. Crawler Hints will also mean less traffic hitting your origin, improving resource consumption and limiting carbon impact. Moreover, your site performance will be improved as well: your human customers will not be competing with bots!

And for Internet users? When you interact with bot-fed experiences — which we all do every day, whether we realize it or not, like search engines or pricing tools — these will now deliver more useful results from crawled data, because Cloudflare has signaled to the owners of the bots the moment they need to update their results.

Finally, and perhaps the one we’re most excited about, for the Internet more generally: it’s going to be greener. Energy usage across the web will be greatly reduced.

Win win win. The types of outcomes that bring us to work every day, and what we think of in helping to build a better Internet.

This is an exciting problem to solve, and we look forward to working with others that want to help the Internet be more efficient and performant while reducing needless energy consumption. We plan on having more news to share on this front soon. If you operate a bot that relies on content freshness and are interested in working with us on this project, please email crawlerhints@cloudflare.com.

Yandex prioritizes long-term sustainability over short-lived success, and joins the global community in its pursuit of climate change mitigation. As a part of its commitment to quality service and user experience, Yandex focuses on ensuring relevance and usability of search results. We believe that a Cloudflare’s solution will strengthen search performance by improving the accuracy of returned results, and look forward to partnering with Cloudflare on boosting the efficiency of valuable bots across the Internet.“DuckDuckGo is supportive of anything that makes search more environmentally friendly and better for end users without harming privacy. We’re looking forward to working with Cloudflare on this proposal.”
Gabriel Weinberg, CEO and Founder, DuckDuckGo.Nearly a year ago (the Internet Archive’s Wayback Machine partnered with Cloudflare) to help power their “Always Online” service and, in turn, to have the Internet Archive learn about high-quality Web URLs to archive. That win-win partnership has been a huge success for the Wayback Machine and, in turn, our partners, as it has helped ensure we better fulfill our mission to help make the Web more useful and reliable by backing up, and making available for future generations, much of the public Web. Building on that ongoing relationship with Cloudflare, the Internet Archive is thrilled to start using this new “Crawler Hints” service. With it, we expect to be able to do more with less. To be able to focus our server and bandwidth resources on more of the Web pages that have changed, and less on those that have not. We expect this will have a material impact on our work. The fact the service also promises to reduce the carbon impact of the Web overall makes it especially worthwhile and, as such, we are proud to be part of the effort.
Mark Graham, Director, the Wayback Machine at the Internet Archive

Source:: CloudFlare