Googlebot or a DDoS Attack?

This article was originally published on Sucuri blog (read here).

A bot is a software application that uses automation to run scripts on the internet. Also called crawlers or spiders, these guys take on the simple yet repetitive tasks we do. There are legitimate bots and malicious ones. A Web Application Firewall (WAF) filters the web traffic and blocks any malicious bots, letting the good ones pass.

Googlebot is Google’s web crawling bot. Google uses it to discover new and updated pages to be added to the search engine index. Google says:

We use a huge set of computers to fetch (or “crawl”) billions of pages on the web.

Googlebot and DDoS Attacks

Major search engine bot traffic, such as legitimate Googlebot requests, are not blocked by a Web Application Firewall (WAF) or Intrusion Prevention System (IPS). Even though it might have similarities to a Distributed Denial of Service (DDoS) attack, Googlebot makes repeated requests to a site, which can seem like suspicious behavior.

Why Not Block Googlebot?

Googlebot is not blocked by default in our WAF. Websites have lots of content that needs to be indexed. The WAF has a standard rate limit, which blocks repeated requests and could interfere with the indexing process.

However, there is a side effect of allowing major search engines to pass through the IPS. If the website allows the same page to have its URL and content changed based on query string, the search engine crawler could overload the hosting server with requests.

Googlebot Crawling an E-commerce Website

In order to explain how Googlebot works, I have collected data from an e-commerce website that sells many different products which are always updated. The website owner might not notice, but Googlebot visits the website a lot.

Here are the top 10 IPs that visited the e-commerce website:

  • 7607: 66.249.73.72
  • 7297: 66.249.73.74
  • 7093: 66.249.73.76
  • 7075: 66.249.73.78
  • 6986: 66.249.73.80
  • 6384: 66.249.73.82
  • 4894: 66.249.73.84
  • 3808: 66.249.73.86
  • 2888: 66.249.73.88
  • 2356: 66.249.73.90

Most of the traffic generated by Googlebot is caused by requests like these:

66.249.73.94 16/Jan/2019:07:57:59 NOT BLOCKED GET 200 /vestidos?cores=20%7C26%7C27%7C75%7C76%7C78%7C79%7C83%7C84%7C89%7C96%7C100%7C102%7C104%7C105%7C113%7C119 CACHES:MISS - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.73.94 16/Jan/2019:07:57:59 NOT BLOCKED GET 200 /vestidos?cores=20%7C26%7C27%7C72%7C75%7C78%7C82%7C83%7C84%7C93%7C98%7C100%7C102%7C104%7C112%7C118%7C121 CACHES:MISS - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.73.92 16/Jan/2019:07:57:59 NOT BLOCKED GET 200 /vestidos?cores=24%7C26%7C27%7C37%7C72%7C75%7C76%7C78%7C83%7C89%7C92%7C93%7C96%7C100%7C101%7C105%7C112%7C113%7C117%7C119 CACHES:MISS - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.73.90 16/Jan/2019:07:57:59 NOT BLOCKED GET 200 /vestidos?cores=24%7C26%7C27%7C37%7C72%7C75%7C79%7C84%7C89%7C93%7C96%7C98%7C99%7C105%7C111%7C112%7C113%7C118%7C119 CACHES:MISS - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.73.90 16/Jan/2019:07:57:59 NOT BLOCKED GET 200 /vestidos?cores=20%7C26%7C27%7C72%7C75%7C78%7C83%7C84%7C99%7C102%7C104%7C105%7C112%7C117%7C118%7C120 CACHES:MISS - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.73.84 16/Jan/2019:07:57:59 NOT BLOCKED GET 200 /vestidos?cores=24%7C26%7C27%7C37%7C72%7C75%7C83%7C86%7C89%7C93%7C96%7C98%7C100%7C101%7C102%7C111%7C117%7C118 CACHES:MISS - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.73.82 16/Jan/2019:07:57:59 NOT BLOCKED GET 200 /vestidos?cores=24%7C26%7C37%7C72%7C75%7C76%7C78%7C83%7C86%7C89%7C92%7C93%7C99%7C100%7C101%7C102%7C104%7C105%7C113%7C117%7C118%7C119%7C120%7C121 CACHES:MISS - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.73.80 16/Jan/2019:07:57:59 NOT BLOCKED GET 200 /vestidos?cores=24%7C27%7C75%7C78%7C83%7C86%7C89%7C92%7C93%7C99%7C100%7C101%7C102%7C105%7C117%7C118%7C119%7C121 CACHES:MISS - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.73.80 16/Jan/2019:07:57:59 NOT BLOCKED GET 200 /vestidos?cores=24%7C27%7C72%7C75%7C76%7C83%7C84%7C86%7C89%7C93%7C98%7C100%7C102%7C105%7C111%7C112%7C121 CACHES:MISS - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.73.78 16/Jan/2019:07:57:59 NOT BLOCKED GET 200 /vestidos?cores=27%7C72%7C75%7C78%7C82%7C83%7C84%7C86%7C92%7C93%7C96%7C98%7C101%7C102%7C104%7C105%7C111%7C112%7C117%7C118%7C119 CACHES:MISS - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.73.78 16/Jan/2019:07:57:59 NOT BLOCKED GET 200 /vestidos?cores=24%7C27%7C72%7C76%7C84%7C93%7C96%7C98%7C100%7C102%7C105%7C111%7C117%7C118%7C120&tamanho=4%7C61 CACHES:MISS - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.73.76 16/Jan/2019:07:57:59 NOT BLOCKED GET 200 /vestidos?cores=20%7C26%7C27%7C72%7C78%7C79%7C83%7C84%7C86%7C89%7C92%7C102%7C105%7C111%7C113%7C118%7C120 CACHES:MISS - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.73.72 16/Jan/2019:07:57:59 NOT BLOCKED GET 200 /vestidos?cores=24%7C26%7C37%7C75%7C78%7C79%7C83%7C84%7C89%7C93%7C99%7C100%7C101%7C102%7C105%7C111%7C117%7C118%7C119&tamanho=4 CACHES:MISS - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.73.72 16/Jan/2019:07:57:59 NOT BLOCKED GET 200 /vestidos?cores=24%7C26%7C27%7C72%7C76%7C78%7C83%7C86%7C89%7C92%7C93%7C98%7C100%7C102%7C105%7C111%7C113%7C117%7C119 CACHES:MISS - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
66.249.73.70 16/Jan/2019:07:57:59 NOT BLOCKED GET 200 /vestidos?cores=24%7C27%7C72%7C75%7C83%7C86%7C89%7C92%7C93%7C98%7C100%7C102%7C105%7C111%7C112%7C113%7C119%7C121 CACHES:MISS - Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

This log shows that Googlebot accessed the same page 15 times a second.

Possible Complications

Since those were very specific requests for an e-commerce website, the WAF does not have a cached version of the pages. Therefore, the web server had to handle 15 concurrent requests from Googlebot.

Since these requests are product variations, each time Googlebot requested the same page with a different query string, databases queries and code processing were required. Imagine that 15 times in a second.

Look at the hosting server usage. The website being accessed by Googlebot has a powerful hosting server and yet, the resource usage has been skyrocketing:

CPU Usage
CPU Usage
Network I/O
Network I/O

How to Solve Website Overload

We are going to present some approaches to solve the website overload caused by Googlebot crawler:

Change the Product Selection

To solve this specific issue, the ideal approach would be to change how the product selection is built on the website. Instead of generating query strings, the product variation should generate a full URL, such as:

"/dress?color=blue" becomes "/dress-blue/"

Fixing the way the products are presented in the e-commerce website would be the best approach. The search engine would still be able to index the product variation correctly and it would also improve the website SEO without overloading it.

Redefine Canonical URLs

Another possible code-level change to fix this issue would be defining the canonical URL correctly on the page:

https://webmasters.googleblog.com/2009/02/specify-your-canonical.html

If you want to read more about how to present colors and size variation, Merkle Inc. has written a great article about it.

However, we know changing the code is not possible in most cases. There are another two workarounds for this issue although not as good as the ideal approach.

If you do any of the following two tricks, Googlebot would no longer craw the URLs containing ?cores= query string.

Specify the Categorization Parameter in Google Search Console

You can either specify the categorization parameter inside Google Search Console:

https://support.google.com/webmasters/answer/6080550?hl=en

Categorization Parameters in Google Search Console
Categorization Parameters in Google Search Console

Check out Google’s tutorial on how to categorize parameters.

Disallow the Categorization Parameter in Robots.txt File

You can also add the categorization parameter to the “blacklist” of robots.txt file:

Disallow: /*cores=

Conclusion

Googlebot and other search engine crawlers are vital to having a website rank correctly in search engines. You want to make sure Google is crawling your pages the right way so your website SEO earns a good ranking and Googlebot does not overload your website resources.

If you are curious about how our Web Application Firewall blocks malicious bots and lets good ones go by, check out our bad bots protection page.

Northon Torga

Northon Torga

Security Analyst III @sucurisecurity. CTO @goinfinitenet. Qapla'!