Google Explains How CDNs Impact Crawling & SEO - adtechsolutions

Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

Google Explains How CDNs Impact Crawling & SEO


Google has published an explanation that discusses how content delivery networks (CDNs) affect search indexing and improve SEO, but also how they can sometimes cause problems.

What is a CDN?

A content delivery network (CDN) is a service that caches a web page and serves it from the data center closest to the browser requesting that web page. Caching a web page means that the CDN creates a copy of the web page and stores it. This speeds up the delivery of the website because it is now served from a server closer to the website visitor, requiring fewer “hops” across the Internet from the source server to the destination (the website visitor’s browser).

CDNs unlock more indexing

One of the benefits of using a CDN is that Google automatically increases its indexing rate when it detects that websites are being served from a CDN. This makes the use of a CDN attractive to SEOs and publishers who are concerned about increasing the number of pages indexed by Googlebot.

Typically, Googlebot will reduce the amount of crawling from the server if it detects that it has reached a certain threshold that causes the server to slow down. Googlebot slows down crawling, which is called throttling. This threshold for “throttling” is higher when a CDN is detected, resulting in more pages being searched.

Something to understand about serving pages from a CDN is that the first time the pages are served, they must be served directly from your server. Google uses an example site with over a million web pages:

“However, the first time a URL is accessed, the CDN’s cache is “cold”, meaning that since no one has yet requested that URL, the CDN has not yet cached its content, so your origin server will still have to serve that URL at least once to “warm up” the CDN cache. This is also very similar to how HTTP caching works.

In short, even if your webshop has CDN support, your server will have to serve those 1,000,007 URLs at least once. Only after that initial serving can your CDN help you with its caches. This is a significant burden on your “crawling budget” and the crawl rate is likely to be high for several days; keep this in mind if you plan to run multiple URLs at once.”

Using CDNs has negative consequences for indexing

Google advises that there are cases when a CDN can blacklist Googlebot and subsequently block crawling. This effect is described as two types of blocks:

1. Hard blocks

2. Soft blocks

Hard blocks occur when the CDN responds that a server error has occurred. A bad server error response can be 500 (Internal Server Error) which signals a major problem happening with the server. Another bad server error response is 502 (Bad Gateway). Both of these server error responses will trigger Googlebot to slow down the crawl rate. Indexed URLs are stored internally by Google, but persistent 500/502 responses can cause Google to eventually drop URLs from the search index.

The preferred response is 503 (Service Unavailable), which indicates a temporary error.

Another hard block to watch out for are what Google calls “accidental errors,” which is when the server sends a 200 response code, meaning the response was good (even though it’s serving an error page with that 200 response). Google will interpret these error pages as duplicates and remove them from the search index. This is a big problem because it can take some time to recover from this type of error.

A soft block can occur if the CDN displays one of those “Are you human?” pop-up windows (bot interstitials) for Googlebot. Robot interstitials should send a 503 server response to let Google know this is a temporary problem.

Google’s new documentation explains:

“…when the interstitial appears, that’s all they see, not your great site. In the case of these crawler interstitials, we strongly recommend sending a clear signal in the form of a 503 HTTP status code to automated clients such as crawlers that the content is temporarily unavailable. This will ensure that content is not automatically removed from Google’s index.”

Debugging issues with the URL Viewer and WAF controls

Google recommends using the URL Checker tool in Search Console to see how the CDN is serving your web pages. If a CDN firewall, called Web Application Firewall (WAF), is blocking Googlebot by IP address, you should be able to check for blocked IP addresses and compare them to Google’s official list of IP addresses to see if one of them is on the list.

Google offers the following tips for debugging at the CDN level:

“If you need your site to appear in search engines, we strongly recommend that you check that the crawlers you care about can access your site. Keep in mind that IPs can end up on the block list automatically, without your knowledge, so checking the block lists from time to time is a good idea for your site’s search success and beyond. If the block list is very long (which is not unlike this blog post), try searching only the first few segments of the IP range, for example, instead of searching for 192.168.0.101, you can just search for 192.168.”

For more information, read Google’s documentation:

Creeping December: CDNs and indexing

Featured Image Shutterstock/JHVEPhoto



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *