Tag Archive for: Googlebot

To understand and rank websites in search results, Google is constantly using tools called crawlers to find and analyze new or recently updated web pages. What may surprise you is that the search engine actually uses three different types of crawlers depending on the situation with web pages. In fact, some of these crawlers may ignore the rules used to control how these crawlers interact with your site.

In the past week, those in the SEO world were surprised by the reveal that the search engine had begun using a new crawler called the GoogleOther crawler to relieve the strain on its main crawlers. Amidst this, I noticed some asking “Google has three different crawlers? I thought it was just Googlebot (the most well-known crawler which has been used by the search engine for over a decade).”  

In reality, the company uses quite a few more than just one crawler and it would take a while to go into exactly what each one does as you can see from the list of them (from Search Engine Roundtable) below: 

However, Google recently updated a help document called “Verifying Googlebot and other Google crawlers” that breaks all these crawlers into three specific groups. 

The Three Types of Google Web Crawlers

Googlebot: The first type of crawler is easily the most well-known and recognized. Googlebots are the tools used to index pages for the company’s main search results. This always observes the rules set out in robots.txt files.

Special-case Crawlers: In some cases, Google will create crawlers for very specific functions, such as AdsBot which assesses web page quality for those running ads on the platform. Depending on the situation, this may include ignoring the rules dictated in a robots.txt file. 

User-triggered Fetchers: When a user does something that requires for the search engine to then verify information (when the Google Site Verifier is triggered by the site owner, for example), Google will use special robots dedicated to these tasks. Because this is initiated by the user to complete a specific process, these crawlers ignore robots.txt rules entirely. 

Why This Matters

Understanding how Google analyzes and processes the web can allow you to optimize your site for the best performance better. Additionally, it is important to identify the crawlers used by Google and ensure they are blocked in analytics tools or they can appear as false visits or impressions.

For more, read the full help article here.

In an update to the help documentation for Googlebot, the search engine’s crawling tool, Google explained it will only crawl the first 15 MB of any webpage. Anything after this initial 15 MBs will not influence your webpage’s rankings.

As the Googlebot help document states:

“After the first 15 MB of the file, Googlebot stops crawling and only considers the first 15 MB of the file for indexing.

The file size limit is applied on the uncompressed data.”

Though this may initially raise concerns since images and videos can easily exceed these sizes, the help document makes clear that media or other resources are typically exempt from this Googlebot limit:

“Any resources referenced in the HTML such as images, videos, CSS, and JavaScript are fetched separately.”

What This Means For Your Website

If you’ve been following the most commonly used best practices for web design and content management, this should leave your website largely unaffected. Specifically, the best practices you should be following include:

  • Keeping the most relevant SEO-related information relatively close to the start of any HTML file. 
  • Compressing images.
  • Leaving images or videos unencoded into the HTML when possible.
  • Keeping HTML files small – typically less than 100 KB.

If you operate a website that is frequently creating or changing pages – such as an e-retail or publishing site – you’ve probably noticed it can take Google a while to update the search engine with your new content.

This has led to widespread speculation about just how frequently Google indexes pages and why it seems like some types of websites get indexed more frequently than others.

In a recent Q&A video, Google’s John Mueller took the time to answer this directly. He explains how Google’s indexing bots prioritize specific types of pages that are more “important” and limit excessive stress on servers. But, in typical Google fashion, he isn’t giving away everything.

The question posed was:

“How often does Google re-index a website? It seems like it’s much less often than it used to be. We add or remove pages from our site, and it’s weeks before those changes are reflected in Google Search.”

Mueller starts by explaining that Google takes its time to crawl the entirety of a website, noting that if it were to continuously crawl entire sites in short periods of time it would lead to unnecessary strain on the server. Because of this, Googlebot actually has a limit on the number of pages it can crawl every day.

Instead, Googlebot focuses on pages that should be crawled more frequently like home pages or high-level category pages. These pages will get crawled at least every few days, but it sounds like less-important pages (like maybe blog posts) might take considerably longer to get crawled.

You can watch Mueller’s response below or read the quoted statement underneath.

“Looking at the whole website all at once, or even within a short period of time, can cause a significant load on a website. Googlebot tries to be polite and is limited to a certain number of pages every day. This number is automatically adjusted as we better recognize the limits of a website. Looking at portions of a website means that we have to prioritize how we crawl.

So how does this work? In general, Googlebot tries to crawl important pages more frequently to make sure that most critical pages are covered. Often this will be a websites home page or maybe higher-level category pages. New content is often mentioned and linked from there, so it’s a great place for us to start. We’ll re-crawl these pages frequently, maybe every few days. maybe even much more frequently depending on the website.”

google-alerts1

Google is continuing its efforts to promote privacy in search by prioritizing indexing HTTPS pages over their HTTP equivalents.

In the announcement, Google explains its long-term aim is to eventually direct users to secure webpages with a private connection. The step to only index HTTPS pages when an HTTP equivalent exists is their most recent move in this process, following the small rankings boost given to HTTPS pages last year.

Unlike the change to Google’s algorithm in August 2014, this move will not have any effect on rankings. Instead, it simply means that Googlebot will only index the HTTPS version of a URL when both an HTTPS and HTTP version exist.

While Google’s commitment to secure search may lead to more rankings boosts for HTTPS pages in the future, this change is mostly to improve the efficiency of Google’s current indexing process. As they explain in their announcement:

“Browsing the web should be a private experience between the user and the website, and must not be subject to eavesdropping, man-in-the-middle attacks, or data modification. This is why we’ve been strongly promoting HTTPS everywhere.”

The Googlebot is Google’s automated program for searching and indexing content on the Internet. In the realm of SEO, the first part of good optimization is all about crafting textual content that’s visible and makes sense to Googlebot. After Googlebot indexes a page, the Google algorithm takes the content text and automatically ranks it on the search results page according to the search terms that the user enters into Google search. If your optimized website performs well for the term “electronic widgets,” for example, the Google algorithm will place your site near or at the top of the search results whenever someone uses Google to search for “electronic widgets.” Did you know that in addition to its automated components like Googlebot and the algorithm that Google also uses human site raters in the ranking of websites?

Google employs hundreds of site raters who rate a huge number of websites on relevancy. The input collected from this team doesn’t directly influence the search results, but it does influence the Google engineers in changing the algorithm to better serve more relevant results to the search engine user.

In this great video, Google senior software engineer Matt Cutts, demystifies this process by explaining how human website raters are used in testing changes to the Google algorithm. Essentially, after a change to the automatic search ranking is made, Google performs many test queries and evaluates what has changed in the results. The new search results are checked against the results before the change, and then presented to the human raters – in what Matt Cutts calls a “blind taste test” – to determine which set of search engine results are more relevant and useful. Only after analyzing and evaluating the feedback of the human raters are the new search results then tested with a small, carefully selected number of Internet users. Only if this last round of surveys on the algorithm change prove the results more accurate and useful will the updated algorithm be integrated into Google Search for the use of the public. It’s an exhaustive process, but that’s how much Google wants its search engine to be the most relevant on the web.

Watch the video here: