Tag Archive for: Google indexing

To understand and rank websites in search results, Google is constantly using tools called crawlers to find and analyze new or recently updated web pages. What may surprise you is that the search engine actually uses three different types of crawlers depending on the situation with web pages. In fact, some of these crawlers may ignore the rules used to control how these crawlers interact with your site.

In the past week, those in the SEO world were surprised by the reveal that the search engine had begun using a new crawler called the GoogleOther crawler to relieve the strain on its main crawlers. Amidst this, I noticed some asking “Google has three different crawlers? I thought it was just Googlebot (the most well-known crawler which has been used by the search engine for over a decade).”  

In reality, the company uses quite a few more than just one crawler and it would take a while to go into exactly what each one does as you can see from the list of them (from Search Engine Roundtable) below: 

However, Google recently updated a help document called “Verifying Googlebot and other Google crawlers” that breaks all these crawlers into three specific groups. 

The Three Types of Google Web Crawlers

Googlebot: The first type of crawler is easily the most well-known and recognized. Googlebots are the tools used to index pages for the company’s main search results. This always observes the rules set out in robots.txt files.

Special-case Crawlers: In some cases, Google will create crawlers for very specific functions, such as AdsBot which assesses web page quality for those running ads on the platform. Depending on the situation, this may include ignoring the rules dictated in a robots.txt file. 

User-triggered Fetchers: When a user does something that requires for the search engine to then verify information (when the Google Site Verifier is triggered by the site owner, for example), Google will use special robots dedicated to these tasks. Because this is initiated by the user to complete a specific process, these crawlers ignore robots.txt rules entirely. 

Why This Matters

Understanding how Google analyzes and processes the web can allow you to optimize your site for the best performance better. Additionally, it is important to identify the crawlers used by Google and ensure they are blocked in analytics tools or they can appear as false visits or impressions.

For more, read the full help article here.

One of the most frustrating aspects of search engine optimization is the time it takes to see results. In some cases, you can see changes start to hit Google’s search engines in just a few hours. In others, you can spend weeks waiting for new content to be indexed with no indication when Google will get around to your pages.

In a recent AskGooglebot session, Google’s John Mueller said this huge variation in the time it takes for pages to be indexed is to be expected for a number of reasons. However, he also provides some tips for speeding up the process so you can start seeing the fruits of your labor as soon as possible.

Why Indexing Can Take So Long

In most cases, Mueller says sites that produce consistently high quality content should expect to see their new pages get indexed within a few hours to a week. In some situations, though, even high quality pages can take longer to be indexed due to a variety of factors.

Technical issues can pop up which can delay Google’s ability to spot your new pages or prevent indexing entirely. Additionally, there is always the chance that Google’s systems are just tied up elsewhere and need time to get to your new content.

Why Google May Not Index Your Page

It is important to note that Google does not index everything. In fact, there are plenty of reasons the search engine might not index your new content.

For starters, you can just tell Google not to index a page or your entire site. It might be that you want to prioritize another version of your site or that your site isn’t ready yet. 

The search engine also excludes content that doesn’t bring sufficient value. This includes duplicate content, malicious or spammy pages, and websites which mirror other existing sites.

How To Speed Up Indexing

Thankfully, Mueller says there are ways to help speed up indexing your content.

  • Prevent server overloading by ensuring your server can handle the traffic coming to it. This ensures Google can get to your site in a timely manner. 
  • Use prominent internal links to help Google’s systems navigate your site and understand what pages are most important.
  • Avoid unnecessary URLs to keep your site well organized and easy for Google to spot new content.
  • Google prioritizes sites which put out consistently quality content and provide high value for users. The more important Google thinks your site is for people online, the more high priority your new pages will be for indexing and ranking.

For more about how Google indexes web pages and how to speed up the process, check out the full AskGooglebot video below:

In a Google Search Central SEO session recently, Google’s John Mueller shed light on a way the search engine’s systems can go astray – keeping pages on your site from being indexed and appearing in search. 

Essentially the issue comes from Google’s predictive approach to identifying duplicate content based on URL patterns, which has the potential to incorrectly identify duplicate content based on the URL alone. 

Google uses the predictive system to increase the efficiency of its crawling and indexing of sites by skipping over content which is just a copy of another page. By leaving these pages out of the index, Google’s engine has less chances of showing repetitious content in its search results and allows its indexing systems to reach other, more unique content more quickly. 

Obviously the problem is that content creators could unintentionally trigger these predictive systems when publishing unique content on similar topics, leaving quality content out of the search engine. 

John Mueller Explains How Google Could Misidentify Duplicate Content

In a response to a question from a user whose pages were not being indexed correctly, Mueller explained that Google uses multiple layers of filters to weed out duplicate content:

“What tends to happen on our side is we have multiple levels of trying to understand when there is duplicate content on a site. And one is when we look at the page’s content directly and we kind of see, well, this page has this content, this page has different content, we should treat them as separate pages.

The other thing is kind of a broader predictive approach that we have where we look at the URL structure of a website where we see, well, in the past, when we’ve looked at URLs that look like this, we’ve seen they have the same content as URLs like this. And then we’ll essentially learn that pattern and say, URLs that look like this are the same as URLs that look like this.”

He also explained how these systems can sometimes go too far and Google could incorrectly filter out unique content based on URL patterns on a site:

“Even without looking at the individual URLs we can sometimes say, well, we’ll save ourselves some crawling and indexing and just focus on these assumed or very likely duplication cases. And I have seen that happen with things like cities.

I have seen that happen with things like, I don’t know, automobiles is another one where we saw that happen, where essentially our systems recognize that what you specify as a city name is something that is not so relevant for the actual URLs. And usually we learn that kind of pattern when a site provides a lot of the same content with alternate names.”

How Can You Protect Your Site From This?

While Google’s John Mueller wasn’t able to provide a full-proof solution or prevention for this issue, he did offer some advice for sites that have been affected:

“So what I would try to do in a case like this is to see if you have this kind of situations where you have strong overlaps of content and to try to find ways to limit that as much as possible.

And that could be by using something like a rel canonical on the page and saying, well, this small city that is right outside the big city, I’ll set the canonical to the big city because it shows exactly the same content.

So that really every URL that we crawl on your website and index, we can see, well, this URL and its content are unique and it’s important for us to keep all of these URLs indexed.

Or we see clear information that this URL you know is supposed to be the same as this other one, you have maybe set up a redirect or you have a rel canonical set up there, and we can just focus on those main URLs and still understand that the city aspect there is critical for your individual pages.”

It should be clarified that duplicate content or pages impacted by this problem will not hurt the overall SEO of your site. So, for example, having several pages tagged as being duplicate content won’t prevent your home page from appearing for relevant searches. 

Still, the issue has the potential to gradually decrease the efficiency of your SEO efforts, not to mention making it harder for people to find the valuable information you are providing. 

To see Mueller’s full explanation, watch the video below:

As announced last month, Google is officially making its first step towards the launch of mobile-first indexing with the test of its mobile-first search index.

The company confirmed the testing has officially started via its company blog:

“Although our search index will continue to be a single index of websites and apps, our algorithms will eventually primarily use the mobile version of a site’s content to rank pages from that site, to understand structured data, and to show snippets from those pages in our results. Of course, while our index will be built from mobile documents, we’re going to continue to build a great search experience for all users, whether they come from mobile or desktop devices.”

This means in the future Google will increasingly prioritize crawling the mobile versions of a site’s content, rather than treating desktop as the “main” version of your site.

The company also gave some quick tips to help you make the most of this change as it is happening:

  • If you have a responsive site with identical content across mobile and desktop, you shouldn’t have to change anything.
  • If you have a site where the primary content and markup is not identical across mobile and desktop, you should consider making some changes to your site.
  • Make sure to serve structured markup for both the desktop and mobile version.
  • Google recommends using the Structured Data Testing Tool to verify the equivalence of structured markup across desktop and mobile by typing the URLs of both versions into the Structured Data Testing Tool and comparing the output.
  • When adding structured data to a mobile site, avoid adding large amounts of markup that isn’t relevant to the specific information content of each document.
  • Use the robots.txt testing tool to verify that your mobile version is accessible to Googlebot.
  • Sites do not have to make changes to their canonical links.
  • If you are a site owner who has only verified their desktop site in Search Console, please add and verify your mobile version.
  • If you only have a desktop site, Google will continue to index your desktop site just fine.
  • If you are building a mobile version of your site, do not launch it until it’s ready. Google says: “a functional desktop-oriented site can be better than a broken or incomplete mobile version of the site.”

google-alerts1

Google is continuing its efforts to promote privacy in search by prioritizing indexing HTTPS pages over their HTTP equivalents.

In the announcement, Google explains its long-term aim is to eventually direct users to secure webpages with a private connection. The step to only index HTTPS pages when an HTTP equivalent exists is their most recent move in this process, following the small rankings boost given to HTTPS pages last year.

Unlike the change to Google’s algorithm in August 2014, this move will not have any effect on rankings. Instead, it simply means that Googlebot will only index the HTTPS version of a URL when both an HTTPS and HTTP version exist.

While Google’s commitment to secure search may lead to more rankings boosts for HTTPS pages in the future, this change is mostly to improve the efficiency of Google’s current indexing process. As they explain in their announcement:

“Browsing the web should be a private experience between the user and the website, and must not be subject to eavesdropping, man-in-the-middle attacks, or data modification. This is why we’ve been strongly promoting HTTPS everywhere.”