In a Google Search Central SEO session recently, Google’s John Mueller shed light on a way the search engine’s systems can go astray – keeping pages on your site from being indexed and appearing in search.
Essentially the issue comes from Google’s predictive approach to identifying duplicate content based on URL patterns, which has the potential to incorrectly identify duplicate content based on the URL alone.
Google uses the predictive system to increase the efficiency of its crawling and indexing of sites by skipping over content which is just a copy of another page. By leaving these pages out of the index, Google’s engine has less chances of showing repetitious content in its search results and allows its indexing systems to reach other, more unique content more quickly.
Obviously the problem is that content creators could unintentionally trigger these predictive systems when publishing unique content on similar topics, leaving quality content out of the search engine.
John Mueller Explains How Google Could Misidentify Duplicate Content
In a response to a question from a user whose pages were not being indexed correctly, Mueller explained that Google uses multiple layers of filters to weed out duplicate content:
“What tends to happen on our side is we have multiple levels of trying to understand when there is duplicate content on a site. And one is when we look at the page’s content directly and we kind of see, well, this page has this content, this page has different content, we should treat them as separate pages.
The other thing is kind of a broader predictive approach that we have where we look at the URL structure of a website where we see, well, in the past, when we’ve looked at URLs that look like this, we’ve seen they have the same content as URLs like this. And then we’ll essentially learn that pattern and say, URLs that look like this are the same as URLs that look like this.”
He also explained how these systems can sometimes go too far and Google could incorrectly filter out unique content based on URL patterns on a site:
“Even without looking at the individual URLs we can sometimes say, well, we’ll save ourselves some crawling and indexing and just focus on these assumed or very likely duplication cases. And I have seen that happen with things like cities.
I have seen that happen with things like, I don’t know, automobiles is another one where we saw that happen, where essentially our systems recognize that what you specify as a city name is something that is not so relevant for the actual URLs. And usually we learn that kind of pattern when a site provides a lot of the same content with alternate names.”
How Can You Protect Your Site From This?
While Google’s John Mueller wasn’t able to provide a full-proof solution or prevention for this issue, he did offer some advice for sites that have been affected:
“So what I would try to do in a case like this is to see if you have this kind of situations where you have strong overlaps of content and to try to find ways to limit that as much as possible.
And that could be by using something like a rel canonical on the page and saying, well, this small city that is right outside the big city, I’ll set the canonical to the big city because it shows exactly the same content.
So that really every URL that we crawl on your website and index, we can see, well, this URL and its content are unique and it’s important for us to keep all of these URLs indexed.
Or we see clear information that this URL you know is supposed to be the same as this other one, you have maybe set up a redirect or you have a rel canonical set up there, and we can just focus on those main URLs and still understand that the city aspect there is critical for your individual pages.”
It should be clarified that duplicate content or pages impacted by this problem will not hurt the overall SEO of your site. So, for example, having several pages tagged as being duplicate content won’t prevent your home page from appearing for relevant searches.
Still, the issue has the potential to gradually decrease the efficiency of your SEO efforts, not to mention making it harder for people to find the valuable information you are providing.
To see Mueller’s full explanation, watch the video below: