How To Hunt Down All Duplicate Content On A Site

Chris Koszo

13 years ago

Duplicate content. You want none of it... Nada. Zip. Zilch. We all know that the series of Panda updates that Google rolled out starting February of this year (timeline) obliterated many sites, including legitimate ones whose webmasters and SEO teams were baffled by why they were penalized. While it's true that some were hit by Panda simply because they had poorly designed pages, too high of an ad-to-content ratio or created an otherwise bad user experience, most of the sites that Panda devalued simply had too much duplicate content on their domains that could have been, in some cases, prevented.

Mandatory reading on duplicate content, how it works and how to deal with it courtesy of Cyrus Shepard and Dr. Pete can be found here, and here, respectively. Whether or not any of your sites were hit by Panda, you absolutely need to read these simple and straightforward posts as they describe very well what duplicate content is and how to approach it. Below is more information on how to actually spot the dreaded duplicate content.

Duplicate Content and Panda

Remember, Panda is a site-wide penalty, not a page penalty. So if a certain percentage of your pages--even if you're not aware of them--fall below Panda's quality algorithm (regardless of how great the rest of your pages, products descriptions and blog is etc.), then the whole site takes a hit in rankings. Fix enough of these pages and your site will recover.

How To Find Duplicate Content...

Many times you may not even be aware of the hundreds or thousands of pages of duplicate content that exist on your site (perhaps due to a CMS issue that's generating duplicate pages or URLs, query strings being generated without your knowledge, or a simple error where you forgot to add a rel="noindex" somewhere) and since you never actually use or visit these pages, and since they don't ever send much referral traffic, they go unnoticed.

...The Simple Way

The simplest, most basic (but often overlooked) way to find these duplicate pages is to simply navigate through your site like a user and visit each page template and click on each link you see, to try all search features and all dynamic content, and see if you can break anything or find some type of page or URL that you aren't familiar with and/or feel shouldn't exist because you think the content already exists elsewhere. If you find anything questionable, the easiest way to see if the content exists anywhere else is to plug it into a Google search (remember to include your snippet of content in quotes to make sure you are searching for the exact match), and see if any of your other pages, or gasp, another site, comes up in the search.

Test The Content

Found a duplicate page but want to know just how much of a duplicate it really is? Try this awesome and free tool to compare pages (and page templates) on your site that appear questionable and see just how much of a match it is. Total text similarity above 75% is not too good. Anything above 90% is playing with fire. Dont worry much about the HTML similarity percentages, since that that value is calculated from your page structure and will always be high (80-95%) if your site uses any type of CMS or a template-based design.

After you find some for-sure duplicate content, remember to check to see if the pages in question don't already have a noindex tag or canonical tag in the <head> section of HTML, and make sure that the particular page, directory or query string isn't already blocked via your robots.txt file (visit the link if you aren't familiar with how to block duplicate content with robots.txt). Of course you wouldn't have found these pages using the Google search method if they were already blocked from the index, but if you find pages using the simple navigation method or some other way, remembering to check for these three things can save you some "Oh crap. Nevemind" moments.

Find More Duplicate Content

Using the manual navigation approach can be done on even relatively large sites since most pages of your site will be based off templates and thus won't result in an unmanageable task, but another way to find duplicate content (you should do both) is to go into your Google Webmaster tools (you have it, right?) and check to see if any of your pages have duplicate meta title tags or description tags under Diagnostics-->HTML suggestions. This is an excellent way to find any pages that are clearly duplicate and should be dealt with.

Find Even More Duplicate Content

Concerned about URL parameters such as query strings and other filtering parameters causing your site to have multiple URLs for the same pages? It's time to visit another part of Webmaster Tools, one that lets you find out if anything is amiss in the first place. Go to Configuration-->URL Parameters, and check to see if any pre-set URL parameters show up here.

If nothing shows up, then you probably don't have to worry about much. If you do see some listed, it doesn't necessarily mean you have big problems, but you need to learn how the process works before you adjust these settings, and learn to be able to identify any duplicate content issues even if Google doesn't alert you. Google is actually fairly good at making sense of weird URL structures and query strings and dealing with them accordingly (there's only so many content management systems out there, and Google has enough experience in dealing each of them to know what your site's intentions are), but nevertheless it doesn't always get things right, especially if you're running an enterprise-level ecommerce site or are using a custom-built environment.

To view the URL parameters that Google already preconfigured for you but doesn't necessarily alert you to, you need to click "Configure URL parameters". Configuring URL parameters is handily explained by Vanessa Fox here, and for a more in-depth tutorial the official Google Webmaster Center URL parameters page is a great resource. Once you understand how Google approaches parameters, you'll have a better grasp of what the various parameters in your URLs mean to Google, and how you should allow them to crawl and index them.