Duplicate Content Check: Who Is The Original Here?

The World Wide Web is gigantic and grows by countless pages every day. This includes many duplicates. Newspaper articles and press releases published under multiple domains, products available for purchase in multiple online stores, or even backups, test servers, and parameter URLs that accidentally end up in Google’s index can all constitute duplicate content. But how does Google decide which document is the original, and does it even matter?

What is duplicate content?

Duplicate content (DC) refers to the same or very similar content published under different URLs. It doesn’t matter whether the duplicate content is published under the same domain or a different domain. If Google receives two identical pieces of content, the search engine will only rank both if the search query is very specific and users are looking for a specific article number or article name.

If the search query is more general, Google’s goal is to offer users diversity. This means that if duplicate content is found, the search engine will limit itself to one of the URLs that offers this content. Google will define one URL as the original and the other URLs as duplicates. In this case, the duplicates will not rank at all or will only rank lower.

In which situations does duplicate content arise?

Duplicate content can arise due to a variety of circumstances. We’ve listed the most common ones here:

Duplicate content from test servers and pages that were accidentally indexed

Many websites are constantly being changed. To test new features, many webmasters have set up test servers, for example, under a subdomain like test.exampledomain.de. If pages from the test server are indexed by Google, they produce duplicate content.

Duplicate content due to missing or incorrect hreflang tags

Domains that target different countries and contain no or incorrect hreflang tags for the different language versions are predestined for the phenomenon of duplicate content.

DC is used when there are multiple pages for one language, and German-speaking users should receive a different URL depending on which country (in the example: Germany, Austria, Switzerland) the users are located in.

Duplicate content through parameters

Many websites (especially online shops) use parameter URLs to allow their users to filter products by color or size, for example.

The resulting pages, such as https://exampledomain.de/example-page?filter, are important, but they often contain the same content as the pages without parameters, except for the newly sorted products. If this is the case and the parameter pages contain the same title, H1, and text content as the page without parameters, this can lead to duplicate content.

External duplicate content

Duplicate content with other domains occurs either when content is knowingly or unknowingly copied and thus stolen, or when content is published on different websites with consent.

Knowingly copied content is often done by spam sites that use your content to generate traffic for themselves. However, some people also unknowingly copy your content. This often happens with partner websites, where one website thinks it can copy the other’s content because of a partnership.

The most common examples of content published on different websites by mutual consent are product descriptions that manufacturers issue to multiple retailers for use, or newspaper articles, such as dpa articles, that may be published on multiple newspaper domains.

Duplicate content check: How can you check websites for duplicate content?

Duplicate content can be detected in several ways. However, there are somewhat fewer options for external DCs:

Find external duplicate content

If you want to check whether other external sites are copying your content, you can find out using Google Search. To do this, take a passage of text from your website and enter it into Google Search with quotation marks.

Find internal duplicate content.

Internal identical content can also be found using the same methods you use to detect external duplicate content.

You can also identify internal duplicate content using other indicators that you can find by crawling your website (e.g., with the tool ScreamingFrog or Sistrix Optimizer):

Same title tag: If two or more pages have the same title tag, this could be an indication of DC.
Same meta description: Even the same meta description can indicate duplicate pages.
Same H1 heading: The H1 heading should always reflect the page topic and be as unique as possible. If multiple pages have the same H1 heading, this is an indication of DC.

Solutions: How to deal with duplicate content

If you’ve found duplicate content, there are several solutions you can use to address it. We’ll explain some of them below:

Delete duplicate pages or set them to noindex

If two identical pages exist, for example, because one page was inadvertently duplicated in the CMS, you can simply delete the duplicate. The page will then typically return a 404 error. Therefore, you should redirect it to the original page via a 301 redirect so that users who access the URL again can find the original content.

Test servers, etc., should never be included in Google’s index. It’s best to protect your test server with .htaccess protection. This prevents Google from accessing your server’s pages and thus from crawling the test server’s pages.

If it’s not possible to set up .htaccess protection on the test server or delete duplicate internal pages, you can also use the noindex tag. This can be used to signal to Google that the duplicate pages should not be indexed.

Setting up canonical tags

If duplicate content needs to be retained, for example, because it’s a parameter page, canonical tags can be used. Canonical tags can signal to Google which page is the original and which is the copy. The search engine will then not include the copy in its index.

Example:

Original URL: https://exampledomain.com/site

Parameter URL (duplicate): https://example.com/site?filter

Both URLs receive a canonical tag. The original URL, https://exampledomain.com/page, receives a canonical tag on itself to indicate that it is the original:

The parameter URL https://exampledomain.com/page?filter receives the same canonical tag as the original URL, thereby signaling that it is a duplicate:

Tip: If you use Google Ads and place ads on Google, parameter URLs are always created. These should also be tagged with a canonical tag.

What can you do about external duplicate content?

Please note: Copyright law also applies on the internet. If someone copies your intellectual property from your website and publishes it on their own, they are violating copyright law. You can even take legal action against them.

However, with spam sites, it’s often impossible to determine who’s copying your content. Therefore, we recommend reporting spam directly to Google. To do so, you can submit a “Report Content for Legal Reasons” request to Google.

If partners (customers, business partners, etc.) copy your content, politely point out the violation and find a solution together. You don’t want to jeopardize your cooperation.

For content published by mutual consent on various websites, such as newspaper articles or product descriptions, you cannot take legal action; you can only take action yourself. If you are allowed to change the content, do so and write the SEO product description yourself in your own words, adding additional content if necessary, thus giving your website individuality and uniqueness.

What is the difference from keyword cannibalism?

Keyword cannibalization occurs when several of your URLs for a topic or keyword rank in the search results. This is generally not duplicate content, but rather similar content or multiple pages that cover the same topic but don’t use the same text blocks or are identical.

You can learn more about keyword cannibalization in our guide “ Detecting and Resolving Keyword Cannibalism ”.

Info Flash Media

Contact Us