More or less all the SEO projects I have worked with during my +10 years have included the challenges of “duplicate content”. Thus the content of the website can be accessed by more than one URL. Sometimes duplicate content stir up some trouble, though it rarely is a huge problem. We have produced a guide that can be used as a guide to select the correct action when duplicate content. For great content and better Conversion rate optimisation you may consult SEO Guru.
What is duplicate content?
In short, it is a question of a certain content can be accessed by more than one URL. Some of these cases are very simple and obvious to those who have worked with SEO for a while. For example, one can often reach a site as well with “www” that without the “www”. For example: http://example.com and http://www.example.com. This is addressed by all the requests (requests) to URLs without “www” singled on a 301 redirect to the same URL with the “www”.
Other common cases of duplicate content:
- Product lists that can be sorted in various ways, and where selected sorting visible in the URL
- Development and test environments where the entire website can be accessed, for example, a different subdomain (dev.example.com)
- Products available for several different product categories and the category where you come from is visible in the URL of the product
How do you solve the problem?
There are many different ways to solve problems with duplicate content, and it is not always easy to see who is the best. Some common solutions are:
- Stop the search engines to spider the content via robots.txt
- Prevent search engines from indexing content via meta tags (<meta name = “robots” content = “noindex, nofollow”>)
- Using rel = “canonical” tag to instruct search engines that a URL has the same content as another URL
- Block users as well as search engines from certain pages with password protection
- Server-based redirection via the so-called 301 Permanent Redirect
In this post, I had not thought to tell exactly how these different solutions are implemented, but instead concentrate on what the solutions should be chosen depending on the prevailing conditions.
A very important aspect that you have to keep an eye on is that of blocking search engines from spider of a URL with robots.txt, so the search engines will not be able to see any meta tags used on the pages blocked. And just because search engines are blocked from indexing a URL that does not mean that they can not come to list the URL in their search results. If a page is blocked in robots.txt get links from sites other than your own, so called external links, are DETTMANN an imminent risk (chance?) That, like Google, choose to list page in their results, but they , instead of showing your meta description or snippet of text from your side in connection with the listing, the user is informed that the content on the page could not be read because of the settings in robots.txt.
This means that if you really do not want a particular page to be listed in the search results, you must either let search engines spider the page and use meta tags to instruct search engines that the page is not indexed, or password protect access to the page, which is rarely useful if your regular visitor to the asset side in a good way.
Personally, I’m not particularly fond of the use of robots.txt. Mainly due to fully open, for anyone, listing which directories or URLs in the site that may be content-sensitive information that you do not want search engines to spider off. Moreover, experience has shown that the search engines do not always follow the instructions in the robots.txt full. More preferably, if you use so-called wildcards or other more advanced instructions to match URLs that should not be spindlas of.
However, there is a very clear case where the robots.txt is fully entitled to use. And that’s where you have a very large site with very many URLs (perhaps millions) and “crawl budget” that search engines give your website is not enough. Briefly, one can say that the search engines use too much of their resources to spider of unnecessary content, instead of focusing on what is important. In this case the robots.txt a very good option to tell the search engines about which URLs are not interesting for them to put some resources.
Learn more about robots.txt here: http://www.robotstxt.org/robotstxt.html
Meta robots tag is the solution you should use if you really want to ensure that pages that your visitors must be able to access, not to show up in search engine results. However, be very careful not to unduly exclude pages that are important in that they rank well on search phrases that your potential customers are using when they are looking for what you have to offer.
Another thing to take into consideration is whether the pages that you should exclude with meta robots could possibly get external links to them. If so, you can use the “noindex, follow” to search engines to follow the links on the page further, both to find content on your site and to spread any link value on these pages. In these cases, usually I personally would rather choose to use rel canonical.
Read more about meta robots here: http://www.robotstxt.org/meta.html
Rel canonical (<link rel = “canonical” href = “http://www.example.com/the-canonical-url” />)
In cases where you have pages on your site that your visitors must be available via different URLs even though the content is the same, and you know it can be created external links to content is rel canonical often the simplest solution. However, it is important to remember that the rel canonical construed as a recommendation by the search engines and there are many cases where the search engines choose not to follow the recommendation.
Read more about rel canonical here: https://support.google.com/webmasters/answer/139066?hl=sv
Block with password protection
If the pages that you do not want search engines to list in their search results are those pages that your visitors do not need to be available so blocking through password protection is often very simple and effective. Frequent cases of this are when the mirroring of the website in several subdomains, or domains for test or development purposes.
Learn more about http authentication for Apache via htpasswd here: http://httpd.apache.org/docs/2.2/programs/htpasswd.html
For content that has changed its website address is 301 redirect the best solution. Briefly, instruct search engines and visitors’ browsers that the content previously found on this web address have now moved to another URL instead. And possibly link value to the old URL has been transferred to the new one. When building on its site, and provided that the URLs change, this is one of the most critical and important parts to remember if you want to avoid getting a hefty drop in rankings and organic search traffic in conjunction with the launch.