cancel
Showing results for 
Search instead for 
Did you mean: 

How can we stop Googlebot indexing dynamically generated URLs?

How can we stop Googlebot indexing dynamically generated URLs?

Hi everyone - I hope someone can help with this problem.
 
Magento seems to be generating a LOT of dynamic URLs which are then being crawled and indexed in error by Googlebot.
 
We think most of them are search results of the, but as time goes one we're starting to think Magento must be creating other dynamic URLs we're not aware of. Is this possible?
 

Our Magento ver. 1.8.1.0 website has around 3,500 pages in its sitemap, but Google Webmaster Tools is now showing nearly 100K pages in its 'index status>total indexed' count.

 
We noticed this number first starting to climb over three months ago and it reached about 45k in about a month, then flatlined for three months, before climbing again dramatically last week to near 100k.
 
We initially thought that Google was indexing our site's internal search return pages, so we added these folders to our robots.txt with 'disallow' directives.
 
While this didn't seem to remove any pages from Google's index, it did appear that for a 3 month period to stop any more being added to the Total Indexed count. This has now changed, with a massive increase up to the 100K mark last week.
 
The number of pages which the robots.txt file is blocking according Index Status>Advanced is over 3million.
 
Three months ago we also added 'noindex,nofollow' to the site's internal search results pages we thought Googlebot must be crawling, and also used the Webmaster Tools URL removal tool to remove the directories we think are relevant.
 
Over the last couple of days we have also used the Crawl>URL Parameters to add the q parameter and set it to NO URLs, again with a view to telling Googlebot not to crawl and index our site's internal search return pages.
 
Our sitemap of the 3,500 URLs which we actually want Google to crawl and index is uploaded every week via WMT.
 
What we really need to know is, what exactly are these pages Googlebot has indexed for our site, and the millions that it has detected and our robot.txt is blocking?
 
Why is Google continuing to try to crawl and index these pages when we've told it using various methods that it shouldn't?
 
Or could it be that it's not the search return pages and Google is indexing pages of a different nature? If it is, how would we go about identifying them?
 
Does anyone know how I go about fixing the problem? As I said, we're using Magento ver. 1.8.1.0.
 
Hope someone can help.
 
All the best - Alex JW
9 REPLIES

Re: How can we stop Googlebot indexing dynamically generated URLs?

Alex,

Can you provide us with an example of the dynamic urls you are referring too?

 

Also - are you using any SEO or Extended Layered navigation plugins in your store?

Problem solved? Click Accept as Solution!
Magento Certified Developer Plus | www.iwebsolutions.co.uk | Magento Small Business Partner

Re: How can we stop Googlebot indexing dynamically generated URLs?

Hi, thanks for responding.

 

No, we're not using an SEO or Extended Layered navigation plugins that I'm aware of. We're now using Solr for the site search, but i'm not sure if that's relevant.

 

This is an example of a URL which Google is crawling and indexing:

http://www.essentialaids.com/index.php/catalogsearch/result/?q=0002xduvbzfw7rka8nfdtyhatoxww6jb5oazd...

 

As you can see, it's a search query with a lot of apparently nonsense search terms.

 

To try and stop Google indexing these pages, since around three months ago, we've set robots.txt to block Googlebot from the pages in these folders:

 

 
As pages generated in these folders should always have 'noindex,nofollow' tags in the html, I'm at a loss as to how we've got even more unwanted pages indexed.
 
It makes me wonder if a) the tags are correctly in place b) is it possible for Googlebot to ignore these tags in some cases and c) are these new pages appearing in Google's index unrelated to our site's search results pages, hence why they've not been blocked by robots.txt or received the noindex message if the bot reaches them.
 
Now according to WMT in the week leading up to the 16 Aug, suddenly the number of URLs indexed has raced up to nearly 100k. 
 
I just can't work out why this would be happening.
 
I'd be very gratefuly for any help you can offer.
 
All the best, Alex

Re: How can we stop Googlebot indexing dynamically generated URLs?

Hi again,

 

Just to add to my last post, can I ask is it possible to switch off completely Magento's underlying search functions which Googlebot seems to be accessing?

 

We are using Solr for our customer site search, so the standard Magento search no longer serves any practical purpose.

 

Is it possible to simply turn off the default Magento search so that it's no longer server pages when Googlebot attempts to search URLs like 

 

http://www.essentialaids.com/catalogsearch/advanced/?q=test

http://www.essentialaids.com/index.php/catalogsearch/result/?q=0002xduvbzfw7rka8nfdtyhatoxww6jb5oazd...

 

In fact, anything in these folders it would be much better for Googlebot to find nothing at all:

 

http://www.essentialaids.com/catalogsearch/advanced/

 

I'm really desperate to find a solution here. Any help very gratefully received.

 

All the best, Alex

 

Re: How can we stop Googlebot indexing dynamically generated URLs?

How many rows do you have in your core_url_rewrite table?

Do you have any products that have the same name as one another?

Problem solved? Click Accept as Solution!
www.iwebsolutions.co.uk | Magento Small Business Partner

Re: How can we stop Googlebot indexing dynamically generated URLs?

Could you tell me how I view the core_url_rewrite table please? It's not something I'm familar with.

 

Thanks for your help.

 

Alex

Re: How can we stop Googlebot indexing dynamically generated URLs?

It is a table in the database.

 

I was just wondering if you had multiple products with the same name. This would cause this table to fill up faster and the urls it creates may have reached google. It is a bit of a guess though as i cannot see any URLs i wouldnt expect to be in google.

Problem solved? Click Accept as Solution!
www.iwebsolutions.co.uk | Magento Small Business Partner

Re: How can we stop Googlebot indexing dynamically generated URLs?

The pages Google has indexed for us is on a grand scale. In Webmaster Tools the total index count has suddenly gone up to 100K.

 

Are there any other folders which Magento generates which could hold search returns, that is apart from the ones i'm already aware of:

 

 
Pages generated in these folders all have 'nofollow' tags, but I'm concerned search results might also generate into another folder which i'm not aware of.
 
Does anyone know anything about that?
 
Thanks in advance if you can help. Alex

Re: How can we stop Googlebot indexing dynamically generated URLs?

Hi

 

I did post about this problem on Friday but the situation has got even more desperate since.

 

Does anyone have any ideas about what might be happening here? Google's number of pages indexed for our site has now jumped almost another 100k up to 184,000. 

 

We don't know what these extra pages are - we should only have about 5,000 indexed at the most - we thought it was the site's internal search results pages but they all have 'noindex' tags in them and we are also blocking them by robots.txt, so Googlebot shouldn't even reach them in the first place.

 

Google is reporting that our robots.txt file is blocking many millions of pages.

 

Has anyone got any ideas? We're tearing our hair out.

 

All the best,

 

Alex JW

 

 

Re: How can we stop Googlebot indexing dynamically generated URLs?

Regarding the robots.txt blocking pages  - could you post your robots.txt contents in here se we can see why?

 

Regards

Problem solved? Click Accept as Solution!
Magento Certified Developer Plus | www.iwebsolutions.co.uk | Magento Small Business Partner