Solved: Re: in robots.txt. what the meaning of

kinan11 · ‎01-04-2016

in robots.txt.

what the meaning of :

in detail

User-agent: *
Disallow: /index.php/
Disallow: /*?
Disallow: /checkout/
Disallow: /app/
Disallow: /lib/
Disallow: /*.php$
Disallow: /pkginfo/
Disallow: /report/
Disallow: /var/
Disallow: /catalog/
Disallow: /customer/
Disallow: /sendfriend/
Disallow: /review/
Disallow: /*SID=

akent99 · ‎01-08-2016

Well, that is definitely us then! I will poke around a little internally. I don't understand why all the entries are there myself. I think there might be some historical baggage in there. In particular, the /app and /lib paths don't look right to me. There is no harm, but it does not seem right.

Here is a first summary of the items however:

/index.php - we don't want to index paths with index.php in them - they should be using the nicer URLs. (We internally redirect URLs to /index.php?.... but that should not be visible externally as far as I know.) I would guess a safeguard rule only.
/*? - don't index queries -
/checkout/ - the checkout pages are not worth indexing as they don't hold any semantic content for your site that will help your SEO ranking
/app/ - I don't think is needed. Should never be a URL pointing at this path.
/lib/ - same as /app/
/*.php$ - should be using nice URLs, not PHP scripts. Not mandatory, but stops strange cases getting into search engines.
/pkginfo/ - Maybe there for historical reasons? Not used to my knowledge (should never occur).
/report/ - won't contain product information, so no point indexing them. (Probably would never happen as need to log in as an admin anyway.)
/var/ - I suspect same as /app/ and /lib/ above.
/catalog/ - this one I am not sure about. I wonder if used for AJAX or in Admin.
/customer/ - I suspect will never happen because requires Admin authentication - but no reason for SEO on customer profiles
/sendfriend/ - again, no reason to index any pages related to sending a friend information and a product you like
/review/ - similar to /catalog/
/*SID= - someone must be logged in - don't want to index that sort of URL.

I am guessing some of the links might be in case someone posts a URL to a forum or similar. We don't want a search engine picking up that URL and recommending it. Better to tell the search engine "forget this URL completely". So it might be overly prescriptive for normal usage, but being safe.

View solution in original post

akent99 · ‎01-04-2016

Sorry, I am not sure what you know about robots.txt, so not sure what level you are asking at. Put simply, it is trying to guide search engines about what to index and what not to index on a site. For example, don't index checkout pages. Are all entries there required? I suspect not - there may be a bit of noise in the file. For example, /app is a directory - you should never see it in a URL.

Where did you find this file from? I cannot find it shipped with M2, so I assume it is from some project you are working on (not a M2 specific question)?

akent99 · ‎01-08-2016

Hi @kinan11, did I answer your question or did you have any additional information to provide? If its not a Magento 2 shipped file, I cannot really respond why it contains what it does (other than educated guesses).

kinan11 · ‎01-08-2016

from magento2 admin

stores - configuration

general - design- search engin robots

in the "Edit custom instruction of robots.txt File"

i click " reset to default " "This action will delete your custom instructions and reset robots.txt file to system's default settings."

then in the box i see this info

User-agent: *
Disallow: /index.php/
Disallow: /*?
Disallow: /checkout/
Disallow: /app/
Disallow: /lib/
Disallow: /*.php$
Disallow: /pkginfo/
Disallow: /report/
Disallow: /var/
Disallow: /catalog/
Disallow: /customer/
Disallow: /sendfriend/
Disallow: /review/
Disallow: /*SID=

akent99 · ‎01-08-2016

Well, that is definitely us then! I will poke around a little internally. I don't understand why all the entries are there myself. I think there might be some historical baggage in there. In particular, the /app and /lib paths don't look right to me. There is no harm, but it does not seem right.

Here is a first summary of the items however:

/index.php - we don't want to index paths with index.php in them - they should be using the nicer URLs. (We internally redirect URLs to /index.php?.... but that should not be visible externally as far as I know.) I would guess a safeguard rule only.
/*? - don't index queries -
/checkout/ - the checkout pages are not worth indexing as they don't hold any semantic content for your site that will help your SEO ranking
/app/ - I don't think is needed. Should never be a URL pointing at this path.
/lib/ - same as /app/
/*.php$ - should be using nice URLs, not PHP scripts. Not mandatory, but stops strange cases getting into search engines.
/pkginfo/ - Maybe there for historical reasons? Not used to my knowledge (should never occur).
/report/ - won't contain product information, so no point indexing them. (Probably would never happen as need to log in as an admin anyway.)
/var/ - I suspect same as /app/ and /lib/ above.
/catalog/ - this one I am not sure about. I wonder if used for AJAX or in Admin.
/customer/ - I suspect will never happen because requires Admin authentication - but no reason for SEO on customer profiles
/sendfriend/ - again, no reason to index any pages related to sending a friend information and a product you like
/review/ - similar to /catalog/
/*SID= - someone must be logged in - don't want to index that sort of URL.

I am guessing some of the links might be in case someone posts a URL to a forum or similar. We don't want a search engine picking up that URL and recommending it. Better to tell the search engine "forget this URL completely". So it might be overly prescriptive for normal usage, but being safe.

josh_a · ‎06-12-2017

I have the same question.

Disallow: /catalog/ -- doesn't that mean that all URLs that have /catalog/ in them will block bots from crawling them from that folder on? So, "wwwexamplesite.com/pub/media/catalog/product/cache/small_image/140x140/ex23le4443am/awesome-product.html" would block access to "/catalog/product/cache/small_image/140x140/ex23le4443am/awesome-product.html" - correct?

Few of our images are indexed by Google, and I'm assuming this "default" robots.txt is the reason why. Most of our images use that basic path and exist in the catalog folder.