美文网首页
亚马逊网站的robots.txt学习

亚马逊网站的robots.txt学习

作者: nicokani | 来源:发表于2018-05-13 00:00 被阅读0次

•什么是robots.txt
•robots.txt的放置位置
•robots.txt的作用是什么
•robots.txt文件中的语法格式
•常用搜索引擎的爬虫名称
•亚马逊robots.txt的学习

一、什么是robots.txt

robots.txt是一个纯文本文件,它是一个协议,而不是一个命令。robots.txt是搜索引擎中访问网站的时候要查看的第一个文件。在robots.txt文件中网站管理者可以声明该网站中不想被搜索引擎访问的部分,或者指定搜索引擎只收录指定的内容。

二、robots.txt的放置位置

robots.txt必须放置在一个站点的根目录下,而且文件名必须全部小写

三、robots.txt的作用是什么

1.保护网络安全与隐私,引导搜索引擎的爬虫抓取指定栏目或内容:当爬虫访问一个站点时,它会首先检查该站点根目录下是否存在robots.txt,如果存在,爬虫就会按照该文件中的内容来确定访问的范围;如果该文件不存在,那么爬虫就能沿着链接抓取。
2.屏蔽对搜索引擎不友好的链接、404错误页面、无内容、无价值的页面。
3.制止不必要的搜索引擎占用服务器的带宽。
4.控制搜索引擎爬虫的访问,避免影响网站的正常访问。

四、robots.txt文件中的语法格式

1.User-agent——定义搜索引擎
eg: 
  User-agent: * ——定义所有搜索引擎
  User-agent: Baiduspider ——定义百度,只允许百度蜘蛛爬取
2.Disallow:——定义禁止爬虫爬取的页面或目录
eg:
  Disallow: / ——禁止爬虫爬取网站的所有目录
  Disallow: /exec ——禁止爬虫爬取exec目录
  Disallow: /gp/legacy-handle-buy-box.html ——禁止爬虫爬取gp目录中的handle-buy-box.html页面
3.Allow:——定义允许爬虫爬取的页面或子目录
eg:
  Allow: /exec ——允许爬虫爬取exec目录
  Disallow: /gp/legacy-handle-buy-box.html ——允许爬虫爬取gp目录中的handle-buy-box.html页面
4.通配符“$”和“ * ”
①$ ——匹配URL结尾的字符
② * ——匹配0个或多个任意字符

eg:
  Disallow: /gp/.jpg$ ——禁止爬取gp目录下的所有以jpg格式的图片
  Disallow: /bin/*.htm 禁止访问/bin/目录下的所有以".htm"为后缀的URL
  Disallow: / * ? *禁止访问网站中所有的动态页面(屏蔽所有的动态路径)

五、常用搜索引擎的爬虫名称

百度——Baiduspider
谷歌——Googlebot
雅虎——Yahoo! Slurp
必应——MSNbot
有道——YoudaoBot
搜搜——Sosospider
lycos——lycos_spider_(T-Rex)
搜狗——Sogouspider
360——360Spider
宜搜——Yisouspider

六、亚马逊robots.txt的学习

首先查看亚马逊网站的robots.txt:https://www.amazon.com/robots.txt
以下是它的robots.txt文件:

User-agent: *                                   #对所有的搜索引擎进行定义
Disallow: /exec/obidos/account-access-login
Disallow: /exec/obidos/change-style
Disallow: /exec/obidos/flex-sign-in
Disallow: /exec/obidos/handle-buy-box
Disallow: /exec/obidos/tg/cm/member/
Disallow: /gp/aw/help/id=sss                     #禁止访问爬取gp目录下aw文件夹中help文件下的id=sss的信息
Disallow: /gp/cart                               #禁止访问爬取gp目录下cart(购物车)中的信息
Disallow: /gp/flex
Disallow: /gp/product/e-mail-friend
Disallow: /gp/product/product-availability
Disallow: /gp/product/rate-this-item
Disallow: /gp/sign-in                           #禁止访问爬取gp目录下sign-in(登录)的所有信息
Disallow: /gp/reader
Disallow: /gp/sitbv3/reader
Disallow: /gp/richpub/syltguides/create
Disallow: /gp/gfix
Disallow: /gp/associations/wizard.html
Disallow: /gp/dmusic/order
Disallow: /gp/legacy-handle-buy-box.html        #禁止访问爬取gp目录中的handle-buy-box.html页面的信息
Disallow: /gp/aws/ssop
Disallow: /gp/yourstore
Disallow: /gp/gift-central/organizer/add-wishlist
Disallow: /gp/vote
Disallow: /gp/voting/
Disallow: /gp/music/wma-pop-up
Disallow: /gp/customer-images
Disallow: /gp/richpub/listmania/createpipeline
Disallow: /gp/content-form
Disallow: /gp/pdp/invitation/invite
Disallow: /gp/customer-reviews/common/du
Disallow: /gp/customer-reviews/write-a-review.html  #禁止访问爬取gp目录下的customer-reviews(顾客点评)的write-a-review.html(写评论) 页面信息
Disallow: /gp/associations/wizard.html
Disallow: /gp/music/clipserve
Disallow: /gp/customer-media/upload
Disallow: /gp/history                                #禁止访问爬取gp目录下的history(浏览记录)的所有信息
Disallow: /gp/item-dispatch
Disallow: /gp/dmusic/order/handle-buy-box.html
Disallow: /gp/recsradio
Disallow: /gp/slredirect
Disallow: /dp/shipping/
Disallow: /dp/twister-update/
Disallow: /dp/manual-submit/
Disallow: /dp/e-mail-friend/
Disallow: /dp/product-availability/
Disallow: /dp/rate-this-item/
Disallow: /gp/registry/wishlist/*/reserve            #禁止访问爬取gp目录下的registry目录下的wishlist文件夹中所有特定的reserve文件
Disallow: /gp/structured-ratings/actions/get-experience.html
Disallow: /gp/twitter/
Disallow: /ap/signin
Disallow: /gp/registry/wishlist/
Disallow: /wishlist/
Allow: /wishlist/universal*          #允许访问wishlist目录下universal文件夹中的所有信息
Allow: /wishlist/vendor-button*
Allow: /wishlist/get-button*
Disallow: /gp/wishlist/
Allow: /gp/wishlist/universal*
Allow: /gp/wishlist/vendor-button*
Allow: /gp/wishlist/ipad-install*
Disallow: /registry/wishlist/
Disallow: /review/common/du
Disallow: /gp/registry/search.html
Disallow: /product-reviews/B0069IY63Y
Disallow: /gp/orc/rml/
Disallow: */gcrnsts                                   #禁止访问爬取所有目录下的gcrnsts中的内容
Disallow: /gp/gc/widget
Disallow: /gp/dmusic/mp3/player
Disallow: /gp/entity-alert/external
Disallow: /gp/customer-reviews/dynamic/sims-box
Disallow: /review/dynamic/sims-box
Disallow: /gp/redirect.html                      #禁止访问爬取gp目录下的redirect.html页面
Disallow: /gp/twister/ajaxv2
Disallow: /ss/twister/ajax
Disallow: /b?*node=7454917011          #屏蔽b下所有包含node=7454917011的动态路径,禁止访问爬取其包含node=7454917011的动态路径
Disallow: /b?*node=7454927011
Disallow: /b?*node=7454939011
Disallow: /b?*node=7454898011
Disallow: /gp/customer-media/actions/delete/
Disallow: /gp/customer-media/actions/edit-caption/
Disallow: /gp/dmusic/
Allow: /gp/dmusic/promotions/PrimeMusic             #禁止访问爬取gp下dmusic目录中promotions下的PrimeMusic 网页信息
Allow: /gp/dmusic/promotions/AmazonMusicUnlimited
Disallow: /gp/offer-listing/
Allow: /gp/offer-listing/B000
Allow: /gp/offer-listing/9000
Disallow: /b?*node=9052533011
Disallow: /lm/R1XIHQVKXSKBNJ
Disallow: /lm/R3HQ5WJSZK6QSO
Disallow: /surprise/
Disallow: /local/ajax/
Disallow: */B00M3E1NYI                      #禁止访问爬取所有目录下的B00M3E1NYI 的信息
Disallow: */B00M3E1Q5Y
Disallow: */B00M3E1TOM
Disallow: */B00M3E1WYO
Disallow: */B00M3E204K
Disallow: */B00M3E236A
Disallow: */B00M3E260I
Disallow: */B00M3E28WO
Disallow: */B00M3E2BC6
Disallow: */B00M3E2DPQ
Disallow: */B00M3E2GU8
Disallow: */B00M3E2J14
Disallow: */B00M3E2LOE
Disallow: */B00M3E1HJY
Disallow: /gp/socialmedia/giveaways
Disallow: /gp/b2b-rd
Disallow: /gp/aw/so.html
Disallow: /gp/rentallist
Disallow: /gp/video/dvd-rental/settings
Disallow: /gp/rl/settings
Disallow: /gp/video/settings
Disallow: /gp/video/library
Disallow: /gp/video/watchlist
Disallow: /reviews/iframe
Disallow: /gp/switch-language
Disallow: /ga/p/
Disallow: /gp/profile/
Disallow: /giveaway/host/setup/
Disallow: /ss/customer-reviews/lighthouse/
Disallow: /ospublishing/story/*      #禁止访问爬取ospublishing目录中story下的所有内容
Disallow: /gp/aw/ol/
Disallow: /gp/promotion/
Disallow: /hz/leaderboard/top-reviewers/
Disallow: /hz/leaderboard/hall-of-fame/
Disallow: /review/top-reviewers/
Disallow: /review/hall-of-fame
Disallow: /reviews/top-reviewers/
Disallow: /reviews/hall-of-fame

User-agent: Googlebot                         #对谷歌搜索引擎进行定义
Disallow: /rss/people/*/reviews              #禁止谷歌搜索引擎爬取rss目录下的people目录下的所有的reviews目录下的信息
Disallow: /gp/pdp/rss/*/reviews
Disallow: /gp/cdp/member-reviews/
Disallow: /gp/aw/cr/
Disallow: /exec/obidos/account-access-login
Disallow: /exec/obidos/change-style
Disallow: /exec/obidos/flex-sign-in
Disallow: /exec/obidos/handle-buy-box
Disallow: /exec/obidos/tg/cm/member/
Disallow: /gp/aw/help/id=sss
Disallow: /gp/cart
Disallow: /gp/flex
Disallow: /gp/product/e-mail-friend
Disallow: /gp/product/product-availability
Disallow: /gp/product/rate-this-item
Disallow: /gp/sign-in
Disallow: /gp/reader
Disallow: /gp/sitbv3/reader
Disallow: /gp/richpub/syltguides/create
Disallow: /gp/gfix
Disallow: /gp/associations/wizard.html
Disallow: /gp/dmusic/order
Disallow: /gp/legacy-handle-buy-box.html
Disallow: /gp/aws/ssop
Disallow: /gp/yourstore
Disallow: /gp/gift-central/organizer/add-wishlist
Disallow: /gp/vote
Disallow: /gp/voting/
Disallow: /gp/music/wma-pop-up
Disallow: /gp/customer-images
Disallow: /gp/richpub/listmania/createpipeline
Disallow: /gp/content-form
Disallow: /gp/pdp/invitation/invite
Disallow: /gp/customer-reviews/common/du
Disallow: /gp/customer-reviews/write-a-review.html
Disallow: /gp/associations/wizard.html
Disallow: /gp/music/clipserve
Disallow: /gp/customer-media/upload
Disallow: /gp/history
Disallow: /gp/item-dispatch
Disallow: /gp/dmusic/order/handle-buy-box.html
Disallow: /gp/recsradio
Disallow: /gp/slredirect
Disallow: /dp/shipping/
Disallow: /dp/twister-update/
Disallow: /dp/manual-submit/
Disallow: /dp/e-mail-friend/
Disallow: /dp/product-availability/
Disallow: /dp/rate-this-item/
Disallow: /gp/registry/wishlist/*/reserve
Disallow: /gp/structured-ratings/actions/get-experience.html
Disallow: /gp/twitter/
Disallow: /ap/signin
Disallow: /gp/registry/wishlist/
Disallow: /wishlist/
Allow: /wishlist/universal*              #禁止谷歌搜索引擎爬取wishlist目录下的universal的所有的信息
Allow: /wishlist/vendor-button*
Allow: /wishlist/get-button*
Disallow: /gp/wishlist/
Allow: /gp/wishlist/universal*
Allow: /gp/wishlist/vendor-button*
Allow: /gp/wishlist/ipad-install*
Disallow: /registry/wishlist/
Disallow: /review/common/du
Disallow: /gp/registry/search.html
Disallow: /product-reviews/B0069IY63Y
Disallow: /gp/orc/rml/
Disallow: */gcrnsts                        #禁止谷歌搜索引擎爬取所有的gcrnsts下的信息
Disallow: /gp/gc/widget
Disallow: /gp/dmusic/mp3/player
Disallow: /gp/entity-alert/external
Disallow: */sim/B001132UEE
Disallow: /gp/customer-reviews/dynamic/sims-box
Disallow: /review/dynamic/sims-box
Disallow: /gp/redirect.html
Disallow: /gp/twister/ajaxv2
Disallow: /ss/twister/ajax
Disallow: /b?*node=7454917011
Disallow: /b?*node=7454927011
Disallow: /b?*node=7454939011
Disallow: /b?*node=7454898011
Disallow: /gp/customer-media/actions/delete/
Disallow: /gp/customer-media/actions/edit-caption/
Disallow: /gp/dmusic/
Allow: /gp/dmusic/promotions/PrimeMusic
Allow: /gp/dmusic/promotions/AmazonMusicUnlimited
Disallow: /gp/offer-listing/
Allow: /gp/offer-listing/B000
Allow: /gp/offer-listing/9000
Disallow: /b?*node=9052533011
Disallow: /lm/R3HQ5WJSZK6QSO
Disallow: /lm/R1XIHQVKXSKBNJ
Disallow: /surprise/
Disallow: /local/ajax/
Disallow: */B00M3E1NYI
Disallow: */B00M3E1Q5Y
Disallow: */B00M3E1TOM
Disallow: */B00M3E1WYO
Disallow: */B00M3E204K
Disallow: */B00M3E236A
Disallow: */B00M3E260I
Disallow: */B00M3E28WO
Disallow: */B00M3E2BC6
Disallow: */B00M3E2DPQ
Disallow: */B00M3E2GU8
Disallow: */B00M3E2J14
Disallow: */B00M3E2LOE
Disallow: */B00M3E1HJY
Disallow: /gp/aag
Allow: /gp/aag/main?*seller=ABVFEJU8LS620
Disallow: /gp/socialmedia/giveaways
Disallow: /gp/b2b-rd
Disallow: /gp/aw/so.html
Disallow: /gp/pdp/profile/
Disallow: /gp/video/library
Disallow: /gp/video/watchlist
Disallow: /gp/video/settings
Disallow: /gp/rl/settings
Disallow: /gp/video/dvd-rental/settings
Disallow: /gp/rentallist
Disallow: /reviews/iframe
Disallow: /gp/switch-language
Disallow: /ga/p/
Disallow: /giveaway/host/setup/
Disallow: /gp/help/customer/express/c2c/

User-agent: EtaoSpider                         #对Etao进行定义
Disallow: /                                    #禁止爬取亚马逊网站所有的信息

之前看过淘宝网(https://www.taobao.com/robots.txt)和京东(https://www.jd.com/robots.txt)的robots.txt,与这两者相比,亚马逊网站的robots.txt写的可以说是十分的细节化,算是很好的robots.txt学习范例。

附:亚马逊robots.txt文件的更多细节的还原和文件声明的一些解释说明:http://canicrawl.com/amazon.com/

相关文章

网友评论

      本文标题:亚马逊网站的robots.txt学习

      本文链接:https://www.haomeiwen.com/subject/wnghdftx.html