Robots协议(也称为爬虫协议、机器人协议等)的全称是“网络爬虫排除标准”(Robots Exclusion Protocol),网站通过Robots协议告诉搜索引擎哪些页面可以抓取,哪些页面不能抓取。(百度百科)
- 文件写法
- User-agent: * 这里的 * 代表的所有的搜索引擎种类, * 是一个通配符
- Disallow: /ABC/ 这里定义是禁止爬寻ABC目录下面的目录
- Disallow:/ab/adc.html 禁止爬取ab文件夹下面的adc.html文件。
- Allow: /cgi-bin/ 这里定义是允许爬寻cgi-bin目录下面的目录
- Allow: /tmp 这里定义是允许爬寻tmp的整个目录
- Sitemap: 网站地图 告诉爬虫这个页面是网站地图
购物网站
亚马逊中国
https://www.amazon.cn/robots.txt
- User-agent: *
- Disallow: /buycar
- Disallow: /cart
- Disallow: /checkout
- Disallow: /class
- Disallow: /com
- Disallow: /common
- Disallow: /css
- Disallow: /dll
- Disallow: /doc
- Disallow: /dp/e-mail-friend/
- Disallow: /dp/manual-submit/
- Disallow: /dp/product-availability/
- Disallow: /dp/rate-this-item/
- Disallow: /dp/shipping/
- Disallow: /dp/twister-update/
- Disallow: /gp/aws/ssop
- Disallow: /gp/cart
- Disallow: /gp/css/homepage.html
- Disallow: /gp/customer-reviews/common/du
- Disallow: /gp/flex
- Disallow: /gp/gfix
- Disallow: /gp/history
- Disallow: /gp/item-dispatch
- Disallow: /gp/music/clipserve
- Disallow: /gp/music/wma-pop-up
- Disallow: /gp/offer-listing
- Disallow: /gp/product/e-mail-friend
- Disallow: /gp/product/product-availability
- Disallow: /gp/product/rate-this-item
- Disallow: /gp/recsradio
- Disallow: /gp/slredirect
- Disallow: /gp/twitter/
- Disallow: /gp/vote
- Disallow: /gp/voting/
- Disallow: /gp/yourstore
- Disallow: /inc
- Disallow: /js
- Disallow: /lib
- Disallow: /mn/bookLookInsideApp
- Disallow: /mn/checkInitApp
- Disallow: /mn/checkoutAlertMsgApp
- Disallow: /mn/checkoutredirectApp
- Disallow: /mn/giftCardApp
- Disallow: /mn/loginApplication
- Disallow: /mn/loyaltyApp
- Disallow: /mn/orderAddrApp
- Disallow: /mn/orderCfmApp
- Disallow: /mn/orderDetailApp
- Disallow: /mn/orderFailApp
- Disallow: /mn/orderHistoryApp
- Disallow: /mn/orderModifyApp
- Disallow: /mn/orderSummaryApp
- Disallow: /mn/paymentRedriveApp
- Disallow: /mn/recommendReviewApp
- Disallow: /mn/releaseReviewApp
- Disallow: /mn/reviewVoteApplication
- Disallow: /mn/selectPaymentMethodApp
- Disallow: /mn/selectShippingOpptionApplication
- Disallow: /mn/shipmentTraceApp
- Disallow: /mn/shoppingCartApplication
- Disallow: /mn/tellFriend
- Disallow: /mn/thankYouApplication
- Disallow: /mn/virtualAccountApp
- Disallow: /mn/yourAccountApp
- Disallow: /paper
- Disallow: /xml
- Disallow: /youraccount
- Disallow: /ap/signin
- Disallow: /gp/registry/wishlist/
- Disallow: /wishlist/
- Allow: /wishlist/universal*
- Allow: /wishlist/vendor-button*
- Allow: /wishlist/get-button*
- Disallow: /gp/wishlist/
- Allow: /gp/wishlist/universal*
- Allow: /gp/wishlist/vendor-button*
- Allow: /gp/wishlist/ipad-install*
- Disallow: /registry/wishlist/
- Disallow: /gp/help/contact-us/general-questions.html*?type&email&skip=true
- Disallow: /gp/help/customer/accessibility?ie=UTF8&initialIssue=forgotpw&skip=true
- Disallow: /gp/registry/search.html
- Disallow: /gp/orc/rml/
- Disallow: /gp/digital/fiona/manage
- Disallow: /gp/entity-alert/external
- Disallow: /gp/customer-reviews/dynamic/sims-box
- Disallow: /review/dynamic/sims-box
- Disallow: /gp/redirect.html
- Disallow: /gp/customer-media/upload/
- Disallow: /gp/customer-media/actions/delete/
- Disallow: /gp/customer-media/actions/edit-caption/
- Disallow: /gp/dmusic/
- Disallow: /registry
- Disallow: /*/wishlist
- Disallow: /gp/registry
- Disallow: /gp/aag
- Disallow: /gp/socialmedia/giveaways
- Disallow: /gp/aw/so.html
- Disallow: /gp/pdp/profile/
- Disallow: /gp/help/customer/display.html*nodeId=200843370
- Disallow: /gp/help/customer/display.html*nodeId=200877580
- Disallow: /gp/help/customer/display.html*nodeId=200877590
- Disallow: /gp/help/customer/display.html*nodeId=200879080
- Disallow: /gp/help/customer/display.html*nodeId=200879100
- Disallow: /gp/help/customer/display.html*nodeId=200879120
- Disallow: /gp/help/customer/display.html*nodeId=200879160
- Disallow: /gp/help/customer/display.html*nodeId=200879140
- Disallow: /gp/help/customer/display.html*nodeId=200877610
- Disallow: /gp/help/customer/display.html*nodeId=200878960
- Disallow: /gp/help/customer/display.html*nodeId=200878980
- Disallow: /gp/help/customer/display.html*nodeId=200879000
- Disallow: /gp/help/customer/display.html*nodeId=200879040
- Disallow: /gp/help/customer/display.html*nodeId=200879020
- Disallow: /gp/help/customer/display.html*nodeId=200877630
- Disallow: /gp/help/customer/display.html*nodeId=200879200
- Disallow: /gp/help/customer/display.html*nodeId=200879220
- Disallow: /gp/help/customer/display.html*nodeId=200879240
- Disallow: /gp/help/customer/display.html*nodeId=200879280
- Disallow: /gp/help/customer/display.html*nodeId=200879260
- Disallow: /gp/help/customer/display.html*nodeId=200877650
- Disallow: /gp/help/customer/display.html*nodeId=200879320
- Disallow: /gp/help/customer/display.html*nodeId=200879340
- Disallow: /gp/help/customer/display.html*nodeId=200879360
- Disallow: /gp/help/customer/display.html*nodeId=200879400
- Disallow: /gp/help/customer/display.html*nodeId=200879380
- Disallow: /gp/help/customer/display.html*nodeId=200877560
- Disallow: /gp/help/customer/display.html*nodeId=200843460
- Disallow: /gp/help/customer/display.html*nodeId=200843440
- Disallow: /gp/help/customer/display.html*nodeId=200899270
- Disallow: /gp/help/customer/display.html*nodeId=200879440
- Disallow: /gp/help/customer/display.html*nodeId=200899330
- Disallow: /gp/help/customer/display.html*nodeId=200899350
- Disallow: /gp/help/customer/display.html*nodeId=200899390
- Disallow: /gp/help/customer/display.html*nodeId=200899410
- Disallow: /gp/help/customer/display.html*nodeId=200899430
- Disallow: /gp/help/customer/display.html*nodeId=200899220
- Disallow: /gp/help/customer/display.html*nodeId=200899450
- Disallow: /gp/help/customer/display.html*nodeId=200899670
- Disallow: /gp/help/customer/display.html*nodeId=200899530
- Disallow: /gp/help/customer/display.html*nodeId=200899470
- Disallow: /gp/help/customer/display.html*nodeId=200899550
- Disallow: /gp/help/customer/display.html*nodeId=200899570
- Disallow: /gp/help/customer/display.html*nodeId=200899510
- Disallow: /gp/help/customer/display.html*nodeId=200899610
- Disallow: /gp/help/customer/display.html*nodeId=200899630
- Disallow: /gp/help/customer/display.html*nodeId=200899650
- Disallow: /gp/help/customer/display.html*nodeId=200879180
- Disallow: /gp/help/customer/display.html*nodeId=200879060
- Disallow: /gp/help/customer/display.html*nodeId=200879300
- Disallow: /gp/help/customer/display.html*nodeId=200879420
- Disallow: /gp/help/customer/display.html*nodeId=200899290
- Disallow: /gp/help/customer/display.html*nodeId=200899310
- Disallow: /gp/help/customer/display.html*nodeId=200843380
- Disallow: /gp/help/customer/display.html*nodeId=200843420
- Disallow: /gp/help/customer/display.html*nodeId=200899230
- Disallow: /gp/help/customer/display.html*nodeId=200899250
- Disallow: /gp/help/customer/display.html*nodeId=200899370
- Disallow: /reviews/iframe
- Disallow: /gp/help/reports/infringement/jquery/handle-notice-submit.html
- Disallow: /gp/help/customer/handler/handle-email-submit.html
不可爬取的页面中可显示的页面包括:购物车,登录,分类列表,个人账户页面,购物历史记录,官方信息,首页,心愿单,联系客服,联系我们,我的电子书,帮助。
亚马逊主要禁止抓取的内容是一些商业信息以及用户的个人信息,如今信息泄露现象越发普遍,作为一个线上购物平台保护用户的隐私显得尤为重要,这不仅是对用户个人财产安全的保护,也是对用户本身安全的保护。不过,同时,亚马逊也存在一些允许爬取的内容。
淘宝
https://www.taobao.com/robots.txt
-
User-agent: Baiduspider
-
Allow: /article
-
Allow: /oshtml
-
Allow: /wenzhang
-
Disallow: /product/
-
Disallow: /
-
User-Agent: Googlebot
-
Allow: /article
-
Allow: /oshtml
-
Allow: /product
-
Allow: /spu
-
Allow: /dianpu
-
Allow: /wenzhang
-
Allow: /oversea
-
Disallow: /
-
User-agent: Bingbot
-
Allow: /article
-
Allow: /oshtml
-
Allow: /product
-
Allow: /spu
-
Allow: /dianpu
-
Allow: /wenzhang
-
Allow: /oversea
-
Disallow: /
-
User-Agent: 360Spider
-
Allow: /article
-
Allow: /oshtml
-
Allow: /wenzhang
-
Disallow: /
-
User-Agent: Yisouspider
-
Allow: /article
-
Allow: /oshtml
-
Allow: /wenzhang
-
Disallow: /
-
User-Agent: Sogouspider
-
Allow: /article
-
Allow: /oshtml
-
Allow: /product
-
Allow: /wenzhang
-
Disallow: /
-
User-Agent: Yahoo! Slurp
-
Allow: /product
-
Allow: /spu
-
Allow: /dianpu
-
Allow: /wenzhang
-
Allow: /oversea
-
Disallow: /
-
User-Agent: *
-
Disallow: /
Baiduspider:百度蜘蛛,是百度搜索引擎的一个自动程序。它的作用是访问收集整理互联网上的网页、图片、视频等内容,然后分门别类建立索引数据库, 使用户能在百度搜索引擎中搜索到您网站的网页、图片、视频等内容。(百度百科)
Googlebot:谷歌的网页抓取机器人(百度百科)
Bingbot是必应搜索引擎的爬虫名称,会在各个网站抓取内容时候留下脚印。(百度贴吧)
现在,如果在百度里搜索淘宝网,会看到的结果是“由于该网站的robots.txt文件存在限制指令,系统无法提供该页面的内容描述”。事实上,百度和淘宝都试图将中国网民培育出一种最符合自己利益用户的习惯:就是尽量让用户用自己的搜索引擎完成消费选择,如果自己能够控制用户端口,那么针对排名就可以做出多种付费推广,而淘宝如果对百度蜘蛛开放robots.txt,作为中国最大的搜索引擎,百度很可能会针对淘宝开发出相应的开放平台,蚕食淘宝的付费市场。如果强势品牌能够打造独立商城分流淘宝店铺的流量,一是可以避免身家性命全押在淘宝上需要通过竞价系统购买昂贵的首页广告(百度同理),二是可以加强品牌优势,培养用户主动搜索品牌的消费习惯。
网页小游戏
4399
http://www.4399.com/robots.txt
- User-agent: *
- Disallow: /upload_pic/
- Disallow: /upload_swf/
- Disallow: /360/
- Disallow: /public/
- Disallow: /yxbox/
- Disallow: /360game/
- Disallow: /loadimg/
- Disallow: /index_pc.htm
- Disallow: /flash/32979_pc.htm
- Disallow: /flash/35538_pc.htm
- Disallow: /flash/48399_pc.htm
- Disallow: /flash/seer_pc.htm
- Disallow: /flash/58739_pc.htm
- Disallow: /flash/78072_pc.htm
- Disallow: /flash/130396_pc.htm
- Disallow: /flash/80727_pc.htm
- Disallow: /flash/151038_pc.htm
- Disallow: /flash/10379_pc.htm
- Disallow: /index_old.htm
不可爬取的页面中可显示的页面包括:游戏列表,最新好玩小游戏列表,首页,洛克王国,奥拉星,赛尔号,龙战士,造梦西游3之大闹天庭篇,爆枪英雄,勇士的信仰(正式版),造梦西游4洪荒大劫篇,奥比岛,老版首页。
7k7k
http://www.7k7k.com/robots.txt
- User-agent: *
- Disallow: /doyo/
- Disallow: /doyoweb/
- Disallow: /yy/
- Disallow: /data/
- Disallow: /widget/
- Disallow: /api/
- Disallow: /classic
- Disallow: /classic/
- Disallow: /classic/tag/
- Disallow: /classic/swf/
- Disallow: /classic/flash_fl/
- Disallow: /classic/top/
- Disallow: /classic/flash/
- Disallow: /classic/index.htm
- Disallow: /new/
- Disallow: /m-iphone/art/
- Disallow: /m-ipad/art/
- Disallow: /m-android/art/
不可爬取的页面中可显示的页面包括:每日最新Flash游戏列表,游戏分类列表,游戏列表,游戏分类标签列表,游戏排行榜,首页。
2144
- User-agent:Mediapartners-Google
- Disallow:
- User-agent: *
- Allow: /girls/?
- Disallow: /tuan
- Disallow: /v3
- Disallow: /hz/cntv
- Disallow: /testdsadsa21321
- Disallow: /xxx
- Disallow: /api
- Disallow: /game.htm
- Disallow: /index_test.htm
- Disallow: /webgame.htm
- Disallow: /index1.htm
- Disallow: /index_old.htm
- Disallow: /index_2010.htm
- Disallow: /index_2011.htm
- Disallow: /index_2012.htm
- Disallow: /game_test.php
- Disallow: /listgame.php
- Disallow: /cj.php
- Disallow: /sdogame.php
- Disallow: /archiver
- Disallow: /YouXi
- Disallow: /sdo
- Disallow: /Archives
- Disallow: /public
- Disallow: /html/26/51653/
- Disallow: /html/14/51654/
- Disallow: /html/14/51655/
- Disallow: /html/26/51857/
- Disallow: /html/14/51863/
- Disallow: /html/14/51862/
- Disallow: /html/14/51861/
- Disallow: /html/26/51858/
- Disallow: /html/26/51859/
- Disallow: /2345/
- Disallow: /2144com/
- Disallow: /xyx/
- Disallow: /xiaoyouxi/
- Disallow: /2015/
- Disallow: /2016/
不可爬取的页面中可显示的页面包括:女生游戏列表,首页,老版首页,三国战纪,战神盟,三国志,三国战,游戏列表。
大部分网页小游戏网站都禁止爬取首页,游戏列表,游戏分类列表以及部分小游戏网页。
小结
购物网站大都将注意力放在用户信息保护以及网站流量上面,网页小游戏网站在关注网站流量的同时,也会着重保护团队的创作成果。
Robots协议是网站出于安全和隐私考虑,防止搜索引擎抓取敏感信息而设置的。Robots协议代表了一种契约精神,互联网企业只有遵守这一规则,才能保证网站及用户的隐私数据不被侵犯。Robots协议是维护互联网世界隐私安全的重要规则,是一种目前为止最有效的方式,用自律维持着网站与搜索引擎之间的平衡,让两者之间的利益不至于过度倾斜。
网友评论