解读亚马逊的robots.txt

作者: 鸡肉卷福 | 来源:发表于2018-05-14 00:59 被阅读0次

解读亚马逊的robots.txt
robots.txt学习笔记----以亚马逊&Githu
分析亚马逊robots.txt
robots协议分析——以亚马逊（中国）为例
robots.txt的解读
亚马逊网站的robots.txt学习
亚马逊robots.txt文件解析
亚马逊 robots.txt 文件解析
robots.txt文件解读
SEO优化-robots.txt解读

robots.txt是一个协议，而不是一个命令。robots.txt是搜索引擎中访问网站的时候要查看的第一个文件。robots.txt文件告诉蜘蛛程序在服务器上什么文件是可以被查看的。它可以屏蔽一些网站中比较大的文件，如：图片，音乐，视频等，节省服务器带宽；可以屏蔽站点的一些死链接。方便搜索引擎抓去网站内容；设置网站地图连接，方便引导蜘蛛爬取页面。

以亚马逊的robots.txt为例：

https___www.amazon.cn_robots.txt.png

上图为亚马逊的robots.txt的截图。
第一行：User-agent: * 这里的代表的所有的搜索引擎种类，是一个通配符
我们可以发现绝大多数东西是不能爬的，能爬的只有心愿单的部分内容：

Allow: /wishlist/universal*
Allow: /wishlist/vendor-button*
Allow: /wishlist/get-button*
Allow: /gp/wishlist/universal*
Allow: /gp/wishlist/vendor-button*
Allow: /gp/wishlist/ipad-install*

1、允许以心愿单（wishlist）加universal、vendor-button、get-button开头的子目录的所有信息
2、允许以gp/wishlist加universal、vendor-button、ipad-install开头的子目录的所有信息

许多网站的robots.txt十分的简单粗暴，甚至有些网站直接写

User-agent: *　
Disallow: /

来禁止所有搜索引擎访问网站的任何部分，亚马逊写的十分详细，几乎囊括了它所有的子目录（购物车、用户账号、银行卡、心愿单、商品分类、商品信息、购买信息、格式、框架等等）。
1.禁止爬寻的整个目录

Disallow: /buycar
Disallow: /cart
Disallow: /checkout
Disallow: /class
Disallow: /com
Disallow: /common
Disallow: /css
......

即以上述开头单词的内容均不能爬取

2.禁止爬寻子目录

Disallow: /mn/bookLookInsideApp
Disallow: /mn/checkInitApp
Disallow: /mn/checkoutAlertMsgApp
Disallow: /mn/checkoutredirectApp
Disallow: /mn/giftCardApp
Disallow: /mn/loginApplication
Disallow: /mn/loyaltyApp
......

除了目录，亚马逊也禁止了一些文件的爬取，比如：