1.robots.txt的基本语法结构##

user-Agent:  <!--指明适用的爬虫名称,若为*，表示所有robots-->
allow:    <!--后跟允许采集的目录或网页-->
Disallow:    <!--后跟禁止采集的目录或网页-->

举个栗子：
允许任何爬虫访问网站任何网页

user-Agent:*
Disallow:

禁止任何爬虫访问网站任何网页

user-Agent:*
Disallow: /

针对某个爬虫

 User-agent:  Baiduspider  
 Allow:  /article        
 Allow:  /oshtml 
 Allow:  /wenzhang
 Disallow:  /product/ 
 Disallow:  /  <!--除上面允许访问的部分，其他部分均不开放访问-->

 User-agent:  Baiduspider  
 Allow:  

 User-agent: *
 Disallow:  /  <!--只允许Baiduspider访问-->

2.robots的<meta>标签##

将robots内容写到HTML头部，告诉爬虫如何处理该页内容。由两部分组成：

name=”robots”表示所有的搜索引擎，可以针对某个具体搜索引擎写为name=”Baiduspider”。
content部分有四个指令选项：index、noindex、follow、nofollow，指令间以“,”分隔。
index 指令告诉搜索机器人抓取该页面；
follow 指令表示搜索机器人可以沿着该页面上的链接继续抓取下去；
robots meta标签的缺省值是index和follow，只有inktomi除外，对于它，缺省值是index,nofollow。

来源：如何使用robots.txt及其详解

举个栗子：

<html>
<head>
  <title>Page</title>
  <meta name="robots" content="index,follow"/><!--或者写成<meta name="robots" content="ALL"/>-->
  <!--<meta name="robots" content="noindex,follow"/>-->
  <!--<meta name="robots" content="index,nofollow"/>-->
  <!--<meta name="robots" content="noindex,nofollow"/>或者写成<meta name="robots" content="NONE"/>-->
</head>
<body>
  hello
</body>
</html>

3.使用robots.txt##

robots.txt是一个文本文件。它必须位于域名的根目录中并被命名为"robots.txt",文件名必须小写，位于子目录中的 robots.txt 文件无效。

4.对淘宝的robots.txt文件的解读##

淘宝robots.txt

User-agent:  Baiduspider   <!--百度蜘蛛-->
Allow:  /article           <!--允许百度蜘蛛抓取的目录-->
Allow:  /oshtml            <!--/areticle和/oshtml是网站地图下的列表页面，/wenzhang是淘宝头条资讯分享-->
Allow:  /wenzhang
Disallow:  /product/       <!--不允许百度蜘蛛抓取的目录-->
Disallow:  /

User-Agent:  Googlebot     <!--谷歌蜘蛛-->
Allow:  /article
Allow:  /oshtml
Allow:  /product
Allow:  /spu
Allow:  /dianpu
Allow:  /wenzhang
Allow:  /oversea
Disallow:  /               <!--除以上目录，其他目录不允许谷歌蜘蛛抓取-->

User-agent:  Bingbot       <!--必应蜘蛛-->
Allow:  /article
Allow:  /oshtml
Allow:  /product
Allow:  /spu
Allow:  /dianpu
Allow:  /wenzhang
Allow:  /oversea
Disallow:  /

User-Agent:  360Spider     <!--360蜘蛛-->
Allow:  /article
Allow:  /oshtml
Allow:  /wenzhang
Disallow:  /

User-Agent:  Yisouspider   <!--易搜蜘蛛-->
Allow:  /article
Allow:  /oshtml
Allow:  /wenzhang
Disallow:  /

User-Agent:  Sogouspider   <!--搜狗蜘蛛-->
Allow:  /article
Allow:  /oshtml
Allow:  /product
Allow:  /wenzhang
Disallow:  /

User-Agent:  Yahoo!  Slurp<!--雅虎蜘蛛-->
Allow:  /product
Allow:  /spu
Allow:  /dianpu
Allow:  /wenzhang
Allow:  /oversea
Disallow:  /

User-Agent:  *            <!--除了上述的用户代理，其他代理都不允许抓取网站的任何内容-->
Disallow:  /

3.1.从代码中我发现几个特点：###

淘宝从一开始屏蔽搜索引擎抓取，现在向几大搜索引擎开放了部分目录。
对国内搜索引擎只开放了网站地图下的列表界面和分享资讯页面，但对搜狗开放了/product目录。
对国外搜索引擎除上述外，还开放了店铺、产品目录。

3.2.原因分析

淘宝起初屏蔽搜索引擎应该是为了：
掌握用户数据：用户搜索行为和喜好对于淘宝来说具有很高的商业价值，它不想因搜索引擎参与减少这部分数据的来源。
防止竞价排名：如果商品信息能被搜索引擎爬取，那么自然而然会出现竞价排名，有可能会导致客户流失
确保网站流量：屏蔽搜索引擎，使用户只能从淘宝网站进入，不会有流量流失。
现在淘宝开放了部分目录，是为了能在搜索引擎上进行推广，但可以发现淘宝依旧屏蔽了商品相关页面，由此可见，淘宝依然是禁止竞价排名的。
在国内搜索引擎中，唯独对搜狗开放了product目录，应该是因为阿里巴巴是搜狗的投资方，双方达成了合作。淘宝可以通过搜狗搜索引擎进行商品推广，而搜狗也能利用淘宝的提升用户流量。
相较于国内搜索引擎，淘宝对国外搜索引擎更加宽容，应该是为了通过搜索引擎扩展国外业务，发展国外用户。