Nginx 防止爬虫

作者: DB哥 | 来源:发表于2019-08-28 17:11 被阅读0次

Nginx 防止爬虫
验证码的作用
爬虫基础系列urllib——构造请求头（3）
防盗链
scrapy总结
禁用空主机头
爬虫、反爬虫与突破反爬虫
一张图读懂Python爬虫与反爬虫大战！
爬虫VS反爬虫
Python实现爬取可用代理IP

Linux系统环境

[root@nginx01 ~]# cat /etc/redhat-release                       #==》系统版本
CentOS release 6.7 (Final)
[root@nginx01 ~]# uname –r                                      #==》内核版本
2.6.32-573.el6.x86_64
[root@nginx01 ~]# uname -m                                      #==》系统架构
x86_64
[root@nginx01 ~]# echo $LANG                                    #==》系统字符集
en_US.UTF-8
[root@nginx01 conf]# /application/nginx/sbin/nginx –v           #==》Nginx版本
nginx version: nginx/1.16.0

网页爬虫（又称网页蜘蛛/网络机器人等）

1、一种按照一定的规则自动地抓取网站信息的程序或脚本。如果不希望网页爬虫抓取网站某些敏感信息或数据，可以配置禁止爬虫访问或限制爬虫访问权限。网页爬虫众多，如果不加以限制，会增加网站压力，存在安全隐患。

2、Nginx限制网页爬虫有两种方法，第一种是使用robots.txt文件限制；第二种是使用$http_user_agent变量进行限制，建议两种方法都使用。个人觉得robots.txt只对正规的网页爬虫有一定的效果，但对部分流氓类型的网页爬虫就建议使用第二种方法。

方法一：站点根目录下存放robots.txt文件
提示：本教程Nginx站点目录 /application/nginx/html/ ，robots.txt是搜索引擎中访问网站的时候要查看的第一个文件，robots.txt文件告诉网页爬虫在服务器上什么文件是可以被查看的，什么文件是不能查看。（流氓类型的网页爬虫此方法无效）

#==》robots.txt文件默认是不存在的，需要手动创建
[root@nginx01 ~]# vim /application/nginx/html/robots.txt
User-agent:  *                                              #==》指定网页爬虫名称，*星号代表所有
Allow: /wp-content                                          #==》允许网页爬虫访问的目录
Disallow: /wp-admi                                          #==》禁止网页爬虫访问的目录
[root@nginx01 ~]# /application/nginx/sbin/nginx -t
nginx: the configuration file /application/nginx1.6.2/conf/nginx.conf syntax is ok
nginx: configuration file /application/nginx1.6.2/conf/nginx.conf test is successful
[root@nginx01 ~]# /application/nginx/sbin/nginx -s reload
[root@nginx01 ~]# curl 10.0.0.8/robots.txt                  #==》检查测试结果
User-agent:  Baiduspider
Allow: /wp-content
Disallow: /wp-admi

方法二：修改nginx.conf配置文件，禁止网络爬虫访问并返回403错误页面

#==》网页爬虫名称可以通过access_log访问日志查询得到
[root@nginx01 ~]# vim /application/nginx/conf/nginx.conf
server {

#==》限制Baiduspider|Googlebot|Googlebot-Mobile这三个网页爬虫
if ($http_user_agent ~* "Baiduspider|Googlebot|Googlebot-Mobile")  
        {
                return 403;
        }

}
[root@nginx01 ~]# /application/nginx/sbin/nginx -t     
nginx: the configuration file /application/nginx1.6.2/conf/nginx.conf syntax is ok
nginx: configuration file /application/nginx1.6.2/conf/nginx.conf test is successful
[root@nginx01 ~]# /application/nginx/sbin/nginx -s reload

#==》通过方法二可以扩展新方法：限制浏览器类型访问网站，例如限制谷歌Chrome浏览器访问
[root@nginx01 ~]# vim /application/nginx/conf/nginx.conf
server {

#==》限制Chrome浏览器访问网站
if ($http_user_agent ~* "Chrome")
        {
                return 403;
        }

}
[root@nginx01 ~]# /application/nginx/sbin/nginx -t     
nginx: the configuration file /application/nginx1.6.2/conf/nginx.conf syntax is ok
nginx: configuration file /application/nginx1.6.2/conf/nginx.conf test is successful
[root@nginx01 ~]# /application/nginx/sbin/nginx -s reload