1. 结构相关
1.1 明确爬虫架构
data:image/s3,"s3://crabby-images/aa011/aa011eead46c9efa0054ccfc0e8e555ce720dea6" alt=""
1.2 URL管理器
data:image/s3,"s3://crabby-images/6d8e4/6d8e4100af4e6140413f2968a9cd83423b0fcabf" alt=""
URL管理器实现方式
data:image/s3,"s3://crabby-images/e780b/e780b4aa47f242dc14147b8e51b60c3fc426123a" alt=""
互联网公司使用缓存数据库
个人可以使用内存,内存不够用或者想要永久储存,用关系型数据库
1.3网页下载器
data:image/s3,"s3://crabby-images/d3fda/d3fdaca2b19f0c2bbb27b895442227e19c6af62d" alt=""
data:image/s3,"s3://crabby-images/8e332/8e332ff12434e99e6773191e68c9d312e5489a75" alt=""
关于Python3:
python 3.x中urllib库和urilib2库合并成了urllib库。。
其中urllib2.urlopen()变成了urllib.request.urlopen()
urllib2.Request()变成了urllib.request.Request()
1.3.1 网页下载器用法
方法1
data:image/s3,"s3://crabby-images/1b37a/1b37a4a89fe3ccc3a58568cd1bb835429520e671" alt=""
方法2
data:image/s3,"s3://crabby-images/1671a/1671afe7df80a5f84ae4500162b97462ed79f3bb" alt=""
data:image/s3,"s3://crabby-images/baa11/baa111b222ce7315ec7488f6a62d0c4794013564" alt=""
方法3
data:image/s3,"s3://crabby-images/cc6c5/cc6c57e9487721f0c702283a14b706aecc8b2491" alt=""
data:image/s3,"s3://crabby-images/160e2/160e2a07566c989a80fefa06dc20ad654b2449d3" alt=""
对于Python 3.x,代码需要做相应改变:
import urllib.request
resp = urllib.request.urlopen('http://www.baidu.com')
print(resp.getcode())
网友评论