Django1.2
Scrapy 1.5.1 ElasticSearch6.3.2
网站端+爬虫端
爬虫(数据入库)
爬虫基础知识
1.正则表达式
2.深度优先和广度优先遍历算法
3.url去重的常见策略
- 爬虫端选型scrapy与request + BeautifulSoup
scrapy为框架frameworks而request和BeautifulSoup都是库kit。
scrapy基于twisted(异步IO框架),性能好
扩展性好,内置函数多
内置css和xpath selector(lxml:c实现的框架)
python字符串编码解码问题
import sys
sys.getdefaultcoding()
s.encoding(‘utf-8’)
模拟人为的网上冲浪及浏览中学习摘(还没到抄)
部署虚拟环境virtualenv,cmd创建scrapy project,cmd创建scrapy。
编写一个py,包含模拟登陆网址、请求url、登陆账号、筛选url和所需内容、结构化保存至本地等需求。
class ZhihuSpider(scrapy.Spider):
#模拟登陆网址
name = ’zhihu’
allowed_domains = [‘www.zhihu.com’]
start_urls = [‘http://emmmmmmmmmm’]
#header = {
# ‘HOST’: ’www.zhihu.com’
#‘Referer’: ‘http://www.zhihu.com’
#‘User-agent’: ‘chrome/68.0.3440.106 emmmmmm’
}
#custom_setting = {
# “COOKIES_ENABLED”: True,
# “DOWNLOAD_DELAY”: 1.5
}
def parse(self, response):
#筛选url和所需内容
all_urls = response.css(“a::attr(herf)”).extract(“”)
#匹配question的url
for url in all_urls
macth_obj = re.match(“(.*zhihu.com/question/(\d+))(/|$).*”, url)
if(match_obj):
request_url = match_obj.group(1)
yield scrapy.Request(request_url=, header= , callback = self.parse_question)
else:
pass
yield Request(request_url=, header= ,callback= self.parse)
def parse_question(self.response):
#question item结构化(新版本)
if “QuestionHeader-title” in response.text:
match_obj = re.match(“(.*zhihu.com/question/(\d+))(/|$).*”, response.url)
if match_obj:
question_id = int(match_obj.group(2))
item_loader.add_css(“emmmmmm”)
item_loader.add_xpath(“emmmmm”)
item_loader.add_value(“emmmmm”)
#pass
question_item = item_loader.loader_item()
yield scrapy.Request(self.start_answer_url.format(question_id, 20, 0),headers = self.headers, callback = self.answer)
yield question_item
def parse_answer(self,response):
#process question answer item
ans_json = json.loads(response.text)
it_end = ans_json[“paging”][“is_end”]
next_url = ans_json[“paging”][“next”]
for answer in ans_json[“data”]:
#pass
really_url =“http://www.zhihu.com/question/{0}/answer/{1}”.format(answer[“question”][“id”], answer[“id”])
yield scrapy.Request(really_url, headers=self.headers, callback =self.parse_end, meta={‘emmmmmmmmm})
if not is_end:
yield scrapy.Request(next_url, hearders = self.headers, callback=self.parse_answer)
def parse_answer_end(self, response):
answer_item = ZhihuAnswerItem()
answer_item[“emmmmm”] = response.meta.get(“emmmmmm”)
yield answer_item
def start_request(self):
from selenium import webdriver
import time
browser = webdriver.PhantomJS()
browser.get(“http://www.zhihu.com/signin”)
browser.find_element_by_css_selector(“emmmmmm”).send_keys(ZHIHU_PHONE)
time.sleep(1)
#输入密码点击等操作
zhihu_cookies = browser.get_cookies()
import pickle
for cookies in zhihu_cookies:
#emmmmmmm
browser.close()
return [scrapy.Request(url=self.start_urls[0], headers=self.headers)]
模拟登陆网址unit test
from config import ZHIHU_PHONE, ZHIHU_PASSWORD
import time
from selenium import webdriver
browser = get(“http://www.zhihu.com”)
browser.find_element_by_xpath(“//emmmmm”).click()
time.sleep(10)
browser.find_element_by_css_selector(“emmmmmm”).send_keys(ZHIHU_PHONE)
time.sleep(1)
#输入密码点击等操作
正则表达式怎么读和优雅的编写?(下一次在补充)
.:匹配前面的除了换行符之外(\r \n)的单字符
:对‘’前面出现的子表达式,出现次数0到无数个
\d:匹配一个数字字符
+:匹配前面字表达式,次数1到无穷
/:普通字符
$:输入的字符串结束标志
(.)(.)
unit test中密码保护问题
存储(爬虫(网站(文章)))
- 配置所需的虚拟环境
mkvirtual articlespider3
创建虚拟环境
workon articlespider3
直接运行虚拟环境
deactivate
退出激活状态
workon
知道有哪些虚拟环境
scrapy startproject Articlespider
得到的虚拟环境下articlespider下的文件结构
|—articlespider
|—spiders
|—init.py
|—items.py
|—middlewares.py
|—pipeline.py
|—settings.py
|—scrapy.cfg
创建具体的spider
cd ArticleSpider
scrapy genspider jobble blog.jobble.com
使用scrapy启动spider,main文件调试
伯乐在线网页内容
- 爬取网页内容=xpath语法与css选择器(html代码,xml代码)
item.py=item(爬取内容)
pipeline=路由(item)
保存到mysql=scrapy(pipeline)
mysql.同步异步插入,连接池=scrapy.异步twisted框架(item)
django.model=django.item(.item)
网友评论