美文网首页
搭建scrapy+es+mysql爬取知乎内容

搭建scrapy+es+mysql爬取知乎内容

作者: 星泼拿衣服 | 来源:发表于2018-12-06 15:29 被阅读0次

Django1.2
Scrapy 1.5.1 ElasticSearch6.3.2
网站端+爬虫端
爬虫(数据入库)

爬虫基础知识

1.正则表达式
2.深度优先和广度优先遍历算法
3.url去重的常见策略

  • 爬虫端选型scrapy与request + BeautifulSoup
    scrapy为框架frameworks而request和BeautifulSoup都是库kit。
    scrapy基于twisted(异步IO框架),性能好
    扩展性好,内置函数多
    内置css和xpath selector(lxml:c实现的框架)

python字符串编码解码问题

import sys
sys.getdefaultcoding()
s.encoding(‘utf-8’)

模拟人为的网上冲浪及浏览中学习摘(还没到抄)

部署虚拟环境virtualenv,cmd创建scrapy project,cmd创建scrapy。
编写一个py,包含模拟登陆网址、请求url、登陆账号、筛选url和所需内容、结构化保存至本地等需求。

class ZhihuSpider(scrapy.Spider):
    #模拟登陆网址
    name = ’zhihu’
    allowed_domains = [‘www.zhihu.com’]
    start_urls = [‘http://emmmmmmmmmm’]
    #header = {
      #  ‘HOST’: ’www.zhihu.com’
        #‘Referer’: ‘http://www.zhihu.com’
        #‘User-agent’: ‘chrome/68.0.3440.106 emmmmmm’
    }
    #custom_setting = {
    #    “COOKIES_ENABLED”: True,
    #    “DOWNLOAD_DELAY”: 1.5
    }
  def parse(self, response):
        #筛选url和所需内容
        all_urls = response.css(“a::attr(herf)”).extract(“”)
        #匹配question的url
        for url in all_urls
            macth_obj = re.match(“(.*zhihu.com/question/(\d+))(/|$).*”, url)
            if(match_obj):
                request_url = match_obj.group(1)
                yield scrapy.Request(request_url=, header= , callback = self.parse_question)
            else:
                pass
                yield Request(request_url=, header= ,callback= self.parse)

    def parse_question(self.response):
        #question item结构化(新版本)
        if “QuestionHeader-title” in response.text:
            match_obj = re.match(“(.*zhihu.com/question/(\d+))(/|$).*”, response.url)
            if match_obj:
                question_id = int(match_obj.group(2))
            item_loader.add_css(“emmmmmm”)
            item_loader.add_xpath(“emmmmm”)
            item_loader.add_value(“emmmmm”)
            #pass
            question_item = item_loader.loader_item()
        yield scrapy.Request(self.start_answer_url.format(question_id, 20, 0),headers = self.headers, callback = self.answer)
        yield question_item


    def parse_answer(self,response):
        #process question answer item
        ans_json = json.loads(response.text)
        it_end = ans_json[“paging”][“is_end”]
        next_url = ans_json[“paging”][“next”]
        
        for answer in ans_json[“data”]:
            #pass
            really_url =“http://www.zhihu.com/question/{0}/answer/{1}”.format(answer[“question”][“id”], answer[“id”])
            yield scrapy.Request(really_url, headers=self.headers, callback =self.parse_end, meta={‘emmmmmmmmm})
        if not is_end:
            yield scrapy.Request(next_url, hearders = self.headers, callback=self.parse_answer)


    def parse_answer_end(self, response):
        answer_item = ZhihuAnswerItem()
        answer_item[“emmmmm”] = response.meta.get(“emmmmmm”)
        yield answer_item
    def start_request(self):
        from selenium import webdriver
        import time
        browser = webdriver.PhantomJS()
        browser.get(“http://www.zhihu.com/signin”)
    browser.find_element_by_css_selector(“emmmmmm”).send_keys(ZHIHU_PHONE)
        time.sleep(1)
        #输入密码点击等操作
        zhihu_cookies = browser.get_cookies()
        import pickle
        for cookies in zhihu_cookies:
            #emmmmmmm
        browser.close()
        return [scrapy.Request(url=self.start_urls[0], headers=self.headers)]

模拟登陆网址unit test

from config import ZHIHU_PHONE, ZHIHU_PASSWORD
import time 
from selenium import webdriver

browser = get(“http://www.zhihu.com”)

browser.find_element_by_xpath(“//emmmmm”).click()
time.sleep(10)
browser.find_element_by_css_selector(“emmmmmm”).send_keys(ZHIHU_PHONE)
        time.sleep(1)
        #输入密码点击等操作

正则表达式怎么读和优雅的编写?(下一次在补充)

.:匹配前面的除了换行符之外(\r \n)的单字符
:对‘’前面出现的子表达式,出现次数0到无数个
\d:匹配一个数字字符
+:匹配前面字表达式,次数1到无穷
/:普通字符
$:输入的字符串结束标志
(.)(.)

unit test中密码保护问题

存储(爬虫(网站(文章)))

  • 配置所需的虚拟环境
mkvirtual articlespider3
创建虚拟环境
workon articlespider3
直接运行虚拟环境
deactivate
退出激活状态
workon
知道有哪些虚拟环境
scrapy startproject Articlespider

得到的虚拟环境下articlespider下的文件结构
|—articlespider
  |—spiders
  |—init.py
  |—items.py
  |—middlewares.py
  |—pipeline.py
  |—settings.py
|—scrapy.cfg

创建具体的spider

cd ArticleSpider
scrapy genspider jobble blog.jobble.com

使用scrapy启动spider,main文件调试

伯乐在线网页内容

  • 爬取网页内容=xpath语法与css选择器(html代码,xml代码)
    item.py=item(爬取内容)
    pipeline=路由(item)
    保存到mysql=scrapy(pipeline)
    mysql.同步异步插入,连接池=scrapy.异步twisted框架(item)
    django.model=django.item(.item)

相关文章

网友评论

      本文标题:搭建scrapy+es+mysql爬取知乎内容

      本文链接:https://www.haomeiwen.com/subject/djeocqtx.html