python 爬虫基础

作者: 两分与桥 | 来源:发表于2018-07-08 16:35 被阅读33次

想要玩爬虫！正则表达式是你的必修课程！这篇足以你玩转爬虫了！
想玩好爬虫！正则表达式是必须精通的！带来正则表达式大全！
想玩好爬虫！正则表达式是必须精通的！带来正则表达式大全！
3分钟带你了解世界第一语言Python 入门上手也这么简单！
Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序
Python网络爬虫（七）- 深度爬虫CrawlSpider
Python网络爬虫（二）- urllib爬虫案例
Python网络爬虫（一）- 入门基础
Python网络爬虫（四）- XPath
Python网络爬虫（三）- 爬虫进阶

爬虫，也就是分析网站的各种请求，用脚本模拟网页登陆、获取数据，套路也就是get，post，cookies，headers，

阶段大纲:
一. 爬虫
1. 基本操作
- 登录任意网站(伪造浏览器的任何行为)
2. 性能相关
- 并发方案:
- 异步IO: gevent/Twisted/asyncio/aiohttp
- 自定义异步IO模块
- IO多路复用:select
3. Scrapy框架
介绍:异步IO:Twisted
- 基于Scrapy源码自定义爬虫框架
- 使用Scrapy
·
二. Tornado框架(异步非阻塞)
1. Tornado的基本使用
- 小示例
- 自定义组件
2. Tornado源码剖析
3. 自定义异步非阻塞框架 select实现

1. 爬虫基本操作
    a. 爬虫
          - 定向
          - 非定向
            权重

    b. 
        需求一:
            下载页面:
            筛选:
                正则表达式
        
            ========== 开源模块 ==========
            
            1. requests
                pip3 install requests
                
                response = requests.get('http://www.autohome.com.cn/news/')
                response.text
                
                
                总结:
                
                response = requests.get('URL')
                response.text  
                response.content  二进制
                response.encoding
                response.aparent_encoding
                response.status_code
                response.cookies.get_dict()
                
                
                requests.get('http://www.autohome.com.cn/news/',cookie={'xx':'xxx'})
                
            2. beautisoup模块
                pip3 install beautifulsoup4
                
                from bs4 import BeautiSoup
                soup = BeautiSoup(response.text,features='html.parser')
                target = soup.find(id='auto-channel-lazyload-article')
                print(target)
            
                总结:
                    soup = beautifulsoup('<html>...</html>',features='html.parser')
                    find 找到第一个符合条件的
                    v1 = soup.find('div') 
                    v1 = soup.find(id='i1')
                    v1 = soup.find('div',id='i1')
                    
                    find_all 找到所有符合条件的，返回列表
                    v2 = soup.find_all('div')
                    v2 = soup.find_all(id='i1')
                    v2 = soup.find_all('div',id='i1')
            
                    obj = v1
                    obj = v2[0]
                    
                    obj.text  获取文本
                    obj.attrs  获取属性
    
    
        需求二:
            通过程序自动登录github
            
            post_dict = {
                "phone": '111111111',
                'password': 'xxx',
                'oneMonth': 1
            }
            response = requests.post(
                url="http://dig.chouti.com/login",
                data = post_dict
            )
            
            print(response.text)
            cookie_dict = response.cookies.get_dict()
    
    
    c. 模块详细使用
        requests
        
        - 方法关系
            requests.get(.....)
            requests.post(.....)
            requests.put(.....)
            requests.delete(.....)
            ...
            
            requests.request('POST'...)
        - 参数
            request.request
            - method:  提交方式
            - url:     提交地址
            - params:  在URL中传递的参数,GET 
                requests.request(
                    method='GET',
                    url= 'http://www.oldboyedu.com',
                    params = {'k1':'v1','k2':'v2'}
                )
                # http://www.oldboyedu.com?k1=v1&k2=v2
            - data:    在请求体里传递的数据
            
                requests.request(
                    method='POST',
                    url= 'http://www.oldboyedu.com',
                    params = {'k1':'v1','k2':'v2'},
                    data = {'use':'alex','pwd': '123','x':[11,2,3]}
                )
                
                请求头:
                    content-type: application/url-form-encod.....
                    
                请求体:
                    use=alex&pwd=123
                
                
            - json   在请求体里传递的数据
                requests.request(
                    method='POST',
                    url= 'http://www.oldboyedu.com',
                    params = {'k1':'v1','k2':'v2'},
                    json = {'use':'alex','pwd': '123'}
                )
                请求头:
                    content-type: application/json
                    
                请求体:
                    "{'use':'alex','pwd': '123'}"
                
                json 和data的区别，就是当 字典中嵌套字典时，只能用json发送数据
                
            - headers   请求头
            
                requests.request(
                    method='POST',
                    url= 'http://www.oldboyedu.com',
                    params = {'k1':'v1','k2':'v2'},
                    json = {'use':'alex','pwd': '123'},
                    headers={
                        'Referer': 'http://dig.chouti.com/',
                        'User-Agent': "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36"
                    }
                )
             - cookies  Cookies
        
            
             
             - files    上传文件
             
             - auth      基本认证(headers中加入加密的用户名和密码)，很少用
             
             - timeout  请求和响应的超市时间
             
             - allow_redirects  是否允许重定向
             
             - proxies  代理
             
             - verify   是否忽略证书
             
             - cert      证书文件
             
             - stream   流，一点一点下载
             
        - session: 用于保存客户端历史访问信息
             

       https 用ssl加密
     
    




        上节回顾：
        -requests
            - requests.post()
            - url
            - method
            - json
            - stream
            - files
            - auth
            - cert
            - allow_redirects
            - headers
            - cookies
            - proxies
            - data
            - params
        -response = requests.post()
            - .....
        - session 
        - Web知识
        - 正则表达式
        - beautifulsoup
        
    微信
长轮询
轮询


爬虫，套路
    - get
    - post
    - cookies
    - headers
    
    url 不行就加上 cookies，还是不行就加上 headers
    
注意： 发送数据的请求头,requests 默认添加的请求头
        - data Content-type:application/urlencoded-form .......(不一定正确)
        - json Content-type:application/json 

        ===》单请求的

参考博客：https://www.cnblogs.com/wupeiqi/articles/6283017.html

想要玩爬虫！正则表达式是你的必修课程！这篇足以你玩转爬虫了！
python 3.x 爬虫基础 python 3.x 爬虫基础---http headers详解 python 3...
想玩好爬虫！正则表达式是必须精通的！带来正则表达式大全！
python 3.x 爬虫基础 python 3.x 爬虫基础---http headers详解 python 3...
想玩好爬虫！正则表达式是必须精通的！带来正则表达式大全！
python 3.x 爬虫基础 python 3.x 爬虫基础---http headers详解 python 3...
3分钟带你了解世界第一语言Python 入门上手也这么简单！
一、Python入门 1. Python爬虫入门一之综述 Python爬虫入门二之爬虫基础了解 Python爬虫入...
Python网络爬虫（八） - 利用有道词典实现一个简单翻译程序
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（七）- 深度爬虫CrawlSpider
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（二）- urllib爬虫案例
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（一）- 入门基础
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（四）- XPath
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...
Python网络爬虫（三）- 爬虫进阶
目录： Python网络爬虫（一）- 入门基础Python网络爬虫（二）- urllib爬虫案例Python网络爬...