Python爬虫学习15-Requests模拟登陆知乎

作者: MingSha | 来源:发表于2017-04-19 22:22 被阅读0次

模拟登陆存在问题
Python爬虫学习15-Requests模拟登陆知乎
Python爬取知乎与我所理解的爬虫与反爬虫
Python爬虫学习16-Scrapy模拟登陆知乎
Selenium+ PhantomJS+Requests 综合使
python模拟登陆知乎
Scrapy基础——Cookies和Session
Scrapy登录新版知乎
headers的详细讲解
爬点小黄图-3-带你走进需要用户登录的社区

一、常见状态码

表达式	说明
200	请求被正确执行
301/302	永久性重定向/临时性重定向
403	没有权限访问
404	没有资源访问
500	服务器错误
503	服务器停机或正在维护

二、登录分析

在登录界面输入手机号和帐号

Paste_Image.png
返回的地址为
Request URL:https://www.zhihu.com/login/phone_num
当输入email地址后返回的地址为
Request URL:https://www.zhihu.com/login/email
并且在formdata中出现
_xsrf:a71f46d549979fa192c09e11e4a463b5这样的字符串。

Paste_Image.png

三、抓取xsrf的值

正则匹配抓取xsrf需要使用header头来进行源代码的获取

def get_xsrf():
    response = requests.get("https://www.zhihu.com", headers=headers)
    re_match = re.match('.*name="_xsrf" value="(.*)"/>', response.text, re.S)
    return re_match

其中re.S，可以换行匹配。

四、登录逻辑

import http.cookiejar as cookielib

def get_xsrf():
    response = requests.get("https://www.zhihu.com", headers=headers)
    re_match = re.match('.*name="_xsrf" value="(.*)"/>', response.text, re.S)
    if re_match:
        return re_match.group(1)
    else:
        return ""

session = requests.session()
session.cookies = cookielib.LWPCookieJar(filename="cookies.txt")
try:
    session.cookies.load(ignore_discard=True)
except:
    print("Cookie未能加载")

def zhihu_login(account, password):
    if re.match("^1\d{10}", account):
        print("手机号码登录")
        post_url = "https://www.zhihu.com/login/phone_num"
        post_data = {
            "_xsrf": get_xsrf(),
            "phone_num": account,
            "password": password,
            "captcha": get_captcha(),
        }
    else:
        print("邮箱登录")
        post_url = "https://www.zhihu.com/login/email"
        post_data = {
            "_xsrf": get_xsrf(),
            "email": account,
            "password": password,
            "captcha": get_captcha(),
        }
    response_text = session.post(post_url, data=post_data, headers=headers)
    session.cookies.save()

以上代码是通过引入requests库，使用它的session方法，进行连接，构造post_data，把自己的用户名密码等信息发送到网站，并通过正则判断发送的是邮箱或是手机进行登录。
引入import http.cookiejar as cookielib，通过session.cookies.save()，对cookie进行保存。

五、获得验证码

def get_captcha():
    import time
    t = str(int(time.time()*1000))
    captcha_url = "https://www.zhihu.com/captcha.gif?r={}&type=login".format(t)
    t = session.get(captcha_url, headers=headers)
    with open('captpcha.jpg', 'wb') as f:
        f.write(t.content)
        f.close()
    from PIL import Image
    try:
        im = Image.open('captpcha.jpg')
        im.show()
        im.close()
    except:
        pass
    captcha = input("输入验证码")
    return captcha

通过上面的代码获得验证码并使用图片显示的方法查看输入。

六、不重新登录

登录只能一次，如果再次登录，可以直接通过查看cookie来判断是否为登录状态。

session.cookies = cookielib.LWPCookieJar(filename="cookies.txt")
try:
    session.cookies.load(ignore_discard=True)
except:
    print("Cookie未能加载")

def is_login():
    inbox_url = "https://www.zhihu.com/index"
    response = session.get(inbox_url, headers=headers, allow_redirects=False)
    if response.status_code != 200:
        return "失败"
    else:
        return "成功"

首先把cookie通过session.cookies.load装载进来，执行is_login()函数，如果成功可以访问inbox_url页面，则状态码为200表示成功。这里一定要注意allow_redirects=False，当不允许且登录时会自动跳转到登录页面，则状态码是301或者302。

网友评论

本文标题：Python爬虫学习15-Requests模拟登陆知乎

本文链接：https://www.haomeiwen.com/subject/vhtkzttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Python爬虫学习15-Requests模拟登陆知乎

一、常见状态码

二、登录分析

三、抓取xsrf的值

四、登录逻辑

五、获得验证码

六、不重新登录

相关文章

模拟登陆存在问题

Python爬虫学习15-Requests模拟登陆知乎

Python爬取知乎与我所理解的爬虫与反爬虫

Python爬虫学习16-Scrapy模拟登陆知乎

Selenium+ PhantomJS+Requests 综合使

python模拟登陆知乎

Scrapy基础——Cookies和Session

Scrapy登录新版知乎

headers的详细讲解

爬点小黄图-3-带你走进需要用户登录的社区

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读