美文网首页
模拟登录并爬取github首页的dashboard-feed流信

模拟登录并爬取github首页的dashboard-feed流信

作者: cailuo | 来源:发表于2020-03-01 20:56 被阅读0次

    一、Chrome浏览器相关:

    1. 查看源代码”里能看到的数据,可以直接通过程序请求当前 URL 获取。(get请求)
    2. Elements 里的 HTML 代码不等于请求返回值,只能作为辅助。
    3. 查看请求的具体信息,包括方法、headers、参数,复制到程序里使用。

    二、具体实现代码

    import requests
    from lxml import etree
    
    class Login:
        def __init__(self):
            self.headers = {
                'Referer': 'https://github.com',
                'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.122 Safari/537.36',
                'Host': 'github.com'
            }
    
            self.login_url = 'https://github.com/login'
            self.post_url = 'https://github.com/session'
            self.session = requests.Session()
            self.response = self.session.get(self.login_url, headers=self.headers)
            self.selector = etree.HTML(self.response.text)
    
        def token(self):
            token = self.selector.xpath('//div[@class="auth-form px-3"]//input[@name="authenticity_token"]/@value')[0]
            return token
    
        def get_timestamp(self):
            timestamp = self.selector.xpath('//input[@name="timestamp"]/@value')
            return timestamp
    
        def get_timestamp_secret(self):
            timestamp_secret = self.selector.xpath('//input[@name="timestamp_secret"]/@value')
    
        def get_ga_id(self):
            # 这个是Google Analytics的id
            ga_id = '422801072.1583054032'
            return ga_id
    
        # 开始实现模拟登录
        def login(self, email, password):
            params = {
                'commit': 'Sign in',
                'utf8': '✓',
                'authenticity_token': self.token(),
                'ga_id': self.get_ga_id(),
                'login': email,
                'password': password,
                'webauthn - support': 'supported',
                'webauthn - iuvpaa - support': 'unsupported',
                'timestamp': self.get_timestamp(),
                'timestamp_secret': self.get_timestamp_secret()
            }
    
            response = self.session.post(self.post_url, data=params, headers=self.headers)
            if response.status_code == 200:
                feed_url = 'https://github.com/dashboard-feed'
                feed_response = self.session.get(feed_url, headers=self.headers)
                if feed_response.status_code == 200:
                    return feed_response.text
    
        def parse(self, data):
            selector = etree.HTML(data).xpath('//div[@class="watch_started"]')
    
            for element in selector:
                # string返回的是一个列表,有的为空
                string = element.xpath('.//div[@class="d-flex flex-items-baseline"]')
                # 如果列表有内容,则解析出文本
                if len(string):
                    news = string[0].xpath('.//text()')
                    # 去除空字符串,但空字符串仍保留在列表中
                    for index, value in enumerate(news):
                        news[index] = value.strip()
    
                    # 把空字符串从列表丢掉,且只保留有效字段的前3个,第4个日期字段不要了
                    news = [new for new in news if new][:3]
    
                    new = ''
                    for s in news:
                        new = new + s + ' '
                    print(new)
    
        def run(self):
            data = self.login('email', 'password')
            self.parse(data)
    
    
    if __name__ == '__main__':
        Login().run()
    

    输出结果如下:

    Germey starred docker-library/python 
    Germey starred google/cadvisor 
    Germey starred bitnami/bitnami-docker-postgresql 
    Germey starred cnbattle/douyin 
    ChenglongChen starred ChineseGLUE/ChineseGLUE 
    ChenglongChen starred FLHonker/Awesome-Knowledge-Distillation 
    ChenglongChen starred iCGY96/awesome_OpenSetRecognition_list 
    ChenglongChen starred google-research/fixmatch 
    ChenglongChen starred AtmaHou/Task-Oriented-Dialogue-Dataset-Survey 
    iamseancheney starred scalingexcellence/scrapybook-2nd-edition 
    ChenglongChen starred zhmiao/OpenLongTailRecognition-OLTR 
    ChenglongChen starred dkozlov/awesome-knowledge-distillation 
    ChenglongChen starred thunlp/TLNN 
    iamseancheney starred kingname/GeneralNewsExtractor 
    Shawn1993 starred sebastianruder/NLP-progress 
    iamseancheney starred seathiefwang/FaceRecognition-tensorflow
    

    相关文章

      网友评论

          本文标题:模拟登录并爬取github首页的dashboard-feed流信

          本文链接:https://www.haomeiwen.com/subject/vighkhtx.html