美文网首页学习已收录(2017-8-15)pyqt学习笔记
<进击的虫师>舆情监测之获取数据

<进击的虫师>舆情监测之获取数据

作者: zhaoolee | 来源:发表于2018-04-19 18:58 被阅读82次
    舆情监测是对互联网上公众的言论和观点进行监视和预测的行为.监测技术大多是基于爬虫的, 如果我们把相关热点事件的关键词, 用搜索引擎进行搜索, 并将结果保存到本地,就实现了舆情监测的第一环节:实时获取互联网数据
    舆情监测.png

    初步实现效果

    获取数据.gif

    实现代码

    import requests
    from lxml import etree
    import os
    import sys
    
    def getData(wd):
        # 设置用户代理头
        headers = {
            # 设置用户代理头(为狼披上羊皮)
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
        }
        # 构造目标网址
        target_url = "https://www.baidu.com/s?wd="+str(wd)
        # 获取响应
        data = requests.get(target_url, headers = headers)
        # xpath格式化
        data_etree = etree.HTML(data.content)
        # 提取数据列表
        content_list = data_etree.xpath('//div[@id="content_left"]/div[contains(@class, "result c-container")]')
        # 定义返回的字符串
        result = ""
        # 获取标题, 内容, 链接
        for content in content_list:
            result_title = "<标题>  "
            bd_title = content.xpath('.//h3/a')
            for bd_t in bd_title:
                result_title += bd_t.xpath('string(.)')
    
            result_content = "<内容>  "
            bd_content = content.xpath('.//div[@class="c-abstract"]')
            for bd_c in bd_content:
                result_content += bd_c.xpath('string(.)')
    
            result_link = "<链接>  "+str(list(content.xpath('.//div[@class="f13"]/a[@class="c-showurl"]/@href'))[0])
    
    
            result_list = [result_title, "\n" , result_content , "\n", result_link, "\n", "\n"]
            for result_l in result_list:
                result += str(result_l)
        return result
    
    
    # 保存为文件
    
    def saveDataToFile(file_name, data):
        # 建立文件夹
        if os.path.exists("./data/"):
            pass
        else:
            os.makedirs("./data/")
    
        with open("./data/"+file_name+".txt", "w+") as f:
            f.write(data)
    
    def main():
        wd = ""
        try:
            wd = sys.argv[1]
        except:
            pass
        if (len(wd) == 0):
            wd = "火影"
        str_data = getData(wd)
        print(str_data)
        saveDataToFile(wd, str_data)
    
    if __name__ == '__main__':
        main()
    

    相关文章

      网友评论

        本文标题:<进击的虫师>舆情监测之获取数据

        本文链接:https://www.haomeiwen.com/subject/gkubkftx.html