美文网首页
使用selenium爬取裁判文书网

使用selenium爬取裁判文书网

作者: NJUNLP | 来源:发表于2018-08-12 21:24 被阅读0次

    一、摘要

    在人工智能时代,法律文书只是海量数据的一个产生源,但它所提供的数据具有数据量大、涉及面广、影响力大、时效性强等重要特点。因此,本文将爬取裁判文书网的若干法律文书,希望可以为喜欢网络爬虫的同学提供一点灵感。


    二、运行环境

    1.Pycharm
    2.python 3.6
    3.selenium
    4.lxml

    三、思路

    (1)主页链接为http://wenshu.court.gov.cn/,一共有五种类型的法律文书,因此我们需要将五种类型事件对应的URL作为爬虫的种子URL
    (2)由于裁判文书网不允许爬虫,因此我们需要使用一些反爬虫策略(建立代理IP池、改变user-agent)
    (3)我们需要利用selenium模拟浏览器登陆,否则将无法获取数据,这点很重要!!!

    四、实现代码

    import requests
    from lxml import etree
    from selenium import webdriver
    import time
    import lxml.html
    import random
    
    #head = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
    UA_LIST = [
       "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
       "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",
       "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
       "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",
       "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
       "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
       "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",
       "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",
       "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",
       "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
    ]
    
    headers = {
       'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
       'Accept-Encoding': 'gzip, deflate',
       'Accept-Language': 'zh-CN,zh;q=0.9',
       'Connection': 'keep-alive',
       'Host': 'wenshu.court.gov.cn',
       'User-Agent': random.choice(UA_LIST)
    }
    def downloadHtml(url):
       try:
           r = requests.get(url, headers=headers)
           r.raise_for_status()
           r.encoding = r.apparent_encoding
           return r.text
       except:
           return ""
    
    def parse():
       list = []
       url = "http://wenshu.court.gov.cn"
       response = downloadHtml(url)
       html = etree.HTML(response)
       urls = html.xpath("//*[@id='nav']/ul/li/a[@target='_blank']/@href")
       for ul0 in range(len(urls)):
           fullurl = url + urls[ul0]
           # 模拟谷歌浏览器
           driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
           driver.get(fullurl)
           time.sleep(20)
           html = driver.page_source
           doc = lxml.html.fromstring(html)
           url1 = doc.xpath("//*[@id='resultList']/div/table/tbody/tr[1]/td/div/a[2]/@href")
           list = list + url1
           driver.close()
       return list
    
    def URL():
       urlList = []
       url = "http://wenshu.court.gov.cn"
       base = parse()
       for i in range(len(base)):
           new_url = url + base[i]
           urlList.append(new_url)
       return urlList
    
    def download(url):
       driver = webdriver.Chrome('C:\chromedriver_win32\chromedriver.exe')
       driver.get(url)
       time.sleep(10)
       html = driver.page_source
       doc = lxml.html.fromstring(html)
       try:
           title = doc.xpath("//*[@id='contentTitle']/text()")
           content = doc.xpath("//*[@id='DivContent']/div/text()")
           for title_i, content_i in zip(title, content):
               content = {
                   'title': title,
                   'app_title': content
               }
               print(content)
       except:
           print("")
    
    if __name__ == '__main__':
       urlss = URL()
       for i in range(len(urlss)):
           download(urlss[i])
    
    

    五、运行结果

    六、总结

    这次学习的东西还是很多,selenium用的模块很多。爬取数据的时候使用了不同的方式,受益匪浅。

    相关文章

      网友评论

          本文标题:使用selenium爬取裁判文书网

          本文链接:https://www.haomeiwen.com/subject/skjebftx.html