美文网首页数据分析商业智能BI那点事儿玩转大数据
左手用R右手Python系列——多进程/线程数据抓取与网页请求

左手用R右手Python系列——多进程/线程数据抓取与网页请求

作者: 天善智能 | 来源:发表于2017-12-22 10:50 被阅读95次

    感谢关注天善智能,走好数据之路↑↑↑

    欢迎关注天善智能,我们是专注于商业智能BI,人工智能AI,大数据分析与挖掘领域的垂直社区,学习,问答、求职一站式搞定!

    本文作者:天善智能社区专家杜雨

    天善智能社区地址:https://www.hellobi.com/


    这一篇涉及到如何在网页请求环节使用多进程任务处理功能,因为网页请求涉及到两个重要问题:一是多进程的并发操作会面临更大的反爬风险,所以面临更严峻的反爬风险,二是抓取网页数据需要获取返回值,而且这些返回值需要汇集成一个关系表(数据框)(区别于上一篇中的二进制文件下载,文件下载仅仅执行语句块命令即可,无需收集返回值)。

    R语言使用RCurl+XML,Python使用urllib+lxml。

    library("RCurl")

    library("XML")

    library("magrittr")

    方案1——自建显式循环:

    Getjobs <-function(){

    fullinfo <- data.frame()

    headers <- c("Referer"="https://www.hellobi.com/jobs/search",

    "User-Agent"="Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"

    )

    d <- debugGatherer()

    handle <- getCurlHandle(debugfunction=d$update,followlocation=TRUE,cookiefile="",verbose = TRUE)

    i =0

    while(i <11){

    i = i+1

    url <-sprintf("https://www.hellobi.com/jobs/search?page=%d",i)

    tryCatch({

    content    <- getURL(url,.opts=list(httpheader=headers),.encoding="utf-8",curl=handle)%>% htmlParse()

    job_item   <- content%>% xpathSApply(.,"//div[@class='job_item_middle pull-left']/h4/a",xmlValue)

    job_links  <- content%>% xpathSApply(.,"//div[@class='job_item_middle pull-left']/h4/a",xmlGetAttr,"href")

    job_info   <- content%>% xpathSApply(.,"//div[@class='job_item_middle pull-left']/h5",xmlValue,trim = TRUE)

    job_salary <- content%>% xpathSApply(.,"//div[@class='job_item-right pull-right']/h4",xmlValue,trim = TRUE)

    job_origin <- content%>% xpathSApply(.,"//div[@class='job_item-right pull-right']/h5",xmlValue,trim = TRUE)

    myreslut   <-  data.frame(job_item,job_links,job_info,job_salary,job_origin)

    fullinfo  <- rbind(fullinfo,myreslut)

    cat(sprintf("第【%d】页已抓取完毕!",i),sep ="\n")

    },error =function(e){

    cat(sprintf("第【%d】页抓取失败!",i),sep ="\n")

    })

    }

    cat("all page is OK!!!")

    return(fullinfo)

    }

    system.time(mydata1 <- Getjobs())

    整个过程耗时11.03秒。

    方案2——使用向量化函数:

    Getjobs <-function(i){

    fullinfo <- data.frame()

    headers <- c("Referer"="https://www.hellobi.com/jobs/search",

    "User-Agent"="Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36"

    )

    d <- debugGatherer()

    handle <- getCurlHandle(debugfunction=d$update,followlocation=TRUE,cookiefile="",verbose =TRUE)

    url <- sprintf("https://www.hellobi.com/jobs/search?page=%d",i)

    content    <- getURL(url,.opts=list(httpheader=headers),.encoding="utf-8",curl=handle) %>% htmlParse()

    job_item   <- content %>% xpathSApply(.,"//div[@class='job_item_middle pull-left']/h4/a",xmlValue)

    job_links  <- content %>% xpathSApply(.,"//div[@class='job_item_middle pull-left']/h4/a",xmlGetAttr,"href")

    job_info   <- content %>% xpathSApply(.,"//div[@class='job_item_middle pull-left']/h5",xmlValue,trim =TRUE)

    job_salary <- content %>% xpathSApply(.,"//div[@class='job_item-right pull-right']/h4",xmlValue,trim =TRUE)

    job_origin <- content %>% xpathSApply(.,"//div[@class='job_item-right pull-right']/h5",xmlValue,trim =TRUE)

    data.frame(job_item,job_links,job_info,job_salary,job_origin) %>%return()

    }

    system.time(mydata <- plyr::ldply(1:10,Getjobs,.progress ="text"))

    整个过程耗时9.07m。

    方案3——使用多进程包:

    system.time({

    library("doParallel")

    library("foreach")

    cl<-makeCluster(4)

    registerDoParallel(cl)

    mydata2 <-foreach(i=1:10,

    .combine=rbind,

    .packages = c("RCurl","XML","magrittr")

    )%dopar% Getjobs(i)

    stopCluster(cl)

    })

    总耗时5.14秒。

    这里解释一下昨天的多进程下载pdf文件为何没有任何效果,我觉得是因为,对于网络I/O密集型的任务,网络下载过程带宽不足,耗时太久,几乎掩盖了多进程的时间节省(pdf文件平均5m)。


    Python版:

    Python的案例使用urllib、lxml包进行演示。

    from urllib.request import urlopen,Request

    import pandas as pd

    import numpy as np

    import time

    from lxml import etree

    方案1——使用显式循环抓取:

    def get jobs(i):

    myresult = {

    "job_item":[],

    "job_links":[],

    "job_info":[],

    "job_salary":[],

    "job_origin":[]

    };

    header ={

    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36',

    'Referer':'https://www.hellobi.com/jobs/search'

    }

    i =0whilei<11:

    i+=

    1

    url=

    "https://www.hellobi.com/jobs/search?page={}".format(i)

    pagecontent=urlopen(Request(url,headers=header)).read().decode(

    'utf-8')

    result= etree.HTML(pagecontent)

    myresult[

    "job_item"].extend(result.xpath('//div[@class="job_item_middle pull-left"]/h4/a/text()'))

    myresult[

    "job_links"].extend(result.xpath('//div[@class="job_item_middle pull-left"]/h4/a/@href'))

    myresult[

    "job_info"].extend([text.xpath('string(.)').strip()fortextinresult.xpath('//div[@class="job_item_middle pull-left"]/h5')])

    myresult[

    "job_salary"].extend(result.xpath('//div[@class="job_item-right pull-right"]/h4/span/text()'))

    myresult[

    "job_origin"].extend(result.xpath('//div[@class="job_item-right pull-right"]/h5/span/text()'))

    time.sleep(

    1)

    print(

    "正在抓取第【{}】页".format(i))

    print(

    "everything is OK")

    returnpd.DataFrame(myresult)

    if__name__ =="__main__":

    t0 =time.time()

    mydata1 = getjobs(list(range(

    1,11)))

    t1 =time.time()

    total = t1 - t0

    print(

    "消耗时间:{}".format(total))

    总耗时将近19秒,(代码中设置有时延,估测净时间在9秒左右)

    方案2——使用多线程方式抓取:

    def executeThread(i):

    myresult = {

    "job_item":[],

    "job_links":[],

    "job_info":[],

    "job_salary":[],

    "job_origin":[]

    };

    header ={

    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36',

    'Referer':'https://www.hellobi.com/jobs/search'

    }

    url ="https://www.hellobi.com/jobs/search?page={}".format(i)

    try:

    pagecontent=urlopen(Request(url,headers=header)).read().decode('utf-8')

    result = etree.HTML(pagecontent)

    myresult["job_item"].extend(result.xpath('//div[@class="job_item_middle pull-left"]/h4/a/text()'))

    myresult["job_links"].extend(result.xpath('//div[@class="job_item_middle pull-left"]/h4/a/@href'))

    myresult["job_info"].extend([ text.xpath('string(.)').strip()fortextinresult.xpath('//div[@class="job_item_middle pull-left"]/h5')])

    myresult["job_salary"].extend(result.xpath('//div[@class="job_item-right pull-right"]/h4/span/text()'))

    myresult["job_origin"].extend(result.xpath('//div[@class="job_item-right pull-right"]/h5/span/text()'))

    except:

    pass

    withopen('D:/Python/File/hellolive.csv','a+')asf:

    pd.DataFrame(myresult).to_csv(f, index =False,header=Falseifi >1elseTrue)

    def

    main():

    threads = []

    foriinrange(1,11):

    thread = threading.Thread(target=executeThread,args=(i,))

    threads.append(thread)

    thread.start()

    foriinthreads:

    i.join()

    if__name__ =='__main__':

    t0 = time.time()

    main()

    t1 = time.time()

    total = t1 - t0

    print("消耗时间:{}".format(total))

    以上多进程模式仅使用了1.64m,多进程爬虫的优势与单进程相比效率非常明显。

    方案3——使用多进程方式抓取:

    from multiprocessing import Pool

    from urllib.request import urlopen,Request

    import pandas as pd

    import time

    from lxml import etree

    def executeThread(i):

    myresult = {

    "job_item":[],

    "job_links":[],

    "job_info":[],

    "job_salary":[],

    "job_origin":[]

    };

    header ={

    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.79 Safari/537.36',

    'Referer':'https://www.hellobi.com/jobs/search'

    }

    url ="https://www.hellobi.com/jobs/search?page={}".format(i)

    try:

    pagecontent=urlopen(Request(url,headers=header)).read().decode('utf-8')

    result = etree.HTML(pagecontent)

    myresult["job_item"].extend(result.xpath('//div[@class="job_item_middle pull-left"]/h4/a/text()'))

    myresult["job_links"].extend(result.xpath('//div[@class="job_item_middle pull-left"]/h4/a/@href'))

    myresult["job_info"].extend([ text.xpath('string(.)').strip()fortextinresult.xpath('//div[@class="job_item_middle pull-left"]/h5')])

    myresult["job_salary"].extend(result.xpath('//div[@class="job_item-right pull-right"]/h4/span/text()'))

    myresult["job_origin"].extend(result.xpath('//div[@class="job_item-right pull-right"]/h5/span/text()'))except:pass

    withopen('D:/Python/File/hellolive.csv','a+')asf:

    pd.DataFrame(myresult).to_csv(f, index =False,header=Falseifi >1elseTrue)

    def

    shell():

    # Multi-process

    pool = Pool(multiprocessing.cpu_count())

    pool.map(excuteThread,list(range(1,11)))

    pool.close()

    pool.join()

    if__name__ =="__main__":

    #计时开始:

    t0 = time.time()

    shell()

    t1 = time.time()

    total = t1 - t0

    print("消耗时间:{}".format(total))

    最后的多进程执行时间差不多也在1.5s左右,但是因为windows的forks问题,不能直接在编辑器中执行,需要将多进程的代码放在.py文件,然后将.py文件在cmd或者PowerShell中执行。

    c从今天这些案例可以看出,对于网络I/O密集型任务而言,多线程和多进程确实可以提升任务效率,但是速度越快也意味着面临着更大的反爬压力,特别是在多进程/多线程环境下,并发处理需要做更加加完善的伪装措施,比如考虑提供随机UA/IP,以防过早被封杀。

    往期案例数据可移步作者GitHub:

    https://github.com/ljtyduyu/DataWarehouse/tree/master/File


    欢迎关注天善智能,我们是专注于商业智能BI,人工智能AI,大数据分析与挖掘领域的垂直社区,学习,问答、求职一站式搞定!

    天善智能社区地址:https://www.hellobi.com/

    相关文章

      网友评论

        本文标题:左手用R右手Python系列——多进程/线程数据抓取与网页请求

        本文链接:https://www.haomeiwen.com/subject/vdjtgxtx.html