美文网首页
R爬虫必备—httr+POST请求类爬虫(网易云课堂)

R爬虫必备—httr+POST请求类爬虫(网易云课堂)

作者: Clariom | 来源:发表于2020-07-28 16:35 被阅读0次

    在实际R爬虫过程中,针对不同的网页,采取的爬虫方法也会有所不同。对于静态网页,rvest包足够了。但是对于网页动态加载的数据,继续使用rvest可能就不合适了。这时候需要RCurl或httr这类能提供丰富请求参数的R包,才能实现对这类动态网页的抓取。今天呢,主要介绍httr包,虽然说这个R包已经比RCurl精简很多,但涉及到的函数也很多,但是常规爬虫中用的比较多的还是GET和POST这两个函数。

    下文案例是个典型的POST请求类的爬虫,因此,今天先看下POST这个函数的大致用法:POST(url = NULL, config = list(), ..., body = NULL, encode = c("multipart", "form", "json", "raw"), handle = NULL),这里面比较重要的是config参数(设置请求头和cookies)和 body参数(查询参数)。具体参数解释如下:

    • url :the url of the page to retrieve
    • config:Additional configuration settings such as http authentication (authenticate), additional headers (add_headers), cookies (set_cookies) etc. See config for full details and list of helpers. Further named parameters, such as query, path, etc, passed on to modify_url. Unnamed parameters will be combined with config.
    • body:One of the following:
    • FALSE: No body. This is typically not used with POST, PUT, or PATCH, but can be useful if you need to send a bodyless request (like GET) with VERB().
    • NULL: An empty body
    • "": A length 0 body
    • upload_file("path/"): The contents of a file. The mime type will be guessed from the extension, or can be supplied explicitly as the second argument to upload_file()
    • A character or raw vector: sent as is in body. Use content_type to tell the server what sort of data you are sending.
    • A named list: See details for encode. (比较常用)
    • encode:If the body is a named list, how should it be encoded? Can be one of form (application/x-www-form-urlencoded), multipart, (multipart/form-data), or json (application/json). For "multipart", list elements can be strings or objects created by upload_file. For "form", elements are coerced to strings and escaped, use I() to prevent double-escaping. For "json", parameters are automatically "unboxed" (i.e. length 1 vectors are converted to scalars). To preserve a length 1 vector as a vector, wrap in I(). For "raw", either a character or raw vector. You'll need to make sure to set the content_type() yourself.
    • handle:The handle to use with this request. If not supplied, will be retrieved and reused from the handle_pool based on the scheme, hostname and port of the url. By default httr requests to the same scheme/host/port combo. This substantially reduces connection time, and ensures that cookies are maintained over multiple requests to the same host. See handle_pool for more details.

    为了更好理解httr:POST这个函数如何抓取动态异步加载网页,下面以网易云课程为例做简单介绍!

    网易云课堂案例说明

    进入网易云课堂,点击编程课程页面,网址为https://study.163.com/category/480000003131009,后台显示了该网址的请求和响应,表明该页面信息是可以通过该网址获取的。

    image

    接下来,第二页,常见情况下在网址的后面添加页码,请求和响应得到第二页面的信息。但是,网易云再点击第2/3/....页后,网址会依次添加页面,但网页后台工具显示(记得刷新),没有添加页码的网页请求,网址维持不变,如下图。这说明通过【https://study.163.com/category/480000003131009#/?p=数字】这类网址是得不到课程信息的。

    image

    这个时候,考虑异步加载,回到首页,进入开发后台,再进入XHR面板。然后点击页码翻页,第2页,第3页,第4页..... ,发现XHR面板再每点击页码,就会生成新的内容。这其中就包括studycourse.json,发现其响应内容就是课程信息。比较不同的studycourse.json,发现request paload的具体参数不同,经分析,不难发现这些studycourse.json存储着课程信息,每50个课程生成一个文件,共计13个文件。所以,要获取课程,应该针对https://study.163.com/p/search/studycourse.json网址,POST提交参数信息,然后获取课程信息。

    image

    在正式爬取之前,需要对下面爬虫主要涉及的参数做下介绍:General里面的Request URL、Request Method、Status Code;Response Headers里面的Content-Type;Request Headers 里面的 Accept、Content-Type、Cookie、Referer、User-Agent等以及最后Form data/Request Paylond里面的所有参数。

    • General里面的Request URL和Request Method方法即是即决定访问的资源对象和使用的技术手段。
    • Response Headers里面的Content-Type决定着你获得的数据以什么样的编码格式返回。
    • Request Headers 里面的 Accept、Content-Type、Cookie、Referer、User-Agent等是你客户端的浏览器信息,其中Cookie是你浏览器登录后缓存在本地的登录状态信息,使用Cookie登入可以避免爬虫程序被频繁拒绝。这其中的参数不一定全部需要提交。
    • Form data/Request Paylond信息最为关键,是POST提交请求必备的定位信息,因为浏览器的课程页有很多页信息,但是实际上访问同一个地址(就是General里面的url),而真正起到切换页面的就是这个Form data/Request Paylond里面的表单信息。

    如何实际操作?

    1. 加载所需要的R包,没安装的提前安装。

    rm(list=ls())
    library("httr") 
    library("dplyr") 
    library("jsonlite")
    library("curl")
    library("magrittr")
    library("rlist")
    library("pipeR")
    library("plyr")
    

    2. 构造url,通过观察XHR面板得到,如下图所示。

    url <- c('https://study.163.com/p/search/studycourse.json')
    
    image

    3. 构造请求提交信息,根据Request headers的内容填写,如下图所示。

    
    mycookie <- 'EDUWEBDEVICE=2558e3af159d412cbd77e1bd4d2ac2b2; hb_MA-BFF5-63705950A31C_source=www.google.com; UM_distinctid=172b723649a9f8-0656102adeb43b-d373666-144000-172b723649b942; EDU-YKT-MODULE_GLOBAL_PRIVACY_DIALOG=true; STUDY_NOT_SHOW_PROMOTION_WIN=true; eds_utm=eyJjIjoiIiwiY3QiOiIiLCJpIjoiIiwibSI6IiIsInMiOiIiLCJ0IjoiIn0=|aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8=; __utmc=129633230; __utmz=129633230.1592444560.4.4.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); sideBarPost=891; NTESSTUDYSI=858bed85fc9541a18c43cd5a6934f38e; NTES_YD_SESS=1tT526KyasdDFqzud7RGk5jej6MFRxOISQfMF1iNn3v_7LHE7Muc9bTJyJAttj.joOdOzfye0L5uGhlKLcFiJ1eQEHvs_xxxUv2aQbafpcbS6hsTMsDdEfzDYOYnYNE2JhmQJ_b6EjujeMBPiY2p_uAiwiZBBAZ9oXs3aZVXgDz4Ml6MBD1fU.BmguhUJap3SRPbGtT.47UnGi3dnfLx7kKzWXRNJ42La4HlM7147KO_x; NTES_YD_PASSPORT=JFDBLA68q0IasVpovWl2uu.D2kvzuFwobwXjKpbdMsw_Wh3VWELSHa6OxOp55IPIQtgtekxnBhmK0HdF9PDQRC7OCZxa2Gn.qoFKRMmPKdxg0hVqvtGpPZo857IgjcbxNCJrffK9DY58Nw8OhoxGUKm7GIS1cSfY3FKlF_TkZSVADf5877_.SzHE5ImYW3nhuQ1V61_vI1NX8v5QQxL2GSO4e; S_INFO=1592451940|0|3&80##|13120412092; P_INFO=13120412092|1592451940|1|study|00&99|null&null&null#bej&null#10#0#0|&0|null|13120412092; STUDY_INFO="yd.6b522ed6bfe94ee3a@163.com|8|1415444383|1592451940634"; STUDY_SESS="fxEsQC5LvlYHX4GhzCnnlTl2iio0a1fDsRiqpj5hqzQh5bTFReRt55rb+vnOnEZ45V/nvYjstqkjGmTNU344+VHBcA9uP0xzObA9G8ot0BOrnyy1VnW5ZKxb44cf+3ZGTda5t/QjAY20KOE0EF9+TP8RDwCBhi8apnA+E128sPALhur2Nm2wEb9HcEikV+3FTI8+lZKyHhiycNQo+g+/oA=="; STUDY_PERSIST="+GsuXMNai/WHdRbupv7eEeLokfCkeuarSAaYVuDIvpH9hK3yadTjOawfrSa+uwNez0VFU4i3ndRj9W9omWbyCuPHHsek+Esh2RBGBgwayFaMnccfeJrtvHow3DYivsfP8diGmo2uGQEZemAqCA5us1F3KV8BqtsUDrCO1ITmM6Zt5913n1CKxdhKmqhyOdXoJ5pxN8si2bAj3KSQtVtAP/OnSEP5aavbVEpleM+fnYXZgpjCC7Iso4RP9U87vJE8LtaQzUT1ovP2MqtW5+L3Hw+PvH8+tZRDonbf7gEH7JU="; NETEASE_WDA_UID=1415444383#|#1581733410145; NTES_STUDY_YUNXIN_ACCID=s-1415444383; NTES_STUDY_YUNXIN_TOKEN=73db624b15dc91b405d838a2433bafc9; __utma=129633230.575204015.1592210449.1592445051.1592458754.6; utm=eyJjIjoiIiwiY3QiOiIiLCJpIjoiIiwibSI6IiIsInMiOiIiLCJ0IjoiIn0=|aHR0cHM6Ly9zdHVkeS4xNjMuY29tL2NhdGVnb3J5LzQ4MDAwMDAwMzEzMTAwOQ==; CNZZDATA1272960468=716509904-1592205743-https%253A%252F%252Fwww.google.com%252F%7C1592461992; STUDY_UUID=d9689752-da3c-43d3-9c17-471aefab6636; __utmb=129633230.22.9.1592463486945'
    #注意比较,保留不变的参数
    myheaders <- c('accept' ='application/json',
                   'accept-encoding' = 'gzip, deflate, br',
                   'accept-language' = 'zh-CN,zh;q=0.9',
                   'content-type' = 'application/json',
                   'edu-script-token' = '858bed85fc9541a18c43cd5a6934f38e',
                   'origin' = 'https://study.163.com',
                   'referer' = 'https://study.163.com/category/480000003131009',
                   'sec-fetch-dest' = 'empty',
                   'sec-fetch-mode' = 'cors',
                   'sec-fetch-site' = 'same-origin',
                   'user-agent' = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
                   'cookie' = mycookie)
    
    image

    4. 构造请求头参数信息,根据Request Payload内容填写,如下图所示。仔细比较,发现pageIndexrelativeOffset是有规律变化的。以下代码是第一页的Request Payload参数,当请求不同页面时,有此规律:"pageIndex"= i, "relativeOffset"= 50*(i-1) 。以下代码是第一页的Request Payload参数。

    mypayload <- list("pageIndex"= 1,
                      "pageSize"= 50,
                      "relativeOffset"= 0,
                      "frontCategoryId"= "480000003131009",
                      "searchTimeType"= -1,
                      "orderType"= 50,
                      "priceType"= -1,
                      "activityId"= 0,
                      "keyword"= "")
    
    image

    5. 执行第一页,httr的POST函数,这里用什么函数是由General里的Request Method参数决定的,如下图。

    image
    POST(url , add_headers(.headers = 待爬取网页的头部信息),
               set_cookies(.cookies =自己的cookie,
               body = Form data/Request Paylond里面的参数, 
               encode = c("multipart"/"form"/"json"/"raw"),
               timeout(最大请求时长/秒),
               use_proxy(代理IP)......)
    
    #针对POST请求而言,POST请求灰常复杂,它的查询参数必须含在请求体(body)中,
    #而且参数发送前需要做指定的编码方式(就是request header中的content-type).
    长见的编码方式有4种:前两种比较常见
    application/x-www-form-urlencoded —— form
    application/json                  —— json
    multipart/form-data               —— multipart
    text/xml                          —— raw 需要自己设置content_type()
    

    根据POST函数用法,正式进行网络请求,服务器响应返回包含有json格式课程的数据。

    response <- POST(url = url, add_headers(.headers = myheaders),body = mypayload, encode="json",verbose())
    #从数据中获取正文数据,包含4个list,选择第3个list,然后再其中选择第2个list,得到50份课信息(这时还是list格式)
    #接着通过toJSON和fromJSON函数将数据转化为矩阵
    result <- response %>% content()  %>%`[[`(3) %>% `[[`(2) %>% toJSON() %>% fromJSON(simplifyDataFrame=TRUE)
    colnames(result)#查看都有哪些信息
    usefulname <- c("productId","courseId","productName","lectorName","provider","score","scoreLevel","learnerCount","originalPrice","discountPrice","discountRate","description")
    result <- result %>% select(usefulname)
    

    第一页爬取结果如下所示,共计50条课程信息:

    image.png

    6. 执行获取所有页面信息,共13页,利用循环抓取。

    myfullresult<-list()
    i=1
    for (i in 1:13){
      print(paste0("正在抓取第",i,"页"))
      mypayload <- list("pageIndex"= i,
                        "pageSize"= 50,
                        "relativeOffset"= 50*(i-1),
                        "frontCategoryId"= "480000003131009",
                        "searchTimeType"= -1,
                        "orderType"= 50,
                        "priceType"= -1,
                        "activityId"= 0,
                        "keyword"= "")
      web <- POST(url = url,add_headers(.headers =myheaders),body = mypayload,encode="json",verbose())
      myresult<-web %>% content() %>% `[[`(3) %>% `[[`(2) 
      myfullresult<-c(myfullresult,myresult)
    }
    

    7. 然后,整理数据并输出到本地。

    #以上获取数据为列表格式,需转换为数据框,并提取出要保留的列。
    mydata<-do.call(rbind,myfullresult) %>% as.data.frame() %>% select(usefulname)
    # mydata整体是数据框,但是单个变量仍然是lsit(原因是原始信息中出现大量的NULL值),
    # 需要将所有NULL替换为NA,方可对mydata的个列进行向量化。
    
    # 替换NULL值
    for (j in 1:length(mydata)){
      for (i in 1:nrow(mydata)){
        if(is.null(mydata[i,j][[1]])){
          mydata[i,j][[1]]=NA
        }
      }
    }
    
    #将所有list列转为向量
    for (i in usefulname){
      mydata[[i]]<-mydata[[i]] %>% unlist()
    }
    
    #去重和保存
    mydata<-unique(mydata)
    write.csv(mydata, file ="course.csv")
    

    所有课程信息如下,共计627条。

    image.png

    最后,完整代码汇总如下:

    rm(list=ls())
    library("httr") 
    library("dplyr") 
    library("jsonlite")
    library("curl")
    library("magrittr")
    library("rlist")
    library("pipeR")
    library("plyr")
    
    #构造url,通过观察XHR面板得到
    url <- c('https://study.163.com/p/search/studycourse.json')
    #构造请求提交信息
    mycookie <- 'EDUWEBDEVICE=2558e3af159d412cbd77e1bd4d2ac2b2; hb_MA-BFF5-63705950A31C_source=www.google.com; UM_distinctid=172b723649a9f8-0656102adeb43b-d373666-144000-172b723649b942; EDU-YKT-MODULE_GLOBAL_PRIVACY_DIALOG=true; STUDY_NOT_SHOW_PROMOTION_WIN=true; eds_utm=eyJjIjoiIiwiY3QiOiIiLCJpIjoiIiwibSI6IiIsInMiOiIiLCJ0IjoiIn0=|aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8=; __utmc=129633230; __utmz=129633230.1592444560.4.4.utmcsr=google|utmccn=(organic)|utmcmd=organic|utmctr=(not%20provided); sideBarPost=891; NTESSTUDYSI=858bed85fc9541a18c43cd5a6934f38e; NTES_YD_SESS=1tT526KyasdDFqzud7RGk5jej6MFRxOISQfMF1iNn3v_7LHE7Muc9bTJyJAttj.joOdOzfye0L5uGhlKLcFiJ1eQEHvs_xxxUv2aQbafpcbS6hsTMsDdEfzDYOYnYNE2JhmQJ_b6EjujeMBPiY2p_uAiwiZBBAZ9oXs3aZVXgDz4Ml6MBD1fU.BmguhUJap3SRPbGtT.47UnGi3dnfLx7kKzWXRNJ42La4HlM7147KO_x; NTES_YD_PASSPORT=JFDBLA68q0IasVpovWl2uu.D2kvzuFwobwXjKpbdMsw_Wh3VWELSHa6OxOp55IPIQtgtekxnBhmK0HdF9PDQRC7OCZxa2Gn.qoFKRMmPKdxg0hVqvtGpPZo857IgjcbxNCJrffK9DY58Nw8OhoxGUKm7GIS1cSfY3FKlF_TkZSVADf5877_.SzHE5ImYW3nhuQ1V61_vI1NX8v5QQxL2GSO4e; S_INFO=1592451940|0|3&80##|13120412092; P_INFO=13120412092|1592451940|1|study|00&99|null&null&null#bej&null#10#0#0|&0|null|13120412092; STUDY_INFO="yd.6b522ed6bfe94ee3a@163.com|8|1415444383|1592451940634"; STUDY_SESS="fxEsQC5LvlYHX4GhzCnnlTl2iio0a1fDsRiqpj5hqzQh5bTFReRt55rb+vnOnEZ45V/nvYjstqkjGmTNU344+VHBcA9uP0xzObA9G8ot0BOrnyy1VnW5ZKxb44cf+3ZGTda5t/QjAY20KOE0EF9+TP8RDwCBhi8apnA+E128sPALhur2Nm2wEb9HcEikV+3FTI8+lZKyHhiycNQo+g+/oA=="; STUDY_PERSIST="+GsuXMNai/WHdRbupv7eEeLokfCkeuarSAaYVuDIvpH9hK3yadTjOawfrSa+uwNez0VFU4i3ndRj9W9omWbyCuPHHsek+Esh2RBGBgwayFaMnccfeJrtvHow3DYivsfP8diGmo2uGQEZemAqCA5us1F3KV8BqtsUDrCO1ITmM6Zt5913n1CKxdhKmqhyOdXoJ5pxN8si2bAj3KSQtVtAP/OnSEP5aavbVEpleM+fnYXZgpjCC7Iso4RP9U87vJE8LtaQzUT1ovP2MqtW5+L3Hw+PvH8+tZRDonbf7gEH7JU="; NETEASE_WDA_UID=1415444383#|#1581733410145; NTES_STUDY_YUNXIN_ACCID=s-1415444383; NTES_STUDY_YUNXIN_TOKEN=73db624b15dc91b405d838a2433bafc9; __utma=129633230.575204015.1592210449.1592445051.1592458754.6; utm=eyJjIjoiIiwiY3QiOiIiLCJpIjoiIiwibSI6IiIsInMiOiIiLCJ0IjoiIn0=|aHR0cHM6Ly9zdHVkeS4xNjMuY29tL2NhdGVnb3J5LzQ4MDAwMDAwMzEzMTAwOQ==; CNZZDATA1272960468=716509904-1592205743-https%253A%252F%252Fwww.google.com%252F%7C1592461992; STUDY_UUID=d9689752-da3c-43d3-9c17-471aefab6636; __utmb=129633230.22.9.1592463486945'
    #注意比较,保留不变的参数
    myheaders <- c('accept' ='application/json',
                   'accept-encoding' = 'gzip, deflate, br',
                   'accept-language' = 'zh-CN,zh;q=0.9',
                   'content-type' = 'application/json',
                   'edu-script-token' = '858bed85fc9541a18c43cd5a6934f38e',
                   'origin' = 'https://study.163.com',
                   'referer' = 'https://study.163.com/category/480000003131009',
                   'sec-fetch-dest' = 'empty',
                   'sec-fetch-mode' = 'cors',
                   'sec-fetch-site' = 'same-origin',
                   'user-agent' = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36',
                   'cookie' = mycookie)
    #构造请求头参数信息,特别是那些有规律变化的参数,list格式
    mypayload <- list("pageIndex"= 1,
                      "pageSize"= 50,
                      "relativeOffset"= 0,
                      "frontCategoryId"= "480000003131009",
                      "searchTimeType"= -1,
                      "orderType"= 50,
                      "priceType"= -1,
                      "activityId"= 0,
                      "keyword"= "")
    
    #执行第一页,httr的POST函数
    response <- POST(url = url, add_headers(.headers = myheaders),body = mypayload, encode="json",verbose())
    #从数据中获取正文数据,包含4个list,选择第3个list,然后再其中选择第2个list,得到50份课信息(这时还是list格式)
    #接着通过toJSON和fromJSON函数将数据转化为矩阵
    result <- response %>% content()  %>%`[[`(3) %>% `[[`(2) %>% toJSON() %>% fromJSON(simplifyDataFrame=TRUE)
    colnames(result)#查看都有哪些信息
    usefulname <- c("productId","courseId","productName","lectorName","provider","score","scoreLevel","learnerCount","originalPrice","discountPrice","discountRate","description")
    result <- result %>% select(usefulname)
    
    #执行获取所有页面信息,共13页,利用循环抓取。
    myfullresult<-list()
    for (i in 1:13){
      print(paste0("正在抓取第",i,"页"))
      mypayload <- list("pageIndex"= i,
                        "pageSize"= 50,
                        "relativeOffset"= 50*(i-1),
                        "frontCategoryId"= "480000003131009",
                        "searchTimeType"= -1,
                        "orderType"= 50,
                        "priceType"= -1,
                        "activityId"= 0,
                        "keyword"= "")
      web <- POST(url = url,add_headers(.headers =myheaders),body = mypayload,encode="json",verbose())
      myresult<-web %>% content() %>% `[[`(3) %>% `[[`(2) 
      myfullresult<-c(myfullresult,myresult)
    }
    
    #以上获取数据为列表格式,需转换为数据框,并提取出要保留的列。
    mydata<-do.call(rbind,myfullresult) %>% as.data.frame() %>% select(usefulname)
    # mydata整体是数据框,但是单个变量仍然是lsit(原因是原始信息中出现大量的NULL值),
    # 我们需要将所有NULL替换为NA,方可对mydata的个列进行向量化。
    
    # 替换NULL值
    for (j in 1:length(mydata)){
      for (i in 1:nrow(mydata)){
        if(is.null(mydata[i,j][[1]])){
          mydata[i,j][[1]]=NA
        }
      }
    }
    
    #将所有list列转为向量
    for (i in usefulname){
      mydata[[i]]<-mydata[[i]] %>% unlist()
    }
    
    #去重和保存
    mydata<-unique(mydata)
    write.csv(mydata, file ="course.csv")
    

    这一期主要介绍了httr如何进行POST请求类爬虫,下一期再介绍httr如何进行GET类请求类爬虫!

    参考:https://cloud.tencent.com/developer/article/1092893

    更多内容可关注公共号“YJY技能修炼”~~~

    往期回顾
    R爬虫在工作中的一点妙用
    R爬虫必备基础——HTML和CSS初识
    R爬虫必备基础——静态网页+动态网页
    R爬虫必备——rvest包的使用
    R爬虫必备基础——CSS+SelectorGadget
    R爬虫必备基础—Chrome开发者工具(F12)
    R爬虫必备基础—HTTP协议

    相关文章

      网友评论

          本文标题:R爬虫必备—httr+POST请求类爬虫(网易云课堂)

          本文链接:https://www.haomeiwen.com/subject/elafrktx.html