美文网首页
回归爬虫,拥抱scrapy&splash。抓facebook p

回归爬虫,拥抱scrapy&splash。抓facebook p

作者: 吴祺育的笔记 | 来源:发表于2018-12-21 18:45 被阅读0次

    上一个项目完成的时间是7月14日,今天是8月30日,已经过去了一个半月,这段时间做了第二个,学了一个半月的爬虫。
    代码写了400多行,目前为止单个项目写的最多的代码。。

    上班之前就有预感会做关于爬虫的东西,果不其然(捂脸)....
    其实上学的时候学爬虫,特别是学scrapy,我的内心是拒绝的。因为scrapy看起来是个很臃肿的框架,上学那个阶段,并没有什么项目需要用到它,用用Request和BS4已经完全够用了。之前那个项目在维基百科上抓取subcategory时,就是用的Request和BS4,抓了1000多个页面,爬虫依然可以跑,这已经完全够了。

    (部分内容可以略过)


    2018.12.11更新

    各位上来就 “这个项目要怎么跑通啊?” “接口不对啊,怎么调啊?” 的朋友,你问这种问题的时候,给我转个大点红包,我教你跑通,也算是知识付费。
    但是你要是什么都不懂就想不劳而获,去找其他人吧,邮件我不会回的。

    欢迎那些交流具体技术的人来邮件,比如说“splash-scrapy 在国内怎么设置代理” 这种问题我会回复。


    关于facebook

    现在公司要给我的任务是抓facebook上,搜索一个keyword,关于这个keyword的所有publick post,包括内容,时间,评论的人,点赞的人,分享的人,具体结构如下:

    {   
         "task": task name,
         "type": "facebook_post",
         "url": default None,
         "keyword": search keyword,
         "timestamp": spider crawls timestamp,
         "post_data":[{
                 "post_information":{
                     "post_from_user": post from user name,
                     "post_from_user_id": post from user id,
                     "post_id": post's id,
                     "post_type": "posts", "photos", "videos"
                     "post_content": post's content,
                     "post_link": post's article link url,
                     "post_time": post created time,
                     "post_likes_number": the number of post who likes it,
                     "post_shares_number": the number of post who share it,
                     "post_comments_number": the number of post who comment it
                 },
    
                 "post_likes":
                 [
                     {
                         "id": user's account id,
                         "name": user's name
                     },
                 ...],
                 
                 "post_shares":
                 [
                     {
                         "id": user's account id,
                         "name": user's name
                     },
                 ...],
                 
                 "post_comments":
                 [
                     {
                         "id": user's account id,
                         "name": user's name
                         "comment_content": comment's content
                     },
                 ...]
                 },
         ...]    
      }
    

    关于这个结构我说一下,facebook提供了一个官方的API, 叫Graph API explorer,是一个很强大的API,在网页上你可以看到的所有公开信息,在这个API里都可以请求到。但是有两个限制:

    1. rate limit,好像是一天最多只有4500次请求,具体数字我忘记了。
    2. access token,这个API有个token,是短期的,2个小时就会过期,需要重新请求。也可以按照官方的方法换成长期的access token,有效期是2个月。但是这怎么能满足我呢,所以我申请了一个永久的access token,具体方法见 :
      申请facebook永久access token的方法。
      Graph API具体用法参照facebook 官方文档。

    关于这次爬虫用到的技术

    关于scrapy 框架

    熟悉scrapy框架就用了两三天,因为这个以前还真没接触过,那段时间是最蛋疼的一段时间了。但是,学一个框架最好的学习方法就是看官方文档。
    用爬虫做大型的项目真的应该全面拥抱scrapy。这里说心得体会,就两点觉得很有必要说一下:

    1. 熟悉到底是要return还是yield,return和yield不对的错误会很头疼的;
    2. 理解你的response.status,这里对各种状态码的理解对debug有大用。

    代码流程

    1. 登录部分,resquest登录页面,用scrapy.formresquest登录,如果登录失败退出爬虫,需要手动检查原因(facebook账号登录异常时有手机验证机制);
    2. 构造search keyword url,直接进入public post页面,利用splash进行页面的JS渲染,返回第一个页面首先出现的几个post,F12打开开发者工具,研究下拉时的页面数据请求,观察源代码,构造出页面数据请求,这个就相当于不断下拉,刷新出新的post的动作。(这个等下要详细讲)
    3. 抓取每个post的ID,利用Graph API请求出每个post的likes,comments,shares的user‘s name and ID, 返回item。

    关于IP代理的处理

    这次代码直接挂在了亚马逊的服务器上,所以没有怎么管这个,但是在本地测试的时候,不要以为换IP就不会被block掉,naive,这么简单的策略骗过不需要登录的站点还可以,你用一个账号不断的换IP登录,你的账号很快就被封掉了,特别是这种国外的社交网站,涉及到个人隐私的。
    正确的办法是,将一个IP对应一个账号,并发抓取(我这里由于项目的任务调度问题,没有并发抓取),间隔时间设置大一些。当然API对resquest速度没有要求,使劲造。如果是纯网页的抓取,建议30s左右抓取一个页面,如果碰到等厉害的,就要用模拟JS动作了(等会讲)。

    关于登录facebook

    对于登录来说,facebook登录页面也没什么好讲的,看scary官方文档,就一个请求就可以自动填表了,代码如下:

    return FormRequest.from_response(
             response,
             url = self.login_url,
             formid = "login_form",
             formdata = {
                 "email": self.login_user,
                 "pass": self.login_pass
             },
             callback = self.after_login
     )
    

    formdata里的参数就是登录的username和password。
    整个爬虫中,登录最重要一点,是需要在服务器端保持登录状态。这个就需要在request中保证cookie传递是正常进行的,当时遇到一个坑,splash request中cookie会自动的丢失一部分,这个时候就没办法保持登录的状态,现在很多网站登录和非登录看到的内容是不一样的,这个很重要,所以就要在splash request补上漏掉的cookie。

    splash在scrapy中的应用

    splash就不做介绍了,具体参看官方文档,也特别详细。
    这里首先说明一下,为什么不用selenium+phantomjs,这里我说明一下为什么花了这么久写这个爬虫。
    首先刚拿到这个项目,我也不知道从何入手,第一个想到的办法就是找有没有官方的API可以获取数据,也确实找到了Graph API,但是之前没有看文档,简单的试了一下,发现部分用户的数据是获取不到的(对方设置了隐私),而且当时老板的要求就是抓取用户的post,这没办法抓啊,API一开始就PASS掉了。转而去攻向页面的抓取。
    页面抓取涉及到JS的渲染,目前常用的两种方法,一个是selenium+phantomjs,另一个是用splash写lua代码,解析整个爬取过程。
    selenium+phantomjs我也试过,但是抓取的数据量一旦大了,phantomjs处理速度会很慢,而且selenium相比于splash来说,我更倾向于splash,因为它的功能更完善(主要是官方文档写的好),所以放弃了selenium+phantomjs。
    在使用splash的过程中,遇到几个坑:

    1. 在登录过程中cookie掉丢,导致页面之前是登录状态,后来就没有登陆了。甚至有时候,代码不变跑两次,一次是登录的状态,一次不是登录的状态(这个问题我现在都不知道为什么,有时候cookie可以完整的传递,有时候不行)。这个时候就要看cookie是什么样的了,lua代码如下:

      local cookies = splash.args.headers['Cookie']
      这个就是当前cookie的代码,在最好把它打印出来,看cookie有没有问题。如果有缺失,这里要进行一个请求请求,重置headers['Cookie'],lua代码如下:

       splash:on_request(
           function(request)
               request:set_header('Cookie', cookies)
           end
       )
      

    这个on_request可以当做一个前调函数去理解。
    <br />

    1. 在实际解析JS页面过程中,丢失了很多resquest请求,我们在抓包过程中,发现加载一个新的post会有很多个请求(大概有30个),最后实际上只有一半左右。看请求数量需要在lua代码的返回结果中加上 HAR包的返回结果,这是一个DEBUG的小技巧,还可以将解析出来的图片打印出来,看和浏览器上显示的有什么不同,具体lua代码如下:

       png2 = splash:png{render_all=true}
               return {
                   url = splash.args.url,
                   html = splash:html(),
                   http_status = last_response.status,
                   headers = last_response.headers,
                   cookies = splash:get_cookies(),
                   png2 = png2,
                   data = json.encode(updates),
                   har = json.encode(splash:har()),
               }
      

    这里我返回了请求的url,网络状态,页面截图,还有HAR。
    HAR的结果可以在http://www.softwareishard.com/har/viewer/中看到,如下图所示:
    [图片上传失败...(image-ed6e5a-1545389085588)]
    按照道理来说是不会丢失request的,但是出现这种情况我也很绝望啊!!
    于是要完成这个项目我需要使用抓包,请求网页内部的数据接口了。这里是一个巨坑。。
    <br />

    1. 关于抓包网上有教程,我就不多说了,直接来到AJAX异步请求。
      让我们来按下邪恶的F12。请求就是这些了:
      [图片上传失败...(image-2bba15-1545389085588)]
      然后在看看这些请求的结构,
      [图片上传失败...(image-f9e999-1545389085588)]
      红色箭头里包含的就是post的数据,一个请求里有3个post的数据。再看看这些请求的URL:
      https://www.facebook.com/ajax/pagelet/generic.php/BrowseScrollingSetPagelet?dpr=1&data=%7B%22view%22%3A%22list%22%2C%22encoded_query%22%3A%22%7B%5C%22bqf%5C%22%3A%5C%22stories-public(stories-keyword(3d%2Bprinting))%5C%22%2C%5C%22browse_sid%5C%22%3Anull%2C%5C%22vertical%5C%22%3A%5C%22content%5C%22%2C%5C%22post_search_vertical%5C%22%3Anull%2C%5C%22intent_data%5C%22%3A%5C%22%7B%5C%5C%5C%22intent%5C%5C%5C%22%3A%5C%5C%5C%22posts%5C%5C%5C%22%2C%5C%5C%5C%22entity_id%5C%5C%5C%22%3Anull%2C%5C%5C%5C%22sub_intents%5C%5C%5C%22%3A%7B%5C%5C%5C%22newsy_by_nms%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22media%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22media_video%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22live%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22whitelisted_explicit_live_video_intent%5C%5C%5C%22%3Afalse%2C%5C%5C%5C%22commerce%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22hard_news%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22c_physical_place%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22c_entertainment_intent%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22sports%5C%5C%5C%22%3Atrue%7D%2C%5C%5C%5C%22user_confidence%5C%5C%5C%22%3A0.00094727595255506%2C%5C%5C%5C%22quel_topics%5C%5C%5C%22%3A[%7B%5C%5C%5C%22fbid%5C%5C%5C%22%3A112456495470085%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.97018970189702%2C%5C%5C%5C%22position%5C%5C%5C%22%3A0%2C%5C%5C%5C%22length%5C%5C%5C%22%3A11%7D]%2C%5C%5C%5C%22multi_label_intents%5C%5C%5C%22%3A[%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A2.3013026293484e-5%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A3.8938988922155e-8%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.00058737583458424%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.20616441965103%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.011218670755625%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A8.2435115473345e-5%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.024955861270428%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.00013062660582364%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A8.7018961494323e-5%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.55965006351471%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.012116322293878%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.0013190114405006%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.59801119565964%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.25427937507629%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.0019076929893345%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.00043614517198876%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.023136112838984%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.014905215241015%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.011931948363781%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.11729157716036%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%22value%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%22confidence%5C%5C%5C%22%3A0.0028816317208111%2C%5C%5C%5C%22source%5C%5C%5C%22%3Anull%7D]%2C%5C%5C%5C%22annotated_string%5C%5C%5C%22%3A%5C%5C%5C%22%7B%5C%5C%5C%5C%5C%5C%5C%22tokens%5C%5C%5C%5C%5C%5C%5C%22%3A[%5C%5C%5C%5C%5C%5C%5C%223d%5C%5C%5C%5C%5C%5C%5C%22%2C%5C%5C%5C%5C%5C%5C%5C%22printing%5C%5C%5C%5C%5C%5C%5C%22]%2C%5C%5C%5C%5C%5C%5C%5C%22entities%5C%5C%5C%5C%5C%5C%5C%22%3A%7B%5C%5C%5C%5C%5C%5C%5C%220_1%5C%5C%5C%5C%5C%5C%5C%22%3A[%7B%5C%5C%5C%5C%5C%5C%5C%22mention%5C%5C%5C%5C%5C%5C%5C%22%3A%5C%5C%5C%5C%5C%5C%5C%223d%20printing%5C%5C%5C%5C%5C%5C%5C%22%2C%5C%5C%5C%5C%5C%5C%5C%22fbid%5C%5C%5C%5C%5C%5C%5C%22%3A150438908406955%2C%5C%5C%5C%5C%5C%5C%5C%22name%5C%5C%5C%5C%5C%5C%5C%22%3A%5C%5C%5C%5C%5C%5C%5C%223d%20printing%5C%5C%5C%5C%5C%5C%5C%22%2C%5C%5C%5C%5C%5C%5C%5C%22probability%5C%5C%5C%5C%5C%5C%5C%22%3A1%2C%5C%5C%5C%5C%5C%5C%5C%22wikiFbid%5C%5C%5C%5C%5C%5C%5C%22%3Anull%2C%5C%5C%5C%5C%5C%5C%5C%22isWikiPage%5C%5C%5C%5C%5C%5C%5C%22%3Afalse%2C%5C%5C%5C%5C%5C%5C%5C%22source%5C%5C%5C%5C%5C%5C%5C%22%3A1%2C%5C%5C%5C%5C%5C%5C%5C%22type%5C%5C%5C%5C%5C%5C%5C%22%3Anull%2C%5C%5C%5C%5C%5C%5C%5C%22fbType%5C%5C%5C%5C%5C%5C%5C%22%3A2%2C%5C%5C%5C%5C%5C%5C%5C%22isConnected%5C%5C%5C%5C%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%5C%5C%5C%5C%22category%5C%5C%5C%5C%5C%5C%5C%22%3Anull%7D%2C%7B%5C%5C%5C%5C%5C%5C%5C%22mention%5C%5C%5C%5C%5C%5C%5C%22%3A%5C%5C%5C%5C%5C%5C%5C%223d%20printing%5C%5C%5C%5C%5C%5C%5C%22%2C%5C%5C%5C%5C%5C%5C%5C%22fbid%5C%5C%5C%5C%5C%5C%5C%22%3A112456495470085%2C%5C%5C%5C%5C%5C%5C%5C%22name%5C%5C%5C%5C%5C%5C%5C%22%3A%5C%5C%5C%5C%5C%5C%5C%223D%20printing%5C%5C%5C%5C%5C%5C%5C%22%2C%5C%5C%5C%5C%5C%5C%5C%22probability%5C%5C%5C%5C%5C%5C%5C%22%3A0.97018970189702%2C%5C%5C%5C%5C%5C%5C%5C%22wikiFbid%5C%5C%5C%5C%5C%5C%5C%22%3A112456495470085%2C%5C%5C%5C%5C%5C%5C%5C%22isWikiPage%5C%5C%5C%5C%5C%5C%5C%22%3Atrue%2C%5C%5C%5C%5C%5C%5C%5C%22source%5C%5C%5C%5C%5C%5C%5C%22%3A0%2C%5C%5C%5C%5C%5C%5C%5C%22type%5C%5C%5C%5C%5C%5C%5C%22%3Anull%2C%5C%5C%5C%5C%5C%5C%5C%22fbType%5C%5C%5C%5C%5C%5C%5C%22%3A102%2C%5C%5C%5C%5C%5C%5C%5C%22isConnected%5C%5C%5C%5C%5C%5C%5C%22%3Anull%2C%5C%5C%5C%5C%5C%5C%5C%22category%5C%5C%5C%5C%5C%5C%5C%22%3A%5C%5C%5C%5C%5C%5C%5C%22activity_general%5C%5C%5C%5C%5C%5C%5C%22%7D]%7D%2C%5C%5C%5C%5C%5C%5C%5C%22concepts%5C%5C%5C%5C%5C%5C%5C%22%3A%7B%5C%5C%5C%5C%5C%5C%5C%220_1%5C%5C%5C%5C%5C%5C%5C%22%3A%7B%5C%5C%5C%5C%5C%5C%5C%22start%5C%5C%5C%5C%5C%5C%5C%22%3A0%2C%5C%5C%5C%5C%5C%5C%5C%22end%5C%5C%5C%5C%5C%5C%5C%22%3A1%2C%5C%5C%5C%5C%5C%5C%5C%22mention%5C%5C%5C%5C%5C%5C%5C%22%3A%5C%5C%5C%5C%5C%5C%5C%223d%20printing%5C%5C%5C%5C%5C%5C%5C%22%2C%5C%5C%5C%5C%5C%5C%5C%22concept%5C%5C%5C%5C%5C%5C%5C%22%3A%5C%5C%5C%5C%5C%5C%5C%22technology%5C%5C%5C%5C%5C%5C%5C%22%2C%5C%5C%5C%5C%5C%5C%5C%22probability%5C%5C%5C%5C%5C%5C%5C%22%3A0.73969631731129%7D%7D%2C%5C%5C%5C%5C%5C%5C%5C%22segments%5C%5C%5C%5C%5C%5C%5C%22%3A[%7B%5C%5C%5C%5C%5C%5C%5C%22type%5C%5C%5C%5C%5C%5C%5C%22%3Anull%2C%5C%5C%5C%5C%5C%5C%5C%22tokens%5C%5C%5C%5C%5C%5C%5C%22%3A%5C%5C%5C%5C%5C%5C%5C%223d%20printing%5C%5C%5C%5C%5C%5C%5C%22%7D]%7D%5C%5C%5C%22%2C%5C%5C%5C%22personalized_user_name_confidence%5C%5C%5C%22%3A0.00094727595255506%2C%5C%5C%5C%22discovery_intent%5C%5C%5C%22%3A%5C%5C%5C%22%7B%5C%5C%5C%5C%5C%5C%5C%22query_entities%5C%5C%5C%5C%5C%5C%5C%22%3A[%5C%5C%5C%5C%5C%5C%5C%2216d6b2919bd9b444180b3364207072696e74696e67173ff0000000000000122b0147043ff00000000000003132211222121504140015026c1b02860b69735f627573696e657373001269735f766964656f5f73686f775f70616765001b001b001b001b00001c001c36d6b2919bd9b4440000%5C%5C%5C%5C%5C%5C%5C%22%2C%5C%5C%5C%5C%5C%5C%5C%22168ad8b897ea9133180b3344207072696e74696e67173fef0bcb461209b611168ad8b897ea9133281061637469766974795f67656e6572616c0b0c0247003ff0000000000000183ff0000000000000321916b817180b3344207072696e74696e67123115cc9202121215cc011400150029ec1c1500150200150619180b3364207072696e74696e67173fdd08b075a3c282001c1500150200150619180a3364207072696e746572173fc1764edf3f7faa001c150015020015061918166164646974697665206d616e75666163747572696e67173fbb3cd3377a518c001c1500150200150619180a3364207072696e746564173fb38d08b075a3c3001c1500150200150619180b3364207072696e74657273173faf14b87afca870001c150015020015061918137261706964206d616e75666163747572696e67173f9f6a46d9e699bc001c1500150200150619180c332064207072696e74696e67173f920c080558e5ef001c1500150200150619180c332064207072696e74657273173f920c080558e5ef001c1500150200150619181c646972656374206469676974616c206d616e75666163747572696e67173f920c080558e5ef001c1500150200150619180b332064207072696e746572173f8ebf2a1c12b725001c150015020015061918083364207072696e74173f8ebf2a1c12b725001c1500150200150619181a736f6c69642066726565666f726d206661627269636174696f6e173f8ebf2a1c12b725001c15001502001506191815696e7374616e74206d616e75666163747572696e67173f8ebf2a1c12b725001c150015020015061918156465736b746f70206d616e75666163747572696e67173f8c12b724c32cc9004c1b02860b69735f627573696e657373001269735f766964656f5f73686f775f70616765001b001b001b001b01890e72656c617465645f656e74697479f61e98efaa99d9ab31d4a48683fad831d4c0c8a0a1fa40baee8b81f1bcbc01ce9e82b7bdca38ec8fe0efcca330e2e2f0ecd3a331a2accef2fd8c3586a2b3dedfe29005dcbac6e99fa933f68bbf838bb530a8bbae9ec3d86080ecf5ec91ddb80194c5d3f3d582dc05f8eed584958230a49afa8793aa31f8d0ebeab1f830daa9d8b7b7bc3494cfb08ae49933dcfca3abc4c93384d8ef85d294339cea88e6b783c803aedadca4b1f2a705b8bddad9f5ddf301fe92aa93cf8230b2c6a6a7dfae34e4c2ad9daab18705c6e6879d92d538fc96f389c79d33908897a8f391ab06001c001c180751323239333637168ad8b897ea9133168ad8b897ea91330000%5C%5C%5C%5C%5C%5C%5C%22]%2C%5C%5C%5C%5C%5C%5C%5C%22module_intent%5C%5C%5C%5C%5C%5C%5C%22%3A%5C%5C%5C%5C%5C%5C%5C%22general_topics%5C%5C%5C%5C%5C%5C%5C%22%2C%5C%5C%5C%5C%5C%5C%5C%22discovery_analysis%5C%5C%5C%5C%5C%5C%5C%22%3A%5C%5C%5C%5C%5C%5C%5C%22191c1d3f800000191c15a01f1c1500150400191c168ad8b897ea91331d3f80000000000000%5C%5C%5C%5C%5C%5C%5C%22%2C%5C%5C%5C%5C%5C%5C%5C%22fbid%5C%5C%5C%5C%5C%5C%5C%22%3A112456495470085%2C%5C%5C%5C%5C%5C%5C%5C%22metapage_fbid%5C%5C%5C%5C%5C%5C%5C%22%3Anull%2C%5C%5C%5C%5C%5C%5C%5C%22score%5C%5C%5C%5C%5C%5C%5C%22%3A1%2C%5C%5C%5C%5C%5C%5C%5C%22strValue%5C%5C%5C%5C%5C%5C%5C%22%3A%5C%5C%5C%5C%5C%5C%5C%223d%20printing%5C%5C%5C%5C%5C%5C%5C%22%2C%5C%5C%5C%5C%5C%5C%5C%22module_semantics%5C%5C%5C%5C%5C%5C%5C%22%3A[]%2C%5C%5C%5C%5C%5C%5C%5C%22will_display%5C%5C%5C%5C%5C%5C%5C%22%3Anull%2C%5C%5C%5C%5C%5C%5C%5C%22explicit_intent%5C%5C%5C%5C%5C%5C%5C%22%3Anull%7D%5C%5C%5C%22%2C%5C%5C%5C%22all_intent_scores%5C%5C%5C%22%3A%7B%5C%5C%5C%22NEWS%5C%5C%5C%22%3A0%2C%5C%5C%5C%22FEED_SEARCH%5C%5C%5C%22%3A0.68000657329864%2C%5C%5C%5C%22PERSON_NAME%5C%5C%5C%22%3A0.00094727595255506%2C%5C%5C%5C%22GRAMMAR%5C%5C%5C%22%3A0.0038283182306238%2C%5C%5C%5C%22ENTITY%5C%5C%5C%22%3A0.26963230895797%2C%5C%5C%5C%22NEEDLE%5C%5C%5C%22%3A0.41857439279556%2C%5C%5C%5C%22PUBLIC%5C%5C%5C%22%3A0.58047193288803%2C%5C%5C%5C%22COMMERCE%5C%5C%5C%22%3A0.7117947936058%2C%5C%5C%5C%22MEDIA%5C%5C%5C%22%3A0.79373741149902%2C%5C%5C%5C%22LOCATION%5C%5C%5C%22%3A0.44744238257408%2C%5C%5C%5C%22RECIPE%5C%5C%5C%22%3A0.046389143913984%2C%5C%5C%5C%22OFFENSIVE%5C%5C%5C%22%3A3.9597637169209e-7%2C%5C%5C%5C%22FOREIGN_LANG%5C%5C%5C%22%3A0.036479275673628%2C%5C%5C%5C%22NOT_UNDERSTAND%5C%5C%5C%22%3A0.033945471048355%2C%5C%5C%5C%22REVIEW%5C%5C%5C%22%3A0.85681182146072%2C%5C%5C%5C%22HOWTO%5C%5C%5C%22%3A0.97832483053207%2C%5C%5C%5C%22POST_SEARCH%5C%5C%5C%22%3A0.99905272404744%2C%5C%5C%5C%22MEDIA_VIDEO%5C%5C%5C%22%3A0.31929058432579%2C%5C%5C%5C%22CELEBRITY%5C%5C%5C%22%3A3.8938988922155e-8%2C%5C%5C%5C%22PERSONAL%5C%5C%5C%22%3A0.00058737583458424%2C%5C%5C%5C%22MEDIA_MUSIC%5C%5C%5C%22%3A8.2435115473345e-5%2C%5C%5C%5C%22ENTERTAINMENT%5C%5C%5C%22%3A0.024955861270428%2C%5C%5C%5C%22POLITICS%5C%5C%5C%22%3A8.7018961494323e-5%2C%5C%5C%5C%22SCIENCE_TECH_EDUCATION%5C%5C%5C%22%3A0.55965006351471%2C%5C%5C%5C%22HEALTH_MEDICINE_FITNESS%5C%5C%5C%22%3A0.012116322293878%2C%5C%5C%5C%22SPORTS%5C%5C%5C%22%3A0.0013190114405006%2C%5C%5C%5C%22TRAVEL%5C%5C%5C%22%3A0.0019076929893345%2C%5C%5C%5C%22JOB%5C%5C%5C%22%3A0.011931948363781%2C%5C%5C%5C%22UTILITIES%5C%5C%5C%22%3A0.11729157716036%2C%5C%5C%5C%22FAMILTY_RELATIONSHIP%5C%5C%5C%22%3A0.0028816317208111%2C%5C%5C%5C%22EVENT%5C%5C%5C%22%3A0.85192716121674%2C%5C%5C%5C%22PAGE%5C%5C%5C%22%3A0.81485790014267%2C%5C%5C%5C%22GROUP%5C%5C%5C%22%3A0.00019464232900646%2C%5C%5C%5C%22PEOPLE%5C%5C%5C%22%3A0%2C%5C%5C%5C%22PERSON_NAME_PERSONALIZE%5C%5C%5C%22%3A0.00094727595255506%2C%5C%5C%5C%22POST_SEARCH_PERSONALIZE%5C%5C%5C%22%3A0.99905272404744%2C%5C%5C%5C%22TIMELINESS%5C%5C%5C%22%3A0.077904395759106%2C%5C%5C%5C%22PEOPLE_L2%5C%5C%5C%22%3A5.8110336204412e-14%2C%5C%5C%5C%22C_MUSICIAN_BAND%5C%5C%5C%22%3A0%2C%5C%5C%5C%22C_NOTABLE_PERSON%5C%5C%5C%22%3A0%2C%5C%5C%5C%22C_COMPANY_ORG%5C%5C%5C%22%3A0%2C%5C%5C%5C%22C_ENTERTAINMENT%5C%5C%5C%22%3A0.0030571992974728%2C%5C%5C%5C%22C_SPORTS%5C%5C%5C%22%3A2.0238827346475e-5%2C%5C%5C%5C%22C_COMMERCE%5C%5C%5C%22%3A0.96098506450653%2C%5C%5C%5C%22C_PHYSICAL_PLACE%5C%5C%5C%22%3A0.028572214767337%2C%5C%5C%5C%22C_RECIPE%5C%5C%5C%22%3A0.0021609598770738%2C%5C%5C%5C%22C_COMMUNITY%5C%5C%5C%22%3A0.0085074165835977%2C%5C%5C%5C%22C_OTHER%5C%5C%5C%22%3A0.0052042468450963%2C%5C%5C%5C%22NEWS_ALT%5C%5C%5C%22%3A0.69900870026791%2C%5C%5C%5C%22HARD_NEWS%5C%5C%5C%22%3A0.4862507960795%2C%5C%5C%5C%22PEOPLE_CLICK%5C%5C%5C%22%3A2.640307293617e-5%2C%5C%5C%5C%22COMMERCE_GROUP%5C%5C%5C%22%3A0.96538978815079%2C%5C%5C%5C%22PEOPLE_CLICK_ALT%5C%5C%5C%22%3A2.640307293617e-5%2C%5C%5C%5C%22LIVENESS%5C%5C%5C%22%3A-9.995%7D%7D%5C%22%2C%5C%22filters%5C%22%3A[]%2C%5C%22has_chrono_sort%5C%22%3Afalse%2C%5C%22query_analysis%5C%22%3Anull%2C%5C%22subrequest_disabled%5C%22%3Afalse%2C%5C%22token_role%5C%22%3A%5C%22NONE%5C%22%2C%5C%22preloaded_story_ids%5C%22%3A[]%2C%5C%22extra_data%5C%22%3Anull%2C%5C%22disable_main_browse_unicorn%5C%22%3Afalse%2C%5C%22entry_point_scope%5C%22%3Anull%2C%5C%22entry_point_surface%5C%22%3Anull%2C%5C%22squashed_ent_ids%5C%22%3A[]%7D%22%2C%22encoded_title%22%3A%22W10%22%2C%22ref%22%3A%22unknown%22%2C%22logger_source%22%3A%22www_main%22%2C%22typeahead_sid%22%3A%22%22%2C%22tl_log%22%3Afalse%2C%22impression_id%22%3A%225e10e947%22%2C%22filter_ids%22%3A%7B%22335422937918%3A10154898540022919%22%3A%22335422937918%3A10154898540022919%22%2C%22107124643386%3A10155324963213387%22%3A%22107124643386%3A10155324963213387%22%7D%2C%22experience_type%22%3A%22grammar%22%2C%22exclude_ids%22%3Anull%2C%22browse_location%22%3A%22browse_location%3Abrowse%22%2C%22trending_source%22%3Anull%2C%22reaction_surface%22%3Anull%2C%22reaction_session_id%22%3Anull%2C%22ref_path%22%3A%22%2Fsearch%2Fstr%2F3d%252Bprinting%2Fstories-keyword%2Fstories-public%22%2C%22is_trending%22%3Afalse%2C%22topic_id%22%3Anull%2C%22place_id%22%3Anull%2C%22story_id%22%3Anull%2C%22callsite%22%3A%22browse_ui%3Ainit_result_set%22%2C%22has_top_pagelet%22%3Atrue%2C%22display_params%22%3A%7B%22crct%22%3A%22none%22%2C%22mrss%22%3Atrue%7D%2C%22cursor%22%3A%22AbrYobJWobSFbr1qWbo_dQrDu_6QbuFES7vkqWhq7zdycIfBfLCp8E7v8qHNDl8q1cVxoIZb4F9_IyNO-SlNyxvQMk-ZFyJ1Knt90sI7VBwng6uMyAQmYZ5bjXZAF7HGcowJtKVAHrDOeqlv0TiExvPsIBDnk0NSALsRDkICZ7rXKjYQ7o77uWZbeFbSzaolcCFnN-BdR2qQuHz_fMey_XfoFbwRe8UPep0M6wekAlQkVm5FEbg5bxKT9pPTjZ4rnu5L_CycOiaxQtxBywlDhOR5_lmfF9x5bfxn936hdOqVeSUHnLQJG81MxIY-rmKYOmKpMP_V-H50Do_TkJVx1YG_uQZVi2tgLTW0cgDPFb4YZVu-cF42nee6TqReLEoxsTOV8GCdJo_yQhFfJ26Nagy2%22%2C%22page_number%22%3A4%2C%22em%22%3Afalse%2C%22tr%22%3Anull%7D&__user=100018749370505&__a=1&__dyn=5V4cjEzUGByK5A9UoHaEWC5ER6yUmyVbGAEG8zCC-C267UDAyoS2N6wAxubwTwFQ2KfgjyR88xK5WAzHzemVWxeUPxKui4GDgdUHDBxe6rCCyVeFFUkxvz8Gicx2jxm1iyECQdwBx66EK2m5K5FLKEgyk6EvGi64i9CUKazpK5EG2eVQm5EgwECwTAyrK4rGUohESUK8Gm8CBz8swgE-6UCbx-8K4uayGVGwFxCQExyUy-f_gOdBx67bw&__af=h0&__req=2f&__be=0&__pc=PHASED%3ADEFAULT&__rev=3267345&__spin_r=3267345&__spin_b=trunk&__spin_t=1504189881
      有15000个字符。。。。
      这他妈要怎么构造!看到这个URL我内心是奔溃的,问组长这要怎么办,组长说一般页面内就有,如果没有就看网页内JS代码里的一些参数,这些URL就是这些参数构造出来的。。WTF,还要看JS代码!二营长,把。。所以那两天内心压力特别大。
      然而现在还年轻,老板交待的任务不敢不完成,就硬磕这15000个字符。回头看其实还是自己内心太浮躁,碰到问题没有冷静的分析。
      再看这个URL,一般爬虫拿到一个这么长的url,基本上是不可能让你去JS里找参数的,这样网站的成本也比较大啊。所以一般在页面内都有url的线索。仔细研究发现,这个url是由3个部分构成:

      request_url = request_header + request_data + request_tail
      request_header = "https://www.facebook.com/ajax/pagelet/generic.php/BrowseScrollingSetPagelet?dpr=1&data="
      request_tail = "&__user=%s&__a=1&__dyn=5V4cjEzUGByK5A9UoHaEWC5EWq2WiWF3oyeqrWo8ovyUWdwIhE98n yUdUat0Hx24UJi28rxuF8WUOuVWxeUPxKcxaFQ3uaVVojxCVFEKjGqu58nUnAz8lUlwkEG9J7BwBx66EK2m5K5FLKE gDQ6EvG7ki4e2i8xqawDDhomx22yq3ui9Dx6WK6pESUK8Gm8CBz8swgE-6UCbx-8xnyESbwFxCQEx38y-fXgO&__af =jw&__req=c&__be=-1&__pc=EXP3%%3Aholdout_pkg&__rev=3207279" % str(facebook_account_id)
      request_data就需要自己在页面中找啦,是一个类似json数据格式的东西,但是它的key都是变量,并不是字符串,所以先要吧这些变量用正则替换成字符串形式,然后在用json解析出来。
      另外解析出来还不够,还要将request_url进行url encode,这个可以参照url encode转换表进行转换。
      <br />
      4.后来发现了Graph API,抓到post的ID就可以把我想要的内容获取到了,美滋滋,省去了很多的工作量。


    说了这么多,代码附上:
    chinwu's facebook spider

    最后打一个小广告,有需要抓数据的请联系我,有ML/NLP/Data Engineer 相关工作请联系我,坐标武汉。

    相关文章

      网友评论

          本文标题:回归爬虫,拥抱scrapy&splash。抓facebook p

          本文链接:https://www.haomeiwen.com/subject/wvuvkqtx.html