都知道微博上有很多僵尸粉,为明星数据流量造假,作为一名求生欲十分强烈的数据媛,思虑再三不敢去扒流量明星,所以今天我挑中的下手 对象是:
”下*****八“
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556262289598" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image <input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image> <tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556262289603" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
可以添加QQ源码群1004391443,有飞机大战、颜值打分器、打砖块小游戏、红包提醒神器、小姐姐表白神器等具体的实训项目,有清晰源码,有相应的文件
一、运行环境
Windows,Python3
需要的包在cmd下载好:pip install ***
然后,在代码前面导入:import ***
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">import requestsfrom html.parser import HTMLParserimport jsonimport timefrom bs4 import BeautifulSoupimport jsonimport pandas as pd
</pre>
二、获取微博粉丝页面的Cookie源码
首先打开他的微博主页,按F12进入开发模式,选择Network,F5刷新页面,像这样:
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556262289608 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
在选项卡中选择"Doc",勾选页面网址,找到" Request Headers ",像这样:
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556262289612 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
通过"Request Headers"就能获取源码啦:
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">#函数def get_html(Header, the_url): r = requests.get(url=the_url, headers=header) parser = HTMLParser() parser.feed(r.text) html_str = r.text return html_str#函数的第一个参数:将"Request Headers"对应的字段贴过来,换成你想爬取得地址哦Header={ "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,/;q=0.8" , "Accept-Encoding": "gzip, deflate, br" , "Accept-Language": "zh-CN,zh;q=0.9" , "Connection": "keep-alive" , "Cookie":"SINAGLOBAL=8942290497361.736.1542873594539; UM_distinctid=1684f2253455f-07fefa1f3a5ef8-47e1039-144000-1684f2253463f8; UOR=,,www.duba.com; un=yuzidesky1128@sina.cn; wvr=6; Ugrow-G0=9642b0b34b4c0d569ed7a372f8823a8e; ALF=1587692149; SSOLoginState=1556156149; SCF=AmrcjCjvFI3VTtRcnw5XSEpu50N99C78GjWRXyTgpAiZyfrYPOUKsG3XcpfQFXmoHSYhPc9zkby1VsW0nEGa35o.; SUB=_2A25xxX6lDeRhGedJ7FAX8CnPyz2IHXVSs9dtrDV8PUNbmtAKLU_YkW9NUdIDy4VrqO7uz81FV-aLXHbJc1pOGAyW; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9W5.xNDWU.nUY18EZ3DRMKKg5JpX5KzhUgL.Fo2NS0zcehM0eh22dJLoIp9jdN9Li--NiK.piKLhi--fi-82iK.7; SUHB=0tKrG81Etv1yct; TC-V5-G0=841d8e04c4761f733a87c822f72195f3; _s_tentry=login.sina.com.cn; Apache=6519093007221.037.1556156193431; wb_view_log_1772607301=15368641.125; ULV=1556156193448:33:4:2:6519093007221.037.1556156193431:1556094384932; TC-Page-G0=cdcf495cbaea129529aa606e7629fea7|1556165109|1556164910; webim_unReadCount=%7B%22time%22%3A1556167207612%2C%22dm_pub_total%22%3A2%2C%22chat_group_pc%22%3A0%2C%22allcountNum%22%3A2%2C%22msgbox%22%3A0%7D", "Host": "weibo.com" , "Referer":"https://weibo.com/p//follow?relate=fans&page=" , "Upgrade-Insecure-Requests": "1" ,"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36"}# 函数的第二个参数,换成你想爬取的微博主页地址the_url="https://weibo.com/p/*/follow?relate=fans&page="# 获取源码yuanma = get_html(Header,the_url)
</pre>
三、解析源码,获取微博粉丝信息,处理成csv表格
主函数如下,其中包含的子函数从源码处下载哦:https://github.com/Mannix1994/FindYou
<pre spellcheck="false" style="box-sizing: border-box; margin: 5px 0px; padding: 5px 10px; border: 0px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-variant-numeric: inherit; font-variant-east-asian: inherit; font-weight: 400; font-stretch: inherit; font-size: 16px; line-height: inherit; font-family: inherit; vertical-align: baseline; cursor: text; counter-reset: list-1 0 list-2 0 list-3 0 list-4 0 list-5 0 list-6 0 list-7 0 list-8 0 list-9 0; background-color: rgb(240, 240, 240); border-radius: 3px; white-space: pre-wrap; color: rgb(34, 34, 34); letter-spacing: normal; orphans: 2; text-align: left; text-indent: 0px; text-transform: none; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; text-decoration-style: initial; text-decoration-color: initial;">def get_fans_list(html_str): # 分割html网页,得到一些包含json数据的字符串 fans_json_list = html_str.split("</script>") # 找到含粉丝信息的json字符串 json_str = {} for item in fans_json_list: item = get_pure_json(item) if not item: continue if item.find('"domid":"Pl_Official_HisRelation__59"') > -1: json_str = item json_data = json.loads(json_str) soup = BeautifulSoup(json_data['html'], 'lxml') soup.prettify() fans = [] for div_tag in soup.find_all('div'): if div_tag['class'] == ["follow_inner"]: # 提取粉丝信息 for fan_dl in div_tag.find_all('dl'): p = Fan(fan_dl) fans.append(p.dict) break return pd.DataFrame(fans)
</pre>
结果像这样子:
一个csv表中包含"下*****八"微博粉丝们的地址、粉丝数、关注数、性别、简介、名字和微博主页地址等。
你想要的都能拿到啦!(点大图瞧得清楚)
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556262289616 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
四、分析抓取到的数据
数据既然已经到手了,接下来分析一下,这位同学的粉丝中间有没有僵尸粉,是更吸引男网友还是女网友呢?
来看一下粉丝的男女比例:
Na Ni
!竟然女粉丝居多!一次警告!
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556262289621 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
再来看一下粉丝们都是来自哪里的:
粉丝们五湖四海呢!
当然还是本地西安的多一些,让人不得不怀疑,外地的这些有一些是不是僵尸粉...
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556262289625 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
最后来找一下粉丝中谁的粉丝量最高,而谁的粉丝量最低:
额....最高也才1万多...
恭喜"裸奔的日常"荣获156个粉丝中粉丝量最多的博主!
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556262289629 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
粉丝量最少的这位"用户6531699021",我有道理怀疑你是个僵尸号哦!
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556262289633 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
最后做一个粉丝云图:
我是不是暴露了什么?...
<tt-image data-tteditor-tag="tteditorTag" contenteditable="false" class="syl1556262289635 ql-align-center" data-render-status="finished" data-syl-blot="image" style="box-sizing: border-box; cursor: text; text-align: left; color: rgb(34, 34, 34); font-family: "PingFang SC", "Hiragino Sans GB", "Microsoft YaHei", "WenQuanYi Micro Hei", "Helvetica Neue", Arial, sans-serif; font-size: 16px; font-style: normal; font-variant-ligatures: normal; font-variant-caps: normal; font-weight: 400; letter-spacing: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: pre-wrap; widows: 2; word-spacing: 0px; -webkit-text-stroke-width: 0px; background-color: rgb(255, 255, 255); text-decoration-style: initial; text-decoration-color: initial; display: block;"> image<input class="pgc-img-caption-ipt" placeholder="图片描述(最多50字)" value="" style="box-sizing: border-box; outline: 0px; color: rgb(102, 102, 102); position: absolute; left: 187.5px; transform: translateX(-50%); padding: 6px 7px; max-width: 100%; width: 375px; text-align: center; cursor: text; font-size: 12px; line-height: 1.5; background-color: rgb(255, 255, 255); background-image: none; border: 0px solid rgb(217, 217, 217); border-radius: 4px; transition: all 0.2s cubic-bezier(0.645, 0.045, 0.355, 1) 0s;"></tt-image>
黑粉们学会了么?技术告诉你们了,赶紧去扒一下流量明星的微博粉丝,什么上亿的转发和点赞,不知道有多少粉丝是0关注0粉丝的新号呢!
网友评论