美文网首页Python
python3 爬取搜索引擎

python3 爬取搜索引擎

作者: CSeroad | 来源:发表于2020-12-24 10:58 被阅读0次

    前言

    在进行红蓝对抗时,前期打点工作必不可少,收集到的信息无非就是域名或者IP。对于域名访问的时候经常遇到404、403等返回码,使用搜索引擎检索到系统入口是不错的选择。但在域名比较多的情况下,又赖得一个一个去搜索引擎检索。所以,需要自动化爬取搜索引擎的一个脚本。
    最终,在github上找到一位师傅写的url采集器。https://github.com/MikoSecSoS/GetURLs
    代码也很简单,但实际测试有些不如意,基于此,对师傅的代码做了简单修改。

    爬取微软Bing搜索引擎

    参考大佬的代码,对代码进行了部分修改:

    1. 使用multiprocessing 多进程模块,调整使用apply_async方法,该方法为异步非阻塞;
    2. 调整word关键字为从文件中读取域名,只使用site语法批量对域名进行采集;
    3. 调整biying的搜索接口来发送get请求;
    4. 爬取后自动保存为时间命名的txt文件;

    代码如下:

    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    # code by CSeroad
    
    import os
    import re
    import sys
    import time
    import requests
    
    from optparse import OptionParser
    from multiprocessing import Pool
    
    def download(filename, datas):
        filename = filename.replace("/", "_")
        if not os.path.exists(filename):
            f = open(filename, "w")
            f.close()
        with open(filename, "a") as f:
            for data in datas:
                f.write(str(data) + "\n")
    
    class BingSpider:
    
        @staticmethod
        def getUrls(page):
            now_time = time.strftime('%Y-%m-%d-%H', time.localtime(time.time()))
            hd = {
                "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
                "accept-language": "zh-CN,zh;q=0.9",
                "alexatoolbar-alx_ns_ph": "AlexaToolbar/alx-4.0.3",
                "cache-control": "max-age=0",
                "upgrade-insecure-requests": "1",
                "cookie": "DUP=Q=axt7L5GANVktBKOinLxGuw2&T=361645079&A=2&IG=8C06CAB921F44B4E8AFF611F53B03799; _EDGE_V=1; MUID=0E843E808BEA618D13AC33FD8A716092; SRCHD=AF=NOFORM; SRCHUID=V=2&GUID=CADDA53D4AD041148FEB9D0BF646063A&dmnchg=1; MUIDB=0E843E808BEA618D13AC33FD8A716092; ISSW=1; ENSEARCH=BENVER=1; SerpPWA=reg=1; _EDGE_S=mkt=zh-cn&ui=zh-cn&SID=252EBA59AC756D480F67B727AD5B6C22; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1; SRCHUSR=DOB=20190616&T=1560789192000; _FP=hta=on; BPF=X=1; SRCHHPGUSR=CW=1341&CH=293&DPR=1&UTC=480&WTS=63696385992; ipv6=hit=1560792905533&t=4; _SS=SID=252EBA59AC756D480F67B727AD5B6C22&HV=1560790599",
                "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
            }
            filename = now_time + ".txt"
            url = "https://cn.bing.com/search?q={}&first={}&FORM=PERE".format(word, page)
            print(url)
            req = requests.get(url, headers=hd)
            if "There are no results for" in req.text:
                return
            urls_titles = re.findall("<h2><a.*?href=\"(.*?)\".*?>(.*?)</a></h2>",req.text)
            data = []
            for url, title in urls_titles:
                title = title.replace("<strong>", "").replace("</strong>", "")
                data.append({
                    "title": title,
                    "url": url
                })
                print(title, url)
            download(filename, data)
    
    
    
        def main(self):
            pool = Pool(5)
            for i in range(1,5):
                pool.apply_async(func=self.getUrls,args=(i,))
                #BingSpider.getUrls(1)
            pool.close()
            pool.join()
    
    
    if __name__ == "__main__":
        parser = OptionParser("bingSpider.py -f words.txt")
        parser.add_option("-f", "--file",action="store",type="string",dest="file",help="words.txt")
        (options, args) = parser.parse_args()
        if options.file:
            file = options.file
            with open(file,'r') as f:
                for line in f.readlines():
                    word = line.strip()
                    word = "site:"+word
                    print("\033[1;37;40m"+word+"\033[0m")
                    #word="site:api.baidu.com"
                    bingSpider = BingSpider()
                    bingSpider.word = word
                    bingSpider.main()
        else:
            parser.error('incorrect number of arguments')
    

    运行实例:
    将信息收集到域名放在word.txt 文件里。运行python3 bingSpider.py -f word.txt即可。
    效果图如下:

    image.png

    爬取google搜索引擎

    biying搜索引擎有着自己的独特,但是google更加强大。
    仿照着bingSpider.py脚本可以写出googleSpider.py
    爬取google的时候遇到的坑点还是很有意思的,这里说明一下:

    1. 爬取google的时候header头一定要写全一些,避免被身份验证;
    2. 因为访问google本地需要代理,所以在脚本里也使用proxies代理,且为session请求;
    3. 用户使用时需要修改代理的端口
      session.proxies = {'http': 'socks5://127.0.0.1:1086','https': 'socks5://127.0.0.1:1086'}
    4. 在使用多进程爬取的时候,也增加了sleep,也是为了避免google的验证码;
    5. 正则匹配返回内容时,也发生了变化,div标签为class="yuRUbf";

    代码如下

    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    # code by CSeroad
    
    import os
    import re
    import sys
    import time
    import requests
    
    from optparse import OptionParser
    from multiprocessing import Pool
    
    
    
    def download(filename, datas):
        filename = filename.replace("/", "_")
        if not os.path.exists(filename):
            f = open(filename, "w")
            f.close()
        with open(filename, "a") as f:
            for data in datas:
                f.write(str(data) + "\n")
    
    
    class GoogleSpider:
    
        @staticmethod
        def getUrls(page):
            now_time = time.strftime('%Y-%m-%d-%H', time.localtime(time.time()))
            hd = {
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
                "Accept-language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
                "Referer": "https://www.google.com/",
                "Cache-control": "max-age=0",
                "Accept-Encoding": "gzip, deflate",
                "Upgrade-insecure-requests": "1",
                "Cookie": "GOOGLE_ABUSE_EXEMPTION=ID=15c1d08c9232025f:TM=1608695949:C=r:IP=52.231.34.93-:S=APGng0veF37IjfSixu2nMBKj7JRlk2A4dg",
                "User-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
            }
            session = requests.session()
            session.proxies = {'http': 'socks5://127.0.0.1:1086','https': 'socks5://127.0.0.1:1086'}
            filename = "google-" + now_time + ".txt"
            url = "https://www.google.com/search?q={}&start={}".format(word, page)
            print("\033[1;37;40m"+url+"\033[0m")
            req = session.get(url,headers=hd)
            #print(req.text)
            if "找不到和您查询的" in req.text:
                return
            urls_titles = re.findall("<div class=\"yuRUbf\"><a href=\"(.*?)\".*?><h3.*?>(.*?)</h3>", req.text)
            #print(urls_titles)
            data = []
            for url, title in urls_titles:
                data.append({
                    "title": title,
                    "url": url
                })
                print(title, url)
            download(filename, data)
    
        def main(self):
            pool = Pool(5)
            for i in range(1,5):
                pool.apply_async(func=self.getUrls,args=(i,))
            time.sleep(20)
            #BingSpider.getUrls(1)
            pool.close()
            pool.join()
    
    
    if __name__ == "__main__":
        parser = OptionParser("googleSpider.py -f words.txt")
        parser.add_option("-f", "--file",action="store",type="string",dest="file",help="words.txt")
        (options, args) = parser.parse_args()
        if options.file:
            file = options.file
            with open(file,'r') as f:
                for line in f.readlines():
                    word = line.strip()
                    word = "site:"+word
                    print("\033[1;37;40m"+word+"\033[0m")
                    googleSpider = GoogleSpider()
                    googleSpider.word = word
                    googleSpider.main()
        else:
            parser.error('incorrect number of arguments')
    

    同样测试一下,效果图如下:

    image.png

    对biying和google整合

    在修改了以上两个脚本的情况下,这里整合为一个文件UrlSpider.py,更方便得同时爬取两个搜索引擎。
    在整合时,需要注意的是:

    1. 本地没有代理的情况下,无法爬取google,
    2. 本地使用全局代理的情况下,只能爬取google无法爬取必应,
    3. 建议本地使用PAC自动代理,爬取google时自行修改session.proxies即可。

    代码如下:

    #!/usr/bin/env python3
    # -*- coding: utf-8 -*-
    # code by CSeroad
    
    import os
    import re
    import sys
    import time
    import requests
    
    from optparse import OptionParser
    from multiprocessing import Pool
    
    banner = '''
      ____ ____                           _
     / ___/ ___|  ___ _ __ ___   __ _  __| |
    | |   \___ \ / _ \ '__/ _ \ / _` |/ _` |
    | |___ ___) |  __/ | | (_) | (_| | (_| |
     \____|____/ \___|_|  \___/ \__,_|\__,_|
    
    '''
    
    def download(filename, datas):
        filename = filename.replace("/", "_")
        if not os.path.exists(filename):
            f = open(filename, "w")
            f.close()
        with open(filename, "a") as f:
            for data in datas:
                f.write(str(data) + "\n")
    
    class BingSpider:
    
        @staticmethod
        def getUrls(page):
            now_time = time.strftime('%Y-%m-%d-%H', time.localtime(time.time()))
            hd = {
                "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
                "accept-language": "zh-CN,zh;q=0.9",
                "alexatoolbar-alx_ns_ph": "AlexaToolbar/alx-4.0.3",
                "cache-control": "max-age=0",
                "upgrade-insecure-requests": "1",
                "cookie": "DUP=Q=axt7L5GANVktBKOinLxGuw2&T=361645079&A=2&IG=8C06CAB921F44B4E8AFF611F53B03799; _EDGE_V=1; MUID=0E843E808BEA618D13AC33FD8A716092; SRCHD=AF=NOFORM; SRCHUID=V=2&GUID=CADDA53D4AD041148FEB9D0BF646063A&dmnchg=1; MUIDB=0E843E808BEA618D13AC33FD8A716092; ISSW=1; ENSEARCH=BENVER=1; SerpPWA=reg=1; _EDGE_S=mkt=zh-cn&ui=zh-cn&SID=252EBA59AC756D480F67B727AD5B6C22; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1; SRCHUSR=DOB=20190616&T=1560789192000; _FP=hta=on; BPF=X=1; SRCHHPGUSR=CW=1341&CH=293&DPR=1&UTC=480&WTS=63696385992; ipv6=hit=1560792905533&t=4; _SS=SID=252EBA59AC756D480F67B727AD5B6C22&HV=1560790599",
                "user-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
            }
            filename = "biying-" + now_time + ".txt"
            url = "https://cn.bing.com/search?q={}&first={}&FORM=PERE".format(word, page)
            print("\033[1;37;40m"+url+"\033[0m")
            req = requests.get(url, headers=hd)
            if "There are no results for" in req.text:
                return
            urls_titles = re.findall("<h2><a.*?href=\"(.*?)\".*?>(.*?)</a></h2>",req.text)
            print(urls_titles)
            data = []
            for url, title in urls_titles:
                title = title.replace("<strong>", "").replace("</strong>", "")
                data.append({
                    "title": title,
                    "url": url
                })
                print(title, url)
            download(filename, data)
    
    
        def main(self):
            pool = Pool(5)
            for i in range(1,5):
                pool.apply_async(func=self.getUrls,args=(i,))
                #BingSpider.getUrls(1)
            pool.close()
            pool.join()
    
    
    class GoogleSpider:
    
        @staticmethod
        def getUrls(page):
            now_time = time.strftime('%Y-%m-%d-%H', time.localtime(time.time()))
            hd = {
                "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
                "Accept-language": "zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2",
                "Referer": "https://www.google.com/",
                "Cache-control": "max-age=0",
                "Accept-Encoding": "gzip, deflate",
                "Upgrade-insecure-requests": "1",
                "Cookie": "GOOGLE_ABUSE_EXEMPTION=ID=15c1d08c9232025f:TM=1608695949:C=r:IP=52.231.34.93-:S=APGng0veF37IjfSixu2nMBKj7JRlk2A4dg",
                "User-agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36"
            }
            session = requests.session()
            session.proxies = {'http': 'socks5://127.0.0.1:1086','https': 'socks5://127.0.0.1:1086'}
            filename = "google-" + now_time + ".txt"
            url = "https://www.google.com/search?q={}&start={}".format(word, page)
            print("\033[1;37;40m"+url+"\033[0m")
            req = session.get(url,headers=hd)
            #print(req.text)
            if "找不到和您查询的" in req.text:
                return
            urls_titles = re.findall("<div class=\"yuRUbf\"><a href=\"(.*?)\".*?><h3.*?>(.*?)</h3>", req.text)
            #print(urls_titles)
            data = []
            for url, title in urls_titles:
                data.append({
                    "title": title,
                    "url": url
                })
                print(title, url)
            download(filename, data)
    
        def main(self):
            pool = Pool(5)
            for i in range(1,6):
                pool.apply_async(func=self.getUrls,args=(i,))
            time.sleep(20)
            #BingSpider.getUrls(1)
            pool.close()
            pool.join()
    
    
    if __name__ == "__main__":
        print(banner)
        parser = OptionParser("UrlSpider.py -f words.txt")
        parser.add_option("-f", "--file",action="store",type="string",dest="file",help="words.txt")
        (options, args) = parser.parse_args()
        if options.file:
            file = options.file
            with open(file,'r') as f:
                for line in f.readlines():
                    word = line.strip()
                    word = "site:"+word
                    print("\033[1;37;40m"+word+"\033[0m")
                    bingSpider = BingSpider()
                    bingSpider.word = word
                    bingSpider.main()
                    googleSpider = GoogleSpider()
                    googleSpider.word = word
                    googleSpider.main()
        else:
            parser.error('incorrect number of arguments')
    

    效果图如下:

    image.png image.png

    处理结果

    当脚本运行完毕后,会产生txt文件。

    image.png

    里面为list类型,分为title和url两部分。
    如果需要取出url并截取子目录且去重,还需要下面的formatUrls.py处理脚本。

    代码如下:

    #!/usr/bin/env python
    # -*- coding: utf-8 -*-
    
    import os
    import sys
    
    def formatUrls(oldfilename,newfilename):
        if os.path.exists(oldfilename):
            file = open(oldfilename, "r", encoding="utf-8")
            urls=set()
            with open("urls.txt", "a") as f:
                for line in file.readlines():
                    url = line[line.index("'url': '")+8:-3]
                    print(url)
                    urls.add(url)
            #print(urls)
            with open(newfilename,'a+') as f:
                for url in urls:
                    f.write(url+'\n')
    
    
    if __name__ == "__main__":
        if len(sys.argv) == 3:
            oldfilename=sys.argv[1]
            newfilename=sys.argv[2]
            formatUrls(oldfilename,newfilename)
        else:
            print('User: python3 formatUrls.py google-2020-12-23-13.txt result.txt')
    

    运行效果图

    image.png

    总结

    自行修改session.proxies代理,实际测试mac、linux均可以,windows上可能出现异常。
    如有问题,还请斧正,欢迎留言。

    相关文章

      网友评论

        本文标题:python3 爬取搜索引擎

        本文链接:https://www.haomeiwen.com/subject/htyynktx.html