美文网首页
爬虫练习--草稿

爬虫练习--草稿

作者: 知识学者 | 来源:发表于2018-04-27 10:04 被阅读117次

    简书的robots

    # See http://www.robotstxt.org/wc/norobots.html for documentation on how to use the robots.txt file
    #
    # To ban all spiders from the entire site uncomment the next two lines:
    User-agent: *
    Disallow: /search
    Disallow: /convos/
    Disallow: /notes/
    Disallow: /admin/
    Disallow: /adm/
    Disallow: /p/0826cf4692f9
    Disallow: /p/d8b31d20a867
    Disallow: /collections/*/recommended_authors
    Disallow: /trial/*
    Disallow: /keyword_notes
    Disallow: /stats-2017/*
    
    User-agent: trendkite-akashic-crawler
    Request-rate: 1/2 # load 1 page per 2 seconds
    Crawl-delay: 60
    
    User-agent: YisouSpider
    Request-rate: 1/10 # load 1 page per 2 seconds
    Crawl-delay: 60
    
    User-agent: Cliqzbot
    Disallow: /
    
    User-agent: Googlebot
    Request-rate: 1/1 # load 1 page per 2 seconds
    Crawl-delay: 10
    
    mport urllib.request
    import urllib.parse
    import re
    
    url="https://www.jianshu.com/c/bd38bd199ec6"
    req=urllib.request.Request(url)
    req.add_header('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 '
                                     '(KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36')
    response=urllib.request.urlopen(req)
    html=response.read().decode("utf-8")
    #print(html)
    
    pattern=re.compile(r'<p class="abstract">\s+(.*)\s+</p>')
    
    
    result=re.findall(pattern,html)
    
    
    
    #for each in result:
    #    print(each)
    #print(result)
        
    print("the length=============",len(result))
    
    print("----------------",result[1])
    
    print("*******",len(result[1]))
    
    爬虫.png

    模仿:Python爬虫初学(一)—— 爬取段子

    还有事情年,还有许多东西需要修改,比如把交友文章下载下来,或者爬取图片,等等什么的.
    re表达式,我还不是很熟。

    
    <a class="nickname" target="_blank" href="[/u/1195c9b43c46](view-source:https://www.jianshu.com/u/1195c9b43c46)">
    大大懒鱼</a>  
    <span class="time" data-shared-at="2018-04-26T21:15:25+08:00">
    </span> 
     <a class="title" target="_blank" href="[/p/a1d691ab1111](view-source:https://www.jianshu.com/p/a1d691ab1111)">
    【简书交友】大大懒鱼:爱好服装搭配的特别能吃麻辣中年少女</a>
    

    这些regular,我还必须写出来,以及翻叶等。

    相关文章

      网友评论

          本文标题:爬虫练习--草稿

          本文链接:https://www.haomeiwen.com/subject/asdjlftx.html