美文网首页Python三期爬虫作业
爬取某作者所有文章及链接

爬取某作者所有文章及链接

作者: DoctorLDQ | 来源:发表于2017-08-09 09:22 被阅读32次
    import requests
    from bs4 import BeautifulSoup
    import re
    
    jianshu_url='http://www.jianshu.com'
    base_url='http://www.jianshu.com/u/54b5900965ea?order_by=shared_at&page='   #更改作者,可在此处修改?之前 /u之后的
    
    
    user_agent='Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'
    headers={'User-Agent':user_agent}
    pattern='http://www.jianshu.com/users/6c7437065202/timeline'    #此处/timeline 之前 /users之后的可以修改
    articlePage=[]
    f=open('run_write.txt','w',encoding='utf-8')
    # articleList=[]
    def download_page():
        i = 1
        while True:
            r=requests.get(base_url+str(i),headers=headers)
    
            articlePage.append(r.url)
            if(r.url==pattern):break    #判断出是否已经到达最后一个请求页面,如果是,则结束
            print('这是第{}个请求的网页'.format(i))
            i += 1
            get_article_from_page(r.url)
    def get_article_from_page(url):
        r=requests.get(url,headers=headers)
        soup=BeautifulSoup(r.text,'lxml')
        articleList=soup.find_all('li',id=re.compile(r'note-\d+'))
        for i in range(len(articleList)):
            articleTitle = articleList[i].find('a', class_='title').text
            articleUrl=jianshu_url+articleList[i].find('a',class_='title')['href']
            print(articleTitle+5*'  '+articleUrl)
    
    download_page()
    
    
    
    2.png

    相关文章

      网友评论

        本文标题:爬取某作者所有文章及链接

        本文链接:https://www.haomeiwen.com/subject/jjrfrxtx.html