美文网首页python爬虫Python数据采集与爬虫
Python 爬取简书标题内容的源码

Python 爬取简书标题内容的源码

作者: Cocoa_Coder | 来源:发表于2016-10-07 09:18 被阅读146次

    很简单地一个爬取程序,适合初学者

    源码如下:

    from urllib.request import urlopen
    
    from bs4 import BeautifulSoup
    
    
    import pymysql
    
    
    html = urlopen("http://www.jianshu.com")
    
    bsobj = BeautifulSoup(html,"html.parser")
    
    
    # print(bsobj.findAll("h4",{"class":"title"}))#打印获取的对象
    
    SqlConnect = pymysql.connect(host = 'localhost',user = 'root',password = '123456',db = 'liusenTestURL',charset = 'utf8mb4')
    
    cur = SqlConnect.cursor()#获取一个游标
    
    
    #写入数据库函数
    
    def writeDataBase(title,content,textURL):
    
        cur.execute("INSERT INTO jianshuTEXT (title,content,URL) VALUES (%s,%s,%s)", (title, content,textURL))
    
        cur.connection.commit()
    
    
    
     
    #获取内容函数
    
    def gainContent(contentHtml):
        contenthtml = urlopen(contentHtml)
    
        contentBsObj = BeautifulSoup(contenthtml,"html.parser")
    
        textTitle = contentBsObj.find('title').get_text()
    
        print('title : '+textTitle)
    
    
        print('----------------------')
    
        textContent = contentBsObj.find("div",{"class":"show-content"}).get_text()
        # print(textContent)
    
        writeDataBase(textTitle,textContent,contentHtml)
    
    
    
    
    
    try:
        for title in bsobj.find("ul", {"class": "article-list thumbnails"}).findAll("h4", {"class": "title"}):
    
            # print(title.find("a"))
            if 'href' in title.find("a").attrs:
                contenthtml = 'http://www.jianshu.com' + title.find("a").attrs['href']
    
                print(contenthtml)
    
                gainContent(contenthtml)
    
    
    
    
    finally:
        cur.close()
        SqlConnect.close()
    

    欢迎一起交流学习
    有时候网页编码不是utf-8,这就不太好弄了.假如现在第三方请求库用的是requests,那么请求下来的数据要做一个转化过程,针对gb2312网页编码,现在要做如下处理,否则会中文乱码

    detailURL  = "http://xxx.xxx.xxxxxx.com/"
    
    html = requests.session().get(detailURL, headers=headers)
    
    jieshouText = html.text.encode('ISO-8859-1',"ignore").decode(requests.utils.get_encodings_from_content(html.text)[0],"ignore")
    

    参考:python的requests类抓取中文页面出现乱码
    http://www.zhetenga.com/view/python的requests类抓取中文页面出现乱码-0abbaa140.html

    解释很详细

    相关文章

      网友评论

        本文标题:Python 爬取简书标题内容的源码

        本文链接:https://www.haomeiwen.com/subject/lwryyttx.html