Python 爬取简书标题内容的源码

作者: Cocoa_Coder | 来源:发表于2016-10-07 09:18 被阅读146次

Python 爬取简书标题内容的源码
第二课：爬虫：（俊）爬取简书漫画专栏
实战爬取简书网热评文章（基于lxml及多进程爬虫方法）
爬取简书全站文章并生成 API（二）
Python网络爬虫：爬取简书推荐内容
爬取Python教程博客并转成PDF
新手向爬虫（三）别人的爬虫在干啥
爬取简书数据生成api
爬虫基础_03——xpath
Python实战计划学习笔记示例（2）爬取商品信息

很简单地一个爬取程序，适合初学者

源码如下：

from urllib.request import urlopen

from bs4 import BeautifulSoup


import pymysql


html = urlopen("http://www.jianshu.com")

bsobj = BeautifulSoup(html,"html.parser")


# print(bsobj.findAll("h4",{"class":"title"}))#打印获取的对象

SqlConnect = pymysql.connect(host = 'localhost',user = 'root',password = '123456',db = 'liusenTestURL',charset = 'utf8mb4')

cur = SqlConnect.cursor()#获取一个游标


#写入数据库函数

def writeDataBase(title,content,textURL):

    cur.execute("INSERT INTO jianshuTEXT (title,content,URL) VALUES (%s,%s,%s)", (title, content,textURL))

    cur.connection.commit()



 
#获取内容函数

def gainContent(contentHtml):
    contenthtml = urlopen(contentHtml)

    contentBsObj = BeautifulSoup(contenthtml,"html.parser")

    textTitle = contentBsObj.find('title').get_text()

    print('title : '+textTitle)


    print('----------------------')

    textContent = contentBsObj.find("div",{"class":"show-content"}).get_text()
    # print(textContent)

    writeDataBase(textTitle,textContent,contentHtml)





try:
    for title in bsobj.find("ul", {"class": "article-list thumbnails"}).findAll("h4", {"class": "title"}):

        # print(title.find("a"))
        if 'href' in title.find("a").attrs:
            contenthtml = 'http://www.jianshu.com' + title.find("a").attrs['href']

            print(contenthtml)

            gainContent(contenthtml)




finally:
    cur.close()
    SqlConnect.close()

欢迎一起交流学习
有时候网页编码不是utf-8,这就不太好弄了.假如现在第三方请求库用的是requests,那么请求下来的数据要做一个转化过程,针对gb2312网页编码,现在要做如下处理,否则会中文乱码

detailURL  = "http://xxx.xxx.xxxxxx.com/"

html = requests.session().get(detailURL, headers=headers)

jieshouText = html.text.encode('ISO-8859-1',"ignore").decode(requests.utils.get_encodings_from_content(html.text)[0],"ignore")

参考:python的requests类抓取中文页面出现乱码
http://www.zhetenga.com/view/python的requests类抓取中文页面出现乱码-0abbaa140.html

解释很详细

网友评论

本文标题：Python 爬取简书标题内容的源码

本文链接：https://www.haomeiwen.com/subject/lwryyttx.html

延伸阅读

深度阅读

您也可以注册成为美文阅读网的作者，发表您的原创作品、分享您的心情！

Python 爬取简书标题内容的源码

相关文章

Python 爬取简书标题内容的源码

第二课：爬虫：（俊）爬取简书漫画专栏

实战爬取简书网热评文章（基于lxml及多进程爬虫方法）

爬取简书全站文章并生成 API（二）

Python网络爬虫：爬取简书推荐内容

爬取Python教程博客并转成PDF

新手向爬虫（三）别人的爬虫在干啥

爬取简书数据生成api

爬虫基础_03——xpath

Python实战计划学习笔记示例（2）爬取商品信息

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

python爬虫

Python数据采集与爬虫