美文网首页Python充电宝python专题
Python 3 爬虫学习笔记 (四)

Python 3 爬虫学习笔记 (四)

作者: Veniendeavor | 来源:发表于2017-01-30 20:21 被阅读2988次

    这是我自己在学习python 3爬虫时的小笔记,做备忘用,难免会有一些错误和疏漏,望指正~~~
    Python 3 爬虫学习笔记 (一)
    Python 3 爬虫学习笔记 (二)
    Python 3 爬虫学习笔记 (三)
    Python 3 爬虫学习笔记 (五)
    Python 3 爬虫学习笔记 (六)


    五 数据库存储爬取的信息(MySQL)

    爬取到的数据为了更好地进行分析利用,而之前将爬取得数据存放在txt文件中后期处理起来会比较麻烦,很不方便,如果数据量比较大的情况下,查找更加麻烦,所以我们通常会把爬取的数据存储到数据库中便于后期分析利用。

    这里,数据库选择MySQL,采用pymysql这个第三方库来处理python和mysql数据库的存取,python连接mysql数据库的配置信息

    db_config ={
        'host': '127.0.0.1',
        'port': 3306,
        'user': 'root',
        'password': '',
        'db': 'pytest',
        'charset': 'utf8'
    }
    

    以爬取简书首页文章标题以及url为例,先分析抓取目标信息,

    Paste_Image.png

    如上图,文章题目在a标签中,且url(href)只含有后半部分,所以在存储的时候,最好把它补全。

    mysql:新建一个数据库pytest,建立一张名为titles的表,表中字段分别为id(int自增),title(varchar),url(varchar),如下:

    Paste_Image.png

    进行数据库操作的思路为:获得数据库连接(connection)->获得游标(cursor)->执行sql语句(execute)->事物提交(commit)->关闭数据据库连接(close),具体代码实现如下:

    # -*- coding:utf-8 -*-
    
    from urllib import request
    from bs4 import BeautifulSoup
    import pymysql
    
    # mysql连接信息(字典形式)
    db_config ={
        'host': '127.0.0.1',
        'port': 3306,
        'user': 'root',
        'password': '',
        'db': 'pytest',
        'charset': 'utf8'
    }
    # 获得数据库连接
    connection = pymysql.connect(**db_config)
    
    # 数据库配置,获得连接(参数方式)
    # connection = pymysql.connect(host='127.0.0.1',
    #                        port=3306,
    #                        user='root',
    #                        password='',
    #                        db='pytest',
    #                        charset='utf8')
    
    
    url = r'http://www.jianshu.com/'
    # 模拟浏览器头
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
    }
    page = request.Request(url, headers=headers)
    page_info = request.urlopen(page).read().decode('utf-8')
    soup = BeautifulSoup(page_info, 'html.parser')
    urls = soup.find_all('a', 'title')
    
    try:
        # 获得数据库游标
        with connection.cursor() as cursor:
            sql = 'insert into titles(title, url) values(%s, %s)'
            for u in urls:
                # 执行sql语句
                cursor.execute(sql, (u.string, r'http://www.jianshu.com'+u.attrs['href']))
        # 事务提交
        connection.commit()
    finally:
        # 关闭数据库连接
        connection.close()
    

    代码执行结果:

    Paste_Image.png

    相关文章

      网友评论

      • 乂_262d:你好,我创建了pytest表,但是仍然报这个错误
        =======================================
        File "e:\softwareDevelopment\pythonProject\03_crawling\crawling_004.py", line 17, in <module>
        connection = pymysql.connect(**db_config)
        File "E:\softwareDevelopment\python\lib\site-packages\pymysql-0.9.2-py3.6.egg\pymysql\__init__.py", line 94, in Connect
        File "E:\softwareDevelopment\python\lib\site-packages\pymysql-0.9.2-py3.6.egg\pymysql\connections.py", line 326, in __init__
        File "E:\softwareDevelopment\python\lib\site-packages\pymysql-0.9.2-py3.6.egg\pymysql\connections.py", line 597, in connect
        File "E:\softwareDevelopment\python\lib\site-packages\pymysql-0.9.2-py3.6.egg\pymysql\connections.py", line 855, in _request_authentication
        File "E:\softwareDevelopment\python\lib\site-packages\pymysql-0.9.2-py3.6.egg\pymysql\connections.py", line 682, in _read_packet
        File "E:\softwareDevelopment\python\lib\site-packages\pymysql-0.9.2-py3.6.egg\pymysql\protocol.py", line 220, in check_error
        File "E:\softwareDevelopment\python\lib\site-packages\pymysql-0.9.2-py3.6.egg\pymysql\err.py", line 109, in raise_mysql_exception
        pymysql.err.InternalError: (1049, "Unknown database 'pytest'")
      • 谁说世界早已没有选择_7a18:你好,为什么在我的环境上运行会报错呢?
        **************************************************************
        Traceback (most recent call last):
        File "F:/工具/Python/py/crawler/jianshu3.py", line 30, in <module>
        cursor.execute(sql, (u.string, r'http://www.jianshu.com'+u.attrs['href']))
        File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\cursors.py", line 165, in execute
        result = self._query(query)
        File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\cursors.py", line 321, in _query
        conn.query(q)
        File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\connections.py", line 860, in query
        self._affected_rows = self._read_query_result(unbuffered=unbuffered)
        File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\connections.py", line 1061, in _read_query_result
        result.read()
        File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\connections.py", line 1349, in read
        first_packet = self.connection._read_packet()
        File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\connections.py", line 1018, in _read_packet
        packet.check_error()
        File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\connections.py", line 384, in check_error
        err.raise_mysql_exception(self._data)
        File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\err.py", line 107, in raise_mysql_exception
        raise errorclass(errno, errval)
        pymysql.err.InternalError: (1366, "Incorrect string value: '\\xE5\\x88\\x98\\xE8\\x8B\\xA5...' for column 'title' at row 1")
        何烨坪:是数据表titie字段的编码问题,你要设置为utf-8
        Alanxx:数据库的设置没弄好,检查你的数据库字段设置,还有id设为主键、自增
      • d778b09685bb:这是我学python以来以第一个所有代码敲完都能运行的教程,就是在mysql安装那稍微卡住一会,但是问题也不大,代码思路真的很清晰,就喜欢这种教程。已经赞赏,钱不多,自己的一份心意。
        Veniendeavor:@DH_4bc4 谢谢啦,大家一起加油吧
      • d778b09685bb:看了好多视频教程都没入门,看了你的教程入门了。真心不错
      • 百曾:你好
        Veniendeavor:@百曾 你好
      • may_5ok:connection = pymysql.connect(**db_config),请问一下,这为什么要加**?
        Veniendeavor:@may_5ok 加油💪
        may_5ok:@Veniendeavor :blush: 谢谢解答,不加**会报错,TypeError: getaddrinfo() argument 1 must be string or None,不知道**作用是什么:dizzy_face:
        Veniendeavor:@may_5ok 就是把字典db_config变成参数传递进去,具体可以去看看Python语法
      • 6ae804024906:from urllib import request 这个我导入不了,请问您用的哪个pytest版本
      • boom__:赞一个,困扰了一周的问题看了你的代码解决了,代码思路很清晰

      本文标题:Python 3 爬虫学习笔记 (四)

      本文链接:https://www.haomeiwen.com/subject/wtlcittx.html