这是我自己在学习python 3爬虫时的小笔记,做备忘用,难免会有一些错误和疏漏,望指正~~~
Python 3 爬虫学习笔记 (一)
Python 3 爬虫学习笔记 (二)
Python 3 爬虫学习笔记 (三)
Python 3 爬虫学习笔记 (五)
Python 3 爬虫学习笔记 (六)
五 数据库存储爬取的信息(MySQL)
爬取到的数据为了更好地进行分析利用,而之前将爬取得数据存放在txt文件中后期处理起来会比较麻烦,很不方便,如果数据量比较大的情况下,查找更加麻烦,所以我们通常会把爬取的数据存储到数据库中便于后期分析利用。
这里,数据库选择MySQL,采用pymysql这个第三方库来处理python和mysql数据库的存取,python连接mysql数据库的配置信息
db_config ={
'host': '127.0.0.1',
'port': 3306,
'user': 'root',
'password': '',
'db': 'pytest',
'charset': 'utf8'
}
以爬取简书首页文章标题以及url为例,先分析抓取目标信息,
Paste_Image.png如上图,文章题目在a标签中,且url(href)只含有后半部分,所以在存储的时候,最好把它补全。
mysql:新建一个数据库pytest,建立一张名为titles的表,表中字段分别为id(int自增),title(varchar),url(varchar),如下:
Paste_Image.png进行数据库操作的思路为:获得数据库连接(connection)->获得游标(cursor)->执行sql语句(execute)->事物提交(commit)->关闭数据据库连接(close),具体代码实现如下:
# -*- coding:utf-8 -*-
from urllib import request
from bs4 import BeautifulSoup
import pymysql
# mysql连接信息(字典形式)
db_config ={
'host': '127.0.0.1',
'port': 3306,
'user': 'root',
'password': '',
'db': 'pytest',
'charset': 'utf8'
}
# 获得数据库连接
connection = pymysql.connect(**db_config)
# 数据库配置,获得连接(参数方式)
# connection = pymysql.connect(host='127.0.0.1',
# port=3306,
# user='root',
# password='',
# db='pytest',
# charset='utf8')
url = r'http://www.jianshu.com/'
# 模拟浏览器头
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
page = request.Request(url, headers=headers)
page_info = request.urlopen(page).read().decode('utf-8')
soup = BeautifulSoup(page_info, 'html.parser')
urls = soup.find_all('a', 'title')
try:
# 获得数据库游标
with connection.cursor() as cursor:
sql = 'insert into titles(title, url) values(%s, %s)'
for u in urls:
# 执行sql语句
cursor.execute(sql, (u.string, r'http://www.jianshu.com'+u.attrs['href']))
# 事务提交
connection.commit()
finally:
# 关闭数据库连接
connection.close()
代码执行结果:
Paste_Image.png
网友评论
=======================================
File "e:\softwareDevelopment\pythonProject\03_crawling\crawling_004.py", line 17, in <module>
connection = pymysql.connect(**db_config)
File "E:\softwareDevelopment\python\lib\site-packages\pymysql-0.9.2-py3.6.egg\pymysql\__init__.py", line 94, in Connect
File "E:\softwareDevelopment\python\lib\site-packages\pymysql-0.9.2-py3.6.egg\pymysql\connections.py", line 326, in __init__
File "E:\softwareDevelopment\python\lib\site-packages\pymysql-0.9.2-py3.6.egg\pymysql\connections.py", line 597, in connect
File "E:\softwareDevelopment\python\lib\site-packages\pymysql-0.9.2-py3.6.egg\pymysql\connections.py", line 855, in _request_authentication
File "E:\softwareDevelopment\python\lib\site-packages\pymysql-0.9.2-py3.6.egg\pymysql\connections.py", line 682, in _read_packet
File "E:\softwareDevelopment\python\lib\site-packages\pymysql-0.9.2-py3.6.egg\pymysql\protocol.py", line 220, in check_error
File "E:\softwareDevelopment\python\lib\site-packages\pymysql-0.9.2-py3.6.egg\pymysql\err.py", line 109, in raise_mysql_exception
pymysql.err.InternalError: (1049, "Unknown database 'pytest'")
**************************************************************
Traceback (most recent call last):
File "F:/工具/Python/py/crawler/jianshu3.py", line 30, in <module>
cursor.execute(sql, (u.string, r'http://www.jianshu.com'+u.attrs['href']))
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\cursors.py", line 165, in execute
result = self._query(query)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\cursors.py", line 321, in _query
conn.query(q)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\connections.py", line 860, in query
self._affected_rows = self._read_query_result(unbuffered=unbuffered)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\connections.py", line 1061, in _read_query_result
result.read()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\connections.py", line 1349, in read
first_packet = self.connection._read_packet()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\connections.py", line 1018, in _read_packet
packet.check_error()
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\connections.py", line 384, in check_error
err.raise_mysql_exception(self._data)
File "C:\Users\Administrator\AppData\Local\Programs\Python\Python35\lib\site-packages\pymysql\err.py", line 107, in raise_mysql_exception
raise errorclass(errno, errval)
pymysql.err.InternalError: (1366, "Incorrect string value: '\\xE5\\x88\\x98\\xE8\\x8B\\xA5...' for column 'title' at row 1")