初学python爬虫,遇到诸多疑难问题。今天这个特别大。目的是简单的使用Xpath爬取豆瓣音乐top250,并存储在MySQL中。
一、数据库的建立:
CREATE TABLE dbmusic(
name TEXT
singer TEXT,
rate TEXT,
url TEXT
) ENGINE INNODB DEFAULT CHARSET=utf8
二、爬虫代码(用XPATH)
from lxml import etree
import requests
import time
import pymysql
conn = pymysql.connect(host='localhost', user='root', passwd='******', db='testdb', port=3306, charset='utf8')
cursor = conn.cursor()
urls =['https://music.douban.com/top250?start={}'.format(str(i)) for i in range(0,250,25)]
headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'
}
for url in urls:
html = requests.get(url,headers = headers)
selector = etree.HTML(html.text)
infos = selector.xpath('//tr[@class="item"]')
for info in infos:
name = info.xpath('td/a[@title]')[0]
singer = name #由于解构不出来,只好暂时这样。大牛指教!
rate = info.xpath('td/div/div/span[2]/text()')[0]
url = info.xpath('td/div/a/@href')[0]
cursor.execute('use testdb')
cursor.execute("insert into dbmusic(name,singer,rate,url)values(%s,%s,%s,%s)",
(str(name),str(singer),str(rate),str(url))
)
print('succeed')
time.sleep(2)
conn.commit()
结果:可以爬取数据!
但是,在数据库中SELECT name 后结果却成了一些诸如
“Element a at 0x5408188” 的内容。
而url和rate是正确的。估计问题出在提取name路径上,但对照源码改过几次都是这样的结果。
望大牛指教!
网友评论