之前浏览《Python数据挖掘入门与实践》这本书的时候发现了非常有意思的内容——用决策树预测NBA获胜球队,但是书中获得原始数据的方式已经行不通了,所以一直没有能够重复这一章的内容。恰巧最近发现了一个利用Python BeautifulSoup模块抓取NBA选秀数据的教程 Learning Python: Part 1:Scraping and Cleaning the NBA draft。突然意识到是否可以利用这份教程来抓取NBA球队的对阵数据,从而重复利用决策树越策NBA获胜球队的内容。
第一部分
这部分内容来自参考书《Python网络数据采集》第一章的内容 基本流程:通过urlopen()函数获得网页的的全部HTML代码;然后通过BeautifulSoup模块解析HTML代码获得我们想要的内容
<pre class="" style="box-sizing: border-box;padding-top: 8px;padding-bottom: 6px;background: rgb(241, 239, 238);border-radius: 0px;overflow-y: auto;color: rgb(80, 97, 109);text-align: start;font-size: 10px;line-height: 12px;font-family: consolas, menlo, courier, monospace, "Microsoft Yahei"!important;border-width: 1px !important;border-style: solid !important;border-color: rgb(226, 226, 226) !important;">
-
from urllib.request import urlopen
-
url = "http://pythonscraping.com/pages/page1.html"
-
html = urlopen(url)
-
print(html.read())
</pre>
输出结果
<pre class="" style="box-sizing: border-box;padding-top: 8px;padding-bottom: 6px;background: rgb(241, 239, 238);border-radius: 0px;overflow-y: auto;color: rgb(80, 97, 109);text-align: start;font-size: 10px;line-height: 12px;font-family: consolas, menlo, courier, monospace, "Microsoft Yahei"!important;border-width: 1px !important;border-style: solid !important;border-color: rgb(226, 226, 226) !important;">
b'<html>\n<head>\n<title>A Useful Page</title>\n</head>\n<body>\n<h1>An Interesting Title</h1>\n<div>\nLorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.\n</div>\n</body>\n</html>\n'
</pre>
简易理解html源代码:尖括号<>内是标签,两个尖括号中间是内容 BeautifulSoup解析
<pre class="" style="box-sizing: border-box;padding-top: 8px;padding-bottom: 6px;background: rgb(241, 239, 238);border-radius: 0px;overflow-y: auto;color: rgb(80, 97, 109);text-align: start;font-size: 10px;line-height: 12px;font-family: consolas, menlo, courier, monospace, "Microsoft Yahei"!important;border-width: 1px !important;border-style: solid !important;border-color: rgb(226, 226, 226) !important;">
-
from bs4 import BeautifulSoup
-
soup = BeautifulSoup(html)
</pre>
如果我们想要获得以上html源代码中title中的内容
<pre class="" style="box-sizing: border-box;padding-top: 8px;padding-bottom: 6px;background: rgb(241, 239, 238);border-radius: 0px;overflow-y: auto;color: rgb(80, 97, 109);text-align: start;font-size: 10px;line-height: 12px;font-family: consolas, menlo, courier, monospace, "Microsoft Yahei"!important;border-width: 1px !important;border-style: solid !important;border-color: rgb(226, 226, 226) !important;">
-
soup.title
-
soup.findAll("title")
-
soup.title.getText()
</pre>
第二部分
爬取2013-2014赛季NBA球队的对阵数据
<pre class="" style="box-sizing: border-box;padding-top: 8px;padding-bottom: 6px;background: rgb(241, 239, 238);border-radius: 0px;overflow-y: auto;color: rgb(80, 97, 109);text-align: start;font-size: 10px;line-height: 12px;font-family: consolas, menlo, courier, monospace, "Microsoft Yahei"!important;border-width: 1px !important;border-style: solid !important;border-color: rgb(226, 226, 226) !important;">
-
from urllib.request import urlopen
-
from bs4 import BeautifulSoup
-
import pandas as pd
-
import time
-
start = time.time()
-
months = ["october","november","december","january","february","march","april","may","june"]
-
col_header = ["Date","Start(ET)","Visitor/Neutral","PTS","Home/Neutral","PTS","","","Attend","Notes"]
-
url = "https://www.basketball-reference.com/leagues/NBA_2014_games-"
-
NBA_1314_Schedule_and_results = []
-
for month in months:
-
urls = url + month + ".html"
-
html = urlopen(urls)
-
soup = BeautifulSoup(html,"lxml")
-
start_1 = time.time()
-
print(month)
-
for i in range(len(soup.tbody.findAll("tr"))):
-
Schedule = []
-
date = soup.tbody.findAll("tr")[i].findAll("th")[0].getText()
-
Schedule.append(date)
-
for j in range(len(soup.findAll("tr")[i].findAll("td"))):
-
data = soup.findAll("tr")[i].findAll("td")[j].getText()
-
Schedule.append(data)
-
NBA_1314_Schedule_and_results.append(Schedule)
-
end_1 = time.time()
-
print(month,round(end_1 - start_1,2),"s")
-
df = pd.DataFrame(NBA_1314_Schedule_and_results,columns = col_header)
-
end = time.time()
-
print("The total time used:",round(end - start,2),"s")
-
df.to_csv("NBA_2013_2014_Schedule_and_results.csv")
</pre>
成功
部分结果[图片上传中...(image-47b705-1554103053614-0)]
结果中存在的问题
每个月份开始的第一行没有数据,暂时还没有发现是什么原因!
网友评论