美文网首页
麻瓜编程·python实战·1-2自学:爬取new blah

麻瓜编程·python实战·1-2自学:爬取new blah

作者: bbjoe | 来源:发表于2016-08-09 09:34 被阅读0次

代码

from bs4 import BeautifulSoup

info = [] 
with open('C:/Users/Administrator/Desktop/Pycharmprojects/OReillyWebScraping/小白/html/1-2 web/new_index.html', 'r') as web_data:
    soup = BeautifulSoup(web_data, 'lxml')
    titles = soup.select('body > div.main-content > ul > li > div.article-info > h3 > a')
    images = soup.select('body > div.main-content > ul > li > img')
    # cates这边停在了父级标签,因为原网页中“项目”和“cates”存在一对多的关系
    cates   = soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info')
    descs  = soup.select('body > div.main-content > ul > li > div.article-info > p.description')
    rates  = soup.select('body > div.main-content > ul > li > div.rate > span')

    # print(titles, images, tags, descs, rates, sep="\n-------------------------\n")

for title, image, cate, desc, rate in zip(titles, images, cates, descs, rates):
    data = {
        'title': title.get_text(),
        'cate' : list(cate.stripped_strings),    # stripped_strings相当于高级的get_text(),可以同时取出多个文本。list()是列表化
        'desc' : desc.get_text(),
        'rate' : rate.get_text(),
        'image': image.get('src')
    }
    info.append(data)    # 之所以建立info[]列表,是为了把多个data字典放进去之后进行迭代

# 选取评分大于3的部分
for i in info:
    if float(i['rate']) > 3:
        print(i['title'], i['rate'])

# 标题 body > div.main-content > ul > li:nth-child(1) > div.article-info > h3 > a
# 图片 body > div.main-content > ul > li:nth-child(1) > img
# 标签 body > div.main-content > ul > li:nth-child(1) > div.article-info > p.meta-info > span:nth-child(2)
# 评分 body > div.main-content > ul > li:nth-child(1) > div.rate > span
# 内容 body > div.main-content > ul > li:nth-child(1) > div.article-info > p.description

相关文章

网友评论

      本文标题:麻瓜编程·python实战·1-2自学:爬取new blah

      本文链接:https://www.haomeiwen.com/subject/ogzrsttx.html