一、网页分析

网址分析1

网址分析2

通过对网址分析，发现糗事百科的网页比较简单，网页的信息均是出现在网址中，为get获取方式，因此此处采用get方式进行请求。

二、主要内容分析

网页内容

从网页内容来看，每条信息间并列分布，因此采用遍历的方式对每一条信息的相关内容进行爬取。

三、代码

import requests
from bs4 import BeautifulSoup
import re

url =  'http://www.qiushibaike.com/text/'
headers = {
        'Cookie': '_qqq_uuid_="2|1',
        'Upgrade-Insecure-Requests': '1', 
        'Accept-Encoding': 'gzip, deflate, sdch, br', 
        'Accept-Language': 'zh-CN,zh;q=0.8', 
        'Cache-Control': 'max-age=0', 
        'ccept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 
        'Host': 'www.qiushibaike.com', 
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.98 Safari/537.36', 
        'Referer': 'https', 
        'If-None-Match': '"091c1ffec42275e428d6a951055a2c5266c52a17"', 
        'Connection': 'keep-alive'
           }
html = requests.get(url, headers).content
soup = BeautifulSoup(html,'lxml')
f = open('C:\\Users\Administrator\\Desktop\\练习杂物\\糗事百科爬虫练习.csv', 'w', encoding = 'utf-8')
f.seek(0)
div_list = soup.find_all(name = 'div', class_ = 'article block untagged mb15')
for i in div_list:
    name = i.find('h2').text
    genders = i.find(name = 'div', class_ = re.compile('articleGender .*'))
     
    if genders == None:
        gender = 'None'
        age = 'None'
    else:
        gender = genders.attrs['class'][1][:-4]  #attrs['class']为一字典，字符串为元素
        age =genders.text
    content = i.find('span').text
    laugh = i.find(name = 'span', class_ = 'stats-vote').find('i').text
    comment = i.find(name = 'span', class_ = 'stats-comments').find('i').text
    f.writelines(['姓名： '+name,'  性别： '+gender,'  年龄： '+age,'  好笑数： '+laugh,'  评论数： '+comment])
    f.writelines('\n')
    f.writelines(content+'\n')
f.close()
print('finished!')