python简单爬虫-帮媳妇爬取小说

作者: quan575 | 来源:发表于2017-10-30 17:46 被阅读33次

python简单爬虫-帮媳妇爬取小说
2017-12-31
Python爬虫实战之爬取链家广州房价_03存储
python 爬虫练习（一）
各类链接
Python学习
Python爬虫-豆瓣电影Top250-各项信息爬取及xls保存
python爬虫实战——爬取股票个股信息
python爬虫：用selenium控制浏览器，爬取蛋壳公寓租房
用python网络爬虫爬取英雄联盟英雄图片

媳妇说想看《公子九》，先百度找了一个可以爬取的网站。
第一，这个网站不需要账号登录；第二，不会因为平凡访问封IP；这种网页爬取就简单了，只需要获取网页，解析网页得到内容，存入文件txt。

http://www.mpzw.com/html/129/129853/
上面是一个目录页面，可以以这个页面作为入口，依次爬取各章节。

查看其源代码：.png
查看源代码，每个章节都有一个链接，其实第五章的地址就是：www.mpzw.com/html/129/129853/ 加上 27468861.html

脚本代码如下

定义一个函数get_text()来获取章节内容（使用BeautifulSoup包来解析），然后遍历每个章节。最后下载的小说在存入out.txt 文件。当然还可以把每个爬取过的链接存下来方便下次爬取的时候跳过。

# -*- coding: utf-8 -*-
"""
Created on Sun Oct 22 15:11:15 2017

@author: Administrator
"""

from BeautifulSoup import BeautifulSoup
import urllib2

def get_text(url,title):
    '''
    解析小说的某一章节，输出到文件
    url = 'http://www.mpzw.com/html/129/129853/27468857.html'
    title='第一章'
    '''
    out=open('out.txt','a+')  #追加模式
    out.write('\n\n'+title)
    f = urllib2.urlopen(url)
    soup = BeautifulSoup(f.read()) #.prettify('gbk')
    print soup.title.prettify('gbk')    #标题
    str1= soup.findAll('div',attrs ={'class':'Content'})[0].prettify('gbk') #正文
    str1=str1.replace(' ','')
    str1=str1.replace('<br />','\n')
    str1=str1.replace(' ','')
    str1=str1.replace('\n\n\n','\n')
    str1=str1.replace('\n\n','\n')
    str1=str1.replace('猫扑中文www.mpzw.com','')
    for i in str1.split('\n'):
        if '<' not in i :
            out.write(i+'\n')
    out.close()
#下一张
#get_text('http://www.mpzw.com/html/129/129853/27468857.html','a')

#入口网址
muluurl='http://www.mpzw.com/html/129/129853/'
f = urllib2.urlopen(muluurl+'index.html')
soup = BeautifulSoup(f.read())      #解析网页
links=soup.findAll('a')             #所有的章节链接

for link in links:  #遍历所有的章节
    try:
       title=link.text
       print title
       url=muluurl+link['href']
       print url
       get_text(url,title)
    except:
       print '链接错误，跳过'

希望以后能写出更复杂的爬虫。。