爬取网页中的文章写成本地txt文件

作者: 新新格子君 | 来源:发表于2016-12-01 20:03 被阅读831次

爬取网页中的文章写成本地txt文件
爬虫-抓取图片
爬取静态网页数据思路与案例
Python 学习笔记 094
爬取基于Ajax技术网页数据
Python实战计划学习笔记示例（2）爬取商品信息
爬虫笔记（二）——爬取动态网页
课时8
识别网站CMS的方法(未完)
Python爬虫抓取东方财富网股票数据并实现MySQL数据库存储

之前看了周浩晖的一下小说，包括邪恶催眠师系列，这个系列已经到了第三季但是网上好像没找到txt文件。只找到下方网页中的文章，网页看小说不是很方便，所以决定爬下来做成txt文件放在手机中看。
http://www.txt99.com/read/12/20831/1.html

技术点：

BeautifulSoup、urllib2

直接上代码

#!/usr/bin/env python
# -*-coding:utf-8-*-

from bs4 import BeautifulSoup
import  html5lib
import urllib2
import sys
import codecs



strall='';
reload(sys)
sys.setdefaultencoding('utf-8')

for i in range(1,34):
   urls=str('http://www.txt99.com/read/12/20831/') +str(i) +str('.html')
   html=urllib2.urlopen(urls)
   htmldata=html.read()
   soup=BeautifulSoup(htmldata,'html.parser',from_encoding="gb18030") #这个网页是gb2312编码，所以要转一下

   #view_content_txt
   titleData=soup.find ('div',id='view_content_txt')

   ss=str(unicode(titleData))
   lists=ss.split('<div id="view_content_txt">')
   lings=str(lists[1])

   lists2=lings.split('<div class="view_page">')
   print str(lists2[0])
   strall+=str(lists2[0])

def writtetxt(content):
   f = codecs.open('f:/python/1.txt', 'w', 'utf-8') #将拼接的字符串写到txt文件中
   f.write(content)

   # print titleData
writtetxt(strall)