美文网首页
2019-06-21-爬虫-FRIST TRY

2019-06-21-爬虫-FRIST TRY

作者: ElfACCC | 来源:发表于2019-06-21 14:53 被阅读0次
    image.png
    image.png
    image.png
    image.png

    遇到的问题:

    response.encoding='utf-8' 中文显示乱码

    image.png
    image.png

    原因:因为原网站是gb2312的,所以要设置成'gb2312'就不乱码了


    image.png

    这样也不会出错

    完整代码:

    import requests
    import re
    
    url = 'http://www.jjwxc.net/onebook.php?novelid=379995'
    response = requests.get(url)
    #html =  response.text.encode('ISO-8859-1').decode('gbk')
    response.encoding = 'gb2312'
    html = response.text
    title = re.findall(r'<span itemprop="articleSection">(.*?)</span>',html)[0]
    #print(title)
    # title = re.findall(r'<span itemprop="articleSection">(.*?)</span>',html)[0]
    # print(title)
    fb = open('%s.txt' % title,'w',encoding='utf-8')
    list = re.findall(r'<a itemprop="url" href="(.*?)">(.*?)</a>',html,re.S)
    #print(list)
    for chapter_info in list:
        chapter_url,chapter_title = chapter_info
        chapter_response = requests.get(chapter_url)
        chapter_response.encoding = 'gb2312'
        chapter_html = chapter_response.text
        chapter_content = re.findall(r'<div style="clear:both;"></div>(.*?)<div id="favoriteshow_3"',chapter_html,re.S)[0]
        chapter_content = chapter_content.replace('<br>','\n')
        chapter_content = chapter_content.replace('\u3000',' ')
        chapter_content = chapter_content.replace('&#8226;','•')
        
        fb.write(chapter_title)
        fb.write(chapter_content)
        fb.write('\n')
        
        print(chapter_title)
    
    image.png
    image.png

    相关文章

      网友评论

          本文标题:2019-06-21-爬虫-FRIST TRY

          本文链接:https://www.haomeiwen.com/subject/zbbyqctx.html