爬

作者: _月光临海 | 来源:发表于2018-05-31 15:37 被阅读0次

爬爬爬爬爬爬爬
我爬爬爬
华山爬爬爬！！！
新奇爬爬爬
爬爬
爬爬
轨迹
比慢爬爬爬
147-爬爬爬
《摔爬爬》

基础爬虫

初学爬虫，借鉴 https://www.jianshu.com/p/0e7d1c80b8c3，受益良多

# Author:Freeman
import urllib.request, urllib.parse  # parse 这个没用上,删掉也没影响
import re  # 正则

pattern = re.compile(r'<div class="j-r-list-c-desc">\s+(.*)\s+</div>')


# .代表除 \n 外所有字符,* 代表多个,\s+ 代表若干个空格,
# r''代表不转义。这里是需要传入一个规则，规则中需要 \s+ ,而如果不用 r'' 的话,传入 compile 的 \s+ 会被转换为 '空字符'

# 向指定 url 的页面发送请求,返回该页面的 html 字符串
def open_url(url):
    req = urllib.request.Request(url)
    req.add_header("user-agent",
                   "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36")
    res = urllib.request.urlopen(req)
    html = res.read().decode("utf-8")  # 这里 res.read() 直接读到的是 bytes 类型,需要解码为 utf-8
    return html  # str


# 输入开始页码和结束页码
def page_counts(start_page_num, end_page_num):
    all_page_list = []
    for i in range(start_page_num, end_page_num + 1):  # 左开右闭 因此 +1
        cur_page_html = open_url("http://www.budejie.com/text/%s" % i)  # 不得姐的翻页就是这么排的
        one_page_list = re.findall(pattern, cur_page_html)  # 匹配所有符合 pattern 规则的字符串,组成一个 list
        # 把单个页面的 list 逐个取出,放到存放所有数据的 list 中,便于后期写入本地文件
        for item in one_page_list:
            all_page_list.append(item)
    return all_page_list        # list


def save(start_page_num, end_page_num):
    with open("a.txt", "w", encoding="utf-8") as f:     # with 方法可以打开文件,并在 with 调用结束后自动关闭文件
        for item in page_counts(start_page_num, end_page_num):
            if r'<br />' in item:
                new_item = re.sub(r'<br />', "\n", item)
                f.write(new_item)
            else:
                new_item = item + "\n"
                f.write(new_item)

save(1, 1)

得手动创建个a.txt