Python编程从0到1（实战篇：抓取外网资源自动翻译并发至个人

作者: 安和然 | 来源:发表于2018-04-04 11:15 被阅读82次

Python编程从0到1（实战篇：抓取外网资源自动翻译并发至个人
爬虫趣操作
Python编程从0到1（实战篇：提取Word表格存储到Exce
Python接口自动化框架从0到1-03：HTTP协议
Python爬虫urllib2库的基本使用系列(三)
python编程 | 第一章起步（环境搭建）
Python3.5爬虫urllib系列之三
从0到1：Python打造MySQL专家系统
Python编程从0到1（第零篇）
从0到1学编程

1.需求：

本程序源于我的一个真实需求，需要每天去关注数十个国外网站，并在上面找到有用信息后，编译为文章。

就想能不能写一个程序，每天早上帮我去数十个网站上，看有没有最新的文章，爬取下来，自动翻译后发到我的邮箱里。

2.设计思路：

设计思路.jpg

3.各模块代码解析

3.1 初始化目标网站及抓取规则

由于后期需要对爬取的网站和规则进行修改完善，因此，把目标网站和规则等内容单独做在一个Config.json文件中，每次程序运行都调用这个配置文件。

生成Config.json的源码：

import json

#本文档用于生成siteconfig.json文件。

config = [{
    'sitename':'美国白宫官网',
    'starturl':'https://www.whitehouse.gov/issues/economy-jobs/',
    'listXpath':r'//article//h2/a/@href',
    'OriginalUrl':"",
    'titleXpath':r'//*[@id="main-content"]/div/div/h1/text()',
    'timeXpath':r'//*[@id="main-content"]/div[1]/div/div/p/time/text()',
    'contentXpath':r'//*[@id="main-content"]/div[2]/div/div/p/text()',
    'authorXpath':r'//*[@id="main-content"]/div[1]/div/p/text()'
},
    {
        'sitename': '美国财政部官网',
        'starturl': 'https://home.treasury.gov/news/press-releases/',
        'listXpath': r'//*[@id="block-hamilton-content"]//h2/a/@href',
        'OriginalUrl':"https://home.treasury.gov",
        'titleXpath': r'//*[@id="block-hamilton-page-title"]/h1/span/text()',
        'timeXpath': r'//*[@id="block-hamilton-content"]/article/div//time/text()',
        'contentXpath': r'//*[@id="block-hamilton-content"]/article/div//p/text()',
        'authorXpath': r''
    },
    {
        'sitename': '美国国会预算办公室',
        'starturl': 'https://www.cbo.gov/most-recent',
        'listXpath': r'//*[@id="content"]/div//span/a/@href',
        'OriginalUrl': "https://www.cbo.gov",
        'titleXpath': r'//*[@id="page-title"]/text()',
        'timeXpath': r'//*[@class="date-display-single"]/text()',
        'contentXpath': r'//*[@id="content-panel"]//p/text()',
        'authorXpath': r''
    }
]
with open("siteconfig.json", "w") as text:
    json.dump(config, text)

其中，config字典的解释如下：

 {
        'sitename':  #网站名
        'starturl':  #开始url地址
        'listXpath':  #文章列表的Xpath表达式
        'OriginalUrl': #原始url，用于拼接地址
        'titleXpath':  #文章标题的Xpath表达式
        'timeXpath': #文章时间的Xpath表达式
        'contentXpath':   #文章内容的Xpath表达式
        'authorXpath': #文章作者的Xpath表达式
    },

3.2 抓取网页内容

生成一个seen.json文件，保存已经抓取过的页面网址，不用重复抓取。
传递的参数为config.json转成的字典列表。
IOarticle函数负责生成md文档。

import json
from lxml import etree
import datetime
import SendEmail

def getHTMLText(url):
    try:
        r = requests.get(url, timeout = 30)
        r.raise_for_status()
        #r.encoding = 'utf-8'
        return r.text
    except:
        return ""

def getLinkList(dic):
    url = dic['starturl']
    html = getHTMLText(url)
    selector = etree.HTML(html)
    links = selector.xpath(dic['listXpath'])
    return links

def getContent(dic,articles):
    links = getLinkList(dic)
    with open("seen.json") as seen:
        seenLink = json.load(seen)
    print(seenLink)
    for link in links:
        link = dic['OriginalUrl'] + link
        if link in seenLink:
            print("已经抓取过本页面")
            continue
        else:
            print(link)
            seenLink.append(link)
            html = getHTMLText(link)
            selector = etree.HTML(html)
            title = selector.xpath(dic['titleXpath'])
            print(title)
            time = selector.xpath(dic['timeXpath'])
            print(time)
            author = dic['sitename']
            print(author)
            paras = selector.xpath(dic['contentXpath'])
            print(paras)
    #将爬取到的文章用字典格式来存
        article = {
         'Title' : str(title).replace("'"," ").replace('"'," "),
         'Link' : str(link),
         'Time' : str(time).replace("'"," ").replace('"'," "),
         'Paragraph' : str(paras).replace("'"," ").replace('"'," ").replace("//n"," "),
         'Author' : str(author).replace("'"," ").replace('"'," ")
       }
        articles.append(article)

    with open("seen.json",'w') as seen:
        json.dump(seenLink,seen)
    return articles

def IOarticle(articles):
    nowTime = datetime.datetime.now().strftime('%Y%m%d')
    filename = nowTime + ".md"
    fo = open(filename, "w+", encoding="utf-8")
    fo.writelines("[TOC]"+ "\n")
    for article in articles:
        fo.writelines("# "+ article['Title'] + "\n")
        fo.writelines("**参考译文：**" + GoogleTransla.translateGoogle(article['Title'].strip('[]')) + "\n")
        fo.writelines("**来源：**" + article['Link'] + "\n")
        fo.writelines(article['Time'].strip() + "\n")
        fo.writelines(GoogleTransla.translateGoogle(article['Time'].strip().replace("'"," ").replace('"'," "))+ "\n")
        fo.writelines("**正文：**" + article['Paragraph'] + "\n")
        try:
            fo.writelines("**参考译文：**" + GoogleTransla.translateGoogle(article['Paragraph'].strip('[').strip(']').replace("'"," ").replace('"'," ") + "\n"))
        except:
            fo.writelines("文本太长，暂时只提供前2000字符的翻译\n")
            fo.writelines("**参考译文：**" + GoogleTransla.translateGoogle(
                article['Paragraph'][:10000].strip('[').strip(']').replace("'", " ").replace('"', " ") + "\n"))
        fo.writelines("\n **来源网站：**" + article['Author'] + "\n")
        fo.writelines("\n\n")
    fo.close()
    with open("text.json","w") as text:
        json.dump(articles,text)
    return filename

def main():
    articles = []
    with open("siteconfig.json") as site:
        webs = json.load(site)
    for web in webs:
        articles = getContent(web,articles);
    filename = IOarticle(articles)
    SendEmail.send_mail(filename,filename)
    # getWHList(keyWord="china")

3.3 调用Google翻译

import re
import urllib.parse, urllib.request
import urllib

url_google = 'http://translate.google.cn'
reg_text = re.compile(r'(?<=TRANSLATED_TEXT=).*?;')
user_agent = r'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) ' \
                  r'Chrome/44.0.2403.157 Safari/537.36'

def translateGoogle(text, f='', t='zh-cn'):
    text = text.strip('[]').replace("'"," ").replace('"'," ")
    values = {'hl': 'zh-cn', 'ie': 'utf-8', 'text': text, 'langpair': '%s|%s' % (f, t)}
    value = urllib.parse.urlencode(values)
    req = urllib.request.Request(url_google + '?' + value)
    req.add_header('User-Agent', user_agent)
    response = urllib.request.urlopen(req)
    content = response.read().decode('utf-8')
    data = reg_text.search(content)
    result = data.group(0).strip(';').strip('\'')
    print(result)
    return result

3.4 发送邮件模块

from email.mime.text import MIMEText
from email.mime.image import MIMEImage
from email.mime.base import MIMEBase
from email.mime.multipart import MIMEMultipart
from email import encoders
import smtplib
import time


def send_mail(subject,filename):
    email_host = 'smtp.163.com'  # 服务器地址
    sender = '     '  # 发件人
    password = '   '  # 密码，如果是授权码就填授权码
    receiver = '   '  # 收件人

    msg = MIMEMultipart()
    msg['Subject'] = subject  # 标题
    msg['From'] = ''  # 发件人昵称
    msg['To'] = ''  # 收件人昵称

    # 正文-图片 只能通过html格式来放图片，所以要注释25，26行
    mail_msg = '''
<p>\n\t 这是电脑自动发送的邮件!</p>
<p>\n\t 不必回复。</p>
<p><a href="https://www.jianshu.com/u/ee5f3fe1b932">简书</a></p>
<p>如需增加收件人、爬取网站，请联系：QQ:39489421</p>
<p><img src="cid:image1"></p>
'''
    msg.attach(MIMEText(mail_msg, 'html', 'utf-8'))
    # 指定图片为当前目录
    # fp = open(r'111.png', 'rb')
    # msgImage = MIMEImage(fp.read())
    # fp.close()
    # # 定义图片 ID，在 HTML 文本中引用
    # msgImage.add_header('Content-ID', '<image1>')
    # msg.attach(msgImage)

    ctype = 'application/octet-stream'
    maintype, subtype = ctype.split('/', 1)
    # 附件-图片
    # image = MIMEImage(open(r'111.jpg', 'rb').read(), _subtype=subtype)
    # image.add_header('Content-Disposition', 'attachment', filename='img.jpg')
    # msg.attach(image)
    # 附件-文件
    file = MIMEBase(maintype, subtype)
    file.set_payload(open(filename, 'rb').read())
    file.add_header('Content-Disposition', 'attachment', filename=subject + '.md')
    encoders.encode_base64(file)
    msg.attach(file)

    # 发送
    smtp = smtplib.SMTP()
    smtp.connect(email_host, 25)
    smtp.login(sender, password)
    smtp.sendmail(sender, receiver, msg.as_string())
    smtp.quit()
    print('success')

4. 复盘程序编写中遇到的问题：

4.1 程序间数据交换格式选择

用Json文件作为程序间交换数据的格式，比较方便，与python中的字典列表很像。

4.2 爬虫方式的选择

爬虫有很多现成的框架和方法，如Beautifulsoup，Scrapy，正则表达式等，而Xpath是一种兼具效率和难度的好工具，在Chrome下有一个Xpath helper的插件，可以很方便的调试，还有一个Chrome按下F12，还有一个Copy Xpath的功能，十分方便。可以上网找一些相关文档学习。

4.3 翻译引擎的使用

当前比较好的有谷歌翻译、百度翻译、有道翻译，都有比较好的方法，不同的文章，三个翻译引擎结果都不一样，各有优势，这里选了谷歌，但不一定是最好的。
谷歌翻译传递的参数如果太长，也就是文章如果太长，就会报错，所以我设定了一个限额，只翻译前10000个字符。

4.3 发送邮箱的坑

做邮件自动发送这块花了不少时间，主要是现在邮件如果使用SMTP，登录的密码需要使用授权码，而不是密码本身。这点很重要，切记。

4.3 用Markdown生成最终报告

Markdown是我比较喜欢的一种方式，因此，使用这个生成后，可以转化为各种格式。

5 改进方向及展望

google翻译的效果真的很一般，不过有参考价值。
可以增加关键字，只发送含有关键字的内容。
目标网站的研究很重要，不同的网站虽然大致一样，但信息源的结构都不一样。一句话：功夫在编程外，要让一个程序发挥作用，需要做更多基础性的工作。

Python编程从0到1（实战篇：抓取外网资源自动翻译并发至个人
1.需求：本程序源于我的一个真实需求，需要每天去关注数十个国外网站，并在上面找到有用信息后，编译为文章。就想能...
爬虫趣操作
Python的主要应用——爬虫 1. 最主要的应用就是——对数据信息进行自动采集，批量自动抓取各种网上的数据和资源...
Python编程从0到1（实战篇：提取Word表格存储到Exce
今天突然有一个需求，要把统计局网站下载的Word文档里的表格提取出来，放到Excel表中，便于下一步进行数据分析。...
Python接口自动化框架从0到1-03：HTTP协议
Python接口自动化框架从0到1-03：HTTP协议 HTTP（Hyper Text Transfer Prot...
Python爬虫urllib2库的基本使用系列(三)
1. 网页抓取所谓网页抓取，就是把URL地址中指定的网络资源从网络流中抓取出来。在Python中有很多库可以用来...
python编程 | 第一章起步（环境搭建）
python编程系统学习指路：快速学习 | python编程：从入门到实践 | Windows 1. python...
Python3.5爬虫urllib系列之三
1,简述所谓网页抓取，就是把URL地址中指定的网络资源从网络流中抓取出来。在Python中有很多库可以用来抓取网...
从0到1：Python打造MySQL专家系统
从0到1：Python打造MySQL专家系统(1) 本博客是赖明星所撰写的Python Linux系统管理与自动化...
Python编程从0到1（第零篇）
程序员都知道，数组、列表等的下标都是从0开始的，因此，学习一门语言，真的需要从0开始。近段时间，由于工作需要，对...
从0到1学编程
进入移动互联网时代之后，编程似乎成为人人必不可少的技能。也许你会说编程不是程序员的专属么，那就错了。编程能让运营不...

Python编程从0到1（实战篇：抓取外网资源自动翻译并发至个人

1.需求：

2.设计思路：

3.各模块代码解析

3.1 初始化目标网站及抓取规则

3.2 抓取网页内容

3.3 调用Google翻译

3.4 发送邮件模块

4. 复盘程序编写中遇到的问题：

4.1 程序间数据交换格式选择

4.2 爬虫方式的选择

4.3 翻译引擎的使用

4.3 发送邮箱的坑

4.3 用Markdown生成最终报告

5 改进方向及展望

相关文章

Python编程从0到1（实战篇：抓取外网资源自动翻译并发至个人

爬虫趣操作

Python编程从0到1（实战篇：提取Word表格存储到Exce

Python接口自动化框架从0到1-03：HTTP协议

Python爬虫urllib2库的基本使用系列(三)

python编程 | 第一章起步（环境搭建）

Python3.5爬虫urllib系列之三

从0到1：Python打造MySQL专家系统

Python编程从0到1（第零篇）

从0到1学编程

网友评论

延伸阅读

深度阅读

栏目导航

热点阅读

已收录(2017-8-15)

工具癖

简摘17

程序员