SuperMemo实践闭环(4)-交互式处理网页材料

作者: 来自知乎的一只小胖子 | 来源:发表于2021-12-28 09:44 被阅读0次

SuperMemo实践闭环(4)-交互式处理网页材料
实践-SuperMemo - 总结篇(五)
SuperMemo实践闭环(1)-学习流程及时间管理
Superemo学习第三天（初级）
SuperMemo实践闭环(3)-批量挖空制卡的操作
SuperMemo实践闭环(2)-增量学习及提取制卡
实践闭环
XMLHttpRequest
vue的vuex
vue的axios

接上一节,这节我们讲解网页材料在SuperMemo中的处理方法,首先回顾下我们之前流程步骤:

如上图示,在之前的学习中,我们有了解到,可以使用Obsidian进行知识点链接,在Obsidian中创建新节点,关联已存在节点的形式,来获取及扩充我们的学习材料范围.本文正是对此部分内容,对前文进行了工程实践探讨. 如果你还不太清楚相关的概念或流程,你可以先参考我之前的原文大体了解,前文链接如下:

一只小胖子：SuperMemo实践闭环(1)-学习流程及时间管理19 赞同 · 14 评论文章

我们主要的学习材料常见为PDF/视频/笔记/网页等四种. 对于PDF/视频/笔记类的材料一般比较好处理,获取到资源后直接把对应的路径信息整理好,放置入SM学习软件即可(操作见上文).但对于网页类的信息处理则比较麻烦,因此本文也把主要的目标放置在介绍网页类材料的处理操作.

在正常的学习过程中,网页类型的信息比较丰富,常见来源如RSS订阅 / 公众号订阅 / 知乎关注/ 引擎工具搜索 / 网址收藏夹等.所有的这些信息都是通过网址链接的形式呈现的.这些网址如果按文件夹结构批量放置于收藏夹来操作会比较麻烦,因为通过文件夹方式管理会遇到怎么命名文件夹的纠结,也会遇到网址信息过多后,内容各种冗余等常见的问题.

这里我先放置一张最终的效果图:(此方案流程及意义: 通过这种方法,我们可以指定关键字检索来批量处理网页材料,对要添加至学习的网页则在左下角保存记录,最终生成的文件可以直接拷贝内容或者改后缀为网页直接用SuperMemo软件来处理,大大的提高了网页材料处理的效率.)

在本文中,我们通过批量对网址链接进行处理,创建了一个交互网页,通过交互式展现,实现快速检索 / 分类 / 整理多个网址.其中具体使用到了streamlit / pyecharts 的 python组件,通过streamlit编写交互式脚本,pyechars进行词云图展示,whoosh进行全文检索.我这里放置了对应的官网链接.

Streamlit • The fastest way to build and share data appsstreamlit.io/

https://pypi.org/project/Whoosh/pypi.org/project/Whoosh/

步骤一: 获取多个网址链接,这里通过Edge演示,使用了Copy All URLs插件,具体安装使用如图:

使用插件来获取多个网址

步骤二: 拷贝获取到的链接信息到脚本文件,并通过命令行运行脚本streamlit run Gist2.py,程序会自动打开一个网页.即上面的效果图网页.

先放置获取到的多个网址脚本运行后链接自动打开按关键字搜索使用即可

可以在右上角设置中,设置宽屏及实时运行模式.

设置项配置宽屏及实时模式

步骤三: 直接放置代码了,按需安装对应的Python包,放置多网址链接,命令行直接运行脚本即可.

最新的Gist脚本可通过GitHub访问: https://gist.github.com/ef56f43040244978fd2714608dc3d115

#!/usr/bin/env python# -*- coding: utf-8 -*-# 批量网页分析处理# 作者:一只小胖子# 版本:V0.1# 知乎:https://www.zhihu.com/people/lxf-8868# 使用:# 1.Copy All URLs 插件获取多个网页地址# 2.命令行执行streamlit run Gist2.pyurl_texts = """ 提示: 在这里放置多个网址信息"""# ===== 一.使用pyecharts生成词云图 =====# 参考：朱卫军# 链接：https://zhuanlan.zhihu.com/p/113312256# https://blog.csdn.net/zx1245773445/article/details/98043120import jiebafrom collections import Counterimport pyecharts.options as optsfrom pyecharts.charts import WordCloud# # 读取内容来源,返回文本数组# def get_text(goods, evaluation):#     if evaluation == '好评':#         evaluation = 1#     else:#         evaluation = 0#     path = 'excel/comments.csv'#     with open(path, encoding='utf-8') as f:#         data = pd.read_csv(f)#     # 商品种类#     types = data['类型'].unique()#     # 获取文本#     # text = data[(data['类型']==goods)&(data['标签']==evaluation)]['内容'].values.tolist()#     text = data['内容'].values.tolist()#     text = str(text)[1:-1]  # 去符号 []#     print(types)#     return text### stext = get_text('1', '好评')# print(stext)## 结巴分词字典加载 对文本内容进行jieba分词 https://zhuanlan.zhihu.com/p/41032295def split_word(text):
    word_list = list(jieba.cut(text))
    print(len(word_list))
    # 去掉一些无意义的词和符号，我这里自己整理了停用词库
    with open('停用词库.txt') as f:
        meaningless_word = f.read().splitlines()
        # print(meaningless_word)
    result = []
    # 筛选词语
    for i in word_list:
        if i not in meaningless_word:
            result.append(i.replace(' ', ''))
    return result# collections 的使用 https://zhuanlan.zhihu.com/p/108713135# 统计词频def word_counter(words):
    # 词频统计,使用Count计数方法
    words_counter = Counter(words)
    # 将Counter类型转换为列表
    words_list = words_counter.most_common(2000)
    return words_list# 制作词云图def word_cloud(data):
    (
        WordCloud().add(
            series_name="热点分析",
            # 添加数据
            data_pair=data,
            # 字间隙rue
            word_gap=5,
            # 调整字大小范围
            word_size_range=[15, 80],
            shape="cursive",
            # 选择背景图，也可以不加该参数，使用默认背景
            # mask_image='购物车.jpg')
        ).set_global_opts(
            # title_opts=opts.TitleOpts(
            #     title="热点分析", title_textstyle_opts=opts.TextStyleOpts(font_size=12)
            # ),
            tooltip_opts=opts.TooltipOpts(is_show=True),
        ).render("basic.html")  # 输出为html格式
    )# [测试Demo]:# stext = ''' '书籍1做父母一定要有刘墉这样的心态，不断地学习，不断地进步，不断地给自己补充新鲜血液，让自己保持.',# '书籍1作者真有英国人严谨的风格，提出观点、进行论述论证，尽管本人对物理学了解不深，但是仍然能感受到.书籍', '1作者长篇大论借用详细报告数据处理工作和计算结果支持其新观点。为什么荷兰曾经县有欧洲最高的生产.. 1',# '书籍1作者在战几时之前用了“拥抱"令人叫绝.日本如果没有战败，就有会有美军的占领，没胡官僚主义的延.书籍1作者在少年时即喜阅读，能看出他精读了无数经典，因而他有一个庞大的内心世界。他的作品最难能可贵..',# '书籍1作者有一种专业的谨慎，若能有幸学习原版也许会更好，简体版的书中的印刷错误比较多，影响学者理解.',# '书籍1作者用诗一样的语言把如水般清澈透明的思想娓娓道来，像一个经验丰富的智慧老人为我们解开一个又一.书籍1作者提出了一种工作和生活的方式，作为咨询界的元老，不仅能提出理念，而且能够身体力行地实践，并.'# sword = split_word(stext)# print(sword)# word_stat = word_counter(sword)# print(word_stat)# word_cloud(word_stat)# show_WordCounter()# ===== 二.使用Whoosh进行全文检索 =====# 参考：酷python# 链接：https://zhuanlan.zhihu.com/p/172348363# https://www.cnblogs.com/mydriverc/articles/4136754.htmlimport os, errnofrom whoosh.qparser import QueryParser, MultifieldParser# from whoosh.fields import TEXT, SchemaClassfrom whoosh.query import compound, Term, Queryfrom whoosh.index import create_infrom whoosh.index import open_dirfrom whoosh.fields import *from jieba.analyse import ChineseAnalyzerimport htmlimport reimport jsonimport streamlit as st# 而对于Python 3.X（X >= 2）版本，os.makedirs 函数还有第三个参数 exist_ok，该参数为真时执行mkdir -p，# 但如果给出了mode参数，目标目录已经存在并且与即将创建的目录权限不一致时，会抛出OSError异常def mkdir_p(path):
    try:
        os.makedirs(path)
    except OSError as exc:  # Python >2.5 (except OSError, exc: for Python <2.5)
        if exc.errno == errno.EEXIST and os.path.isdir(path):
            pass
        else:
            raise# 存储schema信息至indexdir目录index_dir = 'es/index_dir_1/'if not os.path.exists(index_dir):
    mkdir_p(index_dir)# 就如同定义一张mysql里的表，你需要指出需要存储哪些字段，以及这些字段的类型class ArticleSchema(SchemaClass):
    title = TEXT(stored=True, analyzer=ChineseAnalyzer())
    content = TEXT(stored=True, analyzer=ChineseAnalyzer())
    author = TEXT(stored=False, analyzer=ChineseAnalyzer())# create_in会创建一个名为index_dir的文件夹，添加文档时，一定要根据你所定义的索引模式进行添加，# 这样就创建好了索引，添加文档的过程，就如同向mysql的表里写入数据。schema = ArticleSchema()ix = create_in(index_dir, schema, indexname='article_index')if not ix:
    ix = open_dir(index_dir, indexname='article_index')# 处理文档writer = ix.writer()s_url_arr = url_texts.split("<br/>")print("url待处理项: {}".format(len(s_url_arr)))for i in range(len(s_url_arr)):
    # 网页格式
    # reg_arr = re.findall("">(\w.*)</a><br/>", s_url_arr[i])
    if str(s_url_arr[i]).__contains__("href"):
        reg_href = re.findall('href="(.*)"', s_url_arr[i])[0]
        reg_text = re.findall(">(.*)<", s_url_arr[i])[0]
    # 其它格式
    #
    if reg_href or reg_text:
        # print(reg_href, html.unescape(reg_text))
        # 更新也会添加重复内容!
        # writer.update_document(title= reg_href, author="admin") # , content=html.unescape(reg_text)) # add_document
        # 添加内容
        reg_title = html.unescape(reg_text)  # .encode('unicode-escape')
        writer.add_document(title=reg_href, author="admin", content=reg_title)  # add_document
    # print(json.dumps(json_str, sort_keys=True, indent=4, separators=(',', ': '),ensure_ascii=False))# 删除文档# Because "path" is marked as unique,calling update_document with path = u"/a" will# delete any existing document where the path field contains /a writer.delete_by_term("author", "admin")writer.commit()# 设置iframe长宽高r_width = 1200r_height = 400r_scrolling = True# 展示词云图def show_WordCounter():
    st_file_arr = []
    st_file_lines = open("./basic.html").readlines()
    for st_file_str in st_file_lines:
        st_file_arr.append(st_file_str.strip(""))
    st_file_arr_str = " ".join(st_file_arr)
    # 显示云图
    st.components.v1.html(st_file_arr_str, width=r_width, height=r_height, scrolling=r_scrolling)# 文本输入及展示search_key = "简书"search_key = st.text_input('[1].请输入查询关键词:', search_key)# st.write('你输入的关键词为:', search_key)# st.text('输入关键词为:' + search_key)if not ix:
    ix = open_dir(index_dir, indexname='article_index')title_lists = []content_list = []href_title_dict = {}with ix.searcher() as searcher:
    # author_query = [Term('author', 'admin'), Term('author', 'admin')]
    # content_query = [Term('content', 'python'), Term('content', 'jupyter')]
    # query = compound.Or([compound.Or(author_query), compound.Or(content_query)])
    # content_query = [Term("content", "playwright"), Term("content", "jupyter")]
    # query = compound.Or(content_query)

    # 多条件查询
    # query = QueryParser("content", ix.schema).parse("简书")
    # query = MultifieldParser(["content"], ix.schema).parse("知乎")  # default_set()
    # query = _NullQuery()

    # 搜索所有内容
    results = searcher.documents()
    # print(results)
    content_all = []
    for data in results:
        content_all.append(data["content"])
    sword = split_word("".join(content_all))
    print(sword)
    word_stat = word_counter(sword)
    print(word_stat)

    # 生成词云图
    word_cloud(word_stat)
    # 展示词云图
    show_WordCounter()

    if not search_key:
        st.error("请输入查询条件!")
    else:
        # 按关键词查询
        query = MultifieldParser(["content"], ix.schema).parse(search_key)
        print("查询条件:", query)
        results = searcher.search(query)
        # print(results[0].fields())
        print(query, '一共发现%d份文档。' % len(results))

        # 高亮效果
        # if len(results) > 0:
        #     data = results[0]
        #     text = data.highlights("content","title")
        #     print(text)

        for data in results:
            # json_text = json.dumps(data.fields()["title"], ensure_ascii=False)
            # print(data.fields()["title"])
            reg_href = data.fields()["title"]
            reg_title = data.fields()["content"]
            # 网页高亮展示
            # reg_title = data.highlights("content")
            if reg_href not in title_lists and reg_title not in content_list:
                title_lists.append(reg_href)
                content_list.append(reg_title)
                href_title_dict[reg_title] = reg_href
            # print(data.fields())ix.close()st.text("总共查询到 {} 项".format(len(href_title_dict)))# 写入内容reg_href_s = ""  # 选择的URL记录save_file_path = "备注数据.txt"# 下拉框展示select_box_list = list(href_title_dict.keys())if len(select_box_list) > 0:
    reg_title = st.selectbox('[2].选择要打开的网址:', select_box_list)
    reg_href = href_title_dict[reg_title]
    reg_href_s = "{} : {}   {}".format(search_key, reg_title, reg_href)
    st.text('当前选择: {}'.format(reg_href))
    # 可通过以下两种方式加载
    # url_display = f'<embed type="text/html" src="' + reg_href + '" width="1200" height="600">'  # iframe
    # st.markdown(url_display, unsafe_allow_html=True)
    st.components.v1.iframe(reg_href, width=r_width, height=r_height, scrolling=r_scrolling)

    # 按钮
    if st.button('保存此条记录'):
        if not os.path.exists(save_file_path):
            with open(save_file_path, 'w') as file_:
                file_.writelines("搜索项 ----- 标题 ------ 链接 -----")
                pass
        reg_href_s2_arr = []  # 要写入的内容
        with open(save_file_path, 'r') as file_:
            # 不添加重复内容
            search_arr = re.findall(re.escape(reg_href_s), "".join(file_.readlines()), re.I | re.M)
            print(search_arr)
            if len(search_arr) == 0:
                reg_href_s2_arr.append("
" + reg_href_s + "<p>")
                st.write("写入成功!")
            elif len(search_arr) == 1 and str(search_arr[0]).strip(" ") == "":
                st.write("无查询值!")
            else:
                st.write("已经存在!")
        with open(save_file_path, 'a') as file_:
            if len(reg_href_s2_arr) > 0:
                file_.writelines("".join(reg_href_s2_arr))
        with open(save_file_path, "r") as file_:
            st_content = ("
".join(file_.readlines()))  # <br>
            # st.components.v1.html(st_content)  # 网页高亮展示
            st.write(st_content)# 默认展示内容if st.button('加载默认文件'):
    if os.path.exists(save_file_path):
        with open(save_file_path, "r") as file_:
            st.write("
".join(file_.readlines()))
    else:
        st.write("还未保存记录,请先保存!")if st.button('清空文件内容'):
    # https://blog.csdn.net/weixin_36118143/article/details/111988403
    if os.path.exists(save_file_path):
        os.remove(save_file_path)
    else:
        # os.mknod(save_file_path)
        pass

步骤四: 交互式方案 ipywidgets/ Streamlit/ Plotly Dash , 其它有价值的参考链接.

https://www.biaodianfu.com/streamlit.htmlwww.biaodianfu.com/streamlit.html

Python机器学习工具开发框架：Streamlit-Python学习网www.py.cn/toutiao/15437.html

我是一只热爱学习的小胖子,如果你也热爱学习,并且对SuperMemo感兴趣,欢迎转发和评论!

SuperMemo实践闭环(4)-交互式处理网页材料
接上一节,这节我们讲解网页材料在SuperMemo中的处理方法,首先回顾下我们之前流程步骤: 如上图示,在之前的学...
实践-SuperMemo - 总结篇(五)
在之前我有写过几篇SuperMemo实践的文章,分别从文本/图片, 视频, 网页, PDF文件等几种常见的学习材料...
SuperMemo实践闭环(1)-学习流程及时间管理
一学习流程的常规闭环按此常规闭环,可对我们有学习计划并已添加至SuperMemo软件的学习材料进行常规闭环学习....
Superemo学习第三天（初级）
昨天讲了组织知识的前4个原则，今天我们继续！昨天的原则侧重于概念理论，今天的原则更偏向于实践（在SuperMemo...
SuperMemo实践闭环(3)-批量挖空制卡的操作
本文阐述了在不使用ImageOcclusionEditor的情况下,我们通过OpenCV实现批量图片遮挡的效果. ...
SuperMemo实践闭环(2)-增量学习及提取制卡
在上一篇文章,我们介绍了如何创建学习索引及如何按索引来开启学习,但学习过程中我们会标注重点,做内容提取以及进行内容...
实践闭环
检验评价我们有低估和高估自己的情况，而且几乎在所有的事情上，我们对自己能力的评估都有很大的偏差。这种错误的估量...
XMLHttpRequest
Ajax (Asynchronous Javascript And Xml )是指一种创建交互式网页应用的网页开发...
vue的vuex
AJAX即“AsynchronousJavascriptAndXML”，是指一种创建交互式网页应用的网页开发技术。...
vue的axios
AJAX即“AsynchronousJavascriptAndXML”，是指一种创建交互式网页应用的网页开发技术。...