美文网首页
网页挖掘与内容分析:数据、实体、事件、关系抽取笔记

网页挖掘与内容分析:数据、实体、事件、关系抽取笔记

作者: 张小邪先森 | 来源:发表于2020-09-06 00:37 被阅读0次

网页数据挖掘

Web Mining is the process of Data Mining techniques to automatically discover and extract information from Web documents and services. The main purpose of web mining is discovering useful information from the World-Wide Web and its usage patterns.

网页数据挖掘—《数据挖掘概念与技术》

对于新闻、广告、消费信息、财经管理、教育、行政管理和电子商务来说,万维网是一 个巨大的、广泛分布的全球信息中心。它包含丰富、动态的信息,涉及带有超文本结构和多 媒体的网页内容、超链接信息、访问和使用信息,为数据挖掘提供了丰富的资源。Web挖 掘是数据挖掘技术的应用,从Web中发现模式、结构和知识。根据分析目标,Web挖掘可 以划分成三个主要领域:Web内容挖掘、Web结构挖掘和Web使用挖掘

网络数据采集类型与使用分析

网络数据采集是指通过网络爬虫或网站公开 API 等方式从网站上获取数据信息。该方法可以将非结构化数据从网页中抽取出来,将其存储为统一的本地数据文件,并以结构化的方式存储。它支持图片、音频、视频等文件或附件的采集,附件与正文可以自动关联。
网络爬虫基本原理(一)
网络爬虫基本原理(二)

数据抽取

大数据处理流程:数据的抽取、储存、提取

image.png

What is Data Extraction?

Data extraction is a process that involves retrieval of data from various sources. Frequently, companies extract data in order to process it further, migrate the data to a data repository (such as a data warehouse or a data lake) or to further analyze it. It’s common to transform the data as a part of this process. For example, you might want to perform calculations on the data — such as aggregating sales data — and store those results in the data warehouse. If you are extracting the data to store it in a data warehouse, you might want to add additional metadata or enrich the data with timestamps or geolocation data. Finally, you likely want to combine the data with other data in the target data store. These processes, collectively, are called ETL, or Extraction, Transformation, and Loading. Extraction is the first key step in this process.

Structured data

If the data is structured, the data extraction process is generally performed within the source system. It’s common to perform data extraction using one of the following methods:

Full extraction. Data is completely extracted from the source, and there is no need to track changes. The logic is simpler, but the system load is greater.

Incremental extraction. Changes in the source data are tracked since the last successful extraction so that you do not go through the process of extracting all the data each time there is a change. To do this, you might create a change table to track changes, or check timestamps. Some data warehouses have change data capture (CDC) functionality built in. The logic for incremental extraction is more complex, but the system load is reduced.

Unstructured data

When you work with unstructured data, a large part of your task is to prepare the data in such a way that it can be extracted. Most likely, you will store it in a data lake until you plan to extract it for analysis or migration. You'll probably want to clean up "noise" from your data by doing things like removing whitespace and symbols, removing duplicate results, and determining how to handle missing values.

网页数据抽取—《web数据挖掘》

手工方法
包装器归纳
自动抽取

数据挖掘—《数据挖掘概念与技术》

数据清理、数据集成、数据选择、数据变换、数据挖掘、模式评估、知识表示

Data Mining vs Data Extraction

Data mining is based on mathematical methods to reveal patterns or trends. Data extraction is based on programming languages or data extraction tools to crawl the data sources.
The purpose of data mining is to find facts that are previously unknown or ignored, while data extraction deals with existing information.

信息抽取

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concerns processing human language texts by means of natural language processing (NLP). Recent activities in multimediadocument processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction

内容分析

Content Analysis

Content analysis is a research tool used to determine the presence of certain words, themes, or concepts within some given qualitative data (i.e. text). Using content analysis, researchers can quantify and analyze the presence, meanings and relationships of such certain words, themes, or concepts.

内容分析

对大众传播信息如书籍、杂志、电影、广播和电视等的内容作客观、系统和量化描述的一种研究方法。目的是将一种用语言表示而非数量表示的文献转换为用数量表示的资料,并将分析的结果用统计数字描述。

相关文章

  • 网页挖掘与内容分析:数据、实体、事件、关系抽取笔记

    网页数据挖掘 Web Mining is the process of Data Mining technique...

  • 知识图谱学习笔记(八)——事件抽取

    事件抽取 1.事件抽取的任务定义 事件抽取是信息抽取中的难点问题事件抽取依赖实体抽取和关系抽取相较于实体抽取和关系...

  • 知识图谱学习笔记(五)——实体识别(1)

    实体识别(信息抽取) 1. 信息抽取概述 信息抽取定义:从自然语言文本中抽取指定类型的实体、关系、事件等事实信息,...

  • 第四讲 知识抽取与挖掘II

    一、面向文本的知识抽取 二、开放域关系抽取 三、知识挖掘 1. 实体链接 实体链接是指给定一篇文本中的实体指称(m...

  • 实体关系抽取

    实体属性关系抽取 针对语料:通用语料 抽取关系:通用实体关系 抽取级别:句子级别 关系类型(通用文本) 关系类型(...

  • 关系抽取(分类)总结

    关系抽取(分类)总结 关系抽取研究现状 基于路径的实体图关系抽取模型 ChineseNRE 关系抽取(关系学习)综...

  • MySql表设计与优化

    MySql设计与优化系列笔记:一、数据库设计三范式与反范式二、MySql表设计与优化 1、实体关系分析 实体关系需...

  • 实体关系抽取

    实体识别的难点,不好建模,比如投资关系: 1.方向:投资方和被投方2.关系多维,错综复杂:投资方有多个,被投资只有...

  • 实体关系抽取

    代码地址实体关系抽取是信息抽取任务中非常基础且必要的工作。实体关系主要有一对一、多对一、多对多等。今天从实践的角度...

  • 信息抽取

    最近关注一些长文本信息抽取的东西,现在简单做一个小结。 信息抽取主要包括三类 实体识别 关系抽取 事件提取。 一个...

网友评论

      本文标题:网页挖掘与内容分析:数据、实体、事件、关系抽取笔记

      本文链接:https://www.haomeiwen.com/subject/mfkqektx.html