Webpage text extraction method based on deep learning

A deep learning and text technology, applied in the Internet field, can solve problems such as inappropriate web page design, complex implementation and maintenance, and cumbersome rule definition, improve the generalization ability of models, and solve problems of robustness and generalization. , the effect of improving cost

Active Publication Date: 2021-04-16
GUANGDONG ELECTRONICS IND INST
View PDF17 Cites 3 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, there are many styles of encyclopedia and resume pages, and the content of each style is very scattered, so it is difficult to apply this kind of method; in addition, the common rule-based strategy in previous articles, whether it is based on visual elements or HTML tag information, content Information, etc., intuitively do not adapt to the increasingly complex web page structure and the irregular design of some web pages, and the definition of rules is cumbersome, and the implementation and maintenance are very complicated

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Webpage text extraction method based on deep learning
  • Webpage text extraction method based on deep learning
  • Webpage text extraction method based on deep learning

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0052]通过预训练的fasttext模型对输入的标签路径进行编码表示,得到了LSTM 模型的输入,结合标签路径的分类进行训练LSTM,本发明中的LSTM模型直接使用PyTorch框架进行实现。参数设置如下:最大标签路径的序列长度设为 15(长于15个的标签路径只输入前15个标签)、dropout为0.3,、隐藏层单元数为128,LSTM层数为2,输出为2个类(是正文或不是正文),优化器为 Adam,学习率为0.001,损失函数为交叉熵函数,batch size是32,至少经过 100个epochs,之后只要连续20个epochs都没有产生更优的loss和f1score,则停止训练。通过这样的方式得到了正文抽取模型。对测试集中的304篇网页进行抽取正文测试,实验结果如图4实线。

[0053]以上对比例1,对比例2,以及实施例1中,采用的模糊字符串匹配的方式进行效果评估,先用这三个工具对验证集的300余个网站进行正文提取,另外根据标注提取一份正文作为标准答案。为了消除分割方式造成的误差,统一将结果的空格和换行符全部取出,用FuzzyWuzzy实现模糊字符串匹配, FuzzyWuzzy是一个基于Levenshtein距离的一个字符串相似度衡量工具,而 Levenshtein距离表现的是一个字符串至少需要变换几个字符才能变成另外一个字符和标准长度。FuzzyWuzzy衡量字符串相似度的度量是Levenshtein距离和两个字符串平均长度的比率,这个得分越高,说明两个字符串越相似。

[0054]图4的横坐标是网页序号,纵坐标是某个工具在这个网页提取的正文和标准答案的相似度,相似度越高,则说明工具的性能越好,虚线是Readability,点线是Newspaper3k,实线是基于本发明LSTM模型的正文提取结果。从图中明显看出,本发明的提取正文效果要好。

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a webpage text extraction method based on deep learning. The method comprises the following steps: 1) preparing a data set from a root DOM node to a leaf DOM node; 2) constructing a data set from a root DOM node to a leaf DOM node; 3) labeling data in a data set from the root DOM node to the leaf DOM node; 4) utilizing Fasttext to carry out pre-training and encoding on the label of the path; 5) training an LSTM classification model of the label path text; 6) enabling the LSTM model to predict the label path text; and 7) restoring the extracted webpage text. The invention belongs to the technical field of the Internet, and particularly relates to a webpage text extraction method based on deep learning, which improves the accuracy of resume webpage text extraction.

Description

technical field [0001] The invention belongs to the technical field of the Internet, and specifically refers to a web page text extraction method based on deep learning. Background technique [0002] There is a large amount of public information on the Internet. To obtain this information, a series of crawling and natural language processing technologies are required to obtain and analyze web pages. Among them, web page text extraction is an important research topic. With the development of the World Wide Web, the functions and style structures of web pages have become more and more complex, and web pages often contain a lot of useless information: advertisements, external links, navigation bars, etc. Generally speaking, we only care about the text content of web pages. The so-called text refers to the content information we care about on the web page, including target text, pictures, and videos. [0003] There are many methods of text extraction in the research, which prov...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F16/957G06N3/04G06N3/08
Inventor 陈前华
Owner GUANGDONG ELECTRONICS IND INST
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products