An abstract extraction method combining a page analysis rule and NLP text vectorization

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
A text vector and extraction method technology, applied in the field of abstract extraction combining page parsing rules and NLP text vectorization, can solve problems such as high computational complexity

Active Publication Date: 2019-04-26

重庆电信系统集成有限公司 +1

View PDF15 Cites 9 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

At present, the abstract extraction uses the textrank+word2vec model to extract the core sentences of the entire text. However, for long articles, the word2vec model is used to divide the text into sentences, and then split the sentences into words, and then vectorize the words and The process of calculating the distance will have a high computational complexity, and textrank is based on the weight given by the similarity of sentences. According to the actual extraction results, especially for texts like news, the content and paragraph shapes are different. , many interfering sentences will affect the extraction results

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment Construction

[0022] The present invention will be further described in detail below in conjunction with the accompanying drawings.

[0023] A method for extracting summaries combining page parsing rules and NLP text vectorization, comprising the following steps:

[0024] S1: Use the Readability package to extract the text data in html format in the "body" tag of the web page text data to obtain the text corpus of the page text.

[0025] For example the body that needs to be extracted:

[0026] from readability.readability import Document

[0027] from scrapy.selector import HtmlXPathSelector

[0028] from scrapy.http import HtmlResponse

[0029] import urllib

[0030] html = urllib.urlopen(url).read()

[0031] content_t = html.split('')[-1].strip().split('

[0032] content_t = ' '+content_t

[0033] readable_article = Document(content_t).summary()

[0034] response=HtmlResponse(url=", body=readable_article, encoding='utf8')

...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses an abstract extraction method combining a page analysis rule and NLP text vectorization. The abstract extraction method comprises the following steps: S1, extracting text datain an html format in a'body 'label of text data of a webpage by using a Readability packet; S2, obtaining the text length of the text corpus, and eliminating unqualified text corpus; S3, judging whether the number of sentences of the text corpus is greater than a threshold value or not; S4, judging whether paragraph subtitle phrases can be obtained or not; S5, defining regular matching keywords, and removing the texts matched with the regular matching keywords to obtain filtered text corpora; S6, judging the compliance of the language segments; And S7, training a Word2Vec model, splitting thetext corpus into sentences, splitting the sentences into words, performing vectorization operation, solving sentence similarity by using EMD, giving weights based on the sentence similarity by using aTextRank algorithm, and determining the sentence with the highest weight as a text abstract sentence. According to the method, relatively core sentences can be obtained for long blogs and news articles, so that the subjects can be quickly known.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to an abstract extraction method combining page parsing rules and NLP text vectorization. Background technique [0002] The importance of text summarization can be reflected in life. In this era of information and data explosion, more and more information makes it difficult for people to receive in a short time. Filter out cumbersome text information, and use a few simple sentences to The way to express the core information is particularly important. The most common ones are news, Weibo, etc. that we often come into contact with every day. In terms of technical application, the obtained summary information can be used for NLP tasks such as classification and theme analysis. At present, the abstract extraction uses the textrank+word2vec model to extract the core sentences of the entire text. However, for long articles, the word2vec model is used to divide the tex...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

IPC IPC(8): G06F17/27G06F16/34

CPCG06F40/211

Inventor陈玮刘德彬孙世通严开吴涛

Owner重庆电信系统集成有限公司

An abstract extraction method combining a page analysis rule and NLP text vectorization

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment Construction

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology