An abstract extraction method combining a page analysis rule and NLP text vectorization

A text vector and extraction method technology, applied in the field of abstract extraction combining page parsing rules and NLP text vectorization, can solve problems such as high computational complexity

Active Publication Date: 2019-04-26
重庆电信系统集成有限公司 +1
View PDF15 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

At present, the abstract extraction uses the textrank+word2vec model to extract the core sentences of the entire text. However, for long articles, the word2vec model is used to divide the text into sentences, and then split the sentences into words, and then vectorize the words and The process of calculating the

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • An abstract extraction method combining a page analysis rule and NLP text vectorization
  • An abstract extraction method combining a page analysis rule and NLP text vectorization
  • An abstract extraction method combining a page analysis rule and NLP text vectorization

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] The present invention will be further described in detail below in conjunction with the accompanying drawings.

[0023] A method for extracting summaries combining page parsing rules and NLP text vectorization, comprising the following steps:

[0024] S1: Use the Readability package to extract the text data in html format in the "body" tag of the web page text data to obtain the text corpus of the page text.

[0025] For example the body that needs to be extracted:

[0026] from readability.readability import Document

[0027] from scrapy.selector import HtmlXPathSelector

[0028] from scrapy.http import HtmlResponse

[0029] import urllib

[0030] html = urllib.urlopen(url).read()

[0031] content_t = html.split('')[-1].strip().split('

[0032] content_t = ' '+content_t

[0033] readable_article = Document(content_t).summary()

[0034] response=HtmlResponse(url=", body=readable_article, encoding='utf8')

...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an abstract extraction method combining a page analysis rule and NLP text vectorization. The abstract extraction method comprises the following steps: S1, extracting text datain an html format in a'body 'label of text data of a webpage by using a Readability packet; S2, obtaining the text length of the text corpus, and eliminating unqualified text corpus; S3, judging whether the number of sentences of the text corpus is greater than a threshold value or not; S4, judging whether paragraph subtitle phrases can be obtained or not; S5, defining regular matching keywords, and removing the texts matched with the regular matching keywords to obtain filtered text corpora; S6, judging the compliance of the language segments; And S7, training a Word2Vec model, splitting thetext corpus into sentences, splitting the sentences into words, performing vectorization operation, solving sentence similarity by using EMD, giving weights based on the sentence similarity by using aTextRank algorithm, and determining the sentence with the highest weight as a text abstract sentence. According to the method, relatively core sentences can be obtained for long blogs and news articles, so that the subjects can be quickly known.

Description

technical field [0001] The invention relates to the technical field of natural language processing, in particular to an abstract extraction method combining page parsing rules and NLP text vectorization. Background technique [0002] The importance of text summarization can be reflected in life. In this era of information and data explosion, more and more information makes it difficult for people to receive in a short time. Filter out cumbersome text information, and use a few simple sentences to The way to express the core information is particularly important. The most common ones are news, Weibo, etc. that we often come into contact with every day. In terms of technical application, the obtained summary information can be used for NLP tasks such as classification and theme analysis. At present, the abstract extraction uses the textrank+word2vec model to extract the core sentences of the entire text. However, for long articles, the word2vec model is used to divide the tex...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F16/34
CPCG06F40/211
Inventor 陈玮刘德彬孙世通严开吴涛
Owner 重庆电信系统集成有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products