An abstract extraction method combining a page analysis rule and NLP text vectorization
What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A text vector and extraction method technology, applied in the field of abstract extraction combining page parsing rules and NLP text vectorization, can solve problems such as high computational complexity
Active Publication Date: 2019-04-26
重庆电信系统集成有限公司 +1
View PDF15 Cites 9 Cited by
Summary
Abstract
Description
Claims
Application Information
AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology
Problems solved by technology
At present, the abstract extraction uses the textrank+word2vec model to extract the core sentences of the entire text. However, for long articles, the word2vec model is used to divide the text into sentences, and then split the sentences into words, and then vectorize the words and The process of calculating the
Method used
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more
Image
Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
Click on the blue label to locate the original text in one second.
Reading with bidirectional positioning of images and text.
Smart Image
Examples
Experimental program
Comparison scheme
Effect test
Embodiment Construction
[0022] The present invention will be further described in detail below in conjunction with the accompanying drawings.
[0023] A method for extracting summaries combining page parsing rules and NLP text vectorization, comprising the following steps:
[0024] S1: Use the Readability package to extract the text data in html format in the "body" tag of the web page text data to obtain the text corpus of the page text.
[0025] For example the body that needs to be extracted:
[0026] from readability.readability import Document
[0027] from scrapy.selector import HtmlXPathSelector
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more
PUM
Login to view more
Abstract
The invention discloses an abstract extraction method combining a page analysis rule and NLP text vectorization. The abstract extraction method comprises the following steps: S1, extracting text datain an html format in a'body 'label of text data of a webpage by using a Readability packet; S2, obtaining the text length of the text corpus, and eliminating unqualified text corpus; S3, judging whether the number of sentences of the text corpus is greater than a threshold value or not; S4, judging whether paragraph subtitle phrases can be obtained or not; S5, defining regular matching keywords, and removing the texts matched with the regular matching keywords to obtain filtered text corpora; S6, judging the compliance of the language segments; And S7, training a Word2Vec model, splitting thetext corpus into sentences, splitting the sentences into words, performing vectorization operation, solving sentence similarity by using EMD, giving weights based on the sentence similarity by using aTextRank algorithm, and determining the sentence with the highest weight as a text abstract sentence. According to the method, relatively core sentences can be obtained for long blogs and news articles, so that the subjects can be quickly known.
Description
technical field [0001] The invention relates to the technical field of natural language processing, in particular to an abstract extraction method combining page parsing rules and NLP text vectorization. Background technique [0002] The importance of text summarization can be reflected in life. In this era of information and data explosion, more and more information makes it difficult for people to receive in a short time. Filter out cumbersome text information, and use a few simple sentences to The way to express the core information is particularly important. The most common ones are news, Weibo, etc. that we often come into contact with every day. In terms of technical application, the obtained summary information can be used for NLP tasks such as classification and theme analysis. At present, the abstract extraction uses the textrank+word2vec model to extract the core sentences of the entire text. However, for long articles, the word2vec model is used to divide the tex...
Claims
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more
Application Information
Patent Timeline
Application Date:The date an application was filed.
Publication Date:The date a patent or application was officially published.
First Publication Date:The earliest publication date of a patent with the same application number.
Issue Date:Publication date of the patent grant document.
PCT Entry Date:The Entry date of PCT National Phase.
Estimated Expiry Date:The statutory expiry date of a patent right according to the Patent Law, and it is the longest term of protection that the patent right can achieve without the termination of the patent right due to other reasons(Term extension factor has been taken into account ).
Invalid Date:Actual expiry date is based on effective date or publication date of legal transaction data of invalid patent.