Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Automatic document summarization extraction method based on term vectors

A document summary and automatic extraction technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve problems such as reducing the accuracy of node weights, affecting the performance of summarization, and ignoring the semantic similarity between sentences

Active Publication Date: 2015-08-12
DALIAN UNIV OF TECH
View PDF3 Cites 77 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The value in the sentence similarity matrix represents the jump probability from a sentence to other sentences, so the calculation of node weights is very important, but when the traditional graph method calculates the similarity between sentences, it mostly uses the feature words contained in the sentence The co-occurrence is obtained, ignoring the semantic similarity between sentences, reducing the accuracy of node weight calculation, and affecting the performance of summarization

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Automatic document summarization extraction method based on term vectors
  • Automatic document summarization extraction method based on term vectors
  • Automatic document summarization extraction method based on term vectors

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0085] In order to make the purpose, technical solutions and beneficial effects of the present invention clearer and easier to implement, the present invention will be further described in detail in combination with the following specific embodiments and with reference to the accompanying drawings. In this embodiment, the length of the generated summary is preset to be 150 words.

[0086] S1. Use the deep neural network model to train the corpus to obtain the word vector representation of the feature words:

[0087] In order to obtain the vector representation of feature words, the embodiment adopts the biomedical literature database MEDLINE maintained by the National Library of Medicine of the United States to collect the corpus used for the experiment. Preprocess the sentences in the citation, that is, remove stop words, special characters, and punctuation marks against the stop word list, and finally obtain a 1.2G training corpus.

[0088] In the training process of this e...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

Provided is an automatic document summarization extraction method based on term vectors. The method includes the steps that S1, a deep neural network model is used for training linguistic data to obtain term vector representation of feature terms; S2, a sentence graph model is constructed; S3, the weights of sentences are calculated; S4, a maximum marginal relevance algorithm is used for generating a summarization. According to the method, a linguistic data set is collected and preprocessed to obtain a training feature linguistic data set, the deep neural network model is used for training the constructed training feature linguistic data set to obtain the term vectors of the feature terms, a candidate document set and a candidate sentence set are obtained from the linguistic data set through preset search terms, the semantic similarity between the senesces is obtained according to the term vectors of the feature terms, and then the semantic relation between every two sentences is obtained. The problem that in a traditional calculation method based on term co-occurrence, calculation errors are caused under the condition that semantic meaning is identical but terms are different is avoided, and therefore the accuracy of similarity calculation and the performance of the summarization are improved.

Description

technical field [0001] The invention relates to the fields of computer information retrieval and text mining, in particular to a method for automatically extracting document summaries based on word vectors. Background technique [0002] Text summarization technology is an important part of the text mining research field. This technology can find out the most important information in a document or document set and express it in a concise and coherent short text. With the advancement of science and technology and the development of network technology, there is a large amount of available information on the Internet. Faced with a large amount of data, this research can assist users to quickly understand the required information, save users' reading time, and improve work efficiency. [0003] The current text summarization technology is mainly extractive summarization, that is, extracting the most important sentences from the original text to form a summarization. The generation...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/345
Inventor 林鸿飞郝辉辉
Owner DALIAN UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products