Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Method and device for generating document abstract

A technology of document summarization and document collection, applied in word processing, special data processing applications, instruments, etc., can solve the problem of high redundancy of document summaries

Active Publication Date: 2018-06-29
SHENZHEN RAISOUND TECH
View PDF4 Cites 11 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, in the traditional multi-document summarization process, the document set is generally divided into several subsets of sentences with similar meanings, and then sentences are extracted from different sentence subsets to form summaries. This processing method is only considered from the perspective of the overall document Whether the sentence is representative will eventually lead to the problem of excessive redundancy in the generated document summary

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method and device for generating document abstract
  • Method and device for generating document abstract
  • Method and device for generating document abstract

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0056] Such as figure 1 As shown, in one embodiment, a method for generating a document summary includes the following steps:

[0057] S110, preprocessing the document set to obtain a sentence set and a vocabulary set corresponding to the document set.

[0058] Specifically, it traverses the entire document set belonging to the same topic, performs sentence segmentation processing on it, and obtains a sentence set, and then performs word segmentation processing on the English document set or Chinese document set, and uses spaces, symbols, and paragraphs for the English document set. Carry out word segmentation, for the Chinese document collection, according to the word segmentation method based on string matching, the word segmentation method based on understanding and the word segmentation method based on word frequency statistics, but not limited to this; for each word in each sentence, judge whether it is in the preset Stop word list appears, if it is, delete it, if not, s...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention relates to a method for generating a document abstract. The method includes the steps that a document set is preprocessed, a vocabulary set is processed through a latent Dirichlet modelor a vector space model, weights corresponding to vocabularies are obtained, all the weights of the vocabularies corresponding to each sentence in a sentence set are added, and corresponding internalinformation amount scores are obtained; according to preset similarity threshold values, similar sentences and the number of the similar sentences corresponding to each sentence are determined, corresponding importance scores are obtained through calculation, the numbers of the similar sentences of all the sentences are compared with the numbers of the similar sentences corresponding to all the similar sentences of all the sentences, diversity scores of all the sentences are obtained through calculation, and then comprehensive scores of all the sentences are obtained through calculation; finally, screening is conducted according to the comprehensive scores of all the sentences and preset abstract length, and the document abstract is generated. In addition, the invention provides a device for generating the document abstract. According to the method and device for generating the document abstract, the redundancy of the abstract is generally reduced.

Description

technical field [0001] The invention relates to the field of language and word processing, in particular to a method and device for generating document abstracts. Background technique [0002] With the rapid development of Internet technology, the data in the computer network shows an explosive growth trend, and the serious problem of information overload cannot be ignored. When browsing web pages belonging to the same topic, some web pages have a lot of the same information but contain relatively little different information. At this time, a tool for summarizing information is needed to quickly browse information. Therefore, it is necessary to form the content of these pages into a document summary to improve the efficiency of information acquisition. [0003] In network data, text data occupies a very important part. Multi-document summarization is a natural language processing technology that finally extracts a text from the main information described by multiple docume...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/21G06F17/27
CPCG06F40/10G06F40/211
Inventor 张剑刘轶王宝岩黄石磊
Owner SHENZHEN RAISOUND TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products