Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

Multiple-document automatic abstracting method based on frequent itemset

A technology of frequent itemsets and automatic summarization, which is applied in the field of data processing, can solve problems such as inconsistent contributions, phrase shifting, and low similarity results of sentences with similar semantics, and achieve high clarity, high practicability, and high simplicity Effect

Inactive Publication Date: 2011-05-04
SICHUAN UNIV
View PDF2 Cites 56 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, the current semantic annotation theory is immature, so that relying solely on the semantic information of words cannot truly reflect the meaning expressed by the sentence, and the accuracy of the calculated sentence similarity is also questionable
[0006] The method of combining word form and word order: based on word form matching between words, semantic information is not considered, and the different influences of words of different parts of speech on sentences are not distinguished, and sentences with similar semantics often have lower similarity results unreasonable phenomenon
[0007] Dependency tree method: Sui Zhifang et al. proposed a sentence similarity calculation model based on the skeleton dependency tree based on the analysis of the syntactic structure of the sentence. Although it is a good calculation model in theory, its practical application is not strong.
[0008] Edit distance method: Che Wanxiang et al. applied the improved edit distance method to the calculation of Chinese sentence similarity, calculated the sentence similarity by calculating the edit distance of the words in the sentence, and added the semantic information of the vocabulary, although the result is better than purely based on semantics. The accuracy of the dictionary method is high, but the contribution of different words in the sentence to the whole is not consistent and phrase shifting often occurs in Chinese sentences. Inaccurate
[0009] Sentence similarity calculation is the most basic and critical step in multi-document subtopic division, and the research on Chinese similarity calculation is still in its infancy, and there are still great difficulties in using it for multi-document automatic summarization

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Multiple-document automatic abstracting method based on frequent itemset
  • Multiple-document automatic abstracting method based on frequent itemset
  • Multiple-document automatic abstracting method based on frequent itemset

Examples

Experimental program
Comparison scheme
Effect test

specific Embodiment

[0051] A multi-document collection of 20 topics from the Sogou dataset is selected, and the documents of these 20 topics contain about 250 Chinese documents. The word segmentation algorithm of the Chinese Academy of Sciences is used to process the word segmentation of multiple documents, the stop words are removed according to the stop word list, and the Apriori association algorithm is used to mine frequent itemsets. The compression ratios are respectively 10%, 20% and 30%, and multi-document abstracts are generated under different compression ratios.

[0052] 1. Preprocessing of sentence segmentation, word segmentation and removal of stop words

[0053] In order to cluster all the document sentences in the multi-document collection, the multi-document should be segmented first. make D={d 1 , d 2 …, d n } , representing a multi-document collection, where, d i Represents a single document. Use regular expressions to match the end of the sentence, and divide the mul...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a multiple-document automatic abstracting method based on a frequent itemset. In the method, a frequent itemset excavating ideal in the association rules is introduced, and an associating method is utilized to excavate the frequent itemsets of effective itemset to serve as child themes; sentences are directly clustered to different child themes without carrying out sentence similarity computing; and multiple-document automatic abstracting is carried out on the basis of an SFI (sub-topics based on frequent item sets) method. In the method, the sentences are directly clustered to different child themes without carrying out sentence similarity computing, thus the method has the characteristics of high simplicity, high legibility, high practicability and the like.

Description

technical field [0001] The invention relates to the technical field of data processing, in particular to a method for automatic summarization of multiple documents based on frequent itemsets. Background technique [0002] With the continuous development of the global information highway, especially the continuous popularization of Internet applications, a large number of electronic documents emerge every day, and are transmitted and exchanged on the Internet. Multi-document summarization is a text collection compression technology in which repeated information in multiple document collections under the same topic appears in the abstract at one time, and other topic-related information is sequentially extracted according to importance and compression ratio. At present, foreign research on multi-document summarization mainly focuses on the processing of English information. Commonly used multi-document summarization methods such as: based on the method of single-document summa...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30G06F17/27
Inventor 章毅彭德中张蕾吕建成张海仙桑永胜杜芳
Owner SICHUAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products