Novel semantic analysis method and system for large-scale document themes

A document topic and semantic analysis technology, applied in the field of semantic analysis, can solve problems such as inability to cover semantic situations, low knowledge support, and long topic training time

Active Publication Date: 2017-06-13
SOUTH CHINA NORMAL UNIVERSITY
View PDF5 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, such dictionaries cannot cover all possible semantic situations, especially for vocabulary and domain-specific knowledge that have not appeared in dictionaries.
The LDA topic model obtains the semantic relationship between documents, topics, and words by calculating the statistical information of the corpus. However, because it continues to use the bag of words model, it cannot avoid the disaster of dimensionality caused by a large vocabulary. Furthermore, the iteration of the LDA model training Matrix operations lead to too long topic training time

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Novel semantic analysis method and system for large-scale document themes
  • Novel semantic analysis method and system for large-scale document themes
  • Novel semantic analysis method and system for large-scale document themes

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0053] refer to figure 1 , a novel large-scale document subject semantic analysis method of the present invention, comprising the following steps:

[0054] A. Detect whether there is classification information in the known document collection, and if so, perform a supervised subject generation step to generate multiple subject sets; otherwise, perform an unsupervised subject generation step to generate multiple subject sets;

[0055] B. According to the multiple subject sets obtained, respectively calculate the correlation between the document to be analyzed and each subject set, so as to obtain the subject distribution of the document in the subject set.

[0056] According to the distribution of topics, the present invention judges several topics closest to the current document through an adaptive topic selection method, and realizes automatic topic analysis of documents based on semantics. Specifically, the present invention can provide four subject selection methods for di...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a novel semantic analysis method and system for large-scale document themes. The method comprises the steps of detecting whether classification information exists in a known document set or not, and if yes, executing a supervised subject generation step to generate and obtain a plurality of theme sets; otherwise, executing an unsupervised subject generation step to generate and obtain a plurality of theme sets; and according to the obtained theme sets, calculating a degree of correlation between a document needed to be analyzed and each theme set, thereby obtaining a theme distribution state, about the theme set, of the document. The system comprises a theme set generation unit and a theme analysis unit. According to the method and the system, theme generation can be automatically, quickly, flexibly and effectively finished in large-scale document data, and theme distribution of any given document in theme generation can be analyzed and assessed. The method and the system are suitable for an occasion of quickly generating the themes.

Description

technical field [0001] The present invention relates to the technical field of semantic analysis, in particular to a novel large-scale document topic semantic analysis method and system. Background technique [0002] In the era of big data, the number of documents has grown at an unprecedented rate, exceeding the time and energy costs of manual processing. A large amount of data accumulated in daily life: from text files to office files, as well as data in the form of documents such as pictures, images, video and audio, etc. are often not fully utilized, and there are a large number of excavable and learning information. Regarding the huge information value hidden in the data, the reasons why people have no way to start are, firstly, the diversity of document types and document sources, and secondly, the high latitude and unstructured characteristics of the document content. The key is the large amount of document data. At present, big data analysis, especially the subject...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27G06F17/30
CPCG06F16/35G06F40/30
Inventor 赵淦森杜嘉华黄晓烽王欣明唐华聂瑞华汤庸朱佳史爱红
Owner SOUTH CHINA NORMAL UNIVERSITY
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products