Unbalanced big data set-oriented unsupervised text topic related gene extraction method

A large data set and extraction method technology, applied in special data processing applications, unstructured text data retrieval, text database clustering/classification, etc., can solve problems that cannot accurately reflect real information, uneven distribution, and multivariate density estimation Difficulty and other issues

Pending Publication Date: 2020-07-28
XIAN UNIV OF POSTS & TELECOMM
View PDF0 Cites 2 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] ①The distribution of sample categories (clusters) in the data set is unbalanced, and as a measurement function for the quality evaluation of feature subsets, whether it is correlation analysis and similarity analysis based on independence; or Euclidean distance based on distance, Mahalanobis Even the most widely used methods such as mutual information and information gain based on information entropy at present adopt the consistent assumption that the distribution of sample categories (clusters) in the data set is the same or similar, so that most of the determined features come from categories (clusters). ) The number (density) of the "big category" is dominant, and none or very few parts come from the "small category" that is not dominant, resulting in the selection of the most distinguishable feature subset, which cannot accurately reflect the real information in the entire sample space , reducing the performance of subsequent learning methods to solve practical problems;
[0005] ② "Big data" makes the objects to be processed more and more complicated, and the data dimension shows an explosive growth. In the face of ultra-high-dimensional data sets, it not only means huge memory requirements, but also means high computing costs.
In these high-dimensional feature spaces, there is a strong correlation between a large number of feature points, resulting in the introduction of a large amount of redundancy and even noise, which makes the generalization ability of the feature items selected by traditional methods deteriorate sharply. The "empty space" phenomenon also makes the multivariate density estimation problem very difficult

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Unbalanced big data set-oriented unsupervised text topic related gene extraction method
  • Unbalanced big data set-oriented unsupervised text topic related gene extraction method
  • Unbalanced big data set-oriented unsupervised text topic related gene extraction method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0069] The present invention provides an unsupervised text topic-related gene extraction method oriented to unbalanced large data sets, which uses factor analysis and density peak algorithm to obtain clusters of high-dimensional sample sets, and labels unlabeled samples accordingly; Local density and information entropy are used to improve the feature selection method based on the CHI statistical matrix to strengthen the feature expression of low-density and small sample clusters; the fast fixed point algorithm based on negative entropy (FastICA) is used to analyze the high-order between multidimensional data Statistical correlation is used to extract independent hidden theme feature genes and complete the removal of high-order redundancy between components. This method does not require the use of large-scale labeled samples for training, and can effectively avoid the pre-definition of sample category relationships and feature structures; it also overcomes the use of over-sampli...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses an unbalanced big data set-oriented unsupervised text topic related gene extraction method, which comprises the following steps of: obtaining a clustering cluster of a high-dimensional sample set by adopting factor analysis and a density peak algorithm, and labeling an unlabeled sample; improving a CHI statistical matrix-based feature selection method by utilizing the average local density and the information entropy, and forceening the feature expression degree of low-density and small sample clusters; a fast fixed point algorithm based on negentropy is adopted, high-order statistical correlation between multi-dimensional data is analyzed, independent implicit topic feature genes are extracted, and removal of high-order redundancy between components is completed. Large-scale labeled samples do not need to be adopted for training, so that the pre-definition of a sample category relationship and a feature structure can be effectively avoided; and the influence ofan over-sampling or under-sampling method on the category distribution of the original unbalanced data set is overcome. The performance of the CHI statistical selection method is improved by correcting the feature category structure; effective feature dimension reduction under the condition of keeping the sample set recognition capability is also realized.

Description

Technical field [0001] The invention belongs to the technical field of data interpretation and topic discovery in natural language processing, and specifically relates to an unsupervised text topic-related gene extraction method oriented to unbalanced large data sets. Background technique [0002] As society gradually enters the "big data" era, people are getting more and more information through web pages, microblogs, forums, etc., while the time spent on reading and sorting information is getting less and less. Accurately analyzing the subject of information has become an effective means of realizing big data understanding and value discovery, and its application fields include Internet public opinion monitoring and early warning, network harmful information filtering, and sentiment analysis. When dealing with data in these fields, it is often necessary to face a large number of high-dimensional data with redundant or irrelevant features, which greatly reduces the efficiency an...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/35G06K9/62
CPCG06F16/35G06F18/2415
Inventor 孙晶涛李敬明陈彦萍张秋余王忠民孙韩林温福喜何继光
Owner XIAN UNIV OF POSTS & TELECOMM
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products