Large-scale data similar feature detection method based on inverted indexes

A large-scale data and inverted index technology, which is applied in unstructured text data retrieval, text database indexing, special data processing applications, etc., can solve problems such as inability to apply numerical features, poor performance, and affecting feature selection. Achieve the effects of ensuring uniqueness and accuracy, improving computing efficiency, and reducing scale

Active Publication Date: 2021-01-26
ZHEJIANG UNIV
View PDF3 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

There are often a large number of similar features in the original data set, which will disperse the importance of features during model training, affect feature selection, and affect model performance; and increase unnecessary computing overhead and waste a lot of resources
[0003] At present, most of the main feature similarity detection methods need to traverse and combine all the features for analysis. When the size of the original feature set is large, the size of the feature pair set obtained after combination will also be very large, which makes this method in The performance is not good on large-scale data sets; or the locality sensitive hash (Locality sensitive hash) method is used to reduce the dimension first and then analyze the similarity. The disadvantage of this method is that although the dimension of the data set is reduced, the method is currently Can only be applied to categorical features (including numerical one-hot encoded features), but cannot be applied to numerical features

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Large-scale data similar feature detection method based on inverted indexes
  • Large-scale data similar feature detection method based on inverted indexes
  • Large-scale data similar feature detection method based on inverted indexes

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0023] Such as figure 1 As shown, the large-scale data similar feature detection method based on inverted index includes: candidate set generation, similarity measurement and result set integration process.

[0024] First, column sampling is performed on the data corresponding to all the features in the original feature set, and an inverted index is constructed. The features with the same inverted index are put into the same feature subset, and all the features in each feature subset are combined to form a feature pair, and then added to the candidate feature set. The present invention proposes a brand-new inverted index design method by analyzing the relationship between feature distribution and similarity measurement function:

[0025] For a numerical feature pair (let the feature be X, Y), the Pearson correlation coefficient is used to measure the similarity between the features X and Y, and the Pearson correlation coefficient formula is as follows:

[0026]

[0027] T...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a large-scale data similarity feature detection method based on inverted indexes. According to the method, feature data columns of the corresponding types are sampled, the corresponding inverted indexes are extracted, then the hash table is established for the inverted indexes and the features in the form of key value pairs, candidate feature subsets are generated, and therefore the purpose of feature set dimension reduction is achieved; and the features in the feature subset are combined after dimension reduction in pairs, a Pearson correlation coefficient algorithm and a non-repetitive counting method aiming at the numeric features and the category features are respectively applied to obtain correlation coefficients of feature pairs, a threshold value is set, anda result is outputted. According to the method, the defect that original feature sets need to be combined pairwise in the past is overcome, the calculation time can be reduced by one order of magnitude, and a large number of resources are saved; and meanwhile, accuracy and the recall rate can be kept at an extremely high level;.

Description

technical field [0001] The invention belongs to the field of machine learning and data mining, and relates to a feature similarity detection method in big data feature engineering, in particular to a large-scale data similarity feature detection method based on an inverted index. Background technique [0002] Feature Similarity Detection (Feature Similarity Detection) is a crucial link in the data mining process, and it is also a necessary process for machine learning model training. There are often a large number of similar features in the original data set, which will disperse the importance of features during the model training process, affect the selection of features, and affect the performance of the model; and increase unnecessary computing overhead and waste a lot of resources. [0003] At present, most of the main feature similarity detection methods need to traverse and combine all the features for analysis. When the size of the original feature set is large, the s...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F16/31G06K9/62G06N20/00
CPCG06F16/319G06N20/00G06F18/22G06F18/213
Inventor 钱晨张顾洪
Owner ZHEJIANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products