Hierarchical clustering method and system based on multistage layered sampling

A hierarchical clustering, multi-stage technology, applied in relational databases, special data processing applications, instruments, etc., can solve the problems of low clustering accuracy and poor sample representativeness, and achieve high uncertainty, high representativeness, and improved The effect of accuracy

Active Publication Date: 2014-04-02
神行太保智能科技(苏州)有限公司
View PDF1 Cites 16 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] In view of this, the object of the present invention is to provide a hierarchical clustering method and system based on multi-stage hierarchical sampling to overcome the existing problems of poor sample representation and low clustering accuracy caused by random sampling. question

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Hierarchical clustering method and system based on multistage layered sampling
  • Hierarchical clustering method and system based on multistage layered sampling
  • Hierarchical clustering method and system based on multistage layered sampling

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0053] Embodiment 1 of the present invention discloses a hierarchical clustering method based on multi-stage hierarchical sampling, such as figure 1 As shown, the method includes:

[0054] S1: Based on the preset input attribute set, a preset number of samples is randomly sampled from the data source, and the set formed by the collected preset number of samples is marked as the initial sample set.

[0055] Wherein, the data source may be background data that cannot be obtained directly but needs to be obtained by submitting a query through a query interface. In this embodiment, the data source is specifically a Deep Web data source.

[0056] This step S1 randomly collects a preset number of samples from the Deep Web. Generally, the number of samples randomly sampled at this stage is half of the total number of samples required for clustering. In this embodiment, assuming that a total of 2X samples need to be collected for clustering the target Deep Web (data source), X sample...

Embodiment 2

[0110] Embodiment 2 of the present invention discloses another flow of the hierarchical clustering method based on multi-stage hierarchical sampling, please refer to Figure 4 , on the basis of the method in Embodiment 1, it also includes:

[0111] S7: Set the iteration parameter x, and assign 1 to x.

[0112] This step S7 is specifically between steps S1 and S2.

[0113] S8: Judging whether the value of x is less than a preset number of iterations β. If the judgment result is yes, execute step S9; otherwise, if the judgment result is no, execute step S6.

[0114] S9: Add 1 to the value of x, combine the initial sample set, the representative sample set, and the uncertain sample set, and replace the initial sample set with the combined set as a new initial sample set. Go to step S2.

[0115] In order to collect samples with more information content from Deep Web, this embodiment continues to optimize the method of Embodiment 1. In the multi-stage sampling stage of stratifi...

Embodiment 3

[0117]Embodiment 3 continues to optimize the methods in Embodiment 1 and Embodiment 2. On the basis of the above embodiments, it further includes: suppressing excessive stratification of the strategy tree level during the process of constructing the strategy tree.

[0118] Because stratification may involve the problem of over-stratification, and over-stratification will lead to the deterioration of the results of stratified sampling. Therefore, in this embodiment, it is meaningful to test whether the variance of the output attribute is reduced by statistical hypothesis testing to determine whether to continue to stratify the strategy tree when building the strategy tree. The current node LN of the tree is layered. If it is meaningless, the layering of the current node LN in the policy tree is terminated. Specifically, it is tested by the following idea: when the potential splitting attribute P i When there is no significant connection with the output attribute set, the distr...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a hierarchical clustering method and system based on multistage layered sampling. The method includes: using randomly-sampled initial sample sets as seeds to build layered inquiry strategies, and distributing corresponding sample number to each layer of inquiry strategies on the basis of the estimate variance minimum principle of each layer; using the layered inquiry strategies to perform layered sampling on data sources to obtain the representative sample sets with high sample representativeness; clustering the samples in the representative sample sets, and performing secondary sampling on the data sources on the basis the boundary points of the clusters to obtain uncertainty sample sets with high sample uncertainty; clustering on the basis of the collection of the initial sample sets, representative sample sets and the uncertainty sample sets to estimate the clustering center of the data sources. The method has the advantages that high representativeness and uncertainty of the samples are guaranteed through multistage layered sampling, the problem that randomly-sampled samples are poor in representativeness is solved, and the accuracy of data source clustering is increased.

Description

technical field [0001] The invention belongs to the technical field of Deep Web (deep network) data processing, and in particular relates to a hierarchical clustering method and system based on multi-stage hierarchical sampling. Background technique [0002] In recent years, as a way of data dissemination, the Deep Web (data source) has become more and more popular. Compared with the Surface Web (surface network), the Deep Web contains higher-quality data. Data mining is more valuable. As a very active research topic in the field of data mining research, clustering can facilitate the understanding of data distribution and provide reference for subsequent applications of Deep Web data. Therefore, clustering Deep Web data sources has become an important topic in this field. Research hot. [0003] Deep Web data is stored in the background database, and the corresponding data can only be obtained by submitting a query through the query interface, and cannot directly obtain all...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/285
Inventor 赵朋朋刘袁柳吴健鲜学丰崔志明
Owner 神行太保智能科技(苏州)有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products