Probability clustering method of cross-categorical data based on key word

A clustering method and entry technology, applied in the field of probabilistic clustering of cross-type data, can solve the problem of not considering the uncertainty of the clustering process, etc.

Inactive Publication Date: 2009-04-15
NORTHEASTERN UNIV
View PDF0 Cites 17 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The existing data clustering methods do not take into account the uncertainty in the clustering process (uncertainty)

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Probability clustering method of cross-categorical data based on key word
  • Probability clustering method of cross-categorical data based on key word
  • Probability clustering method of cross-categorical data based on key word

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0080] An embodiment of the invention:

[0081] (1) Define the type of subject entry and rank the entries by weight

[0082] assuming d 1 and d 2 are two data in the data space, T(d 1 ) and T(d 2 ) respectively represent the entry items contained in each data, where T(d 1 ) = {data, index, search, precision, meeting, clustering, lookup, similarity, summary, contains, version}, T(d 2 ) = {data, search, accuracy, session, image, measure, indeterminate}. T(d 1 ) and T(d 2 ) Each entry in ) is given a weight value, and is sorted from high to low according to the weight value, such as Figure 7 (a) and (b) shown.

[0083] (2) Representing data subjects with probabilities

[0084] in d 1 Among them, "data", "index", "search" and "accuracy" are taken as topic-related entries, "meeting" and "clustering" are topic-related semi-related entries, and the rest are topic-irrelevant entries. The weights of "meeting" and "clustering" are 4 and 3 respectively, and d 1The maximum w...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

A probabilistic clustering method of trans-type data based on keyword entries belongs to the database field and comprises the following steps: (1) defining the type of the keyword entry; and dividing the trans-type data into a keyword correlation entry, a keyword half-correlation entry and a keyword non-correlation entry; (2) allocating probability for each entry; (3) expressing data keywords by the probability; (4) constructing a data keyword entry probabilistic similarity matrix M; for any two data of the trans-type data dx and dy in the step (3), computing similarity of any two descriptive forms of the dx and the dy, summing the probabilities of the similarity which is greater than a certain threshold, and storing the direct correlation probabilities of the any two data in the matrix M; (5) constructing a clustering model M<c> based on the matrix M; and (6) obtaining the clustering method based on the clustering model M<c>. The method clusters the trans-type data by utilizing the similarity of the entry related to the keywords, which improves the data clustering precision and reduces the clustering time.

Description

technical field [0001] The invention belongs to the field of databases, in particular to a method for probabilistic clustering of cross-type data based on subject entries. Background technique [0002] Over the past few decades, traditional relational database management systems have played a very important role. However, with the continuous development of computer application technology, especially Web information technology, today's data presents the two characteristics of "massive" and "data everywhere", and the data features are complex. Therefore, a certain traditional database management system can no longer meet the needs of such a database management, and much of today's data or information is not stored in the database management system at all, as Serge Atiteboul et al. published in ACM Communication (Volume 48, No. 5) and Homman pointed out in the DASFAA2007 conference report, currently only about 20% of the data or information is stored in the database. This mea...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 王国仁于亚新王波涛丁国辉王斌赵相国赵宇海信俊昌乔百友韩东红张恩德李淼
Owner NORTHEASTERN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products