Large scale text data external clustering method and system

A text data and clustering method technology, applied in the information field, can solve problems such as incomputable space complexity, and achieve the effect of small space occupation, large capacity, and novel and scientific ideas

Inactive Publication Date: 2008-11-19
沈阳格微软件有限责任公司
View PDF0 Cites 26 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

However, in order to ensure the performance of the system, the selected representative points cannot be too small, so in essence, the above method still does not solve the incomputable problem caused by the space complexity of the processing scale

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Large scale text data external clustering method and system
  • Large scale text data external clustering method and system
  • Large scale text data external clustering method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0015] Referring to the accompanying drawings, an external clustering method and system for large-scale text data, the main steps of the method include: preprocessing an input text set, generating an inverted index and a feature vector of the text set; using retrieval technology to retrieve candidates for each document relationship collection; use the relationship calculation method to perform relationship calculation on documents with candidate relationships; sort and output the calculation results that are greater than a certain threshold; clustering algorithm and then according to the sorting results, iteratively merges the text pairs with the first direct relationship, and finally achieves the text pair with the first direct relationship. The clustering output of the set; the clustering system designed by the external clustering method of large-scale text data, including a candidate analyzer, relationship generator, relationship selection and clustering components, the basic...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

Disclosed are an external clustering method and a system for large-scale text data, applied to the information technical field; the designed clustering system comprises a candidate analyzer, a relationship generator, a relationship selection component and a clustering component; each sample point serves as a cluster; a candidate related candidate point cluster is selected for each sample through search technology; the relationship between the sample and the candidate relationship sample is calculated by the relationship generator and outputted to the external storage space by an increasing or decreasing order. The method mainly comprises the following steps: to pre-process the input text set and generate the inverted index and the eigenvector of the text set; to use the search technology to retrieve the candidate relationship set of each file; to calculate the relations of the files with candidate relationship through a relation calculation method; to orderly output the calculated results which are greater than a certain threshold value; then according to the orderly outputted results, the clustering algorithm repeatedly and iteratively combines a text pair with a first direct relationship, so as to eventually achieve the text set clustering output. The device has novel and scientific design, occupies small space in the clustering process and uses large-capacity external memories to make differentiated treatment to the treatment process.

Description

Technical field: [0001] The invention relates to a method and implementation of large-scale text data clustering using external memory in the field of information technology, an external document clustering method based on retrieval technology, and a large-scale method that overcomes the shortcomings of existing methods in processing problem scale and time. External clustering method and system for text data. Background technique: [0002] In the past 10 years, information-based organizations or knowledge-based enterprises have been full of vitality, creating and disseminating knowledge has become a key element to test the core competence of enterprises, and the ability to create and apply knowledge has become an uncompromising force for the core competitiveness of enterprises. support. Knowledge is not only stored in the brains of employees, but also rooted in various documents accumulated by the enterprise for a long time, as well as in the data of other application syste...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
Inventor 季铎蔡东风张桂平尹宝生苗雪雷周俏丽白羽
Owner 沈阳格微软件有限责任公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products