Unlock instant, AI-driven research and patent intelligence for your innovation.

Distributed index establishment method and system based on text clustering

A construction method and distributed technology, which is applied in the text clustering-based distributed index construction method and system field, can solve the problems of retrieval efficiency impact, index file size increase, centralized storage index, etc., to improve user experience Sensitive effect

Inactive Publication Date: 2016-07-20
SUN YAT SEN UNIV
View PDF5 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0002] In traditional structured information management, index technology is usually used to retrieve information. However, in a distributed network environment, the growth rate of knowledge scale is very fast, and the size of index files increases sharply with the growth of scale. The index is stored in a centralized way, and the retrieval efficiency is also seriously affected by the huge index library; in view of this situation, an index method based on document division is proposed, but this index divides the collection in a random way, because each divided subset is an equivalent distribution, so all sub-indexes still need to be retrieved when retrieving, resulting in a large retrieval overhead

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Distributed index establishment method and system based on text clustering
  • Distributed index establishment method and system based on text clustering
  • Distributed index establishment method and system based on text clustering

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0056] The following will clearly and completely describe the technical solutions in the embodiments of the present invention with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some, not all, embodiments of the present invention. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts fall within the protection scope of the present invention.

[0057] figure 1 is a schematic flowchart of a method for constructing a distributed index based on text clustering in an embodiment of the present invention, as shown in figure 1 As shown, the method includes:

[0058] S11: Format and preprocess the unstructured text, and store the preprocessing results on the distributed nodes;

[0059] S124: Perform filtering and feature extraction processing on the preprocessing result, and obtain a processed text vocabulary fea...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a distributed index establishment method and system based on text clustering.The method includes the steps that unstructured texts are subjected to formatting and word segmentation pretreatment, and the pretreatment result is stored in original distributed nodes; the pretreatment result is subjected to filtering and feature extraction, and processed text lexical feature vectors are obtained; the text lexical feature vectors are clustered through a Canopy-Kmeans clustering algorithm, and K clusters of the text lexical feature vectors are obtained; each cluster of the K clusters is distributed on one or more distributed nodes; the K clusters distributed on the one or more nodes are subjected to full-text index establishment through an index engine, and K full-text indexes are obtained.By means of the embodiment, the method and system are used for establishing a distributed index mode for retrieval, the rapid index mode is provided for a user, and the use experience of the user is improved.

Description

technical field [0001] The invention relates to the technical field of retrieval index construction, in particular to a text clustering-based distributed index construction method and system. Background technique [0002] In traditional structured information management, index technology is usually used to retrieve information. However, in a distributed network environment, the growth rate of knowledge scale is very fast, and the size of index files increases sharply with the growth of scale. The index is stored in a centralized way, and the retrieval efficiency is also seriously affected by the huge index library; in view of this situation, an index method based on document division is proposed, but this index divides the collection in a random way, because the subsets of each division is an equivalent distribution, so all sub-indexes still need to be retrieved during retrieval, resulting in a large retrieval overhead. [0003] Text clustering is based on the clustering hy...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F17/30
CPCG06F16/2272G06F40/14G06F40/103G06F40/284
Inventor 林格邓现
Owner SUN YAT SEN UNIV
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More