Rapid hierarchical document querying method

A Query Method, Hierarchical Technology

Active Publication Date: 2017-10-24
ZHEJIANG UNIV
View PDF6 Cites 9 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

Since bulldozer distance is usually formalized as a linear optimization problem and can be modeled as the minimum cost flow of bipartite networks, it requires a large time complexity

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Rapid hierarchical document querying method
  • Rapid hierarchical document querying method
  • Rapid hierarchical document querying method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0070] Embodiments of the present invention and its implementation process are as follows:

[0071] The specific implementation is to further illustrate the technical scheme of the present invention with document data set Reuters-21578 (Reuters) and schematic diagram; Wherein, Reuters is a branch of Newswire news document collection, and can openly obtain from Internet, and it comprises 65 subjects, 8293 documents, 2347 documents are used for testing, and 5946 training documents are used to build the hash table.

[0072] Document format processing stage

[0073] Step 1: Establish the data model of each document. The data model of a document is mainly composed of three parts: words, word vectors and word weights; words are valid words left after the document is preprocessed, and word vectors are available on the Internet. The word vector of the publicly available Google News Word2Vec model, the word weight is the TF-IDF value corresponding to the word.

[0074] Step 2: Format...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a rapid hierarchical document querying method. The rapid hierarchical document querying method comprises the following steps: establishing data models for documents of document sets, and carrying out formatting treatment on the documents to obtain a document centroid vector and a document label; taking the generated document centroid vector as a point in high-dimensional vector space, and constructing a hash index structure in a memory for every document set by using a locality sensitive hash method; acquiring a candidate document set in the hash index structure by using a querying method based on locality sensitive hash thinking according to the document centroid vector of a queried text; and acquiring a nearest neighbor document under word movement distance measurement in a candidate document set by using a filtering-thinning hierarchical frame according to the document label of the queried text. When the designed hierarchical querying method is applied to document classification and retrieval, the efficiency and the effect are balanced well, and a target document is acquired rapidly under the accuracy can be guaranteed when the user queries documents under word movement distance measurement.

Description

technical field [0001] The invention relates to a fast hierarchical document query method, in particular to a Word2Vec model in the field of machine learning, a locally sensitive hash method in the field of databases, and a filtering-refining framework under bulldozer distance measurement. Background technique [0002] With the development of information technology, people's ability to produce, collect and store information has been continuously enhanced. One of the main information carriers is documents. Accurately expressing the similarity between two documents has a wide range of applications in document retrieval, document classification and document clustering. The latent semantic analysis method is to extract low-dimensional semantic information through matrix decomposition, and the topic model is a method to model the hidden topics in the text. Recently, with the development of deep learning, the Word2Vec model and the Doc2Vec model have been proposed one after anoth...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06F17/27
CPCG06F16/212G06F16/2462G06F16/325G06F16/335G06F40/284
Inventor 陈珂王伟迪胡天磊陈刚伍赛寿黎但
Owner ZHEJIANG UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products