Looking for breakthrough ideas for innovation challenges? Try Patsnap Eureka!

A fingerprint-based corpus full-text retrieval method and system

A technology of corpus and fingerprint database, applied in the field of corpus full-text retrieval method and system based on fingerprint, can solve the problem of inability to generate retrieval results quickly and accurately, and achieve the effects of easy promotion, improved accuracy and strong applicability

Active Publication Date: 2020-12-22
中国人民解放军军事科学院评估论证研究中心 +2
View PDF8 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] In order to solve the problem in the prior art that retrieval results cannot be generated quickly and accurately, the present invention provides a method and system for full-text retrieval of corpus based on fingerprints

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • A fingerprint-based corpus full-text retrieval method and system
  • A fingerprint-based corpus full-text retrieval method and system
  • A fingerprint-based corpus full-text retrieval method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0078] A fingerprint-based corpus full-text retrieval method, such as figure 1 shown, including:

[0079] Step 1: Construct fingerprints for the documents to be checked based on the distance map method in parallel;

[0080] Step 2: Based on the fingerprint of the document to be checked, search in parallel in the pre-built fingerprint library for one or more fingerprints with the greatest similarity to the fingerprint of the document to be checked;

[0081] Step 3: The document corresponding to the fingerprint is the retrieval result for the document to be checked.

[0082] Step 1: Construct fingerprints for the documents to be checked based on the distance map method in parallel.

[0083] Specifically, the construction of the fingerprint database includes:

[0084] Based on the full text of all documents in the corpus, the distance map method is used to construct fingerprints for each document, and a fingerprint index is generated.

[0085] Specifically, said using the dis...

Embodiment 2

[0125] Based on the same inventive concept, the present invention also provides a fingerprint-based corpus full-text retrieval system, such as figure 2 As shown, including index module, similarity module and retrieval module:

[0126] Fingerprint module: used to construct fingerprints for the documents to be checked based on the distance map method in a parallel manner;

[0127] Similarity module: used to search in parallel in the pre-built fingerprint database for one or more fingerprints with the greatest similarity to the fingerprint of the document to be checked based on the fingerprint of the document to be checked;

[0128] Retrieval module: the document corresponding to the fingerprint is the retrieval result for the document to be checked.

[0129] In the fingerprint module, the construction of the fingerprint library includes:

[0130] Based on the full text of all documents in the corpus, the distance map method is used to construct fingerprints for each document,...

Embodiment 3

[0170] Fingerprint-based corpus full-text retrieval methods can be divided into two phases: index generation and index-based search. The process of generating an index is generally a one-time process. As long as the main content and structure of the document do not change, the corresponding index will generally not be updated.

[0171] Related concepts and symbolic representations thereof involved in the present invention are defined as follows:

[0172] K-order distance: For a given document D, its word sequence is denoted as seq(D), and the word set is denoted as N(D). If in seq(D), words, also called nodes, are represented by n, n i in the word n j At most k positions have appeared at least 1 time before, where n i ,n j ∈N(D), then say n i to n j The distance is the k-order distance, k≥0.

[0173] Edge of order k: if node n in document D i to n j The distance is the k-order distance, then it is called n i to n j The directed edge is e i,j is an edge of order k, ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A method for full-text retrieval of a corpus based on fingerprints, comprising: adopting a parallel method, constructing fingerprints for documents to be checked based on a distance graph method, and generating a fingerprint index; One or more fingerprints with the largest fingerprint similarity of the document are retrieved; the document corresponding to the fingerprint is the retrieval result for the document to be checked. The technical solution provided by the present invention establishes a fingerprint index based on a distance map, uses bit-by-bit "AND" operation to calculate the similarity of fingerprints, and uses a parallel method for retrieval, which can accurately and comprehensively describe the structure and content of documents, and improves full-text retrieval Efficiency and accuracy, low requirements for computer hardware, strong applicability, and easy promotion.

Description

technical field [0001] The invention relates to the field of document retrieval, in particular to a method and system for full-text retrieval of a corpus based on fingerprints. Background technique [0002] With the rapid development of Internet technology, whether online or offline, the scale of text databases has expanded rapidly. How to establish efficient indexes and fast retrieval of these text collections has become an urgent problem to be solved. [0003] Full-text retrieval refers to an information retrieval technology that takes all text information as the retrieval object. The key to full-text retrieval is document indexing, that is, how to record the information of all basic elements in the source document into the index library in an appropriate form. According to the different elements indexed in the index database, the existing full-text retrieval system can be divided into two types: full-text retrieval based on word (word) table and full-text retrieval based...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/383G06F16/31G06K9/62
CPCG06F16/383G06F16/316G06F18/22
Inventor 林旺群金松昌林彬李妍王伟高博
Owner 中国人民解放军军事科学院评估论证研究中心
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Patsnap Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Patsnap Eureka Blog
Learn More
PatSnap group products