Comparison matrix similarity retrieval method based on multi-order fingerprints

A comparison matrix and similarity technology, applied in special data processing applications, instruments, electrical digital data processing, etc., can solve the problem that the similarity retrieval mechanism cannot be effectively migrated

Active Publication Date: 2018-09-25
同方知网数字出版技术股份有限公司
View PDF10 Cites 21 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

The method improves the efficiency of duplicate checking and comparison of declared items, reduces the waste of resources such as manpower and material resources, and solves the problem that the existing similarity retrieval mechanism cannot be effectively transferred

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Comparison matrix similarity retrieval method based on multi-order fingerprints
  • Comparison matrix similarity retrieval method based on multi-order fingerprints
  • Comparison matrix similarity retrieval method based on multi-order fingerprints

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0022] In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the embodiments and accompanying drawings.

[0023] Such as figure 1 As shown, it is a comparison matrix similarity retrieval method based on multi-level fingerprints, including:

[0024] Step 10 fragments the text, saves it in the database and cleans the text data to form a unified format text;

[0025] The text in word, pdf and other formats is recognized by the program, and the format is unified and saved in the database. Such as figure 2 As shown, it is a unified database structure, where the attribute f_article_title is the title of each article, and f_after_content is the full text of the text without the html tags. This method mainly uses the full text information of the attribute f_after_content.

[0026] Such as image 3 Shown is the text formatted content.

[0027] Step 20 enc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a comparison matrix similarity retrieval method based on multi-order fingerprints. The method comprises the following steps: fragmenting texts, saving in a database and cleaning text data to form a unified format text; encoding the unified format text by using a simhash algorithm to form a 64-bit binary multi-order fingerprint feature value and saving in the database; calculating the Hamming distance between the feature value of a similarity comparison text and the feature values of other texts, selecting the text of which the Hamming distance is smaller than the threshold value of 3 for performing secondary calculation; constructing a comparison matrix by combining the original text and the comparison text two by two, calculating text similarity and similar content, and marking the output; optimizing the text similarity and a similarity content calculation method, and using parallel computing to calculate multiple practical threads simultaneously in the optimization method.

Description

technical field [0001] The invention relates to the technical fields of text mining and computer information processing, in particular to a multi-stage fingerprint-based comparison matrix similarity retrieval method. Background technique [0002] With the popularity of computers for various natural language processing applications such as text information, people have put forward higher requirements for computer text processing in the face of the increasingly complex needs of today's society. In the field of similarity retrieval, the existing methods are irreproducible and require a lot of hardware support and special database support, so they cannot meet the diverse needs of enterprises. Especially for state-owned enterprises, public institutions, and state secret agencies, the public similarity retrieval system cannot be used because the data needs to be kept confidential. Faced with the increasing demand for project declarations, it is only possible to conduct similar ch...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30G06K9/62
CPCG06F18/22
Inventor 段飞虎吕强冯自强张宏伟
Owner 同方知网数字出版技术股份有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products