Eureka AIR delivers breakthrough ideas for toughest innovation challenges, trusted by R&D personnel around the world.

Improved SimHash code similarity detection method

A detection method and similarity technology, applied in special data processing applications, instruments, software maintenance/management, etc., can solve problems such as poor accuracy, and achieve the effect of improving accuracy

Inactive Publication Date: 2017-06-20
ZHEJIANG UNIV OF TECH
View PDF4 Cites 21 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0028] In order to overcome the shortcomings of poor accuracy of the existing code similarity detection methods, the present invention provides an improved SimHash code similarity detection method with high accuracy

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Improved SimHash code similarity detection method
  • Improved SimHash code similarity detection method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0049] The present invention will be further described below in conjunction with the accompanying drawings.

[0050] refer to figure 2 , an improved SimHash code similarity detection method, comprising the following steps:

[0051] 1) participle

[0052] Given a sentence (article, code), perform word segmentation and feature extraction to obtain effective feature vectors, and then set weights for each feature vector.

[0053] 2) hash

[0054] The hash value of each feature vector is calculated by the hash function, and the hash value is an n-bit signature composed of binary number 01.

[0055] 3) weighted

[0056] On the basis of the hash value, weight all the feature vectors, that is, W=hash*weight, and when 1 is encountered, the hash value is multiplied positively by the weight, and when 0 is encountered, the hash value is multiplied negatively by the weight. Thus the weighted results of each eigenvector are obtained.

[0057] 4) merge

[0058]Accumulate the hash-wei...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses an improved SimHash code similarity detection method which comprises the following steps: 1) word segmentation; 2) hash; 3) weighting; 4) merging: namely accumulating results of various feature vectors subjected to hash weighting so as to become only one sequence string; 5) value decreasing: namely selecting a threshold value T through sorting and analyzing, and subtracting the set threshold value T from each item in the result sequence string obtained by merging finally so as to obtain a final result sequence string; 6) dimensionality reduction: namely performing dimensionality reduction on accumulative results of n-bit signatures, for each bit on the final sequence string, setting 1 if greater than 0, otherwise setting 0 so as to obtain a simhash value of the sentence, and finally, judging similarities of different sentences according to Hamming distances of simhash values of the sentences. The improved SimHash code similarity detection method provided by the invention is high in accuracy.

Description

technical field [0001] The invention relates to the technical field of code similarity detection, in particular to a code similarity detection method improved to a simhash detection method. Background technique [0002] A paper "detecting near-duplicates for webcrawling" published by GoogleMosesCharikar proposed the simhash algorithm, which is specially used to solve the deduplication task of hundreds of millions of web pages. [0003] Simhash is a kind of locality sensitive hash (locality sensitive hash): [0004] The main idea is to reduce the dimension, map the high-dimensional feature vector to the low-dimensional feature vector, and determine whether the article is repeated or highly approximated by the Hamming Distance of the two vectors. [0005] Among them, Hamming Distance, also known as Hamming distance, in information theory, the Hamming distance between two strings of equal length is the number of different characters in the corresponding positions of the two st...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F9/44G06F17/22
CPCG06F8/70G06F40/194
Inventor 陈铁明潘永涛王婷吕明琪陈波江颉
Owner ZHEJIANG UNIV OF TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Eureka Blog
Learn More
PatSnap group products