Set similarity calculation method and system based on minhash

A set similarity and similarity calculation technology, applied in the minhash-based set similarity calculation method and system field, can solve the problems of long calculation time, complicated calculation process, and long time-consuming minhash signature process, and achieve speed and speed improvement Effect

Inactive Publication Date: 2017-05-17
KUYUN INTERACTIVE TECH
View PDF0 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0016] However, in actual operation, it is found that the generation process of the minhash function is a process of randomly rearranging rows. If the number of rows (the overall number of elements) is large

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Set similarity calculation method and system based on minhash
  • Set similarity calculation method and system based on minhash
  • Set similarity calculation method and system based on minhash

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 2

[0104] figure 2 A schematic structural diagram of a minhash-based set similarity calculation system provided in Embodiment 2 of the present invention, as shown in figure 2 As shown, the set similarity calculation system is used to implement the set similarity calculation method in the first embodiment above, and the set similarity calculation system includes: hash mapping module 1, class group establishment module 2, allocation module 3, minimum hash A hash value determination module 4, a minimum hash signature generation module 5, and a similarity calculation module 6.

[0105] Wherein, the hash mapping module 1 is configured to use a hash function to map each element in the set to a first hash value with a length of m bits, where m is an integer.

[0106] Class group build module 2 for build 2 k class groups, each class group corresponds to a label, and the tag is a second hash value with a length of k bits, and the tags corresponding to different class groups are differ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a set similarity calculation method and system based on minhash. The method includes the steps that each element in a set is mapped into a first hash value with an m-bit length through a hash function, and 2k class groups are established, wherein each class group corresponds to one tag which is a second hash value with a k-bit length, and different class groups correspond to different tags; for any set, the first hash values corresponding to the elements in the set are distributed into the class groups corresponding to the tags with the same first k bits as the first hash values; minhash values, corresponding to the class groups, of the set are determined according to the distribution result; the minhash values, corresponding to the class groups, of the set form an array serving as a minhash signature of the set; according to the minhash signatures of any two sets, the similarity of the two sets is calculated. By means of the technical scheme, the minhash signature speed can be greatly increased, and thus the set similarity calculation speed is greatly increased.

Description

technical field [0001] The invention relates to the technical field of computer processing, in particular to a minhash-based set similarity calculation method and system. Background technique [0002] Given two sets A and B, Jaccard similarity is an algorithm widely used to describe the similarity between sets, and its formula is expressed as follows: [0003] [0004] To calculate the similarity between two sets of N sets, you need to calculate N(N-1) / 2 times, and the complexity is O(n 2 ). The calculation speed of a single Jaccard similarity will be more critical, especially if the set is relatively large, the calculation of the Jaccard similarity will be relatively time-consuming, and there will be greater pressure on computing resources. For example, according to the relationship between two programs Viewers to calculate the similarity between programs, each program may have millions of viewers, at this time the calculation of Jaccard similarity will be time-consumi...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F7/02G06K9/62
CPCG06F7/02G06F18/22
Inventor 李鹏陆承恩
Owner KUYUN INTERACTIVE TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products