Unlock instant, AI-driven research and patent intelligence for your innovation.

Method for measuring document similarity through single random permutation hash of position coding

A technology of document similarity and random replacement, applied in the field of finding similar texts in information retrieval, to save storage space and computing time

Pending Publication Date: 2020-07-24
HUNAN UNIV OF TECH
View PDF6 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0009] In order to solve the above-mentioned technical problems, the present invention proposes a method for measuring document similarity by Position One Permutation Hashing (POPH), which is used to solve the performance consumption problem of comparing hash values ​​when OPH generates excessive empty areas. Improving computing performance has important scientific significance and practical application value

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for measuring document similarity through single random permutation hash of position coding
  • Method for measuring document similarity through single random permutation hash of position coding
  • Method for measuring document similarity through single random permutation hash of position coding

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0050] A method for measuring document similarity by position coding single random permutation hash, comprising the following steps:

[0051] S1, initially extracting text features and generating a single random permutation hash set O x ;

[0052] S2, further extract text features, and generate a single random permutation position encoding hash set P x : traverse the set O in S1 x In the non-empty area, the serial number of the non-empty area is used as the key, the hash value is used as the value, and the mixed encoding generates the key-value pair with the structure to form a set P x ;

[0053] S3: Similarity measure: traverse P a ,P b All key-value pairs in , according to the similarity Compare the similarity of two documents a and b;

[0054] Among them, the subscript x represents any document, and P a ,P b They are the set of key-value pairs generated by the method of S2 for documents a and b respectively, N emp for the set O a , O b The number of empty are...

Embodiment 2

[0073] Select 4 pairs of documents in the experimental data set to form the data set, divide the document pairs into 4 groups according to the similarity from high to low, and randomly select a pair of words in each document pair to represent the document pair. The experimental data is shown in Table 1 below ( f 1 , f 2 is the word set size of document 1 and document 2, and a is the intersection size), if the randomly permuted set π(S xD ) does not exist empty area, when calculating N emp and N mat , POPH not only needs to compare Bid, but also compares Binhash, while OPH only needs to compare Binhash, so the calculation speed is better than POPH, such as Figure 4 shown.

[0074] Count the time it takes for OPH and POPH to complete the comparison of hash values ​​when empty areas appear in different proportions: this data set is constructed on the basis of the first data set above, according to the similarity between the measurement sets S1 and S2 The basic formula used ...

Embodiment 3

[0082] Such as Figure 9 to Figure 13 As shown, this embodiment enumerates two key-value pair sets P a with P b , the number of regions is 10, and the diagram shows that P a with P b The calculation process of the similarity R.

[0083] Among them, i is a counter, indicating the 1-10th area, and minindex refers to the set P a with P b The value of k is currently small, and the counter i increases with the change of minindex (i=minindex+1), instead of using a loop to traverse k areas like the calculation method of OPH, so POPH saves time.

[0084] Specifically, P a ={*,2,*,*,3,*,1,*,0,end}, P b = {*,*,3,*,*,*,1,*,1,end}, N emp for the set P a ,P b The number of empty areas in the same time, N mat represents the set P a ,P b The number that is not empty and has the same hash value, k is the set P a ,P b The largest number of collection areas obtained from dividing the end bit, P a ,P b They are the set of key-value pairs generated by document a and b through po...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

A method for measuring document similarity through single random permutation hash of position coding belongs to the field of searching similar texts in information retrieval, and comprises the following steps: S1, preliminarily extracting text features, and generating a single random permutation hash set Ox; S2, further extracting text features, and generating a single random replacement positioncode hash set Px: traversing non-empty regions in the set Ox in S1, taking serial numbers of the non-empty regions as keys, taking hash values as values, and performing hybrid coding to generate key value pairs with a structure of < k, v > to form the set Px; and S3, similarity measurement: traversing all key value pairs in Pa and Pb, and comparing the similarity of the two documents a and b according to the similarity. The method is high in calculation precision, and is consistent with OPH; along with the increase of the number of the empty areas, the POPH document similarity measurement method not only saves the calculation time, but also saves the storage space.

Description

technical field [0001] The invention belongs to the field of searching similar texts in information retrieval, and more specifically relates to a method for measuring document similarity by position coding single random permutation hash. Background technique [0002] The WEB is experiencing explosive growth, and more and more documents are published on the Internet. This trend makes the document resources on the Internet grow exponentially, which provides unprecedented convenience for human beings to share knowledge and create wealth. Modernization has a positive role in promoting. However, while these digital resources provide help to people, the easy availability of resources also makes the illegal copying, plagiarism, and plagiarism of documents more and more rampant, making it possible that there may be comparisons in documents such as various papers and project applications. serious plagiarism. At the same time, with the country's large investment in education and sci...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
IPC IPC(8): G06F16/332G06F16/31
CPCG06F16/332G06F16/325
Inventor 袁鑫攀王松林毛鑫鑫
Owner HUNAN UNIV OF TECH