Method for measuring document similarity through single random permutation hash of position coding
A technology of document similarity and random replacement, applied in the field of finding similar texts in information retrieval, to save storage space and computing time
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0050] A method for measuring document similarity by position coding single random permutation hash, comprising the following steps:
[0051] S1, initially extracting text features and generating a single random permutation hash set O x ;
[0052] S2, further extract text features, and generate a single random permutation position encoding hash set P x : traverse the set O in S1 x In the non-empty area, the serial number of the non-empty area is used as the key, the hash value is used as the value, and the mixed encoding generates the key-value pair with the structure to form a set P x ;
[0053] S3: Similarity measure: traverse P a ,P b All key-value pairs in , according to the similarity Compare the similarity of two documents a and b;
[0054] Among them, the subscript x represents any document, and P a ,P b They are the set of key-value pairs generated by the method of S2 for documents a and b respectively, N emp for the set O a , O b The number of empty are...
Embodiment 2
[0073] Select 4 pairs of documents in the experimental data set to form the data set, divide the document pairs into 4 groups according to the similarity from high to low, and randomly select a pair of words in each document pair to represent the document pair. The experimental data is shown in Table 1 below ( f 1 , f 2 is the word set size of document 1 and document 2, and a is the intersection size), if the randomly permuted set π(S xD ) does not exist empty area, when calculating N emp and N mat , POPH not only needs to compare Bid, but also compares Binhash, while OPH only needs to compare Binhash, so the calculation speed is better than POPH, such as Figure 4 shown.
[0074] Count the time it takes for OPH and POPH to complete the comparison of hash values when empty areas appear in different proportions: this data set is constructed on the basis of the first data set above, according to the similarity between the measurement sets S1 and S2 The basic formula used ...
Embodiment 3
[0082] Such as Figure 9 to Figure 13 As shown, this embodiment enumerates two key-value pair sets P a with P b , the number of regions is 10, and the diagram shows that P a with P b The calculation process of the similarity R.
[0083] Among them, i is a counter, indicating the 1-10th area, and minindex refers to the set P a with P b The value of k is currently small, and the counter i increases with the change of minindex (i=minindex+1), instead of using a loop to traverse k areas like the calculation method of OPH, so POPH saves time.
[0084] Specifically, P a ={*,2,*,*,3,*,1,*,0,end}, P b = {*,*,3,*,*,*,1,*,1,end}, N emp for the set P a ,P b The number of empty areas in the same time, N mat represents the set P a ,P b The number that is not empty and has the same hash value, k is the set P a ,P b The largest number of collection areas obtained from dividing the end bit, P a ,P b They are the set of key-value pairs generated by document a and b through po...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


