Document similarity determining method based on maximum likelihood estimation

A technique of maximum likelihood estimation and document similarity, which is applied in the field of information retrieval and can solve the problem of low document similarity accuracy.

Active Publication Date: 2015-05-20
CENT SOUTH UNIV
View PDF2 Cites 15 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0006] The present invention proposes a method for determining document similarity based on maximum likelihood estimation, and its purpose is to solve the problem of low accuracy of document similarity in the prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Document similarity determining method based on maximum likelihood estimation
  • Document similarity determining method based on maximum likelihood estimation
  • Document similarity determining method based on maximum likelihood estimation

Examples

Experimental program
Comparison scheme
Effect test

example 1

[0119] S in Example 1 1 and S 2 The corresponding position values ​​of the minwise fingerprint are:

[0120] min{π 2 (S 1 )}=2=min{π 2 (S 2 )}=2,

[0121] min{π 3 (S 1 )}=1=min{π 3 (S 2 )}=1,

[0122] min{π 4 (S 1 )}=1=min{π 4 (S 2 )}=1

[0123] so k = =3

[0124] 2)k > solution of

[0125] S in Example 1 1 and S 2 The corresponding position values ​​of the minwise fingerprint are:

[0126] min{π 5 (S 1 )}=4>min{π 5 (S 2 )}=0,,

[0127] min{π 6 (S 1 )}=1>min{π 6 (S 2 )}=0

[0128] so k > = 2

[0129] 3)k solution of

[0130] S in Example 1 1 and S 2 The corresponding position values ​​of the minwise fingerprint are:

[0131] min{π 1 (S 1 )}=01 (S 2 )}=1

[0132] k = 1

[0133] 4) Both maximum likelihood estimators a MLE (the intersection of the two obtained by the maximum likelihood method) solution.

[0134] the f in instance 1 1 = 6, f 2 =6 and k = = 3,k > = 2,k =1 into the formula Have:

[0135] k ...

example 2

[0138] Maximum likelihood similarity determination for 3 documents:

[0139] On the basis of example 1, add document S 3 ={1,3,4,5}, using the three similarities obtained in the prior art method is:

[0140] R ( 1,2,3 ) = | S 1 ∩ S 2 ∩ S 3 | | S 1 ∪ S 2 ∪ S 3 | = a f 1 + f 2 + f...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a document similarity determining method based on maximum likelihood estimation. The method includes the following steps that firstly, text characteristics are extracted; secondly, numerical value mapping is conducted on text characteristic sets, so that numerical value sets Sd corresponding to documents are obtained; thirdly, minwise fingerprint representation is adopted for the numerical value sets Sd corresponding to the documents; fourthly, the similarity a of the two documents is calculated on the basis of minwise fingerprint of the documents and a maximum likelihood function. According to the method, the probabilities of various results (<, > and =) of hash value comparison are used, the likelihood function combining the probabilities is ingeniously designed on the basis of the probabilities, and a maximum likelihood minwise hash estimator is established. The method is applied and popularized to determining of the similarity of three documents, and the similarity of high-precision text is obtained accurately. Because the variance mean obtained through a maximum likelihood method is minimum, the natural precision of the obtained similarity is higher than that of a minwise method.

Description

technical field [0001] The invention belongs to the field of information retrieval, in particular to a method for determining document similarity based on maximum likelihood estimation. Background technique [0002] The WEB is experiencing explosive growth, and more and more documents are published on the Internet. This trend makes the document resources on the Internet grow exponentially, which provides unprecedented convenience for human beings to share knowledge and create wealth. Modernization has a positive role in promoting. However, while these digital resources provide help to people, the easy availability of resources also makes illegal copying, plagiarism, plagiarism and other behaviors of documents more and more rampant, making it possible that there may be relatively serious of plagiarism. At the same time, with the country's large investment in education and scientific research, it has provided funding for various educational and technological projects, such a...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 龙军袁鑫攀盛鑫海李祖德
Owner CENT SOUTH UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products