Document similarity determining method based on maximum likelihood estimation
What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technique of maximum likelihood estimation and document similarity, which is applied in the field of information retrieval and can solve the problem of low document similarity accuracy.
Active Publication Date: 2015-05-20
CENT SOUTH UNIV
View PDF2 Cites 15 Cited by
Summary
Abstract
Description
Claims
Application Information
AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology
Problems solved by technology
[0006] The present invention proposes a method for determining document similarity based on maximum likelihood estimation, and its purpose is to solve the problem of low accuracy of document similarity in the prior art
Method used
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more
Image
Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
Click on the blue label to locate the original text in one second.
Reading with bidirectional positioning of images and text.
Smart Image
Examples
Experimental program
Comparison scheme
Effect test
example 1
[0119] S in Example 1 1 and S 2 The corresponding position values of the minwise fingerprint are:
[0120] min{π 2 (S 1 )}=2=min{π 2 (S 2 )}=2,
[0121] min{π 3 (S 1 )}=1=min{π 3 (S 2 )}=1,
[0122] min{π 4 (S 1 )}=1=min{π 4 (S 2 )}=1
[0123] so k = =3
[0124] 2)k > solution of
[0125] S in Example 1 1 and S 2 The corresponding position values of the minwise fingerprint are:
[0126] min{π 5 (S 1 )}=4>min{π 5 (S 2 )}=0,,
[0127] min{π 6 (S 1 )}=1>min{π 6 (S 2 )}=0
[0128] so k > = 2
[0129] 3)k solution of
[0130] S in Example 1 1 and S 2 The corresponding position values of the minwise fingerprint are:
[0131] min{π 1 (S 1 )}=01 (S 2 )}=1
[0132] k = 1
[0133] 4) Both maximum likelihood estimators a MLE (the intersection of the two obtained by the maximum likelihood method) solution.
[0134] the f in instance 1 1 = 6, f 2 =6 and k = = 3,k > = 2,k =1 into the formula Have:
[0135] k ...
example 2
[0138] Maximum likelihood similarity determination for 3 documents:
[0139] On the basis of example 1, add document S 3 ={1,3,4,5}, using the three similarities obtained in the prior art method is:
[0140] R ( 1,2,3 ) = | S 1 ∩ S 2 ∩ S 3 | | S 1 ∪ S 2 ∪ S 3 | = a f 1 + f 2 + f...
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more
PUM
Login to view more
Abstract
The invention discloses a document similarity determining method based on maximum likelihood estimation. The method includes the following steps that firstly, text characteristics are extracted; secondly, numerical value mapping is conducted on text characteristic sets, so that numerical value sets Sd corresponding to documents are obtained; thirdly, minwise fingerprint representation is adopted for the numerical value sets Sd corresponding to the documents; fourthly, the similarity a of the two documents is calculated on the basis of minwise fingerprint of the documents and a maximum likelihood function. According to the method, the probabilities of various results (<, > and =) of hash value comparison are used, the likelihood function combining the probabilities is ingeniously designed on the basis of the probabilities, and a maximum likelihood minwise hash estimator is established. The method is applied and popularized to determining of the similarity of three documents, and the similarity of high-precision text is obtained accurately. Because the variance mean obtained through a maximum likelihood method is minimum, the natural precision of the obtained similarity is higher than that of a minwise method.
Description
technical field [0001] The invention belongs to the field of information retrieval, in particular to a method for determining document similarity based on maximum likelihood estimation. Background technique [0002] The WEB is experiencing explosive growth, and more and more documents are published on the Internet. This trend makes the document resources on the Internet grow exponentially, which provides unprecedented convenience for human beings to share knowledge and create wealth. Modernization has a positive role in promoting. However, while these digital resources provide help to people, the easy availability of resources also makes illegal copying, plagiarism, plagiarism and other behaviors of documents more and more rampant, making it possible that there may be relatively serious of plagiarism. At the same time, with the country's large investment in education and scientific research, it has provided funding for various educational and technological projects, such a...
Claims
the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more
Application Information
Patent Timeline
Application Date:The date an application was filed.
Publication Date:The date a patent or application was officially published.
First Publication Date:The earliest publication date of a patent with the same application number.
Issue Date:Publication date of the patent grant document.
PCT Entry Date:The Entry date of PCT National Phase.
Estimated Expiry Date:The statutory expiry date of a patent right according to the Patent Law, and it is the longest term of protection that the patent right can achieve without the termination of the patent right due to other reasons(Term extension factor has been taken into account ).
Invalid Date:Actual expiry date is based on effective date or publication date of legal transaction data of invalid patent.