Method for determining document similarity based on improved Jaccard coefficients

A document similarity and similarity technology, applied in the field of information retrieval, can solve the problem of inaccurate calculation of similarity between documents

Active Publication Date: 2017-02-01
FUJIAN NORMAL UNIV
View PDF5 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0005] The purpose of the present invention is to overcome the deficiencies in the prior art, to provide a method for determining document similarity based on the improved Jaccard coefficient, by considering the weight of each element, sample in the document and its contribution to the similarity of multiple documents, Effectively solve the problem of inaccurate calculation of similarity between documents existing in the prior art

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for determining document similarity based on improved Jaccard coefficients
  • Method for determining document similarity based on improved Jaccard coefficients
  • Method for determining document similarity based on improved Jaccard coefficients

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0021] Hereinafter, the method provided by the present invention will be described in detail with reference to the drawings, and the advantages of the method according to the present invention will be described in detail with reference to examples and experimental data.

[0022] Such as figure 1 As shown, the present invention discloses a method for determining document similarity based on an improved Jaccard coefficient, which includes the following steps:

[0023] Step 1: Given document X and document Y, the document length of document X is m, and the document length of document Y is n; respectively determine each element w of document X with length K i Corresponding quantity And each element w of length K in document Y j Corresponding quantity Where i=1, 2,...,m-K+1, j=1, 2,...,n-K+1; specifically, the document X contains m elements w of length 1 1 , Contains (m-1) elements w of length 2 2 , And so on, it contains an element of length m, which is n-Gram, which is formed by slidi...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a method for determining document similarity based on improved Jaccard coefficients. The method comprises: step 1, respectively determining the corresponding number (which is shown in the description) of each element wi, the length of which is K in a document X, and the corresponding number (which is shown in the description) of each element wj, the length of which is K in a document Y; step 2, calculating the proportion (which is shown in the description) of each element wi in the document X; step 3, calculating the proportion (which is shown in the description) of each element wj in the document Y; step 4, calculating the Jaccard similarity (which is shown in the description) of a common element wh in the document X and the document Y; step 5, calculating the Epsilon (wh) of the element wh in elements, the n-Gram length of which is K in the document X and the document Y; step 6, calculating a parameter F(wh) representing whether the element wh simultaneously exists in the document X and the document Y; and step 7, setting a symbol (which is shown in the description) as the similarity of the document X and the document Y. According to the method for determining the document similarity based on the improved Jaccard coefficients, by considering the proportion of each element and sample in the documents and the contribution degree to the similarity of multiple documents, a problem that inter-document similarity calculation is inexact in the prior art is effectively solved.

Description

Technical field [0001] The invention relates to the field of information retrieval, and in particular to a method for determining document similarity based on an improved Jaccard coefficient. Background technique [0002] With the rapid development of modern computer technology and the rapid popularization of the Internet, online data resources are also rapidly increasing. This rich data resource provides convenience for people's lives and improves people's work efficiency. While these data resources provide convenience to people, many problems have also appeared, such as plagiarism of academic papers and news reprints. In this context, duplicate detection has emerged. Similarity calculation has a wide range of application prospects. Currently, it is mainly used in academic paper duplicate checking, electronic file copyright, text clustering, questionnaire survey sorting, search engine deduplication, etc. [0003] The amount of detection data for similarity data is huge. On Baid...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/22G06F17/27G06F17/30
CPCG06F16/31G06F16/3331G06F40/194G06F40/30
Inventor 林劼俞婷婷江育娥
Owner FUJIAN NORMAL UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products