Similarity-based text duplicate checking method and system

A similarity and cosine similarity technology, which is applied in the field of text duplication checking methods and systems based on similarity, can solve problems such as unsatisfactory effects, and achieve the effect of enhancing accuracy.

Pending Publication Date: 2018-05-04
COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI
View PDF4 Cites 37 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The purpose of the present invention is to provide a text plagiarism check method and system based on similarity, whi

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Similarity-based text duplicate checking method and system
  • Similarity-based text duplicate checking method and system
  • Similarity-based text duplicate checking method and system

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0027] In order to make the above-mentioned features and advantages of the present invention more comprehensible, the following specific embodiments are described in detail in conjunction with the accompanying drawings.

[0028] This embodiment provides a text similarity-based method for checking duplicates, using text similarity to check duplicates, such as figure 1 As shown, the steps include:

[0029] 1. Preprocess the comparison text D1:

[0030] Preprocessing includes word segmentation and removal of stop words. Open source word segmentation tools such as stammer can be used for word segmentation. Stop word removal can be done by manually setting a stop word list or collecting stop word lists that have been used on the Internet.

[0031] 2. Obtain similar candidate sets:

[0032] A similar candidate set is screened out from the text database, and the coarse-grained similarity SIMjaccard(D1, D2) of each candidate text D2 in the set to the text D1 to be compared is greate...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention provides a similarity-based text duplicate checking method. The method includes the steps of: preprocessing to-be-compared text; screening out all candidate text, of which coarse-grainedsimilarity with the to-be-compared text is greater than a similar-candidate-set threshold value, from a text database, and forming a similar candidate set; using sentences as segmentation units to segment the to-be-compared text and the candidate text; determining fine-grained similarity through calculating TFIDF similarity, LDA similarity, doc2vec similarity and word2vec similarity of the candidate text and the to-be-compared text; and screening out candidate text of which fine-grained similarity exceeds a similarity determination threshold value, and determining the same as similar text ofthe to-be-compared text to realize duplicate checking. The invention also provides a similarity-based text duplicate checking system for realizing the above-mentioned method.

Description

technical field [0001] The invention relates to the field of digital copyright, in particular to a text plagiarism checking method and system based on similarity. Background technique [0002] Currently, digital copyright technology is widely used to prevent illegal copying of digital media. The existing digital copyright technology for texts is mainly to identify and analyze the similarity of texts. Hash technology is usually used to check the texts for plagiarism, so as to find out whether the texts have been illegally copied in time to achieve the purpose of copyright protection. However, a section of text may change the encoding after the Hash through a simple modification. At this time, the above-mentioned prior art is used to check the text for plagiarism, and its effect is not very satisfactory. Contents of the invention [0003] The purpose of the present invention is to provide a text plagiarism check method and system based on similarity, which is used for text ...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/27G06F17/30
CPCG06F16/3344G06F40/279
Inventor 肖云王闰强胡松波何洪波
Owner COMP NETWORK INFORMATION CENT CHINESE ACADEMY OF SCI
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products