Supercharge Your Innovation With Domain-Expert AI Agents!

Text approximately duplicated detection method based on semantic analysis and multiple Simhash

A technique of approximate repetition and semantic analysis, which is applied in the field of text approximate repetition detection based on semantic analysis and multiple Simhash, which can solve the problems of general repetition problem and achieve excellent effect.

Active Publication Date: 2017-05-10
浙江网新恒天软件有限公司
View PDF4 Cites 13 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] However, most of the current algorithms are aimed at almost identical articles, and the effect on approximate repetition problems such as word replacement, sentence structure transformation, or rewriting without changing the meaning of the original text is average.
Therefore, there are certain limitations in practical application

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text approximately duplicated detection method based on semantic analysis and multiple Simhash
  • Text approximately duplicated detection method based on semantic analysis and multiple Simhash
  • Text approximately duplicated detection method based on semantic analysis and multiple Simhash

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0047] In this embodiment, 2162 IT news articles are used as the original text, and the specific implementation methods are as follows:

[0048] (1) Preprocessing Chinese articles: such as figure 1 As shown, the original text is segmented and stop words are removed, and the remaining words are content words.

[0049] (2) Select keywords: such as figure 2 As shown in , the weight of each substantive word is calculated based on tfidf, part of speech, word length, and the position where the word first appears, and the 10 words with the highest weight are selected as keywords. The weight of a word is calculated by the following formula:

[0050]

[0051] The parameters are selected as follows:

[0052] if w i is a noun, then pro i Take 0.6, if w i is an adjective, then pro i Take 0.4, if w i is a verb, then pro i Take 0.3, if w i is a word of other parts of speech, then pro i Take 0.1;

[0053] tfidf weights λ 1 Take 0.8;

[0054] Part of Speech Weight λ 2 Take ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention discloses a text approximately duplicated detection method based on semantic analysis and multiple Simhash. Grammar, syntax, semanteme and other information are introduced, firstly, importance of words is comprehensively considered according to multiple factors, and keywords of an article are obtained; secondly, a certain number of words before and after the keywords are innovatively extracted to serve as article features, SimHash is computed according to the article features, and on the problem of semantic processing, synonyms are eliminated through a synonym replacement method. However, the method is radical, the accuracy is reduced to a certain degree, and accordingly the double SimHash method is used, that is, the radical computation method is used only in a certain range. Results show that the improved detection method has better effects than a traditional SimHash method, a traditional Shingling method and the like.

Description

technical field [0001] The invention relates to a text repetition detection method, in particular to a text approximate repetition detection method based on semantic analysis and multiple Simhash. Background technique [0002] In the field of data mining and knowledge information discovery, one of the major challenges that comes with large-scale data spewing out is excessive repetitive information. Relevant foreign studies have shown that 1.7% to 7% of repeated web pages are flooded in the Internet environment. This kind of problem is more prominent in China's Internet environment. China's Internet environment is immature, copyright protection mechanism and Internet users' literacy are not yet perfect, and a lot of information has been plagiarized and reprinted in large quantities. Duplicate pages. Too much repetitive information is also one of the main problems encountered in Internet information retrieval. [0003] Compared with webpage repetition detection, text repeti...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/22G06F17/27
CPCG06F40/194G06F40/247G06F40/284G06F40/289G06F40/30
Inventor 付晗彭志宇庄郭冕章劲
Owner 浙江网新恒天软件有限公司
Features
  • R&D
  • Intellectual Property
  • Life Sciences
  • Materials
  • Tech Scout
Why Patsnap Eureka
  • Unparalleled Data Quality
  • Higher Quality Content
  • 60% Fewer Hallucinations
Social media
Patsnap Eureka Blog
Learn More