Text approximately duplicated detection method based on semantic analysis and multiple Simhash

What is Al technical title?
Al technical title is built by PatSnap Al team. It summarizes the technical point description of the patent document.
A technique of approximate repetition and semantic analysis, which is applied in the field of text approximate repetition detection based on semantic analysis and multiple Simhash, which can solve the problems of general repetition problem and achieve excellent effect.

Active Publication Date: 2017-05-10

浙江网新恒天软件有限公司

View PDF4 Cites 13 Cited by

Summary
Abstract
Description
Claims
Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology

Problems solved by technology

[0004] However, most of the current algorithms are aimed at almost identical articles, and the effect on approximate repetition problems such as word replacement, sentence structure transformation, or rewriting without changing the meaning of the original text is average.

Therefore, there are certain limitations in practical application

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Image

Smart Image Click on the blue labels to locate them in the text.

Viewing Examples

Smart Image

Examples

Experimental program

Comparison scheme

Effect test

Embodiment

[0047] In this embodiment, 2162 IT news articles are used as the original text, and the specific implementation methods are as follows:

[0048] (1) Preprocessing Chinese articles: such as figure 1 As shown, the original text is segmented and stop words are removed, and the remaining words are content words.

[0049] (2) Select keywords: such as figure 2 As shown in , the weight of each substantive word is calculated based on tfidf, part of speech, word length, and the position where the word first appears, and the 10 words with the highest weight are selected as keywords. The weight of a word is calculated by the following formula:

[0050]

[0051] The parameters are selected as follows:

[0052] if w i is a noun, then pro i Take 0.6, if w i is an adjective, then pro i Take 0.4, if w i is a verb, then pro i Take 0.3, if w i is a word of other parts of speech, then pro i Take 0.1;

[0053] tfidf weights λ 1 Take 0.8;

[0054] Part of Speech Weight λ 2 Take ...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

PUM

Login to View More

Abstract

The invention discloses a text approximately duplicated detection method based on semantic analysis and multiple Simhash. Grammar, syntax, semanteme and other information are introduced, firstly, importance of words is comprehensively considered according to multiple factors, and keywords of an article are obtained; secondly, a certain number of words before and after the keywords are innovatively extracted to serve as article features, SimHash is computed according to the article features, and on the problem of semantic processing, synonyms are eliminated through a synonym replacement method. However, the method is radical, the accuracy is reduced to a certain degree, and accordingly the double SimHash method is used, that is, the radical computation method is used only in a certain range. Results show that the improved detection method has better effects than a traditional SimHash method, a traditional Shingling method and the like.

Description

technical field [0001] The invention relates to a text repetition detection method, in particular to a text approximate repetition detection method based on semantic analysis and multiple Simhash. Background technique [0002] In the field of data mining and knowledge information discovery, one of the major challenges that comes with large-scale data spewing out is excessive repetitive information. Relevant foreign studies have shown that 1.7% to 7% of repeated web pages are flooded in the Internet environment. This kind of problem is more prominent in China's Internet environment. China's Internet environment is immature, copyright protection mechanism and Internet users' literacy are not yet perfect, and a lot of information has been plagiarized and reprinted in large quantities. Duplicate pages. Too much repetitive information is also one of the main problems encountered in Internet information retrieval. [0003] Compared with webpage repetition detection, text repeti...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine

Login to View More

Application Information

Patent Timeline

Login to View More

Patent Type & Authority Applications(China)

IPC IPC(8): G06F17/22G06F17/27

CPCG06F40/194G06F40/247G06F40/284G06F40/289G06F40/30

Inventor 付晗彭志宇庄郭冕章劲

Owner 浙江网新恒天软件有限公司

Features

R&D
Intellectual Property
Life Sciences
Materials
Tech Scout

Why Patsnap Eureka

Unparalleled Data Quality
Higher Quality Content
60% Fewer Hallucinations

Social media

Patsnap Eureka Blog

Learn More

Browse by: Latest US Patents, China's latest patents, Technical Efficacy Thesaurus, Application Domain, Technology Topic, Popular Technical Reports.

Text approximately duplicated detection method based on semantic analysis and multiple Simhash

AI Technical Summary This helps you quickly interpret patents by identifying the three key elements: Problems solved by technologyMethod usedBenefits of technology

Problems solved by technology

Method used

Image

Examples

Embodiment

PUM

Abstract

Description

Claims

Application Information

AI Technical Summary
This helps you quickly interpret patents by identifying the three key elements:
Problems solved by technology
Method used
Benefits of technology