Cluster-based text duplicate checking method

A text and clustering technology, applied in unstructured text data retrieval, text database clustering/classification, text database browsing/visualization, etc.
CN106446148AActive Publication Date: 2017-02-22CHINA ACAD OF LAUNCH VEHICLE TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA ACAD OF LAUNCH VEHICLE TECH
Publication Date
2017-02-22

Smart Images

  • Figure 1
    Figure 1
  • Figure 2
    Figure 2
  • Figure 3
    Figure 3
Patent Text Reader

Abstract

The invention discloses a cluster-based text duplicate checking method. The method includes the steps: 1, for data acquisition and processing, storing text data in a database and a file server; 2, for preprocessing, subjecting the text data to word segmentation and feature vector extraction; 3, clustering the text data preprocessed in the database, and calculating center feature vectors of all class clusters; 4, for primary duplicate checking processing, extracting feature vectors of the text data, comparing the feature vectors with the center vectors of the class clusters in the database, and recording the class clusters of the center feature vectors with the distance smaller than a set threshold; 5, for secondary duplicate checking processing, comparing the feature vectors of the text data with the feature vectors of the text data in the corresponding class clusters, and recording the corresponding text data of the feature vectors with the distance smaller than a certain threshold as duplicated text data, so as to realize text data duplicate checking. By the method, unnecessary duplicated comparative work can be reduced, and text duplicate checking efficiency is improved.
Need to check novelty before this filing date? Find Prior Art

Description

technical field

[0001] The invention relates to the technical field of text data analysis and mining, in particular to a clustering-based text plagiarism checking method. Background technique

[0002] In recent years, with the frequent occurrence of fraudulent incidents in academia and the increasing calls for intellectual property protection, the research on text plagiarism checking technology has gradually become a research hotspot for relevant experts and scholars. At present, some scholars at home and abroad have proposed text plagiarism checking methods, which can be mainly divided into the following categories after induction:

[0003] 1. Text plagiarism checking method based on the sememe space of HowNet.

[0004] In this method, the text is firstly segmented, and then the split words are further divided into smaller semantic units "sememes". "HowNet" is based on sememes, and uses a formalized language (similar to ontology description language) to organize sememes t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More