Cluster-based text duplicate checking method

A text and clustering technology, applied in unstructured text data retrieval, text database clustering/classification, text database browsing/visualization, etc.

Active Publication Date: 2017-02-22
CHINA ACAD OF LAUNCH VEHICLE TECH
View PDF5 Cites 42 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0013] Based on the above analysis, the current text plagiarism check technology has many deficiencies, especially in the efficiency of plagiarism check, there is a lot of room for improvement

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Cluster-based text duplicate checking method
  • Cluster-based text duplicate checking method
  • Cluster-based text duplicate checking method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment

[0123] The present embodiment applies clustering-based text plagiarism checking method in user-oriented information search engine system, and its information search engine system is made up of server and client, and database server adopts Xeon2.8 dual-core processor, 16G internal memory, 2TB hard disk, Responsible for storing all data information, configuring tape library and backup software at the same time, as historical data backup and recovery; application server adopts Linux operating system, Oracle11g or above data management software, used to realize data acquisition processing, preprocessing, clustering processing , one-time check processing, two-time check processing and visual display, responsible for the back-end analysis and processing of the data transmitted by the client; the client host adopts 3.7GHZ CPU, 8G memory, 2T hard disk, and uses Windows8 / 7 / XP to operate The system interacts with the server through B / S mode, and its main function is front-end display.

...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a cluster-based text duplicate checking method. The method includes the steps: 1, for data acquisition and processing, storing text data in a database and a file server; 2, for preprocessing, subjecting the text data to word segmentation and feature vector extraction; 3, clustering the text data preprocessed in the database, and calculating center feature vectors of all class clusters; 4, for primary duplicate checking processing, extracting feature vectors of the text data, comparing the feature vectors with the center vectors of the class clusters in the database, and recording the class clusters of the center feature vectors with the distance smaller than a set threshold; 5, for secondary duplicate checking processing, comparing the feature vectors of the text data with the feature vectors of the text data in the corresponding class clusters, and recording the corresponding text data of the feature vectors with the distance smaller than a certain threshold as duplicated text data, so as to realize text data duplicate checking. By the method, unnecessary duplicated comparative work can be reduced, and text duplicate checking efficiency is improved.

Description

technical field [0001] The invention relates to the technical field of text data analysis and mining, in particular to a clustering-based text plagiarism checking method. Background technique [0002] In recent years, with the frequent occurrence of fraudulent incidents in academia and the increasing calls for intellectual property protection, the research on text plagiarism checking technology has gradually become a research hotspot for relevant experts and scholars. At present, some scholars at home and abroad have proposed text plagiarism checking methods, which can be mainly divided into the following categories after induction: [0003] 1. Text plagiarism checking method based on the sememe space of HowNet. [0004] In this method, the text is firstly segmented, and then the split words are further divided into smaller semantic units "sememes". "HowNet" is based on sememes, and uses a formalized language (similar to ontology description language) to organize sememes t...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/30
CPCG06F16/3349G06F16/34G06F16/35
Inventor 贾倩王立伟王彦静杜俊鹏姜悦杨玉堃张冶郭大庆池元成张丽晔许怡婷康磊晶
Owner CHINA ACAD OF LAUNCH VEHICLE TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products