Text content duplicate removal method

A text and content technology, applied in the field of text content similarity comparison

Active Publication Date: 2014-08-06
JIANGSU WISEDU INFORMATION TECH
View PDF3 Cites 35 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

At present, there is no general and efficient text deduplication method for effective deduplication for different application scenarios

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text content duplicate removal method

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0056] The present invention will be described in further detail below in conjunction with the accompanying drawings.

[0057] The input of the text content deduplication method of the present invention is the text to be judged and the text library. By comparing the heavy text to be judged with the text in the text database, it is judged whether there is a text similar to the heavy text to be judged in the text database.

[0058] Such as figure 1 As shown, the text content deduplication method of the present invention mainly includes three steps: file-based fingerprint detection, text content-based fingerprint detection, and text paragraph-based fingerprint detection.

[0059] Based on the file-level fingerprint detection, that is, the aforementioned step S1, it is judged whether the text to be judged is the same as the text in the text library by comparing the file fingerprints. If the file fingerprint of the text in the text library is the same as the heavy text to be judg...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a text content duplicate removal method. Whether texts with duplicate judgment to be implemented are the same as texts in a text library or not is judged through comparison of file fingerprints, comparison of main body content fingerprints and comparison of paragraph fingerprints of the texts. The text content duplicate removal method is low in computation overhead, high in duplicate judgment rate and high in response speed, duplicate judgment on the texts with the same contents and different composing types can be accurately carried out, and duplicate judgment on a small number of texts with different contents can be accurately carried out. The text content duplicate removal method is wide in application scope and can be applied to library duplicate judgment uploading, web spider webpage processing, paper and test paper plagiarism detection and the like.

Description

technical field [0001] The present invention relates to text content similarity comparison. Background technique [0002] With the continuous growth of various information, network information sharing has brought great convenience to people, but at the same time, it has introduced a large number of reprinted information. Currently, text deduplication has been applied in various application scenarios. In terms of search engines, removing duplicate web pages can improve the search efficiency of search engines, reduce massive data storage space, and improve user experience; in the protection of personal intellectual property rights, the use of text deduplication methods can identify the similarity of file content for tracking The similarity of scientific and technological documents can identify plagiarism of papers and patents; in the library, document deduplication can not only reduce data storage space, but also reduce transmission network traffic. [0003] The task of text...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/21G06F17/30
Inventor 吴家奇严敏林文荟李海
Owner JIANGSU WISEDU INFORMATION TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products