Text characteristic extracting method and document copy detection system based on sentence level

A feature extraction and subsystem technology, applied in the field of copy detection, can solve problems such as inability to detect copy detection methods

Inactive Publication Date: 2012-07-18
FUDAN UNIV
View PDF3 Cites 10 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

When citations appear in articles or news, the citations are usually a few sentences or a short paragraph of text, so the similarity between the two documents will not be high, so the document-level copy detection method cannot be used to effectively det

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Text characteristic extracting method and document copy detection system based on sentence level
  • Text characteristic extracting method and document copy detection system based on sentence level
  • Text characteristic extracting method and document copy detection system based on sentence level

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0046] Suppose there are two papers in the document set, named P1 and P2. Among them, the third paragraph in P2 is copied from the second paragraph in P1. The range of this paragraph in P1 is S3-S5, and that in P2 is S6-S8. Then the document collection D is divided into two separate documents P1 and P2 after being input into the document reading subsystem; and the two documents are divided into a collection of sentences after being input into the sentence breaking system; the feature extraction subsystem extracts the sentences from the text Represents a set converted into a feature vector and added to the inverted index; the copy detection subsystem uses the inverted index to perform copy detection, and finds the following sentence pairs that are copies of each other (P1S3, P2S6), (P1S4, P2S7) , (P1S5, P2S8); after the sequence matching subsystem arranges the above copy pairs, it outputs (P1[S3-S5], P2[S6-S8]), that is, the third sentence to the fifth sentence in P1 in the doc...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention belongs to the technical field of copy detection and particularly relates to a text characteristic extracting method and a document copy detection system based on sentence level. The invention provides the text characteristic extracting method based on the sentence level, and the method comprises the following steps: selecting a certain quantity of common vocabularies with the lowest reverse document frequency as antecedents, extracting improved Shingle characteristics to express the whole sentence. The invention also provides a document copy detection system based on the sentence level, and the system comprises a document reading subsystem, a segmenting subsystem, a characteristic extracting subsystem, a copy detection subsystem and a sequence matching subsystem, can accurately find out a document pair including part of copies in a document set at high speed, and positions the mutual copying range.

Description

technical field [0001] The invention belongs to the technical field of copy detection, and in particular relates to a sentence-level text feature extraction method and a document copy detection system. Background technique [0002] With the development of the Internet age, information presents an explosive growth trend. Due to the fact that digital documents are easy to be copied, a large number of duplicate web pages and documents appear on the network. These repeated information have caused a serious burden to the application based on Web information. Therefore, the research on copy detection has gradually become a research hotspot in the field of information retrieval in recent years. [0003] Existing research work mainly focuses on how to perform document-level copy detection. The research results of document-level copy detection have achieved good results in the copy detection of ordinary web pages. However, there are still some problems that cannot be solved by ex...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30G06F17/22
Inventor 俞昊旻张奇黄萱菁
Owner FUDAN UNIV
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products