Rearrangement method and system based on document similarity

A similarity and file technology, applied in the field of text similarity calculation and detection, can solve problems such as inability to process oriental languages, unusable, single scope of application, etc.

Inactive Publication Date: 2014-12-17
HUAZHONG UNIV OF SCI & TECH
View PDF5 Cites 12 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0007] Existing methods for detecting file similarity are mainly aimed at Western languages ​​(such as English). Since the segmentation of Chinese phrases in Chinese is completely different from English when segmenti

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Rearrangement method and system based on document similarity
  • Rearrangement method and system based on document similarity
  • Rearrangement method and system based on document similarity

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0046] The present invention will be described in further detail below in conjunction with the accompanying drawings and embodiments.

[0047]The weight ranking method based on file similarity in the embodiment of the present invention is based on the following three basic assumptions:

[0048] (1) Judgment of document similarity by text content: When analyzing and determining document similarity, only the text content in the document is considered and non-text content is ignored.

[0049] (2) Judging the similarity of documents through basic units: In the text content of documents, sentences are used as the basic units for calculating the similarity of documents, that is, the more basic units that are "similar" in two documents, the higher their relative similarity is. high. Further, if multiple basic units in one document are similar to those in other document collections, the higher the similarity between the current document and the current document collection is.

[005...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a rearrangement method and a rearrangement system based on document similarity, and relates to the field of calculation and detection on text similarity. The method comprises the following steps of extracting documents to be compared, and generating plain texts; normalizing the plain texts, and generating normalized text unit; encoding the normalized text unit, and generating an irreversible representative code with a fixed length by an encoding algorithm; extracting keywords of the representative codes of the files to be compared, and generating a keyword sequence; calculating the word form similarity and the word sequence similarity of sentences to be compared according to the keyword sequence of the sentences to be compared; calculating the similarity of the sentences to be compared according to the word form similarity and the word sequence similarity of the sentences to be compared; calculating the similarity of the documents to be compared according to the similarity of the sentences. The rearrangement method and the rearrangement system can be suitable for Chinese characters, are convenient for use by users in China, and are also higher in similar document comparison precision.

Description

technical field [0001] The invention relates to the field of calculation and detection of text similarity, in particular to a method and system for deduplication based on document similarity. Background technique [0002] The file similarity calculation method is a method of analyzing and calculating the file similarity by using the file's own information (file content and connection information). With the progress of the times, the calculation method of file similarity has been widely used in various fields (such as information retrieval, collaborative recommendation system, library classification system and other related fields). [0003] Existing methods for detecting file similarity generally include the following steps: [0004] (1) After basic simplification of each file in the submitted file collection, each file is divided into continuous marking blocks; a certain number of representative marking blocks are reserved in the marking blocks; representative marking bloc...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
IPC IPC(8): G06F17/30
CPCG06F16/122G06F16/113G06F16/1748
Inventor 易乔治管晏
Owner HUAZHONG UNIV OF SCI & TECH
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products