Unlock instant, AI-driven research and patent intelligence for your innovation.

Method for determining similar character strings, method and system for file duplication checking

A technology of similar characters and files, applied in the field of paper duplication checking, can solve the problems of inconsistent subject-verb-object order, high algorithm complexity, long calculation time, etc., to achieve the effect of improving the accuracy.

Active Publication Date: 2020-08-04
NORTH CHINA UNIVERSITY OF TECHNOLOGY
View PDF4 Cites 0 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0003] The string matching algorithm uses a piece of text to be completely consistent as the standard for measuring the repetition of papers. However, due to the complexity of the Chinese language and the diversity of expressions, for two paragraphs of text with the same substantive content, there are often some meaningless gaps in the middle. "Stop words" or function words or situations such as subject-verb-object inconsistency, and it is wrongly judged as not belonging to repeated content, therefore, adopting the character string matching algorithm in the prior art may cause recall rate and precision rate to be inconsistent high
Moreover, the string matching algorithm has strict requirements on the selection of strings, the complexity of the algorithm itself is relatively high, and it requires relatively large resource overhead and long computing time. Therefore, the efficiency of duplicate checking is not high.

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Method for determining similar character strings, method and system for file duplication checking
  • Method for determining similar character strings, method and system for file duplication checking
  • Method for determining similar character strings, method and system for file duplication checking

Examples

Experimental program
Comparison scheme
Effect test

Embodiment Construction

[0033] In order to have a clearer understanding of the technical features, purposes and effects of the present invention, the process of determining similar character strings based on the fuzzy matching method proposed by the present invention will be further described in detail with reference to the accompanying drawings.

[0034] figure 1 A schematic flowchart of a method for determining similar character strings according to an embodiment of the present invention is shown. Specifically include the following steps:

[0035] 1) Step S110, acquiring the sample file and the character array of the target file to be detected.

[0036] In this specification, a file to be detected is called a target file, and a file to be compared with the target file is called a sample file. The file types may include various forms, for example, PDF files, WORD files, or text type files.

[0037] The character array of the file is obtained by word-segmenting the text content of the file. The s...

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

PUM

No PUM Login to View More

Abstract

The invention provides a method for determining similar character strings. The method comprises the steps: obtaining character arrays of sample files and character arrays of target files to be detected; constructing a matrix M, wherein rows and columns of the matrix M are corresponding to the character arrays of the sample files and the character arrays of the target files respectively; looking for a sub-square matrix meeting the similar character string conditions in the matrix M, wherein the similar character string conditions are arranged as following: if the character strings of the rows and columns corresponding to elements (img file='DDA0001205361200000011.TIF' wi='203' he='55' / )..., (img file='DDA0001205361200000012.TIF' wi='70' he='47' / ) are the same in the sub-square matrix, and then the character strings mapped by (img file='DDA0001205361200000013.TIF' wi='203' he='47' / )..., (img file='DDA0001205361200000014.TIF' wi='78' he='47' / ) are determined to be similar character strings, wherein, K represents the order of the sub-square matrix, j1, j2, j3, ... jk is an arrangement of 1, 2, ..., k. According to the method for determining the similar character strings, the fully-finding rate and the precision rate of the files can be improved by finding duplicate files.

Description

technical field [0001] The invention relates to the technical field of plagiarism checking of papers, in particular to a method and system for document plagiarism checking based on word segmentation fuzzy matching. Background technique [0002] At present, paper / file duplication rate detection mainly uses paper detection systems such as PaperPass, Wanfang, and HowNet, and uses string matching algorithms to calculate the similarity ratio of the file to be detected relative to the target file in the file library. [0003] The string matching algorithm uses a piece of text to be completely consistent as the standard for measuring the repetition of papers. However, due to the complexity of the Chinese language and the diversity of expressions, for two paragraphs of text with the same substantive content, there are often some meaningless gaps in the middle. "Stop words" or function words or situations such as subject-verb-object inconsistency, and it is wrongly judged as not belo...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to View More

Application Information

Patent Timeline
no application Login to View More
Patent Type & Authority Patents(China)
IPC IPC(8): G06F16/33G06F40/279
CPCG06F16/3344G06F40/279
Inventor 杨冬菊赵卓峰李成龙冯凯邓崇彬
Owner NORTH CHINA UNIVERSITY OF TECHNOLOGY