Statistical method and statistical system of text similarity

A technology of text similarity and statistical methods, applied in computing, special data processing applications, instruments, etc., can solve problems such as difficult to accurately reflect the degree of similarity

Active Publication Date: 2013-06-26
南方电网互联网服务有限公司
View PDF4 Cites 14 Cited by
  • Summary
  • Abstract
  • Description
  • Claims
  • Application Information

AI Technical Summary

Problems solved by technology

[0004] Based on this, in order to solve the problem that the traditional text similarity statistical method is difficult to accurately reflect the similarity between texts whose order of words and sentences has been artificially disrupted, it is necessary to provide a method that can accurately reflect the artificially disrupted words and sentences A Statistical Method of Text Similarity of Sentence Sequence Similarity Between Texts

Method used

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
View more

Image

Smart Image Click on the blue labels to locate them in the text.
Viewing Examples
Smart Image
  • Statistical method and statistical system of text similarity
  • Statistical method and statistical system of text similarity
  • Statistical method and statistical system of text similarity

Examples

Experimental program
Comparison scheme
Effect test

Embodiment 1

[0020] figure 1 It is a flowchart of a statistical method for text similarity in an embodiment, including the following steps:

[0021] S110. Acquire text T1 and text T2 for which similarity needs to be determined.

[0022] S120. Separate the text T1 and the text T2 into several natural segments, compare all the natural segments in the text T1 with all the natural segments in the text T2, and record the number of identical natural segments as k3.

[0023] In this embodiment, the number of natural paragraphs in the text T1 is recorded as k1, and the number of natural paragraphs in the text T2 is recorded as k2. i ranges from 1 to k1, j ranges from 1 to k2, compare whether paragraph i of text T1 is the same as paragraph j of text T2, and record the number of identical natural paragraphs as k3.

[0024] S130, delete the same natural segment from the text T1 and the text T2, the text T1 is deleted to obtain the text T3, and the text T2 is deleted to obtain the text T4.

[0025]...

Embodiment 2

[0053] S210. Acquire text T1 and text T2 for which similarity needs to be determined.

[0054] S220. Separate the text T1 and the text T2 into several natural segments, compare all the natural segments in the text T1 with all the natural segments in the text T2, and record the number of identical natural segments as k3.

[0055] In this embodiment, the number of natural paragraphs in the text T1 is recorded as k1, and the number of natural paragraphs in the text T2 is recorded as k2. i ranges from 1 to k1, j ranges from 1 to k2, compare whether paragraph i of text T1 is the same as paragraph j of text T2, and record the number of identical natural paragraphs as k3.

[0056] S230, delete the same natural segment from the text T1 and the text T2, the text T1 is deleted to obtain the text T3, and the text T2 is deleted to obtain the text T4.

[0057] S240. Separate the text T3 and the text T4 into several words, compare all the words in the text T3 with all the words in the text...

Embodiment 3

[0069] S310. Acquire text T1 and text T2 for which similarity needs to be determined.

[0070] S320. Separate the text T1 and the text T2 into several sentences, compare all the sentences in the text T1 with all the sentences in the text T2, and record the number of identical sentences as k3.

[0071] In this embodiment, the number of sentences in the text T1 is denoted as k1, and the number of sentences in the text T2 is denoted as k2. i is from 1 to k1, j is from 1 to k2, compare whether the i-th sentence of the text T1 is the same as the j-th sentence of the text T2, and record the number of identical sentences as k3.

[0072] S330, delete the same sentence from the text T1 and the text T2, the text T1 is deleted to obtain the text T3, and the text T2 is deleted to obtain the text T4.

[0073] S340. Separate the text T3 and the text T4 into several words, compare all the words in the text T3 with all the words in the text T4, and record the number of identical words as k6....

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

PUM

No PUM Login to view more

Abstract

The invention discloses a statistical method of text similarity. The statistical method comprises the following steps: obtaining a first text and a second text need to distinguish similarity; respectively dividing the first text and the second text into a plurality of text segments according to a first dividing scale, calculating a proportion of quantity of the same text segments in the first text and the second text to total text segment quantity of the first text under the first dividing scale; deleting the same text segments from the first text and the second text, respectively obtaining a first remaining text and a second remaining text; respectively dividing the first remaining text and the second remaining text into a plurality of text segments according to a second dividing scale, calculating a proportion of quantity of the same text segments in the first remaining text and the second remaining text to total text segment quantity of the first remaining text under the second dividing scale; and calculating the comprehensive text similarity of the first text and the second text. The statistical method of the text similarity can accurately reflect the similarity degree between texts in which the orders of words and sentences are disorganized by men and detect the similar text in which the word order, the sentence order and the section order are disorganized on purpose.

Description

technical field [0001] The invention relates to text processing, in particular to a statistical method for text similarity and a statistical system for text similarity. Background technique [0002] In the prior art, judging the similarity between two texts is generally by segmenting the two texts, and then judging the repeated strings of words and phrases in the two texts in order. [0003] However, if the order of words and sentences in the text is deliberately disrupted, even if the texts are essentially similar (such as plagiarized), the similarity obtained according to the existing similarity statistics method is low, which cannot reflect the similarity of the text itself. degree. Contents of the invention [0004] Based on this, in order to solve the problem that the traditional text similarity statistical method is difficult to accurately reflect the similarity between texts whose order of words and sentences has been artificially disrupted, it is necessary to prov...

Claims

the structure of the environmentally friendly knitted fabric provided by the present invention; figure 2 Flow chart of the yarn wrapping machine for environmentally friendly knitted fabrics and storage devices; image 3 Is the parameter map of the yarn covering machine
Login to view more

Application Information

Patent Timeline
no application Login to view more
Patent Type & Authority Applications(China)
IPC IPC(8): G06F17/27
Inventor 朱定局
Owner 南方电网互联网服务有限公司
Who we serve
  • R&D Engineer
  • R&D Manager
  • IP Professional
Why Eureka
  • Industry Leading Data Capabilities
  • Powerful AI technology
  • Patent DNA Extraction
Social media
Try Eureka
PatSnap group products