An article plagiarism checking method for preventing paraphrased generation
By simulating the human reading comprehension process, using text segmentation and text vector calculation, and combining attention mechanisms and supervised learning ANN, the problem of identifying rewriting and restatement in existing technologies is solved, thus improving the reliability and security of article plagiarism detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUAZHONG UNIV OF SCI & TECH
- Filing Date
- 2025-04-17
- Publication Date
- 2026-06-16
AI Technical Summary
Existing article plagiarism detection technologies are unable to effectively identify rewriting and restatement, and are vulnerable to AI attacks, resulting in poor security.
By simulating the human reading comprehension process, deep learning is used to represent text, employing semantic paragraph segmentation and semantic vector calculation. Combined with attention mechanisms and supervised learning ANN, the system identifies disambiguation and co-occurrence matrices of keywords, thereby enhancing plagiarism detection capabilities.
It effectively distinguishes between rewriting and restatement, improves the reliability of article plagiarism detection technology, reduces the effectiveness of AI-generated restatements, and enhances the ability to resist cracking.
Smart Images

Figure CN120429422B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of artificial intelligence technology, and more specifically, relates to a method for preventing plagiarism in articles generated by restatement. Background Technology
[0002] Article plagiarism detection technology, which compares the similarity between articles to identify duplicate and plagiarized content, is widely used in the review process of academic journals. However, current plagiarism detection technologies are facing serious challenges from artificial intelligence, and their reliability is declining. The latest generative AI, such as the ChatGPT4 model, can perform deep plagiarism reduction; articles modified by it can reduce the similarity rate to 5-10% while maintaining the basic meaning, and its cloned papers can meet the plagiarism rate requirements of most journals. This adds new challenges to article plagiarism detection technology.
[0003] Regarding artificial intelligence, current article plagiarism detection technologies have developed limited preventative measures, such as: 1. Patent CN119292658A, which fuses the structure and order of source code using an attention mechanism based on structural embedding vectors and sequential embedding representations to obtain a global vector; based on the global vector, it calculates the similarity between the source code to be detected and the target source code. 2. Patent CN119271528A, which references the Reflexion framework and creates a proxy based on the ReAct mechanism, evaluates the current results through a reflection mechanism, and performs in-depth analysis to obtain more comprehensive plagiarism detection results.
[0004] However, current article plagiarism detection technology still faces two major challenges: 1. Poor ability to identify rewritten and restated texts. For example, even if two sentences convey the exact same meaning, replacing each word with a synonym will lower the plagiarism rate. Similarly, translating an English document into Chinese and then back into English will also lower the plagiarism rate, even though the actual meaning remains unchanged. 2. Article plagiarism detection algorithms are easily cracked and vulnerable to targeted attacks by artificial intelligence. For instance, the code for generating hash values based on article fingerprints can be calculated using the plagiarism rate of a specific sample text. By performing only a few dozen checks on the sample text, the plagiarism detection code can be largely cracked, resulting in poor security. Summary of the Invention
[0005] In view of the above-mentioned defects or improvement needs of existing technologies, this invention provides a method for preventing paraphrasing in article plagiarism detection. Its purpose is to simulate the human reading comprehension process, deeply learn the expression of the text, so as to complete the extraction and comparison of the meaning of the text, and efficiently identify rewriting or paraphrasing.
[0006] To achieve the above objectives, according to one aspect of the present invention, a method for preventing plagiarism in articles generated by restatements is provided, comprising:
[0007] The process involves word segmentation of the article to be checked for plagiarism; identification of valid words that are not conjunctions from all words, and generation of a co-occurrence matrix for each valid word; determination of keywords based on the weight of each valid word in the article; segmentation of the article into paragraphs based on the keywords, resulting in multiple paragraphs; and calculation of the disambiguation parameters between each valid word in the article and various valid words in the corpus.
[0008] Calculate r for each keyword in each paragraph. i The keyword immediately following it is r i+1 The middle text placeholder parameter L(r) i The value is determined in advance based on experiments according to the linguistic features of the corresponding stop words in the text; the keyword r is constructed. i Symbolic normalized vector Where, vector The sign of positive and negative L(r) i The same, vector The dimension of L(r) is determined in advance through experiments; if L(r) i If the value of ) is greater than the dimension, then the vector Each dimension of vector V takes a value that is one-square root of the dimension; otherwise, the vector V... ri The first few dimensions are One-square root of the first few dimensions, with all other dimensions being 0. The dimensions of the first few dimensions are the same as those of L(r). i ) value; if If the sign is negative, then the keyword r i With the keyword r i+1 Swap the positions; otherwise, do not swap.
[0009] The co-occurrence matrix of each keyword in each text segment is concatenated to its symbolic normalized vector to form the word vector of that keyword. The word vector and disambiguation parameter vector of each keyword in each text segment are combined into a vector group. All the vector groups corresponding to all keywords in the text segment are input into the attention mechanism AM algorithm to obtain the text vector of the text segment. Among them, the disambiguation parameter vector of each keyword is a vector formed by the disambiguation parameters between the keyword and various effective words in the corpus, and its dimension is the same as the word vector dimension of the keyword.
[0010] The plagiarism rate of each text segment is calculated based on its text vector, and the overall plagiarism rate of the article to be checked is calculated based on the plagiarism rates of each text segment.
[0011] Furthermore, an adhesive decision mechanism is adopted, and word segmentation is achieved through an ANN. The segmentation method is as follows:
[0012] For the three characters A, B, and C, the probability of a word formed by A and B appearing in the entire corpus is P. AB The probability of a word consisting of B and C appearing in the entire corpus is P. BC If P AB >>P BC If B and A form a word; if P AB >>P BC If B and C form a word; if P AB ≈P BC Therefore, B does not form a word with either A or C.
[0013] Furthermore, the method for determining keywords is as follows:
[0014] Each effective word is assigned a weight value according to its importance in conveying the meaning of the text.
[0015] Determine the maximum weight T among all valid word weights in the article to be checked for plagiarism. max , put the weights in Valid words within the specified range are used as keywords; it is determined whether the number of keywords meets the preset limit; if not, the relative weight is applied. Select effective words within a low range and those adjacent to that range until the number of keywords reaches the preset number.
[0016] Furthermore, the method for assigning weights to each word is as follows:
[0017]
[0018]
[0019] TF-IDF(t,d) = TF(t,d) × IDF(t)
[0020] T(t) = TF - IDF(t,d)
[0021] In the formula, TF(t,d) represents the word frequency parameter of word t in the article d to be checked for plagiarism; n t n represents the number of times word t appears in the article d to be checked for plagiarism; n0 represents the total number of words in the article d to be checked for plagiarism; IDF(t) represents the inverse document frequency of word t; N t denoted by , N represents the number of documents in the corpus containing the word t, and N represents the total number of documents in the corpus; TF-IDF(t,d) represents the term frequency-inverse document frequency of the word t in the article d to be checked for plagiarism, and T(t) represents the weight of the word t.
[0022] Furthermore, the maximum weight T among all valid word weights in the article to be checked for plagiarism is determined. max Previously, it also included updating the weight of each valid word, in the following way:
[0023] Find the effective word 'a' in the article to be checked for plagiarism and the i-th effective word 'c' in the corpus. i Disambiguation parameter S between a,ci The larger the value of the disambiguation parameter, the better the relationship between a and c. i The more likely they are to become synonyms; among them, W a Wc represents the co-occurrence matrix of the valid word 'a' in the article to be checked for plagiarism. i c represents the known valid words in the corpus. i The co-occurrence matrix is given by i = 1, 2, ..., m, where m represents the total number of valid words in the corpus. If the disambiguation parameters between valid word a and the m valid words in the corpus are all less than the preset value Δ, then the weight of valid word a does not need to be modified. Otherwise, the weight of valid word a is updated according to the following formula:
[0024]
[0025] In the formula, m' represents the number of valid word types in the corpus corresponding to the disambiguation parameter being greater than the preset value Δ, and the corresponding valid word types are denoted as c′1, c′2, ..., c′ m′ T′(a) is the weight of the updated effective word 'a'; TF(c′) i ,d) indicates the valid words c′ in the article d to be checked for plagiarism. i The word frequency parameter, IDF(c′ i ) indicates a valid word c′ i Inverse document frequency,
[0026] Increase the preset value Δ and repeat the above update operation to the preset number of times to obtain the final weight of the effective word 'a' in the article to be checked for plagiarism.
[0027] Furthermore, the co-occurrence matrix W of each valid word 'a' in the article to be checked for plagiarism is calculated. a The implementation method is as follows:
[0028] In the article to be checked for plagiarism, identify the n valid words before and the n valid words after the valid word 'a', and denote them as 'b' respectively. -n ,b -n+1 ,……,b n ;
[0029] Determine b i Regarding the probability P of 'a' occurring a,bi In all documents, search for b within the interval -n to n before and after each valid word a. i The word can be used to find b.i The statistical probability of a word, as P a,bi , i = -n, ..., n;
[0030] All P a,bi Construct a vector, and use this vector as the Y-axis in the HMM algorithm. 1,2n Matrix, Y 1,2n The matrix transpose is used as [P(Y|X)] in the HMM algorithm. 2n,1 Matrix, input HMM algorithm, output result X 1,m That is, the co-occurrence matrix W of the effective word 'a'. a .
[0031] Furthermore, before forming the word vector for the keyword, the method also includes updating the co-occurrence matrix of each valid word 'a', specifically:
[0032] Find the effective word 'a' in the article to be checked for plagiarism and the i-th effective word 'c' in the corpus. i Disambiguation parameter S between a,ci The larger the value of the disambiguation parameter, the better the relationship between a and c. i The more likely they are to become synonyms; among them, W a Wc represents the co-occurrence matrix of the valid word 'a' in the article to be checked for plagiarism. i c represents the known valid words in the corpus. i The co-occurrence matrix is given by i = 1, 2, ..., m, where m represents the total number of valid words in the corpus. If the disambiguation parameters between valid word a and the m valid words in the corpus are all less than the preset value Δ, then the co-occurrence matrix of valid word a does not need to be modified. Otherwise, the co-occurrence matrix of valid word a is updated according to the following formula:
[0033]
[0034] In the formula, m' represents the number of valid word types in the corpus corresponding to the disambiguation parameter being greater than the preset value Δ, and the corresponding valid word types are denoted as c′1, c′2, ..., c′ m′ W a ′ represents the co-occurrence matrix of the effective word 'a' after the update; W c′i c′ represents the known valid words in the corpus. i The co-occurrence matrix;
[0035] Increase the value of the preset value Δ and repeat the above update operation to the preset number of times to obtain the final co-occurrence matrix of the effective word 'a' in the article to be checked for plagiarism.
[0036] Furthermore, the method for dividing the text into paragraphs based on keywords is as follows: input the positions of all keywords in the text, the position of the period in each sentence, and the position of the paragraph division into an ANN to obtain the division results.
[0037] Furthermore, the conditions for supervised learning in supervised learning of ANNs include:
[0038] (1) The paragraph division is located at the period or the natural paragraph division;
[0039] (2) Let P λ Let P be the ratio of the number of keywords in the λth paragraph to the total number of keywords in that paragraph. Then, among all possible partitioning scenarios, the partition P is the most significant. λ The average value should be the highest;
[0040] (3) The number of paragraphs in the main body of the text shall not be less than half the number of natural paragraphs, and not more than the number of natural paragraphs.
[0041] According to another aspect of the invention, a computer-readable storage medium is provided, the computer-readable storage medium including a stored computer program, wherein, when the computer program is run by a processor, it controls the device where the storage medium is located to perform the steps of the method described above.
[0042] In summary, compared with the prior art, the technical solutions conceived by this invention have the following main advantages:
[0043] 1. This invention proposes a method for preventing paraphrasing in article plagiarism detection. It innovatively utilizes the concept of a semantic segment and a semantic vector as a parameter, enabling plagiarism detection to go beyond literal comparison and better understand the meaning of keywords, thus further reducing the effectiveness of AI-generated paraphrasing using keyword substitution. Specifically, the method proposes that the semantic vector of each semantic segment is obtained by inputting all vector groups corresponding to all keywords in that segment into an attention mechanism (AM) algorithm, thereby identifying AI paraphrasing of the entire segment. The vector group consists of word vectors and disambiguation parameter vectors for each keyword. The disambiguation parameter vector for each keyword is formed by the disambiguation parameters between the keyword and various valid words in the corpus, with the same dimension as the word vector, thus identifying AI paraphrasing using keyword synonym substitution. The word vector of a keyword is formed by concatenating the keyword's co-occurrence matrix with its symbolic normalized vector, simultaneously considering both the article's language structure and word meaning, enhancing the effectiveness of countering paraphrasing. Therefore, this invention can effectively distinguish between rewriting and restatement, effectively countering AI restatement generation algorithms, making it impossible for AI to effectively reduce the plagiarism of articles, increasing the reliability of current article plagiarism detection technology, and solving the problem that plagiarism detection technology has difficulty identifying AI-generated and plagiarized articles, that is, plagiarism detection is ineffective against rewritten and restated articles, and plagiarism detection algorithms are easily cracked.
[0044] 2. This invention also proposes a method for updating the weights of keywords and / or the co-occurrence matrix to achieve keyword disambiguation. On the one hand, this makes plagiarism detection no longer limited to hard comparisons between words, significantly reducing the effectiveness of AI restatement generation algorithms that use keyword substitution and grammatical transformation. On the other hand, disambiguation can increase information entropy, making the AI reverse-engineering formula more complex, which significantly increases the difficulty for AI to crack the code.
[0045] 3. This invention further proposes that the segmentation of textual meaning be achieved through an ANN. Using supervised learning of the ANN to generate relevant parameters, AI counters AI, enabling this invention to iterate rapidly under the threat of being cracked by restatement generation AI, further increasing its advantage in dealing with cracking. Attached Figure Description
[0046] Figure 1 A flowchart illustrating a method for preventing plagiarism in articles generated by restatements, provided in an embodiment of the present invention;
[0047] Figure 2 A flowchart of another method for preventing plagiarism in articles generated by restatements, provided in an embodiment of the present invention. Detailed Implementation
[0048] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.
[0049] Example 1
[0050] A method for preventing plagiarism detection of articles generated by restatements, such as Figure 1 As shown, it includes:
[0051] The process involves word segmentation of the article to be checked for plagiarism; identification of valid words that are not conjunctions from all words, and generation of a co-occurrence matrix for each valid word; determination of keywords based on the weight of each valid word in the article; segmentation of the article into paragraphs based on the keywords, resulting in multiple paragraphs; and calculation of the disambiguation parameters between each valid word in the article and various valid words in the corpus.
[0052] Calculate r for each keyword in each paragraph. i The keyword immediately following it is r i+1 The middle text placeholder parameter L(r) i The value is determined in advance based on experiments according to the linguistic features of the corresponding stop words in the text; the keyword r is constructed. i Symbolic normalized vector Where, vector The sign of positive and negative L(r) i The same, vector The dimension of L(r) is determined in advance through experiments; if L(r) i If the value of ) is greater than the dimension, then the vector Each dimension of the vector takes a value that is one-square root of the dimension; otherwise, the vector... The first few dimensions are L(r) i The first few dimensions are equal to the square root of L(r), and the other dimensions are zero. i ) value; if If the sign is negative, then the keyword r i With the keyword r i+1 Swap the positions; otherwise, do not swap.
[0053] Concatenate the co-occurrence matrix of each keyword in each semantic paragraph to the rear of its symbolized normalized vector to form the word vector of the keyword; form a vector group with the word vectors of each keyword and the disambiguation parameter vectors in each semantic paragraph, and input all the said vector groups corresponding to all keywords in the semantic paragraph into the attention mechanism AM algorithm to obtain the semantic vector of the semantic paragraph; wherein, the disambiguation parameter vector of each keyword is a vector composed of the disambiguation parameters between the keyword and various valid words in the corpus, and its dimension is the same as the dimension of the word vector of the keyword;
[0054] Calculate the duplicate rate of each semantic paragraph based on the semantic vector of each semantic paragraph, and calculate the duplicate rate of the article to be checked based on the duplicate rates of each semantic paragraph of the article to be checked.
[0055] The method of this embodiment needs to first identify valid words that are not conjunctions from all words. The definition of valid words is as follows: Words other than stop words (conjunctions, auxiliary words, pronouns, etc.) in the sense of computer science are valid words. All valid words can be divided by directly excluding stop words.
[0056] In addition, regarding the construction of the symbolized normalized vector. Suppose there are p keywords in a semantic paragraph, i = 1, 2,..., p, a keyword r i to the keyword r i+1 adjacent to it behind, the text in the middle has a placeholder parameter L(r i ), and this parameter is determined by all stop words between the two keywords. Extract the stop words among them, and the direct sum of the linguistic features of all stop words in terms of data is L(r i ). The relevant parameters for converting the linguistic features of stop words into data need to be obtained through experiments. For example, "de" is -1, "le" is 0, etc.
[0057] Set the parameter χ, and the symbolized normalized vector of the keyword r i can be expressed as:
[0058]
[0059] where V ri is the symbolized normalized vector of the keyword r i , and it is a χ-dimensional column vector. When L(r i ) < χ, the first [L(r ri )] (the square brackets indicate rounding down) dimensions of V i are and the subsequent dimensions are 0. When L(r i ) ≥ χ, all dimensions are As a preferred solution, χ can take 12, 13 or 14, which is determined by experiments.
[0060] If V ri sgn[L(r i If the term is greater than zero, then the keyword r i No change. If the value is less than 0, the keyword is swapped with the keyword immediately following it.
[0061] The matrix linking method is as follows: generate a column vector U with dimension χ+m. r As a keyword r i The word vectors (m represents the number of valid words in the corpus) have the first χ dimensions equal to their symbolic normalized vectors, and the last m dimensions equal to their co-occurrence matrix.
[0062] The simplified workflow of the AM algorithm is as follows:
[0063] The keyword with the most occurrences in a given passage is called the semantic keyword of that passage. Let the semantic keyword of a passage be r′0, and all keywords rearranged are r′1, r′2, ..., r′. p Its word vector is U r′1 U r′2 , ...U r′p Then the semantic vector U of this paragraph P for:
[0064]
[0065] Among them, S r′0,r′i This indicates that the semantic keyword r′0 is related to any keyword r′. i Disambiguation parameters between them.
[0066] As an example, the formula for calculating the plagiarism rate of a paragraph can be:
[0067]
[0068] Where λ represents the order of the paragraphs, U i This represents the semantic vector of each sentence in the corpus.
[0069] As an example, the formula for the full-text plagiarism rate can be:
[0070]
[0071] Where λ t Ψ0 represents the number of Chinese paragraphs in the article to be checked for plagiarism, and Ψ0 represents the full-text plagiarism rate.
[0072] As a preferred implementation, an adhesion-based decision method is used to segment words using an ANN. The segmentation method is as follows:
[0073] For the three characters A, B, and C, the probability of a word formed by A and B appearing in the entire corpus is P.AB The probability of a word consisting of B and C appearing in the entire corpus is P. BC If P AB >>P BC If B and A form a word; if P AB >>P BC If B and C form a word; if P AB ≈P BC Therefore, B does not form a word with either A or C.
[0074] An ANN with an adhesion criterion can be used to improve word segmentation accuracy. Its database consists of a corpus and idioms, and it performs machine learning on the probability of a character forming a word with the characters before and after it.
[0075] As a preferred implementation method, the above-mentioned method for determining keywords is as follows:
[0076] Each effective word is assigned a weight value according to its importance in conveying the meaning of the text.
[0077] Determine the maximum weight T among all valid word weights in the article to be checked for plagiarism. max , put the weights in Valid words within the specified range are used as keywords; it is determined whether the number of keywords meets the preset limit; if not, the relative weight is applied. Select effective words within a low range and those adjacent to that range until the number of keywords reaches the preset number.
[0078] For example, for the weights of all valid words in the article, find the maximum value T. max All ownership value is All words within the specified range are considered keywords. Next, let the word count of the article be N. d (Total number of words obtained from S1), if the number of keywords at this time (Square brackets indicate rounding down), then continue taking words with lower weights until the keyword count reaches a certain level. until.
[0079] As a further preferred implementation, the method for assigning weights to each word is as follows:
[0080]
[0081]
[0082] TF-IDF(t,d) = TF(t,d) × IDF(t)
[0083] T(t) = TF - IDF(t,d)
[0084] In the formula, TF(t,d) represents the word frequency parameter of word t in the article d to be checked for plagiarism, which is usually a number greater than 0 and much less than 1; n t N represents the number of times word t appears in the article d to be checked for plagiarism; n0 represents the total number of words in the article d to be checked for plagiarism; IDF(t) represents the inverse document frequency of word t, which usually has three types of values: Type I values are between 0 and 0.001, Type II values are between 2.3 and 3.5, and Type III values are greater than 5; N t denoted as , where represents the number of documents in the corpus containing the word t, and N represents the total number of documents in the corpus; TF-IDF(t,d) represents the term frequency-inverse document frequency of word t in the article d to be checked for plagiarism, and T(t) represents the weight of word t, which is equal to TF-IDF(t,d).
[0085] As a further preferred implementation, the maximum weight T among all valid word weights in the article to be checked for plagiarism can be determined. max Previously, it also included updating the weight of each valid word, in the following way:
[0086] Find the effective word 'a' in the article to be checked for plagiarism and the i-th effective word 'c' in the corpus. i Disambiguation parameter S between a,ci The larger the value of the disambiguation parameter, the better the relationship between a and c. i The more likely they are to become synonyms; among them, W a W represents the co-occurrence matrix of the valid word 'a' in the article to be checked for plagiarism. ci c represents the known valid words in the corpus. i The co-occurrence matrix is given by i = 1, 2, ..., m, where m represents the number of all valid words in the corpus (containing the articles to be checked for plagiarism). If the disambiguation parameters between valid word a and the m valid words in the corpus are all no greater than the preset value Δ, then the weight of valid word a does not need to be modified; otherwise, the weight of valid word a is updated according to the following formula:
[0087]
[0088] In the formula, m' represents the number of valid word types in the corpus corresponding to the disambiguation parameter being greater than the preset value Δ, and the corresponding valid word types are denoted as c′1, c′2, ..., c′ m′ T′(a) is the weight of the updated effective word 'a'; TF(c′) i ,d) indicates the valid words c′ in the article d to be checked for plagiarism. i The word frequency parameter, IDF(c′ i ) indicates a valid word c′ i Inverse document frequency,
[0089] Increase the preset value Δ and repeat the above update operation to the preset number of times to obtain the final weight of the effective word 'a' in the article to be checked for plagiarism.
[0090] As a preferred implementation method, the co-occurrence matrix W of each valid word 'a' in the article to be checked for plagiarism can be solved. a The implementation method is as follows:
[0091] In the article to be checked for plagiarism, identify the n valid words before and the n valid words after the valid word 'a', and denote them as 'b' respectively. -n ,b -n+1 ,……,b n ;
[0092] Determine b i Regarding the probability P of 'a' occurring a,bi In all documents, search for b within the interval -n to n before and after each valid word a. i The word can be used to find b. i The statistical probability of a word, as P a,bi , i = -n, ..., n;
[0093] All P a,bi Construct a vector, and use this vector as the Y-axis in the HMM algorithm. 1,2n Matrix, Y 1,2n The matrix transpose is used as [P(Y|X)] in the HMM algorithm. 2n,1 Matrix, input HMM algorithm, output result X 1,m That is, the co-occurrence matrix W of the effective word 'a'. a .
[0094] In other words, we first define a parameter n to represent the position of a valid word. The preceding valid word is designated as position -1, and so on. The nth valid word preceding a given valid word is designated as position -n, and the nth valid word following it is designated as position n. Let b be the number of valid words in the interval from -n to n in the article to be checked for plagiarism. -n ,b -n+1 ,……,b n .
[0095] In the article to be checked for plagiarism, for a specific valid word 'a', solve for its co-occurrence matrix W. a :
[0096] For 'a' and each 'b' in the article to be checked for plagiarism i (i=-n,...,n), solve b i Regarding the probability of 'a' occurring; all P corresponding to 'a'. a,bi This forms a vector, which is the Y vector of the HMM algorithm. 1,2nMatrix; Y 1,2n The matrix transpose is used as [P(Y|X)] in the HMM algorithm. 2n,1 Matrix, input HMM algorithm, output result X 1,m That is, the co-occurrence matrix W of the word. a .
[0097] As a further preferred implementation, before forming the word vector of the keyword, the method further includes updating the co-occurrence matrix of each valid word 'a', specifically:
[0098] Find the effective word 'a' in the article to be checked for plagiarism and the i-th effective word 'c' in the corpus. i Disambiguation parameter S between a,ci The larger the value of the disambiguation parameter, the better the relationship between a and c. i The more likely they are to become synonyms; among them, W a Wc represents the co-occurrence matrix of the valid word 'a' in the article to be checked for plagiarism. i c represents the known valid words in the corpus. i The co-occurrence matrix is given by i = 1, 2, ..., m, where m represents the number of all valid words in the corpus (containing the article to be checked for plagiarism). If the disambiguation parameters between valid word a and m valid words in the corpus are all less than the preset value Δ, then the co-occurrence matrix of valid word a does not need to be modified; otherwise, the co-occurrence matrix of valid word a is updated according to the following formula:
[0099]
[0100] In the formula, m' represents the number of valid word types in the corpus corresponding to the disambiguation parameter being greater than the preset value Δ, and the corresponding valid word types are denoted as c′1, c′2, ..., c′ m′ W a ′ represents the co-occurrence matrix of the effective word 'a' after the update; W c′i c′ represents the known valid words in the corpus. i The co-occurrence matrix;
[0101] Increase the value of the preset value Δ and repeat the above update operation to the preset number of times to obtain the final co-occurrence matrix of the effective word 'a' in the article to be checked for plagiarism.
[0102] It should be noted that updating the weights and updating the co-occurrence matrix are both disambiguation operations. Both updates can be performed simultaneously, or one can be chosen for disambiguation.
[0103] After updating the weights and co-occurrence matrix, the values a, c′1, c′2, ..., c′ in the article to be checked for plagiarism will be... m′These words are all replaced with 'a'. The purpose of this operation is to reverse the restatement process, merging synonyms to minimize the impact of the AI's restatement process on the article's plagiarism rate.
[0104] As an example, the preset value Δ is set to around 0.937, this value was determined experimentally. After disambiguation, the above steps are repeated, and the parameter Δ is set to around 0.894. Experiments show that articles with AI-based deep plagiarism reduction are prone to multiple restates. One disambiguation cannot completely eliminate its influence. Multiple disambiguations can be performed, and the parameter Δ can be slightly increased to counteract the effects of AI-based deep plagiarism reduction.
[0105] Furthermore, the method for dividing the text into paragraphs based on keywords is as follows: input the positions of all keywords in the text, the position of the period in each sentence, and the position of the paragraph division into an ANN to obtain the division results.
[0106] Furthermore, the conditions for supervised learning in supervised learning of ANNs include:
[0107] (1) The paragraph division of the text is located at the period or the natural paragraph division;
[0108] (2) Let P λ Let P be the ratio of the number of keywords in the λth paragraph to the total number of keywords in that paragraph. Then, among all possible partitioning scenarios, the partition P is the most significant. λ The average value should be the highest;
[0109] (3) The number of paragraphs in the main body of the text shall not be less than half the number of natural paragraphs, and not more than the number of natural paragraphs.
[0110] This embodiment incorporates artificial intelligence and deep learning modules, whose parameters can be automatically adjusted through training, eliminating the need for manual parameter tuning. Therefore, the introduction of neural networks and Hidden Markov Models only covers the input and output; the working process and specific parameter values are not detailed above. This embodiment is particularly suitable for plagiarism detection in Chinese academic papers because Chinese stop words are highly targeted, their linguistic features are easy to grasp, and the resulting word vectors are more effective.
[0111] Based on the content of Embodiment 1 above, Figure 2 A flowchart of one of the plagiarism detection methods is shown.
[0112] Example 2
[0113] This application also relates to a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method described above.
[0114] Specifically, the memory may include high-speed random access memory, as well as non-volatile memory, such as hard disks, RAM, plug-in hard disks, smart media cards (SMC), secure digital cards (SD), flash cards, at least one disk storage device, flash memory device, or other volatile solid-state storage devices.
[0115] The relevant technical solutions are the same as above, and will not be repeated here.
[0116] Those skilled in the art will readily understand that the above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A method for preventing plagiarism in articles generated by restatements, characterized in that, include: Segment the text to be checked for plagiarism by word analysis; Identify valid words that are non-connecting words from all words, and generate a co-occurrence matrix for each valid word; Keywords are determined based on the weight of each effective word in the article; the article is divided into paragraphs based on the keywords to obtain multiple paragraphs; the disambiguation parameters between each effective word in the article and various effective words in the corpus are solved. Calculate r for each keyword in each paragraph. i The keyword immediately following it is r i+1 The middle text placeholder parameter L(r) i The value is determined in advance based on experiments according to the linguistic features of the corresponding stop words in the text; the keyword r is constructed. i Symbolic normalized vector Where, vector The sign of positive and negative L(r) i The same, vector The dimension of L(r) is determined in advance through experiments; if L(r) i If the value of ) is greater than the dimension, then the vector Each dimension of the vector takes a value that is one-square root of the dimension; otherwise, the vector... The first few dimensions are L(r) i The first few dimensions are equal to the square root of L(r), and the other dimensions are zero. i ) value; if If the sign is negative, then the keyword r i With the keyword r i+1 Swap their positions; otherwise, do not swap. The co-occurrence matrix of each keyword in each text segment is concatenated to its symbolic normalized vector to form the word vector of that keyword. The word vector and disambiguation parameter vector of each keyword in each text segment are combined into a vector group. All the vector groups corresponding to all keywords in the text segment are input into the attention mechanism AM algorithm to obtain the text vector of the text segment. Among them, the disambiguation parameter vector of each keyword is a vector formed by the disambiguation parameters between the keyword and various effective words in the corpus, and its dimension is the same as the word vector dimension of the keyword. The plagiarism rate of each text segment is calculated based on its text vector, and the overall plagiarism rate of the article to be checked is calculated based on the plagiarism rates of each text segment.
2. The article plagiarism detection method as described in claim 1, characterized in that, Adhesive decision-making is adopted, and word segmentation is performed using an ANN. The segmentation method is as follows: For the three characters A, B, and C, the probability of a word formed by A and B appearing in the entire corpus is P. AB The probability of a word consisting of B and C appearing in the entire corpus is P. BC If P AB >>P BC If B and A form a word; if P AB >>P BC If B and C form a word; if P AB ≈P BC Therefore, B does not form a word with either A or C.
3. The article plagiarism detection method as described in claim 1, characterized in that, The method for determining keywords is as follows: Each effective word is assigned a weight value according to its importance in conveying the meaning of the text. Determine the maximum weight T among all valid word weights in the article to be checked for plagiarism. max , put the weights in Valid words within the specified range are used as keywords; it is determined whether the number of keywords meets the preset limit; if not, the relative weight is applied. Select effective words within a low range and those adjacent to that range until the number of keywords reaches the preset number.
4. The article plagiarism detection method as described in claim 3, characterized in that, The method for assigning weights to each word is as follows: TF-IDF(t,d) = TF(t,d) × IDF(t) T(t) = TF - IDF(t,d) In the formula, TF(t,d) represents the word frequency parameter of word t in the article d to be checked for plagiarism; n t n represents the number of times word t appears in the article d to be checked for plagiarism; n0 represents the total number of words in the article d to be checked for plagiarism; IDF(t) represents the inverse document frequency of word t; N t denoted by , N represents the number of documents in the corpus containing the word t, and N represents the total number of documents in the corpus; TF-IDF(t,d) represents the term frequency-inverse document frequency of the word t in the article d to be checked for plagiarism, and T(t) represents the weight of the word t.
5. The article plagiarism detection method as described in claim 3, characterized in that, The maximum weight T among all valid word weights in the article to be checked for plagiarism is determined. max Previously, it also included updating the weight of each valid word, in the following way: Find the effective word 'a' in the article to be checked for plagiarism and the i-th effective word 'c' in the corpus. i Disambiguation parameter S between a,ci The larger the value of the disambiguation parameter, the better the relationship between a and c. i The more likely they are to become synonyms; among them, W a Wc represents the co-occurrence matrix of the valid word 'a' in the article to be checked for plagiarism. i c represents the known valid words in the corpus. i The co-occurrence matrix is given by i = 1, 2, ..., m, where m represents the total number of valid words in the corpus. If the disambiguation parameters between valid word a and the m valid words in the corpus are all less than the preset value Δ, then the weight of valid word a does not need to be modified. Otherwise, the weight of valid word a is updated according to the following formula: In the formula, m' represents the number of valid word types in the corpus corresponding to the disambiguation parameter being greater than the preset value Δ, and the corresponding valid word types are denoted as c′1, c′2, ..., c′ m′ T′(a) is the weight of the updated effective word 'a'; TF(c′) i ,d) indicates the valid words c′ in the article d to be checked for plagiarism. i The word frequency parameter, IDF(c′ i ) indicates a valid word c′ i Inverse document frequency, Increase the preset value Δ and repeat the above update operation to the preset number of times to obtain the final weight of the effective word 'a' in the article to be checked for plagiarism.
6. The article plagiarism detection method as described in claim 1, characterized in that, Solve for the co-occurrence matrix W of each valid word 'a' in the article to be checked for plagiarism. a The implementation method is as follows: In the article to be checked for plagiarism, identify the n valid words before and the n valid words after the valid word 'a', and denote them as 'b' respectively. -n ,b -n+1 ,……,b n ; Determine b i Regarding the probability P of 'a' occurring a,bi In all documents, search for b within the interval -n to n before and after each valid word a. i The word can be found as b i The statistical probability of a word, as P a,bi , i = -n, ..., n; All P a,bi Construct a vector, and use this vector as the Y-axis in the HMM algorithm. 1,2n Matrix, Y 1,2n The matrix transpose is used as [P(Y|X)] in the HMM algorithm. 2n,1 Matrix, input HMM algorithm, output result X 1,m That is, the co-occurrence matrix W of the effective word 'a'. a .
7. The article plagiarism detection method as described in claim 1, characterized in that, Before generating the word vector for the keyword, the method also includes updating the co-occurrence matrix for each valid word 'a', specifically: Find the effective word 'a' in the article to be checked for plagiarism and the i-th effective word 'c' in the corpus. i Disambiguation parameter S between a,ci The larger the value of the disambiguation parameter, the better the relationship between a and c. i The more likely they are to become synonyms; among them, W a Wc represents the co-occurrence matrix of the valid word 'a' in the article to be checked for plagiarism. i c represents the known valid words in the corpus. i The co-occurrence matrix is given by i = 1, 2, ..., m, where m represents the total number of valid words in the corpus. If the disambiguation parameters between valid word a and the m valid words in the corpus are all less than the preset value Δ, then the co-occurrence matrix of valid word a does not need to be modified. Otherwise, the co-occurrence matrix of valid word a is updated according to the following formula: In the formula, m' represents the number of valid word types in the corpus corresponding to the disambiguation parameter being greater than the preset value Δ, and the corresponding valid word types are denoted as c′1, c′2, ..., c′ m′ W a ′ represents the co-occurrence matrix of the effective word 'a' after the update; W c′i c′ represents the known valid words in the corpus. i The co-occurrence matrix; Increase the value of the preset value Δ and repeat the above update operation to the preset number of times to obtain the final co-occurrence matrix of the effective word 'a' in the article to be checked for plagiarism.
8. The article plagiarism detection method as described in claim 1, characterized in that, The method for dividing the text into paragraphs based on keywords is as follows: input the positions of all keywords in the text, the position of the period in each sentence, and the position of the paragraph division into an ANN to obtain the division results.
9. The article plagiarism detection method as described in claim 8, characterized in that, The conditions for supervised learning of ANNs include: (1) The paragraph division is located at the period or the natural paragraph division; (2) Let P λ Let P be the ratio of the number of keywords in the λth paragraph to the total number of keywords in that paragraph. Then, among all possible partitioning scenarios, the partition P is the most significant. λ The average value should be the highest; (3) The number of paragraphs in the main body of the text shall not be less than half the number of natural paragraphs, and not more than the number of natural paragraphs.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium includes a stored computer program, wherein the computer program, when executed by a processor, controls the device on which the storage medium is located to perform the steps of the method as described in any one of claims 1 to 9.