Document duplicate checking method, device, equipment, medium and program product
By extracting key text content from documents and calculating similarity, this technology solves the problem that existing technologies require obtaining the complete document content for plagiarism detection, thus achieving fast and accurate document plagiarism detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA UNITED NETWORK COMM GRP CO LTD
- Filing Date
- 2026-03-09
- Publication Date
- 2026-06-30
AI Technical Summary
In existing technologies, document plagiarism detection methods require obtaining the complete document content before they can perform plagiarism checks, resulting in poor efficiency.
By extracting key textual content such as research objectives, research plans, and research results from the document to be checked for plagiarism, calculating their similarity, and outputting the plagiarism check results, it is not necessary to obtain the complete document content.
It improves the efficiency and accuracy of document deduplication, quickly determining whether a document is a duplicate.
Smart Images

Figure CN122309698A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of artificial intelligence technology, and in particular to a document plagiarism detection method, apparatus, device, medium, and program product. Background Technology
[0002] In today's information-saturated era, document processing and knowledge management have become indispensable parts of all industries. With the acceleration of digitalization, the quantity and variety of documents have increased dramatically. Whether it's academic papers, business reports, patent applications, news reports, or policy documents, they all carry a wealth of information and innovative ideas. Behind this information abundance, however, lie some problems, the most prominent of which is the issue of document originality.
[0003] In related technologies, document plagiarism detection is necessary to verify the originality of a document and determine whether duplicate or similar documents already exist. The typical method for document plagiarism detection is to summarize all the content of the document and then perform plagiarism checks on the summarized text.
[0004] However, in the above method, since the plagiarism check is performed based on the summary text of the entire document, the electronic device needs to obtain the complete document content in order to get accurate plagiarism check results, resulting in poor plagiarism check efficiency. Summary of the Invention
[0005] This application provides a document plagiarism detection method, apparatus, device, medium, and program product to improve the efficiency of document plagiarism detection.
[0006] In a first aspect, embodiments of this application provide a document plagiarism detection method, which includes: inputting the document to be checked into a plagiarism detection model; extracting content from the document to be checked through the plagiarism detection model to obtain M first key content texts corresponding to M plagiarism detection dimensions, wherein one first key content text corresponds to one plagiarism detection dimension, and the M plagiarism detection dimensions include at least: research objectives, research plan, research results, and research content, where M is an integer greater than or equal to 4; calculating the similarity between the M content text pairs corresponding to the M plagiarism detection dimensions through the plagiarism detection model to obtain M first similarity values, wherein one first similarity value corresponds to one plagiarism detection dimension, and the content text pair corresponding to one plagiarism detection dimension includes a second key content text corresponding to one plagiarism detection dimension and a corresponding first key content text stored in a benchmark database; and outputting the plagiarism detection result of the document to be checked based on the M first similarity values through the plagiarism detection model.
[0007] The technical solution provided in this application brings at least the following beneficial effects: Since the research objectives, research plans, research results, and research content in the document to be checked for plagiarism are usually centered around one or more specific user ideas, they have a certain degree of non-repetition and can summarize all the content of the document to be checked for plagiarism, plagiarism can be checked by using the document content corresponding to the research objectives, research plans, research results, and research content. It is not necessary to obtain the complete document content of the document to be checked for plagiarism, so as to quickly and accurately determine whether the document to be checked for plagiarism is a duplicate document. Thus, the efficiency of document plagiarism checking is improved.
[0008] One possible implementation involves using a plagiarism detection model to output the plagiarism detection result of the document to be checked based on M first similarity values. This includes: using the plagiarism detection model, identifying N text pairs corresponding to the first similarity values that are greater than or equal to a preset similarity threshold from among the M first similarity values as N text pairs to be checked for plagiarism; using the plagiarism detection model, performing text plagiarism detection on each text pair to be checked to determine the correlation between the first key content text and the corresponding second key content text in each text pair to be checked; and using the plagiarism detection model, outputting the plagiarism detection result of the document to be checked based on the correlation between the first key content text and the corresponding second key content text in the N text pairs to be checked for plagiarism.
[0009] Another possible implementation involves using a plagiarism detection model to perform text plagiarism checks on each pair of texts to be checked, determining the correlation between the first key content text and the corresponding second key content text in each pair. This includes: extracting K key pieces of information from the first key content text in each pair of texts to be checked using the plagiarism detection model; and extracting K key pieces of information from the second key content text in each pair of texts to be checked. The K key pieces of information include at least: the core problem to be solved, the core research subject, the technical field, the technical path, the type of research result, and the function of the research result, where K is an integer greater than or equal to 6. The plagiarism detection model is then used to cross-compare the K key pieces of information and the K second key pieces of information corresponding to each pair of texts to be checked to determine the correlation between the first key content text and the corresponding second key content text in each pair of texts to be checked.
[0010] Another possible implementation is as follows: using the plagiarism detection model, based on the correlation results between the first key content text and the corresponding second key content text in the N text pairs to be checked for plagiarism, the plagiarism detection result of the document to be checked is output. This includes: using the plagiarism detection model, if all M first similarity values are greater than or equal to a preset similarity threshold, and the correlation results between the first key content text and the corresponding second key content text in any two text pairs to be checked for plagiarism are related, the document to be checked is output as a duplicate document.
[0011] Another possible implementation involves using a plagiarism detection model to extract content from the document to be checked, obtaining M primary key content texts corresponding to M plagiarism detection dimensions. This includes: using the plagiarism detection model to extract content from the document to be checked based on the M plagiarism detection dimensions, obtaining M secondary key content texts; and performing data cleaning on the M secondary key content texts to obtain M primary key content texts.
[0012] Secondly, embodiments of this application provide a document plagiarism detection device, including: an input module, an extraction module, and a processing module. The input module is used to input the document to be checked into a plagiarism detection model; the extraction module is used to extract content from the document to be checked using the plagiarism detection model, obtaining M first key content texts corresponding to M plagiarism detection dimensions, where each first key content text corresponds to one plagiarism detection dimension, and the M plagiarism detection dimensions include at least: research objectives, research plans, research results, and research content, where M is an integer greater than or equal to 4; the processing module is used to calculate the similarity between the M content text pairs corresponding to the M plagiarism detection dimensions using the plagiarism detection model, obtaining M first similarity values, where each first similarity value corresponds to one plagiarism detection dimension, and each content text pair corresponding to a plagiarism detection dimension includes a second key content text corresponding to a plagiarism detection dimension and a corresponding first key content text stored in a benchmark database; and outputs the plagiarism detection result of the document to be checked based on the M first similarity values using the plagiarism detection model.
[0013] One possible implementation is that the aforementioned processing module is specifically used to, through a plagiarism detection model, identify N text pairs corresponding to M first similarity values that are greater than or equal to a preset similarity threshold as N text pairs to be checked for plagiarism; and through the plagiarism detection model, perform text plagiarism detection on each text pair to be checked to determine the correlation results between the first key content text and the corresponding second key content text in each text pair to be checked; and through the plagiarism detection model, based on the correlation results between the M first similarity values and the first key content text and the corresponding second key content text in the N text pairs to be checked, output the plagiarism detection results of the document to be checked.
[0014] Another possible implementation is that the aforementioned processing module is specifically used to extract K key pieces of information from the first key content text in each text pair to be checked for plagiarism using a plagiarism detection model; and to extract K key pieces of information from the second key content text in each text pair to be checked for plagiarism. The K key pieces of information include at least: the core problem to be solved, the core research subject, the technical field, the technical path, the type of research result, and the function of the research result, where K is an integer greater than or equal to 6; and through the plagiarism detection model, to cross-compare the K key pieces of information corresponding to each text pair to be checked for plagiarism and the K key pieces of information corresponding to each text pair to be checked for plagiarism to determine the correlation results between the first key content text and the corresponding second key content text in each text pair to be checked for plagiarism.
[0015] Another possible implementation is that the above processing module is specifically used to output the document to be checked as a duplicate document when the M first similarity values are all greater than or equal to the preset similarity threshold and the correlation results between the first key content text and the corresponding second key content text in any two of the N text pairs to be checked are related.
[0016] Another possible implementation is that the above extraction module is specifically used to extract content from the document to be checked based on M deduplication dimensions using the deduplication model, to obtain M third key content texts; and to perform data cleaning on the M third key content texts to obtain M first key content texts.
[0017] Thirdly, this application provides an electronic device comprising: a processor and a memory; the memory stores a program or instructions executable on the processor, wherein the program or instructions, when executed by the processor, implement the method of the first aspect described above.
[0018] Fourthly, this application provides a readable storage medium on which a program or instructions are stored, which, when executed by a computer, implement the method of the first aspect described above.
[0019] Fifthly, this application provides a computer program product stored in a storage medium, which, when executed by a computer, implements the method described in the first aspect.
[0020] In a sixth aspect, embodiments of this application provide a chip including a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement the method described in the first aspect.
[0021] The beneficial effects of the second to sixth aspects mentioned above are described in the corresponding description of the first aspect and will not be repeated here. Attached Figure Description
[0022] Figure 1 A schematic diagram of the network architecture for a document plagiarism detection method provided in this application embodiment;
[0023] Figure 2 A flowchart illustrating a document plagiarism detection method provided in an embodiment of this application;
[0024] Figure 3 A flowchart illustrating another document plagiarism detection method provided in this application embodiment;
[0025] Figure 4 A flowchart illustrating yet another document plagiarism detection method provided in this application embodiment;
[0026] Figure 5 A flowchart illustrating yet another document plagiarism detection method provided in this application embodiment;
[0027] Figure 6 A flowchart illustrating yet another document plagiarism detection method provided in this application embodiment;
[0028] Figure 7 A flowchart illustrating the implementation process of a document plagiarism detection method provided in this application embodiment;
[0029] Figure 8 This is a schematic diagram of the structure of a document plagiarism detection device provided in an embodiment of this application;
[0030] Figure 9 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation
[0031] The document plagiarism detection method, apparatus, equipment, media, and program products provided in this application will be described in detail below with reference to the accompanying drawings.
[0032] The technical solutions of the embodiments of this application will be clearly described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this application. All other embodiments obtained by those skilled in the art based on the embodiments of this application are within the scope of protection of this application.
[0033] The terms "first," "second," etc., used in the specification and claims of this application are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such use of data can be interchanged where appropriate so that embodiments of this application can be implemented in orders other than those illustrated or described herein, and the objects distinguished by "first," "second," etc., are generally of the same class and the number of objects is not limited; for example, a first object can be one or more. Furthermore, in the specification and claims, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects are in an "or" relationship.
[0034] The terms "at least one," "at least one of," etc., used in the specification and claims of this application refer to any one, any two, or a combination of two or more of the included items. For example, at least one of a, b, and c can mean: "a," "b," "c," "a and b," "a and c," "b and c," and "a, b, and c," where a, b, and c can be single or multiple. Similarly, "at least two" refers to two or more items, and its meaning is similar to that of "at least one."
[0035] In the description of this application, unless otherwise stated, "a plurality of" means two or more.
[0036] The document plagiarism detection method, apparatus, device, medium, and program product provided in this application embodiment can be applied to scenarios such as academic paper plagiarism detection, business report plagiarism detection, patent application plagiarism detection, news report plagiarism detection, and policy document plagiarism detection.
[0037] Currently, in order to implement the strategic deployment of scientific and technological innovation and deepen the concept of innovation-driven development, we are vigorously promoting the tackling of cutting-edge technologies and the cultivation of innovative projects, which has led to a significant increase in the number of research project applications. Against this backdrop, relying solely on manual review of project application materials makes it difficult to comprehensively and quickly identify duplicate applications or projects with highly similar content. This results in the duplication and fragmented use of research resources, hindering the concentration of efforts to overcome key core technologies.
[0038] To address the aforementioned issues, three technical solutions have been provided in the relevant technologies, which will be explained below.
[0039] Solution 1: Analyze the document to be checked for plagiarism using the style of document content elements to obtain the structured data corresponding to the document. This structured data is used to indicate the data structure corresponding to the document content of the document to be checked. This structured data includes the hierarchical headings in the document to be checked. Determine the functional points of the document to be checked based on the structured data. Use the functional points of historical documents to perform similarity matching on the functional points of the document to be checked to obtain the plagiarism check results of the document to be checked.
[0040] Option 2: Use a word segmentation tool to segment the project name; then, use a natural language processing algorithm to check for plagiarism. If the plagiarism result is greater than the set threshold, use a parsing extraction algorithm to extract the key content that needs to be checked for plagiarism from the document; use a natural language processing algorithm to calculate plagiarism. If the plagiarism result is greater than the set threshold, it is judged as suspected duplication and warning data is generated.
[0041] Option 3: Obtain a fine-tuned sample set based on the acquired sample scientific and technological document set to fine-tune the candidate large model and obtain the fine-tuned target large model; construct the corresponding target scientific and technological knowledge graph, and obtain the similar document set of the document object to be deduplicated based on the target scientific and technological knowledge graph; call the model capabilities of the target large model, and obtain the target deduplication result of the document object based on the similar document set through the model capabilities.
[0042] However, in Scheme 1 above, the electronic device needs to obtain the complete content of the document to be checked for plagiarism. Then, based on this complete content, it generates structured data corresponding to the document. In other words, the electronic device performs plagiarism checks based on a summary of the entire document, resulting in poor plagiarism detection efficiency. In Scheme 2, the electronic device uses traditional semantic similarity analysis methods, but these methods have limitations in terms of computational efficiency and accuracy. In Scheme 3, the electronic device also needs to obtain the complete content of the document to be checked in order to construct a scientific knowledge graph corresponding to the document and obtain the plagiarism detection results, leading to poor plagiarism detection efficiency.
[0043] To address the aforementioned technical problems, embodiments of this application provide a document plagiarism detection method, apparatus, device, medium, and program product. Since the research objectives, research plans, research results, and research content in the document to be checked are usually centered around one or more specific user ideas, they have a certain degree of non-repetition and can summarize all the content of the document to be checked. Therefore, by checking the document content corresponding to the research objectives, research plans, research results, and research content, it is not necessary to obtain the complete document content of the document to be checked. This allows for a quick and accurate determination of whether the document to be checked is a duplicate document, thereby improving the efficiency of document plagiarism detection by electronic devices.
[0044] The document plagiarism detection method, apparatus, device, medium, and program products provided in this application will be described in detail below with reference to the accompanying drawings.
[0045] Figure 1 This illustration shows the network architecture of a document plagiarism detection method provided in an embodiment of this application. For example... Figure 1As shown, the network architecture includes a document plagiarism detection device 101 and a terminal device 102. The document plagiarism detection device 101 and the terminal device 102 are interconnected.
[0046] In some embodiments, the document plagiarism detection device 101 may be a server, a computer, or a processor or processing unit within a server or computer. The server may be a single server or a server cluster consisting of multiple servers. It should be noted that this application embodiment does not limit the specific device form of the document plagiarism detection device 101. Figure 1 The document plagiarism detection device 101 is used as an example of a single server.
[0047] In some embodiments, the terminal device may be a mobile phone, tablet computer, laptop computer, handheld computer, in-vehicle electronic device, mobile internet device (MID), augmented reality (AR) / virtual reality (VR) device, robot, wearable device, personal computer (PC), ultra-mobile personal computer (UMPC), netbook, or personal digital assistant (PDA), etc., and the embodiments of this application do not specifically limit it. Figure 1 The example shown is a mobile phone, with terminal device 102 as an example.
[0048] In some embodiments, the terminal device 102 can send the document to be checked for plagiarism to the document plagiarism checking device 101. After receiving the document to be checked, the document to be checked can be input into the plagiarism checking model. Then, the plagiarism checking model is used to perform document plagiarism checking on the document to be checked, and the plagiarism checking result is output. Finally, the plagiarism checking result is sent to the terminal device 102 so that the terminal device 102 can determine whether the document to be checked is a duplicate document based on the plagiarism checking result.
[0049] It should be noted that the network architecture described in the embodiments of this application is for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and does not constitute a limitation on the technical solutions provided in the embodiments of this application. As network architectures evolve, the technical solutions provided in the embodiments of this application are also applicable to similar technical problems.
[0050] See Figure 2 This is a flowchart illustrating a document plagiarism detection method provided in an embodiment of this application. Figure 2As shown, the document plagiarism detection method provided in this application embodiment can be implemented by the above-mentioned document plagiarism detection device, specifically including the following steps 201 to 204.
[0051] Step 201: The document plagiarism detection device inputs the document to be checked into the plagiarism detection model.
[0052] In some embodiments, the document to be checked for plagiarism can be a document stored in the document plagiarism detection device, or a document received by the document plagiarism detection device. The specific method can be determined according to actual usage requirements, and this application embodiment does not impose any limitations.
[0053] In some embodiments, the plagiarism detection model described above can be an artificial intelligence (AI) model, a neural network model, or a large language model, etc. The specific model can be determined according to actual usage requirements, and this application embodiment does not impose any limitations.
[0054] For example, the document to be checked for plagiarism may include text, flowcharts, interface diagrams, and other content.
[0055] In some embodiments, the document plagiarism detection device can input the document to be checked into the plagiarism detection model through the document input interface of the plagiarism detection model.
[0056] Step 202: The document plagiarism detection device extracts content from the document to be checked using the plagiarism detection model, and obtains the M first key content texts corresponding to the M plagiarism detection dimensions.
[0057] In some embodiments, one of the M first key content texts corresponds to a plagiarism detection dimension. The M plagiarism detection dimensions include at least: research objectives, research plan, research results, and research content, where M is an integer greater than or equal to 4.
[0058] In some embodiments, the aforementioned M plagiarism detection dimensions can be user-defined or preset by the document plagiarism detection device. The specific dimensions can be determined according to actual usage requirements, and this application embodiment does not impose any limitations.
[0059] For example, the document plagiarism detection device can extract content from the document to be checked based on M plagiarism detection dimensions using a plagiarism detection model, thereby obtaining M first key content texts corresponding to each of the M plagiarism detection dimensions.
[0060] For example, a document plagiarism detection device can determine the M first key content texts containing the M keywords from the document to be checked based on the M keywords corresponding to the M plagiarism detection dimensions, and then perform content extraction to obtain the M first key content texts corresponding to each of the M plagiarism detection dimensions.
[0061] In some embodiments, after obtaining the M first key content texts corresponding to each deduplication dimension of the document to be deduplicated, the document deduplication device can store the M first key content texts corresponding to each deduplication dimension into the benchmark database.
[0062] Step 203: The document plagiarism detection device calculates the similarity between M content text pairs corresponding to M plagiarism detection dimensions through the plagiarism detection model, and obtains M first similarity values.
[0063] In some embodiments, one of the M first similarity values corresponds to a plagiarism detection dimension, and the content text pair corresponding to the plagiarism detection dimension includes the second key content text corresponding to the plagiarism detection dimension and the corresponding first key content text stored in the benchmark database.
[0064] It is understandable that the aforementioned benchmark database stores M second key content texts corresponding to the M plagiarism detection dimensions of different plagiarism detection documents.
[0065] It can be understood that the first key content text mentioned above is the first key content text corresponding to the second key content text of a plagiarism detection dimension.
[0066] In some embodiments, the document plagiarism detection device can obtain M first similarity values based on the character similarity between the second key content text corresponding to each plagiarism detection dimension and the corresponding first key content text through the plagiarism detection model.
[0067] In some implementations, document plagiarism detection devices can use a plagiarism detection model to obtain M first similarity values based on the semantic similarity between the second key content text and the corresponding first key content text for each plagiarism detection dimension.
[0068] It should be noted that the specific process by which the above-mentioned document plagiarism detection device obtains M first similarity values through character similarity or semantic similarity can be found in the description in the relevant technology. To avoid repetition, it will not be repeated here.
[0069] Step 204: The document plagiarism detection device outputs the plagiarism detection results of the document to be checked based on the M first similarity values through the plagiarism detection model.
[0070] It should be noted that the specific implementation process of step 204 above can be found in the following embodiments, and will not be repeated here to avoid repetition.
[0071] This application provides a document plagiarism detection method. The document plagiarism detection device can input the document to be checked into a plagiarism detection model. Then, through the plagiarism detection model, the content of the document to be checked is extracted to obtain M first key content texts corresponding to M plagiarism detection dimensions. Each first key content text corresponds to one plagiarism detection dimension. The M plagiarism detection dimensions include at least: research objectives, research plan, research results, and research content, where M is an integer greater than or equal to 4. Next, through the plagiarism detection model, the similarity between the M content text pairs corresponding to the M plagiarism detection dimensions is calculated to obtain M first similarity values. Each first similarity value corresponds to one plagiarism detection dimension. The content text pair corresponding to one plagiarism detection dimension includes the second key content text and the corresponding first key content text stored in the benchmark database. Finally, through the plagiarism detection model, based on the M first similarity values, the plagiarism detection result of the document to be checked is output. In this solution, since the research objectives, research plans, research results, and research content in the document to be checked for plagiarism are usually centered around one or more specific user ideas, they have a certain degree of non-repetition and can summarize all the content of the document to be checked for plagiarism. Therefore, by checking the document content corresponding to the research objectives, research plans, research results, and research content, it is not necessary to obtain the complete document content of the document to be checked for plagiarism. This can quickly and accurately determine whether the document to be checked for plagiarism is a duplicate document, thus improving the efficiency of document plagiarism checking on electronic devices.
[0072] In some embodiments, combined with Figure 2 ,like Figure 3 As shown, step 204 can be implemented through steps 204a to 204c as described below.
[0073] Step 204a: The document plagiarism detection device uses a plagiarism detection model to identify N text pairs corresponding to the first similarity values that are greater than or equal to a preset similarity threshold out of the M first similarity values, as N text pairs to be checked for plagiarism.
[0074] In some embodiments, the aforementioned preset similarity threshold can be user-defined; or it can be preset by the document plagiarism detection device. The specific threshold can be determined according to actual usage requirements, and this application embodiment does not impose any limitations.
[0075] For example, the preset similarity threshold can be 0.8.
[0076] For example, the document plagiarism detection device can compare each of the M first similarity values with a preset similarity threshold, thereby determining N first similarity values that are greater than or equal to the preset similarity threshold from the M first similarity values, and then determine the corresponding text pair based on the N first similarity values.
[0077] Step 204b: The document plagiarism detection device uses a plagiarism detection model to perform text plagiarism detection on each of the N text pairs to be checked, and determines the correlation between the first key content text and the corresponding second key content text in each of the N text pairs to be checked.
[0078] It is understandable that the above results are used to indicate the semantic similarity between the first key content text and the corresponding second key content text in each of the N text pairs to be checked for plagiarism.
[0079] It should be noted that the specific implementation process of step 204b above can be found in the following embodiments, and will not be repeated here to avoid repetition.
[0080] Step 204c: The document plagiarism detection device outputs the plagiarism detection result of the document to be checked based on the correlation results between the first key content text and the corresponding second key content text in the M first similarity values and N text pairs to be checked, using the plagiarism detection model.
[0081] It should be noted that the specific implementation process of step 204c above can be found in the following embodiments, and will not be repeated here to avoid repetition.
[0082] In some embodiments, the first key content text corresponding to M dimensions and the M second key content text corresponding to M dimensions stored in the benchmark database are filtered by similarity value, which reduces the computational load of the model and improves the document plagiarism detection efficiency of the document plagiarism detection device.
[0083] In some embodiments, combined with Figure 3 ,like Figure 4 As shown, step 204b can be implemented through steps 204b1 and 204b2 as described below.
[0084] Step 204b1: The document plagiarism detection device extracts K first key information from the first key content text in each of the N text pairs to be checked for plagiarism using the plagiarism detection model; and extracts K second key information from the second key content text in each of the N text pairs to be checked for plagiarism.
[0085] In some embodiments, the above K key information items include at least: the core problem to be solved, the core research subject, the technical field, the technical path, the type of research result, and the function of the research result, where K is an integer greater than or equal to 6.
[0086] For example, the document plagiarism detection device can traverse N pairs of texts to be checked through the plagiarism detection model. For each dimension to be checked, it extracts the aforementioned first key information through large language model and prompt word engineering technology, and removes redundant and interfering information.
[0087] In some embodiments of this application, the document plagiarism detection device can store K first key information of the first key content text in each of the N text pairs to be checked for plagiarism, and K second key information of the second key content text in each of the N text pairs to be checked for plagiarism, in JSON format.
[0088] For example, the document plagiarism detection device, through its plagiarism detection model, can extract the following specific content for each dimension based on the standards and requirements for determining duplication in research projects: The overall project objective extracts the core problem to be solved; the project research content extracts the core research object; the technical solution extracts the adopted technical field and key technical path; and the landmark achievement extracts the type and function of the achievement. Typically, a research project often includes multiple core problems to be solved, research objects, key technical paths, and achievements; therefore, the extraction of the above core content involves multiple elements. The final extraction result is corrected and stored in the following JSON format:
[0089] {
[0090] "goal": ["Core problem to be solved 1", "Core problem to be solved 2","..."],
[0091] "research_content": ["Research Subject 1", "Research Subject 2", "..."],
[0092] "technical_solution": [{
[0093] "field": "Technical Field 1",
[0094] "pathway": "Technology Path 1"
[0095] }, {
[0096] "field": "Technical Field 2",
[0097] "pathway": "Technology Path 2"
[0098] }, { ...
[0100] }],
[0101] "achievement": [{
[0102] "type": "Type 1",
[0103] "function1": "Function 1"
[0104] }, {
[0105] "type": "Type 1",
[0106] "function1": "Function 1"
[0107] }, { ...
[0109] }]
[0110] }
[0111] Step 204b2: The document plagiarism detection device uses a plagiarism detection model to cross-compare the K first key information corresponding to each of the N text pairs to be checked for plagiarism and the K second key information corresponding to each of the N text pairs to be checked for plagiarism, and determines the correlation results between the first key content text and the corresponding second key content text in each text pair to be checked for plagiarism.
[0112] For example, the plagiarism detection model can specifically write similarity judgment prompts for each dimension, and submit the judgment prompts, N pairs of texts to be checked for plagiarism, and key information together to a large language model for relevance judgment. The specific relevance judgment criteria are as follows: Whether the overall project goal is relevant: It is necessary to determine whether the core problem to be solved by the project is relevant; Whether the project research content is relevant: It is necessary to determine whether the core research object in the research content is relevant; Whether the technical solution is relevant: It is necessary to determine whether the technical field and key technical path are relevant; Whether the landmark achievement is relevant: It is necessary to determine whether the type of achievement and its functional essence are relevant. Since the core content of each dimension usually contains multiple items, a pairwise cross-comparison is required to obtain a relevance matrix during relevance judgment. Taking the overall goal relevance analysis of project A and project B as an example, the relevance analysis results are corrected to the following JSON format and stored:
[0113] {
[0114] "goal_relevance": [
[0115] [A0, B0, true],
[0116] [A0, B1, false],
[0117] [A1, B0, false],
[0118] [A1, B1, false] ]
[0120] }
[0121] In Project A, two core problems to be solved are extracted from the overall objective, numbered A0 and A1. Similarly, two core problems to be solved are extracted from the overall objective, numbered B0 and B1. Based on the relevance criteria for the overall objectives, a pairwise cross-comparison (e.g., 2x2, resulting in a four-dimensional array) is performed, yielding four sets of comparison results. [A0, B0, true] indicates that the large language model determines A0 and B0 are related; [A0, B1, false] indicates that the large language model determines A0 and B1 are not related, and so on. The storage format of the relevance analysis results for the research content, technical solutions, and landmark achievements of Projects A and B is similar to that of the overall objectives, with the storage matrix variables named research_content_relevance, technical_solution_relevance, and achievement_relevance, respectively.
[0122] In some embodiments, the document plagiarism detection device first performs a rapid initial screening of the first key content text corresponding to M dimensions. If the similarity is high, it further extracts key information for in-depth comparison. This balances efficiency and accuracy, effectively identifying suspected duplicate document content. This improves the efficiency and accuracy of the document plagiarism detection device.
[0123] In some embodiments, combined with Figure 3 ,like Figure 5 As shown, step 204c above can be implemented through step 204c1 below.
[0124] Step 204c1: The document plagiarism detection device, through the plagiarism detection model, outputs that the document to be checked is a duplicate document if the M first similarity values are all greater than or equal to the preset similarity threshold, and the correlation results between the first key content text and the corresponding second key content text in any two of the N text pairs to be checked are related.
[0125] For example, the document plagiarism detection device, based on the vector similarity scores of the current document and the core content relevance judgment results of the four dimensions of research objectives, research plan, research results, and research content stored in the benchmark database obtained in the above steps, comprehensively derives the duplicate documents. The judgment rules are as follows: (1) The similarity scores of all four dimensions are greater than 0.8; (2) The relevance judgment meets one of the following requirements: a. Any two of the overall objectives, research content, and technical routes are related, that is, any two of goal_relevance, research_content_relevance, and technical_solution_relevance are true; b. The landmark achievements are related, that is, achievement_relevance is true.
[0126] In some embodiments, the document plagiarism detection device comprehensively judges whether the document to be checked is a duplicate document by using the vector similarity values of the current document with the four dimensions of overall objectives, research content, technical solutions, and landmark achievements stored in the benchmark library, as well as the relevance of the core content. This improves the accuracy of the document plagiarism detection device in checking documents.
[0127] In some embodiments, combined with Figure 2 ,like Figure 6 As shown, step 202 can be implemented through steps 202a and 202b below.
[0128] Step 201a: The document plagiarism detection device extracts content from the document to be checked based on the plagiarism detection model and M plagiarism detection dimensions to obtain M third key content texts.
[0129] It should be noted that the specific process by which the document plagiarism detection device obtains M third key content texts can be found in the description of related technologies, and will not be repeated here to avoid duplication.
[0130] Step 201b: The document deduplication device cleans the M third key content texts to obtain M first key content texts.
[0131] For example, the document deduplication device can remove meaningless characters (such as special symbols, whitespace, etc.) from M third-key content texts and unify them into UTF-8 encoding format. Then, the cleaned text is associated with document names, numbers, and other identifiers to form project structured information, which is stored in the benchmark database.
[0132] In some embodiments, the document deduplication device can use the Qwen-Embedding-8B vector model to convert the structured text of M third key content texts into fixed 4096-dimensional vectors, store them in a benchmark database, and establish an efficient retrieval index.
[0133] In some embodiments, the document plagiarism detection device can perform data cleaning on M third key content texts to obtain M first key content texts, thereby improving the accuracy of the document plagiarism detection device in obtaining the M first key content texts.
[0134] The document plagiarism detection method of this application will be described below through specific embodiments.
[0135] like Figure 7 As shown, the implementation process of the document plagiarism detection method provided in this application embodiment includes the following S1 to S9:
[0136] S1. The document plagiarism detection device inputs the document to be checked into the plagiarism detection model.
[0137] S2. The document plagiarism detection device extracts data from the document to be checked and obtains M first key content texts corresponding to M plagiarism detection dimensions.
[0138] For example, the document plagiarism detection device uses digital technology to extract key content text corresponding to the plagiarism detection dimensions from the document to be checked. The plagiarism detection dimensions include: overall project goals, project research content, technical solutions, and landmark achievements.
[0139] S3. The document plagiarism detection device cleans and stores the M primary key content texts corresponding to the M plagiarism detection dimensions.
[0140] For example, the document plagiarism detection device can clean the text content extracted in step S2, removing meaningless characters (such as special symbols, whitespace, etc.) and unifying it to UTF-8 encoding format. The extracted text content is then associated with identifiers such as the document name and number to be checked, forming structured project information, which is stored in a MySQL database.
[0141] S4. The document deduplication device performs text vectorization and stores the M primary key content texts corresponding to the M deduplication dimensions after data cleaning into the database.
[0142] For example, the document plagiarism detection device can use the Qwen-Embedding-8B vector model to convert the structured text stored in step S3 into fixed 4096-dimensional vectors, store them in the Milvus vector database, and establish an efficient retrieval index.
[0143] S5. The document plagiarism detection device calculates the semantic similarity between the M first key content texts corresponding to the vectorized M plagiarism detection dimensions and the M second key content texts of all historical plagiarism detection documents stored in the benchmark database in the M dimensions.
[0144] S6. The document plagiarism detection device performs a coarse screening of highly similar text pairs.
[0145] For example, the document plagiarism detection device can take the N text pairs corresponding to the first similarity values that are greater than or equal to 0.8 out of the M first similarity values as N text pairs to be checked for plagiarism.
[0146] S7. The document plagiarism detection device extracts key information from each of the N text pairs to be checked for plagiarism.
[0147] S8. The document plagiarism detection device performs high-similarity text screening based on the key information in each of the N text pairs to be checked for plagiarism, and obtains relevant results.
[0148] S9. The document plagiarism detection device determines duplicate items based on the M first similarity values and the relevant results corresponding to the key information in each pair of texts to be checked.
[0149] Thus, since the research objectives, research plans, research results, and research content in the document to be checked for plagiarism are usually centered around one or more specific user ideas, they have a certain degree of non-repetition and can summarize all the content of the document to be checked for plagiarism, plagiarism can be checked by using the document content corresponding to the research objectives, research plans, research results, and research content. It is not necessary to obtain the complete document content of the document to be checked for plagiarism, so as to quickly and accurately determine whether the document to be checked for plagiarism is a duplicate document. This improves the efficiency of document plagiarism checking.
[0150] It should be noted that the descriptions of each step S1 to S9 in this embodiment can be found in the descriptions in the above embodiments, and will not be repeated here.
[0151] It should be noted that the above-described method embodiments, or the various possible implementations of the method embodiments, can be executed individually, or, provided there is no conflict, they can be combined with each other. The specific implementation can be determined according to actual usage requirements, and this application embodiment does not impose any restrictions on this.
[0152] It should be noted that for a detailed explanation of the steps performed by each module and their beneficial effects, please refer to the description in the above embodiments, which will not be repeated here.
[0153] As can be seen, the above mainly describes the solutions provided by the embodiments of this application from a methodological perspective. To achieve the above functions, the embodiments of this application provide corresponding hardware structures and / or software modules for executing each function. Those skilled in the art should readily recognize that, in conjunction with the modules and algorithm steps of the various examples described in the embodiments disclosed herein, the embodiments of this application can be implemented in hardware or a combination of hardware and computer software. Whether a function is executed in hardware or by computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0154] This application embodiment can divide the document plagiarism detection device into functional modules according to the above method example. For example, each function can be divided into a separate functional module, or two or more functions can be integrated into one processing module. The integrated module can be implemented in hardware or as a software functional module. Optionally, the module division in this application embodiment is illustrative and only represents one logical functional division; other division methods may be used in actual implementation.
[0155] In some embodiments, this application also provides a document plagiarism detection device. This device may include one or more functional modules for implementing the document plagiarism detection method described in the above embodiments.
[0156] For example, Figure 8 This is a schematic diagram of a document plagiarism detection device provided in an embodiment of this application. Figure 8 As shown, the document plagiarism detection device 900 includes: an input module 901, an extraction module 902, and a processing module 903.
[0157] The system comprises several modules: an input module for inputting the document to be checked into the plagiarism detection model; an extraction module for extracting content from the document using the model, yielding M primary key content texts corresponding to M plagiarism detection dimensions, with each primary key content text corresponding to one plagiarism detection dimension; and M plagiarism detection dimensions including at least: research objectives, research plan, research results, and research content, where M is an integer greater than or equal to 4; and a processing module for calculating the similarity between the M content text pairs corresponding to the M plagiarism detection dimensions, yielding M primary similarity values, each primary similarity value corresponding to one plagiarism detection dimension. Each content text pair corresponding to a plagiarism detection dimension includes the secondary key content text and the corresponding primary key content text stored in the benchmark database. Based on these primary similarity values, the processing module outputs the plagiarism detection results for the document.
[0158] The document plagiarism detection device provided in this application, since the research objectives, research plans, research results, and research content in the document to be checked are usually centered around one or more specific user ideas, have a certain degree of non-repetition, and can summarize all the content of the document to be checked, can quickly and accurately determine whether the document to be checked is a duplicate document without obtaining the complete document content, thus improving the efficiency of document plagiarism detection.
[0159] In some embodiments, the processing module 903 is specifically used to: use a plagiarism detection model to identify N text pairs corresponding to M first similarity values that are greater than or equal to a preset similarity threshold as N text pairs to be checked for plagiarism; use the plagiarism detection model to perform text plagiarism detection on each text pair to be checked, and determine the correlation results between the first key content text and the corresponding second key content text in each text pair to be checked; and use the plagiarism detection model to output the plagiarism detection result of the document to be checked based on the correlation results between the M first similarity values and the first key content text and the corresponding second key content text in the N text pairs to be checked for plagiarism.
[0160] In other embodiments, the processing module 903 is specifically used to extract K first key information from the first key content text in each text pair to be checked for plagiarism using a plagiarism detection model; and to extract K second key information from the second key content text in each text pair to be checked for plagiarism; the K key information includes at least: the core problem to be solved, the core research subject, the technical field, the technical path, the type of research result, and the function of the research result, where K is an integer greater than or equal to 6; and to cross-compare the K first key information and the K second key information corresponding to each text pair to be checked for plagiarism using a plagiarism detection model to determine the correlation results between the first key content text and the corresponding second key content text in each text pair to be checked for plagiarism.
[0161] In some other embodiments, the processing module 903 is specifically used to output the document to be checked as a duplicate document when the M first similarity values are all greater than or equal to a preset similarity threshold and the correlation results between the first key content text and the corresponding second key content text in any two of the N text pairs to be checked are related.
[0162] In some other embodiments, the extraction module 901 is specifically used to extract content from the document to be checked based on M deduplication dimensions using a deduplication model, to obtain M third key content texts; and to perform data cleaning on the M third key content texts to obtain M first key content texts.
[0163] It should be noted that the document plagiarism detection device can implement all the processes implemented in the above method embodiments and achieve the same beneficial effects. To avoid repetition, it will not be described again here.
[0164] In the case where the functions of the integrated modules described above are implemented in hardware, this application provides a possible structural schematic diagram of the electronic device involved in the above embodiments. For example... Figure 9 As shown, the electronic device 90 includes: a processor 92, a communication interface 93, and a bus 94. Optionally, the electronic device 90 may also include a memory 91.
[0165] Processor 92 may implement or execute various exemplary logic blocks, modules, and circuits described in conjunction with the disclosure of this application. Processor 92 may be a central processing unit, a general-purpose processor, a digital signal processor, an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute various exemplary logic blocks, modules, and circuits described in conjunction with the disclosure of this application. Processor 92 may also be a combination that implements computational functions, such as including one or more microprocessor combinations, a combination of a DSP and a microprocessor, etc.
[0166] Communication interface 93 is used to connect with other devices via a communication network. This communication network can be Ethernet, wireless access network, wireless local area network (WLAN), etc.
[0167] The memory 91 may be a read-only memory (ROM) or other type of static storage device capable of storing static information and instructions, random access memory (RAM) or other type of dynamic storage device capable of storing information and instructions, or electrically erasable programmable read-only memory (EEPROM), disk storage media or other magnetic storage devices, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures and accessible by a computer, but is not limited thereto.
[0168] As one possible implementation, the memory 91 can exist independently of the processor 92. The memory 91 can be connected to the processor 92 via a bus 94 and is used to store instructions or program code. When the processor 92 calls and executes the instructions or program code stored in the memory 91, it can implement the document deduplication method provided in the embodiments of this application.
[0169] In another possible implementation, memory 91 can also be integrated with processor 92.
[0170] Bus 94 can be an Extended Industry Standard Architecture (EISA) bus, etc. Bus 94 can be divided into address bus, data bus, control bus, etc. For ease of representation, Figure 9 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.
[0171] Through the above description of the implementation methods, those skilled in the art can clearly understand that, for the sake of convenience and brevity, only the division of the above functional modules is used as an example. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the service calling device can be divided into different functional modules to complete all or part of the functions described above.
[0172] This application embodiment also provides a chip, which includes a processor and a communication interface. The communication interface is coupled to the processor. The processor is used to run programs or instructions to implement the various processes of the above-described document plagiarism detection method embodiments and can achieve the same technical effect. To avoid repetition, it will not be described again here.
[0173] It should be understood that the chip mentioned in the embodiments of this application may also be referred to as a system-on-a-chip, system chip, chip system, or system-on-a-chip, etc.
[0174] This application also provides a readable storage medium storing a program or instructions that, when executed by a computer, implement the document deduplication method provided in the above embodiments. It is understood that all or part of the processes in the above method embodiments can be executed by computer instructions instructing related hardware; the readable storage medium can be any of the foregoing embodiments or memory; the readable storage medium can also be an external storage device of the service invocation device, such as a pluggable hard drive, Smart MediaCard (SMC), Secure Digital (SD) card, flash card, etc., equipped on the service invocation device. Further, the readable storage medium can include both internal storage units of the service invocation device and external storage devices. The readable storage medium is used to store the computer program and other programs and data required by the service invocation device. The readable storage medium can also be used to temporarily store data that has been output or will be output.
[0175] This application also provides a computer program product, which is stored in a storage medium and, when executed by a computer, implements the document deduplication method provided in the above embodiments.
[0176] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element. Furthermore, it should be noted that the scope of the methods and apparatuses in the embodiments of this application is not limited to performing functions in the order shown or discussed, but may also include performing functions substantially simultaneously or in the reverse order, depending on the functions involved. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
[0177] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a computer software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal (which may be a mobile phone, computer, server, or network device, etc.) to execute the methods described in the various embodiments of this application.
[0178] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of this application.
Claims
1. A document plagiarism detection method, characterized in that, include: Input the document to be checked for plagiarism into the plagiarism detection model; The plagiarism detection model is used to extract content from the document to be checked, resulting in M first key content texts corresponding to M plagiarism detection dimensions. Each first key content text corresponds to one plagiarism detection dimension. The M plagiarism detection dimensions include at least: research objectives, research plan, research results, and research content, where M is an integer greater than or equal to 4. Using the plagiarism detection model, the similarity between M content text pairs corresponding to the M plagiarism detection dimensions is calculated to obtain M first similarity values. Each first similarity value corresponds to a plagiarism detection dimension, and the content text pair corresponding to a plagiarism detection dimension includes the second key content text corresponding to the plagiarism detection dimension and the corresponding first key content text stored in the benchmark database. Based on the M first similarity values, the plagiarism detection model outputs the plagiarism detection results of the document to be checked.
2. The document plagiarism detection method according to claim 1, characterized in that, The step of using the plagiarism detection model to output the plagiarism detection result of the document to be checked based on the M first similarity values includes: Using the plagiarism detection model, the N text pairs corresponding to the M first similarity values that are greater than or equal to a preset similarity threshold are taken as N text pairs to be checked for plagiarism. Using the plagiarism detection model, text plagiarism is performed on each pair of texts to be checked, and the correlation results between the first key content text and the corresponding second key content text in each pair of texts to be checked are determined. Based on the correlation results between the M first similarity values and the first key content text and the corresponding second key content text in the N text pairs to be checked for plagiarism, the plagiarism detection model outputs the plagiarism detection result of the document to be checked.
3. The document plagiarism detection method according to claim 2, characterized in that, The process of performing text deduplication on each pair of texts to be deduplicated using the deduplication model, and determining the correlation between the first key content text and the corresponding second key content text in each pair of texts to be deduplicated, includes: Using the plagiarism detection model, K key pieces of information about the first key content text in each pair of texts to be checked for plagiarism are extracted; and K key pieces of information about the second key content text in each pair of texts to be checked for plagiarism are extracted; the K key pieces of information include at least: the core problem to be solved, the core research subject, the technical field, the technical path, the type of research result, and the function of the research result, where K is an integer greater than or equal to 6; The plagiarism detection model cross-compares the K first key information pieces corresponding to each pair of texts to be checked for plagiarism with the K second key information pieces corresponding to each pair of texts to be checked for plagiarism, and determines the correlation results between the first key content text and the corresponding second key content text in each pair of texts to be checked for plagiarism.
4. The document plagiarism detection method according to claim 2, characterized in that, The plagiarism detection model, based on the correlation results between the M first similarity values and the first key content text and corresponding second key content text in the N text pairs to be checked, outputs the plagiarism detection results of the document to be checked, including: Using the plagiarism detection model, if all M first similarity values are greater than or equal to a preset similarity threshold, and the correlation results between the first key content text and the corresponding second key content text in any two of the N text pairs to be checked are related, the document to be checked will be output as a duplicate document.
5. The document plagiarism detection method according to claim 1, characterized in that, The plagiarism detection model is used to extract content from the document to be checked, resulting in M first key content texts corresponding to M plagiarism detection dimensions, including: Using the plagiarism detection model, based on the M plagiarism detection dimensions, the content of the document to be checked is extracted to obtain M third key content texts; Data cleaning is performed on the M third key content texts to obtain the M first key content texts.
6. A document plagiarism detection device, characterized in that, The document plagiarism detection device includes: an input module, an extraction module, and a processing module; The input module is used to input the document to be checked into the plagiarism detection model; The extraction module is used to extract content from the document to be checked for plagiarism using the plagiarism detection model, and obtain M first key content texts corresponding to M plagiarism detection dimensions. Each first key content text corresponds to one plagiarism detection dimension. The M plagiarism detection dimensions include at least: research objectives, research plan, research results, and research content, where M is an integer greater than or equal to 4. The processing module is used to calculate the similarity between M content text pairs corresponding to the M duplication detection dimensions through the duplication detection model, and obtain M first similarity values. Each first similarity value corresponds to a duplication detection dimension, and each content text pair corresponding to a duplication detection dimension includes the second key content text corresponding to the duplication detection dimension and the corresponding first key content text stored in the benchmark database. And through the plagiarism detection model, based on the M first similarity values, the plagiarism detection result of the document to be checked is output.
7. The document plagiarism detection device according to claim 6, characterized in that, The processing module is specifically used to, through the plagiarism detection model, identify N text pairs corresponding to the first similarity values that are greater than or equal to a preset similarity threshold among the M first similarity values as N text pairs to be checked for plagiarism. And through the plagiarism detection model, text plagiarism detection is performed on each pair of texts to be checked, and the correlation results between the first key content text and the corresponding second key content text in each pair of texts to be checked are determined; And through the plagiarism detection model, based on the correlation results between the M first similarity values and the first key content text and the corresponding second key content text in the N text pairs to be checked for plagiarism, the plagiarism detection result of the document to be checked is output.
8. The document plagiarism detection device according to claim 7, characterized in that, The processing module is specifically used to extract K first key information from the first key content text in each text pair to be deduplicated using the plagiarism detection model; and to extract K second key information from the second key content text in each text pair to be deduplicated; the K key information includes at least: the core problem to be solved, the core research subject, the technical field, the technical path, the type of research result, and the function of the research result, where K is an integer greater than or equal to 6; The plagiarism detection model is used to cross-compare the K first key information pieces corresponding to each pair of texts to be checked for plagiarism and the K second key information pieces corresponding to each pair of texts to be checked for plagiarism, and to determine the correlation results between the first key content text and the corresponding second key content text in each pair of texts to be checked for plagiarism.
9. The document plagiarism detection device according to claim 7, characterized in that, The processing module is specifically used to output the document to be checked as a duplicate document when the M first similarity values are all greater than or equal to a preset similarity threshold, and the correlation results between the first key content text and the corresponding second key content text in any two of the N text pairs to be checked are related.
10. The document plagiarism detection device according to claim 6, characterized in that, The extraction module is specifically used to extract content from the document to be checked for plagiarism based on the plagiarism detection model and the M plagiarism detection dimensions, thereby obtaining M third key content texts. The M third key content texts are then cleaned to obtain the M first key content texts.
11. An electronic device, characterized in that, It includes a processor and a memory, the memory storing programs or instructions that can run on the processor, the programs or instructions being executed by the processor to implement the document plagiarism detection method as described in any one of claims 1-5.
12. A readable storage medium, characterized in that, The readable storage medium stores a program or instructions, which, when executed by a computer, implement the document deduplication method as described in any one of claims 1-5.
13. A computer program product, characterized in that, The computer program product is stored in a storage medium, and when the computer program product is executed by a computer, it implements the document deduplication method as described in any one of claims 1-5.