Data classification processing method and device, computer device and storage medium

By performing two-stage training on the pre-trained language model to acquire and process contract clause data, the problem of low accuracy in traditional contract clause classification is solved, and higher accuracy in contract clause type identification and content understanding is achieved.

CN122241329APending Publication Date: 2026-06-19TENCENT TECHNOLOGY (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date
2024-12-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Traditional methods of classifying contract terms have the problem of low accuracy, especially when enterprises or institutions need to manage multiple contracts in a unified manner, which can easily lead to inadequate interpretation of content and classification errors.

Method used

By acquiring multiple training contract clause data, extracting the clause content and determining the clause type, and using a pre-trained language model for the first stage of training to obtain an initial contract recognition model, a second stage of training is then conducted based on the prompts indicating the type of contract clause to obtain a trained contract classification model.

Benefits of technology

It improves the accuracy of the contract classification model, enhances the ability to understand the content of contract terms and the ability to distinguish the types of contract terms, and improves the precision of contract classification.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241329A_ABST
    Figure CN122241329A_ABST
Patent Text Reader

Abstract

This application relates to a data classification processing method, apparatus, computer equipment, and storage medium. The method includes: acquiring multiple training contract clause data; for each training contract clause data, extracting clause content from the training contract clause data; determining the contract clause type to which the training contract clause data belongs; performing a first-stage training on a pre-trained language model based on content prompts describing the clause content, the multiple training contract clause data, and the clause content of each of the multiple training contract clause data, to obtain an initial contract recognition model; and performing a second-stage training on the initial contract recognition model based on discrimination prompts for determining the contract clause type, the multiple training contract clause data, and the contract clause type to which each of the multiple training contract clause data belongs, to obtain a trained contract classification model. This method can improve the classification accuracy of the contract classification model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a data classification and processing method, apparatus, computer equipment, computer-readable storage medium, and computer program product. Background Technology

[0002] With the development of artificial intelligence technology and the promotion and application of various businesses, in order to ensure the security and stability of resources for both parties involved in the business and to avoid malicious leakage or loss of resources, a contract system has emerged for both parties involved in specific businesses to be signed, so as to constrain the execution process of the business through contract terms and contents.

[0003] However, due to differences in business type or industry, the corresponding contract terms and contents also vary. For example, for the same company or organization, contracts signed with different companies or organizations usually require different contract terms and contents depending on the business type or industry. If a company or organization needs to manage multiple contracts in a unified manner, the traditional content identification and classification method is still used. This requires a lot of human and material resources to analyze and interpret the content of the contract terms in order to determine the category of the contract terms corresponding to different businesses. When there are a large number of contract terms, it is easy to have inadequate interpretation of the content and incorrect contract classification. Therefore, the traditional contract term classification method still has the problem of low classification accuracy. Summary of the Invention

[0004] Therefore, it is necessary to provide a data classification processing method, apparatus, computer equipment, computer-readable storage medium, and computer program product that can improve the accuracy of classifying contract terms in order to address the above-mentioned technical problems.

[0005] In a first aspect, this application provides a data classification processing method, comprising: acquiring multiple training contract clause data; for each training contract clause data, extracting clause content from the training contract clause data and determining the contract clause type to which the training contract clause data belongs; performing a first-stage training on a pre-trained language model based on content prompts describing the clause content, the multiple training contract clause data, and the clause content of each of the multiple training contract clause data, to obtain an initial contract recognition model; and performing a second-stage training on the initial contract recognition model based on discrimination prompts for determining the contract clause type, the multiple training contract clause data, and the contract clause type to which each of the multiple training contract clause data belongs, to obtain a trained contract classification model.

[0006] Secondly, this application also provides a data classification processing apparatus, comprising: a training contract clause data acquisition module for acquiring multiple training contract clause data; a contract clause type determination module for extracting clause content from each training contract clause data and determining the contract clause type to which the training contract clause data belongs; an initial contract recognition model acquisition module for performing a one-stage training on a pre-trained language model based on content prompts describing the clause content, the multiple training contract clause data, and the clause content of each of the multiple training contract clause data, to obtain an initial contract recognition model; and a contract classification model acquisition module for performing a two-stage training on the initial contract recognition model based on discrimination prompts for determining the contract clause type, the multiple training contract clause data, and the contract clause type to which each of the multiple training contract clause data belongs, to obtain a trained contract classification model.

[0007] Thirdly, this application also provides a computer device, including a memory and a processor. The memory stores a computer program, and the processor executes the computer program to perform the following steps: acquiring multiple training contract clause data; for each training contract clause data, extracting clause content from the training contract clause data and determining the contract clause type to which the training contract clause data belongs; performing a first-stage training on a pre-trained language model based on content prompts describing the clause content, the multiple training contract clause data, and the clause content of each of the multiple training contract clause data, to obtain an initial contract recognition model; and performing a second-stage training on the initial contract recognition model based on discrimination prompts for determining the contract clause type, the multiple training contract clause data, and the contract clause type to which each of the multiple training contract clause data belongs, to obtain a trained contract classification model.

[0008] Fourthly, this application also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, performs the following steps: acquiring multiple training contract clause data; for each training contract clause data, extracting clause content from the training contract clause data and determining the contract clause type to which the training contract clause data belongs; performing a first-stage training on a pre-trained language model based on content prompts describing the clause content, the multiple training contract clause data, and the clause content of each of the multiple training contract clause data, to obtain an initial contract recognition model; and performing a second-stage training on the initial contract recognition model based on discrimination prompts for determining the contract clause type, the multiple training contract clause data, and the contract clause type to which each of the multiple training contract clause data belongs, to obtain a trained contract classification model.

[0009] Fifthly, this application also provides a computer program product, including a computer program that, when executed by a processor, performs the following steps: acquiring multiple training contract clause data; for each training contract clause data, extracting clause content from the training contract clause data and determining the contract clause type to which the training contract clause data belongs; performing a first-stage training on a pre-trained language model based on content prompts describing the clause content, the multiple training contract clause data, and the clause content of each of the multiple training contract clause data, to obtain an initial contract recognition model; and performing a second-stage training on the initial contract recognition model based on discrimination prompts for determining the contract clause type, the multiple training contract clause data, and the contract clause type to which each of the multiple training contract clause data belongs, to obtain a trained contract classification model.

[0010] In the aforementioned data classification and processing methods, apparatus, computer equipment, computer-readable storage media, and computer program products, multiple training contract clause data are acquired. For each training contract clause data, clause content is extracted from the training contract clause data. Based on the content prompts used to describe the clause content, the multiple training contract clause data, and the clause content of each of the multiple training contract clause data, a pre-trained language model is trained in one stage to obtain an initial contract recognition model. This allows the initial contract recognition model to learn the clause content and specific definitions of multiple different contract clause data. Furthermore, by determining the contract clause type to which each of the multiple training contract clause data belongs, and based on the discrimination prompts used to determine the contract clause type, the multiple training contract clause data, and the contract clause type to which each of the multiple training contract clause data belongs, a two-stage training is performed on the initial contract recognition model to obtain a trained contract classification model. This achieves the further training and parameter update of the initial contract recognition model based on the contract clause type and discrimination prompts, on the basis of the initial contract recognition model with specific clause content that can be identified. This enables the trained contract classification model to have the ability to understand the clause content and the ability to distinguish the contract clause type. In other words, through two-stage fine-tuning training, the model accuracy and contract classification accuracy of the obtained contract classification model are improved. Attached Figure Description

[0011] To more clearly illustrate the technical solutions in the embodiments of this application or related technologies, the drawings used in the description of the embodiments of this application or related technologies will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0012] Figure 1 This is an application environment diagram of the data classification and processing method in one embodiment;

[0013] Figure 2 This is a flowchart illustrating a data classification and processing method in one embodiment;

[0014] Figure 3 This is a schematic diagram of the process for obtaining an initial contract recognition model in one embodiment;

[0015] Figure 4 This is a flowchart illustrating the process of obtaining a trained contract classification model in one embodiment.

[0016] Figure 5 This is a schematic diagram illustrating the process of grouping multiple training contract clause data and the contract clause types to which each of the multiple training contract clause data belongs in one embodiment.

[0017] Figure 6 This is a flowchart illustrating the process of obtaining a trained contract classification model in another embodiment;

[0018] Figure 7 This is a flowchart illustrating the data classification and processing method in another embodiment;

[0019] Figure 8 This is a schematic diagram of the overall processing flow of a data classification and processing method in one embodiment;

[0020] Figure 9 This is a schematic diagram illustrating API deployment and inference for a contract classification model in one embodiment;

[0021] Figure 10 This is a structural block diagram of a data classification and processing device in one embodiment;

[0022] Figure 11 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0024] The data classification and processing method provided in this application involves artificial intelligence technology and can be applied to various scenarios such as online media, corporate auditing, and online financial transactions. Specifically, it can be applied to, for example... Figure 1In the application environment shown, terminal 102 communicates with server 104 via a network. A data storage system can store the data that server 104 needs to process. This data storage system can be integrated onto server 104 or located on the cloud or other network servers. Terminal 102 can be, but is not limited to, various personal computers, laptops, smartphones, tablets, IoT devices, portable wearable devices, and aircraft. IoT devices can include smart speakers, smart TVs, smart air conditioners, smart in-vehicle devices, and projection devices. Portable wearable devices can include smartwatches, smart bracelets, and head-mounted devices. Head-mounted devices can be virtual reality (VR) devices, augmented reality (AR) devices, smart glasses, etc. Server 104 can be a standalone physical server, a server cluster consisting of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The terminal 102 and the server 104 can be connected directly or indirectly through wired or wireless communication, and this embodiment does not impose any restrictions on this.

[0025] Both terminal 102 and server 104 can be used independently to execute the data classification processing method provided in this embodiment, or they can work together to execute the data classification processing method provided in this embodiment. For example, taking the collaborative execution of the data classification processing method provided in this embodiment by terminal 102 and server 104 as an example, server 104 obtains multiple training contract clause data, extracts clause content from each training contract clause data, and performs a first-stage training on the pre-trained language model based on the content prompt information describing the clause content, the multiple training contract clause data, and the clause content of each of the multiple training contract clause data, to obtain an initial contract recognition model. Furthermore, server 104 determines the contract clause type to which each of the multiple training contract clause data belongs, and performs two-stage training on the initial contract recognition model based on the discrimination prompt information used to determine the contract clause type, the multiple training contract clause data, and the contract clause type to which each of the multiple training contract clause data belongs, to obtain a trained contract classification model. Thus, when server 104 receives a contract clause type discrimination request triggered by terminal 102, it can perform category determination on the unprocessed contract clause data corresponding to the contract clause type discrimination request based on the trained contract classification model, obtain the target contract clause type corresponding to the unprocessed contract clause data, and feed back the target contract clause type corresponding to the unprocessed contract clause data to terminal 102.

[0026] In one exemplary embodiment, such as Figure 2 As shown, a data classification and processing method is provided, which can be applied to... Figure 1 Taking server 104 as an example, the explanation includes the following steps S202 to S208. Wherein:

[0027] Step S202: Obtain multiple training contract terms data.

[0028] Specifically, training contract terms data refers to sample contract terms data used for model training. Contract terms data can be understood as data items that include the name of the contract terms and the specific content of the contract terms. That is, the server can obtain multiple training contract terms data, such as multiple data items that include the name of the contract terms and the specific content of the contract terms.

[0029] For example, a training contract clause data specifically includes: 1) a contract clause titled "Breach of Contract Clause (Defective Ownership of Property or Illegal Rental of Property)"; 2) the clause content corresponding to the contract clause titled "Breach of Contract Clause (Defective Ownership of Property or Illegal Rental of Property)," namely, "If the leased property has defective ownership or is illegally rented, affecting the lessee's use, the lessee shall bear the liability for breach of contract and compensate for losses. The amount of liquidated damages should be clear, specific, and operable."

[0030] Step S204: For each training contract clause data, extract the clause content from the training contract clause data and determine the contract clause type to which the training contract clause data belongs.

[0031] Specifically, the server acquires multiple training contract clause data, such as multiple data items including contract clause names and specific clause content. For each training contract clause data, the server extracts the specific clause content. For example, for a training contract clause data whose name is "Breach of Contract Clause (Defective Ownership of Property or Illegal Rental of Property)," the extracted clause content corresponding to this clause is: "If the ownership of the rented property is defective or it is illegally rented, affecting the lessee's use of the property, the lessee shall bear the liability for breach of contract and compensate for losses. The amount of liquidated damages should be clear, specific, and operable."

[0032] Furthermore, for each training contract clause data, after extracting the specific clause content from the training contract clause data, the server also needs to determine the contract clause type to which the training contract clause data belongs. The contract clause type can be determined based on the contract clause name and the corresponding clause content; it can be understood as the category to which the contract clause belongs.

[0033] For example, contract terms may specifically include general terms and special terms. General terms include: 1) terms regarding the information of the contracting parties, 2) terms regarding the subject matter of the contract, 3) terms regarding quantity and quality, 4) terms regarding price or remuneration, 5) terms regarding the time, place, and method of performance, and 6) terms regarding dispute resolution, etc. Special terms may specifically include: 1) force majeure clauses, 2) confidentiality clauses, 3) warranty and guarantee clauses, 4) intellectual property clauses, 5) risk transfer clauses, 6) contract modification and assignment clauses, 7) contract termination clauses, 8) earnest money clauses, and 9) clauses regarding liability for breach of contract, etc.

[0034] Optionally, if a training contract clause includes the following: the contract clause title is "Breach of Contract Clause (Defective Ownership or Illegal Rental of Property)", and the clause content is "If the ownership of the rented property is defective or it is illegally rented, affecting the lessee's use, the lessee shall bear the liability for breach of contract and compensate for losses. The amount of liquidated damages should be clear, specific, and operable," then the contract clause type of the training contract clause can be determined as "Breach of Contract Clause" in the special clauses category.

[0035] Step S206: Based on the content prompt information used to describe the content of the terms, multiple training contract terms data, and the terms content of each of the multiple training contract terms data, perform a first-stage training on the pre-trained language model to obtain an initial contract recognition model.

[0036] Content prompts, specifically, can be understood as the prompts provided during the first stage of training of a pre-trained language model. These prompts enhance the model's ability to understand and recognize specific clauses in contract data, thus improving its capacity to comprehend and describe the content of contract terms. Content prompts can also be understood as fine-tuning data during model training. Fine-tuning data refers to explicit instructions or prompts provided to the model during training to guide it in learning how to better perform specific tasks, thereby improving its zero-shot or few-shot learning ability on unseen tasks. Specifically, fine-tuning data can be specific instructions and corresponding examples provided to the model during training, enabling it to understand and follow these instructions, resulting in more accurate and context-aware responses or generation.

[0037] Specifically, the server obtains the content interpretation task information configured for the clause content, the contract clause name, and the clause interpretation content, and also obtains a preset prompt information template. Based on the preset prompt information template, the server constructs content prompt information describing the clause content. Here, the content interpretation task information refers to the specific task prompt information configured for the clause content that needs to be executed to interpret the clause content. The contract clause name refers to the name of the specific contract clause corresponding to the clause content, such as a breach of contract clause (defective property ownership or illegal rental of property). The clause interpretation content can be understood as the specific clause content, as well as further supplementary explanations of the clause content.

[0038] The preset prompt information template specifically includes a task description information field, a context text content field, and a target output information field. For example, the preset prompt information template (or general format) is: {"instruction": "prompt instruction", "input": "input context", "output": "target output sequence"}. Here, "instruction" refers to the task description information field in the preset prompt information template, "input" refers to the context text content field, and "output" refers to the target output information field. "Prompt instruction" refers to the specific task prompts or instructions used to guide the model in executing the required actions, "input context" refers to the context content input into the model, and "target output sequence" refers to the target data output by the model.

[0039] Furthermore, when constructing content prompt information to describe the content of the clauses based on the preset prompt information template, as well as the content interpretation task information, contract clause name, and clause interpretation content configured for the clause content, the specific steps are to sequentially fill the task description information field, context text content field, and target output information field in the preset prompt information template to construct the content prompt information for the clause content.

[0040] For example, during the first-stage training of the pre-trained language model, it is necessary to enable the model to learn the specific definition and meaning of each contract clause name, allowing the model to initially grasp the basic background knowledge of the contract clauses and lay the foundation for subsequent classification tasks. Specifically, in this embodiment, for the first-stage training process of the pre-trained language model, the content prompt information set to describe the clause content is as follows: {“instruction”: “Please explain the meaning of the following contract clause names”, “input”: “Breach of contract liability clause (defective ownership of the property or illegal rental of the property)”, “output”: “If the ownership of the property is defective or the property is illegally rented, affecting the lessee's use, the lessee shall bear the liability for breach of contract and compensate for the losses. The amount of liquidated damages should be clear, specific, and operable.”}

[0041] In an exemplary embodiment, after obtaining content prompt information to describe the terms, the server specifically performs a first-stage training on the pre-trained language model based on the content prompt information, multiple training contract terms data, and the terms content of each of the multiple training contract terms data. When the model training termination condition is met, an initial contract recognition model is obtained.

[0042] Specifically, during the first phase of training, the server inputs content prompts and multiple training contract clause data into the pre-trained language model. The content prompts guide the pre-trained language model to perform the task of recognizing and interpreting the clause content corresponding to the specific contract clause name, thereby obtaining the predicted clause content corresponding to each training contract clause data.

[0043] Furthermore, the server determines the first model training loss value in the first stage of training based on the predicted clause content corresponding to each of the multiple training contract clause data and the clause content extracted from the multiple training contract clause data. Then, the parameters of the pre-trained language model are adjusted according to the first model training loss value. When the model training ends, the initial contract recognition model can be obtained.

[0044] Step S208: Based on the discrimination prompt information used to determine the type of contract terms, multiple training contract terms data, and the contract terms type to which each of the multiple training contract terms data belongs, the initial contract recognition model is trained in two stages to obtain a trained contract classification model.

[0045] The discrimination task prompt information can be specifically understood as the prompt information during the second-stage training process of the initial contract recognition model. It is used to guide the initial contract recognition model, which has the ability to understand the basic definition of contract terms, to further learn the ability to discriminate the specific contract term type to which the contract term data belongs. In other words, it can be used to enhance the initial contract recognition model's ability to discriminate the specific contract term type.

[0046] Specifically, the server obtains the discrimination task prompt information configured for the contract clause type, the clause content, and the target contract clause type, and constructs discrimination prompt information for judging the contract clause type based on the preset prompt information template and the discrimination task prompt information configured for the contract clause type, the clause content, and the target contract clause type.

[0047] The task prompt information refers to the specific task prompt information configured for the contract clause type, which needs to be executed to determine the contract clause type. The clause explanation content can be understood as the specific clause content and further supplementary explanations of the clause content. The target contract clause type refers to the contract clause type that matches the clause content. The preset prompt information template specifically includes a task description information field, a context text content field, and a target output information field. For example, the preset prompt information template (or general format) is: {"instruction": "prompt instruction", "input": "input context", "output": "target output sequence"}.

[0048] For example, in this embodiment of the application, during the two-stage training process of the initial contract recognition model, the discrimination prompt information set for determining the type of contract clauses is: {"instruction": "There are several important clauses in a housing rental contract. Below are their names and descriptions:} Breach of contract clause (defects in property ownership or illegal rental): ... Breach of contract clause (failure to provide the agreed rental property): ... Based on the clause types and descriptions given above, please determine the category to which one of the contract clauses given below belongs. When outputting, only the specific category is required. If the contract clause does not belong to any of the categories, please return "Unable to determine". Below is the clause content: "input": "3. Rights and Obligations" 3.1 Party B guarantees that it has legal ownership or disposal rights over the leased premises and has the right to lease the premises, and Party A guarantees that all information provided to Party B regarding the leased premises is true and legal. (Breach of Contract Clause (Defects in Property Ownership or Illegal Leasing)) " indicates a newline.

[0049] The terms and conditions that need to be entered in "instruction" refer to the specific terms and conditions in "input", including "3. Rights and Obligations". 3.1 Party B guarantees that it has legal ownership or disposal rights over the leased premises and has the right to rent out the premises, and Party A guarantees that all information provided to Party B regarding the leased premises is true and legal.

[0050] Optionally, for each independent initial contract identification model, all contract clause types within the data group are applied to the discrimination prompts of the large language model so that the model can make predictions within the scope of all contract clause types included in the data group. If data samples from outside the group are input, the model's target output is "cannot be determined".

[0051] Furthermore, when the server constructs the discrimination prompt information for judging the contract clause type based on the preset prompt information template, the discrimination task prompt information configured for the contract clause type, the clause content, and the target contract clause type, it specifically fills the discrimination task prompt information configured for the contract clause type, the clause content, and the target contract clause type into the task description information field, the context text content field, and the target output information field in the preset prompt information template in sequence to construct the discrimination prompt information for judging the contract clause type.

[0052] In an exemplary embodiment, after obtaining the discrimination prompt information for determining the type of contract terms, the server specifically performs two-stage training on the initial contract recognition model based on the discrimination prompt information, multiple training contract terms data, and the contract terms type to which each of the multiple training contract terms data belongs. When the model training termination condition is met, the trained contract classification model is obtained.

[0053] Specifically, during the second-stage training process, the server inputs discrimination prompts and multiple training contract clause data into the initial contract recognition model. The discrimination prompts guide the initial contract recognition model to perform the task of discriminating the type of contract clause, thereby obtaining the predicted contract clause type corresponding to each training contract clause data.

[0054] Furthermore, the server determines the second model training loss value in the two-stage training process based on the predicted contract term types corresponding to each of the multiple training contract term data and the actual contract term types corresponding to the multiple training contract term data. Thus, the parameters of the initial contract recognition model can be adjusted according to the second model training loss value. When the model training termination condition is met, the trained contract classification model can be obtained.

[0055] Understandably, traditional classification models have small model structures and weak contextual understanding capabilities. In practical applications, due to the large number and high relevance of contract clause categories, many different contract clause categories cannot be distinguished by a single classification algorithm. However, in this embodiment, for the pre-trained language model, a two-stage prompting information is constructed, including content prompting information for the first stage of training and discriminative prompting information for the second stage of training. This allows for two-stage progressive training of the pre-trained language model. Training is then performed based on multiple training contract clause data and content prompting information, resulting in a more sophisticated and capable language model. Based on the initial contract recognition model that identifies the specific content of contract terms, the model is further retrained and its parameters are updated according to the contract term types and discrimination prompts. This allows the trained contract classification model to understand the content of the terms and to discriminate the types of contract terms. In other words, through two-stage fine-tuning training and the construction of prompts, the model's zero-shot or few-shot learning ability on unseen tasks can be improved, thus enhancing the accuracy of the obtained contract classification model in classifying a large number of contract term categories.

[0056] In an exemplary embodiment, the data classification processing method further includes: receiving a contract clause type discrimination request, obtaining contract clause data to be processed corresponding to the contract clause type discrimination request; and classifying the contract clause data to be processed according to a trained contract classification model to obtain the target contract clause type corresponding to the contract clause data to be processed.

[0057] Specifically, users can trigger a contract term type identification request based on their terminals. When the terminal detects the contract term type identification request, it sends the request back to the server. When the server receives the contract term type identification request, it can obtain the contract term data to be processed corresponding to the request by parsing the request.

[0058] Furthermore, the server inputs the contract terms data to be processed into a trained contract classification model, and then uses the trained contract classification model to determine the category of the contract terms data to be processed, thereby obtaining the target contract terms type corresponding to the contract terms data to be processed.

[0059] In an exemplary embodiment, depending on the actual application scenario (such as multi-intent recognition scenario, disease classification scenario, and image classification scenario, etc.), the specific components of the constructed content prompt information and discrimination prompt information are different, as are the training sample data collected in the actual application scenario. The contract classification model trained in this embodiment can also be an intent recognition classification model in a multi-intent recognition scenario, a disease category recognition model in a disease classification scenario, and an image classification model in an image classification scenario, etc. For example, in a multi-intent recognition scenario, when training the intent recognition classification model, intent training sample data needs to be collected, and content prompt information needs to be constructed for the intent content, and discrimination prompt information needs to be constructed for the intent category to which the intent content belongs. Thus, based on each intent training sample data and the content prompt information constructed for the intent content, a pre-trained language model can be trained in the first stage to obtain an initial intent recognition model, and based on each intent training sample data and the discrimination prompt information constructed for the intent category to which the intent content belongs, a second stage of training can be performed on the initial recognition model until a trained intent recognition classification model is obtained.

[0060] Similarly, in image classification scenarios, when training an image classification model, it is necessary to collect image training sample data, construct content prompts for the image content, and construct discrimination prompts for the image category to which the image content belongs. Based on the image training sample data and the content prompts constructed for the image content, the pre-trained language model can be trained in the first stage to obtain an initial image recognition model. Based on the image training sample data and the discrimination prompts constructed for the image category to which the image content belongs, the initial image recognition model can be trained in the second stage until a well-trained image classification model is obtained.

[0061] In the above data classification and processing method, multiple training contract clause data are acquired. For each training contract clause data, the clause content is extracted from the training contract clause data. Based on the content prompts used to describe the clause content, the multiple training contract clause data, and the clause content of each of the multiple training contract clause data, a pre-trained language model is trained in one stage to obtain an initial contract recognition model. This allows the initial contract recognition model to learn the clause content and specific definitions of multiple different contract clause data. Furthermore, by determining the contract clause type to which each of the multiple training contract clause data belongs, and based on the discrimination prompts used to determine the contract clause type, the multiple training contract clause data, and the contract clause type to which each of the multiple training contract clause data belongs, a two-stage training is performed on the initial contract recognition model to obtain a trained contract classification model. This achieves the further training and parameter update of the initial contract recognition model based on the contract clause type and discrimination prompts, on the basis of the initial contract recognition model with specific clause content that can be identified. This enables the trained contract classification model to have the ability to understand the clause content and the ability to distinguish the contract clause type. In other words, through two-stage fine-tuning training, the model accuracy and contract classification accuracy of the obtained contract classification model are improved.

[0062] In one exemplary embodiment, such as Figure 3 As shown, the steps to obtain the initial contract recognition model, namely, to perform a one-stage training on the pre-trained language model based on the content prompts used to describe the clause content, multiple training contract clause data, and the clause content of each of the multiple training contract clause data, to obtain the initial contract recognition model, include the following steps S302 to S306. Wherein:

[0063] Step S302: Input multiple training contract terms data and content prompts into the pre-trained language model to obtain the predicted terms content corresponding to each of the multiple training contract terms data.

[0064] Specifically, during the first phase of training, the server inputs multiple training contract clause data and content prompts into the pre-trained language model. Through the content prompts, the pre-trained language model is guided to perform the task of recognizing and interpreting the clause content corresponding to the specific contract clause name, thereby obtaining the predicted clause content corresponding to each training contract clause data.

[0065] Step S304: Determine the first model training loss value in the first stage of training process based on the content of each of the multiple training contract terms and the content of the prediction terms corresponding to each of the multiple training contract terms.

[0066] Specifically, in the first-stage training process, the joint probability distribution loss function is used as the model training loss function. The server can determine the first model training loss value in the first-stage training process based on the content of each of the multiple training contract terms, the predicted content of each of the multiple training contract terms, and the joint probability distribution loss function.

[0067] In an exemplary embodiment, the joint probability distribution corresponding to the joint probability distribution loss function during the first-stage training process is represented by the following formula (1). Calculation method:

[0068] ;Formula (1)

[0069] in, This is used to describe the joint probability distribution of the target output sequence Y (i.e., the predicted contract content) under the given conditions: an input context sequence X (i.e., multiple training contract terms data for the input model) and an instruction I (i.e., content prompt information). This can be obtained by calculating the product of the conditional probabilities of all yi (i.e., the i-th token in the output sequence Y). The purpose of calculating the joint probability distribution is to generate a reasonable output sequence by maximizing the joint probability. This represents the output sequence y given the conditions X, I, and prior conditions. <i In the case of m, the conditional probability distribution of the i-th token of the target output sequence Y, where m represents the length of the target output sequence Y and i represents the index, ranging from 1 to m.

[0070] in, The parameter represents the parameters that are frozen in the pre-trained language model during the first stage of training. These parameters are not updated during training. This represents the parameters in the LoRa module corresponding to the pre-trained language model, which need to be updated during model training.

[0071] Specifically, in this embodiment, a model training method based on a low-rank adapter (or lora for short, a method for fine-tuning large language models using efficient parameter fine-tuning, which reduces the number of training parameters through low-rank decomposition, thereby efficiently training large language models) is adopted. This method freezes the original parameters of the pre-trained language model and injects a trainable low-rank matrix (i.e., a lora module) into the Transformer architecture of each layer of the pre-trained language model. During model training, only the injected low-rank matrix (i.e., the lora module) is updated and adjusted. Thus, the model parameters can be adjusted by introducing a learnable low-rank matrix, rather than adjusting all the parameters of the model, reducing the number of parameters during model fine-tuning, improving training efficiency, and reducing resource consumption.

[0072] Furthermore, the joint probability distribution loss function during the first-stage training process is expressed by the following formula (2). :

[0073] ;Formula (2)

[0074] in, Let represent the joint probability distribution loss function, where i represents the i-th token in the target output sequence Y (i.e., the predicted terms), and n represents the length of the target output sequence Y. This means that, given the following conditions: input context sequence X (i.e., multiple training contract terms data of the input model) and instruction I (i.e., content prompt information), and the previous output sequence y... <i Given the conditional probability distribution of the i-th token in the target output sequence Y, the joint probability distribution loss function can be calculated by summing the negative logarithms of the joint probability distributions of all tokens in the target output sequence Y. The goal is to maximize the joint probability product in order to generate a reasonable output sequence.

[0075] Step S306: Based on the training loss value of the first model, adjust the parameters of the first projection matrix corresponding to the pre-trained language model to obtain the initial contract recognition model.

[0076] Specifically, after determining the first model training loss value in the first stage of training, the server adjusts the parameters of the first projection matrix corresponding to the pre-trained language model based on the first model training loss value. When the model training termination condition of the first stage is met, the initial contract recognition model is obtained.

[0077] Furthermore, in this embodiment, a model training method based on a low-rank adapter (referred to as lora, which is a method for fine-tuning large language models using efficient parameter fine-tuning) is adopted. This method freezes the original parameters (W0) of the pre-trained language model and injects a trainable low-rank matrix (i.e., lora module) into the Transformer architecture of each layer of the pre-trained language model. This can also be understood as the first projection matrix corresponding to the pre-trained language model, which specifically includes the first lower projection matrix A1 and the first upper projection matrix B1. During the model training process, only the injected first projection matrix (including the first lower projection matrix A1 and the first upper projection matrix B1) is updated and adjusted to obtain the adjusted first parameter update matrix lora1. The original parameters (W0) of the pre-trained language model and the adjusted first parameter update matrix lora1 are combined to obtain the model parameters of W0 + lora1, which are used as the model parameters of the initial contract recognition model to obtain the initial contract model.

[0078] For example, the model parameter W1 of the initial contract model is represented by the following formula (3):

[0079] W1=W0+lora1= W0+B1*A1*scaling1; formula (3)

[0080] Where W1 represents the model parameters of the initial contract model, W0 represents the original parameters of the pre-trained language model, B1*A1*scaling1 represents the first parameter update matrix lora1 in the first stage of training, A1 represents the first lower projection matrix, B1 represents the first upper projection matrix, and scaling1 represents the scaling factor of the first parameter update matrix, which can be set to a specific value according to actual needs.

[0081] In this embodiment, by inputting content prompts and multiple training contract terms into a pre-trained language model, predicted terms corresponding to each of the multiple training contract terms are obtained. Based on the terms of each of the multiple training contract terms and their corresponding predicted terms, a first model training loss value is determined during the first stage of training. Then, based on the first model training loss value, the parameters of the first projection matrix corresponding to the pre-trained language model are adjusted to obtain an initial contract recognition model. This achieves a model training method that fine-tunes a large language model. During model training, only the parameters of the first projection matrix corresponding to the pre-trained language model are updated and adjusted, rather than all parameters of the model, reducing the number of parameters during model fine-tuning, improving model training efficiency, and reducing resource consumption during model training.

[0082] In one exemplary embodiment, such as Figure 4 As shown, the steps to obtain a trained contract classification model, namely, the two-stage training of the initial contract recognition model based on the discrimination prompts used to determine the type of contract clauses, multiple training contract clause data, and the contract clause types to which each of the multiple training contract clause data belongs, to obtain a trained contract classification model, specifically include the following steps S402 to S406. Wherein:

[0083] Step S402: Based on multiple training contract clause data and the contract clause types to which each training contract clause data belongs, group processing is performed to obtain a preset number of data groups. Each data group includes multiple training contract clause data belonging to different contract clause types.

[0084] Specifically, due to the large number of different contract clause types in practical applications, before performing two-stage training on the initial contract recognition model, the server groups multiple training contract clause data sets based on their respective contract clause types to obtain a preset number of data groups. Each data group includes multiple training contract clause data sets belonging to different contract clause types. Each data group is used to independently fine-tune its corresponding initial contract recognition model; that is, each data group can independently train its corresponding initial contract recognition model based on a low-rank adapter (LoRa, a method for fine-tuning large language models using efficient parameter tuning).

[0085] Step S404: Determine the initial contract recognition model corresponding to each data group, and for each data group, obtain the out-of-group training samples corresponding to the targeted data group from the remaining data groups. The out-of-group training samples include multiple training contract clause data belonging to different contract clause types.

[0086] Specifically, after the server obtains a preset number of data groups, it needs to determine the initial contract recognition model corresponding to each data group in order to achieve independent fine-tuning of the initial contract recognition model corresponding to each data group, since it needs to adopt the method of independent model training.

[0087] Furthermore, for each data group, the server needs to obtain out-of-group training samples corresponding to the targeted data group from the remaining data groups. For example, based on multiple training contract clause data and the contract clause types to which each training contract clause data belongs, grouping is performed to obtain three data groups, including the first data group, the second data group, and the third data group. Among them, for the first data group, a preset number of training contract clause data (e.g., 10) needs to be randomly selected from all the data in the second and third data groups. Thus, the out-of-group training samples obtained will also include multiple training contract clause data belonging to different contract clause types.

[0088] In one exemplary embodiment, such as Figure 5 The diagram illustrates a process for grouping multiple training contract clause data points based on their respective contract clause types. Figure 5 It can be seen that, specifically, based on multiple training contract terms data and the contract terms type to which each of the multiple training contract terms data belongs, three data groups can be obtained. Each data group is trained independently based on the model training method of low-rank adapter (abbreviated as LOA, which is a way to fine-tune large language models by using parameter efficient fine-tuning method).

[0089] Among them, reference Figure 5 It is known that, for the three obtained data groups, including data group 1, data group 2, and data group 3, for data group 1, a preset number of training contract terms (e.g., 10 terms) need to be randomly selected from all the data in data group 2 and data group 3 as the out-of-group training samples for data group 1. For data group 2 and data group 3, the method for determining the out-of-group training samples is the same: a preset number of training contract terms (e.g., 10 terms) are randomly selected from all the data in the remaining data groups other than their own data group as the corresponding out-of-group training samples.

[0090] Step S406: For each initial contract recognition model, perform two-stage training on the initial contract recognition model based on the discrimination prompt information, the data group corresponding to the initial contract recognition model, and the out-of-group training samples corresponding to the data group, to obtain a trained contract classification model.

[0091] Specifically, the server independently trains multiple initial contract recognition models. The training method for each initial contract recognition model is the same. That is, for each initial contract recognition model, a two-stage training is performed on the initial contract recognition model based on the discrimination prompt information, the data group corresponding to the initial contract recognition model, and the out-of-group training samples corresponding to the data group. When the model training termination condition of the two-stage training is met, the trained contract classification model is obtained.

[0092] In the second-stage training process, for each initial contract recognition model, the server inputs discrimination prompts, multiple training contract clause data from the data group corresponding to the initial contract recognition model, and multiple training contract clause data from the out-of-group training samples corresponding to the data group into the initial contract recognition model. The discrimination prompts guide the initial contract recognition model to perform the task of judging the contract clause type, thereby obtaining the predicted contract clause type corresponding to each training contract clause data.

[0093] Furthermore, the server determines the second model training loss value in the two-stage training process based on the predicted contract term types corresponding to each of the multiple training contract term data and the actual contract term types corresponding to the multiple training contract term data. Thus, the parameters of the initial contract recognition model can be adjusted according to the second model training loss value. When the model training termination condition is met, the trained contract classification model can be obtained.

[0094] In this embodiment, multiple training contract clause data and their respective contract clause types are grouped to obtain a preset number of data groups. An initial contract recognition model is determined for each data group. For each data group, external training samples corresponding to the target data group are obtained from the remaining data groups. For each initial contract recognition model, a two-stage training process is performed based on the discrimination prompt information, the data group corresponding to the target initial contract recognition model, and multiple training contract clause data from the external training samples corresponding to that data group. This results in a trained contract classification model. This allows for further retraining and parameter updates of the initial contract recognition model, which already has specific clause content that can be identified. This enables the trained contract classification model to understand clause content and discriminate contract clause types. In other words, through two-stage fine-tuning training, the model accuracy and contract classification accuracy of the obtained contract classification model are improved.

[0095] In one exemplary embodiment, such as Figure 6As shown, the steps to obtain a trained contract classification model, namely, to perform two-stage training on the initial contract recognition model based on the discrimination prompt information, the data group corresponding to the initial contract recognition model, and the out-of-group training samples corresponding to the data group, to obtain a trained contract classification model, specifically include the following steps S602 to S606. Wherein:

[0096] Step S602: Input the discrimination prompt information, the data group corresponding to the initial contract recognition model, and the training contract clause data from the training samples outside the corresponding data group into the initial contract recognition model to obtain the predicted contract clause type corresponding to each of the multiple training contract clause data.

[0097] Specifically, during the two-stage training process, for each initial contract recognition model, the server inputs multiple training contract clause data from the data group corresponding to the initial contract recognition model, multiple training contract clause data from the out-of-group training samples corresponding to the data group, and discrimination prompt information into the initial contract recognition model. The discrimination prompt information guides the initial contract recognition model to perform the task of judging the contract clause type, thereby obtaining the predicted contract clause type corresponding to each training contract clause data.

[0098] Step S604: Determine the second model training loss value in the second-stage training process based on the predicted contract term type corresponding to each of the multiple training contract term data and the contract term type corresponding to each of the multiple training contract term data.

[0099] Specifically, in the second-stage training process, the joint probability distribution loss function is used as the model training loss function. The server can determine the second model training loss value in the second-stage training process by considering the predicted contract term types corresponding to each of the multiple training contract term data, the contract term types corresponding to each of the multiple training contract term data, and the joint probability distribution loss function.

[0100] In the second-stage training process, the joint probability distribution corresponding to the joint probability distribution loss function is calculated using formula (1), and the joint probability distribution loss function in the second-stage training process is also calculated using the formula (2). However, the difference between the second-stage training process and the first-stage training process lies in the fact that the instruction I and the target output sequence Y involved in the joint probability distribution (and joint probability distribution loss function) are different. In the first-stage training process, the input context sequence X is multiple training contract clause data of the input model, the instruction I is the content prompt information, and the target output sequence Y is the predicted clause content. In the second-stage training process, the input context sequence X is multiple training contract clause data of the input model, the instruction I is the discrimination prompt information, and the target output sequence Y is the predicted contract clause type.

[0101] Step S606: Based on the training loss value of the second model, adjust the parameters of the second projection matrix corresponding to the initial contract recognition model to obtain the trained contract classification model.

[0102] Specifically, after determining the second model training loss value in the two-stage training process, the server adjusts the parameters of the second projection matrix corresponding to the initial contract recognition model based on the second model training loss value. When the end condition of the two-stage model training is met, the trained contract classification model is obtained.

[0103] Furthermore, in this embodiment, a model training method based on a low-rank adapter (referred to as lora, which is a method for fine-tuning large language models using efficient parameter fine-tuning) is adopted. This method freezes the model parameters of the initial contract recognition model (i.e., the original parameters W0 of the pre-trained language model + the first parameter update matrix lora1). In each layer of the Transformer architecture of the initial contract recognition model, a trainable low-rank matrix (i.e., lora module) is injected. This can also be understood as a second projection matrix corresponding to the initial contract recognition model. Specifically, it includes a second lower projection matrix A2 and a second upper projection matrix B2. During the model training process, only the injected second projection matrix (including the second lower projection matrix A2 and the second upper projection matrix B2) is updated and adjusted to obtain the adjusted second parameter update matrix lora2. The model parameters of the initial contract recognition model (i.e., the original parameters W0 of the pre-trained language model + the first parameter update matrix lora1) and the adjusted second parameter update matrix lora2 are combined to obtain the model parameters W0 + lora1 + lora2, which are used as the model parameters of the contract classification model to obtain the trained contract classification model.

[0104] For example, the model parameters W of the trained contract classification model are represented by the following formula (4):

[0105] W=W1+lora2=(W0+B1*A1*scaling1)+ B2*A2*scaling2; formula (3)

[0106] Where W represents the model parameters of the trained contract classification model, W1 represents the model parameters of the initial contract model, which corresponds to W0 + B1*A1*scaling1. W0 represents the original parameters of the pre-trained language model, B1*A1*scaling1 represents the first parameter update matrix lora1 in the first stage of training, A1 represents the first down projection matrix, B1 represents the first up projection matrix, and scaling1 represents the scaling factor of the parameter update matrix, which can be set according to actual needs. B2*A2*scaling2 represents the second parameter update matrix lora2 in the second stage of training, A2 represents the second down projection matrix, B2 represents the second up projection matrix, and scaling2 represents the scaling factor of the second parameter update matrix.

[0107] In this embodiment, for each initial contract recognition model, multiple training contract clause data from discrimination prompts, data groups, and out-of-group training samples are input into the initial contract recognition model to obtain the predicted contract clause type corresponding to each of the multiple training contract clause data. Based on the predicted contract clause type corresponding to each of the multiple training contract clause data and the contract clause type corresponding to each of the multiple training contract clause data, the second model training loss value in the two-stage training process is determined. Thus, based on the second model training loss value, the parameters of the second projection matrix corresponding to the initial contract recognition model can be adjusted to obtain the trained contract classification model. This realizes a model training method that uses fine-tuning of a large language model. During the model training process, only the parameters of the second projection matrix corresponding to the initial contract recognition model are updated and adjusted, rather than adjusting all the parameters of the model. This reduces the number of parameters during model fine-tuning, improves model training efficiency, and reduces resource consumption during model training.

[0108] In one exemplary embodiment, such as Figure 7 As shown, a data classification and processing method is provided, which can be applied to... Figure 1 Taking server 104 as an example, the explanation includes the following steps S701 to S713. Wherein:

[0109] Step S701: Obtain multiple training contract clause data, extract clause content from each training contract clause data, and determine the contract clause type to which the training contract clause data belongs.

[0110] Specifically, the server acquires multiple training contract clause data, such as multiple data items including contract clause names and specific clause content. For each training contract clause data, it extracts the specific clause content. For example, for a training contract clause data whose contract clause name is "Breach of Contract Clause (Defective Ownership of Property or Illegal Rental of Property)," the server extracts the clause content corresponding to this clause, which would be: "If the ownership of the rented property is defective or it is illegally rented, affecting the lessee's use, the lessee shall bear the liability for breach of contract and compensate for losses. The amount of liquidated damages should be clear, specific, and operable." Furthermore, based on the contract clause name and the corresponding clause content, the server determines the contract clause type to which the training contract clause data belongs. The contract clause type can be understood as the category to which the contract clause belongs.

[0111] Step S702: Obtain the content interpretation task information configured for the clause content, the contract clause name, and the clause interpretation content, and obtain the preset prompt information template. Based on the preset prompt information template, the content interpretation task information configured for the clause content, the contract clause name, and the clause interpretation content, construct the content prompt information used to describe the clause content.

[0112] Specifically, the server obtains the content interpretation task information configured for the clause content, the contract clause name, and the clause interpretation content, and obtains a preset prompt information template, including a task description information field, a context text content field, and a target output information field. The server then sequentially fills the task description information field, context text content field, and target output information field in the preset prompt information template with the content interpretation task information configured for the clause content, the contract clause name, and the clause interpretation content, thereby constructing the content prompt information for the clause content.

[0113] Step S703: Input the content prompt information and multiple training contract terms data into the pre-trained language model to obtain the predicted terms content corresponding to each of the multiple training contract terms data.

[0114] Specifically, during the first phase of training, the server inputs content prompts and multiple training contract clause data into the pre-trained language model. The content prompts guide the pre-trained language model to perform the task of recognizing and interpreting the clause content corresponding to the specific contract clause name, thereby obtaining the predicted clause content corresponding to each training contract clause data.

[0115] Step S704: Determine the first model training loss value in the first stage of training process based on the content of each of the multiple training contract terms and the content of the prediction terms corresponding to each of the multiple training contract terms.

[0116] Specifically, in the first-stage training process, the joint probability distribution loss function is used as the model training loss function. The server can determine the first model training loss value in the first-stage training process based on the content of each of the multiple training contract terms, the predicted content of each of the multiple training contract terms, and the joint probability distribution loss function.

[0117] Step S705: Based on the training loss value of the first model, adjust the parameters of the first projection matrix corresponding to the pre-trained language model to obtain the initial contract recognition model.

[0118] Specifically, when the server adjusts the parameters of the first projection matrix corresponding to the pre-trained language model based on the training loss value of the first model, it adopts a model training method based on a low-rank adapter (or lora, which is a method for fine-tuning large language models using efficient parameter fine-tuning). This method freezes the original parameters (W0) of the pre-trained language model and injects a trainable low-rank matrix (i.e., the first projection matrix) into the Transformer architecture of each layer of the pre-trained language model. During model training, only the injected first projection matrix is ​​updated and adjusted to obtain the adjusted first parameter update matrix lora1. The original parameters (W0) of the pre-trained language model and the adjusted first parameter update matrix lora1 are combined to obtain the model parameters W0 + lora1, which are used as the model parameters of the initial contract recognition model to obtain the initial contract model.

[0119] Step S706: Obtain the discrimination task prompt information, clause content, and target contract clause type configured for the contract clause type, so as to construct discrimination prompt information for discriminating contract clause type based on the preset prompt information template, the discrimination task prompt information, clause content, and target contract clause type configured for the contract clause type.

[0120] Specifically, the server obtains the judgment task prompt information configured for the contract clause type, the clause content, and the target contract clause type, and obtains a preset prompt information template, including a task description information field, a context text content field, and a target output information field. The server then sequentially fills the judgment task prompt information configured for the contract clause type, the clause content, and the target contract clause type into the task description information field, the context text content field, and the target output information field in the preset prompt information template to construct the judgment prompt information used to determine the contract clause type.

[0121] Step S707: Based on multiple training contract clause data and the contract clause types to which each training contract clause data belongs, group processing is performed to obtain a preset number of data groups. Each data group includes multiple training contract clause data belonging to different contract clause types.

[0122] Specifically, since there are a large number of different contract terms in actual applications, before the server performs two-stage training on the initial contract recognition model, it groups multiple training contract terms data and the contract terms types to which each training contract terms data belongs, to obtain a preset number of data groups. Each data group includes multiple training contract terms data belonging to different contract terms types, and each data group is used to independently fine-tune its corresponding initial contract recognition model.

[0123] Step S708: Determine the initial contract recognition model corresponding to each data group, and for each data group, obtain the out-of-group training samples corresponding to the targeted data group from the remaining data groups. The out-of-group training samples include multiple training contract clause data belonging to different contract clause types.

[0124] Specifically, the server determines an initial contract recognition model corresponding to each data group, enabling independent fine-tuning of the initial contract recognition model for each data group. For example, based on multiple training contract clause data and the contract clause types to which each training contract clause data belongs, the server performs grouping processing to obtain three data groups: a first data group, a second data group, and a third data group. For each data group, such as the first data group, the server needs to randomly extract a preset number of training contract clause data from the remaining data groups (including the second and third data groups) as out-of-group training samples corresponding to that data group (e.g., the first data group). That is, the obtained out-of-group training samples also include multiple training contract clause data belonging to different contract clause types.

[0125] Step S709: For each initial contract recognition model, input the discrimination prompt information, the data group corresponding to the initial contract recognition model, and multiple training contract clause data from the external training samples corresponding to the data group into the initial contract recognition model to obtain the predicted contract clause type corresponding to each of the multiple training contract clause data.

[0126] Specifically, the server independently trains multiple initial contract recognition models. The training method for each initial contract recognition model is the same. That is, in the two-stage training process, for each initial contract recognition model, the server inputs the discrimination prompt information, multiple training contract clause data from the data group corresponding to the initial contract recognition model, and multiple training contract clause data from the out-of-group training samples corresponding to the data group into the initial contract recognition model. In order to guide the initial contract recognition model to perform the task of judging the contract clause type through the discrimination prompt information, the server obtains the predicted contract clause type corresponding to each training contract clause data.

[0127] Step S710: Determine the second model training loss value in the second-stage training process based on the predicted contract term type corresponding to each of the multiple training contract term data and the contract term type corresponding to each of the multiple training contract term data.

[0128] Specifically, in the second-stage training process, the joint probability distribution loss function is used as the model training loss function. The server can determine the second model training loss value in the second-stage training process by considering the predicted contract term types corresponding to each of the multiple training contract term data, the contract term types corresponding to each of the multiple training contract term data, and the joint probability distribution loss function.

[0129] Step S711: Based on the training loss value of the second model, adjust the parameters of the second projection matrix corresponding to the initial contract recognition model to obtain the trained contract classification model.

[0130] Specifically, the server employs a model training method based on a low-rank adapter (lora, a method for fine-tuning large language models using efficient parameter fine-tuning). Based on the training loss value of the second model, the parameters of the second projection matrix corresponding to the initial contract recognition model are adjusted. This is achieved by freezing the model parameters of the initial contract recognition model (i.e., the original parameters W0 of the pre-trained language model + the first parameter update matrix lora1). In each layer of the Transformer architecture of the initial contract recognition model, a trainable low-rank matrix, i.e., the second projection matrix corresponding to the initial contract recognition model, is injected. During model training, only the parameters of the second projection matrix are updated and adjusted to obtain the adjusted second parameter update matrix lora2. The model parameters of the initial contract recognition model (i.e., the original parameters W0 of the pre-trained language model + the first parameter update matrix lora1) and the adjusted second parameter update matrix lora2 are combined to obtain the model parameters W0 + lora1 + lora2, which are used as the model parameters of the contract classification model, resulting in a trained contract classification model.

[0131] Step S712: Receive a contract clause type determination request and obtain the pending contract clause data corresponding to the contract clause type determination request.

[0132] Specifically, users can trigger a contract term type identification request based on their terminals. When the terminal detects the contract term type identification request, it sends the request back to the server. When the server receives the contract term type identification request, it can obtain the contract term data to be processed corresponding to the request by parsing the request.

[0133] Step S713: Based on the trained contract classification model, classify the contract clause data to be processed to obtain the target contract clause type corresponding to the contract clause data to be processed.

[0134] Specifically, the server inputs the contract terms data to be processed into a trained contract classification model, and then uses the trained contract classification model to determine the category of the contract terms data to be processed, thereby obtaining the target contract terms type corresponding to the contract terms data to be processed.

[0135] In an exemplary embodiment, model training is performed on the first data group -104 (12), the second data group -103 (12), and the third data group -135 (15). The numbers within parentheses represent the number of contract clause types, which already includes out-of-group data and the "cannot be determined" category. The numbers outside the parentheses represent the number of contract clause samples in the data group, which also includes 10 out-of-group data entries. Furthermore, the constructed test set contains one data entry and two out-of-group data entries corresponding to each contract clause type. The model performance is then tested using the test set.

[0136] To quantify the facilitating effect of the first-stage training on the second-stage training, the model performance was tested for both the first-stage training followed by the second-stage training (s1-s2) and the second-stage training alone (s2). The results are illustrated in Table 1 below.

[0137] Table 1

[0138]

[0139] Specifically, as shown in Table 1, the classification accuracy of the model with one-stage training followed by two-stage training (i.e., s1-s2) is higher than that of the model with only two-stage training (i.e., s2). This means that by setting up a two-stage fine-tuning training method, the model's classification accuracy is improved. In other words, by conducting a one-stage training process, the model's classification ability can be further promoted. The well-trained model can provide accurate contract clause type identification methods for contract clause classification systems, automated contract generation systems, etc.

[0140] In one exemplary embodiment, such as Figure 8 As shown, an overall processing flow for a data classification and processing method is provided, referring to... Figure 8 It can be seen that the processing flow specifically includes: P1, dataset partitioning and grouping; P2, instruction fine-tuning and data construction; and P3, API deployment and inference, wherein:

[0141] For P1, dataset partitioning and grouping: Based on multiple training contract terms and their respective contract terms types, the datasets are grouped to obtain a preset number of data groups. Each data group includes multiple training contract terms belonging to different contract terms types. For each data group, the server needs to randomly select a preset number of training contract terms from the remaining data groups as out-of-group training samples corresponding to the targeted data group.

[0142] For P2, instruction fine-tuning data construction: (1) Contract clause definition learning: In the first stage of training, the content prompt information used to describe the content of the clauses is: {"instruction": "Please explain the meaning of the contract clause names given below", "input": "Breach of contract liability clause (defects in property ownership or illegal rental of the house)", "output": "If the property ownership of the rented house is defective or it is illegally rented and affects the lessee's use, the lessee shall bear the liability for breach of contract and compensate for the losses. The amount of liquidated damages should be clear and specific and operable."}. (2) Contract clause type discrimination ability learning: In the second stage of training, the discrimination prompt information used to discriminate the type of contract clauses is: {"instruction": "There are several important clauses in the house rental contract. The following are their names and descriptions: Breach of contract clause (defects in property ownership or illegal rental): ... Breach of contract clause (failure to provide the agreed rental property): ... Based on the clause types and descriptions given above, please determine the category to which one of the contract clauses given below belongs. When outputting, only the specific category is required. If the contract clause does not belong to any of the categories, please return "Unable to determine". Below is the clause content: "input": "3. Rights and Obligations" 3.1 Party B guarantees that it has legal ownership or disposal rights over the leased premises and has the right to lease the premises, and Party A guarantees that all information provided to Party B regarding the leased premises is true and legal.

[0143] Specifically, a well-trained contract classification model is obtained by performing one-stage and two-stage training on the pre-trained language model.

[0144] For P3, API deployment, and inference: After obtaining the trained contract classification model, based on the Flask framework (a lightweight web application framework), a corresponding API service (i.e., interface service) is set up for each contract classification model. Users can access the API service set for the contract classification model to call the trained contract classification model for contract clause type identification. Specifically, by entering "ip:port / ?question=specific clause content" into the API service set for the classification model, the user can return the contract clause type corresponding to that clause content.

[0145] For example, such as Figure 9 As shown, this provides a schematic diagram for API deployment and inference for a contract classification model, referencing... Figure 9 It can be seen that the API service set up by the user for the class classification model can return the contract clause type corresponding to the clause content by inputting "ip:port / question=specific clause content". For example, refer to... Figure 9 It can be seen that by inputting: The question "If the leased premises are damaged or destroyed due to force majeure, this contract shall be terminated, and neither Party A nor Party B shall be liable for any losses incurred by Party B" returns the contract clause type corresponding to this clause as "Force Majeure Clause".

[0146] in," ":12345" represents the IP address of the API service set for the class classification model, while ":12345" specifically corresponds to ":port". The clause "If the leased house is damaged or lost due to force majeure, this contract shall be terminated, and neither Party A nor Party B shall be liable for any losses caused to Party B" indicates the specific clause content that needs to be determined for the type of contract clause.

[0147] In the above data classification and processing method, multiple training contract clause data are acquired. For each training contract clause data, the clause content is extracted from the training contract clause data. Based on the content prompts used to describe the clause content, the multiple training contract clause data, and the clause content of each of the multiple training contract clause data, a pre-trained language model is trained in one stage to obtain an initial contract recognition model. This allows the initial contract recognition model to learn the clause content and specific definitions of multiple different contract clause data. Furthermore, by determining the contract clause type to which each of the multiple training contract clause data belongs, and based on the discrimination prompts used to determine the contract clause type, the multiple training contract clause data, and the contract clause type to which each of the multiple training contract clause data belongs, a two-stage training is performed on the initial contract recognition model to obtain a trained contract classification model. This achieves the further training and parameter update of the initial contract recognition model based on the contract clause type and discrimination prompts, on the basis of the initial contract recognition model with specific clause content that can be identified. This enables the trained contract classification model to have the ability to understand the clause content and the ability to distinguish the contract clause type. In other words, through two-stage fine-tuning training, the model accuracy and contract classification accuracy of the obtained contract classification model are improved.

[0148] It should be understood that although the steps in the flowcharts of the above embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the above embodiments may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.

[0149] Based on the same inventive concept, this application also provides a data classification processing apparatus for implementing the data classification processing method described above. The solution provided by this apparatus is similar to the implementation scheme described in the above method; therefore, the specific limitations in one or more data classification processing apparatus embodiments provided below can be found in the limitations of the data classification processing method described above, and will not be repeated here.

[0150] In one exemplary embodiment, such as Figure 10As shown, a data classification processing device is provided, including: a training contract clause data acquisition module 1002, a contract clause type determination module 1004, an initial contract recognition model acquisition module 1006, and a contract classification model acquisition module 1008, wherein:

[0151] The training contract clause data acquisition module 1002 is used to acquire multiple training contract clause data; the contract clause type determination module 1004 is used to extract clause content from each training contract clause data and determine the contract clause type to which the training contract clause data belongs; the initial contract recognition model acquisition module 1006 is used to perform a first-stage training on the pre-trained language model based on the content prompts used to describe the clause content, multiple training contract clause data, and the clause content of each of the multiple training contract clause data, to obtain an initial contract recognition model; the contract classification model acquisition module 1008 is used to perform a second-stage training on the initial contract recognition model based on the discrimination prompts used to determine the contract clause type, multiple training contract clause data, and the contract clause type to which each of the multiple training contract clause data belongs, to obtain a trained contract classification model.

[0152] In the aforementioned data classification and processing device, multiple training contract clause data are acquired. For each training contract clause data, the clause content is extracted from the training contract clause data. Based on the content prompts used to describe the clause content, the multiple training contract clause data, and the clause content of each of the multiple training contract clause data, a pre-trained language model is trained in one stage to obtain an initial contract recognition model. This allows the initial contract recognition model to learn the clause content and specific definitions of multiple different contract clause data. Furthermore, by determining the contract clause type to which each of the multiple training contract clause data belongs, and based on the discrimination prompts used to determine the contract clause type, the multiple training contract clause data, and the contract clause type to which each of the multiple training contract clause data belongs, a two-stage training is performed on the initial contract recognition model to obtain a trained contract classification model. This achieves the further training and parameter update of the initial contract recognition model based on the contract clause type and discrimination prompts, on the basis of the initial contract recognition model with specific clause content that can be identified. This enables the trained contract classification model to have the ability to understand the clause content and the ability to distinguish the contract clause type. In other words, through two-stage fine-tuning training, the model accuracy and contract classification accuracy of the obtained contract classification model are improved.

[0153] In an exemplary embodiment, the initial contract recognition model acquisition module is further configured to: input multiple training contract clause data and content prompt information into a pre-trained language model to obtain predicted clause content corresponding to each of the multiple training contract clause data; determine a first model training loss value in a first-stage training process based on the clause content of each of the multiple training contract clause data and the predicted clause content corresponding to each of the multiple training contract clause data; and adjust the parameters of a first projection matrix corresponding to the pre-trained language model based on the first model training loss value to obtain an initial contract recognition model.

[0154] In an exemplary embodiment, the contract classification model acquisition module is further configured to: group multiple training contract clause data and the contract clause types to which each training contract clause data belongs, to obtain a preset number of data groups; each data group includes multiple training contract clause data belonging to different contract clause types; determine an initial contract recognition model corresponding to each data group, and for each data group, obtain out-of-group training samples corresponding to the targeted data group from the remaining data groups; the out-of-group training samples include multiple training contract clause data belonging to different contract clause types; for each initial contract recognition model, perform two-stage training on the targeted initial contract recognition model based on the discrimination prompt information, the data group corresponding to the targeted initial contract recognition model, and the out-of-group training samples corresponding to the data group, to obtain a trained contract classification model.

[0155] In an exemplary embodiment, the contract classification model acquisition module is further configured to: input the discrimination prompt information, the data group corresponding to the initial contract recognition model, and multiple training contract clause data from the out-of-group training samples corresponding to the data group into the initial contract recognition model to obtain the predicted contract clause type corresponding to each of the multiple training contract clause data; determine the second model training loss value in the two-stage training process based on the predicted contract clause type corresponding to each of the multiple training contract clause data and the contract clause type corresponding to each of the multiple training contract clause data; and adjust the parameters of the second projection matrix corresponding to the initial contract recognition model based on the second model training loss value to obtain the trained contract classification model.

[0156] In one exemplary embodiment, a data classification processing apparatus is provided, further comprising: a content prompt information construction module, configured to: obtain content interpretation task information configured for the clause content, contract clause name, and clause interpretation content, and obtain a preset prompt information template, so as to construct content prompt information describing the clause content based on the preset prompt information template, the content interpretation task information configured for the clause content, contract clause name, and clause interpretation content; and a discrimination prompt information construction module, configured to: obtain discrimination task prompt information configured for the contract clause type, clause content, and target contract clause type, so as to construct discrimination prompt information for discriminating the contract clause type based on the preset prompt information template, the discrimination task prompt information configured for the contract clause type, clause content, and target contract clause type.

[0157] In one exemplary embodiment, a data classification and processing apparatus is provided, which further includes a contract clause type discrimination module, configured to: receive a contract clause type discrimination request, obtain contract clause data to be processed corresponding to the contract clause type discrimination request; and perform category discrimination on the contract clause data to be processed according to a trained contract classification model to obtain the target contract clause type corresponding to the contract clause data to be processed.

[0158] Each module in the aforementioned data classification and processing device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in the processor of a computer device in hardware form or independent of it, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.

[0159] In one exemplary embodiment, a computer device is provided, which may be a server or a terminal. Taking the computer device as a server as an example, its internal structure diagram can be as follows: Figure 11As shown, this computer device includes a processor, memory, input / output (I / O) interfaces, and a communication interface. The processor, memory, and I / O interfaces are connected via a system bus, and the communication interface is also connected to the system bus via the I / O interfaces. The processor provides computational and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides the environment for the operating system and computer programs stored in the non-volatile storage media. The database stores training contract terms, terms content, contract term types to which the training contract terms belong, content prompts, pre-trained language models, initial contract recognition models, discrimination prompts, and trained contract classification models. The I / O interfaces are used for exchanging information between the processor and external devices. The communication interface is used for communicating with external terminals via a network connection. When executed by the processor, the computer program implements a data classification processing method.

[0160] Those skilled in the art will understand that Figure 11 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0161] In one embodiment, a computer device is also provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps in the above method embodiments.

[0162] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon that, when executed by a processor, implements the steps in the above method embodiments.

[0163] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps in the above method embodiments.

[0164] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties, and the collection, use and processing of the relevant data must comply with relevant regulations.

[0165] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile memory and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, artificial intelligence (AI) processors, etc., and are not limited to these.

[0166] The technical features of the above embodiments can be combined arbitrarily. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as the combination of these technical features does not contradict each other, it should be considered within the scope of this application. The above embodiments only illustrate several implementation methods of this application, and their descriptions are relatively specific and detailed, but they should not be construed as limiting the scope of this application. It should be noted that for those skilled in the art, several modifications and improvements can be made without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.

Claims

1. A data classification and processing method, characterized in that, The method includes: Obtain data on multiple training contract terms; For each training contract clause data, extract the clause content from the training contract clause data and determine the contract clause type to which the training contract clause data belongs; Based on the content prompt information used to describe the content of the terms, the multiple training contract terms data, and the terms content of each of the multiple training contract terms data, the pre-trained language model is trained in one stage to obtain an initial contract recognition model. Based on the discrimination prompts used to determine the type of contract terms, the multiple training contract terms data, and the contract terms type to which each of the multiple training contract terms data belongs, the initial contract recognition model is trained in two stages to obtain a trained contract classification model.

2. The method according to claim 1, characterized in that, The step of performing a first-stage training on the pre-trained language model based on content prompt information describing the content of the contract terms, the multiple training contract term data, and the term content of each of the multiple training contract term data to obtain an initial contract recognition model includes: The multiple training contract terms data and the content prompt information are input into the pre-trained language model to obtain the predicted terms content corresponding to each of the multiple training contract terms data. Based on the content of each of the multiple training contract terms and the corresponding predicted terms of each of the multiple training contract terms, the first model training loss value in the first stage of training process is determined. Based on the training loss value of the first model, the parameters of the first projection matrix corresponding to the pre-trained language model are adjusted to obtain the initial contract recognition model.

3. The method according to claim 1, characterized in that, The step of performing a two-stage training on the initial contract recognition model based on the discrimination prompt information used to determine the contract clause type, the multiple training contract clause data, and the contract clause type to which each of the multiple training contract clause data belongs, to obtain a trained contract classification model, includes: Based on the multiple training contract clause data and the contract clause type to which each of the multiple training contract clause data belongs, the data are grouped to obtain a preset number of data groups; each data group includes multiple training contract clause data belonging to different contract clause types; An initial contract recognition model is determined for each data group, and for each data group, out-of-group training samples corresponding to the target data group are obtained from the remaining data groups; the out-of-group training samples include multiple training contract clause data belonging to different contract clause types; For each initial contract recognition model, a two-stage training process is performed on the initial contract recognition model based on the discrimination prompt information, the data group corresponding to the initial contract recognition model, and the out-of-group training samples corresponding to the data group, to obtain a trained contract classification model.

4. The method according to claim 3, characterized in that, Based on the discrimination prompt information, the data group corresponding to the initial contract recognition model, and the out-of-group training samples corresponding to the data group, a two-stage training is performed on the initial contract recognition model to obtain a trained contract classification model, including: The discrimination prompt information, the data group corresponding to the initial contract recognition model, and the multiple training contract clause data in the external training samples corresponding to the data group are input into the initial contract recognition model to obtain the predicted contract clause type corresponding to each of the multiple training contract clause data. Based on the predicted contract term type corresponding to each of the multiple training contract term data and the contract term type corresponding to each of the multiple training contract term data, the second model training loss value in the two-stage training process is determined. Based on the training loss value of the second model, the parameters of the second projection matrix in the initial contract recognition model are adjusted to obtain the trained contract classification model.

5. The method according to any one of claims 1 to 4, characterized in that, The method further includes: Obtain the content interpretation task information configured for the content of the clause, the contract clause name, and the clause interpretation content, and obtain the preset prompt information template, so as to construct content prompt information describing the content of the clause based on the preset prompt information template, the content interpretation task information configured for the content of the clause, the contract clause name, and the clause interpretation content; Obtain the discrimination task prompt information, clause content, and target contract clause type configured for the contract clause type, and construct discrimination prompt information for discriminating contract clause types based on the preset prompt information template and the discrimination task prompt information, clause content, and target contract clause type configured for the contract clause type.

6. The method according to any one of claims 1 to 4, characterized in that, The method further includes: Receive a contract clause type determination request and obtain the contract clause data to be processed corresponding to the contract clause type determination request; Based on the trained contract classification model, the category of the contract clause data to be processed is determined to obtain the target contract clause type corresponding to the contract clause data to be processed.

7. A data classification and processing device, characterized in that, The device includes: The training contract terms data acquisition module is used to acquire multiple training contract terms data. The contract clause type determination module is used to extract clause content from each training contract clause data and determine the contract clause type to which the training contract clause data belongs. The initial contract recognition model acquisition module is used to perform a one-stage training on the pre-trained language model based on the content prompt information used to describe the content of the terms, the multiple training contract terms data, and the terms content of each of the multiple training contract terms data, to obtain the initial contract recognition model. The contract classification model acquisition module is used to perform two-stage training on the initial contract recognition model based on the discrimination prompt information used to determine the type of contract terms, the multiple training contract terms data, and the contract terms type to which each of the multiple training contract terms data belongs, to obtain a trained contract classification model.

8. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 6.

9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 6.