Methods and apparatus for training language models to obtain predictive models

By combining unsupervised pre-training and semi-supervised training of language models with unsupervised clustering and a small amount of manual annotation, the problem of complex and costly decision-making in real estate transactions is solved, achieving efficient and low-cost decision support.

CN117094414BActive Publication Date: 2026-06-30BEIKE TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIKE TECH CO LTD
Filing Date
2023-08-23
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Real estate transactions involve complex decision-making processes and costly manual annotation and analysis. Existing large language models require a large amount of manually labeled data for training, resulting in huge training costs.

Method used

By performing unsupervised pre-training and semi-supervised training on the language model, combined with unsupervised clustering and a small amount of manually labeled data, the unsupervised clustering results are used as a labeling reference for training and optimization, thereby reducing the cost of manual labeling and analysis.

Benefits of technology

This reduces the training cost of the prediction model, decreases the complexity of manual annotation and analysis, and improves the efficiency and accuracy of real estate buying and selling decisions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117094414B_ABST
    Figure CN117094414B_ABST
Patent Text Reader

Abstract

A method, apparatus, computer device, storage medium, and computer program product are provided for training a language model to obtain a predictive model. The method includes: acquiring multiple pieces of textual business data as a first dataset; performing unsupervised clustering on the first dataset to obtain a second dataset, the second dataset comprising K first subsets, each piece of data in the first dataset belonging to a corresponding first subset and having a unique label for that first subset; performing the following training tasks on the language model: in a first task, pre-training the language model based on the first dataset to obtain a pre-trained model; in a second task, performing semi-supervised training on the pre-trained model based on both the first and second datasets to obtain a trained model; in a third task, fine-tuning the trained model based on the second dataset to obtain a fine-tuned model; and obtaining a predictive model based on the fine-tuned model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence, and in particular to a method for training a language model to obtain a predictive model, an apparatus for training a language model to obtain a predictive model, a computer device, a computer-readable storage medium, and a computer program product. Background Technology

[0002] Real estate transactions are a significant economic activity, involving a relatively large transaction volume and holding considerable importance for every family. Whether buying or selling a personal residence or investing in the real estate market, careful decision-making is essential. During the transaction process, both buyers and sellers face a series of decisions, such as determining the appropriate time to buy or sell, setting a reasonable price, and assessing the property's potential value and risks. These decisions often require considering numerous factors, including real estate market conditions, the property's location, its condition, and the buyer's or seller's financial situation.

[0003] The methods described in this section are not necessarily methods that had been previously conceived or adopted. Unless otherwise specified, no method described in this section should be assumed to be prior art simply because it is included in this section. Similarly, unless otherwise specified, the issues mentioned in this section should not be considered to be accepted in any prior art. Summary of the Invention

[0004] It would be beneficial to provide a mechanism to alleviate, reduce, or even eliminate one or more of the aforementioned problems.

[0005] According to one aspect of this disclosure, a method for training a language model to obtain a prediction model is provided, comprising: acquiring multiple textual business data as a first dataset, wherein the multiple textual business data describe business-related situations; performing an unsupervised clustering operation on the first dataset to obtain a second dataset, wherein the second dataset includes K first subsets, and each data in the first dataset belongs to a corresponding first subset of the K first subsets and has a unique label of the corresponding first subset, wherein K represents the number of clusters obtained after performing unsupervised clustering on the first dataset, and K is an integer greater than or equal to 2; performing the following training tasks on the language model: in a first training task, pre-training the language model based on the first dataset to obtain a pre-trained language model; in a second training task, semi-supervised training of the pre-trained language model based on the first dataset and the second dataset to obtain a trained language model; in a third training task, fine-tuning the trained language model based on the second dataset to obtain a fine-tuned language model; and obtaining a prediction model based on the fine-tuned language model.

[0006] According to one aspect of this disclosure, an apparatus for training a language model to obtain a prediction model is provided, comprising: a first module for acquiring multiple textual business data as a first dataset, wherein the multiple textual business data describe business-related situations; a second module for performing unsupervised clustering operations on the first dataset to obtain a second dataset, wherein the second dataset includes K first subsets, and each data in the first dataset belongs to a corresponding first subset of the K first subsets and has a unique label of the corresponding first subset, wherein K represents the number of clusters obtained after performing unsupervised clustering operations on the first dataset, and K is an integer greater than or equal to 2; a third module for performing the following training tasks on the language model: in a first training task, pre-training the language model based on the first dataset to obtain a pre-trained language model; in a second training task, semi-supervised training of the pre-trained language model based on the first dataset and the second dataset to obtain a trained language model; in a third training task, fine-tuning the trained language model based on the second dataset to obtain a fine-tuned language model; and a fourth module for obtaining a prediction model based on the fine-tuned language model.

[0007] According to one aspect of this disclosure, a computer device is provided, comprising: at least one processor; and at least one memory having a computer program stored thereon, wherein the computer program, when executed by the at least one processor, causes the at least one processor to perform any of the methods described above.

[0008] According to one aspect of this disclosure, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, causes the processor to perform any of the methods described above.

[0009] According to one aspect of this disclosure, a computer program product is provided, including a computer program that, when executed by a processor, causes the processor to perform any of the methods described above.

[0010] According to embodiments of this disclosure, the training cost of the prediction model is reduced by using unsupervised pre-training and training, as well as a small number of supervised training samples based on cue learning and manual annotation; unsupervised clustering is used to cluster the training data, and the clustering results are used as annotation references for model training and tuning, thereby avoiding the redundant cost of manual data annotation and reducing the complexity of manual analysis.

[0011] These and other aspects of this disclosure will be apparent from the embodiments described below, and will be elucidated with reference to the embodiments described below. Attached Figure Description

[0012] Further details, features, and advantages of this disclosure are disclosed in the following description of exemplary embodiments in conjunction with the accompanying drawings, in which:

[0013] Figure 1 This is a schematic diagram illustrating an example system in which various methods described herein may be implemented according to exemplary embodiments;

[0014] Figure 2 This is a flowchart illustrating a method for training a language model to obtain a prediction model according to an exemplary embodiment;

[0015] Figure 3 This is an illustration based on an exemplary embodiment. Figure 2 A flowchart illustrating an example process for the second training task in the method;

[0016] Figure 4 This is an illustration based on an exemplary embodiment. Figure 2 A flowchart illustrating the example process of the third training task in the method;

[0017] Figure 5 This is an illustration based on an exemplary embodiment. Figure 2 A flowchart of a portion of the example process in the method;

[0018] Figure 6 This is a schematic block diagram illustrating an apparatus for training a language model to obtain a prediction model according to an exemplary embodiment;

[0019] Figure 7 This is a block diagram illustrating an exemplary computer device that can be applied to an exemplary embodiment. Detailed Implementation

[0020] In this disclosure, unless otherwise stated, the use of terms such as "first," "second," etc., to describe various elements is not intended to limit the positional, temporal, or importance relationships of these elements; such terms are merely used to distinguish one element from another. In some examples, the first element and the second element may refer to the same instance of that element, while in other cases, based on the context, they may refer to different instances.

[0021] The terminology used in the description of the various examples described in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context explicitly indicates otherwise, an element may be one or more unless the number of elements is specifically limited. As used herein, the term "multiple" means two or more, and the term "based on" should be interpreted as "at least partially based on". Furthermore, the terms "and / or" and "at least one of..." cover any one of the listed items and all possible combinations thereof.

[0022] Real estate transactions are a significant economic activity, with a relatively large transaction volume among various economic activities, holding considerable importance for every family. Whether buying or selling a personal residence or investing in the real estate market, careful decision-making is essential. During the real estate transaction process, both buyers and sellers face a series of decisions, such as determining the appropriate time to buy or sell, setting a reasonable price, and assessing the potential value and risks of the property. These decisions often require comprehensive consideration of numerous factors, such as real estate market conditions, the location and condition of the property, and the financial situation of the buyer or seller. Large language models can be applied to the real estate transaction field, providing decision-making assistance to buyers and sellers through training and analysis of large amounts of data. However, in actual business, the real estate decision-making process is highly complex, action labels are relatively scarce, and the amount of labeled data required for supervised training of large models is enormous, resulting in significant labeling costs.

[0023] As real estate transactions account for an increasingly larger share of the economy, it is crucial to provide appropriate and timely decision-making assistance to all parties involved in the real estate buying and selling process.

[0024] In related technologies, such as in the real estate field, existing decision-making models mostly require GPT2 or GPR3 ​​models as base models before supervised training. Furthermore, the decision-making process for buying and selling properties is extremely complex, involving significant costs for manual annotation and analysis. The inventors recognized that supervised training based on large models requires substantial amounts of manually labeled data, and the manual analysis process is also very complex and difficult, resulting in an overall enormous cost.

[0025] In view of the above, this disclosure proposes a method for training a language model to obtain a prediction model.

[0026] It should be understood that the methods proposed in this disclosure can be applied to problems or scenarios involving decision-making processes of one or more parties in any other field besides real estate.

[0027] Exemplary embodiments of this disclosure will now be described in detail with reference to the accompanying drawings.

[0028] Figure 1 This is a schematic diagram illustrating an example system 100 in which various methods described herein may be implemented according to exemplary embodiments.

[0029] refer to Figure 1 The system 100 includes a client device 110, a server 120, and a network 130 that communicatively couples the client device 110 and the server 120.

[0030] Client device 110 includes a display 114 and a client application (APP) 112 that can be displayed on the display 114. Client application 112 can be an application that needs to be downloaded and installed before running, or a lightweight application (liteapp). If client application 112 is an application that needs to be downloaded and installed before running, client application 112 can be pre-installed on client device 110 and activated. If client application 112 is a mini-app, user 102 can directly run client application 112 on client device 110 without installing it, by searching for client application 112 in the host application (e.g., by the name of client application 112) or scanning the graphic code of client application 112 (e.g., barcode, QR code, etc.). In some embodiments, client device 110 can be any type of mobile computing device, including mobile computers, mobile phones, wearable computing devices (e.g., smartwatches, head-mounted devices including smart glasses, etc.), or other types of mobile devices. In some embodiments, the client device 110 may alternatively be a fixed computer device, such as a desktop computer, server computer, or other type of fixed computer device.

[0031] Server 120 is typically a server deployed by an Internet Service Provider (ISP) or Internet Content Provider (ICP). Server 120 can represent a single server, a cluster of multiple servers, a distributed system, or a cloud server providing basic cloud services (such as cloud databases, cloud computing, cloud storage, and cloud communications). It will be understood that, although... Figure 1 The diagram shows that server 120 communicates with only one client device 110, but server 120 can provide background services to multiple client devices simultaneously.

[0032] Examples of network 130 include combinations of local area networks (LANs), wide area networks (WANs), personal area networks (PANs), and / or communication networks such as the Internet. Network 130 can be wired or wireless. In some embodiments, technologies and / or formats including Hypertext Markup Language (HTML), Extensible Markup Language (XML), etc., are used to process data exchanged through network 130. Furthermore, encryption technologies such as Secure Sockets Layer (SSL), Transport Layer Security (TLS), Virtual Private Network (VPN), and Internet Protocol Security (IPsec) can be used to encrypt all or some of the links. In some embodiments, custom and / or dedicated data communication technologies can be used to replace or supplement the aforementioned data communication technologies.

[0033] For the purposes of this disclosure's embodiments, Figure 1In the example, client application 112 can be an application that provides decision-making assistance to parties involved in a real estate transaction. This application can provide decision-making assistance or strategy recommendations to each party in a decision-making process involving one or more parties, such as buying, selling, leasing, mortgaging, or pledging in the real estate sector. Correspondingly, server 120 can be a server used with such an application. Server 120 can provide decision-making assistance or strategy recommendations to client application 112 running on client device 110 based on textual business data related to the real estate transaction.

[0034] Figure 2 This is a flowchart illustrating a method 200 for training a language model to obtain a prediction model according to an exemplary embodiment.

[0035] Method 200 can be used on the client device (e.g., Figure 1 The execution is performed at the client device 110 shown, that is, the execution entity of each step of method 200 can be... Figure 1 The client device 110 shown. In some embodiments, method 200 can be performed on a server (e.g., Figure 1 The method 200 is executed at server 120 (as shown in the diagram). In some embodiments, method 200 may be executed in combination by a client device (e.g., client device 110) and a server (e.g., server 120). Hereinafter, the steps of method 200 will be described in detail with the client device 110 as the executing entity.

[0036] refer to Figure 2 Method 200 includes steps 210 to 260.

[0037] Step 210: Obtain multiple pieces of textualized business data as the first dataset.

[0038] Step 220: Perform unsupervised clustering on the first dataset to obtain the second dataset.

[0039] Step 230: Perform a training task on the language model to obtain the optimized language model.

[0040] Step 240: Obtain the prediction model based on the optimized language model.

[0041] In step 210, multiple pieces of textual business data can describe business-related situations.

[0042] In the example, the term "business data" can refer to data related to a specific business (e.g., related to the execution process of the business, the transformation of each node of the business, the intermediate or final result of the business, etc.), data required or generated during the execution of a specific business, data originating from or associated with the participants in the specific business, etc., and this disclosure does not impose any limitations in this regard.

[0043] In the example, the term "textualized business data" can refer to business data that is presented, transmitted, and / or stored in text form or format, including various structured or unstructured data that can be processed by various machine models and / or language models, and this disclosure does not impose any limitations on it.

[0044] In step 220, as is known in the art, the term "unsupervised clustering operation" can refer to a machine learning paradigm that models unlabeled data. In examples, unsupervised clustering operations may include K-means clustering, hierarchical clustering, density-based scan clustering (DBSCAN), Gaussian mixture clustering, etc., and this disclosure does not impose any limitations thereon.

[0045] In the example, the second dataset may include K first subsets. Through unsupervised clustering, each data point in the first dataset can belong to a corresponding first subset among the K first subsets and has a unique label for that first subset. That is, assuming that the first dataset yields K clusters after unsupervised clustering, then there are K unique labels corresponding to each data point in these K clusters. For example, each data point in the first cluster may have a numerical label 1, each data point in the second cluster may have a numerical label 2, and so on. Of course, this disclosure does not impose any restrictions on the representation of the labels; any suitable label format can be used as the unique label for each first subset (i.e., the clusters obtained after the clustering operation).

[0046] As mentioned above, K represents the number of clusters obtained after performing unsupervised clustering on the first dataset. In the example, K can be an integer greater than or equal to 2.

[0047] In step 230, the training task may include a first training task, a second training task, and a third training task. As used in this disclosure, the term "training task" may refer to inputting input samples into a model, obtaining actual output samples from the model, comparing the actual output samples with expected output samples, and adjusting or optimizing the model based on the comparison results.

[0048] In the first training task, the language model can be pre-trained based on the first dataset to obtain a pre-trained language model.

[0049] For example, taking a generative pre-trained transformer model as an example, a pre-training method similar to GPT2 can be used to pre-train the language model so that the language model initially has rich language knowledge and business-related information.

[0050] In the second training task, the pre-trained language model can be semi-supervised based on the first and second datasets to obtain the trained language model.

[0051] As used in this disclosure, the term "semi-supervised training" can refer to pattern recognition using both unlabeled and labeled data. Here, the unlabeled data is a first dataset, and the labeled data is a second dataset in which each cluster of data is given its own unique label after unsupervised clustering of the first dataset.

[0052] In the third training task, the trained language model can be fine-tuned based on the second dataset to obtain a fine-tuned language model.

[0053] As used in this disclosure, the term "tuning training" can refer to tuning and optimizing the performance of a model using a significantly smaller number of samples compared to those used in the model training process (e.g., a second training task). These smaller numbers of samples can be obtained by modifying some of the samples used in the model training phase, or by directly using other samples that are different from those used in the model training phase.

[0054] In step 240, a prediction model can be obtained based on the optimized language model.

[0055] For example, as long as the performance of the optimized language model meets expectations, it can be directly used as a prediction model.

[0056] For example, if the optimized language model performs differently than expected, the optimized model can be improved to obtain the final prediction model.

[0057] In sectors such as real estate, buying and selling property is a crucial task requiring careful decision-making. Both buyers and sellers face numerous challenges, including determining the appropriate timing, assessing property value and risks, and considering market trends. These processes are complex and require extensive, time-consuming manual analysis. Large language model algorithms have provided valuable experience in solving such problems; however, the labeled data required for training large language models is enormous, and the labeling and analysis processes for diagnostic decisions are complex and costly. To address this issue, this disclosure proposes a method for training a language model to obtain a predictive model, as described in Method 200. Specifically, the language model is pre-trained using unlabeled textual business data to initially equip it with rich linguistic knowledge and business-related information. Unsupervised clustering operations are then used to obtain unsupervised clustering labels required for diagnostic decisions, and the pre-trained language model is weakly trained using cue-based learning statements. Finally, fine-tuning is performed to optimize the trained language model on a small, manually labeled business dataset, thereby better learning business-related knowledge and corresponding diagnostic decision-making actions to provide relevant suggestions to all stakeholders in the business process.

[0058] Method 200 pre-trains and trains a language model using a large number of unsupervised samples, and then fine-tunes the trained model using a small amount of manually labeled data combined with cue learning, thus reducing the manual annotation and analysis costs required to obtain the prediction model. The unsupervised samples refer to the first and second datasets used in the first and second training tasks. Since the first and second training tasks can be executed automatically without manual intervention, the first and second datasets can be considered as data samples under unsupervised training, i.e., unsupervised samples.

[0059] According to some embodiments, performing an unsupervised clustering operation on a first dataset to obtain a second dataset may include one of the following: specifying a value for K; or not specifying a value for K, wherein the number of clusters K obtained after the unsupervised clustering operation depends on prior knowledge or an evaluation metric of the unsupervised clustering operation.

[0060] For example, the value of K can be specified in advance, for instance, by drawing on prior knowledge from experts.

[0061] For example, clustering can be stopped when the cluster confusion or entropy reaches a suitable value, or when the existing number of clusters satisfies expert priors. That is, the value of K is not specified in advance during unsupervised clustering operations.

[0062] In the example, unlabeled textual business data from all channels can be automatically or manually cleaned based on preset rules, and then all data can be mixed together for unsupervised clustering to obtain cluster labels for all data. Examples of data cleaning operations can be any suitable cleaning operation known in the art, and this disclosure makes no limitation thereto.

[0063] For example, in the decision-making process of a real estate transaction, there are already established transaction procedures and / or key actions. Therefore, clustering can be performed based on these predefined transaction procedures and / or key actions as initial cluster centers, and then all samples can be clustered to obtain the corresponding action labels as unique labels.

[0064] According to some embodiments, in the first training task, the language model is pre-trained based on the first dataset to obtain a pre-trained language model, which may include one of the following: randomly initializing the parameters of the language model; or setting the initial values ​​of the parameters of the language model based on another trained language model that is structurally identical to the language model.

[0065] For example, the language model can be initialized randomly, or the parameters can be initialized using the weights of a pre-trained GPT model with the same structure from open source or other sources. It is important to note that the data required for the pre-training phase (e.g., the first training task) is textual business data without clustering labels (e.g., plain text data).

[0066] Figure 3 This is an illustration based on an exemplary embodiment. Figure 2 The flowchart of example process 232 for the second training task in the method is shown below. Figure 3 As shown, according to some embodiments, process 232: In the second training task, the pre-trained language model is semi-supervised based on the first dataset and the second dataset to obtain a trained language model, which may include:

[0067] Step 2321: Obtain the third dataset.

[0068] Step 2322: Based on the third dataset, perform the steps to construct triples.

[0069] Step 2323: Input all the constructed triples into the pre-trained language model to perform semi-supervised training on the pre-trained language model, and obtain the trained language model.

[0070] In step 2321, the third dataset may include multiple different second subsets, each of which may include one textual business data item from the first dataset or multiple related textual business data items.

[0071] In step 2322, the step of constructing triples may include: constructing one or more cue learning statements, each of which can point to any of the multiple second subsets; for each of the multiple second subsets, constructing a triple, wherein any one of the one or more cue learning statements is used as the first element of the triple, all textual business data in the second subset is used as the second element of the triple, and the unique label of the first subset containing the data that, from a business perspective, should immediately follow all the textual business data in the second dataset is used as the third element of the triple.

[0072] "From a business perspective, the second dataset should be the data that immediately follows all textual business data in the second subset" refers to the data in the second dataset that is most suitable to appear immediately after all textual business data in the second subset, whether from the perspective of business process advancement or from the perspective of logical connection or contextual meaning of plain textual language.

[0073] As an example, not a limitation, taking a real estate transaction process as an example, an exemplary second subset includes the following data: Data 1 "Buyer inquires about the down payment percentage"; Data 2 "Seller inquires about the buyer's loan amount and how long it will take to obtain bank disbursement"; Data 3 "Real estate agent estimates the total purchase price for the buyer, including various taxes and agent fees." The exemplary second dataset includes at least the following: Data M is textual business data obtained from screenshots of the buyer and agent's communication via WeChat regarding the time of a property viewing; Data M+1 "Seller inquires with the agent about listing the property"; Data M+2 is textual business data generated from audio and video recordings of the buyer negotiating agent fees with the salesperson at the store; and so on. Therefore, from a business perspective, the data immediately following all textual business data in the exemplary second dataset should be Data M+2, because both from the perspective of business process progression and from the logical connection or contextual understanding of purely textual language, Data M+2 is most suitable to appear immediately after Data 1 to 3.

[0074] As used in this disclosure, the term "cue learning statement" can refer to a statement consisting of cue words that enable a language model to learn or reason better. For example, a cue learning statement could be, "Based on the following data, please determine which action the customer or broker should take?" Any statement or set of statements can be appropriately followed by this cue learning statement, because "the following data" in the cue learning statement refers to the statement or set of statements immediately following it. Examples of cue learning statements are diverse and can be automatically generated or designed, and this disclosure does not impose any limitations on them.

[0075] Following the example above, an exemplary triple could be {“Based on the following data, please determine which action the customer or agent should take?”; “The buyer inquires about the down payment ratio” + “The seller inquires about the buyer's loan amount and how long it will take to obtain the bank loan” + “The agent estimates the total purchase price for the buyer, including various taxes and agency fees”; M}. The first element of the triple is the prompt learning statement “Based on the following data, please determine which action the customer or agent should take?”, the second element is all the textualized business data of this exemplary second subset (i.e., “The buyer inquires about the down payment ratio” + “The seller inquires about the buyer's loan amount and how long it will take to obtain the bank loan” + “The agent estimates the total purchase price for the buyer, including various taxes and agency fees”), and the third element is the cluster label of data M+2 (i.e., the unique label of the corresponding first subset to which data M+2 belongs). Here, we assume the label of data M+2 is M.

[0076] Understandably, the difference between the second and first training tasks lies in the additional requirement of unsupervised training based on cluster labels. The training method can be summarized as follows: First, design a specific prompting statement, such as "Based on the following data, please determine which action the customer or agent should take?" Then, concatenate the corresponding data content and the prompt answer (i.e., the label corresponding to "which action"). For example, the label here can be a unique cluster label in numerical form obtained after unsupervised clustering. Finally, feed all the triples combined according to the above steps into the pre-trained model obtained from the first training task for further training.

[0077] Figure 4 This is an illustration based on an exemplary embodiment. Figure 2 The flowchart of example process 233 for the third training task in the method. (See example flowchart 233). Figure 4 As shown, according to some embodiments, process 233: In the third training task, the trained language model is fine-tuned based on the second dataset to obtain the fine-tuned language model, which may include:

[0078] Step 2331: Select a certain proportion of data from each first subset of the second dataset to obtain the fourth dataset.

[0079] Step 2332: Based on the fourth dataset, perform the steps to construct quadruples.

[0080] Step 2333: Input all the constructed quadruplets into the trained language model to fine-tune the trained language model and obtain the fine-tuned language model.

[0081] For example, small sample data can be randomly selected from each type of unsupervised clustering sample (i.e., each of the K clusters) in a certain proportion by manual means, so that manual annotation can be performed later.

[0082] In step 2332, the step of constructing the quadruple may include: labeling each data point in the fourth dataset to obtain the corresponding labeling information for each data point; constructing one or more cue learning statements, each of which can point to each data point in the fourth dataset; constructing a quadruple for each data point in the fourth dataset, wherein any one of the cue learning statements is used as the first element of the quadruple, the data is used as the second element of the quadruple, the unique label of the first subset of the data in the second dataset that should immediately follow the data from the business perspective is used as the third element of the quadruple, and the corresponding labeling information of the data is used as the fourth element of the quadruple.

[0083] For example, the content requiring manual annotation may include 1) which step of the established business process the data type actually belongs to, 2) some obvious value information related to the progress of the process contained in the sample (such as target attributes, participant intentions, etc.), etc. This disclosure does not impose any restrictions on this.

[0084] Using the example above, an exemplary quadruple could be {"Based on the following data, please determine which action the client or agent should take?"; "The buyer inquires about the down payment percentage" + "The seller inquires about the buyer's loan amount and how long it will take to obtain the bank loan" + "The agent estimates the total purchase price for the buyer, including various taxes and agent fees"; L; "Related to the reason for property settlement"}. The first element of each quadruple is the prompt learning statement: "Based on the following data, please determine which action the customer or agent should take?"; the second element is all the textualized business data of the exemplary second subset (i.e., "Buyer inquires about the down payment ratio" + "Seller inquires about the buyer's loan amount and how long it will take to get the bank loan" + "Agent estimates the total purchase price for the buyer, including various taxes and agent fees"); the third element is the next action corresponding to all the information contained in the second element, for example, if data L+2 "Buyer inquires whether the seller is a five-year-old or two-year-old property owner," assuming the label of data L+2 is L, then the third element is L; the fourth element is the reason for selecting data L+2, i.e., "Related to the reason for property handover." Finally, all the quadruples combined according to the above steps are fed into the trained model obtained from the second training task for model optimization.

[0085] It is understandable that the quadruples constructed for model tuning can be roughly divided into two categories: correction samples and enhancement samples. The former can be obtained by modifying the less desirable triples described in process 232 above, in order to correct the impact of the undesirable triples on the trained language model. The latter can be completely newly created quadruples or obtained by appropriately modifying a more reasonable triple, in order to improve the reasoning ability of the language model.

[0086] Figure 5 This is an illustration based on an exemplary embodiment. Figure 2 The flowchart of part of the example process 240 in the method. For example... Figure 5 As shown, according to some embodiments, process 240: obtaining a prediction model based on the tuned language model may include:

[0087] Step 241: In response to determining that the optimized language model needs further optimization, the optimized language model is further optimized and trained based on the fifth dataset to obtain the prediction model.

[0088] Step 242: In response to determining that no further tuning of the tuned language model will be performed, the tuned language model is determined as the prediction model.

[0089] Therefore, it is possible to determine whether to further optimize the optimized language model based on the requirements of the language model's inference performance in different application scenarios, so as to meet the corresponding performance requirements, improve the model's inference ability, and better provide decision assistance or strategy recommendations for various parties in the decision-making process involving one or more parties.

[0090] According to some embodiments, the optimized language model is further optimized and trained based on a fifth dataset to obtain a prediction model that includes at least one of the following:

[0091] From each first subset of the second dataset, select one or more data points that differ from a certain proportion of the data to obtain the fifth dataset. Then, based on the fifth dataset, perform the step of constructing quadruples, and input all the newly constructed quadruples into the optimized language model; or

[0092] Select at least a portion of the data from a certain proportion of the second dataset to obtain the fifth dataset, and adjust the annotation information of the corresponding quadruples in the fifth dataset. Input the adjusted quadruples into the optimized language model.

[0093] Therefore, the language model, which has been initially tuned, can be further optimized by constructing improvement samples or correction samples, in order to better provide decision-making assistance or strategy recommendations for various participants in the decision-making process.

[0094] According to some embodiments, multiple pieces of textual business data may include business target data, business participant data, and communication records between business participants.

[0095] For example, real estate-related data can be broadly categorized based on its source: information related to basic property attributes such as property directories; information related to user browsing behavior and preferences; background and identity information of users and agents; and chat logs between users and agents. Furthermore, different types of information may be mixed in from different channels. This disclosure does not impose any restrictions on this.

[0096] According to some embodiments, the language model may include a generative pre-trained transformer (GPT) model.

[0097] Figure 6 This is a schematic block diagram illustrating an apparatus 600 for training a language model to obtain a prediction model according to an exemplary embodiment.

[0098] like Figure 6 As shown, the apparatus 600 for training a language model to obtain a prediction model may include: a first module 610 for acquiring multiple pieces of textual business data as a first dataset; a second module 620 for performing unsupervised clustering operations on the first dataset to obtain a second dataset; a third module 630 for performing a training task on the language model to obtain an optimized language model; and a fourth module 640 for obtaining a prediction model based on the optimized language model.

[0099] Device 600 pre-trains a language model using unlabeled textual business data to initially equip the model with rich linguistic knowledge and business-related information. Subsequently, it obtains unsupervised clustering labels needed for diagnostic decisions through unsupervised clustering operations, and then performs weakly supervised training on the pre-trained language model using prompting learning statements. Finally, through fine-tuning, the trained language model is optimized on a small dataset with manual annotations, thereby better learning business-related knowledge and corresponding diagnostic decision-making actions to provide relevant suggestions to all parties involved in the business process.

[0100] Device 600 pre-trains and trains a language model using a large number of unsupervised samples, and then fine-tunes the trained model using a small amount of manually labeled data combined with cue learning, reducing the manual annotation and analysis costs required to obtain the prediction model. The unsupervised samples refer to the first and second datasets used in the first and second training tasks. Since the first and second training tasks can be executed automatically without manual intervention, the first and second datasets can be considered as data samples under unsupervised training, i.e., unsupervised samples.

[0101] It should be understood that Figure 6 The various modules of the device 600 shown can be connected to the reference. Figure 2 The steps in method 200 described correspond to each other. Therefore, the operations, features, and advantages described above for method 200 also apply to device 600 and its included modules. For the sake of brevity, some operations, features, and advantages will not be repeated here.

[0102] While specific functions have been discussed above with reference to specific modules, it should be noted that the functions of the modules discussed herein can be divided into multiple modules, and / or at least some functions of multiple modules can be combined into a single module. The specific module performing an action discussed herein includes the specific module itself performing the action, or alternatively, the specific module calling or otherwise accessing another component or module that performs the action (or performs the action in conjunction with the specific module). Therefore, a specific module performing an action may include the specific module performing the action itself and / or another module that the specific module calls or otherwise accesses to perform the action. For example, the first module 610 and the second module 620 may be combined into a single module in some embodiments. As another example, the fourth module 640 may include the third module 630 in some embodiments. As used herein, the phrase "entity A initiates action B" may mean that entity A issues an instruction to perform action B, but entity A itself does not necessarily perform action B.

[0103] It should also be understood that this article can describe various technologies in the general context of software and hardware components or program modules. The above regarding... Figure 5 The various modules described can be implemented in hardware or in hardware in combination with software and / or firmware. For example, these modules can be implemented as computer program code / instructions configured to execute in one or more processors and stored in a computer-readable storage medium. Alternatively, these modules can be implemented as hardware logic / circuit. For example, in some embodiments, one or more of the first module 610, the second module 620, the third module 630, and the fourth module 640 can be implemented together in a System on Chip (SoC). The SoC may include an integrated circuit chip (which includes a processor (e.g., a Central Processing Unit (CPU), microcontroller, microprocessor, digital signal processor (DSP), etc.), memory, one or more communication interfaces, and / or one or more components of other circuitry) and may optionally execute received program code and / or include embedded firmware to perform functions.

[0104] According to embodiments of this disclosure, the training cost of the prediction model is reduced by using unsupervised pre-training and training, as well as a small number of supervised training samples based on cue learning and manual annotation; unsupervised clustering is used to cluster the training data, and the clustering results are used as annotation references for model training and tuning, thereby avoiding the redundant cost of manual data annotation and reducing the complexity of manual analysis.

[0105] According to another aspect of this disclosure, a computer device is provided, including a memory, a processor, and a computer program stored in the memory. When executed by the processor, the computer program causes the processor to perform the computer program to implement the steps of any of the method embodiments described above.

[0106] According to another aspect of this disclosure, a non-transitory computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the steps of any of the method embodiments described above.

[0107] According to another aspect of this disclosure, a computer program product is provided, which includes a computer program that, when executed by a processor, implements the steps of any of the method embodiments described above.

[0108] In the following text, combined with Figure 7 Illustrative examples describing such computer devices, non-transitory computer-readable storage media, and computer program products.

[0109] Figure 7 An example configuration of a computer device 700 that can be used to implement the methods described herein is shown. For example, Figure 1 The server 120 and / or client device 110 shown may include an architecture similar to computer device 700. The apparatus 600 described above for training a language model to obtain a prediction model may also be implemented wholly or at least partially by computer device 700 or similar devices or systems.

[0110] Computer device 700 can be a variety of different types of devices. Examples of computer device 700 include, but are not limited to: desktop computers, server computers, laptop or netbook computers, mobile devices (e.g., tablet computers, cellular or other wireless phones (e.g., smartphones), notebook computers, mobile stations), wearable devices (e.g., glasses, watches), entertainment devices (e.g., entertainment appliances, set-top boxes communicatively coupled to a display device, game consoles), televisions or other display devices, automotive computers, and so on.

[0111] Computer device 700 may include at least one processor 702, memory 704, multiple communication interfaces 706, display device 708, other input / output (I / O) devices 710, and one or more mass storage devices 712 capable of communicating with each other, such as via system bus 714 or other suitable connections.

[0112] Processor 702 may be a single processing unit or multiple processing units, and all processing units may include single or multiple computing units or multiple cores. Processor 702 may be implemented as one or more microprocessors, microcomputers, microcontrollers, digital signal processors, central processing units, state machines, logic circuits, and / or any device that manipulates signals based on operating instructions. Among other capabilities, processor 702 may be configured to acquire and execute computer-readable instructions stored in memory 704, mass storage device 712, or other computer-readable media, such as program code of operating system 716, program code of application program 718, program code 722 of other program 720, etc.

[0113] Memory 704 and mass storage device 712 are examples of computer-readable storage media for storing instructions executed by processor 702 to perform the various functions described above. For example, memory 704 can generally include both volatile and non-volatile memory (e.g., RAM, ROM, etc.). Furthermore, mass storage device 712 can generally include hard disk drives, solid-state drives, removable media, including external and removable drives, memory cards, flash memory, floppy disks, optical disks (e.g., CDs, DVDs), storage arrays, network-attached storage, storage area networks, etc. Both memory 704 and mass storage device 712 can be collectively referred to herein as memory or computer-readable storage media, and can be non-transitory media capable of storing computer-readable, processor-executable program instructions as computer program code, which can be executed by processor 702 as a specific machine configured to perform the operations and functions described in the examples herein.

[0114] Multiple programs may be stored on mass storage device 712. These programs include operating system 716, one or more application programs 718, other programs 720, and program data 722, and they may be loaded into memory 704 for execution. Examples of such application programs or program modules may include, for example, computer program logic (e.g., computer program code or instructions) for implementing components / functions such as client application 112, method 200, process 232, process 233, and / or process 240 and any of their modules or steps, and / or other embodiments described herein.

[0115] Although Figure 7 The modules 716, 718, 720, and 722, or portions thereof, are illustrated as being stored in memory 704 of computer device 700; however, modules 716, 718, 720, and 722 may be implemented using any form of computer-readable medium accessible by computer device 700. As used herein, “computer-readable medium” includes at least two types of computer-readable media: computer-readable storage media and communication media.

[0116] Computer-readable storage media include volatile and non-volatile, removable and non-removable media implemented by any method or technology for storing information such as computer-readable instructions, data structures, program modules, or other data. Computer-readable storage media include, but are not limited to, RAM, ROM, EEPROM, flash memory or other memory technologies, CD-ROM, DVD, or other optical storage devices, magnetic cassettes, magnetic tapes, disk storage devices or other magnetic storage devices, or any other non-transmission medium that can be used to store information for access by a computer device. In contrast, communication media can embody computer-readable instructions, data structures, program modules, or other data in modulated data signals such as carrier waves or other transmission mechanisms. Computer-readable storage media as defined herein do not include communication media.

[0117] One or more communication interfaces 706 are used for exchanging data with other devices, such as via a network, direct connection, etc. Such communication interfaces can be one or more of the following: any type of network interface (e.g., a network interface card (NIC)), wired or wireless (such as IEEE 802.11 Wireless LAN (WLAN)) wireless interface, Wi-MAX interface, Ethernet interface, Universal Serial Bus (USB) interface, cellular network interface, Bluetooth. TM Interfaces include near-field communication (NFC) interfaces. Communication interface 706 facilitates communication across various network and protocol types, including wired networks (e.g., LAN, cable, etc.) and wireless networks (e.g., WLAN, cellular, satellite, etc.), the Internet, etc. Communication interface 706 can also provide communication with external storage devices (not shown) such as storage arrays, network-attached storage, storage area networks, etc.

[0118] In some examples, a display device 708, such as a monitor, may be included for displaying information and images to the user. Other I / O devices 710 may be devices that receive various inputs from the user and provide various outputs to the user, and may include touch input devices, gesture input devices, cameras, keyboards, remote controls, mice, printers, audio input / output devices, and so on.

[0119] The technologies described herein can be supported by these various configurations of computer device 700, and are not limited to specific examples of the technologies described herein. For example, the functionality can also be implemented wholly or partially on a “cloud” using a distributed system. A cloud includes and / or represents a platform for resources. The platform abstracts the underlying functionality of the cloud’s hardware (e.g., servers) and software resources. Resources may include applications and / or data that can be used when performing computational processing on a server remote from computer device 700. Resources may also include services provided via the Internet and / or via subscriber networks such as cellular or Wi-Fi networks. The platform can abstract resources and functionality to connect computer device 700 to other computer devices. Therefore, the implementation of the functionality described herein can be distributed throughout the cloud. For example, the functionality may be implemented partly on computer device 700 and partly through a platform that abstracts the functionality of the cloud.

[0120] Although this disclosure has been described and illustrated in detail in the accompanying drawings and the foregoing description, such description and illustration should be considered illustrative and suggestive, not restrictive; this disclosure is not limited to the disclosed embodiments. By studying the drawings, the disclosure, and the appended claims, those skilled in the art will be able to understand and implement variations of the disclosed embodiments in practice with respect to the claimed subject matter. In the claims, the word "comprising" does not exclude other elements or steps not listed, the indefinite article "a" or "an" does not exclude a plurality, the term "a plurality" means two or more, and the term "based on" should be interpreted as "at least partially based on". The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be beneficial.

Claims

1. A method for training a language model to obtain a predictive model, wherein, The predictive model is used to process communication record data of multiple participants to provide recommended actions corresponding to the multiple participants, and the method includes: Multiple pieces of textualized business data are acquired as the first dataset, and the multiple pieces of textualized business data describe business-related situations; An unsupervised clustering operation is performed on the first dataset to obtain a second dataset. The second dataset includes K first subsets. Each piece of data in the first dataset belongs to a corresponding first subset of the K first subsets and has a unique label of the corresponding first subset. Here, K represents the number of clusters obtained after performing unsupervised clustering on the first dataset, and K is an integer greater than or equal to 2. Perform the following training task on the language model: In the first training task, the language model is pre-trained based on the first dataset to obtain a pre-trained language model; In the second training task, the pre-trained language model is semi-supervised based on the first dataset and the second dataset to obtain the trained language model. In the third training task, the trained language model is fine-tuned based on the second dataset to obtain a fine-tuned language model; and The prediction model is obtained based on the optimized language model. In the second training task, the pre-trained language model is semi-supervised based on the first dataset and the second dataset to obtain a trained language model, including: Obtain a third dataset, which includes multiple different second subsets, each of which includes one textual business data or multiple related textual business data from the first dataset; Based on the third dataset, the steps for constructing triples are performed, including: Construct one or more cue learning statements, each of which can point to any second subset of the plurality of second subsets, and For each of the plurality of second subsets, a triplet is constructed, wherein any one of the one or more cue learning statements is used as the first element of the triplet, all textual business data in the second subset is used as the second element of the triplet, and the unique label of the first subset containing the data that, from a business perspective, immediately follows all textual business data in the second dataset is used as the third element of the triplet; and All the constructed triples are input into the pre-trained language model to perform semi-supervised training on the pre-trained language model, thereby obtaining the trained language model.

2. The method according to claim 1, wherein, Performing unsupervised clustering on the first dataset to obtain the second dataset includes one of the following: Specify the value of K; or The value of K is not specified, wherein the number of clusters K obtained after the unsupervised clustering operation depends on prior knowledge or the evaluation metric of the unsupervised clustering operation.

3. The method according to claim 1 or 2, wherein, In the first training task, the language model is pre-trained based on the first dataset to obtain a pre-trained language model that includes one of the following: The parameters of the language model are randomly initialized; or The initial values ​​of the parameters of the language model are set based on another trained language model that is structurally identical to the language model.

4. The method according to claim 1 or 2, wherein, In the third training task, the trained language model is fine-tuned based on the second dataset to obtain the fine-tuned language model, which includes: A certain proportion of data is selected from each first subset of the second dataset to obtain the fourth dataset; Based on the fourth dataset, the steps for constructing quadruplets are performed, including: Each data point in the fourth dataset is labeled to obtain the corresponding labeling information for each data point; Construct one or more cue learning statements, each of which can point to each piece of data in the fourth dataset; For each piece of data in the fourth dataset, a quadruple is constructed, wherein any one of the one or more prompting learning statements is the first element of the quadruple, the data is the second element of the quadruple, the unique label of the first subset of the data in the second dataset that immediately follows the data is the third element of the quadruple, and the corresponding annotation information of the data is the fourth element of the quadruple. All the constructed quadruplets are input into the trained language model to fine-tune the trained language model, resulting in the fine-tuned language model.

5. The method according to claim 4, wherein, The step of obtaining the prediction model based on the optimized language model includes: In response to the determination that the tuned language model needs further tuning, the tuned language model is further tuned and trained based on the fifth dataset to obtain the prediction model, and Specifically, the optimized language model is further optimized and trained based on the fifth dataset to obtain a prediction model that includes at least one of the following: From each first subset of the second dataset, select one or more data points that differ from the stated certain proportion of data to obtain a fifth dataset. Then, based on the fifth dataset, perform the step of constructing quadruplets, and input all newly constructed quadruplets into the tuned language model; or At least a portion of the data is selected from a certain proportion of the data in the second dataset to obtain the fifth dataset, and the annotation information of the corresponding quadruples in the fifth dataset is adjusted, and the adjusted quadruples are input into the tuned language model.

6. The method according to claim 1 or 2, wherein, The step of obtaining the prediction model based on the optimized language model includes: In response to determining that the tuned language model will not be further tuned, the tuned language model is identified as the prediction model.

7. The method according to claim 1 or 2, wherein, The multiple pieces of textual business data include business target data, business participant data, and communication records between business participants.

8. The method according to claim 1 or 2, wherein, The language model includes the generative pre-trained transformer (GPT) model.

9. An apparatus for training a language model to obtain a predictive model, wherein, The predictive model is used to process communication record data of multiple participants to provide recommended actions corresponding to the multiple participants, and the device includes: The first module is used to acquire multiple pieces of textual business data as a first dataset, wherein the multiple pieces of textual business data describe business-related situations; The second module is used to perform unsupervised clustering operations on the first dataset to obtain a second dataset. The second dataset includes K first subsets. Each piece of data in the first dataset belongs to a corresponding first subset of the K first subsets and has a unique label of the corresponding first subset. Here, K represents the number of clusters obtained after performing unsupervised clustering operations on the first dataset, and K is an integer greater than or equal to 2. The third module is used to perform the following training tasks on the language model: In the first training task, the language model is pre-trained based on the first dataset to obtain a pre-trained language model; In the second training task, the pre-trained language model is semi-supervised based on the first dataset and the second dataset to obtain the trained language model. In the third training task, the trained language model is fine-tuned based on the second dataset to obtain a fine-tuned language model; and The fourth module is used to obtain the prediction model based on the optimized language model. In the second training task, the pre-trained language model is semi-supervised based on the first dataset and the second dataset to obtain a trained language model, including: Obtain a third dataset, which includes multiple different second subsets, each of which includes one textual business data or multiple related textual business data from the first dataset; Based on the third dataset, the steps for constructing triples are performed, including: Construct one or more cue learning statements, each of which can point to any second subset of the plurality of second subsets, and For each of the plurality of second subsets, a triplet is constructed, wherein any one of the one or more cue learning statements is used as the first element of the triplet, all textual business data in the second subset is used as the second element of the triplet, and the unique label of the first subset containing the data that, from a business perspective, immediately follows all textual business data in the second dataset is used as the third element of the triplet; and All the constructed triples are input into the pre-trained language model to perform semi-supervised training on the pre-trained language model, thereby obtaining the trained language model.

10. A computer device, comprising: At least one processor; as well as At least one memory on which a computer program is stored, When the computer program is executed by the at least one processor, it causes the at least one processor to perform the method according to any one of claims 1 to 8.

11. A computer-readable storage medium having instructions stored thereon, which, when executed by one or more processors, cause the one or more processors to perform the method of any one of claims 1 to 8.

12. A computer program product comprising instructions that, when executed by one or more processors, cause the one or more processors to perform the method of any one of claims 1 to 8.