A method and device for training a graph data model, a storage medium and an electronic device

By identifying user subgraph data from global graph data and classifying samples based on historical business records, and using unlabeled samples to fit the distribution and train the graph data model, the problem of low correlation in graph neural network model training is solved, achieving efficient training and accurate judgment with small samples.

CN116403034BActive Publication Date: 2026-06-19ALIPAY (HANGZHOU) INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ALIPAY (HANGZHOU) INFORMATION TECH CO LTD
Filing Date
2023-03-28
Publication Date
2026-06-19

Smart Images

  • Figure CN116403034B_ABST
    Figure CN116403034B_ABST
Patent Text Reader

Abstract

In the graph data model training method provided in this specification, training samples are determined based on historical business records. The risk assessment value of the training samples is determined using a graph data model. When the training samples are unlabeled, a distribution is determined based on the risk assessment values ​​of each unlabeled sample. Then, based on the labels of the labeled samples, an interval matching the label in the distribution is determined. The goal is to train the graph data model so that the risk assessment value of each labeled sample falls within the interval matching the label. As can be seen from the above method, by using the risk assessment values ​​of unlabeled samples to determine the distribution, then determining the difference between the risk assessment values ​​of labeled samples and the distribution, determining the loss based on the difference, and using the loss to train the graph data model, a graph neural network model can be trained using a small number of samples.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computers, and in particular to a method, apparatus, storage medium, and electronic device for training graph data models. Background Technology

[0002] With the development of internet technology, people are paying increasing attention to protecting user privacy during data processing. Currently, using graph neural network models to extract features and classify graph data has become very common. Generally, graph neural network models require a large number of labeled samples for training. However, in risk control operations such as financial transactions, the samples obtained are usually unlabeled. Adding labels to a large number of unlabeled samples would result in excessively high training costs.

[0003] To address this issue, existing technologies typically utilize large amounts of labeled data acquired from other scenarios to train a graph neural network (GNN) model, enabling it to extract features. This model is then fine-tuned using a small number of labeled samples from risk control data related to financial transactions. It is evident that the training performance of the GNN model depends on the correlation between the data from other scenarios and the risk control data from financial transactions; if the correlation is low, the training performance of the GNN model will be poor.

[0004] To address this, this application proposes a method for training graph neural network models using few samples. Summary of the Invention

[0005] This specification provides a method, apparatus, storage medium, and electronic device for training graph data models, to at least partially solve the above-mentioned problems.

[0006] The following technical solution is adopted in this specification:

[0007] This specification provides a method for training a graph data model, the method comprising:

[0008] For each user, the subgraph data corresponding to that user is determined from the global graph data and used as training samples;

[0009] Based on historical business records, if it is determined that risk control business has been performed on the user, the subgraph data corresponding to the user will be used as a labeled sample; if it is determined that risk control business has not been performed on the user, the subgraph data corresponding to the user will be used as an unlabeled sample.

[0010] The training samples are input into the graph data model to be trained to determine the risk assessment value of the training samples;

[0011] At least based on the risk assessment values ​​of each unlabeled sample, determine the distribution of the risk assessment values;

[0012] Based on the labels of the labeled samples, determine the intervals in the distribution that match the labels;

[0013] The graph data model to be trained is trained with the goal of ensuring that the risk assessment value of the labeled sample falls within the range that matches the label.

[0014] Optionally, the risk assessment distribution is determined, specifically including:

[0015] Store the risk assessment values ​​for each training sample;

[0016] Based on the stored risk assessment values ​​of each unlabeled sample and the number of unlabeled samples corresponding to each risk assessment value, the distribution of the risk assessment values ​​is determined by fitting a Gaussian function.

[0017] Optionally, before determining the risk assessment distribution, the method further includes:

[0018] Determine whether the number of stored risk assessment values ​​for unlabeled samples exceeds a preset threshold;

[0019] If so, delete the risk assessment values ​​of a specified number of unlabeled samples according to the storage time of each unlabeled sample's risk assessment value.

[0020] Optionally, the labeled samples include white samples and black samples, wherein the white samples are training samples in which the risk control results show no risk, and the black samples are training samples in which the risk control results show risk.

[0021] Determining the distribution of risk assessment values ​​specifically includes:

[0022] Store the risk assessment values ​​for each training sample;

[0023] Based on the stored risk assessment values ​​of each unlabeled sample and each stored white sample, as well as the number of unlabeled and white samples corresponding to each risk assessment value, the distribution of risk assessment values ​​is determined by fitting a Gaussian function.

[0024] Optionally, based on the labels of the labeled samples, determining the intervals in the distribution that match the labels specifically includes:

[0025] Determine the standard deviation and mean of the distribution;

[0026] The boundary values ​​of the distribution center region are determined based on the standard deviation and the mean, wherein the number of risk assessment values ​​in the distribution in the center region is greater than the number of risk assessment values ​​in the non-center region.

[0027] The central region is designated as the interval that matches the label of the white sample, and the region outside the central region is designated as the interval that matches the label of the black sample.

[0028] Optionally, the graph data model includes a feature extraction subnetwork and an evaluation subnetwork, and the method further includes:

[0029] The evaluation subnet in the trained graph data model is replaced with the classification subnet to be trained to obtain the classification model;

[0030] Filter the training samples to determine which training samples each contain labeled samples;

[0031] The labeled samples are input into the classification model to obtain the classification results;

[0032] Based on the classification results and the labels corresponding to the labeled samples, the classification model is trained. The trained classification model is used to determine whether there is any risk in transactions between users based on the users' transaction data.

[0033] Optionally, the graph data model to be trained is trained with the goal of ensuring that the risk assessment value of the labeled sample falls within the interval that matches the label. This specifically includes:

[0034] Determine the standard deviation and mean of the distribution;

[0035] The loss of the training samples is determined based on the risk assessment value, the standard deviation, and the mean value of the training samples.

[0036] When the training samples are white samples, the graph data model is trained with the goal of minimizing the loss.

[0037] When the training samples are black samples, the graph data model is trained with the loss not being less than the preset hyperparameter as the training objective.

[0038] This specification provides an apparatus for training graph data models, the apparatus comprising:

[0039] The sample determination module is used to determine the subgraph data corresponding to each user from the global graph data, and use it as training samples.

[0040] The sample classification module is used to classify data based on historical business records. If it is determined that risk control business has been performed on a user, the sub-graph data corresponding to that user is treated as a labeled sample. If it is determined that risk control business has not been performed on a user, the sub-graph data corresponding to that user is treated as an unlabeled sample.

[0041] An evaluation module is used to input the training samples into the graph data model to be trained and determine the risk assessment value of the training samples.

[0042] The distribution determination module is used to determine the distribution of risk assessment values ​​based at least on the risk assessment values ​​of each unlabeled sample;

[0043] The interval determination module is used to determine the interval in the distribution that matches the label based on the label of the labeled sample;

[0044] The training module is used to train the graph data model to be trained with the goal of ensuring that the risk assessment value of the labeled sample falls within the interval that matches the label.

[0045] This specification provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described method for training a graph data model.

[0046] This specification provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the above-described method for training a graph data model.

[0047] The above-mentioned technical solutions adopted in this specification can achieve the following beneficial effects:

[0048] In the graph data model training method provided in this specification, training samples are determined based on historical business records, and risk assessment values ​​of the training samples are determined through the graph data model. When the training samples are unlabeled samples, the distribution is determined based on the risk assessment values ​​of each unlabeled sample. Based on the labels of the labeled samples, the intervals in the distribution that match the labels are determined. The goal is to train the graph data model to be trained so that the risk assessment values ​​of each labeled sample fall into the intervals that match the labels.

[0049] As can be seen from the above method, by using the risk assessment value of unlabeled samples to determine the distribution, then determining the difference between the risk assessment value of labeled samples and the assessment distribution, determining the loss based on the difference, and using the loss to train the graph data model, it is possible to train a graph neural network model using a small number of samples. Attached Figure Description

[0050] The accompanying drawings, which are included to provide a further understanding of this specification and form part of this specification, illustrate exemplary embodiments and are used to explain this specification, but do not constitute an undue limitation thereof. In the drawings:

[0051] Figure 1 This is a schematic diagram of the training process for a graph data model in this specification;

[0052] Figure 2 This is a schematic diagram illustrating one method of determining training samples as provided in this specification;

[0053] Figure 3 A schematic diagram of an evaluation distribution provided in this specification;

[0054] Figure 4 This is a schematic diagram of a graph data model training device provided in this specification;

[0055] Figure 5 The corresponding information provided in this specification Figure 1 A schematic diagram of an electronic device. Detailed Implementation

[0056] To make the objectives, technical solutions, and advantages of this specification clearer, the technical solutions of this specification will be clearly and completely described below in conjunction with specific embodiments and corresponding drawings. Obviously, the described embodiments are only a part of the embodiments of this specification, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments in this specification without creative effort are within the scope of protection of this application.

[0057] The technical solutions provided in the various embodiments of this specification are described in detail below with reference to the accompanying drawings.

[0058] Figure 1 This document provides a flowchart illustrating a method for training a graph data model, which includes the following steps:

[0059] S100: For each user, determine the subgraph data corresponding to that user from the global graph data, and use it as a training sample.

[0060] The execution subject for model training provided in this manual can be a server or an electronic device such as a personal computer (PC). For ease of description, the following description will only use a server as the execution subject to explain the model training method provided in this manual.

[0061] In the embodiments of this specification, since the graph data model to be trained belongs to a graph neural network model, the training samples are graph data. In the scenario of risk control business in fund transactions, the nodes in the graph data represent each user, and the edges represent the relationships between users. Based on the historical transactions between users, the server can generate global graph data containing each user. When training this graph data model, a portion of the graph data can be selected from the global graph data according to the actual situation. This portion of the graph data is the sub-graph data corresponding to each user, and this sub-graph data is used as the training sample.

[0062] Specifically, the server can, for each user, use the node corresponding to that user in the global graph data as the center, and determine a subgraph data with a preset value representing the number of hops, as training samples. For example, assuming the value is 2, then using the user node as the center, it determines all nodes and edges in the graph data that are within a range of 2 related to the user, thus determining the subgraph data, which serves as the training samples for that user. Figure 2 As shown.

[0063] S101: Based on historical business records, if it is determined that risk control business has been performed on the user, the sub-graph data corresponding to the user will be used as a labeled sample; if it is determined that risk control business has not been performed on the user, the sub-graph data corresponding to the user will be used as an unlabeled sample.

[0064] In the embodiments of this specification, the trained graph data model is used to determine whether there is risk in transactions between users based on user transaction data. Therefore, when training this graph data model, it is necessary to first obtain historical business records. For each user, if the user is the user corresponding to a historical business recorded in the historical business record, it means that the user has undergone risk control business, and the subgraph data corresponding to the user is used as a labeled sample; if the user is not the user corresponding to a historical business recorded in the historical business record, it means that the user has not undergone risk control business, and the subgraph data corresponding to the user is used as an unlabeled sample.

[0065] For example, for users A, B, C, and D, if users A and C can be found among the users corresponding to the historical business transactions recorded in the acquired historical business records, it means that users A and C have undergone risk control transactions, and the subgraph data corresponding to users A and C are used as labeled samples; if users B and D are not recorded among the users in the historical business records, it means that users B and D have not undergone risk control transactions, and the subgraph data corresponding to users B and D are used as unlabeled samples.

[0066] In the embodiments of this specification, historical business records refer to records of risk control business related to fund transactions between users in the past. For example, when user 1 and user 2 conduct a fund transaction, if the server detects that the transaction may have risks, it will temporarily restrict the transaction function between user 1 and user 2. Based on the transaction data between user 1 and user 2, the server will determine whether there is any risk in the transaction based on human experience. The determination result is the execution result of the risk control business. At the same time, the server can execute corresponding business based on the execution result of the risk control business. If it is determined that there is a risk in the transaction between user 1 and user 2, the server can continue to restrict the transaction function between user 1 and user 2; if it is determined that there is no risk in the transaction function between user 1 and user 2, the server can lift the restriction on the transaction function between user 1 and user 2.

[0067] It should be noted that in the embodiments of this specification, labeled samples are small samples, and a large number of training samples are unlabeled samples.

[0068] S102: Input the training samples into the graph data model to be trained and determine the risk assessment value of the training samples. After determining the training samples, the determined labeled and unlabeled samples are input into the graph data model to be trained to obtain the risk assessment value of each training sample. During the training of the graph data model, the server can input each training sample from one iteration into the graph data model to be trained, determine the risk assessment value of each training sample, and then execute subsequent steps S103-S105 to adjust the model parameters of the graph data model to be trained; that is, the model parameters are adjusted once per iteration.

[0069] Alternatively, the server can adjust the model parameters of the graph data model to be trained based on each training sample. Then, after each input training sample, subsequent steps S103 or S104 can be executed, and the graph data model to be trained can be trained through step S104.

[0070] The following content uses the example of adjusting the model parameters once in one iteration to illustrate the training method of the graph data model provided in this manual.

[0071] S103: Determine the distribution of risk assessment values ​​based at least on the risk assessment values ​​of each unlabeled sample.

[0072] In the embodiments of this specification, the training samples include labeled samples and unlabeled samples. When the training sample is an unlabeled sample, the risk assessment value of the unlabeled sample output by the graph data model is stored, and the distribution of each stored risk assessment value is determined.

[0073] Since the unlabeled samples are determined using subgraph data from users who have not undergone risk control procedures, and these users may or may not pose a risk, in reality, the risk probability distribution of users on a normal business platform should typically follow a Gaussian distribution. This means that the number of users with extremely high or very low risk is relatively small, while the risk of most users is concentrated. Therefore, for the large number of unlabeled samples in this specification, the distribution of risk assessment values ​​for each unlabeled sample should also conform to a Gaussian distribution.

[0074] Therefore, in one or more embodiments of this specification, the server can determine the distribution of risk assessment values ​​based on the stored risk assessment values. By adjusting the model parameters of the graph data model to be trained in the subsequent step S105, the distribution can be made to conform to a Gaussian distribution, that is, the risk assessment values ​​of risk-free samples should be as close as possible to the central region of the Gaussian distribution, while the risk assessment values ​​of risky samples should be as far outside the central region of the Gaussian distribution as possible. Figure 3 As shown.

[0075] Figure 3 This diagram illustrates the distribution of risk assessment values ​​provided in the embodiments of this specification. The central region of this distribution can be preset according to actual conditions. The x-axis represents the risk assessment value, and the y-axis represents the sample size corresponding to the risk assessment value. For example, the central region is (μ-3σ, μ+3σ), where μ is the average value of each risk assessment value within the assessment distribution, and σ is the standard deviation of each risk assessment value within the assessment distribution.

[0076] S104: Based on the labels of the labeled samples, determine the intervals in the distribution that match the labels.

[0077] Since the determined distribution approximately conforms to a Gaussian distribution, and in actual business operations, the number of risky businesses is far less than the number of risk-free businesses, therefore, as Figure 3 As shown, the central region of this distribution can be the region corresponding to the risk-free samples in the labeled samples, that is, the central region of this distribution is the interval that matches the label of the white sample; the region outside the central region of this distribution can be the region corresponding to the risky samples in the labeled samples, that is, the central region of this distribution is the interval that matches the label of the black sample.

[0078] S105: Train the graph data model to be trained with the goal of ensuring that the risk assessment value of the labeled sample falls within the interval that matches the label.

[0079] When the training sample is a white sample, the risk assessment value of the white sample is stored, the distribution at this time is determined, and the standard deviation and mean of the distribution are determined. Based on the standard deviation and mean of the assessment distribution and the risk assessment value of the white sample, the loss is determined, and the graph data model is trained with the loss minimization as the training objective. When the training sample is a black sample, based on the determined distribution, the standard deviation and mean of the distribution are determined, and the loss is determined based on the standard deviation and mean of the distribution and the risk assessment value of the black sample. The graph data model is trained with the loss not being less than the preset hyperparameter as the training objective.

[0080] In the embodiments of this specification, when the training sample is a black sample, after determining the risk assessment value of the black sample, the difference between the risk assessment value and the distribution is determined, and the loss is determined based on the difference. The specific method for determining the loss is as follows: First, based on the distribution, the mean and standard deviation of the distribution are determined, according to dev(vi)=(S i -μ i ) / σ i Determine the deviation loss dev(vi), and then determine the loss according to L = max(0, m - |dev(vi)|). Since the goal is to keep risky samples outside the central region of the Gaussian distribution as much as possible, when the training samples are black samples, the graph data model can be trained with a loss not less than a preset hyperparameter m as the training objective.

[0081] When the training sample is a white sample, after determining the risk assessment value of the white sample, the difference between the risk assessment value and the distribution is determined, and the loss is determined based on this difference. The specific method for determining the loss is as follows: First, based on the distribution, determine the mean and standard deviation of the distribution, and then use dev(vi) = (S i -μ i ) / σ i Determine the deviation loss dev(vi), and then determine the loss according to L=|dev(vi)|). Since the goal is to keep risk-free samples within the central region of the Gaussian distribution as much as possible, when the training samples are white samples, the training objective can be minimizing the loss to train the graphical data model. The hyperparameter m can be preset according to the actual situation; S i This is the risk assessment value for the current sample; μ i The mean of the most recently determined evaluation distribution; σ i This represents the standard deviation of the newly determined evaluation distribution.

[0082] The trained graph data model can be used to determine whether there is risk in transactions between users. For example, when the server determines that there is risk in the financial transactions between users and restricts the user's subsequent transaction activities, the user can submit an appeal to the server through the terminal to request the lifting of the restrictions on the user's account. After receiving the user's appeal, the server determines the subgraph data corresponding to the user in the global graph data containing data of each user, uses the subgraph data as input, and inputs it into the trained graph data model. Based on the output of the graph data model, it determines whether there is risk in the user's account. If there is risk, the restrictions on the user's subsequent transaction activities continue; if there is no risk, the restrictions on the user are lifted.

[0083] based on Figure 1The training method for the graph data model shown in this embodiment involves determining training samples based on historical business records and obtaining the risk assessment value of each training sample through the graph data model. In each training sample, at least the risk assessment values ​​of unlabeled samples are used to determine the assessment distribution. The loss is determined based on the difference between the risk assessment values ​​of labeled samples and the determined distribution, and the graph data model is trained based on this loss. The trained graph data model is used to determine whether there is risk in transactions between users based on user transaction data. This model training method utilizes both unlabeled and labeled samples, truly achieving graph data model training with a small sample size.

[0084] After determining the risk assessment value for each training sample, this risk assessment value needs to be stored. Since the stored risk assessment values ​​are used to determine the distribution, and the difference between this distribution and the labels of the labeled samples is used to determine the loss, the graph data model parameters are adjusted based on this loss. In other words, the stored risk assessment values ​​have already been used to adjust the graph data model parameters. Continuing to retain these risk assessment values ​​may lead to poor model training performance and consume excessive resources. Therefore, the storage space for these risk assessment values ​​can be a First-In-First-Out (FIFO) queue.

[0085] Since a FIFO queue is used, before storing the risk assessment value, it is necessary to determine whether the number of risk assessment values ​​stored in the FIFO queue has reached a preset threshold. If the preset threshold is reached, the stored risk assessment values ​​need to be removed from the FIFO queue according to their storage time, in a preset quantity. This allows the new risk assessment value to be stored in the FIFO queue, and also updates the stored risk assessment values, further updating their distribution and enabling better training of the graph data model. The length of the FIFO queue can be preset according to actual conditions.

[0086] In the embodiments of this specification, the distribution can be determined using only the risk assessment values ​​of unlabeled samples, and then the difference between the labeled samples and this distribution can be determined, using this difference to train the graph data model. However, since the graph data model to be trained does not have a mature ability to judge whether the input data has risks, determining the distribution using only the risk assessment values ​​of the unlabeled samples output by the graph data model is not very accurate. The difference between the labeled samples and the distribution will be relatively large, resulting in low training efficiency of the graph data model. Therefore, it is also possible to store both the risk assessment values ​​of unlabeled samples and the risk assessment values ​​of white samples in the labeled samples. White samples are samples that have been determined to be risk-free. The risk assessment values ​​of the white samples are used as anchor points, and the model parameters are continuously adjusted to make the risk assessment values ​​of the unlabeled samples as close as possible to the risk assessment values ​​of the white samples. The distribution determined in this way is more accurate, thereby improving the training efficiency and effect of the graph data model.

[0087] Based on the stored risk assessment values, the distribution is determined as follows: For each risk assessment value, the number of samples corresponding to that risk assessment value is determined based on the number of unlabeled and blank samples. For example, the risk assessment value can be used as the X-axis, and the number of samples corresponding to that risk assessment value as the Y-axis; or, the number of samples corresponding to that risk assessment value can be used as the X-axis, and the risk assessment value as the Y-axis. Based on the determined number of samples corresponding to each risk assessment value, the Gaussian distribution function corresponding to each risk assessment value is determined, and this distribution is used as the distribution.

[0088] In the embodiments of this specification, when training a graph data model, the graph data model consists of a feature extraction subnet and an evaluation subnet. Determined training samples are input into the feature extraction subnet of the graph data model to determine the features of the training samples. Then, the features of the training samples are input into the evaluation subnet of the graph data model to determine the risk assessment value of the training samples. The trained graph data model can accurately extract the features of the input graph data and determine the risk assessment value of the graph data based on these features. At this point, a preset risk assessment value threshold can be set. When the determined risk assessment value of the graph data exceeds the preset risk assessment value threshold, it is determined that the transaction of the user corresponding to the graph data is risky; when the determined risk assessment value of the graph data is greater than or equal to the risk assessment value threshold, it is determined that the transaction of the user corresponding to the graph data is not risky.

[0089] However, the above method requires continuously updating the FIFO queue and distribution based on the determined risk assessment values ​​of the training samples, and simultaneously storing the updated distribution. Therefore, this method is resource-intensive and memory-intensive. Thus, after the graph data model is trained, the evaluation subnet in the trained graph data model can be replaced with the classification subnet to be trained, resulting in a classification model. The training samples are then filtered to identify labeled samples. For each labeled sample, it is input into the classification model to obtain a classification result. Based on the classification result and the corresponding label of the labeled sample, the classification model is trained. Since the feature extraction subnet in the trained graph data model can accurately extract features from the input data, the features extracted by the feature extraction subnet can be input into the classification subnet to obtain a classification result. Then, based on the label of the input data, the difference between the label and the classification result is determined, and a loss function is determined. The classification model is then trained based on this loss function. The trained classification model can then determine whether there is risk in transactions between users based on user transaction data.

[0090] The above describes one or more embodiments of a graph data model training method provided in this specification. Based on the same idea, this specification also provides a corresponding graph data model training device, such as... Figure 4 As shown.

[0091] Figure 4 This specification provides a schematic diagram of a graph data model training device, which specifically includes:

[0092] The sample determination module 401 is used to determine the subgraph data corresponding to each user from the global graph data, and use it as a training sample.

[0093] The sample classification module 402 is used to classify the data into labeled samples based on historical business records. If it is determined that risk control business has been performed on the user, the sub-graph data corresponding to the user is used as a labeled sample. If it is determined that risk control business has not been performed on the user, the sub-graph data corresponding to the user is used as an unlabeled sample.

[0094] Evaluation module 403 is used to input the training samples into the graph data model to be trained and determine the risk assessment value of the training samples;

[0095] Distribution determination module 404 is used to determine the distribution of risk assessment values ​​based at least on the risk assessment values ​​of each unlabeled sample;

[0096] The interval determination module 405 is used to determine the interval in the distribution that matches the label based on the label of the labeled sample;

[0097] Training module 406 is used to train the graph data model to be trained with the goal of the risk assessment value of the labeled sample falling into the interval that matches the label.

[0098] Optionally, the distribution determination module 404 is specifically used to store the determined risk assessment values ​​of each training sample; and to determine the distribution of risk assessment values ​​by fitting a Gaussian function based on the stored risk assessment values ​​of each unlabeled sample and the number of unlabeled samples corresponding to each risk assessment value.

[0099] Optionally, the distribution determination module 404 is specifically used to determine whether the number of stored risk assessment values ​​of unlabeled samples is greater than a preset threshold before determining the distribution of risk assessment values; if so, delete a specified number of risk assessment values ​​of unlabeled samples according to the storage time of each unlabeled sample's risk assessment value.

[0100] Optionally, the labeled samples include white samples and black samples, wherein the white samples are training samples in which the risk control results show no risk, and the black samples are training samples in which the risk control results show risk.

[0101] The distribution determination module 404 is specifically used to store the determined risk assessment values ​​of each training sample; and to determine the distribution of risk assessment values ​​by fitting a Gaussian function based on the stored risk assessment values ​​of each unlabeled sample and each white sample, as well as the number of unlabeled and white samples corresponding to each risk assessment value.

[0102] Optionally, the interval determination module 405 is specifically used to: determine the standard deviation and mean of the distribution; determine the boundary value of the central region of the distribution based on the standard deviation and mean, wherein the number of risk assessment values ​​in the distribution in the central region is greater than the number of risk assessment values ​​in the non-central region; use the central region as the interval that matches the label of the white sample, and use the region outside the central region as the interval that matches the label of the black sample.

[0103] Optionally, the graph data model includes a feature extraction subnetwork and an evaluation subnetwork;

[0104] The training module 406 is further configured to: replace the evaluation subnet in the trained graph data model with the classification subnet to be trained to obtain a classification model; filter the training samples to determine the labeled samples in the training samples; input the labeled samples into the classification model to obtain the classification result; train the classification model according to the classification result and the labels corresponding to the labeled samples, and the trained classification model is used to determine whether there is a risk in transactions between users based on the user's transaction data.

[0105] Optionally, the training module 406 is specifically used to: determine the standard deviation and mean of the distribution; determine the loss of the training sample based on the risk assessment value of the training sample, the standard deviation, and the mean; when the training sample is a white sample, train the graph data model with the goal of minimizing the loss; when the training sample is a black sample, train the graph data model with the goal of the loss being no less than a preset hyperparameter.

[0106] This specification also provides a computer-readable storage medium storing a computer program that can be used to execute the above-described... Figure 1 The provided method for training graph data models.

[0107] This instruction manual also provides Figure 5 The diagram shows the structure of the electronic device. Figure 5 At the hardware level, the interface matching device includes a processor, internal bus, network interface, memory, and non-volatile memory, and may also include other hardware required for the services. The processor reads the corresponding computer program from the non-volatile memory into memory and then runs it to achieve the above. Figure 1 The graph data model training method described above. Of course, in addition to software implementation, this specification does not exclude other implementation methods, such as logic devices or a combination of hardware and software, etc. That is to say, the execution subject of the following processing flow is not limited to each logic unit, but can also be hardware or logic devices.

[0108] In the 1990s, improvements to a technology could be clearly distinguished as either hardware improvements (e.g., improvements to the circuit structure of diodes, transistors, switches, etc.) or software improvements (improvements to the methodology). However, with technological advancements, many methodological improvements today can be considered direct improvements to the hardware circuit structure. Designers almost always obtain the corresponding hardware circuit structure by programming the improved methodology into the hardware circuit. Therefore, it cannot be said that a methodological improvement cannot be implemented using hardware physical modules. For example, a Programmable Logic Device (PLD) (such as a Field Programmable Gate Array (FPGA)) is such an integrated circuit whose logic function is determined by the user programming the device. Designers can program and "integrate" a digital system onto a PLD themselves, without needing chip manufacturers to design and manufacture dedicated integrated circuit chips. Furthermore, nowadays, instead of manually manufacturing integrated circuit chips, this programming is mostly implemented using "logic compiler" software. Similar to the software compiler used in program development, the original code before compilation must be written in a specific programming language, called a Hardware Description Language (HDL). There are many HDLs, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, and RHDL (Ruby Hardware Description Language). Currently, the most commonly used are VHDL (Very-High-Speed ​​Integrated Circuit Hardware Description Language) and Verilog. Those skilled in the art should understand that by simply performing some logic programming on the method flow using one of these hardware description languages ​​and programming it into an integrated circuit, the hardware circuit implementing the logical method flow can be easily obtained.

[0109] The controller can be implemented in any suitable manner. For example, it can take the form of a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro)processor, logic gates, switches, application-specific integrated circuits (ASICs), programmable logic controllers, and embedded microcontrollers. Examples of controllers include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicon Labs C8051F320. A memory controller can also be implemented as part of the control logic of the memory. Those skilled in the art will also recognize that, in addition to implementing the controller in purely computer-readable program code form, the same functionality can be achieved by logically programming the method steps to make the controller take the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers. Therefore, such a controller can be considered a hardware component, and the means included therein for implementing various functions can also be considered as structures within the hardware component. Alternatively, the means for implementing various functions can be considered as both software modules implementing the method and structures within the hardware component.

[0110] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, a computer can be, for example, a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or any combination of these devices.

[0111] For ease of description, the above devices are described in terms of function, divided into various units. Of course, in implementing this specification, the functions of each unit can be implemented in one or more software and / or hardware.

[0112] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0113] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0114] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0115] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0116] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.

[0117] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.

[0118] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.

[0119] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0120] Those skilled in the art will understand that the embodiments of this specification can be provided as methods, systems, or computer program products. Therefore, this specification may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this specification may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0121] This specification can be described in the general context of computer-executable instructions that are executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform a specific task or implement a specific abstract data type. This specification can also be practiced in distributed computing environments, where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.

[0122] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to interchangeably. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.

[0123] The above description is merely an embodiment of this specification and is not intended to limit this specification. Various modifications and variations can be made to this specification by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this specification should be included within the scope of the claims of this application.

Claims

1. A method for training a graph data model, the method comprising: For each user, the subgraph data corresponding to that user is determined from the global graph data and used as training samples; Based on historical business records, if it is determined that risk control business has been performed on the user, the subgraph data corresponding to the user will be used as a labeled sample; if it is determined that risk control business has not been performed on the user, the subgraph data corresponding to the user will be used as an unlabeled sample. The training samples are input into the graph data model to be trained to determine the risk assessment value of the training samples; The distribution of risk assessment values ​​is determined based on at least the risk assessment values ​​of each unlabeled sample; the distribution conforms to a Gaussian distribution. Based on the labels of the labeled samples, determine the intervals in the distribution that match the labels; The graph data model to be trained is trained with the goal of ensuring that the risk assessment value of the labeled sample falls within the range that matches the label.

2. The method as described in claim 1, wherein determining the distribution of risk assessment values ​​specifically includes: Store the risk assessment values ​​for each training sample; Based on the stored risk assessment values ​​of each unlabeled sample and the number of unlabeled samples corresponding to each risk assessment value, the distribution of the risk assessment values ​​is determined by fitting a Gaussian function.

3. The method of claim 2, prior to determining the distribution of risk assessment values, the method further includes: Determine whether the number of stored risk assessment values ​​for unlabeled samples exceeds a preset threshold; If so, delete the risk assessment values ​​of a specified number of unlabeled samples according to the storage time of each unlabeled sample's risk assessment value.

4. The method as described in claim 1, wherein the labeled samples include white samples and black samples, wherein the white samples are training samples for which the risk control results show no risk, and the black samples are training samples for which the risk control results show risk; Determining the distribution of risk assessment values ​​specifically includes: Store the risk assessment values ​​for each training sample; Based on the stored risk assessment values ​​of each unlabeled sample and each stored white sample, as well as the number of unlabeled and white samples corresponding to each risk assessment value, the distribution of risk assessment values ​​is determined by fitting a Gaussian function.

5. The method as described in claim 4, wherein determining the interval in the distribution that matches the label based on the label of the labeled sample, specifically includes: Determine the standard deviation and mean of the distribution; The boundary values ​​of the distribution center region are determined based on the standard deviation and the mean, wherein the number of risk assessment values ​​in the distribution in the center region is greater than the number of risk assessment values ​​in the non-center region. The central region is designated as the interval that matches the label of the white sample, and the region outside the central region is designated as the interval that matches the label of the black sample.

6. The method of claim 1, wherein the graph data model includes a feature extraction subnetwork and an evaluation subnetwork, and the method further includes: The evaluation subnet in the trained graph data model is replaced with the classification subnet to be trained to obtain the classification model; Filter the training samples to determine which training samples each contain labeled samples; The labeled samples are input into the classification model to obtain the classification results; Based on the classification results and the labels corresponding to the labeled samples, the classification model is trained. The trained classification model is used to determine whether there is any risk in transactions between users based on the users' transaction data.

7. The method as described in claim 4, wherein training the graph data model to be trained aims to ensure that the risk assessment value of the labeled sample falls within an interval matching the label, specifically includes: Determine the standard deviation and mean of the distribution; The loss of the training samples is determined based on the risk assessment value, the standard deviation, and the mean value of the training samples. When the training samples are white samples, the graph data model is trained with the goal of minimizing the loss. When the training samples are black samples, the graph data model is trained with the loss not being less than the preset hyperparameter as the training objective.

8. An apparatus for training a graph data model, the apparatus comprising: The sample determination module is used to determine the subgraph data corresponding to each user from the global graph data, and use it as training samples. The sample classification module is used to classify data based on historical business records. If it is determined that risk control business has been performed on a user, the sub-graph data corresponding to that user is treated as a labeled sample. If it is determined that risk control business has not been performed on a user, the sub-graph data corresponding to that user is treated as an unlabeled sample. An evaluation module is used to input the training samples into the graph data model to be trained and determine the risk assessment value of the training samples. A distribution determination module is used to determine the distribution of risk assessment values ​​based at least on the risk assessment values ​​of each unlabeled sample; the distribution conforms to a Gaussian distribution. The interval determination module is used to determine the interval in the distribution that matches the label based on the label of the labeled sample; The training module is used to train the graph data model to be trained with the goal of ensuring that the risk assessment value of the labeled sample falls within the interval that matches the label.

9. A computer-readable storage medium storing a computer program that, when executed by a processor, implements the method of any one of claims 1 to 7.

10. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the method of any one of claims 1 to 7.