Data processing method, apparatus and device
By using the graph coding network and sampling network in the graph sampling model, the representation vectors of nodes and edges are determined, and pruning is performed based on the sampling probability. This solves the problem of inaccurate graph data pruning in existing technologies and improves the accuracy and efficiency of data processing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ALIPAY (HANGZHOU) INFORMATION TECH CO LTD
- Filing Date
- 2022-12-12
- Publication Date
- 2026-06-16
AI Technical Summary
In existing technologies, when pruning graph data using k-order neighbor subgraphs, it is impossible to effectively select nodes that are useful for big data problems, resulting in poor pruning effects and inaccurate data processing.
By employing the graph encoding network and sampling network in a pre-trained graph sampling model, sampling probabilities are determined through node representation vectors and edge representation vectors, and graph structure data is pruned to retain important graph structure data.
It improves the accuracy of pruning graph-structured data, reduces computational complexity and storage costs, and enhances the accuracy of subsequent big data problem processing, especially the efficiency of text semantic similarity, similar product recommendation, and intelligent question answering systems.
Smart Images

Figure CN115795109B_ABST
Abstract
Description
TECHNICAL FIELD
[0001] Embodiments of the present specification relate to the technical field of data processing, and in particular to a data processing method, device and equipment. BACKGROUND
[0002] With the rapid development of computer technology, various industries are facing the problem of big data processing. How to extract valuable information from big data to support increasingly complex business needs is a problem that needs to be solved in various industries. Graph structure data can be used to solve big data problems such as text semantic similarity, similar product recommendation, or intelligent question and answer system, because it can describe knowledge resources and their carriers using visualization technology.
[0003] Since complete graph data may contain a large amount of redundant data, it is necessary to prune the graph data to process the above-mentioned big data problems in combination with the pruned graph data. For example, the complete graph data can be pruned by cutting k-order neighbor subgraphs to obtain pruned graph data. However, since the pruning is performed by k-order neighbor subgraphs, it may not be able to effectively select nodes useful for big data problems, i.e., the pruned graph structure data may not be accurately used to process big data problems, thereby causing poor pruning effect of graph data, poor data processing accuracy, and other problems. Therefore, a technical solution is needed to improve the pruning accuracy of graph structure data. SUMMARY
[0004] The purpose of the embodiments of the present specification is to provide a data processing method, device and equipment to provide a technical solution to improve the pruning accuracy of graph structure data.
[0005] In order to achieve the above technical solution, the embodiments of the present specification are implemented as follows:
[0006] In a first aspect, embodiments of this specification provide a data processing method, comprising: acquiring first graph structure data to be pruned, the first graph structure data being constructed based on human-computer interaction data having a preset correspondence with a target user; encoding the first graph structure data based on a graph coding network in a pre-trained graph sampling model to determine a node representation vector for each node in the first graph structure data; determining an edge representation vector for an edge between every two connected nodes in the first graph structure data based on the node representation vector for each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two connected nodes in the first graph structure data; determining a sampling probability for an edge between every two connected nodes in the first graph structure data based on the sampling network in the pre-trained graph sampling model and the edge representation vector for an edge between every two connected nodes in the first graph structure data; and pruning the first graph structure data based on the sampling probability for an edge between every two connected nodes in the first graph structure data to obtain pruned first graph structure data.
[0007] Secondly, embodiments of this specification provide a data processing apparatus, the apparatus comprising: a data acquisition module, configured to acquire first graph structure data to be pruned, the first graph structure data being constructed based on human-computer interaction data having a preset correspondence with a target user; a first determination module, configured to encode the first graph structure data based on a graph encoding network in a pre-trained graph sampling model, and determine the node representation vector of each node in the first graph structure data; a second determination module, configured to determine the edge representation vector of the edge between each two connected nodes in the first graph structure data based on the node representation vector of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between each two connected nodes in the first graph structure data; a probability determination module, configured to determine the sampling probability of the edge between each two connected nodes in the first graph structure data based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between each two connected nodes in the first graph structure data; and a first pruning module, configured to prune the first graph structure data based on the sampling probability of the edge between each two connected nodes in the first graph structure data, to obtain pruned first graph structure data.
[0008] Thirdly, embodiments of this specification provide a data processing device, the data processing device comprising: a processor; and a memory arranged to store computer-executable instructions, wherein the executable instructions, when executed, cause the processor to: acquire first graph structure data to be pruned, the first graph structure data being constructed based on human-computer interaction data having a preset correspondence with a target user; encode the first graph structure data based on a graph encoding network in a pre-trained graph sampling model to determine a node representation vector for each node in the first graph structure data; and construct the first graph structure data based on the node representation vector for each node in the first graph structure data. Based on the construction time and the time information between every two connected nodes in the first graph structure data, the edge representation vector of the edge between every two connected nodes in the first graph structure data is determined; based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between every two connected nodes in the first graph structure data, the sampling probability of the edge between every two connected nodes in the first graph structure data is determined; based on the sampling probability of the edge between every two connected nodes in the first graph structure data, the first graph structure data is pruned to obtain the pruned first graph structure data.
[0009] Fourthly, embodiments of this specification provide a storage medium for storing computer-executable instructions. When executed, these instructions implement the following process: acquiring first graph structure data to be pruned, the first graph structure data being constructed based on human-computer interaction data with a preset correspondence to a target user; encoding the first graph structure data based on a graph encoding network in a pre-trained graph sampling model to determine the node representation vector of each node in the first graph structure data; determining the edge representation vector of the edge between each two connected nodes in the first graph structure data based on the node representation vector of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between each pair of connected nodes in the first graph structure data; determining the sampling probability of the edge between each pair of connected nodes in the first graph structure data based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between each pair of connected nodes in the first graph structure data; and pruning the first graph structure data based on the sampling probability of the edge between each pair of connected nodes in the first graph structure data to obtain pruned first graph structure data. Attached Figure Description
[0010] To more clearly illustrate the technical solutions in the embodiments or prior art of this specification, the drawings used in the description of the embodiments or prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this specification. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0011] Figure 1A This is a flowchart illustrating an embodiment of a data processing method described in this specification;
[0012] Figure 1B This is a schematic diagram illustrating the processing procedure of one embodiment of a data processing method described in this specification.
[0013] Figure 2 This is a schematic diagram illustrating the process of trimming structural data in one of the first figures in this specification;
[0014] Figure 3 This is a schematic diagram illustrating the processing procedure of another data processing method embodiment in this specification;
[0015] Figure 4 This is a schematic diagram illustrating the training process of a graph sampling model as described in this specification;
[0016] Figure 5 This is a schematic diagram illustrating the processing procedure of another data processing method embodiment in this specification;
[0017] Figure 6 This is a schematic diagram of a sampling process in this specification;
[0018] Figure 7 This is a schematic diagram of the structure of an embodiment of a data processing device according to this specification;
[0019] Figure 8 This is a schematic diagram of the structure of a data processing device described in this specification. Detailed Implementation
[0020] This specification provides a data processing method, apparatus, and device through its embodiments.
[0021] To enable those skilled in the art to better understand the technical solutions in this specification, the technical solutions in the embodiments of this specification will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this specification, and not all embodiments. Based on the embodiments in this specification, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of this specification.
[0022] Example 1
[0023] like Figure 1A and Figure 1B As shown in the embodiments of this specification, a data processing method is provided. The execution subject of this method can be a server, which can be a standalone server or a server cluster composed of multiple servers. Specifically, the method may include the following steps:
[0024] In S102, the first graph structure data to be clipped is obtained.
[0025] The first graph structure data can be constructed based on human-computer interaction data with a preset correspondence with the target user. The server can receive the first graph structure data sent by any sending device (such as mobile terminal devices such as mobile phones and tablets, terminal devices such as personal computers, or servers). The first graph structure data can be any graph structure data constructed by the sending device based on human-computer interaction data with a preset correspondence with the target user. For example, the first graph structure data can be any graph data that can represent nodes and node relationships, such as a knowledge graph. The nodes in the first graph structure data can be used to represent entities or concepts, and the edges between nodes can be used to represent the semantic relationships between entities / concepts. For example, when the target user clicks on a page, it can be considered that the target user has interacted with the page content block. That is, the target user and the page content block can be used as nodes in the first graph structure data, and the edges between these two nodes can be used to represent the interaction between the target user and the page content block. Therefore, the first graph structure data constructed by human-computer interaction data with a preset correspondence with the target user can be used to characterize the target user's fine-grained habit preferences and interpersonal relationships.
[0026] In practice, with the rapid development of computer technology, all industries face the challenge of big data processing. How to extract valuable information from big data to support increasingly complex business needs is a pressing issue for all industries. Graph structure data, because it can utilize visualization technology to describe knowledge resources and their carriers, can be used to solve big data problems such as text semantic similarity, similar product recommendations, or intelligent question-answering systems. Since complete graph data may contain a large amount of redundant data, it needs to be pruned to process the aforementioned big data problems. For example, the complete graph data can be pruned by extracting k-order neighbor subgraphs. However, pruning using k-order neighbor subgraphs may not effectively select nodes useful for big data problems; that is, the pruned graph structure data may not be accurately used to process big data problems, resulting in poor pruning effects and inaccurate data processing. Therefore, a technical solution is needed to improve the accuracy of pruning graph structure data. To this end, this specification provides a technical solution that can solve the above problems, as detailed below.
[0027] Taking a resource transfer scenario as an example, when a terminal device detects that a target user has triggered the execution of a resource transfer service, it can send a request to the server to execute the resource transfer service. In response to the service execution request, the server can obtain a first graph structure data based on human-computer interaction data that has a preset correspondence with the target user. The first graph structure data can be first graph structure data constructed based on human-computer interaction data such as the input data of the target user during the human-computer interaction process.
[0028] In addition, the server can also acquire second graph structure data. This second graph structure data can be human-computer interaction data acquired within the first time period that has a preset correspondence with the target user. Since the amount of second graph structure data may be large, to improve subsequent data processing efficiency, the server can perform preliminary trimming on the second graph structure data to obtain first graph structure data. For example, the server can acquire the subgraph in the second graph structure data that corresponds to the second time period as the first graph structure data, where the second time period is shorter than the first time period; for example, the first time period could be the last half month, and the second time period could be the last week.
[0029] The method for obtaining the structure data of the first figure described above is an optional and feasible method. In actual application scenarios, there can be a variety of different methods, which may vary depending on the actual application scenario. This specification does not specifically limit the methods used in this embodiment.
[0030] In S104, the first graph structure data is encoded based on the graph encoding network in the pre-trained graph sampling model to determine the node representation vector of each node in the first graph structure data.
[0031] The graph coding network can be any graph neural network capable of encoding the first graph structure data.
[0032] In practice, the first graph structure data can be input into the graph coding network in the pre-trained graph sampling model. The graph coding network can perform message aggregation processing on each node in the first graph structure data to obtain the node representation vector of each node.
[0033] In S106, based on the node representation vector of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two nodes with a connection relationship in the first graph structure data, the edge representation vector of the edge between every two nodes with a connection relationship in the first graph structure data is determined.
[0034] In the first graph structure data, the time information between any two connected nodes can be the time information of when the connection was established between the two connected nodes. For example, when a target user clicks on a page, it can be considered that an interaction has occurred between the target user and the page content block. That is, the target user and the page content block can be regarded as nodes in the first graph structure data, and the edge between these two nodes can be used to represent the interaction between the target user and the page content block. The time information between these two connected nodes can be the time information of the target user's click behavior.
[0035] In implementation, the node representation vector of each node in two connected nodes can be obtained, as well as the construction time of the first graph structure data and the time information between the two connected nodes. Based on the obtained information, the variable representation vector of the edge between the two connected nodes can be constructed.
[0036] In S108, the sampling probability of the edge between every two connected nodes in the first graph structure data is determined based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between every two connected nodes in the first graph structure data.
[0037] In the graph sampling model, the sampling network can be a network built based on a neural network algorithm. For example, the sampling network can be built based on a graph convolutional neural network algorithm. The sampling probability can be used to characterize the importance of the edge between any two connected nodes. The network structures of the graph coding network and the sampling network in the graph sampling model can be selected according to different application scenarios. For example, for scenarios with high timeliness requirements and low accuracy requirements, a network structure with fewer layers (such as a 2-3 layer network structure) can be used to meet the needs of efficient real-time scenario computing. For scenarios with low timeliness requirements and high accuracy requirements, a network structure with more layers (such as a 5-6 layer network structure) can be used to meet the needs of high accuracy scenario computing.
[0038] In implementation, the server can jointly train the graph encoding network and the sampling network in the graph sampling model based on historical graph structure data to obtain the trained graph sampling model. Then, the edge representation vector of the edge between every two connected nodes in the first graph structure data is input into the sampling network in the pre-trained graph sampling model to obtain the sampling probability of the edge between every two connected nodes.
[0039] In S110, based on the sampling probability of the edge between every two connected nodes in the first graph structure data, the first graph structure data is pruned to obtain the pruned first graph structure data.
[0040] In implementation, the sampling probabilities of the edges between any two connected nodes can be sorted from largest to smallest, and the edges corresponding to the last n sampling probabilities can be pruned to obtain the pruned first graph structure data. Here, n is a positive integer not less than 1, and n can vary depending on the pruning requirements of the actual application scenario. This specification does not specifically limit this in the embodiments.
[0041] For example, suppose the first graph structure data contains node 1, node 2, node 3, node 4, and node 5, and the connection relationships between the nodes are as follows: Figure 2 As shown, the sampling probabilities of the edges between any two connected nodes can be sorted from largest to smallest, and the edges corresponding to the last two sampling probabilities can be pruned to obtain the pruned first graph structure data.
[0042] The above-described method for cropping the structural data of the first figure is an optional and feasible method. In actual application scenarios, there are many other cropping methods. Different cropping methods can be selected according to different actual application scenarios. This specification does not specifically limit this method in the embodiments.
[0043] By pruning the first graph structure data, we can meet the computational complexity requirements of big data scenarios, achieve high pruning rates with low computational costs, reduce data storage costs, and decrease the computational resource consumption of subsequent downstream tasks.
[0044] This specification provides a data processing method that acquires first graph structure data to be clipped. The first graph structure data is constructed based on human-computer interaction data that has a preset correspondence with the target user. The first graph structure data is encoded using a graph encoding network in a pre-trained graph sampling model to determine the node representation vector of each node in the first graph structure data. Based on the node representation vectors of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two connected nodes in the first graph structure data, the edge representation vector of the edge between every two connected nodes in the first graph structure data is determined. Based on the sampling network in the pre-trained graph sampling model and the edge representation vectors of the edges between every two connected nodes in the first graph structure data, the sampling probability of the edge between every two connected nodes in the first graph structure data is determined. Based on the sampling probability of the edge between every two connected nodes in the first graph structure data, the first graph structure data is clipped to obtain clipped first graph structure data. In this way, the node representation vector of each node can be accurately determined through the graph coding network. By combining the node representation vector with time information (i.e., the construction time of the first graph structure data and the time information between every two connected nodes in the first graph structure data), the edge representation vector of the edge between every two connected nodes in the first graph structure data can be accurately determined. Then, based on the sampling probability of the edge between every two connected nodes, the first graph structure data is pruned to obtain pruned first graph structure data. This process can remove redundant graph information while retaining important graph structure data, thereby improving the accuracy of subsequent processing of big data problems such as text semantic similarity, similar product recommendation, or intelligent question answering systems based on the pruned first graph structure data. In addition, since the pruning method does not rely on static graphs such as neighbor matrices, it can also prune time-series dynamic graphs (i.e., the first graph structure data can be a time-series dynamic graph), thus improving data processing efficiency.
[0045] Example 2
[0046] like Figure 3As shown in the embodiments of this specification, a data processing method is provided. The execution subject of this method can be a terminal device or a server. The terminal device can be a personal computer or a mobile terminal device such as a mobile phone or tablet computer. The server can be a standalone server or a server cluster composed of multiple servers. Specifically, the method may include the following steps:
[0047] In S302, retrieve historical graph structure data.
[0048] The historical graph structure data can be training data used to train the graph sampling model. The historical graph structure data can be constructed based on historical human-computer interaction data that has a preset correspondence with the first user. The first user can be multiple users including the target user, or the first user can be a user different from the target user, etc. The embodiments of this specification do not specifically limit the specific composition of the historical graph structure data.
[0049] In S304, the historical graph structure data is input into the graph coding network in the graph sampling model to encode the historical graph structure data and determine the node representation vector of each node in the historical graph structure data.
[0050] Among them, the graph coding network can be a temporal graph coding network, such as a temporal knowledge graph (TGAT), a temporal graph network, or a Transformer network.
[0051] In implementation, such as Figure 4 As shown, historical graph structure data can be input into the graph coding network of the graph sampling model. The graph coding network can then be used to perform message aggregation calculations to obtain the node representation vector of each node in the historical graph structure data. Specifically, the graph coding network can adopt the graph network structure of TGAT, performing neighbor node message aggregation calculations at each node to obtain the node representation vector of each node.
[0052] In S306, based on the node representation vector of each node in the historical graph structure data, the construction time of the historical graph structure data, and the time information between every two connected nodes in the historical graph structure data, the edge representation vector of the edge between every two connected nodes in the historical graph structure data is determined.
[0053] In implementation and practical application scenarios, time information is of great significance for processing big data problems. For example, edges closer to the current time are more important, and the time information corresponding to an edge (i.e., the time of the interaction event) may reveal regular behaviors such as user habits and preferences. For instance, a user might purchase a certain product at 8 AM every day (or at the beginning of each month). Therefore, to improve the accuracy of subsequent data processing, the server can determine the edge representation vector by combining the node representation vectors of the left and right nodes of each edge, the construction time of the historical graph structure data, and the time information corresponding to that edge.
[0054] In practical applications, the processing method of S306 described above can vary. The following is one optional implementation method, which can be found in steps one through three below:
[0055] Step 1: Based on the preset dimensions and the time information between every two connected nodes in the historical graph structure data, construct a discrete representation vector.
[0056] In implementation, assuming the preset dimensions include four dimensions: month, date, hour, and minute, then based on the time information between every two connected nodes in the historical graph structure data, a corresponding discrete representation vector can be constructed based on these four dimensions.
[0057] Step 2: Based on the construction time of the historical graph structure data and the time information between every two connected nodes in the historical graph structure data, determine the time difference and convert the time difference into a continuous representation vector in the time domain.
[0058] In practice, the time difference can be input into the formula.
[0059] φ(Δt) = cos(WΔt + b),
[0060] A continuous representation vector in the time domain is obtained, where Δt is the time difference, φ(Δt) is the continuous representation vector of the time difference in the time domain, and W and b are learnable parameters of dimension d, where d is a preset dimension.
[0061] Step 3: Based on the node representation vector of every two connected nodes in the historical graph structure data, the discrete representation vector and the continuous representation vector of the edge between every two connected nodes, determine the edge representation vector of the edge between every two connected nodes in the historical graph structure data.
[0062] In S308, the sampling probability of the edge between every two connected nodes in the historical graph structure data is determined based on the sampling network in the graph sampling model and the edge representation vector of the edge between every two connected nodes in the historical graph structure data.
[0063] In implementation and practical applications, the processing method of S308 can vary. The following is one optional implementation method, which can be found in steps one to two below:
[0064] Step 1: Based on the sampling network in the graph sampling model and the edge representation vector of the edge between every two connected nodes in the historical graph structure data, determine the first sampling probability of the edge between every two connected nodes in the historical graph structure data.
[0065] In implementation, taking the sampling network based on a multi-layer perceptron (MLP) as an example, assuming that the network structure of the sampling network is a 3-layer MLP structure, the edge representation vector of the edge between every two nodes with a connection relationship in the historical graph structure data can be input into the sampling network, and the first sampling probability can be obtained through an activation function (such as the sigmoid function).
[0066] Step 2: Convert the first sampling probability of the edge between every two connected nodes in the historical graph structure data to a preset distribution for sampling, and obtain the sampling probability of the edge between every two connected nodes in the historical graph structure data.
[0067] In practice, since the value of the first sampling probability obtained through the activation function is 0 or 1, in order to make the sampling process trainable, the Bernoulli distribution can be used for sampling, and the original untrainable sampling process can be approximated by reparameterization.
[0068] The first sampling probability can be input into the formula.
[0069] e=σ((logε-log(1-ε)+m / τ),
[0070] The sampling probability of the edge between every two connected nodes in the historical graph structure data is obtained, where e is the sampling probability of the edge, m is the first sampling probability of the edge, σ is the activation function, and ε and τ are preset hyperparameters.
[0071] In S310, the historical graph structure data is pruned based on the sampling probability of the edge between every two connected nodes in the historical graph structure data, resulting in pruned historical graph structure data.
[0072] In implementation, the sampling probability of the edges between any two connected nodes in the historical graph structure data can be filtered based on a preset sampling probability threshold, and the clipped historical graph structure data can be constructed based on the filtered nodes.
[0073] In S312, based on the attribute information of the edges between every two connected nodes in the clipped historical graph structure data, an edge feature vector is generated between every two connected nodes in the clipped historical graph structure data.
[0074] The attribute information of an edge can be determined based on the interaction data corresponding to that edge. For example, if an edge between two nodes with a connection relationship is used to represent a resource transfer event, the attribute information of that edge can be determined based on the interaction data of the resource transfer event. The attribute information of that edge may include the resource transfer time, the number of resources transferred, etc.
[0075] In implementation, based on a preset feature extraction algorithm, the attribute information of the edges between every two connected nodes in the clipped historical graph structure data can be extracted to obtain the edge feature vector between every two connected nodes in the clipped historical graph structure data.
[0076] In S314, based on the edge feature vector and edge representation vector of the edge between every two connected nodes in the clipped historical graph structure data, and the attention coefficient between every two connected nodes in the clipped historical graph structure data, the node representation vector of each node in the clipped historical graph structure data is updated to obtain the updated node representation vector of each node in the clipped historical graph structure data.
[0077] In implementation, such as Figure 4 As shown, the attention coefficients between every two connected nodes in the pruned historical graph structure data can be obtained through a graph coding network. Based on the edge feature vectors and edge representation vectors of the edges between every two connected nodes in the pruned historical graph structure data, as well as the attention coefficients between every two connected nodes, the node representation vector of each node in the pruned historical graph structure data is updated to obtain the updated node representation vector of each node in the pruned historical graph structure data. Specifically, the product of the edge feature vectors and edge representation vectors of the edges between every two connected nodes in the pruned historical graph structure data, and the attention coefficients between every two connected nodes, can be used as the updated node representation vector of each node in the pruned historical graph structure data.
[0078] The above-described method for updating node representation vectors is an optional and implementable method. In practical application scenarios, there are many other different update methods. Different update methods can be selected according to different practical application scenarios. This specification does not specifically limit the embodiments in this way.
[0079] In S316, the loss value is determined based on the node representation vector of the node in the historical graph structure data and the updated node representation vector of the node in the pruned historical graph structure data.
[0080] In implementation and practical applications, the processing method of S316 can vary. The following is one optional implementation method, which can be found in steps one to two below:
[0081] Step 1: Determine the loss value based on the node representation vector of the center node of the historical graph structure data and the updated node representation vector of the center node of the pruned historical graph structure data.
[0082] The central node of the historical graph structure data can be determined based on the number of connections between each node and other nodes in the historical graph structure data. In addition, there are many other methods for determining the central node. This specification does not specifically limit the method for determining the central node of the historical graph structure data in the embodiments.
[0083] Step 2: Obtain the mutual information value between the node representation vector of the center node of the historical graph structure data and the updated node representation vector of the center node of the pruned historical graph structure data, and determine the mutual information value as the loss value.
[0084] In S318, based on the loss value, it is determined whether the graph sampling model has converged. If the graph sampling model has not converged, the graph encoding network and sampling network of the graph sampling model are trained again based on historical graph structure data until the graph sampling model converges, and the trained graph sampling model is obtained.
[0085] In implementation, to ensure that the pruned graph structure data retains as much useful information as possible from the original graph structure data, the graph sampling model can be trained by comparing the representational similarity of the central nodes of the graph structure data before and after pruning. Specifically, the graph sampling model can be trained by checking whether the loss value determined by the mutual information between the node representation vectors of the central nodes of the historical graph structure data and the updated node representation vectors of the central nodes of the pruned historical graph structure data is greater than a preset loss threshold. This ensures that the trained graph sampling model can retain as much useful information as possible from the original graph structure data when the pruned graph structure data is obtained.
[0086] In S102, the first graph structure data to be clipped is obtained.
[0087] The first graph structure data can be constructed based on human-computer interaction data that has a preset correspondence with the target user.
[0088] In S104, the first graph structure data is encoded based on the graph encoding network in the pre-trained graph sampling model to determine the node representation vector of each node in the first graph structure data.
[0089] In S106, based on the node representation vector of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two nodes with a connection relationship in the first graph structure data, the edge representation vector of the edge between every two nodes with a connection relationship in the first graph structure data is determined.
[0090] In S108, the sampling probability of the edge between every two connected nodes in the first graph structure data is determined based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between every two connected nodes in the first graph structure data.
[0091] In S110, based on the sampling probability of the edge between every two connected nodes in the first graph structure data, the first graph structure data is pruned to obtain the pruned first graph structure data.
[0092] In S320, an information recommendation request for the target user is received.
[0093] In implementation, taking the first graph structure data to be clipped as an example, which is graph structure data determined based on the target user's product purchase information within a preset first time period, the target user purchased product 1 in the product transaction application within the preset first time period. The first graph structure data to be clipped can be constructed based on the target user's user information, product information of product 1, and product information of product 2.
[0094] For example, a commodity trading application may include commodity 1, commodity 2, commodity 3, and commodity 4. Commodity 1 and commodity 2 have the same sales volume within a preset first time period. Commodity 1 and commodity 3 are of the same type, and commodity 1 and commodity 4 have the same transaction price. Therefore, nodes in the first graph structure data can be determined based on the target user, commodity 1, commodity 2, commodity 3, and commodity 4. Furthermore, the node relationships in the first graph structure data can be determined based on the purchase relationship between the target user and each commodity, and the commodity relationships between commodities. The server can obtain the first graph structure data constructed based on the above information and trim the first graph structure data according to steps S104-S110 to obtain the trimmed first graph structure data.
[0095] When a terminal device detects that a target user has triggered the launch of a commodity trading application, the terminal device can send an information recommendation request to the server for the target user.
[0096] In S322, the first graph structure data after cropping is processed to classify nodes, and the node classification results are obtained. Based on the node classification results, the information recommendation results are determined.
[0097] In practice, the server can perform node classification processing on the cropped first graph structure data to obtain node classification results, and then determine the information recommendation results based on the node classification results.
[0098] For example, nodes other than those corresponding to the target user in the cropped first graph structure data can be classified. The resulting node classification results can be: Category 1 and Category 2. Category 1 corresponds to Product 1, Product 2 and Product 3, and Category 2 corresponds to Product 4. Since Product 1 is the product purchased by the target user within a preset first time period, the information recommendation results can be determined based on Category 1. That is, the information recommendation results can include Product 2 and Product 3.
[0099] The method for determining the above-mentioned information recommendation results is an optional and feasible method. In actual application scenarios, there can be a variety of different methods. Different methods can be selected according to different actual application scenarios. This specification does not specifically limit the embodiments in this way.
[0100] In S324, information recommendation results are provided in response to information recommendation requests.
[0101] In practice, the server can feed back the information recommendation results to the terminal device so that the terminal device can display the information recommendation results, which can improve the efficiency and accuracy of information recommendation.
[0102] This specification provides a data processing method that acquires first graph structure data to be clipped. The first graph structure data is constructed based on human-computer interaction data that has a preset correspondence with the target user. The first graph structure data is encoded using a graph encoding network in a pre-trained graph sampling model to determine the node representation vector of each node in the first graph structure data. Based on the node representation vectors of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two connected nodes in the first graph structure data, the edge representation vector of the edge between every two connected nodes in the first graph structure data is determined. Based on the sampling network in the pre-trained graph sampling model and the edge representation vectors of the edges between every two connected nodes in the first graph structure data, the sampling probability of the edge between every two connected nodes in the first graph structure data is determined. Based on the sampling probability of the edge between every two connected nodes in the first graph structure data, the first graph structure data is clipped to obtain clipped first graph structure data. In this way, the node representation vector of each node can be accurately determined through the graph coding network. By combining the node representation vector with time information (i.e., the construction time of the first graph structure data and the time information between every two connected nodes in the first graph structure data), the edge representation vector of the edge between every two connected nodes in the first graph structure data can be accurately determined. Then, based on the sampling probability of the edge between every two connected nodes, the first graph structure data is pruned to obtain pruned first graph structure data. This process can remove redundant graph information while retaining important graph structure data, thereby improving the accuracy of subsequent processing of big data problems such as text semantic similarity, similar product recommendation, or intelligent question answering systems based on the pruned first graph structure data. In addition, since the pruning method does not rely on static graphs such as neighbor matrices, it can also prune time-series dynamic graphs (i.e., the first graph structure data can be a time-series dynamic graph), thus improving data processing efficiency.
[0103] Example 3
[0104] like Figure 5 As shown in the embodiments of this specification, a data processing method is provided. The execution subject of this method can be a terminal device or a server. The terminal device can be a personal computer or a mobile terminal device such as a mobile phone or tablet computer. The server can be a standalone server or a server cluster composed of multiple servers. Specifically, the method may include the following steps:
[0105] In S502, a risk detection request is received that triggers the execution of target business against a target user.
[0106] The target business can be any business that may pose risks such as privacy leaks. For example, the target business can be a resource transfer business or an information update business.
[0107] In practice, when a terminal device detects that a user has triggered a target service, it can send a risk detection request to the server regarding the target user triggering the execution of the target service.
[0108] In S504, in response to a risk detection request, human-computer interaction data with a preset correspondence to the target user and target data required to execute the target business are obtained.
[0109] In implementation, taking resource transfer as the target business as an example, the target data may include resource transfer time, resource transfer quantity, resource transfer object, resource transfer method, etc. The human-computer interaction data with the target user having a preset corresponding relationship may include the target user's input data of the corresponding script for the resource transfer business, etc.
[0110] In S506, the first graph structure data to be clipped is constructed based on human-computer interaction data and target data.
[0111] In implementation, the server can use target users and resource transfer objects as nodes, and construct connections between nodes based on the interaction between target users and resource transfer objects. Since the amount of human-computer interaction data with pre-defined correspondences to target users may be large, and this data may contain redundant data with little relevance to the target business, this redundant data not only wastes storage space but also affects the efficiency and effectiveness of risk detection. Therefore, the data in the first graph structure needs to be pruned.
[0112] In S104, the first graph structure data is encoded based on the graph encoding network in the pre-trained graph sampling model to determine the node representation vector of each node in the first graph structure data.
[0113] Among them, the graph coding network can be a temporal graph coding network.
[0114] The training process of the graph sampling model can be found in S302 to S318 of the above embodiment 2, and will not be repeated here.
[0115] In S508, a discrete representation vector is constructed based on the preset dimension and the time information between every two nodes with a connection relationship in the first graph structure data.
[0116] In S510, based on the construction time of the first graph structure data and the time information between every two connected nodes in the first graph structure data, the time difference is determined and converted into a continuous representation vector in the time domain.
[0117] In S512, based on the node representation vector of every two nodes with a connection relationship in the first graph structure data, the discrete representation vector and the continuous representation vector of the edge between every two nodes with a connection relationship, the edge representation vector of the edge between every two nodes with a connection relationship is determined.
[0118] In S514, the first sampling probability of the edge between every two connected nodes in the first graph structure data is determined based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between every two connected nodes in the first graph structure data.
[0119] In S516, the first sampling probability of the edge between every two connected nodes in the first graph structure data is transformed to a preset distribution for sampling, so as to obtain the sampling probability of the edge between every two connected nodes in the first graph structure data.
[0120] The specific processing steps of S508 to S516 can be found in the relevant content of S306 to S308 in the above embodiment 2, and will not be repeated here.
[0121] In S518, based on a preset sampling probability threshold, the sampling probability of the edge between every two nodes with a connection relationship in the first graph structure data is filtered, and the clipped first graph structure data is constructed based on the filtered nodes.
[0122] In implementation and practical applications, the processing method of S518 can vary. The following is one optional implementation method, which can be found in steps one through three below:
[0123] Step 1: Aggregate the node representation vectors of the nodes in the first graph structure data to obtain the node representation vector of the target node.
[0124] In implementation, such as Figure 6 As shown, the node representation vectors of nodes other than the central node in the first graph structure data can be aggregated to obtain the node representation vector of the target node. The node representation vector of the target node can be determined through methods such as mean aggregation.
[0125] Step 2: Establish a connection between the target node and the central node of the first graph structure data, and set the sampling probability of the edge between the target node and the central node of the first graph structure data to a preset sampling probability value.
[0126] The preset sampling probability value can be greater than the sampling probability threshold.
[0127] In implementation, the pruned graph structure data will lose some topological information (i.e., lose global information). Therefore, the node representation vectors of the nodes in the first graph structure data can be aggregated to obtain the target node. The node representation vector of the target node can be used to represent the global information of the first graph structure data. A connection relationship can be established between the target node and the central node of the first graph structure data, and the sampling probability of the edge between the target node and the central node of the first graph structure data can be set to a preset sampling probability value greater than a sampling probability threshold. For example, since the numerical range of the sampling probability can be between 0 and 1, that is, the sampling probability is not less than 0 and not greater than 1, the preset sampling probability value can be set to 1.
[0128] Step 3: Based on a preset sampling probability threshold, filter the sampling probability of the edges between every two connected nodes in the first graph structure data containing the target node, and construct the clipped first graph structure data based on the filtered nodes.
[0129] In implementation, since the sampling probability of the edge between the target node and the central node of the first graph structure data is a preset sampling probability value that is greater than the sampling probability threshold, the filtered nodes include the target node. Therefore, the global information of the first graph structure data before pruning can be retained through the target node, reducing the information loss caused by pruning and helping to improve the performance of subsequent data processing.
[0130] In S520, based on the clipped first graph structure data, it is determined whether there are risks in executing the target business.
[0131] In implementation, a pre-trained risk identification model can be used to process the cropped first graph structure data for risk identification, obtaining corresponding risk identification results. Based on these results, it can be determined whether there are risks associated with executing the target business. The risk identification model can be a model built based on a pre-set machine learning algorithm.
[0132] Furthermore, if the server determines that there is a risk in executing the target service based on the risk identification results, the server can send the risk identification results to the terminal device and stop executing the target service. If the server determines that there is no risk in executing the target service based on the risk identification results, the server can execute the target service and return the execution result to the terminal device.
[0133] The method described above for determining whether there is a risk in executing the target business is an optional and feasible method. In actual application scenarios, there can be a variety of different determination methods. Different determination methods can be selected according to different actual application scenarios. This specification does not specifically limit this method in the embodiments.
[0134] This specification provides a data processing method that acquires first graph structure data to be clipped. The first graph structure data is constructed based on human-computer interaction data that has a preset correspondence with the target user. The first graph structure data is encoded using a graph encoding network in a pre-trained graph sampling model to determine the node representation vector of each node in the first graph structure data. Based on the node representation vectors of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two connected nodes in the first graph structure data, the edge representation vector of the edge between every two connected nodes in the first graph structure data is determined. Based on the sampling network in the pre-trained graph sampling model and the edge representation vectors of the edges between every two connected nodes in the first graph structure data, the sampling probability of the edge between every two connected nodes in the first graph structure data is determined. Based on the sampling probability of the edge between every two connected nodes in the first graph structure data, the first graph structure data is clipped to obtain clipped first graph structure data. In this way, the node representation vector of each node can be accurately determined through the graph coding network. By combining the node representation vector with time information (i.e., the construction time of the first graph structure data and the time information between every two connected nodes in the first graph structure data), the edge representation vector of the edge between every two connected nodes in the first graph structure data can be accurately determined. Then, based on the sampling probability of the edge between every two connected nodes, the first graph structure data is pruned to obtain pruned first graph structure data. This process can remove redundant graph information while retaining important graph structure data, thereby improving the accuracy of subsequent processing of big data problems such as text semantic similarity, similar product recommendation, or intelligent question answering systems based on the pruned first graph structure data. In addition, since the pruning method does not rely on static graphs such as neighbor matrices, it can also prune time-series dynamic graphs (i.e., the first graph structure data can be a time-series dynamic graph), thus improving data processing efficiency.
[0135] Example 4
[0136] The above describes the data processing method provided in the embodiments of this specification. Based on the same idea, the embodiments of this specification also provide a data processing device, such as... Figure 7 As shown.
[0137] The data processing device includes: a data acquisition module 701, a first determination module 702, a second determination module 703, a probability determination module 704, and a first cropping module 705, wherein:
[0138] Data acquisition module 701 is used to acquire the first image structure data to be cropped, the first image structure data being constructed based on human-computer interaction data that has a preset correspondence with the target user;
[0139] The first determining module 702 is used to encode the first graph structure data based on the graph coding network in the pre-trained graph sampling model, and determine the node representation vector of each node in the first graph structure data.
[0140] The second determining module 703 is used to determine the edge representation vector of the edge between every two connected nodes in the first graph structure data based on the node representation vector of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two connected nodes in the first graph structure data.
[0141] The probability determination module 704 is used to determine the sampling probability of the edge between every two connected nodes in the first graph structure data based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between every two connected nodes in the first graph structure data.
[0142] The first trimming module 705 is used to trim the first graph structure data based on the sampling probability of the edge between every two connected nodes in the first graph structure data, so as to obtain the trimmed first graph structure data.
[0143] In the embodiments described in this specification, the device further includes:
[0144] A request receiving module is used to receive information recommendation requests for the target user.
[0145] The classification module is used to perform node classification processing on the cropped first graph structure data, obtain node classification results, and determine information recommendation results based on the node classification results;
[0146] The feedback module is used to provide the information recommendation result in response to the information recommendation request.
[0147] In this embodiment of the specification, the data acquisition module 701 is used for:
[0148] Receive a risk detection request that triggers the execution of a target service for the target user;
[0149] In response to the risk detection request, the system acquires the human-computer interaction data that has a preset correspondence with the target user, as well as the target data required to execute the target business.
[0150] Based on the human-computer interaction data and the target data, construct the first graph structure data to be clipped;
[0151] The device further includes:
[0152] The risk assessment module is used to determine whether there is a risk in executing the target business based on the cropped first graph structure data.
[0153] In the embodiments described in this specification, the graph coding network is a temporal graph coding network.
[0154] In this embodiment of the specification, the second determining module 703 is used for:
[0155] Based on the preset dimensions and the time information between every two connected nodes in the first graph structure data, a discrete representation vector is constructed.
[0156] Based on the construction time of the first graph structure data and the time information between every two connected nodes in the first graph structure data, the time difference is determined, and the time difference is converted into a continuous representation vector in the time domain.
[0157] Based on the node representation vector of every two connected nodes in the first graph structure data, the discrete representation vector of the edge between every two connected nodes, and the continuous representation vector, the edge representation vector of the edge between every two connected nodes in the first graph structure data is determined.
[0158] In the embodiments of this specification, the probability determination module 704 is used for:
[0159] Based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between every two connected nodes in the first graph structure data, the first sampling probability of the edge between every two connected nodes in the first graph structure data is determined.
[0160] The first sampling probability of the edge between every two connected nodes in the first graph structure data is transformed to a preset distribution for sampling, thereby obtaining the sampling probability of the edge between every two connected nodes in the first graph structure data.
[0161] In the embodiments described in this specification, the first trimming module 705 is used for:
[0162] Based on a preset sampling probability threshold, the sampling probability of the edge between every two connected nodes in the first graph structure data is filtered, and the clipped first graph structure data is constructed based on the filtered nodes.
[0163] In the embodiments described in this specification, the first trimming module 705 is used for:
[0164] The node representation vectors of the nodes in the first graph structure data are aggregated to obtain the node representation vector of the target node.
[0165] A connection relationship is established between the target node and the center node of the first graph structure data, and the sampling probability of the edge between the target node and the center node of the first graph structure data is set to a preset sampling probability value, wherein the preset sampling probability value is greater than the sampling probability threshold.
[0166] Based on a preset sampling probability threshold, the sampling probability of the edge between every two connected nodes in the first graph structure data containing the target node is filtered, and the clipped first graph structure data is constructed based on the filtered nodes.
[0167] In the embodiments described in this specification, the device further includes:
[0168] The historical data acquisition module is used to acquire historical graph structure data;
[0169] The third determining module is used to input the historical graph structure data into the graph coding network in the graph sampling model, encode the historical graph structure data, and determine the node representation vector of each node in the historical graph structure data.
[0170] The fourth determining module is used to determine the edge representation vector of the edge between every two connected nodes in the historical graph structure data based on the node representation vector of each node in the historical graph structure data, the construction time of the historical graph structure data, and the time information between every two connected nodes in the historical graph structure data.
[0171] The fifth determining module is used to determine the sampling probability of the edge between every two connected nodes in the historical graph structure data based on the sampling network in the graph sampling model and the edge representation vector of the edge between every two connected nodes in the historical graph structure data.
[0172] The second pruning module is used to prune the historical graph structure data based on the sampling probability of the edge between every two connected nodes in the historical graph structure data, so as to obtain the pruned historical graph structure data.
[0173] The feature generation module is used to generate edge feature vectors between every two connected nodes in the clipped historical graph structure data based on the attribute information of the edges between every two connected nodes in the clipped historical graph structure data.
[0174] The update module is used to update the node representation vector of each node in the clipped historical graph structure data based on the edge feature vector and edge representation vector of the edge between every two connected nodes in the clipped historical graph structure data, and the attention coefficient between every two connected nodes in the clipped historical graph structure data, to obtain the updated node representation vector of each node in the clipped historical graph structure data.
[0175] The loss determination module is used to determine the loss value based on the node representation vector of the node in the historical graph structure data and the updated node representation vector of the node in the pruned historical graph structure data.
[0176] The training module is used to determine whether the graph sampling model has converged based on the loss value. If the graph sampling model has not converged, the graph encoding network and sampling network of the graph sampling model are trained again based on the historical graph structure data until the graph sampling model converges, thus obtaining the trained graph sampling model.
[0177] In the embodiments of this specification, the loss determination module is used for:
[0178] The loss value is determined based on the node representation vector of the center node of the historical graph structure data and the updated node representation vector of the center node of the pruned historical graph structure data.
[0179] In the embodiments of this specification, the loss determination module is used for:
[0180] Obtain the mutual information value between the node representation vector of the center node of the historical graph structure data and the updated node representation vector of the center node of the pruned historical graph structure data, and determine the mutual information value as the loss value.
[0181] This specification provides a data processing apparatus that acquires first graph structure data to be clipped. The first graph structure data is constructed based on human-computer interaction data that has a preset correspondence with a target user. The apparatus encodes the first graph structure data using a graph coding network in a pre-trained graph sampling model to determine the node representation vector of each node in the first graph structure data. Based on the node representation vectors of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two connected nodes in the first graph structure data, the apparatus determines the edge representation vector of the edge between every two connected nodes in the first graph structure data. Based on the sampling network in the pre-trained graph sampling model and the edge representation vectors of the edges between every two connected nodes in the first graph structure data, the apparatus determines the sampling probability of the edge between every two connected nodes in the first graph structure data. Based on the sampling probability of the edge between every two connected nodes in the first graph structure data, the apparatus clips the first graph structure data to obtain clipped first graph structure data. In this way, the node representation vector of each node can be accurately determined through the graph coding network. By combining the node representation vector with time information (i.e., the construction time of the first graph structure data and the time information between every two connected nodes in the first graph structure data), the edge representation vector of the edge between every two connected nodes in the first graph structure data can be accurately determined. Then, based on the sampling probability of the edge between every two connected nodes, the first graph structure data is pruned to obtain pruned first graph structure data. This process can remove redundant graph information while retaining important graph structure data, thereby improving the accuracy of subsequent processing of big data problems such as text semantic similarity, similar product recommendation, or intelligent question answering systems based on the pruned first graph structure data. In addition, since the pruning method does not rely on static graphs such as neighbor matrices, it can also prune time-series dynamic graphs (i.e., the first graph structure data can be a time-series dynamic graph), thus improving data processing efficiency.
[0182] Example 5
[0183] Following the same line of thought, embodiments of this specification also provide a data processing device, such as... Figure 8 As shown.
[0184] Data processing devices can vary considerably due to differences in configuration or performance. They may include one or more processors 801 and memory 802, with memory 802 storing one or more application programs or data. Memory 802 can be temporary or persistent storage. The application programs stored in memory 802 may include one or more modules (not shown), each module including a series of computer-executable instructions for the data processing device. Furthermore, processor 801 may be configured to communicate with memory 802 and execute the series of computer-executable instructions stored in memory 802 on the data processing device. The data processing device may also include one or more power supplies 803, one or more wired or wireless network interfaces 804, one or more input / output interfaces 805, and one or more keyboards 806.
[0185] Specifically, in this embodiment, the data processing device includes a memory and one or more programs, wherein one or more programs are stored in the memory, and one or more programs may include one or more modules, and each module may include a series of computer-executable instructions for the data processing device, and is configured to be executed by one or more processors. The one or more programs include computer-executable instructions for performing the following:
[0186] Obtain the first image structure data to be cropped, which is constructed based on human-computer interaction data that has a preset correspondence with the target user;
[0187] Based on the graph coding network in the pre-trained graph sampling model, the first graph structure data is encoded to determine the node representation vector of each node in the first graph structure data.
[0188] Based on the node representation vector of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two connected nodes in the first graph structure data, the edge representation vector of the edge between every two connected nodes in the first graph structure data is determined.
[0189] Based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between every two connected nodes in the first graph structure data, the sampling probability of the edge between every two connected nodes in the first graph structure data is determined.
[0190] Based on the sampling probability of the edge between every two connected nodes in the first graph structure data, the first graph structure data is pruned to obtain the pruned first graph structure data.
[0191] Optionally, the method further includes:
[0192] Receive information recommendation requests for the target user;
[0193] The cropped first graph structure data is subjected to node classification processing to obtain node classification results, and information recommendation results are determined based on the node classification results;
[0194] The system provides feedback on the information recommendation request, along with the information recommendation result.
[0195] Optionally, obtaining the first image structure data to be clipped includes:
[0196] Receive a risk detection request that triggers the execution of a target service for the target user;
[0197] In response to the risk detection request, the system acquires the human-computer interaction data that has a preset correspondence with the target user, as well as the target data required to execute the target business.
[0198] Based on the human-computer interaction data and the target data, construct the first graph structure data to be clipped;
[0199] The method further includes:
[0200] Based on the cropped first graph structure data, determine whether there is a risk in executing the target business.
[0201] Optionally, the graph coding network is a temporal graph coding network.
[0202] Optionally, determining the edge representation vector of the edge between every two connected nodes in the first graph structure data based on the node representation vector of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two connected nodes in the first graph structure data includes:
[0203] Based on the preset dimensions and the time information between every two connected nodes in the first graph structure data, a discrete representation vector is constructed.
[0204] Based on the construction time of the first graph structure data and the time information between every two connected nodes in the first graph structure data, the time difference is determined, and the time difference is converted into a continuous representation vector in the time domain.
[0205] Based on the node representation vector of every two connected nodes in the first graph structure data, the discrete representation vector of the edge between every two connected nodes, and the continuous representation vector, the edge representation vector of the edge between every two connected nodes in the first graph structure data is determined.
[0206] Optionally, determining the sampling probability of an edge between any two connected nodes in the first graph structure data based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between any two connected nodes in the first graph structure data includes:
[0207] Based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between every two connected nodes in the first graph structure data, the first sampling probability of the edge between every two connected nodes in the first graph structure data is determined.
[0208] The first sampling probability of the edge between every two connected nodes in the first graph structure data is transformed to a preset distribution for sampling, thereby obtaining the sampling probability of the edge between every two connected nodes in the first graph structure data.
[0209] Optionally, the step of pruning the first graph structure data based on the sampling probability of the edge between every two connected nodes in the first graph structure data to obtain pruned first graph structure data includes:
[0210] Based on a preset sampling probability threshold, the sampling probability of the edge between every two connected nodes in the first graph structure data is filtered, and the clipped first graph structure data is constructed based on the filtered nodes.
[0211] Optionally, the step of filtering the sampling probability of the edges between every two connected nodes in the first graph structure data based on a preset sampling probability threshold, and constructing the clipped first graph structure data based on the filtered nodes, includes:
[0212] The node representation vectors of the nodes in the first graph structure data are aggregated to obtain the node representation vector of the target node.
[0213] A connection relationship is established between the target node and the center node of the first graph structure data, and the sampling probability of the edge between the target node and the center node of the first graph structure data is set to a preset sampling probability value, wherein the preset sampling probability value is greater than the sampling probability threshold.
[0214] Based on a preset sampling probability threshold, the sampling probability of the edge between every two connected nodes in the first graph structure data containing the target node is filtered, and the clipped first graph structure data is constructed based on the filtered nodes.
[0215] Optionally, before the graph encoding network in the pre-trained graph sampling model encodes the first graph structure data to obtain the node representation vector of each node in the first graph structure data, the method further includes:
[0216] Obtain historical graph structure data;
[0217] The historical graph structure data is input into the graph coding network in the graph sampling model to encode the historical graph structure data and determine the node representation vector of each node in the historical graph structure data.
[0218] Based on the node representation vector of each node in the historical graph structure data, the construction time of the historical graph structure data, and the time information between every two connected nodes in the historical graph structure data, the edge representation vector of the edge between every two connected nodes in the historical graph structure data is determined.
[0219] Based on the sampling network in the graph sampling model and the edge representation vector of the edge between every two connected nodes in the historical graph structure data, the sampling probability of the edge between every two connected nodes in the historical graph structure data is determined.
[0220] Based on the sampling probability of the edge between every two connected nodes in the historical graph structure data, the historical graph structure data is pruned to obtain pruned historical graph structure data.
[0221] Based on the attribute information of the edge between every two connected nodes in the clipped historical graph structure data, an edge feature vector is generated between every two connected nodes in the clipped historical graph structure data.
[0222] Based on the edge feature vector and edge representation vector of the edge between every two connected nodes in the clipped historical graph structure data, and the attention coefficient between every two connected nodes in the clipped historical graph structure data, the node representation vector of each node in the clipped historical graph structure data is updated to obtain the updated node representation vector of each node in the clipped historical graph structure data.
[0223] The loss value is determined based on the node representation vector of the node in the historical graph structure data and the updated node representation vector of the node in the pruned historical graph structure data.
[0224] Based on the loss value, it is determined whether the graph sampling model has converged. If the graph sampling model has not converged, the graph encoding network and sampling network of the graph sampling model are trained again based on the historical graph structure data until the graph sampling model converges, and the trained graph sampling model is obtained.
[0225] Optionally, determining the loss value based on the node representation vectors of the nodes in the historical graph structure data and the updated node representation vectors of the nodes in the pruned historical graph structure data includes:
[0226] The loss value is determined based on the node representation vector of the center node of the historical graph structure data and the updated node representation vector of the center node of the pruned historical graph structure data.
[0227] Optionally, determining the loss value based on the node representation vector of the center node of the historical graph structure data and the updated node representation vector of the center node of the pruned historical graph structure data includes:
[0228] Obtain the mutual information value between the node representation vector of the center node of the historical graph structure data and the updated node representation vector of the center node of the pruned historical graph structure data, and determine the mutual information value as the loss value.
[0229] This specification provides a data processing device that acquires first graph structure data to be clipped. The first graph structure data is constructed based on human-computer interaction data that has a preset correspondence with a target user. The device encodes the first graph structure data using a graph coding network in a pre-trained graph sampling model to determine the node representation vector of each node in the first graph structure data. Based on the node representation vectors of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two connected nodes in the first graph structure data, the device determines the edge representation vector of the edge between every two connected nodes in the first graph structure data. Based on the sampling network in the pre-trained graph sampling model and the edge representation vectors of the edges between every two connected nodes in the first graph structure data, the device determines the sampling probability of the edge between every two connected nodes in the first graph structure data. Based on the sampling probability of the edge between every two connected nodes in the first graph structure data, the device clips the first graph structure data to obtain clipped first graph structure data. In this way, the node representation vector of each node can be accurately determined through the graph coding network. By combining the node representation vector with time information (i.e., the construction time of the first graph structure data and the time information between every two connected nodes in the first graph structure data), the edge representation vector of the edge between every two connected nodes in the first graph structure data can be accurately determined. Then, based on the sampling probability of the edge between every two connected nodes, the first graph structure data is pruned to obtain pruned first graph structure data. This process can remove redundant graph information while retaining important graph structure data, thereby improving the accuracy of subsequent processing of big data problems such as text semantic similarity, similar product recommendation, or intelligent question answering systems based on the pruned first graph structure data. In addition, since the pruning method does not rely on static graphs such as neighbor matrices, it can also prune time-series dynamic graphs (i.e., the first graph structure data can be a time-series dynamic graph), thus improving data processing efficiency.
[0230] Example 6
[0231] This specification also provides a computer-readable storage medium storing a computer program. When executed by a processor, this computer program implements the various processes of the above-described data processing method embodiments and achieves the same technical effects. To avoid repetition, it will not be described again here. The computer-readable storage medium may include, for example, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk.
[0232] This specification provides a computer-readable storage medium for acquiring first graph structure data to be clipped. The first graph structure data is constructed based on human-computer interaction data with a preset correspondence with a target user. The first graph structure data is encoded using a graph coding network in a pre-trained graph sampling model to determine the node representation vector of each node in the first graph structure data. Based on the node representation vectors of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two connected nodes in the first graph structure data, the edge representation vector of the edge between every two connected nodes in the first graph structure data is determined. Based on the sampling network in the pre-trained graph sampling model and the edge representation vectors of the edges between every two connected nodes in the first graph structure data, the sampling probability of the edge between every two connected nodes in the first graph structure data is determined. Based on the sampling probability of the edge between every two connected nodes in the first graph structure data, the first graph structure data is clipped to obtain clipped first graph structure data. In this way, the node representation vector of each node can be accurately determined through the graph coding network. By combining the node representation vector with time information (i.e., the construction time of the first graph structure data and the time information between every two connected nodes in the first graph structure data), the edge representation vector of the edge between every two connected nodes in the first graph structure data can be accurately determined. Then, based on the sampling probability of the edge between every two connected nodes, the first graph structure data is pruned to obtain pruned first graph structure data. This process can remove redundant graph information while retaining important graph structure data, thereby improving the accuracy of subsequent processing of big data problems such as text semantic similarity, similar product recommendation, or intelligent question answering systems based on the pruned first graph structure data. In addition, since the pruning method does not rely on static graphs such as neighbor matrices, it can also prune time-series dynamic graphs (i.e., the first graph structure data can be a time-series dynamic graph), thus improving data processing efficiency.
[0233] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.
[0234] In the 1990s, improvements to a technology could be clearly distinguished as either hardware improvements (e.g., improvements to the circuit structure of diodes, transistors, switches, etc.) or software improvements (improvements to the methodology). However, with technological advancements, many methodological improvements today can be considered direct improvements to the hardware circuit structure. Designers almost always obtain the corresponding hardware circuit structure by programming the improved methodology into the hardware circuit. Therefore, it cannot be said that a methodological improvement cannot be implemented using a hardware physical module. For example, a Programmable Logic Device (PLD) (e.g., a Field Programmable Gate Array (FPGA)) is such an integrated circuit whose logic function is determined by the user programming the device. Designers can program a digital system themselves to "integrate" it onto a PLD, without needing chip manufacturers to design and manufacture dedicated integrated circuit chips. Furthermore, nowadays, instead of manually manufacturing integrated circuit chips, this programming is mostly implemented using "logic compiler" software. Similar to the software compiler used in program development, the original code before compilation must be written in a specific programming language, called a Hardware Description Language (HDL). There are many HDLs, such as ABEL (Advanced Boolean Expression Language), AHDL (Altera Hardware Description Language), Confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), Lava, Lola, MyHDL, PALASM, and RHDL (Ruby Hardware Description Language). Currently, the most commonly used are VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog. Those skilled in the art should understand that by simply performing some logic programming on the method flow using one of these hardware description languages and programming it into an integrated circuit, the hardware circuit implementing the logical method flow can be easily obtained.
[0235] The controller can be implemented in any suitable manner. For example, it can take the form of a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro)processor, logic gates, switches, application-specific integrated circuits (ASICs), programmable logic controllers, and embedded microcontrollers. Examples of controllers include, but are not limited to, the following microcontrollers: ARC625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicon Labs C8051F320. A memory controller can also be implemented as part of the control logic of the memory. Those skilled in the art will also recognize that, in addition to implementing the controller in purely computer-readable program code form, the same functionality can be achieved by logically programming the method steps to make the controller take the form of logic gates, switches, ASICs, programmable logic controllers, and embedded microcontrollers. Therefore, such a controller can be considered a hardware component, and the means included therein for implementing various functions can also be considered as structures within the hardware component. Alternatively, the means for implementing various functions can be considered as both software modules implementing the method and structures within the hardware component.
[0236] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, a computer can be, for example, a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or any combination of these devices.
[0237] For ease of description, the above apparatus is described by dividing it into various functional units. Of course, when implementing one or more embodiments of this specification, the functions of each unit can be implemented in one or more software and / or hardware.
[0238] Those skilled in the art will understand that the embodiments of this specification can be provided as methods, systems, or computer program products. Therefore, one or more embodiments of this specification may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of this specification may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0239] The embodiments described herein are illustrated with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this specification. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in one or more flowchart illustrations and / or one or more block diagrams.
[0240] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means that implement the functions specified in one or more flowcharts and / or one or more block diagrams.
[0241] These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process, such that the instructions, which execute on the computer or other programmable apparatus, provide steps for implementing the functions specified in one or more flowcharts and / or one or more block diagrams.
[0242] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.
[0243] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.
[0244] Computer-readable media includes both permanent and non-permanent, removable and non-removable media that can store information using any method or technology. Information can be computer-readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, CD-ROM, digital versatile optical disc (DVD) or other optical storage, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transferable medium that can be used to store information accessible by a computing device. As defined herein, computer-readable media does not include transient computer-readable media, such as modulated data signals and carrier waves.
[0245] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0246] Those skilled in the art will understand that the embodiments of this specification can be provided as methods, systems, or computer program products. Therefore, one or more embodiments of this specification may take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of this specification may take the form of a computer program product implemented on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0247] One or more embodiments of this specification can be described in the general context of computer-executable instructions, such as program modules, that are executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform a particular task or implement a particular abstract data type. One or more embodiments of this specification can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.
[0248] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to interchangeably. Each embodiment focuses on describing the differences from other embodiments. In particular, the system embodiments are basically similar to the method embodiments, so the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.
[0249] The above description is merely an embodiment of this specification and is not intended to limit this specification. Various modifications and variations can be made to this specification by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this specification should be included within the scope of the claims of this specification.
Claims
1. A data processing method, comprising: Obtain the first image structure data to be cropped, which is constructed based on human-computer interaction data that has a preset correspondence with the target user; Based on the graph coding network in the pre-trained graph sampling model, the first graph structure data is encoded to determine the node representation vector of each node in the first graph structure data. Based on the node representation vector of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two connected nodes in the first graph structure data, the edge representation vector of the edge between every two connected nodes in the first graph structure data is determined; wherein, the time information is the time when the interaction behavior between the two connected nodes is generated. Based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between every two connected nodes in the first graph structure data, the sampling probability of the edge between every two connected nodes in the first graph structure data is determined. Based on the sampling probability of the edge between every two connected nodes in the first graph structure data, the first graph structure data is pruned to obtain the pruned first graph structure data.
2. The method according to claim 1, further comprising: Receive information recommendation requests for the target user; The cropped first graph structure data is subjected to node classification processing to obtain node classification results, and information recommendation results are determined based on the node classification results. The system provides feedback on the information recommendation request, along with the information recommendation result.
3. The method according to claim 1, wherein obtaining the first image structure data to be clipped includes: Receive a risk detection request that triggers the execution of a target service for the target user; In response to the risk detection request, the system acquires the human-computer interaction data that has a preset correspondence with the target user, as well as the target data required to execute the target service. Based on the human-computer interaction data and the target data, construct the first graph structure data to be clipped; The method further includes: Based on the cropped first graph structure data, determine whether there is a risk in executing the target business.
4. The method according to claim 1, wherein the graph coding network is a temporal graph coding network.
5. The method according to claim 4, wherein determining the edge representation vector of the edge between every two connected nodes in the first graph structure data based on the node representation vector of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two connected nodes in the first graph structure data comprises: Based on the preset dimensions and the time information between every two connected nodes in the first graph structure data, a discrete representation vector is constructed. Based on the construction time of the first graph structure data and the time information between every two connected nodes in the first graph structure data, the time difference is determined, and the time difference is converted into a continuous representation vector in the time domain. Based on the node representation vector of every two connected nodes in the first graph structure data, the discrete representation vector of the edge between every two connected nodes, and the continuous representation vector, the edge representation vector of the edge between every two connected nodes in the first graph structure data is determined.
6. The method according to claim 5, wherein determining the sampling probability of an edge between every two connected nodes in the first graph structure data based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between every two connected nodes in the first graph structure data comprises: Based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between every two connected nodes in the first graph structure data, the first sampling probability of the edge between every two connected nodes in the first graph structure data is determined. The first sampling probability of the edge between every two connected nodes in the first graph structure data is transformed to a preset distribution for sampling, thereby obtaining the sampling probability of the edge between every two connected nodes in the first graph structure data.
7. The method according to claim 6, wherein the step of pruning the first graph structure data based on the sampling probability of the edge between every two connected nodes in the first graph structure data to obtain pruned first graph structure data comprises: Based on a preset sampling probability threshold, the sampling probability of the edge between every two connected nodes in the first graph structure data is filtered, and the clipped first graph structure data is constructed based on the filtered nodes.
8. The method according to claim 7, wherein the step of filtering the sampling probability of the edges between every two connected nodes in the first graph structure data based on a preset sampling probability threshold, and constructing the trimmed first graph structure data based on the filtered nodes, comprises: The node representation vectors of the nodes in the first graph structure data are aggregated to obtain the node representation vector of the target node. A connection relationship is established between the target node and the center node of the first graph structure data, and the sampling probability of the edge between the target node and the center node of the first graph structure data is set to a preset sampling probability value, wherein the preset sampling probability value is greater than the sampling probability threshold. Based on a preset sampling probability threshold, the sampling probability of the edge between every two connected nodes in the first graph structure data containing the target node is filtered, and the clipped first graph structure data is constructed based on the filtered nodes.
9. The method according to claim 1, before encoding the first graph structure data using the graph coding network in the pre-trained graph sampling model to obtain the node representation vector of each node in the first graph structure data, further comprising: Obtain historical graph structure data; The historical graph structure data is input into the graph coding network in the graph sampling model to encode the historical graph structure data and determine the node representation vector of each node in the historical graph structure data. Based on the node representation vector of each node in the historical graph structure data, the construction time of the historical graph structure data, and the time information between every two connected nodes in the historical graph structure data, the edge representation vector of the edge between every two connected nodes in the historical graph structure data is determined. Based on the sampling network in the graph sampling model and the edge representation vector of the edge between every two connected nodes in the historical graph structure data, the sampling probability of the edge between every two connected nodes in the historical graph structure data is determined. Based on the sampling probability of the edge between every two connected nodes in the historical graph structure data, the historical graph structure data is pruned to obtain pruned historical graph structure data. Based on the attribute information of the edge between every two connected nodes in the clipped historical graph structure data, an edge feature vector is generated between every two connected nodes in the clipped historical graph structure data. Based on the edge feature vector and edge representation vector of the edge between every two connected nodes in the clipped historical graph structure data, and the attention coefficient between every two connected nodes in the clipped historical graph structure data, the node representation vector of each node in the clipped historical graph structure data is updated to obtain the updated node representation vector of each node in the clipped historical graph structure data. The loss value is determined based on the node representation vector of the node in the historical graph structure data and the updated node representation vector of the node in the pruned historical graph structure data. Based on the loss value, it is determined whether the graph sampling model has converged. If the graph sampling model has not converged, the graph encoding network and sampling network of the graph sampling model are trained again based on the historical graph structure data until the graph sampling model converges, and the trained graph sampling model is obtained.
10. The method according to claim 9, wherein determining the loss value based on the node representation vector of the node in the historical graph structure data and the updated node representation vector of the node in the pruned historical graph structure data includes: The loss value is determined based on the node representation vector of the center node of the historical graph structure data and the updated node representation vector of the center node of the pruned historical graph structure data.
11. The method according to claim 10, wherein determining the loss value based on the node representation vector of the center node of the historical graph structure data and the updated node representation vector of the center node of the pruned historical graph structure data comprises: Obtain the mutual information value between the node representation vector of the center node of the historical graph structure data and the updated node representation vector of the center node of the pruned historical graph structure data, and determine the mutual information value as the loss value.
12. A data processing apparatus, comprising: The data acquisition module is used to acquire the first image structure data to be cropped. The first image structure data is constructed based on human-computer interaction data that has a preset correspondence with the target user. The first determining module is used to encode the first graph structure data based on the graph coding network in the pre-trained graph sampling model, and determine the node representation vector of each node in the first graph structure data. The second determining module is used to determine the edge representation vector of the edge between every two connected nodes in the first graph structure data based on the node representation vector of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two connected nodes in the first graph structure data; wherein, the time information is the time when the interaction behavior between the two connected nodes is generated. The probability determination module is used to determine the sampling probability of the edge between every two connected nodes in the first graph structure data based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between every two connected nodes in the first graph structure data. The first trimming module is used to trim the first graph structure data based on the sampling probability of the edge between every two connected nodes in the first graph structure data, so as to obtain the trimmed first graph structure data.
13. A data processing apparatus, the data processing apparatus comprising: processor; as well as A memory configured to store computer-executable instructions, which, when executed, cause the processor to: Obtain the first image structure data to be cropped, which is constructed based on human-computer interaction data that has a preset correspondence with the target user; Based on the graph coding network in the pre-trained graph sampling model, the first graph structure data is encoded to determine the node representation vector of each node in the first graph structure data. Based on the node representation vector of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two connected nodes in the first graph structure data, the edge representation vector of the edge between every two connected nodes in the first graph structure data is determined; wherein, the time information is the time when the interaction behavior between the two connected nodes is generated. Based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between every two connected nodes in the first graph structure data, the sampling probability of the edge between every two connected nodes in the first graph structure data is determined. Based on the sampling probability of the edge between every two connected nodes in the first graph structure data, the first graph structure data is pruned to obtain the pruned first graph structure data.
14. A storage medium for storing computer-executable instructions, which, when executed by a processor, perform the following process: Obtain the first image structure data to be cropped, which is constructed based on human-computer interaction data that has a preset correspondence with the target user; Based on the graph coding network in the pre-trained graph sampling model, the first graph structure data is encoded to determine the node representation vector of each node in the first graph structure data. Based on the node representation vector of each node in the first graph structure data, the construction time of the first graph structure data, and the time information between every two connected nodes in the first graph structure data, the edge representation vector of the edge between every two connected nodes in the first graph structure data is determined; wherein, the time information is the time when the interaction behavior between the two connected nodes is generated. Based on the sampling network in the pre-trained graph sampling model and the edge representation vector of the edge between every two connected nodes in the first graph structure data, the sampling probability of the edge between every two connected nodes in the first graph structure data is determined. Based on the sampling probability of the edge between every two connected nodes in the first graph structure data, the first graph structure data is pruned to obtain the pruned first graph structure data.