A method for obtaining embedding vectors of node attributes in an aging bipartite network
By identifying higher-order dependencies between homogeneous and heterogeneous nodes, a joint optimization model is established to generate embedding vectors for time-sequential bipartite networks. This addresses the shortcomings of existing bipartite network graph embedding methods and improves the accuracy of link prediction and recommendation systems.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA UNIV OF GEOSCIENCES (BEIJING)
- Filing Date
- 2023-12-11
- Publication Date
- 2026-06-30
AI Technical Summary
The lack of existing bipartite network graph embedding methods that consider dependencies means that the embedding vectors in time-sensitive bipartite networks cannot accurately represent the potential phenomena in the network, affecting the performance of link prediction and recommendation systems.
By identifying higher-order dependencies between homogeneous and heterogeneous nodes, a joint optimization model is established to generate embedding vectors for a time-dependent binary network. Random walks are then used to generate a corpus and iteratively optimize the node embedding vectors.
It effectively captures high-order dependencies in time-dependent binary networks, improves the accuracy and performance of link prediction and recommendation systems, and enhances the representation capability of node attributes.
Smart Images

Figure CN117609564B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer application technology, and in particular to a method for obtaining the embedding vector of node attributes in a time-dependent binary network. Background Technology
[0002] Bipartite networks are a ubiquitous data structure used to model relationships between two types of entities. They have been widely applied in numerous fields, including recommender systems, search engines, and question-answering systems. For example, in search engines, queries and web pages form a bipartite network, where edges can indicate user click behavior and provide valuable association signals. In another application of recommender systems, users and items form a bipartite network, where edges can encode user rating behavior and contain rich collaborative filtering patterns. The embedding technique of bipartite networks has received little attention; it is a typical data structure for modeling nodes in two different partitions. Regarding bipartite network embedding, the Bipartite Network Embedding (BINE) model is a hybrid embedding method, an extension and improvement of first-order static network embedding methods. It proposes establishing explicit and implicit relationships in a bipartite graph, establishing a joint optimization framework, performing random walks, and training to obtain the final embedding vectors. Explicit relationships in the BINE model are the relationships between directly connected nodes. A crucial characteristic of bipartite networks is that edges only exist between nodes of different types; that is, explicit relationships only exist between nodes of different types. The BINE model uses a combination of vector inner product and sigmoid function to represent explicit relationships between nodes and fits them with local similarity. Implicit relationships, on the other hand, are the unspoken relationships between nodes of the same type without direct connections. First, a random walk is used to transform the bipartite network into corpora of two different types of node sequences; then, embeddings are learned from the corpus encoding higher-order relationships between vertices. Building upon BINE, the BIANE method is proposed, an end-to-end model for learning representations in bipartite graph attribution networks. It jointly models attribute proximity and structural proximity through a novel latent relevance training method and proposes a dynamic positive sampling technique to overcome the efficiency limitations of existing dynamic negative sampling techniques.
[0003] Currently, research on network representation learning mainly focuses on first-order networks, where nodes only have simple pairwise connections. However, nodes may have higher-order interactions beyond the first order, meaning a node's historical paths influence its likelihood of connecting with other nodes. This higher-order approach performs better in common network mining tasks such as link prediction, network reconstruction, and community detection. Recent research suggests that higher-order network representation methods can more accurately capture important information in the network. In traditional first-order networks, each node represents a state, while in higher-order networks, a node represents not only the current state but also previous states (historical paths), thus capturing implicit higher-order dependencies in the original data. Most higher-order network embeddings are based on matrix factorization methods; for example, HONEM is a matrix factorization method that considers higher-order dependencies. It demonstrates that representation learning algorithms developed for first-order networks are insufficient for higher-order networks, even if they capture higher-order approximations. HONEM is a method for preserving network embeddings based on non-negative matrix factorization, maintaining local structure and higher-order features. It first extracts higher-order dependencies from the original data, then generates a higher-order neighborhood matrix based on the extracted dependencies, and finally obtains the node embeddings using the truncated SVD method. It performs exceptionally well in tasks such as node classification, network reconstruction, link prediction, and visualization. Another high-order embedding algorithm based on matrix factorization is GraRep, which integrates global graph structure information into the learning process. Besides HONEM, there are many other methods for embedding high-order networks. Simplex complexes and hypergraphs are commonly used representations of high-order networks.
[0004] Currently, existing graph embedding methods suffer from several drawbacks, including the lack of a method that considers dependencies in bipartite network graph embedding. However, high-order dependencies in bipartite networks are significant and cannot be ignored. On one hand, in real-world scenarios, bipartite networks are often time-sensitive; for example, in shopping systems, user purchasing preferences and item popularity typically change over time. On the other hand, representation learning methods based on first-order networks may fail to incorporate non-Markovian high-order dependencies. Consequently, the generated embedding vectors lose important information and cannot accurately represent the underlying phenomena in the network, leading to poor performance in various inductive or transformational learning tasks. Therefore, developing a high-order bipartite network graph embedding method that considers dependencies is both necessary and urgently needed. Summary of the Invention
[0005] The embodiments of the present invention provide a method for obtaining the embedding vector of node attributes in a time-dependent binary search network, so as to effectively obtain the node attributes in the time-dependent binary search network.
[0006] To achieve the above objectives, the present invention adopts the following technical solution.
[0007] A method for obtaining the embedding vectors of node attributes in a time-dependent binary search network includes:
[0008] Different symbols are used to represent different types of nodes in the time-sensitive binary network. The dependency relationships between homogeneous nodes are identified, a high-order dependency matrix of homogeneous nodes is established, the dependency relationships between heterogeneous nodes are identified, and a high-order dependency path between heterogeneous nodes is generated.
[0009] Establish a time-dependent binary network graph structure based on the high-order dependency matrix of the heterogeneous nodes;
[0010] A random walk operation is performed on the time-sensitive bipartite network graph structure to generate a corpus of homogeneous nodes and a corpus of heterogeneous nodes, respectively. The corpus is a set of random walk paths of this type of node homogeneous unweighted network.
[0011] Model the implicit relationships of homogeneous nodes based on the corpus of homogeneous nodes, and model the explicit relationships of heterogeneous nodes based on the higher-order dependency paths of heterogeneous nodes. Establish a joint optimization model by combining the implicit relationships of homogeneous nodes and the explicit relationships of heterogeneous nodes.
[0012] The embedding vectors of nodes in the time-dependent binary network are obtained by iteratively optimizing the embedding vectors and context vectors of nodes according to the joint optimization model.
[0013] Preferably, the process of representing different types of nodes in the time-sensitive binary search network with different symbols, identifying dependencies between homogeneous nodes, and establishing a high-order dependency matrix for homogeneous nodes includes:
[0014] Read the raw data of the time-sensitive binary search network, extract the higher-order dependencies of homogeneous nodes, and use the symbols U and V to represent the two types of nodes respectively. Nodes of the same type are considered homogeneous nodes, and nodes of different types are considered heterogeneous nodes. For all V-type and U-type nodes, find all their edges and sort them by timestamp to form historical paths. Let V-type node v have k edges, then the historical path of node v is s. v =[u k ,u k-1 ...u2,u1], the set of historical paths for V-type nodes is S V The set of historical paths for U-type nodes is S. U ;
[0015] In s v =[u k ,u k-1 In the path s, all nodes in s are within the path s. v It exists in the time order of all nodes and s v If they are consistent, then s is sv Subpaths, path s v =[u k ,u k-1 The probability distribution D of ...u2,u1] Sv Defined as path [u k ,u k-1 The next step of ...u2] is the probability of u1, in S V With S U Find all sub-paths containing only two nodes and calculate the probability distribution d of each sub-path.
[0016] For all the found paired sub-paths, expand them. A paired sub-path is a sub-path with only two nodes. Both nodes must be nodes in the parent path and their order must be the same as in the parent path. For all expanded paths, repeat the operation of finding paired sub-paths and expanding all paired sub-paths until no new expanded paths can be found, and obtain the high-order dependency matrix of homogeneous nodes.
[0017] Preferably, the step of identifying dependencies between heterogeneous nodes and generating higher-order dependency paths between heterogeneous nodes includes:
[0018] In the original data of the time-dependent binary search network, find all paired nodes. For each node, record two paths. For example, if node u and node v have an edge at time t, record two paths [u,v] and [v,u]. The set of paths is S. H ;
[0019] Expand all found paired paths, for path S H =[u1,v1,v2,u2,v3], if a path S exists H’ =[u new [,u1,v1,v2,u2,v3] or [v new [u1,v1,v2,u2,v3], satisfying formula d KL (d SH’ ||d SH If )>δ, then s H For s H The extended path;
[0020] For all extended paths, repeat the process of finding paired sub-paths and extending all paired sub-paths until no new extended paths can be found, thus obtaining the higher-order dependency paths of heterogeneous nodes.
[0021] Preferably, the random walk operation performed on the time-sensitive bipartite network graph structure generates a corpus of homogeneous nodes and a corpus of heterogeneous nodes, respectively. This corpus is a set of random walk paths in a homogeneous, unweighted network of this type of node, including:
[0022] Read the original data of the time-bound binary search network, generate two homogeneous unweighted networks consisting of all U-type nodes and V-type nodes respectively, and add a connection between the two nodes that have a common neighbor node in the time-bound binary search network in the homogeneous unweighted network.
[0023] Read the high-order dependency matrix of homogeneous nodes, store homogeneous node pairs with high-order dependencies in a dictionary, calculate the centrality of all nodes, and perform a certain number of truncated random walks in two homogeneous unweighted networks. The higher the node centrality, the higher the probability of being selected as the initial node. The next step of a node walks to a homogeneous node with an edge or high-order dependency. A maximum threshold is set for the node's walk. The walk stops when the path length reaches the maximum. At the same time, each step in the walk process has a certain probability of returning to the initial node or stopping the walk. Once the walk stops, the path is stored in the corpus of that type of node, and a new truncated random walk begins. Finally, corpora of type U nodes and corpora of type V nodes are generated respectively. The corpus of each type of node is a set of random walk paths in the homogeneous unweighted network of that type of node.
[0024] Preferably, the step of modeling implicit relationships of homogeneous nodes based on a corpus of homogeneous nodes, modeling explicit relationships of heterogeneous nodes based on higher-order dependency paths of heterogeneous nodes, and establishing a joint optimization model by combining the implicit relationships of homogeneous nodes and the explicit relationships of heterogeneous nodes includes: modeling implicit relationships of homogeneous nodes, where implicit relationships are used to measure the difference between the global similarity between nodes in the vector space and the global similarity between nodes in the actual binary network; the objective function established by the implicit relationships calculates the product of the conditional probabilities of the target node and all its context nodes; the conditional probability of the target node i and the context node c is a fraction, the numerator of which is the dot product of the embedding vector of i and the context vector of c, and the denominator is the sum of the dot products of the embedding vector of i and the context vectors of all nodes of the same type;
[0025] We model explicit relations between heterogeneous nodes. Explicit relations are used to measure the difference between the local similarity between nodes in the vector space and the local similarity between nodes in the actual bipartite network. In the final vector space, the local similarity between nodes i and j is represented by the inner product of the embedding vectors of i and j. In the actual bipartite network, the local similarity between nodes i and j is represented by the weight of the edges between them. The objective function of explicit relations is the KL divergence between the local similarity of nodes in the vector space and the local similarity of nodes in the actual bipartite network.
[0026] Subtracting the implicit relationships of homogeneous nodes from the explicit relationships of heterogeneous nodes yields the objective function of the joint optimization model.
[0027] Preferably, the step of iteratively optimizing the node embedding vector and context vector according to the joint optimization model to obtain the node embedding vector in the time-sensitive binary network includes:
[0028] The embedding vectors and context vectors of nodes in the time-dependent binary network are initialized. The embedding vectors and context vectors of nodes are iteratively optimized according to the joint optimization model. The embedding vectors and context vectors of nodes are updated in each iteration using the stochastic gradient ascent method to maximize the objective function of the joint optimization model. After a certain number of iterations, the final embedding vectors of nodes in the time-dependent binary network are obtained. The embedding vectors of nodes represent the node attributes in the network in a vector form with reduced dimensionality.
[0029] As can be seen from the technical solutions provided by the embodiments of the present invention above, the method of the present invention can fully capture the high-order dependencies of nodes in the time-dependent binary network, establish a joint optimization framework, and effectively obtain the node attributes in the time-dependent binary network.
[0030] Additional aspects and advantages of the invention will be set forth in part in the description which follows, and will become apparent from the description or may be learned by practice of the invention. Attached Figure Description
[0031] To more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0032] Figure 1 This is a flowchart illustrating a method for obtaining the embedding vector of node attributes in a time-dependent binary network, as provided in an embodiment of the present invention. Detailed Implementation
[0033] Embodiments of the present invention are described in detail below, examples of which are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and should not be construed as limiting the present invention.
[0034] Those skilled in the art will understand that, unless specifically stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in this specification means the presence of the stated features, integers, steps, operations, elements, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof. It should be understood that when we say an element is “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, or there may be intermediate elements. Furthermore, “connected” or “coupled” as used herein can include wireless connections or couplings. The term “and / or” as used herein includes any and all combinations of one or more of the associated listed items.
[0035] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. It should also be understood that terms such as those defined in general dictionaries should be understood to have the same meaning as in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless defined as herein.
[0036] To facilitate understanding of the embodiments of the present invention, the following will provide further explanation and description with reference to the accompanying drawings and several specific embodiments. These embodiments do not constitute a limitation on the embodiments of the present invention.
[0037] This invention provides a graph embedding method that considers dependencies in a binary network, which can be applied to real-world scenario data to obtain binary network embedding vectors that consider dependencies. The method's performance and accuracy are verified and applied to fields such as recommendation systems, link prediction, and network analysis.
[0038] This invention provides a processing flow for obtaining the embedding vector of node attributes in a time-sensitive binary search network, as follows: Figure 1 As shown, the processing steps include the following:
[0039] Step S1: Extract the higher-order dependencies of homogeneous nodes in the time-series binary search network, and use the symbols U and V to represent the two types of nodes respectively. For example, in a search engine, users and web pages are regarded as nodes, query behavior is regarded as edges, and the time information of query behavior is retained, thus forming a time-series binary search network, where users can be represented as U and web pages as V.
[0040] This section identifies the dependencies between homogeneous nodes of type U and type V. Taking type U nodes as an example, the purpose of identifying the dependencies between homogeneous nodes of type U is to find the relationship between the tendency of type V nodes to connect to type U nodes and the historical path. Nodes of the same type are considered homogeneous nodes, and nodes of different types are considered heterogeneous nodes.
[0041] Step S2: Establish a high-order dependency matrix for heterogeneous nodes. This part treats the entire bipartite network as a whole to identify dependencies. In this part, the node types in the path can be any order of permutation and combination.
[0042] Step S3: Establish a high-order dependency matrix for homogeneous nodes.
[0043] Step S4: Establish a weighted bipartite network graph structure based on the higher-order dependency matrices of heterogeneous nodes and homogeneous nodes.
[0044] Step S5: Random walk to obtain the corpus.
[0045] Read the original data from the time-bound binary search network and generate two complete homogeneous unweighted networks. One homogeneous unweighted network consists of all U-type nodes, and the other consists of all V-type nodes. For any two nodes that share a common neighbor in the time-bound binary search network, add an edge between them in the homogeneous unweighted network. The homogeneous unweighted network is used for subsequent random walks.
[0046] Read the higher-order dependency matrix of homogeneous nodes and store the pairs of homogeneous nodes with higher-order dependencies in a dictionary.
[0047] The centrality of all nodes is calculated, and a certain number of truncated random walks are performed in two homogeneous unweighted networks. The higher the node centrality, the higher the probability of being selected as the initial node. A node's next step can be to a homogeneous node with an edge or higher-order dependency. A maximum threshold is set during the walk; the walk stops when the path length reaches the maximum. Each step in the walk has a certain probability of returning to the initial node or stopping. Once the walk stops, the path is stored in the corpus of that type of node, and a new truncated random walk begins. Finally, corpora of type U nodes and corpora of type V nodes are generated. Each type of node corpus is a set of random walk paths in the homogeneous unweighted network for that type of node. In steps S6 and S7, if two nodes appear together in the paths in the corpus, they are considered context nodes.
[0048] Step S6: Establish a joint optimization model.
[0049] This paper models explicit relations between heterogeneous nodes, which measure the difference between the local similarity between nodes in the vector space and the local similarity between nodes in the actual bipartite network. In the final vector space, the local similarity between nodes i and j is represented by the inner product of their embedding vectors. In the actual bipartite network, the local similarity between nodes i and j is represented by the weights of the edges connecting them. Finally, the objective function of the explicit relations is the KL divergence between the local similarity of nodes in the vector space and the local similarity of nodes in the actual bipartite network. The purpose of modeling explicit relations is to minimize the objective function so that the final embedding vectors preserve the local similarity of nodes as much as possible.
[0050] Implicit relations are modeled for homogeneous nodes. These implicit relations measure the difference between the global similarity between nodes in the vector space and the global similarity between nodes in the actual binary network. The objective function for establishing implicit relations calculates the product of the conditional probabilities of the target node and all its context nodes. The concept of context nodes is mentioned in S5. The conditional probability of the target node i and its context node c is a fraction. The numerator is the dot product of the embedding vector of i and the context vector of c, and the denominator is the sum of the dot products of the embedding vector of i and the context vectors of all nodes of the same type. For corpora of type U nodes and corpora of type V nodes, an implicit relation is established respectively. The purpose of modeling implicit relations is to maximize the objective function so that the final embedding vector can preserve the global similarity of nodes as much as possible.
[0051] Establish a joint optimization model. The joint optimization model combines explicit and implicit relationships. The sum of the two implicit relationships minus the explicit relationship yields the objective function of the final joint optimization model. The purpose of establishing the joint optimization model is to maximize the objective function so that the final embedding vector can simultaneously preserve the local and global similarity of nodes.
[0052] Step S7: Iteratively optimize the node embedding vector and context vector according to the joint optimization model to obtain the node embedding vector in the time-dependent binary network.
[0053] Initialize the embedding vector and context vector;
[0054] Based on the joint optimization model, the embedding vector and context vector of the nodes are iteratively optimized. Using the stochastic gradient ascent method, the embedding vector and context vector of the nodes are updated in each iteration according to the established joint optimization model, maximizing the objective function of the joint optimization model. After a certain number of iterations, the final embedding vector is obtained.
[0055] The final output of this invention is the embedding vector of a node in the time-sensitive binary search network. The context vector is only used during training to represent the context node of the target node in a random walk path; it is equivalent to temporary context node information in that scenario. These node embedding vectors reduce the dimensionality of nodes in the network to vector form, allowing for a simpler way to obtain node properties, such as calculating the similarity between nodes by taking the vector inner product. This can be applied to subsequent work such as recommendation systems and link prediction.
[0056] The above step S1 specifically includes:
[0057] Step S11: Read the original data of the time-series binary search network. For all V-type nodes, find all their edges and sort them by timestamp to form historical paths. For example, if node v has k edges, then the historical path of node v is s. v =[u k ,u k-1 ...u2,u1], the set of historical paths for V-type nodes is S V Similarly, for U-type nodes, the set is S. U .
[0058] Step S12, in s v =[u k ,u k-1 In the path s, all nodes in s are within the path s. v It exists in the time order of all nodes and s v If they are consistent, then s is s v Subpaths. Path s v =[u k ,u k-1 The probability distribution D of ...u2,u1] Sv Defined as path [u k ,u k-1 The next step of ...u2] is the probability of u1. In S V With S U Find all sub-paths containing only two nodes and calculate the probability distribution d of each sub-path;
[0059] Step S13: Expand the path. Expand all the found paired sub-paths. A paired sub-path is a sub-path with only two nodes, where both nodes are nodes in the parent path and their order is consistent with that in the parent path.
[0060] The extension method is as follows: Taking a U-type node path as an example, for a path s containing k nodes... v =[u k ,u k- [1...u2,u1], if s existsv’ =[u k+1 ,u k ,u k-1 ...u2,u1] is S V Subpaths of the middle path that satisfy formula d KL (d sv’ ||d sv If )>δ, then s v’ For s v The extension path. Where, d sv’ and d sv These are paths s v’ and path s v The probability distribution, d KL (d sv’ ||d sv ) is d sv’ and d sv The KL divergence, where δ is the set threshold.
[0061] For all extended paths, repeat step S13 until no new extended paths can be found; the paths that cannot be extended further are the final higher-order paths, and the higher-order paths of the homogeneous nodes obtained are written to a file.
[0062] The above step S2 specifically includes:
[0063] Step S21: In the original data of the time-dependent binary search network, find all paired nodes. Record two paths for each node. For example, if node u and node v have an edge at time t, record two paths [u,v] and [v,u]. The set of paths is S. H ;
[0064] Step S22: Expand the path. Expand all paired paths found. Using path S... H For example, if [u1,v1,v2,u2,v3] exists, then... H’ =[u new [,u1,v1,v2,u2,v3] or [v new [u1,v1,v2,u2,v3], satisfying formula d KL (d SH’ ||d SH If )>δ, then s H’ For s H The extended path;
[0065] For all extended paths, repeat step S22 until no new extended paths can be found.
[0066] Write the obtained higher-order dependencies of heterogeneous nodes into a file. The higher-order dependencies of the heterogeneous nodes are the higher-order paths of the heterogeneous nodes.
[0067] The above step S3 specifically includes:
[0068] Initialize the higher-order dependency matrix with the number of rows and columns equal to the number of nodes, and all elements set to 0.
[0069] Read the higher-order path file of homogeneous nodes. For node pairs with higher-order dependencies, add weights to the corresponding elements in the higher-order dependency matrix. The higher the order of the dependency, the smaller the added weight value. Obtain the higher-order dependency matrix of homogeneous nodes.
[0070] The above step S4 specifically includes:
[0071] Read the raw data of the time-sensitive binary search network. The raw data contains three columns: the number of node 1, the timestamp, and the number of node 2. Node 1 and node 2 belong to different types of nodes, such as users and products.
[0072] Establish a network data structure, add the read nodes to the network data structure, add an edge to each pair of read nodes 1 and 2, and assign weights according to the number of interactions.
[0073] Read the high-order path file of heterogeneous nodes. For each path where the first and last nodes are heterogeneous, add edges and weights to the first and last node pairs. The higher the order of the dependency, the smaller the added weight value. This results in a weighted bipartite network graph structure.
[0074] In summary, the method of this invention can fully capture the high-order dependencies of nodes in time-dependent binary networks, establish a joint optimization framework, effectively obtain node attributes in time-dependent binary networks, and improve the accuracy and performance of node attribute embedding vectors in tasks such as link prediction and recommendation systems.
[0075] The beneficial effects of the method in the embodiments of the present invention also include the following aspects:
[0076] More accurate relationship prediction: This invention can more comprehensively capture the complex relationships between nodes. The obtained embeddings can provide a deeper representation, enabling more accurate prediction of future possible connection relationships in link prediction tasks.
[0077] More accurate similarity measurement: Node embeddings capture higher-order dependencies, providing better node representations. In recommender systems, such node embeddings can be used to more accurately measure the similarity between nodes, thereby improving the quality of recommendations and making them more aligned with user interests or needs.
[0078] Effective information transmission: Consideration of higher-order dependencies allows node embeddings to better capture the information transmission paths between nodes. For link prediction and recommendation systems, this means more efficient information dissemination and reasoning, thereby improving the accuracy of predictions and recommendations.
[0079] Combating noise and bias in networks: Embedding higher-order dependencies can mitigate the effects of noise and bias in networks. In link prediction and recommendation systems, such node embeddings can better cope with data incompleteness or noise in the network, improving the robustness of the system.
[0080] Those skilled in the art will understand that the accompanying drawings are merely schematic diagrams of one embodiment, and the modules or processes shown in the drawings are not necessarily essential for implementing the present invention.
[0081] As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that the present invention can be implemented by means of software plus necessary general-purpose hardware platforms. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments of the present invention.
[0082] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, for apparatus or system embodiments, since they are basically similar to method embodiments, the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments. The apparatus and system embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0083] The above description is merely a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.
Claims
1. A method for obtaining the embedding vector of node attributes in a time-dependent binary search network, characterized in that, In a search engine, users and web pages are treated as nodes, query actions are treated as edges, and the time information of the query actions is retained, thus forming a time-based binary search network. The method includes: Different symbols are used to represent different types of nodes in the time-sensitive binary network. The dependency relationships between homogeneous nodes are identified, a high-order dependency matrix of homogeneous nodes is established, the dependency relationships between heterogeneous nodes are identified, and a high-order dependency path between heterogeneous nodes is generated. Establish a time-dependent binary network graph structure based on the high-order dependency matrix of the heterogeneous nodes; A random walk operation is performed on the time-sensitive bipartite network graph structure to generate a corpus of homogeneous nodes and a corpus of heterogeneous nodes, respectively. The corpus is a set of random walk paths of this type of node homogeneous unweighted network. Model the implicit relationships of homogeneous nodes based on the corpus of homogeneous nodes, and model the explicit relationships of heterogeneous nodes based on the higher-order dependency paths of heterogeneous nodes. Establish a joint optimization model by combining the implicit relationships of homogeneous nodes and the explicit relationships of heterogeneous nodes. The embedding vectors of nodes in the time-dependent binary network are obtained by iteratively optimizing the embedding vectors and context vectors of nodes according to the joint optimization model.
2. The method according to claim 1, characterized in that, The method of representing different types of nodes in a time-sensitive binary search network with different symbols, identifying dependencies between homogeneous nodes, and establishing a high-order dependency matrix for homogeneous nodes includes: Read the raw data of the time-sensitive binary search network, extract the higher-order dependencies of homogeneous nodes, and use the symbols U and V to represent the two types of nodes respectively. Nodes of the same type are considered homogeneous nodes, and nodes of different types are considered heterogeneous nodes. For all V-type nodes and U-type nodes, find all their edges and sort them by timestamp to form historical paths. Let V-type node v have k edges, then the historical path of node v is S. V = [u k ,u k-1 ... u2, u1], the set of historical paths for V-type nodes is S V The set of historical paths for U-type nodes is S. U ; In S V = [u k ,u k-1 ... in [u2, u1], as long as the path s satisfies that all nodes in s are in S... V It exists in the time order of all nodes and is related to S. V If they are consistent, then s is S. V Subpath, path S V = [u k ,u k-1 ... the probability distribution D of u2, u1] Sv Defined as path [u k ,u k-1 ... The next step of u2] is the probability of u1, in S V With S U Find all sub-paths containing only two nodes and calculate the probability distribution d of each sub-path. For all the found paired sub-paths, expand them. A paired sub-path is a sub-path with only two nodes. Both nodes must be nodes in the parent path and their order must be the same as in the parent path. For all expanded paths, repeat the operation of finding paired sub-paths and expanding all paired sub-paths until no new expanded paths can be found, and obtain the high-order dependency matrix of homogeneous nodes.
3. The method according to claim 2, characterized in that, The process of identifying dependencies between heterogeneous nodes and generating higher-order dependency paths between them includes: In the original data of the time-dependent binary search network, find all paired nodes. For each node, record two paths. For example, if node u and node v have an edge at time t, record two paths [u,v] and [v,u]. The set of paths is S. H ; Expand all found paired paths, for path S H =[u1,v1,v2,u2,v3], if a path S exists H’ =[u new [,u1,v1,v2,u2,v3] or [v new [u1,v1,v2,u2,v3], satisfying formula d KL (d SH’ ||d SH If )>δ, then S H’ For S H The extended path; For all extended paths, repeat the process of finding paired sub-paths and extending all paired sub-paths until no new extended paths can be found, thus obtaining the higher-order dependency paths of heterogeneous nodes.
4. The method according to claim 3, characterized in that, The aforementioned random walk operation on the time-sensitive bipartite network graph structure generates corpora of homogeneous nodes and corpora of heterogeneous nodes, respectively. These corpora are sets of random walk paths in homogeneous unweighted networks of this type of node, including: Read the original data of the time-bound binary search network, generate two homogeneous unweighted networks consisting of all U-type nodes and V-type nodes respectively, and add a connection between the two nodes that have a common neighbor node in the time-bound binary search network in the homogeneous unweighted network. Read the high-order dependency matrix of homogeneous nodes, store homogeneous node pairs with high-order dependencies in a dictionary, calculate the centrality of all nodes, and perform a certain number of truncated random walks in two homogeneous unweighted networks. The higher the node centrality, the higher the probability of being selected as the initial node. The next step of a node walks to a homogeneous node with an edge or high-order dependency. A maximum threshold is set for the node's walk. The walk stops when the path length reaches the maximum. At the same time, each step in the walk process has a certain probability of returning to the initial node or stopping the walk. Once the walk stops, the path is stored in the corpus of that type of node, and a new truncated random walk begins. Finally, corpora of type U nodes and corpora of type V nodes are generated respectively. The corpus of each type of node is a set of random walk paths in the homogeneous unweighted network of that type of node.
5. The method according to claim 4, wherein modeling implicit relationships of homogeneous nodes based on a corpus of homogeneous nodes, modeling explicit relationships of heterogeneous nodes based on higher-order dependency paths of heterogeneous nodes, and establishing a joint optimization model by combining the implicit relationships of homogeneous nodes and the explicit relationships of heterogeneous nodes, comprises: The implicit relationship between homogeneous nodes is modeled. The implicit relationship is used to measure the difference between the global similarity between nodes in the vector space and the global similarity between nodes in the actual binary network. The objective function established by the implicit relationship calculates the product of the conditional probabilities of the target node and all its context nodes. The conditional probability of the target node i and the context node c is a fraction. The numerator is the dot product of the embedding vector of i and the context vector of c, and the denominator is the sum of the dot products of the embedding vector of i and the context vectors of all nodes of the same type. We model explicit relations between heterogeneous nodes. Explicit relations are used to measure the difference between the local similarity between nodes in the vector space and the local similarity between nodes in the actual bipartite network. In the final vector space, the local similarity between nodes i and j is represented by the inner product of the embedding vectors of i and j. In the actual bipartite network, the local similarity between nodes i and j is represented by the weight of the edges between them. The objective function of explicit relations is the KL divergence between the local similarity of nodes in the vector space and the local similarity of nodes in the actual bipartite network. Subtracting the implicit relationships of homogeneous nodes from the explicit relationships of heterogeneous nodes yields the objective function of the joint optimization model.
6. The method according to claim 5, characterized in that, The step of iteratively optimizing the embedding vectors and context vectors of nodes according to the joint optimization model to obtain the embedding vectors of nodes in the time-efficient binary search network includes: The embedding vectors and context vectors of nodes in the time-dependent binary network are initialized. The embedding vectors and context vectors of nodes are iteratively optimized according to the joint optimization model. The embedding vectors and context vectors of nodes are updated in each iteration using the stochastic gradient ascent method to maximize the objective function of the joint optimization model. After a certain number of iterations, the final embedding vectors of nodes in the time-dependent binary network are obtained. The embedding vectors of nodes represent the node attributes in the network in a vector form with reduced dimensionality.