Knowledge graph construction method and device, electronic equipment and storage medium
By constructing a node network and using random walk and word embedding algorithms to merge objects belonging to similar nodes, the problem of objects having the same name in knowledge graphs is solved, thus improving the accuracy and completeness of knowledge graphs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA UNIONPAY
- Filing Date
- 2019-10-18
- Publication Date
- 2026-06-26
AI Technical Summary
When objects in a knowledge graph have the same name, it is difficult to determine whether they are the same object, which reduces the accuracy of the constructed knowledge graph.
By acquiring the direct relationships and subordinate objects of multiple nodes, a node network is constructed. Then, node embedding vectors are trained using random walk and word embedding algorithms. Subordinate objects of nodes with similarity higher than a threshold are merged to form a knowledge graph.
It improves the accuracy and completeness of knowledge graphs, avoids omissions and errors in relationship mining, and simplifies the construction process.
Smart Images

Figure CN110807103B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of information retrieval, and particularly relates to a method, apparatus, electronic device and storage medium for constructing a knowledge graph. Background Technology
[0002] Knowledge graphs can be used to describe various entities and concepts, as well as the relationships between them, and have powerful semantic processing and interconnection capabilities.
[0003] In the development of knowledge graphs, there are instances where objects have the same name. However, objects in a knowledge graph lack unique attributes, making it difficult to determine whether objects with the same name are the same object. Treating objects with the same name as different objects may miss the discovery of relationships associated with those objects. Conversely, treating them as the same object may lead to incorrect relationship discovery. Both of these situations reduce the accuracy of the constructed knowledge graph. Summary of the Invention
[0004] This invention provides a knowledge graph construction method, apparatus, electronic device, and storage medium, which can improve the accuracy of the constructed knowledge graph.
[0005] In a first aspect, embodiments of the present invention provide a knowledge graph construction method, including:
[0006] Obtain multiple nodes, construct a node network based on the direct relationships between the nodes and whether the nodes have subordinate objects with the same name. The node network includes multiple nodes, the connection relationships between the nodes, and the subordinate objects of each node.
[0007] In a node network, each node is used as the initial node for random walks, resulting in multiple walk sequences;
[0008] Using a word embedding algorithm, the embedding vector of each node is obtained by training multiple walk sequences;
[0009] Merge subordinate objects with the same name that are nodes in the target node set into one subordinate object, and use the node network after merging subordinate objects as a knowledge graph. The target node set includes at least two nodes whose embedding vectors have a similarity higher than the similarity recognition threshold.
[0010] Secondly, embodiments of the present invention provide a knowledge graph construction apparatus, comprising:
[0011] The node network construction module is used to obtain multiple nodes, construct a node network based on the direct relationships between the nodes and whether the nodes have subordinate objects with the same name. The node network includes multiple nodes, the connection relationships between the nodes, and the subordinate objects of each node.
[0012] The random walk module is used to perform a random walk on each node in a node network, resulting in multiple walk sequences.
[0013] The training module is used to train the embedding vector of each node on multiple walk sequences using a word embedding algorithm.
[0014] The knowledge graph construction module is used to merge subordinate objects with the same name that are nodes in the target node set into one subordinate object, and use the node network after merging subordinate objects as a knowledge graph. The target node set includes at least two nodes whose embedding vectors have a similarity higher than the similarity recognition threshold.
[0015] Thirdly, embodiments of the present invention provide an electronic device, including a processor, a memory, and a computer program stored in the memory and executable on the processor. When the computer program is executed by the processor, it implements the knowledge graph construction method in the first aspect of the technical solution.
[0016] Fourthly, embodiments of the present invention provide a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the knowledge graph construction method in the first aspect of the technical solution.
[0017] This invention provides a knowledge graph construction method, apparatus, electronic device, and storage medium. It constructs a node network, including nodes, connections between nodes, and subordinate objects of nodes, based on the direct relationships between multiple nodes and the subordinate objects of each node. Random walks and word embedding algorithms are used in the node network to obtain embedding vectors for identifiable nodes. The similarity of the embedding vectors of different nodes is used to determine whether subordinate objects with the same name belonging to different nodes are the same subordinate object. Subordinate objects with the same name corresponding to two nodes whose similarity is higher than the similarity recognition threshold are merged as the same subordinate object. This avoids the situation where all subordinate objects with the same name are considered different objects, leading to missed relationships in relation mining, and also avoids the situation where all subordinate objects with the same name are considered the same subordinate object, resulting in erroneous relation mining, thereby improving the accuracy of the constructed knowledge graph. Attached Figure Description
[0018] The invention can be better understood from the following description of specific embodiments of the invention in conjunction with the accompanying drawings, wherein the same or similar reference numerals denote the same or similar features.
[0019] Figure 1 This is a flowchart of a knowledge graph construction method according to an embodiment of the present invention;
[0020] Figure 2 This is a schematic diagram of a node network according to an embodiment of the present invention;
[0021] Figure 3 In the embodiments of the present invention, and Figure 2 A schematic diagram of the knowledge graph corresponding to the node network shown;
[0022] Figure 4 This is a flowchart illustrating a specific implementation of a knowledge graph construction method in an embodiment of the present invention;
[0023] Figure 5 This is a flowchart illustrating another specific implementation of a knowledge graph construction method in this invention.
[0024] Figure 6 This is a planar schematic diagram of a node in a two-dimensional coordinate system according to an embodiment of the present invention;
[0025] Figure 7 This is a schematic diagram of a knowledge graph construction device according to an embodiment of the present invention;
[0026] Figure 8 This is a structural schematic diagram of a specific implementation of a knowledge graph construction device according to an embodiment of the present invention;
[0027] Figure 9 This is a schematic diagram of the structure of an electronic device according to an embodiment of the present invention. Detailed Implementation
[0028] The features and exemplary embodiments of various aspects of the present invention will now be described in detail. Numerous specific details are set forth in the following detailed description to provide a thorough understanding of the invention. However, it will be apparent to those skilled in the art that the invention may be practiced without requiring some of these specific details. The following description of embodiments is merely intended to provide a better understanding of the invention by illustrating examples of the invention. The invention is by no means limited to any specific configurations and algorithms presented below, but covers any modifications, substitutions, and improvements to elements, components, and algorithms without departing from the spirit of the invention. Well-known structures and techniques are not shown in the drawings and the following description in order to avoid unnecessarily obscuring the invention.
[0029] This invention provides a method, apparatus, electronic device, and storage medium for constructing a knowledge graph, applicable to scenarios where multiple nodes are used to build a knowledge graph. This invention does not limit the type of knowledge graph; for example, the knowledge graph can be an enterprise knowledge graph, a historical knowledge graph, etc. In this invention, each node has at least one subordinate object. No other auxiliary knowledge graphs are needed; the knowledge graph is generated based on random walk and word embedding algorithms, utilizing the direct relationships between multiple nodes and their subordinate objects.
[0030] Figure 1 This is a flowchart of a knowledge graph construction method according to an embodiment of the present invention. Figure 1 As shown, the knowledge graph construction method may include steps S101 to S104.
[0031] In step S101, multiple nodes are obtained, and a node network is constructed based on the direct relationships between the multiple nodes and whether the multiple nodes have subordinate objects with the same name.
[0032] Each node has at least one subordinate object, and the subordinate object belongs to the node. A direct association exists between two nodes, indicating a relationship between them. If the name of a subordinate object of one node is the same as the name of a subordinate object of another node, it means that the subordinate object of this node and the subordinate object of the other node may be the same subordinate object, i.e., there may be a relationship between this node and the other node.
[0033] A node network can be constructed based on multiple nodes and the possible relationships between them. A node network includes multiple nodes, the connections between them, and the subordinate objects of each node.
[0034] for example, Figure 2 This is a schematic diagram of a node network according to an embodiment of the present invention. Figure 2 As shown, the four nodes are A1, A2, A3, and A4. Node A1 has subordinate objects named Object 1, Object 2, and Object 3; Node A2 has subordinate objects named Object 2, Object 4, Object 5, and Object 6; Node A3 has subordinate objects named Object 2, Object 3, Object 8, and Object 9; and Node A4 has subordinate objects named Object 4, Object 7, and Object 8. Node A1 has a direct relationship with both Node A2 and A3; Node A2 has a direct relationship with Node A3; and Node A3 has a direct relationship with Node A4.
[0035] In step S102, in the node network, each node is used as the initial node for random walks, resulting in multiple walk sequences.
[0036] In a node network, each node is traversed, meaning each node is used as an initial node for a random walk, resulting in multiple walk sequences. It should be noted that a random walk on an initial node can yield one or more walk sequences. The number of walk sequences obtained from random walks on each initial node can be set according to the work scenario or requirements, and is not limited here. The length of each walk sequence can be preset, and its specific value can be determined according to the work scenario or requirements, and is not limited here.
[0037] In step S103, the word embedding algorithm is used to train multiple walking sequences to obtain the embedding vector of each node.
[0038] After obtaining multiple walking sequences, word embedding algorithms can be used to treat the walking sequences as sentences and the nodes as words, and the embedding vector of each node can be trained.
[0039] In some examples, word embedding algorithms may include continuous skip-gram model (Skip-gram) or continuous bag of words (CBOW) algorithms, etc., and are not limited here.
[0040] In step S104, subordinate objects with the same name that are nodes in the target node set are merged into one subordinate object, and the node network after merging subordinate objects is used as a knowledge graph.
[0041] The target node set includes at least two nodes whose embedding vectors have a similarity higher than a similarity recognition threshold. The similarity recognition threshold is a limit that determines whether the embedding vectors of two nodes are sufficiently similar. If the similarity of the embedding vectors of two nodes is higher than this threshold, it indicates that the embedding vectors of the two nodes are sufficiently similar. The specific similarity recognition threshold can be set according to the specific work scenario and requirements, and is not limited here. For example, the similarity recognition threshold could be 0.9.
[0042] Specifically, similarity can be calculated for the embedding vectors of any pair of nodes. If the similarity between the embedding vectors of two nodes is higher than the similarity recognition threshold, then the subordinate objects with the same name of these two nodes are considered to be the same subordinate object. Therefore, the subordinate objects with the same name of these two nodes can be merged into one subordinate object. The node network after merging subordinate objects can then serve as a knowledge graph.
[0043] For example, if the node network is like Figure 2As shown, the similarity between the embedding vectors of node A1 and node A2 is higher than the similarity recognition threshold, the similarity between the embedding vectors of node A1 and node A3 is higher than the similarity recognition threshold, the similarity between the embedding vectors of node A4 and node A2 is lower than the similarity recognition threshold, and the similarity between the embedding vectors of node A4 and node A3 is lower than the similarity recognition threshold. Therefore, the nodes named "Sub-object of Object Two" (nodes A1, A2, and A3) are all the same sub-object and can be merged into one sub-object. Similarly, the nodes named "Sub-object of Object Three" (nodes A1 and A3) are also the same sub-object and can be merged into one sub-object. Therefore, the merged node network, i.e., the knowledge graph, is as follows: Figure 3 As shown, Figure 3 In the embodiments of the present invention, and Figure 2 The diagram shows a knowledge graph corresponding to the node network shown.
[0044] In some examples, the similarity of embedded vectors can be calculated using algorithms such as cosine similarity or Euclidean distance; this is not a limitation. The following explanation uses the cosine similarity algorithm to calculate the similarity of embedded vectors.
[0045] For nodes u and v, the embedding vector s of node u u and the embedding vector s of node v u The similarity can be calculated using the following formula (1):
[0046]
[0047] Where, sim(s) u s v ) is the embedding vector s of node u. u and the embedding vector s of node v v similarity, The embedding vector s of node u u The transpose of ||s u ||2 is the embedding vector s of node u. u 2-norm, ||s v ||2 is the embedding vector s of node v v 2-norm.
[0048] sim(s u s v The higher the value, the higher the correlation between node u and node v. The higher the correlation, the greater the likelihood that node u and node v with the same name are the same subordinate object.
[0049] In this embodiment of the invention, a node network is constructed based on the direct relationships between multiple nodes and the subordinate objects of each node. This network includes nodes, the connections between nodes, and the subordinate objects of each node. Random walks and word embedding algorithms are used in the node network to obtain embedding vectors for identifiable nodes. The similarity of the embedding vectors of different nodes is used to determine whether subordinate objects with the same name belonging to different nodes are the same subordinate object. Subordinate objects with the same name belonging to two nodes whose embedding vectors have a similarity higher than the similarity recognition threshold are merged as the same subordinate object. This avoids the situation where all subordinate objects with the same name are considered different objects, which would lead to missed relationships in relation mining, and also avoids the situation where all subordinate objects with the same name are considered the same subordinate object, resulting in erroneous relation mining. This improves the accuracy and confidence of the constructed knowledge graph.
[0050] The knowledge graph constructed using the knowledge graph construction method in this embodiment of the invention can display not only direct relationships between nodes but also indirect relationships, thus improving the completeness of the knowledge graph. Furthermore, the knowledge graph construction method in this embodiment of the invention does not require other pre-aligned entities or additional knowledge graphs as references or supervision, simplifying the complexity of the knowledge graph construction process.
[0051] Figure 4 This is a flowchart illustrating a specific implementation of a knowledge graph construction method in an embodiment of the present invention. Figure 4 and Figure 1 The difference is that, Figure 1 Step S101 can be further refined as follows: Figure 4 Steps S1011 to S1013, and step S102 can be further refined into steps S1021 to S1024.
[0052] In step S1011, for any two nodes among the multiple nodes, if there is a direct association between the two nodes, a connection relationship is established between the two nodes.
[0053] In some examples, direct relationships can include dependency and / or control relationships. If node 1 is subordinate to node 2, then there is a dependency relationship between node 1 and node 2. If node 1 controls at least some of the attributes or subordinate objects of node 2, then there is a control relationship between node 1 and node 2. For example, if a node is a company, and company 1 is a subsidiary of company 2, then there is a dependency relationship between company 1 and company 2. Or, if a node is a company, and there is an equity relationship between company 1 and company 2, such as company 1 being the controlling shareholder of company 2, then there is a control relationship between company 1 and company 2.
[0054] Multiple nodes can be traversed pairwise to check for direct relationships and determine whether a connection should be established between them. Specifically, the connection between two nodes can be represented in a node network as an edge or line connecting the two nodes.
[0055] In step S1012, for any two nodes among multiple nodes, if any two nodes have subordinate objects with the same name, a connection relationship is established between the two nodes.
[0056] If two nodes have subordinate objects with the same name, it means that the subordinate objects in the two nodes may be the same object, that is, there may be a relationship between the two nodes, and a connection relationship is established between the two nodes.
[0057] In step S1013, a node network is constructed based on multiple nodes and the connection relationships between them.
[0058] In some examples, connections can have weights. For instance, the connection between two directly related nodes can have a first weight, and the connection between two subordinate nodes with the same name can have a second weight. The first and second weights can be set according to the work scenario and requirements; they can be the same or different, and this is not limited here. In some cases, the first weight can be set higher than the second weight.
[0059] Different or the same first weight can be set for different types of direct relationships, which is not limited here. For example, the first weight can be 1. The second weight can be set according to the number of subordinate objects with the same name of the two nodes. For example, the larger the number of subordinate objects with the same name of the two nodes, the larger the second weight of the connection relationship between the two nodes. That is, the second weight is positively correlated with the number of subordinate objects with the same name of the two nodes.
[0060] Specifically, for a target node pair (i.e., any two nodes with subordinate objects having the same name), the intersection and union of the names of the subordinate objects of these two nodes can be obtained. Based on the obtained intersection and union, and a preset weight adjustment parameter, a second weight is obtained. Further, the second weight can be obtained based on the number of elements in the obtained intersection and union, and the preset weight adjustment parameter. The preset weight adjustment parameter can be a positive number; by adjusting the preset weight adjustment parameter, the similarity between nodes calculated subsequently can be closer to 1. The preset weight adjustment parameter can be set according to specific work scenarios and requirements, and is not limited here. For example, the preset weight adjustment parameter can be 10.
[0061] For example, set the first weight to 1. For nodes u and v, if there is a direct association between nodes u and v, then the weight of the connection between nodes u and v is 1.
[0062] If there is no direct association between node u and node v, and node u and node v do not have subordinate objects with the same name, that is, the set a of the names of the subordinate objects of node u. u The set a of names of the subordinate objects of node v v If the intersection of nodes u and v is empty, then there is no connection between nodes u and v.
[0063] If there is no direct association between node u and node v, but node u and node v have subordinate objects with the same name, that is, the set a of the names of the subordinate objects of node u. u The set a of names of the subordinate objects of node v v The intersection of nodes u and v is not empty. The second weight of the connection between nodes u and v can be calculated according to the following formula (2):
[0064]
[0065] Among them, e uv The second weight of the connection between node u and node v, |a u ∩a v | is the set of names of the subordinate objects of node u. u The set a of names of the subordinate objects of node v v The number of elements in the intersection of |a u ∪ a v | is the set of names of the subordinate objects of node u. u The set a of names of the subordinate objects of node v v The number of elements in the union of α and exp is the preset weight adjustment parameter.
[0066] In step S1021, if there is no node that has a connection with the initial node, it is determined that the initial node has no corresponding walk sequence.
[0067] In this embodiment, all nodes are traversed, and each node is used as an initial node for random walk. If there is no node connected to an initial node, the random walk ends, and the initial node has no corresponding walk sequence.
[0068] In step S1022, if there is a node that is connected to the initial node, then the node is randomly walked from the initial node to one of the nodes that is connected to the initial node, and the node to which the random walk is taken is the current node.
[0069] In step S1023, if there is a node that is connected to the current node, then the current node randomly walks to one of the nodes that is connected to the current node, and the node to which the random walk is taken is the new current node, until the number of random walks reaches the preset number of steps.
[0070] The random walk ends when the preset number of steps is reached. This preset number of steps can be set according to the work scenario and requirements, and is not limited here. It should be noted that for an initial node, a random walk can produce one or multiple walk sequences, which is not limited here. For example, each initial node can correspond to 20 walk sequences, with a length of 80.
[0071] For example, node networks Figure 2 As shown, taking node A1 as the initial node as an example, the goal is to obtain three walk sequences with node A1 as the initial node. The preset number of steps can be set according to requirements, and one walk sequence corresponds to one walk path. For example, the preset number of steps can be 20, 10, or 2. For ease of explanation, let's assume the preset number of steps is 2. The first walk path is node A1→node A2→node A3, the second walk path is node A1→node A3→node A4, and the third walk path is node A1→node A3→node A2. The three walk sequences can be obtained from the first, second, and third walk paths, respectively.
[0072] The following explanation uses the first walk path as an example. Node A1 is the initial node. Nodes connected to node A1 include nodes A2 and A3. A random walk can proceed from node A1 to either node A2 or node A3. If the random walk reaches node A2, then node A2 becomes the current node. Nodes connected to node A2 include nodes A1 and A3. A random walk can proceed from node A2 to node A3, and node A3 becomes the new current node. The preset number of steps is 2. A random walk to node A3 corresponds to the walk sequence of the first walk path.
[0073] In step S1024, multiple walk sequences are obtained based on the nodes visited by the random walk corresponding to each initial node.
[0074] Specifically, multiple walk sequences can be obtained based on the multiple walk paths formed by the nodes visited by the random walk corresponding to each initial node.
[0075] In some examples, the random walk is specifically a biased random walk, where the probability of a random walk is related to the weights of the connections. Nodes connected to the initial node or the current node are called candidate nodes. For each candidate node, the probability of a random walk from the initial node or the current node to the candidate node is positively correlated with the weights of the connections between the initial node or the current node and the candidate node.
[0076] In other words, the greater the weight of the connection between the initial node and the candidate node, the greater the probability of traversing from the initial node to the candidate node. Similarly, the greater the weight of the connection between the current node and the candidate node, the greater the probability of traversing from the current node to the candidate node.
[0077] It should be noted that the candidate nodes for the initial node are nodes that are connected to the initial node. The candidate nodes for the current node are nodes that are connected to the current node.
[0078] In some examples, the probability of traversing from the initial node or the current node to a candidate node is obtained by normalizing the weights of the connection relationships between the initial node or the current node and the candidate node.
[0079] It should be noted that the probability of traversing from the initial node to a candidate node is obtained by normalizing the weights of the connection relationships between the initial node and the candidate nodes. The candidate nodes of the initial node are those nodes that are connected to the initial node.
[0080] The probability of traversing from the current node to a candidate node is obtained by normalizing the weights of the connection relationships between the current node and the candidate nodes. Candidate nodes are nodes that are connected to the current node.
[0081] The probability of randomly walking from the initial node or the current node to a candidate node can be calculated using the following formula (3):
[0082]
[0083] in, Let e be the probability of randomly walking from the initial node u or the current node u to the candidate node v. uv The weights for the connection relationships between the initial node u or the current node u and the candidate node v. It is the set of all candidate nodes of the initial node u or the current node u.
[0084] The initial node u or the current node u with probability Randomly walk to candidate node v.
[0085] Figure 5This is a flowchart illustrating another specific implementation of a knowledge graph construction method in this invention. Figure 5 and Figure 1 The difference is that, Figure 5 The knowledge graph construction method shown may also include steps S105 and S106.
[0086] In step S105, a dimensionality reduction algorithm is used to map the N-dimensional embedding vector of each node to an M-dimensional embedding vector.
[0087] The dimensionality reduction algorithm may include Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), Locally Linear Embedding (LLE), or Laplacian eigenmap algorithm, etc., and is not limited here.
[0088] Both N and M are positive integers, and N > M. In the initially obtained N-dimensional embedding vector mapping, the specific value of N can be set according to the work scenario and requirements, and is not limited here. For example, N = 128.
[0089] By using dimensionality reduction algorithms, high-dimensional embedding vectors are mapped to low-dimensional embedding vectors, thus defining a low-dimensional mapping. In subsequent analysis or observation, low-dimensional embedding vectors are easier to analyze or observe than high-dimensional embedding vectors, reducing the difficulty of analysis or observation.
[0090] In step S106, each node is displayed in the M-dimensional coordinate system.
[0091] In an M-dimensional coordinate system, the coordinates of a node are its M-dimensional embedding vector. By displaying each node in M-dimensional coordinates, the correlation between nodes can be observed more intuitively. In the M-dimensional coordinate system, the farther apart two nodes are, the weaker their correlation; the closer two nodes are, the stronger their correlation.
[0092] In some examples, M=2, meaning that a dimensionality reduction algorithm is used to map the N-dimensional embedding vector to a 2-dimensional embedding vector, thus defining a low-dimensional mapping. In a two-dimensional coordinate system, the relative positions of each node can be visualized, allowing for a more intuitive observation and measurement of the correlation between nodes.
[0093] for example, Figure 6 This is a planar schematic diagram of a node in a two-dimensional coordinate system according to an embodiment of the present invention. Figure 6As shown, nodes A1, A2, and A3 are very close to each other, while nodes A4 and A5 are far from nodes A1, A2, and A3. Therefore, nodes A1, A2, and A3 are highly correlated, node A4 is not highly correlated with other nodes, and node A5 is not highly correlated with other nodes.
[0094] The knowledge graph construction method described in the above embodiments can be applied to scenarios involving name disambiguation of enterprise personnel. In the above embodiments, nodes can specifically represent enterprises, and subordinate objects of nodes can be enterprise employees, whose names can be the names of the enterprise employees. Enterprise employees include legal representatives, shareholders, and senior management personnel, etc., and are not limited here. Direct relationships in the above embodiments can include enterprise affiliation relationships, equity relationships, etc., and are not limited here.
[0095] In this embodiment of the invention, in the enterprise knowledge graph constructed using the above-described knowledge graph construction method, employees with the same name in enterprises with high relevance are merged and considered to be the same employee. Employees with the same name in enterprises with low relevance do not need to be merged and are considered to be different employees with the same name, thereby improving the accuracy of the enterprise knowledge graph.
[0096] To more intuitively observe the relevance of companies within a knowledge graph, a planar diagram of multiple companies on a two-dimensional coordinate system can be generated. For example... Figure 6 As shown, node A1 is Medical Device Co., Ltd. (B), node A2 is Medical Technology Co., Ltd. (C), node A3 is Pharmaceutical Co., Ltd. (D), node A4 is Information Technology Co., Ltd. (E), and node A5 is Investment Partnership (F). Each of these companies has an employee named "Zhang San". Figure 6 As shown in the two-dimensional coordinate system diagram, B Medical Device Co., Ltd., C Medical Technology Co., Ltd., and D Pharmaceutical Co., Ltd. are very close in the coordinate system and belong to similar industries, all related to medical and pharmaceutical fields. E Information Technology Co., Ltd. and F Investment Partnership are far from the other companies in the coordinate system and have very low industry relevance. Therefore, it is assumed that the employee named "Zhang San" in B Medical Device Co., Ltd., C Medical Technology Co., Ltd., and D Pharmaceutical Co., Ltd. is the same employee. The employee named "Zhang San" in E Information Technology Co., Ltd. is not the same employee named "Zhang San" in the other four companies. The employee named "Zhang San" in F Investment Partnership is not the same employee named "Zhang San" in the other four companies.
[0097] Figure 7 This is a schematic diagram of a knowledge graph construction device according to an embodiment of the present invention. Figure 7As shown, the knowledge graph construction device 200 may include a node network construction module 201, a random walk module 202, a training module 203, and a knowledge graph construction module 204.
[0098] The node network construction module 201 is used to obtain multiple nodes, construct a node network based on the direct relationships between the multiple nodes and whether the multiple nodes have subordinate objects with the same name.
[0099] The node network includes multiple nodes, the connections between the nodes, and the subordinate objects of each node.
[0100] The random walk module 202 is used to perform a random walk on each node as the initial node in the node network to obtain multiple walk sequences.
[0101] Training module 203 is used to train the embedding vector of each node on multiple walking sequences using a word embedding algorithm.
[0102] The knowledge graph construction module 204 is used to merge subordinate objects with the same name that are nodes in the target node set into one subordinate object, and to use the node network after merging subordinate objects as a knowledge graph. The target node set includes at least two nodes whose embedding vectors have a similarity higher than the similarity recognition threshold.
[0103] In this embodiment of the invention, a node network is constructed based on the direct relationships between multiple nodes and the subordinate objects of each node. This network includes nodes, the connections between nodes, and the subordinate objects of each node. Random walks and word embedding algorithms are used in the node network to obtain embedding vectors for identifiable nodes. The similarity of the embedding vectors of different nodes is used to determine whether subordinate objects with the same name belonging to different nodes are the same subordinate object. Subordinate objects with the same name belonging to two nodes whose embedding vectors have a similarity higher than the similarity recognition threshold are merged as the same subordinate object. This avoids the situation where all subordinate objects with the same name are considered different objects, which would lead to missed relationships in relation mining, and also avoids the situation where all subordinate objects with the same name are considered the same subordinate object, resulting in erroneous relation mining. This improves the accuracy and confidence of the constructed knowledge graph.
[0104] The knowledge graph constructed using the knowledge graph construction apparatus in this embodiment of the invention can display not only direct relationships between nodes but also indirect relationships, thus improving the completeness of the knowledge graph. Furthermore, the knowledge graph construction method in this embodiment of the invention does not require other pre-aligned entities or additional knowledge graphs as references or supervision, simplifying the complexity of the knowledge graph construction process.
[0105] In some examples, the node network building module 201 described above can be specifically used to: establish a connection between any two nodes if there is a direct association between them; establish a connection between any two nodes if they have subordinate objects with the same name; and build a node network based on the multiple nodes and the connection relationships between them.
[0106] In some examples, the direct relationships mentioned above include subordinate and / or control relationships.
[0107] Specifically, the connection relationships in the above embodiments have weights.
[0108] The first weight is given to the connection between two directly related nodes. The second weight is given to the connection between two subordinate nodes with the same name.
[0109] Furthermore, the first weight is higher than the second weight.
[0110] In some examples, the random walk module 202 described above can be specifically used to: if there is no node connected to the initial node, determine that the initial node has no corresponding walk sequence; if there is a node connected to the initial node, randomly walk from the initial node to one of the nodes connected to the initial node, and take the node reached by the random walk as the current node; if there is a node connected to the current node, randomly walk from the current node to one of the nodes connected to the current node, and take the node reached by the random walk as the new current node, until the number of random walk steps reaches a preset number of steps; and obtain multiple walk sequences based on the nodes traversed by the random walk corresponding to each initial node.
[0111] Furthermore, for each candidate node, the probability of a random walk from the initial node or the current node to the candidate node is positively correlated with the weight of the connection relationship between the initial node or the current node and the candidate node, where the candidate node is a node that has a connection relationship with the initial node or the current node.
[0112] Specifically, the probability of traversing from the initial node or the current node to a candidate node is obtained by normalizing the weights of the connection relationships between the initial node or the current node and the candidate node.
[0113] Figure 8 This is a structural schematic diagram of a specific implementation of a knowledge graph construction device in an embodiment of the present invention. Figure 8 and Figure 7 The difference is that, Figure 8The knowledge graph construction device 200 shown may also include a computing module 205, a dimensionality reduction mapping module 206, and a display module 207.
[0114] The calculation module 205 can be used to obtain the intersection and union of the names of the subordinate objects of two nodes in a target node pair, wherein the target node pair includes any two nodes with subordinate objects having the same name; and to obtain a second weight based on the intersection and union of the names of the subordinate objects of two nodes in the target node pair and a preset weight adjustment parameter.
[0115] The dimension reduction mapping module 206 is used to map the N-dimensional embedding vector of each node to an M-dimensional embedding vector using a dimension reduction algorithm.
[0116] Where N and M are both positive integers, and N > M.
[0117] Display module 207 is used to display each node in an M-dimensional coordinate system, where the coordinates of the node are the M-dimensional embedding vector of the node.
[0118] Furthermore, M = 2.
[0119] Figure 9 This is a schematic diagram of the structure of an electronic device according to an embodiment of the present invention. Figure 9 As shown, the electronic device 300 includes a memory 301, a processor 302, and a computer program stored in the memory 301 and executable on the processor 302.
[0120] In one example, the processor 302 described above may include a central processing unit (CPU), or an integrated circuit (ASIC), or one or more integrated circuits that may be configured to implement the embodiments of this application.
[0121] Memory 301 may include mass storage for data or instructions. For example, and not limitingly, memory 301 may include an HDD, floppy disk drive, flash memory, optical disk, magneto-optical disk, magnetic tape, or Universal Serial Bus (USB) drive, or a combination of two or more of these. Where appropriate, memory 301 may include removable or non-removable (or fixed) media. Where appropriate, memory 301 may be internal or external to terminal hotspot-enabled service device 300. In a particular embodiment, memory 301 is a non-volatile solid-state memory. In a particular embodiment, memory 301 includes read-only memory (ROM). Where appropriate, the ROM may be a mask-programmed ROM, a programmable ROM (PROM), an erasable PROM (EPROM), an electrically erasable PROM (EEPROM), an electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these.
[0122] The processor 302 reads the executable program code stored in the memory 301 to run the computer program corresponding to the executable program code, so as to implement the knowledge graph construction method in the above embodiments.
[0123] In one example, the electronic device 300 may also include a communication interface 303 and a bus 304. For example, Figure 9 As shown, the memory 301, processor 302, and communication interface 303 are connected through bus 304 and complete communication with each other.
[0124] The communication interface 303 is mainly used to realize communication between various modules, devices, units and / or equipment in the embodiments of this application. Input devices and / or output devices can also be connected through the communication interface 303.
[0125] Bus 304 includes hardware, software, or both, that couples components of service device 300 together. For example, and not as a limitation, bus 304 may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an Infinite Bandwidth Interconnect, a Low Pin Count (LPC) bus, a memory bus, a Microchannel Architecture (MCA) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a Video Electronics Standards Association Local (VLB) bus, or other suitable buses, or combinations of two or more of these. Where appropriate, bus 304 may include one or more buses. Although specific buses are described and illustrated in embodiments of this application, any suitable bus or interconnect is contemplated herein.
[0126] An embodiment of this application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, can implement the knowledge graph construction method described in the above embodiments.
[0127] It should be clarified that the various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. For the device embodiments, electronic device embodiments, and computer-readable storage medium embodiments, the relevant parts can be referred to the description section of the method embodiments. The present invention is not limited to the specific steps and structures described above and shown in the figures. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of the present invention. Furthermore, for the sake of brevity, detailed descriptions of known methods and techniques are omitted here.
[0128] Those skilled in the art will understand that the above embodiments are exemplary and not restrictive. Different technical features appearing in different embodiments can be combined to achieve beneficial effects. Based on a study of the drawings, specification, and claims, those skilled in the art should be able to understand and implement other variations of the disclosed embodiments. In the claims, the term "comprising" does not exclude other means or steps; the indefinite article "a" does not exclude a plurality; the terms "first" and "second" are used to identify names and not to indicate any particular order. No reference numerals in the claims should be construed as limiting the scope of protection. The functionality of multiple parts appearing in the claims can be implemented by a single hardware or software module. The appearance of certain technical features in different dependent claims does not mean that these technical features cannot be combined to achieve beneficial effects.
Claims
1. A method for constructing a knowledge graph, characterized in that, include: Multiple nodes are obtained, and a node network is constructed based on the direct relationships between the multiple nodes and whether the multiple nodes have subordinate objects with the same name. The node network includes the multiple nodes, the connection relationships between the multiple nodes, and the subordinate objects of each node. In the node network, each node is used as an initial node for random walks, resulting in multiple walk sequences; Using a word embedding algorithm, the embedding vector of each node is obtained by training multiple walk sequences; Merge subordinate objects with the same name that are nodes in the target node set into one subordinate object, and use the node network after merging subordinate objects as a knowledge graph. The target node set includes at least two nodes whose similarity to the embedding vector is higher than the similarity recognition threshold. The step of constructing a node network based on the direct relationships between the multiple nodes and whether the multiple nodes have subordinate objects with the same name includes: For any two nodes among the plurality of nodes, if there is a direct association between the two nodes, a connection relationship is established between the two nodes, and the direct association relationship includes a subordinate relationship and / or a control relationship. If any two nodes have subordinate objects with the same name, establish a connection between the two nodes. The node network is constructed based on the plurality of nodes and the connection relationships between the plurality of nodes; The connection relationships between the nodes are weighted, and the probability of a random walk is related to the weight of the connection relationship. The weight of the connection relationship between two nodes that have a direct association is the first weight, and the weight of the connection relationship between two nodes with subordinate objects of the same name is the second weight. The second weight is set according to the number of subordinate objects with the same name of the two nodes.
2. The method according to claim 1, characterized in that, Also includes: Obtain the intersection and union of the names of the subordinate objects of two nodes in a target node pair, wherein the target node pair includes any two nodes with subordinate objects having the same name; The second weight is obtained by taking the intersection and union of the names of the subordinate objects of the two nodes in the target node pair, and by taking the preset weight adjustment parameters.
3. The method according to claim 1, characterized in that, The first weight is higher than the second weight.
4. The method according to claim 1, characterized in that, In the node network, each node is used as an initial node for random walks, resulting in multiple walk sequences, including: If there is no node that is connected to the initial node, then it is determined that the initial node has no corresponding walk sequence. If there exists a node connected to the initial node, then a random walk is performed from the initial node to one of the nodes connected to the initial node, and the node reached by the random walk is taken as the current node. If there exists a node that is connected to the current node, then the current node will randomly walk to one of the nodes that is connected to the current node, and the node to which the random walk is taken will be the new current node, until the number of random walk steps reaches the preset number of steps; Based on the nodes visited by the random walk corresponding to each initial node, multiple walk sequences are obtained.
5. The method according to claim 4, characterized in that, For each candidate node, the probability of a random walk from the initial node or the current node to the candidate node is positively correlated with the weight of the connection relationship between the initial node or the current node and the candidate node. The candidate node is a node that has a connection relationship with the initial node or the current node.
6. The method according to claim 5, characterized in that, The probability of traversing from the initial node or the current node to the candidate node is obtained by normalizing the weights of the connection relationships between the initial node or the current node and the candidate node.
7. The method according to claim 1, characterized in that, The embedding vector is an N-dimensional embedding vector, and the method further includes: Using a dimensionality reduction algorithm, the N-dimensional embedding vector of each node is mapped to an M-dimensional embedding vector, where N and M are both positive integers and N > M; Each node is displayed in an M-dimensional coordinate system, where the coordinates of the node are its M-dimensional embedding vector.
8. The method according to claim 7, characterized in that, M=2。 9. A knowledge graph construction device, characterized in that, include: A node network construction module is used to acquire multiple nodes, construct a node network based on the direct relationships between the multiple nodes and whether the multiple nodes have subordinate objects with the same name, the node network including the multiple nodes, the connection relationships between the multiple nodes, and the subordinate objects of each node; A random walk module is used to perform a random walk on each of the nodes in the node network as an initial node to obtain multiple walk sequences; The training module is used to train the embedding vector of each node on multiple walk sequences using a word embedding algorithm; The knowledge graph construction module is used to merge subordinate objects with the same name that are nodes in the target node set into one subordinate object, and to use the node network after merging subordinate objects as a knowledge graph. The target node set includes at least two nodes whose similarity to the embedding vector is higher than the similarity recognition threshold. The node network construction module can be specifically used to: for any two nodes among the plurality of nodes, if there is a direct association between the two nodes, establish a connection relationship between the two nodes, wherein the direct association relationship includes a subordinate relationship and / or a control relationship; If any two nodes have subordinate objects with the same name, establish a connection between the two nodes; construct the node network based on the multiple nodes and the connection relationships between them. The connection relationships between the nodes are weighted, and the probability of a random walk is related to the weight of the connection relationship. The weight of the connection relationship between two nodes that have a direct association is the first weight, and the weight of the connection relationship between two nodes with subordinate objects of the same name is the second weight. The second weight is set according to the number of subordinate objects with the same name of the two nodes.
10. An electronic device, characterized in that, It includes a processor, a memory, and a computer program stored in the memory and executable on the processor, wherein the computer program, when executed by the processor, implements the knowledge graph construction method as described in any one of claims 1 to 8.
11. A computer-readable storage medium, characterized in that, A computer program is stored on the computer-readable storage medium, which, when executed by a processor, implements the knowledge graph construction method as described in any one of claims 1 to 8.