Subgraph learning based multi-modal knowledge graph representation learning system and product
The multimodal knowledge graph representation learning system based on subgraph learning solves the problem that existing models cannot effectively utilize image information, and realizes efficient representation and application of multimodal knowledge graphs.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XI AN JIAOTONG UNIV
- Filing Date
- 2023-04-18
- Publication Date
- 2026-06-23
AI Technical Summary
Existing knowledge graph representation learning models are unable to effectively find useful information from image information, making them inefficient in handling multimodal knowledge graph application scenarios.
A multimodal knowledge graph representation learning system based on subgraph learning is adopted. The system extracts multimodal structural information of target triples through a multimodal subgraph construction subsystem, fuses multiple channel feature components of node features through a neighborhood feature aggregation subsystem, and predicts missing entities and relationships through a link prediction subsystem. The system achieves efficient fusion of multimodal information by combining feature decoupling layer and graph alignment mechanism.
It achieves efficient fusion of visual modal information in multimodal knowledge graphs, improves the stability and representation ability of embedded representations, and enhances the application flexibility in scenarios where only a small number of entities possess multimodal information.
Smart Images

Figure CN116467463B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of representation learning technology, and in particular to a multimodal knowledge graph representation learning system and product based on subgraph learning. Background Technology
[0002] Representation learning has become a valuable method for learning from relational data, but encoding increasingly rich multimodal information has become a significant challenge. Since most knowledge in real-world applications can be represented in the form of graphs, graph representation learning techniques can transform raw knowledge acquired from the real world into low-dimensional vectors that preserve the intrinsic properties of the graph, enabling us to discover deeper relationships within this complex knowledge. Therefore, accurately and efficiently representing multimodal knowledge graphs existing in the real world is a crucial issue.
[0003] Current knowledge graph representation learning techniques have been widely applied in information retrieval, recommendation systems, and question answering. However, for multimodal knowledge graphs, the capabilities of most existing knowledge graph representation methods still exceed those of current methods. In practical applications, we often leverage the rich information within images to enhance the understanding of single-modal knowledge. However, even when only partial image embeddings need to be learned, existing knowledge graph representation learning models still struggle to effectively extract more valuable information from images, thus failing to efficiently address the application scenarios of multimodal knowledge graphs. Summary of the Invention
[0004] This invention provides a multimodal knowledge graph representation learning system and product based on subgraph learning, in order to solve the problem that existing knowledge graph representation learning models are still unable to effectively find useful information from image information, and therefore cannot efficiently cope with the application scenarios of multimodal knowledge graphs.
[0005] Based on the first aspect, embodiments of the present invention provide a multimodal knowledge graph representation learning system based on subgraph learning, comprising:
[0006] The multimodal subgraph construction subsystem extracts the structural information of the head entity's multimodality in the target triplet to obtain a multimodal subgraph, which includes a visual scene graph and an egocentric graph of the head entity.
[0007] The neighborhood feature aggregation subsystem fuses multiple channel feature components of node features to obtain a multimodal embedding representation of the entity;
[0008] The link prediction subsystem predicts the missing entities and relations in the target triple based on the multimodal embedding representation of the entity.
[0009] Based on the first aspect, the extraction of structural information of the head entity multimodalities in the target triplet to obtain a multimodal subgraph includes:
[0010] Input a head entity image and use Faster R-CNN to extract visual objects and construct a visual scene graph;
[0011] Input target triples , with head entity Search by center Jump to neighboring nodes to construct the egocentric graph of the head entity.
[0012] Based on the first aspect, the neighborhood feature aggregation subsystem includes:
[0013] The feature decoupling module maps the multimodal subgraph to different modal decoding channels based on the feature decoupling layer, obtains the initial embedding space vector representation of the head entity, projects the visual encoding information to the representation space of the structured information, performs linear fusion, and obtains the preliminary multimodal entity feature embedding.
[0014] The neighborhood feature learning module uses a graph alignment mechanism to align the multimodal subgraphs and obtain a visual information-guided graph alignment weight matrix.
[0015] The multimodal information fusion module, based on the graph alignment weight matrix, fuses the multimodal entity feature embeddings to obtain a multimodal embedding representation of the entity.
[0016] Based on the first aspect, the initial embedding space vector representation includes visual information feature representation and structured information feature representation;
[0017] The feature-based decoupling layer maps multimodal subgraphs to different modal decoding channels to obtain the initial embedding space vector representation of the head entity, including:
[0018] The visual information in the visual scene map is encoded using a first encoder to obtain the feature representation of the visual information;
[0019] The structured information in the egocentric graph of the head entity is encoded using a second encoder to obtain the structured information feature representation.
[0020] Based on the first aspect, the method of aligning the multimodal subgraphs using a graph alignment mechanism to obtain a visual information-guided graph alignment weight matrix includes:
[0021] Obtain the node embedding representations of any two graphs in the multimodal subgraph, calculate the pairwise similarity values between all nodes, and obtain the similarity matrix;
[0022] A greedy algorithm is used to calculate the alignment score between the visual scene graph and the egocentric graph in the multimodal subgraph using soft alignment.
[0023] Based on the alignment scores, a graph alignment weight matrix is constructed.
[0024] Based on the first aspect, and based on the multimodal embedding representation of the entity, predicting the missing entities and relations in the target triple includes:
[0025] Construct a scoring function for the link prediction task;
[0026] A network parameter regularization term and a multimodal embedding regularization term are added to the scoring function to construct the overall scoring function;
[0027] Based on the overall loss function described above, predict the missing entities and relations in the target triples and optimize the model.
[0028] Based on the first aspect, the overall scoring function is defined as follows:
[0029] ,
[0030] In the formula, The subscript represents the weighted sum of the multimodal embedding scores. and These represent multimodal and single graph-structured information modes, respectively. The score of the target triple is calculated based on the multimodal embedding representation. This is a hyperparameter used to adjust the payoffs for balancing diversity and correlation. yes The set of negatively sampled triples represents the random replacement of the head or tail entity with other entities and the random replacement of relations with other relations in the input target triples. For the target triple, To predict triples.
[0031] Based on the second aspect, embodiments of the present invention provide a multimodal knowledge graph representation learning method based on subgraph learning, the method being used in the multimodal knowledge graph representation learning system based on subgraph learning according to any one of the first aspects, comprising:
[0032] Select a target triplet head entity image from the head entity image set and an untraversed target triplet from the multimodal knowledge graph triplet, input them into the multimodal subgraph construction subsystem, and output a visual scene graph and a head entity egocentric graph;
[0033] The visual scene graph and the head entity egocentric graph are input into the neighborhood feature aggregation subsystem, which outputs a multimodal embedding representation of the entity.
[0034] The multimodal features of the target triple are embedded into the input link prediction subsystem, which outputs the prediction results of entities or relations.
[0035] Repeat the above learning process until every target triple in the multimodal knowledge graph triple set has been traversed.
[0036] Thirdly, embodiments of the present invention provide an electronic device, comprising:
[0037] Memory, used to store one or more programs;
[0038] processor;
[0039] When the processor executes the one or more programs, it implements the multimodal knowledge graph representation learning system based on subgraph learning as described in any of the first aspects above.
[0040] Fourthly, embodiments of the present invention provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the multimodal knowledge graph representation learning system based on subgraph learning as described in any of the first aspects.
[0041] This invention has at least the following advantages:
[0042] (1) This invention uses a novel image structure information extraction mechanism and adopts an efficient graph alignment mechanism to introduce structured information in the image, realize the effective aggregation of target entity neighborhood topology information, lay the foundation for the efficient fusion of subsequent multimodal information, and can efficiently learn multimodal knowledge graph embedding representation;
[0043] (2) The present invention uses a feature decoupling layer to decompose and map the multimodal features of entities to multiple embedding spaces, projects the visual modal features into the graph structure feature space, and realizes feature aggregation of all entities in the self-centered graph through the aligned multimodal perception attention weights, while updating the multimodal features of the target entity accordingly, thereby improving the stability and representation ability of the embedding representation, fully considering the complex relationships implied at the feature level, and further improving the representation ability of the embedding representation;
[0044] (3) The present invention first condenses and refines the most important neighborhood topology for each target entity node, and then performs neighborhood information propagation and aggregation on this basis, which improves the application flexibility in scenarios where only a small number of entities have multimodal information and need to learn embedded representations. It can be flexibly applied to the demand scenarios of multimodal knowledge graph embedded representations. Attached Figure Description
[0045] Figure 1 This is an architecture diagram of a multimodal knowledge graph representation learning system based on subgraph learning, as described in an embodiment of the present invention.
[0046] Figure 2This is a flowchart illustrating a multimodal subgraph construction subsystem according to an embodiment of the present invention;
[0047] Figure 3 This is a schematic diagram of a multimodal knowledge graph representation learning system based on subgraph learning in an embodiment of the present invention;
[0048] Figure 4 This is an architecture diagram of a neighborhood feature aggregation subsystem in an embodiment of the present invention;
[0049] Figure 5 This is a flowchart illustrating a neighborhood feature aggregation subsystem in an embodiment of the present invention.
[0050] Figure 6 This is a flowchart illustrating a link prediction subsystem in an embodiment of the present invention;
[0051] Figure 7 This is a flowchart of a multimodal knowledge graph representation learning method based on subgraph learning in an embodiment of the present invention;
[0052] Figure 8 This is a schematic structural block diagram of an electronic device provided in an embodiment of the present invention. Detailed Implementation
[0053] In practical applications, the rich information in images can enhance the deficiencies of single-modal knowledge. However, even when only partial image information embeddings need to be learned, existing knowledge graph representation learning models still cannot effectively find more useful information from images, thus failing to efficiently address multimodal knowledge graph application scenarios. To overcome these shortcomings, the present invention aims to provide a multimodal knowledge graph representation learning system based on subgraph learning. On one hand, the system employs a graph alignment mechanism to achieve efficient alignment of the visual scene graph and the entity egocentric graph, and introduces the structured information in the images into representation learning, effectively solving the problems of difficult fusion and learning of visual modal information in multimodal knowledge graphs. On the other hand, for multimodal knowledge graph application scenarios, this method does not require all entities to have images, thus greatly adapting to different knowledge graphs in the real world.
[0054] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0055] In a first aspect, embodiments of the present invention provide an architecture diagram of a multimodal knowledge graph representation learning system based on subgraph learning, see below. Figure 1 , Figure 1 A multimodal knowledge graph representation learning system based on subgraph learning, as described in this embodiment of the invention, includes:
[0056] The multimodal subgraph construction subsystem 110 extracts the structural information of the head entity multimodality in the target triplet to obtain a multimodal subgraph, which includes a visual scene graph and an egocentric graph of the head entity.
[0057] The neighborhood feature aggregation subsystem 120 fuses multiple channel feature components of entity features to obtain a multimodal embedding representation of the entity;
[0058] The link prediction subsystem 130 predicts the missing entities and relations in the target triple based on the multimodal embedding representation of the entity.
[0059] Specifically, in the multimodal subgraph construction subsystem 110, for the target triple, with the head entity as the egocentric entity, a search is performed. Jump to neighboring nodes and select Each node constructs its own egocentric graph; for the image corresponding to the head entity, Faster R-CNN is used for object detection, and the image with the highest confidence score is selected. A scene graph is constructed by combining the relationships between target objects and joint objects. In the neighborhood feature aggregation subsystem 120, to reveal the complex interactions of multimodal factors, a decoupling layer is first used to decompose and map the features of each node to multiple feature modalities, and then the embedding features of the visual modality are mapped to the graph-structured feature vector space. Further, a graph alignment mechanism is used to align the visual scene graph and the egocentric graph of the head entity to obtain the aligned graph alignment weight matrix. Subsequently, feature aggregation is achieved based on the normalized channel-aware attention of the alignment weights of entities in all egocentric graphs to obtain the multimodal embedding representation of the entity. In the link prediction subsystem 130, the multimodal entity and relation representations in the target triple are sorted by the score of the missing entity or relation based on the score function to obtain the final prediction result. At the same time, based on the score function constructed for specific modal embedding, a network parameter regularization term and a regularization term applied to the modality-aware embedding are added to force a certain degree of independence between different modal embedding representations.
[0060] In the multimodal subgraph construction subsystem, the step of extracting the structural information of the multimodal head entity in the target triplet to obtain the multimodal subgraph includes:
[0061] Input a head entity image and use Faster R-CNN to extract visual objects and construct a visual scene graph;
[0062] Input target triples Search centered on the head entity Jump to neighbor nodes to construct the egocentric graph of the head entity;
[0063] The input to the multimodal subgraph construction subsystem consists of multimodal knowledge graph triples and a set of entity images. Each traversal only requires selecting one target triple head entity image from the entity image set and one target triple from the multimodal knowledge graph triples for learning. Represented as a head entity, Represented as a relation, Represented as a tail entity, each triple represents a head entity. End-to-end entity The relationship, through continuous adjustment , and The vector representation between them makes as much as possible with Equal, that is For example, the input target triple is: head entity is person A, relation is win, and tail entity is NBA championship. During the learning process, the vector representations between the head entity (person A), relation (win), and tail entity (NBA championship) are continuously adjusted so that the predicted result should be as close as possible to person A winning the NBA championship.
[0064] See Figure 2 , Figure 2 This is a flowchart illustrating a multimodal subgraph construction subsystem according to an embodiment of the present invention. Specifically, for the input head entity image, in the neighborhood feature aggregation subsystem, Faster R-CNN is used for object detection to obtain the image with the highest confidence level. A collection of visual objects And based on their position and semantic information in the image, we can derive... A set of possible relationships between visual objects ; collection of visual objects The set of nodes and the set of relations in a graph As the set of edges of the graph, construct the visual scene graph. Since the egocentric graph of the head entity can effectively represent the local information of the head entity in a knowledge graph, it is chosen as the subgraph for the structured information of the knowledge graph in this system. For the input target triplet... , with head entity As an egocentric entity, regarding its surroundings The search proceeds by skipping neighboring entities, and each entity is recorded in the set based on its distance M from the center entity. In the middle, until Greater than or equal to the above visual objects Stop searching, and use this The set of nodes is denoted as and their set of relationships in the knowledge graph The graph constructed based on these nodes and edges is called the head entity. The egocentric diagram, denoted as .
[0065] For example, see Figure 3 , Figure 3 This is a schematic diagram of a multimodal knowledge graph representation learning system based on subgraph learning, as described in an embodiment of the present invention. Figure 3 An image containing person A is input into a multimodal subgraph construction subsystem. Faster R-CNN is used to detect person A. Based on person A's position and semantic information in the image, the system derives a set of the five visual objects with the highest confidence (e.g., man, trophy, basketball, logo, clothing, etc.). It also derives a set of possible relationships between four visual objects based on their position and semantic information (e.g., man holding a trophy, basketball in front of man, logo on man, man wearing clothes, etc.). The visual object set is treated as the node set of the graph, and the relationship set as the edge set, constructing a visual scene graph. Using person A as the head entity in the target triple (e.g., head entity is person A, relationship is winning, tail entity is NBA championship), the system analyzes the relationships between the target triple and the surrounding objects. The system searches by jumping to neighboring entities and records their corresponding distances (M) from the central entity into a set (e.g., Person A's team is the Golden State Warriors, the Golden State Warriors won the NBA championship, Person A is a basketball player, etc.). The search continues node by node until the sum of the node counts exceeds the number of visual objects. This set of nodes is then used as the relation set in the multimodal knowledge graph. The graph constructed based on these nodes and sets is the egocentric graph of the head entity (Person A). The visual scene graph and the head entity's egocentric graph are then aligned using a graph alignment mechanism and input into the link prediction subsystem. The resulting relation prediction is that Person A won the NBA championship.
[0066] See Figure 4 , Figure 4 This is an architectural diagram of a neighborhood feature aggregation subsystem according to an embodiment of the present invention. The neighborhood feature aggregation subsystem includes:
[0067] The feature decoupling module 210 maps the multimodal subgraph to different modal decoding channels based on the feature decoupling layer, obtains the initial embedding space vector representation of the head entity, projects the visual encoding information to the representation space of the structured information, performs linear fusion, and obtains the preliminary multimodal entity feature embedding.
[0068] The neighborhood feature learning module 220 uses a graph alignment mechanism to align the multimodal subgraphs and obtain a graph alignment weight matrix guided by visual information.
[0069] The multimodal information fusion module 230, based on the graph alignment weight matrix, fuses the multimodal entity feature embeddings to obtain a multimodal embedding representation of the entity.
[0070] The neighborhood feature aggregation subsystem is used to realize the knowledge embedding feature update. It decouples the multimodal features of an entity into multiple embedding feature components, and then uses the weights of the aligned multimodal subgraphs according to the graph alignment mechanism to propagate and aggregate neighborhood information. It fuses and weights the multiple channel feature components of the entity features and updates the features to obtain a multimodal embedding feature representation, thereby improving the entity's embedding representation capability for subsequent link prediction.
[0071] Specifically, in the feature decoupling module 210, the initial embedding space vector representation includes visual information feature representation and structured information feature representation;
[0072] The feature-based decoupling layer maps multimodal subgraphs to different modal decoding channels to obtain the initial embedding space vector representation of the head entity, including:
[0073] The visual information in the visual scene map is encoded using a first encoder to obtain the feature representation of the visual information;
[0074] The structured information in the egocentric graph of the head entity is encoded using a second encoder to obtain the structured information feature representation.
[0075] In this embodiment of the application, the first encoder is ViT and the second encoder is Complex.
[0076] See Figure 5 , Figure 5 This is a flowchart illustrating a neighborhood feature aggregation subsystem in an embodiment of the present invention. Specifically, in the feature decoupling module, the complex interactions in the multimodal features of entities are decoupled; by defining a feature decoupling layer, the multimodal subgraph is mapped to different modal decoding channels, thereby obtaining the initial embedding space vector representation of the entity; wherein, visual information is encoded using the first encoder ViT, and the resulting feature representation is as follows. ViT has a more accurate advantage in representing information compared to traditional methods such as CNN. This is because ViT is trained using large-scale models with data from various domains, and then fine-tuned based on image datasets from specific domains to more accurately represent the specific information of images. The encoding of structured information is achieved by using a second encoder, Complex, which provides an initial definition. Because semantic matching-based methods utilize the inner product as the mathematical formula for the final score function, they can more significantly distinguish information in 1-N and NN relation triples compared to transition-based models, and are better able to model the structured information in knowledge graphs compared to text-image pre-trained models. Based on the initial definition, visual encoded information is projected onto the representation space of structured information and linearly fused to obtain preliminary multimodal entity feature embeddings. In the formula, It is a linear projection matrix. It is a linear fusion weighting factor used to adjust the information contribution between two modes.
[0077] In the neighborhood feature learning module 220, aligning the multimodal subgraphs based on the graph alignment mechanism to obtain a visual information-guided graph alignment weight matrix includes:
[0078] Obtain the node embedding representations of any two graphs in the multimodal subgraph, calculate the pairwise similarity values between all nodes, and obtain the similarity matrix;
[0079] A greedy algorithm is used to calculate the alignment score between the visual scene graph and the egocentric graph in the multimodal subgraph using soft alignment.
[0080] Based on the alignment scores, a graph alignment weight matrix is constructed.
[0081] In the neighborhood feature learning module, a graph alignment mechanism is used to learn neighborhood features. The multimodal subgraph visual scene graph obtained by the alignment multimodal subgraph construction subsystem is... Egocentric diagram of the head entity The process of obtaining the visually guided graph alignment weight matrix mainly includes the following steps:
[0082] Obtain the node embedding representations of the two graphs;
[0083] A greedy algorithm is used to soft align the visual scene image. Egocentric diagram of the head entity ;
[0084] Specifically, the node embedding representations of the two graphs are obtained; for each node... First, calculate their in-degree and out-degree respectively. In the formula express The degree of the jump node. Indicates the diameter of the graph. It represents the importance factor for measuring node degree; then, through Calculate the pairwise similarity values between all nodes to obtain the similarity matrix; where, It is a scalar parameter used to control the effect of structured information. and It is a node and nodes In and out; through the visual scene graph Egocentric diagram of the head entity Select A special marker and all The similarity calculated for each node yields a matrix. And extract the similarity values between special markers to obtain a matrix. ; and then the matrix can be The decomposition yields the node similarity embedding representation matrix. ,in Represents the matrix The full-rank singular value decomposition of the generalized inverse. Through decomposition, the embedding representations of the nodes of the two graphs can be obtained. and .
[0085] Among them, a greedy algorithm is used to soft-align the visual scene graph. Egocentric diagram of the head entity Specifically, considering time complexity, a soft alignment method is used to align the similarity between nodes; it does not require... Each node in is related to Instead of matching each node in the list and then selecting the pair of matching nodes with the highest similarity score, the most likely first... The matching nodes will be calculated; therefore, the matching nodes will be calculated. Stored in a KD-tree, Each node in the algorithm is quickly selected using the nearest neighbor algorithm. indivual The nodes in the graph are used to calculate the corresponding similarity score of the nodes, and then... Calculate and The alignment scores are used to construct the graph alignment weight matrix. In the formula, , , This indicates that there are two corresponding nodes. and The similarity score between the vectors is represented by the similarity score.
[0086] In the multimodal information fusion module 230, multimodal information fusion is performed; after obtaining the graph alignment weight matrix guided by visual information, the neighborhood multimodal feature information needs to be fused; based on the obtained alignment weights, the visual scene graph... Each entity in the graph has a self-centered view. The head entity in the model corresponds to this. Then, this characteristic can be used to deeply fuse the multimodal entity feature embeddings of the neighborhood. The resulting multimodal entity embedding is represented as follows: In the formula, It is a normalization function, representing the normalization function. Normalize, This is a preliminary multimodal entity feature embedding.
[0087] For example, the visual scene graph and egocentric graph of person A obtained by the multimodal subgraph construction subsystem are input into the neighborhood feature aggregation subsystem. Through the feature decoupling module, the visual information in the visual scene graph is encoded using ViT to obtain feature representation, and the structured information in the egocentric graph is encoded using Complex to obtain feature representation. These are then linearly fused to obtain a multimodal entity embedding representation. Neighborhood features are learned based on the graph alignment mechanism, and a visual information-guided graph alignment weight matrix is obtained for the aforementioned visual scene graph and egocentric graph. Based on the graph alignment weight matrix, the aforementioned multimodal entity feature embeddings are fused to obtain a multimodal entity embedding representation.
[0088] In the link prediction subsystem 130, based on the multimodal embedding representation of the entity, the missing entities and relations in the target triple are predicted, including:
[0089] Construct a scoring function for the link prediction task;
[0090] A network parameter regularization term and a multimodal embedding regularization term are added to the scoring function to construct the overall scoring function;
[0091] Based on the overall scoring function, predict the missing entities and relations in the target triples and optimize the model.
[0092] The overall score function is defined as follows:
[0093] ,
[0094] In the formula, The subscript represents the weighted sum of the multimodal embedding scores. and These represent multimodal and single graph-structured information modes, respectively. The score of the target triple is calculated based on the multimodal embedding representation. This is a hyperparameter used to adjust the payoffs for balancing diversity and correlation. yes The set of negatively sampled triples represents the random replacement of the head or tail entity with other entities and the random replacement of relations with other relations in the input target triples. For the target triple, To predict triples.
[0095] The link prediction subsystem is used to predict missing entities or relationships based on the multimodal embedding representation of entities output from the neighborhood feature aggregation subsystem. (See also...) Figure 6 , Figure 6 This is a flowchart illustrating a link prediction subsystem according to an embodiment of the present invention. Specifically, the processing procedure of the link prediction subsystem after obtaining the multimodal embedding representation of the target triple is as follows:
[0096] The scoring function is defined as follows: Since the main task of this system is to predict the missing entities and relations in the target triples, the focus is primarily on the link prediction task. In the multimodal knowledge graph representation learning system based on subgraph learning provided in this embodiment of the invention, the scoring function for the link prediction task is defined as: Among them, the multimodal embedding regularization term added to the scoring function , The subscript represents the weighted sum of the multimodal embedding scores. and These represent multimodal and single graph-structured information modes, respectively; while This represents the score calculated for the triples based on the embedding representation obtained in the previous step; where the network regularization term added to the scoring function... , is a hyperparameter used to adjust the gains for balancing diversity and correlation; where yes The set of negative sampled triples represents the random replacement of the head or tail entity with other entities and the random replacement of relations with other relations in the input target triples.
[0097] For example, after traversing each target triple in the multimodal knowledge graph triple (e.g., Person A's team is the Golden State Warriors, the Golden State Warriors won the NBA championship, Person A is a basketball player, etc.), the score of each target triple is obtained. Based on the score of each target triple, the triples are sorted to obtain the final prediction result that Person A won the NBA championship.
[0098] In the above implementation process, the present invention provides a multimodal knowledge graph representation learning system based on subgraph learning. A multimodal subgraph construction subsystem extracts the multimodal structural information of the head entity in the target triplet to obtain a multimodal subgraph, wherein the multimodal subgraph includes a visual scene graph and an egocentric graph of the head entity; a neighborhood feature aggregation subsystem fuses multiple channel feature components of the entity features to obtain a multimodal embedding representation of the entity; and a link prediction subsystem predicts the missing entities and relations in the target triplet based on the multimodal embedding representation of the entity. This invention provides a multimodal knowledge graph representation learning system based on subgraph learning. It employs a novel image structure information extraction mechanism and an efficient graph alignment mechanism to introduce structured information from images, effectively aggregating the neighborhood topology information of target entities. This lays the foundation for efficient fusion of subsequent multimodal information and enables efficient learning of multimodal knowledge graph embedding representations. A feature decoupling layer decomposes and maps entity multimodal features to multiple embedding spaces, projecting visual modality features into the graph structure feature space. Simultaneously, aligned multimodal perception attention weights are used to aggregate features of entities in all egocentric graphs, updating the multimodal features of the target entity accordingly. This improves the stability and representational power of the embedding representation, fully considering the complex relationships implicit at the feature level, further enhancing its representational capabilities. For each target entity node, the most important neighborhood topology is condensed and refined. Based on this, neighborhood information is propagated and aggregated, improving application flexibility in scenarios where only a few entities possess multimodal information requiring embedding representation learning. This system can be flexibly applied to scenarios requiring multimodal knowledge graph embedding representations.
[0099] Based on the same inventive concept as the first aspect described above, this invention also proposes a multimodal knowledge graph representation learning method based on subgraph learning, which is used in the multimodal knowledge graph representation learning system based on subgraph learning described in the first aspect. See also... Figure 7 , Figure 7 This is a flowchart of a multimodal knowledge graph representation learning method based on subgraph learning, as described in an embodiment of the present invention, including:
[0100] Select a target triplet head entity image from the head entity image set and an untraversed target triplet from the multimodal knowledge graph triplet, input them into the multimodal subgraph construction subsystem, and output a visual scene graph and a head entity egocentric graph;
[0101] The visual scene graph and the head entity egocentric graph are input into the neighborhood feature aggregation subsystem, which outputs a multimodal embedding representation of the entity.
[0102] The multimodal embedding representation of the entity is input into the link prediction subsystem, and the prediction result of the entity or relation is output.
[0103] Repeat the above learning process until every target triple in the multimodal knowledge graph triple set has been traversed.
[0104] For example, the above-mentioned multimodal knowledge graph triples and head entity image sets are input into the multimodal subgraph construction subsystem, which will generate a scene graph for each image and an egocentric graph for each head entity. Assuming person A is the head entity, and there are 40 images containing person A, these 40 images are used as the image set for the head entity. A target triple head entity image is selected from this image set (e.g.,...). Figure 3 (as shown) and an untraversed target triple in a multimodal knowledge graph triple (such as...). Figure 3 As shown, for example, if the head entity is Person A, the relation is "win", and the tail entity is "NBA championship", the multimodal subgraph construction subsystem constructs a visual scene graph and an egocentric graph of Person A. The constructed visual scene graph and the head entity's egocentric graph are then input into the neighborhood feature aggregation subsystem, where a graph alignment mechanism is used to align the visual scene graph and the egocentric graph, resulting in a multimodal embedding representation of the entity. Subsequently, the multimodal embedding representation of the entity is input into the link prediction subsystem, and the final prediction result (e.g., Person A wins the NBA championship) is obtained based on the score ranking calculated by the scoring function. This process is repeated until every untraversed target triple in the multimodal knowledge graph triple has been traversed.
[0105] In the above implementation process, the present invention provides a multimodal knowledge graph representation learning method based on subgraph learning. This method selects a target triplet head entity image from the head entity image set and an untraversed target triplet from the multimodal knowledge graph triplet set, inputting them into a multimodal subgraph construction subsystem to output a visual scene graph and a head entity egocentric graph. The visual scene graph and head entity egocentric graph are then input into a neighborhood feature aggregation subsystem to output a multimodal embedding representation of the entity. This multimodal embedding representation is then input into a link prediction subsystem to output the prediction result of the entity or relation. This learning process is repeated until every target triplet in the multimodal knowledge graph triplet set has been traversed. This invention provides a multimodal knowledge graph representation learning method based on subgraph learning. It employs a novel image structure information extraction mechanism and an efficient graph alignment mechanism to introduce structured information from images, effectively aggregating the neighborhood topology information of target entities. This lays the foundation for efficient fusion of subsequent multimodal information and enables efficient learning of multimodal knowledge graph embedding representations. A feature decoupling layer decomposes and maps entity multimodal features to multiple embedding spaces, projecting visual modal features into the graph structure feature space. Simultaneously, the aligned multimodal perception attention weights achieve feature aggregation of entities in all egocentric graphs, updating the multimodal features of the target entities accordingly. This improves the stability and representational ability of the embedding representation, fully considering the complex relationships implicit at the feature level, further enhancing its representational power. For each target entity node, the most important neighborhood topology is condensed and refined. Based on this, neighborhood information is propagated and aggregated, improving application flexibility in scenarios where only a few entities possess multimodal information requiring embedding representation learning. This method can be flexibly applied to scenarios requiring multimodal knowledge graph embedding representations.
[0106] Please see Figure 8 , Figure 8 This is a schematic structural block diagram of an electronic device provided in an embodiment of the present invention. The electronic device includes a memory 101, a processor 102, and a communication interface 103. The memory 101, processor 102, and communication interface 103 are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, these components can be electrically connected to each other through one or more communication buses or signal lines. The memory 101 can be used to store software programs and modules, such as the program instructions / modules corresponding to a semantic knowledge-guided incremental learning system provided in an embodiment of the present invention. The processor 102 executes various functional applications and data processing by executing the software programs and modules stored in the memory 101. The communication interface 103 can be used for signaling or data communication with other node devices.
[0107] The memory 101 may be, but is not limited to, random access memory (RAM), read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), etc.
[0108] The processor 102 can be an integrated circuit chip with signal processing capabilities. The processor 102 can be a general-purpose processor, including a central processing unit (CPU), a network processor (NP), etc.; it can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
[0109] Understandable. Figure 8 The structure shown is for illustrative purposes only; the electronic device may also include components that are more advanced than those shown. Figure 8 The more or fewer components shown, or having the same Figure 8 The different configurations shown. Figure 8 The components shown can be implemented using hardware, software, or a combination thereof.
[0110] In the embodiments provided by this invention, it should be understood that the disclosed systems and methods can also be implemented in other ways. The apparatus embodiments described above are merely illustrative; for example, the flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods, and computer program products according to various embodiments of the present invention. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions marked in the blocks may occur in a different order than those marked in the drawings. For example, two consecutive blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram and / or flowchart, and combinations of blocks in block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or action, or using a combination of dedicated hardware and computer instructions.
[0111] In addition, the functional modules in the various embodiments of the present invention can be integrated together to form an independent part, or each module can exist independently, or two or more modules can be integrated to form an independent part.
[0112] If the aforementioned functions are implemented as software functional modules and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this invention, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0113] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
[0114] It will be apparent to those skilled in the art that the present invention is not limited to the details of the exemplary embodiments described above, and that the invention can be implemented in other specific forms without departing from its spirit or essential characteristics. Therefore, the embodiments should be considered in all respects as exemplary and non-limiting, and the scope of the invention is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be included within the present invention. No reference numerals in the claims should be construed as limiting the scope of the claims.
Claims
1. A multimodal knowledge graph representation learning system based on subgraph learning, characterized in that, include: The multimodal subgraph construction subsystem extracts the structural information of the head entity's multimodality in the target triplet to obtain a multimodal subgraph, which includes a visual scene graph and an egocentric graph of the head entity. The neighborhood feature aggregation subsystem fuses multiple channel feature components of entity features to obtain a multimodal embedding representation of the entity; The link prediction subsystem predicts the missing entities and relations in the target triple based on the multimodal embedding representation of the entity. The step of extracting the structural information of the head entity multimodal in the target triplet to obtain a multimodal subgraph includes: inputting a head entity image and using Faster R-CNN to extract visual objects and construct a visual scene graph; inputting the target triplet. , with head entity Search by center Jump to neighbor nodes to construct the egocentric graph of the head entity; The neighborhood feature aggregation subsystem includes: The feature decoupling module maps the multimodal subgraph to different modal decoding channels based on the feature decoupling layer, obtains the initial embedding space vector representation of the head entity, projects the visual encoding information to the representation space of the structured information, performs linear fusion, and obtains the preliminary multimodal entity feature embedding. The neighborhood feature learning module uses a graph alignment mechanism to align the multimodal subgraphs and obtain a visual information-guided graph alignment weight matrix. The multimodal information fusion module, based on the graph alignment weight matrix, fuses the multimodal entity feature embeddings to obtain a multimodal embedding representation of the entity; The step of aligning the multimodal subgraphs using a graph alignment mechanism to obtain a visual information-guided graph alignment weight matrix includes: obtaining the node embedding representations of any two graphs in the multimodal subgraphs, calculating the similarity values between all pairs of nodes to obtain a similarity matrix; using a greedy algorithm for soft alignment to calculate the alignment scores between the visual scene graph and the egocentric graph in the multimodal subgraphs; and constructing a graph alignment weight matrix based on the alignment scores.
2. The learning system according to claim 1, characterized in that, The initial embedding space vector representation includes visual information feature representation and structured information feature representation; The feature-based decoupling layer maps multimodal subgraphs to different modal decoding channels to obtain the initial embedding space vector representation of the head entity, including: The visual information in the visual scene map is encoded using a first encoder to obtain the feature representation of the visual information; The structured information in the egocentric graph of the head entity is encoded using a second encoder to obtain the structured information feature representation.
3. The learning system according to claim 1, characterized in that, Based on the multimodal embedding representation of the entities, predict the missing entities and relations in the target triples, including: Construct a scoring function for the link prediction task; A network parameter regularization term and a multimodal embedding regularization term are added to the scoring function to construct the overall scoring function; Based on the overall scoring function described above, predict the missing entities and relations in the target triples and optimize the model.
4. The learning system according to claim 3, characterized in that, The overall score function is defined as follows: , In the formula, The subscript represents the weighted sum of the multimodal embedding scores. and These represent multimodal and single graph-structured information modes, respectively. The score of the target triple is calculated based on the multimodal embedding representation. This is a hyperparameter used to adjust the payoffs for balancing diversity and correlation. yes The set of negatively sampled triples represents the random replacement of the head or tail entity with other entities and the random replacement of relations with other relations in the input target triples. For the target triple, To predict triples.
5. A multimodal knowledge graph representation learning method based on subgraph learning, characterized in that, The method is used in the multimodal knowledge graph representation learning system based on subgraph learning as described in any one of claims 1-4, comprising: Select a target triplet head entity image from the head entity image set and an untraversed target triplet from the multimodal knowledge graph triplet, input them into the multimodal subgraph construction subsystem, and output a visual scene graph and a head entity egocentric graph; The visual scene graph and the head entity egocentric graph are input into the neighborhood feature aggregation subsystem, which outputs a multimodal embedding representation of the entity. The multimodal embedding representation of the entity is input into the link prediction subsystem, and the prediction result of the entity or relation is output. Repeat the above learning process until every target triple in the multimodal knowledge graph triple set has been traversed.
6. An electronic device, characterized in that, include: Memory, used to store one or more programs; processor; When the processor executes the one or more programs, it implements the multimodal knowledge graph representation learning system based on subgraph learning as described in any one of claims 1-4.
7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements the multimodal knowledge graph representation learning system based on subgraph learning as described in any one of claims 1-4.