A multimodal temporal knowledge graph alignment method, device and medium
By constructing a hierarchically stacked neural symbol evolution hypergraph and agent collaboration groups, the alignment uncertainty problem of multimodal knowledge graphs in dynamic environments is solved, achieving more efficient and accurate entity alignment.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NAT UNIV OF DEFENSE TECH
- Filing Date
- 2026-03-11
- Publication Date
- 2026-06-19
AI Technical Summary
Existing knowledge graph alignment methods cannot effectively cope with the asynchronous evolution of multimodal evidence over time, and it is difficult to make reliable alignment decisions in dynamic and uncertain environments.
By constructing a hierarchical stacked neural symbol evolution hypergraph, the hypergraph and hyperedges are used to perform temporal and modal slicing of entities in a structured manner. Combined with alignment entropy, the intelligent agent cooperative groups are dynamically triggered to identify and resolve evidence conflicts between different modal layers and update node weights to optimize the alignment probability distribution.
It significantly improves the alignment accuracy and robustness of multimodal temporal knowledge graphs in open-world scenarios, effectively handles the temporal evolution and modal asynchrony of multimodal evidence, and improves the reliability and accuracy of alignment.
Smart Images

Figure CN122242678A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of knowledge graph alignment technology, specifically to a method, device, and medium for aligning multimodal temporal knowledge graphs. Background Technology
[0002] Knowledge graphs store real-world entities and their relationships in a structured format, serving as a crucial infrastructure for the field of artificial intelligence. Traditional knowledge graphs primarily focus on static facts, but in practical applications, entity attributes and their relationships often evolve dynamically over time, giving rise to research on temporal knowledge graphs. Temporal knowledge graphs introduce a time dimension, representing facts as quadruples (head entity, relation, tail entity, time interval) to capture the dynamic evolution of knowledge.
[0003] With the explosive growth of multimodal data, entities and their states at different points in time are often recorded through various media types, including images, text, audio, and video. Multimodal temporal knowledge graphs have emerged to address this need. Building upon traditional temporal knowledge graphs, they associate entities with multimodal observation data that changes over time, thus providing a more comprehensive reflection of the evolutionary nature of entities in the real world. These graphs offer crucial support for downstream tasks such as information retrieval, recommendation systems, and natural language processing.
[0004] In practical applications, due to differences in the construction methods, coverage, and update frequency of knowledge graphs from different sources, the representation of the same real-world entity often varies across different graphs, and incomplete information is a common problem in each graph. Therefore, entity alignment across heterogeneous multimodal temporal knowledge graphs has become a key technology for integrating distributed knowledge and building a more comprehensive knowledge base.
[0005] Existing knowledge graph alignment methods are mainly divided into two categories: one is alignment methods based on temporal structure modeling, which treats time as a structural constraint or relational attribute and captures the evolution of topological structure through techniques such as graph neural networks and sequence models. However, such methods usually assume that entities only have structural information or simplified text attributes and lack mechanisms for processing high-dimensional time-varying observation data such as images, audio, and video. The other category is alignment methods based on multimodal evidence fusion, which aggregate heterogeneous observation data such as text and images through static fusion mechanisms or attention mechanisms. However, such methods treat multimodal data as time-invariant global descriptors and ignore the evolutionary characteristics of entity attributes over time, resulting in the indiscriminate merging of conflicting evidence from different periods, leading to alignment errors in dynamic scenarios.
[0006] In summary, existing alignment methods face challenges in open-world scenarios: the multimodal evidence of entities evolves over time, with different modalities potentially appearing, drifting, or disappearing at different points in time, and the evolution process often occurring asynchronously across different knowledge graphs. Existing methods cannot effectively handle this temporally evolving evidence flow and struggle to make reliable alignment decisions in dynamic and uncertain environments. Summary of the Invention
[0007] This invention provides a method, device, and medium for aligning multimodal temporal knowledge graphs. Its purpose is to solve the technical problem that existing technologies cannot effectively cope with the asynchronous evolution of multimodal evidence over time and make reliable alignment decisions in dynamic and uncertain environments.
[0008] To achieve the above objectives, the first aspect of the present invention provides a method for aligning multimodal temporal knowledge graphs, comprising the following steps: Obtain a first multimodal temporal knowledge graph and a second multimodal temporal knowledge graph, each multimodal temporal knowledge graph containing multiple entities and multimodal observation data with time indexes associated with the entities; Neural retrieval is performed on the source entities in the first multimodal temporal knowledge graph and the target entities in the second multimodal temporal knowledge graph to obtain a candidate entity set for the source entities; Temporal projection constraints and modal projection constraints are applied to the target entities in the candidate entity set to obtain projection instances that are temporally and modally aligned with the source entity. Using the projected instances as nodes and the source entities as hyperedges, construct modality-specific hypergraphs for each modality; The modality-specific hypergraphs are stacked, and different modality layers are connected by the identity of the source entity to construct a neural symbolic evolution hypergraph; Based on the probability distribution of candidate alignments in the evolutionary hypergraph, the alignment entropy of the source entity in each modal layer is calculated. When the alignment entropy exceeds a preset threshold, at least one agent is selected from the agent pool to form a cooperative group. The cooperative group identifies conflicts between target entities supported by hyperedge clusters corresponding to different modal layers and performs inference actions to update the weights of nodes in the evolutionary hypergraph. The process iteratively executes the steps of calculating the alignment entropy, selecting the agent, identifying conflicts, and updating weights until the iteration termination condition is met, and then outputs the alignment relationship between the source entity and the target entity.
[0009] Furthermore, the method for obtaining the candidate entity set of the source entity includes: calculating the embedding similarity between the source entity and each target entity in the second multimodal temporal knowledge graph based on the text modal data of the source entity; selecting the candidate entity set with the highest embedding similarity from the second multimodal temporal knowledge graph. A set of target entities, serving as the candidate entity set for the source entity.
[0010] Furthermore, the method for applying temporal projection constraints to the target entities in the candidate entity set includes: Obtain the set of valid timestamps of the source entity; Observation data whose timestamps belong to the set of valid timestamps are selected from the multimodal observation data of the target entity as observation evidence of time alignment; Based on the observational evidence of temporal alignment, a temporally aligned projection instance of the target entity is generated.
[0011] Furthermore, the method for applying modal projection constraints to the target entities in the candidate entity set includes: Obtain the set of modal types in which the source entity exists; Observation data whose modal type belongs to the modal type set are selected from the multimodal observation data of the target entity as observation evidence for modal alignment; Based on the observational evidence of modal alignment, a projection instance of the target entity aligned modally is generated.
[0012] Furthermore, methods for constructing mode-specific hypergraphs for each mode include: For each modality, when the target entity has the projection instance in the modality, a node in the corresponding modality is created for the target entity; For each source entity, construct a hyperedge by combining the nodes corresponding to the target entities of all created nodes in the candidate entity set of the source entity. A weight is assigned to each node in the hyperedge, the weight being based on the initial retrieval similarity between the source entity and the target entity.
[0013] Furthermore, methods for constructing neural symbolic evolution hypergraphs include: The modality-specific hypergraphs are stacked to form a global evolutionary hypergraph containing multiple modality layers; For each source entity, the hyperedges corresponding to the source entity in different modal layers are associated to construct a cross-layer hyperedge cluster of the source entity; The cross-layer hyperedge cluster is used as the basic unit for cross-modal reasoning in the neural symbol evolution hypergraph.
[0014] Furthermore, the method for calculating the alignment entropy of the source entity at each modal layer includes: For each modal layer, obtain the current weight of each projected node in the hyperedge corresponding to the source entity; Calculate the probability that each projection node is correctly aligned based on its current weight. Based on the probability of each projection node, the alignment entropy of the source entity on this modal layer is calculated according to the definition of entropy.
[0015] Furthermore, the method for updating the weights of nodes in the evolutionary hypergraph includes: For each source entity, obtain the candidate target entity with the highest current probability in the hyperedge corresponding to each modal layer, and obtain the candidate entity set; If the candidate entity set contains more than one different target entity, it is determined that there is a conflict between the source entities in different modal layers; For conflicting source entities, the agents in the cooperative group infer from the multimodal time-series observation data of the source entities to obtain the confidence increment for each candidate target entity. The weights of nodes within the corresponding modality layer hyperedge in the evolutionary hypergraph are updated based on the confidence increment.
[0016] To achieve the above objectives, a second aspect of the present invention provides an electronic device including a memory and a processor, the memory being used to store a program that supports the processor in executing the multimodal temporal knowledge graph alignment method, and the processor being configured to execute the program stored in the memory.
[0017] To achieve the above objectives, a third aspect of the present invention provides a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, performs the steps of the multimodal temporal knowledge graph alignment method.
[0018] The beneficial effects of this invention are: Compared with existing technologies, the present invention provides a multimodal temporal knowledge graph alignment method, device, and medium. By introducing hypergraphs and hyperedges to structure and organize temporal and modal slices of entities, a hierarchically stacked neural symbol evolution hypergraph is constructed. This unifies asynchronously evolving multimodal observation data into a comparable inference space. Simultaneously, based on alignment entropy, it dynamically triggers agent collaboration groups to identify and resolve evidence conflicts between different modal layers. Node weights are updated as needed to optimize the alignment probability distribution. Thus, during iterative inference, the uncertainties caused by temporal drift and modality loss are gradually eliminated, significantly improving the accuracy and robustness of multimodal temporal knowledge graph alignment in open-world scenarios. Attached Figure Description
[0019] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below.
[0020] Figure 1 This is a flowchart of a multimodal temporal knowledge graph alignment method disclosed in an embodiment of the present invention.
[0021] Figure 2 This is a diagram of an intelligent hypergraph collaboration framework for neural symbol evolution disclosed in an embodiment of the present invention. Detailed Implementation
[0022] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.
[0023] According to embodiments of the present invention, it should be noted that the steps shown in the flowcharts of the accompanying drawings can be executed in a computer system such as a set of computer-executable instructions, and although a logical order is shown in the following methods, in some cases the steps shown or described may be executed in a different order than that shown here.
[0024] like Figure 1 As shown, this invention provides a method for aligning multimodal temporal knowledge graphs, including: Step S100: Obtain the first multimodal temporal knowledge graph and the second multimodal temporal knowledge graph. Each multimodal temporal knowledge graph contains multiple entities and multimodal observation data with time indexes associated with the entities. In this step, we first need to obtain two multimodal temporal knowledge graphs to be aligned, denoted as the first multimodal temporal knowledge graph. Second Multimodal Temporal Knowledge Graph These two graphs originate from different data sources, such as public knowledge bases like Wikipedia, YAGO, and ICEWS, or are constructed by fusing multiple heterogeneous databases. Each graph contains a large number of entities and associated multimodal observation data with explicit time indexes. The observation data includes not only traditional structured facts but also various media types that change over time, such as text descriptions, images, audio, and video, thus providing a more comprehensive reflection of the dynamic evolutionary characteristics of real-world entities.
[0025] To better understand this, the following provides a formal definition of multimodal temporal knowledge graphs and related concepts. A multimodal temporal knowledge graph... It can be represented in the form of a six-tuple:
[0026] in, A collection of entities, representing objects or concepts in a knowledge graph, such as people, organizations, events, etc. This is a set of relations that represents the semantic associations between entities, such as "employed in" or "location of occurrence". For a set of facts, each fact is represented as a quintuple. ,in These represent the head entity and the tail entity, respectively. Indicates the relation type, A timestamp or time interval indicating that the fact is valid. This represents the set of multimodal information types associated with this fact. It is a set of timestamps used to mark the valid time of factual or entity observations, supporting time-series alignment and evolutionary analysis. It is a collection of modal types, typically including text, images, audio, video, etc., used to represent the multi-dimensional semantic information of entities.
[0027] During the acquisition of the map, it is necessary to ensure that each entity Multimodal observation data that may be linked across multiple time points constitute its observation stream. Defined as:
[0028] in, This represents the feature representation of the modality data (such as the feature vector of an image, the transcribed text of audio, etc.). For observation timestamps, This is a modal type. This observation stream reflects the temporal evolution of entities, meaning that the same entity may exhibit different visual appearances, semantic descriptions, or media representations at different points in time.
[0029] In practical applications, the first map Second map The construction methods may differ significantly, for example: It may focus more on structured event data (such as political events in ICEWS), while This may contain rich multimodal descriptions (such as biographical text and historical photos on a Wikipedia page for a person). Furthermore, the observational data for the same entity in two graphs are often asynchronous in terms of temporal distribution and modal completeness, reflecting the asynchronous evolutionary characteristics of an "open world."
[0030] This step yielded two structurally similar but content-heterogeneous multimodal temporal knowledge graphs, providing raw data input for subsequent entity alignment. During implementation, these graphs typically require preprocessing, including constructing entity alignment annotations, extracting features from modal data (such as using BLIP-2 for image description generation and Whisper for audio-to-text conversion), and timestamp normalization, to provide a unified input format for subsequent neural retrieval and evolutionary hypergraph construction.
[0031] Step S200: Perform neural retrieval on the source entities in the first multimodal temporal knowledge graph and the target entities in the second multimodal temporal knowledge graph to obtain a candidate entity set of the source entities; In step S200, for the first multimodal temporal knowledge graph Each source entity in Using neural retrieval from the second multimodal temporal knowledge graph target entity set The process quickly filters out a small number of target entities that are most likely to match the source entity, forming a candidate entity set for that source entity. This process leverages the semantic stability of text modalities, which assumes that the textual descriptions of entities (such as names, tags, summaries, etc.) have relatively consistent semantic expressions across different knowledge graphs, serving as a reliable basis for coarse-grained alignment.
[0032] Specifically, for each source entity First, its textual modal data (e.g., entity names or descriptions) is encoded into semantic vectors using a pre-trained language model (such as BERT); similarly, the target graph... The text modal data of all entities are subjected to the same vectorization process to construct a semantic index library for the target entities. Then, an embedding similarity search method (usually based on cosine similarity or Euclidean distance) is used to calculate... With each target entity The semantic similarity between them is used to select the top ones with the highest similarity. 1 target entity, forming a candidate entity set :
[0033] in, The preset hyperparameters control the size of the candidate set, minimizing the search space for subsequent fine-grained alignment while maintaining recall. This step acts as a fast "neural filter," significantly reducing the computational complexity of subsequent processing. To improve retrieval efficiency, specialized approximate nearest neighbor search libraries (such as Faiss) can be used to accelerate the similarity calculation of large-scale embedded vectors, thus meeting the speed requirements of practical applications.
[0034] It should be noted that the neural retrieval stage relies solely on the semantic features of the text modality, without incorporating other modalities (such as images and audio) or temporal information. Its purpose is to quickly obtain a candidate set with high recall. The candidate set may contain noisy or mismatched entities, which will be further identified and corrected in subsequent steps through temporal projection, modal projection, and agent reasoning.
[0035] Step S300: Apply temporal projection constraints and modal projection constraints to the target entities in the candidate entity set to obtain projection instances that are temporally and modally aligned with the source entity. The source entity is obtained in step S200. candidate entity set Next, this step aims to analyze each target entity in the candidate set. Temporal and modal projections are applied to extract observational evidence comparable to the source entity in both temporal and modal dimensions, thereby generating fine-grained projected instances. This process addresses the asynchronicity of evidence evolution in open-world multimodal temporal knowledge graphs: two equivalent entities may present informative evidence at different times or in different modalities, and directly comparing the original observation streams can lead to mismatch or noise interference. Therefore, the target entity needs to be decoupled into a conditional view, allowing it to be compared with the source entity only within a shared temporal context and a compatible modal space.
[0036] Specifically, for a target entity Its multimodal time-series observation stream Represented as:
[0037] in, For the characteristic representation of the observed data, For observation timestamps, For modal types. To obtain observational evidence that is temporally aligned with the source entity, a temporal projection operator is defined. Its function is to determine the effective timestamp set of the source entity. Filtering the observation stream of the target entity. Mathematical form of temporal projection. As shown in the formula below:
[0038] in, Represents the source entity The set of valid timestamps, that is, all time points in the source map where the entity has observed data; For target entity The set of valid timestamps. Through temporal projection, only observation data whose timestamps in the target entity also exist within the valid time period of the source entity are retained, thus ensuring that they are strictly limited to a common temporal intersection and avoiding cross-temporal confusion.
[0039] Meanwhile, to address the issue of missing or inconsistent modes, a modal projection operator is defined. Its function is based on the set of modal types possessed by the source entity. Filtering the observation stream of the target entity. Mathematical form of modal projection. As shown in the formula below:
[0040] in, Represents the source entity A subset of existing modal types (e.g., text, images, audio, video, etc.). Through modal projection, only observation data in the target entity whose modal type matches that of the source entity are retained, thus ensuring that the comparison is performed within a shared modal space.
[0041] In practice, to obtain projected instances that are aligned both temporally and modally, it is necessary to jointly apply the two projection operators mentioned above, that is, to filter out observation data from the observation stream of the target entity that simultaneously satisfy both temporal and modal conditions. This joint filtering can be expressed as:
[0042] After the above projection process, the original target entity Decoupled into a set of projection instances, denoted as Each projection instance represents a slice of the target entity under specific time-aligned and modality-specific conditions. It encapsulates observational evidence consistent with the evolution of the source entity, providing atomized inference units for subsequent construction of the evolutionary hypergraph. It is worth noting that if a candidate target entity has no valid observations after projection... If the target entity is not selected, it will not participate in the subsequent construction of the hypergraph, thus further reducing the alignment search space.
[0043] Step S400: Using the projection instance as a node and the source entity as a hyperedge, construct a mode-specific hypergraph for each mode; After obtaining the projection instances of each target entity under a specific modality, this step aims to construct a structured modality-specific hypergraph for each modality type, organizing fragmented observational evidence into a unified representation. The construction of the modality-specific hypergraph explicitly models the alignment relationship between the source and target entities under a specific modality through the formal definition of nodes, hyperedges, and weights.
[0044] First, for each mode The modal type set Define the hypergraph in this mode as , For a set of nodes, This is a hyperedge set. The node set is constructed as shown in the following formula:
[0045] In the formula, Represented as target entity The created projection node; It is an existential quantifier, indicating "there exists a..."; The source entity set is the set of entities in the first graph. This means "meeting the following conditions"; This represents the empty set.
[0046] Understandably, for the target entity If a source entity exists , making belong candidate set Furthermore, after modal projection, the target entity in the modal... The above still retains valid observational evidence (i.e. If the target is not empty, then a projection node is created for that target entity. This node encapsulates time-aligned observation slices. , serving as the atomic unit for subsequent reasoning. In other words, a node represents a slice of the target entity with comparable evidence under a specific modality.
[0047] The construction of the hyperedge set is shown in the following formula:
[0048]
[0049] In the formula, Representing modes The set of hyperedges under this mode, that is, all hyperedges in the specific hypergraph of this mode; Representing modes The following is the source entity A hyperedge is constructed that associates the source entity with the projection nodes of all its candidate target entities in a specific modality; This represents the source entity set, i.e., all entities in the first graph; Represented as target entity The created projection node encapsulates the target entity in modality. Below is the observation slice after time-series projection alignment; Representing modes The set of nodes below is the set of all projected nodes obtained by the node set construction formula in step S400.
[0050] For each source entity Construct a hyperedge The hyperedge contains all candidates belonging to the source entity set. And it is already in modality The nodes corresponding to the target entities of the projected nodes are created below. Hyperedges associate the source entity with all its possible candidate targets in a specific modality, forming an alignment candidate group within the modality. This hyperedge structure allows a source entity to connect to multiple candidate targets simultaneously, reflecting the uncertainty within the modality.
[0051] To incorporate prior information into the neural retrieval phase, each node within the hyperedge is assigned an initial weight, as shown in the following formula:
[0052] in, Indicating in modality Below, the weights of the projection nodes of the target entity in the hyperedge corresponding to the source entity; This represents the source entity obtained through neural retrieval calculation in step S200. With the target entity The embedding similarity score between nodes. This weight serves as the initial confidence of a node in a hyperedge, reflecting the likelihood of a match based on textual semantics.
[0053] Through the above steps, each mode Each constructs an independent modality-specific hypergraph, organizing projected instances as nodes and source entities as hyperedges, with initial weights labeling the importance of nodes. These hypergraphs collectively constitute the "modality layer" of the evolutionary hypergraph, providing structured subgraph components for the subsequent construction of a unified neural symbolic evolutionary hypergraph. Within this representation space, alignment evidence from different modalities is stored separately, maintaining modality independence while preserving an entry point for subsequent cross-modal collaborative reasoning.
[0054] Step S500: Stack the modality-specific hypergraphs and connect different modality layers through the identity of the source entity to construct a neural symbol evolution hypergraph; In completing the modal-specific hypergraph (in , After the construction of the original layer, this step aims to integrate these independent modal layers into a unified neural symbolic evolution hypergraph. This provides a global representation space for cross-modal collaborative reasoning. The evolutionary hypergraph does not simply place the modal layers side by side, but uses the identity of the source entity as a logical link to associate hyperedges in different modal layers that correspond to the same source entity, forming cross-layer hyperedge clusters. This enables agents to aggregate evidence and detect conflicts between different modalities.
[0055] Specifically, firstly, each modality-specific hypergraph is... Structurally, these layers are stacked to form a global evolutionary hypergraph containing multiple modal layers. , where the node set For each modal layer node Union of super-edge sets For each modal layer hyperedge Union, weight The initial weights are inherited from the nodes within each hyperedge. Different modal layers are originally disjoint, meaning that nodes and hyperedges belong to a specific modality, and there are no direct connections across layers.
[0056] To achieve cross-modal collaborative reasoning, it is necessary to establish connections between modal layers. For each source entity... The hyperedges corresponding to the source entity in different modal layers are associated to form a cross-layer hyperedge cluster of the source entity. Its formal definition is shown in the following formula:
[0057] in, Indicating in modality The following is the source entity The constructed hyperedge (see the definition in step S400) is a hyperedge that contains all candidate target entity projection nodes. Cross-layer hyperedge cluster All candidate evidences of the same source entity in different modalities are organized together to form a logical "evidence package". The hyperedges in the package share the same source entity identity, but each contains candidate target nodes in different modalities.
[0058] Through this association mechanism, the originally discrete mode-specific hypergraphs are integrated into a neural symbolic evolution hypergraph with cross-layer connectivity. This hypergraph has the following properties: 1) The construction of nodes and hyperedges is based on the similarity of neural retrieval. This demonstrates the soft matching capability of neural representation; while the construction of cross-layer hyperedge clusters is based on the symbolic identity of the source entity, reflecting the symbolic logical constraints, which transforms the alignment problem into a navigation task on the hypergraph.
[0059] 2) The evolutionary hypergraph preserves modal independence (each layer maintains a hyperedge structure) and establishes cross-layer associations through the identity of the source entity, which facilitates intra-modal comparison and inter-modal conflict detection during subsequent on-demand reasoning.
[0060] 3) Intelligent agents can operate on hyperedges It can perform intra-modal reasoning (i.e., comparing the weights of different candidate nodes within the same modality), and can also be performed in cross-layer hyperedge clusters. Cross-modal reasoning is performed (i.e., comparing candidate results supported by different modal layers, aggregating complementary evidence, or identifying conflicts).
[0061] Completed neural symbol evolution hypergraph As the foundational state space for on-demand agent reasoning in subsequent steps, its node weights will be dynamically updated as the reasoning process iterates, ultimately guiding the system to converge to a low-entropy, definite aligned state.
[0062] Step S600: Based on the probability distribution of candidate alignments in the evolutionary hypergraph, calculate the alignment entropy of the source entity in each modal layer; After constructing the neural symbolic evolution hypergraph Next, this step aims to quantify each source entity. Alignment uncertainty at different modal layers provides a basis for decision-making in subsequent on-demand agent inference. The alignment entropy is calculated based on the current weights of nodes within the hyperedges of each modal layer in the evolutionary hypergraph. After normalization to a probability distribution, the discriminability between candidate targets in that modality is measured according to the definition of information entropy.
[0063] For a given source entity and modality Its corresponding hyperedge is The hyperedge contains several projection nodes. Each node represents a candidate target entity in the modality. The following is a projection example. In the first... In the next iteration, the nodes inside the superedge The weight is denoted as These weights are initially derived from the similarity obtained during neural retrieval in step S400 and may be updated during subsequent inference. Based on these weights, the similarity of nodes in the current iteration can be calculated using the softmax function. The probability of correct alignment is given by the following formula:
[0064] In the formula, In the first In the next iteration, given the source entity and modality Projection nodes The probability that the candidate target entity it represents is correctly aligned; It is an exponential function used to convert weights to non-negative values and perform normalization; In the first In the next iteration, the source entity In modality The super edge below In the middle, projection nodes The weights; For traversing superedges All projection nodes within.
[0065] In the above formula, the denominator traverses the hyperedge. For all nodes within the range, ensure probability normalization. This probability distribution reflects the probability of the source entity given the current evidence. In modality The probability of matching each candidate target.
[0066] To quantify the uncertainty of the modal layer alignment result, the alignment entropy is defined using information entropy, as shown in the following formula:
[0067] In the above formula, That is, the source entity In modality Alignment entropy. A higher entropy value indicates a more uniform probability distribution among candidate targets in the current modality, resulting in more ambiguous evidence and difficulty in distinguishing correct matches; a lower entropy value indicates that the probability distribution is concentrated on a certain candidate node, and the alignment result tends to be certain. When the entropy value exceeds a preset threshold... When this occurs, it indicates that the evidence provided by the modal layer is insufficient to support a reliable alignment decision, and subsequent agent reasoning needs to be triggered to reduce uncertainty.
[0068] In actual calculations, it is necessary to process each source entity. Traverse all modalities Each of these values is then calculated to determine its corresponding alignment entropy. These entropy values constitute the system state. This is a crucial component, providing a quantitative indicator for meta-agents to monitor entities with high uncertainty. It's worth noting that the above formula only measures uncertainty within a single modality, while cross-modal conflict detection will be determined in subsequent steps based on candidate results supported by different modal layers. Through this step, uncertainty in the evolutionary hypergraph is explicitly modeled, laying the foundation for triggering and iterative optimization of on-demand reasoning.
[0069] After calculating the alignment entropy of each source entity at each modal layer, the overall optimization objective of this invention is to minimize the weighted sum of entropy over the entire evolutionary hypergraph, which is formally defined as follows:
[0070] in, Represents the entire evolutionary hypergraph The global entropy objective function; For modality The reliability weights can be obtained based on the prior reliability of the modality or through dynamic learning; For the first Second iteration time source entity In modality Alignment entropy. This objective function drives the system toward a deterministic low-entropy state by minimizing the sum of weighted entropies of all source entities across all modalities, making true matches gradually clear.
[0071] Step S700: When the alignment entropy exceeds a preset threshold, select at least one agent from the agent pool to form a cooperative group; After calculating the alignment entropy of each source entity at different modal layers This next step aims to dynamically trigger the agent's inference mechanism based on the entropy monitoring results. The core idea of this process is "on-demand processing": instead of using expensive inference resources for all entities, it only triggers inference when the alignment entropy of a source entity in a certain modality exceeds a preset threshold. Only when necessary is a dedicated intelligent agent collaboration group formed for that entity to handle situations with high uncertainty. This mechanism ensures that computational resources are focused on the most difficult alignment cases, thereby improving overall efficiency while maintaining alignment accuracy.
[0072] To achieve this dynamic triggering and resource allocation, a meta-agent is introduced. As a monitoring and scheduling center, the meta-agent continuously tracks the evolutionary hypergraph. Each source entity In each mode Alignment entropy For any source entity and modality If the conditions are met:
[0073] This indicates that in the current mode Below, source entity The probability distribution of candidate target nodes is too uniform, and the evidence is ambiguous, making it difficult to make a reliable alignment decision directly. At this point, the meta-agent determines that the entity faces a situation of "insufficient evidence" in this modality, and it is necessary to introduce external reasoning capabilities to reduce uncertainty.
[0074] The meta-agent then draws from a pre-built pool of agent resources. The system dynamically selects a group of specialized intelligent agents to form a collaborative group targeting the source entity. The formal definition of this selection process is shown in the following formula:
[0075] In the formula, In the first In the next iteration, the source entity The formed intelligent agent collaboration group consists of several specialized intelligent agents. This is the meta-agent selection function, which determines which agents to invoke based on the current system state and the target entity. It serves as a resource pool for intelligent agents, containing various intelligent agents with specialized capabilities, such as visual temporal agents (adept at analyzing temporal changes in images and videos), text reasoning agents (adept at handling text descriptions and semantic relationships), and cross-modal coordination agents (adept at resolving intermodal conflicts). For the first The system state at each iteration includes information such as the current weight of each node in the evolutionary hypergraph, the alignment entropy value of each entity, and the identified conflict situations. The target source entity that is currently being triggered for processing.
[0076] The selection process is not random or fixed, but based on the current system state. and entities to be processed The meta-agent makes adaptive decisions based on the characteristics of the data. For example, if high entropy of a source entity mainly occurs in the image modality, the meta-agent may preferentially choose the visual temporal agent; if multiple modalities simultaneously exhibit high entropy, it may choose the cross-modal coordinating agent. This dynamic combination mechanism ensures a precise match between reasoning ability and specific problems.
[0077] This step achieves a key shift from "passive computation" to "active reasoning": the deep semantic understanding capability of the agent is only introduced to intervene when conventional static weight comparison fails to distinguish candidate targets.
[0078] Step S800: The cooperative group identifies the conflicts between target entities supported by the hyperedge clusters corresponding to different modal layers, and performs inference actions to update the weights of nodes in the evolutionary hypergraph. In step S700, the high-entropy source entity Forming an intelligent agent collaboration group Next, this step aims to resolve intermodal evidence conflicts through collaborative reasoning and dynamically update the node weights in the evolutionary hypergraph. This process is the core of agent-based on-demand reasoning, and its goal is to reassess the reliability of each candidate target by leveraging the agent's semantic understanding capabilities when evidence is ambiguous or contradictory, thereby gradually eliminating uncertainty.
[0079] The collaboration group first checks the source entity. cross-layer super-edge cluster For each modal layer Based on the current probability distribution This allows us to determine the best candidate target currently supported by the modality layer, i.e., the target entity corresponding to the node with the highest probability:
[0080] In the formula, This indicates that, in the modal context, the target entity currently predicted to be the most likely to be correctly aligned within the hyperedge corresponding to the source entity is the basic unit for subsequent collision detection. This represents the value of the variable that maximizes the objective function.
[0081] A set of the best candidate targets supported by all modal layers ,in, This represents the set of the best candidate target entities currently supported by all modal layers in the cross-layer hyperedge cluster of the source entity; The best candidate target entity is identified. If the set contains more than one different target entity, it indicates that the evidence provided by different modal layers points to different candidates, i.e., a conflict exists. The formal definition of this conflict state is shown in the following formula:
[0082] in, An indicator variable indicating whether a source entity has cross-modal conflicts; This is an indicator function that returns 1 if the size of the set is greater than 1 (indicating a conflict), and 0 otherwise. Represents a set The cardinality is the number of different target entities in the set. The existence of conflict means that the evidence between the current modalities is inconsistent, and a reliable alignment decision cannot be made directly through simple voting or weighted fusion. A deeper reasoning approach must be introduced for coordination.
[0083] For conflicting source entities (i.e.) In this collaborative group, agents will perform specific reasoning actions. These actions leverage the semantic understanding capabilities of Large Language Models (LLMs) and combine them with the source entities. Multimodal temporal observation data is used to reassess the reliability of each candidate target. For example, a visual temporal agent might analyze image changes at specific time points to verify the continuity of entity identities, while a text-based reasoning agent might compare the consistency of descriptions across time periods. The agent generates a confidence increment through reasoning. This increment reflects the candidate node based on the current evidence. The degree of correction to correctness.
[0084] Subsequently, the weights of the nodes inside the hyperedge of the corresponding modality layer in the evolutionary hypergraph are updated using this increment, and the update rule is shown in the following formula:
[0085] In the formula, Indicates the first Node at the next iteration In the super-edge Weights in; For intelligent agents targeting source entities With candidate nodes The confidence increment returned by the matching relationship between the corresponding target entities (can be positive or negative, indicating an increase or decrease in the candidate's confidence). This update directly changes the subsequent probability distribution. This provides the computational foundation for achieving the evolution of aligned states at the evolutionary hypergraph level.
[0086] Step S900: Iteratively execute the steps of calculating the alignment entropy, selecting the agent, identifying conflicts, and updating weights until the iteration termination condition is met, and output the alignment relationship between the source entity and the target entity.
[0087] After completing the weight update, this step performs meta-evaluation and loop control on the entire iterative process to ensure that the alignment decision gradually converges to a deterministic state. The EvoWildAlign framework treats alignment as a dynamic optimization process, gradually reducing the uncertainty in the evolutionary hypergraph through multiple iterations until the preset termination condition is met, ultimately outputting reliable entity alignment relationships.
[0088] After each weight update, a meta-evaluation phase is performed to check the system state after the current iteration. Has the convergence criterion been met? The core of meta-evaluation is based on the reward function. Determine whether the alignment uncertainty of all triggered source entities has been eliminated. The reward function is defined as follows:
[0089] In the formula, For the first The system state after each iteration includes the updated weights of each node in the evolutionary hypergraph and the alignment entropy of each entity in each modality. For the first The set of source entities that need to be coordinated in the next iteration is the set of high-entropy entities that trigger agent reasoning in step S700. For source entity Conflict indicator function, This indicates that the entity has reached a consensus on the best candidate target supported by each modal layer, and there is no cross-modal conflict. For the first Source entity after the next iteration In modality Alignment entropy on, requiring for all modes All This means that the entropy values of all modal layers are lower than a preset threshold. This indicates that uncertainty has been reduced to an acceptable level. This is the reward function, which returns 1 when all entities requiring coordination meet the conditions of no conflict and low entropy, and 0 otherwise.
[0090] like If the system reaches a convergence state, the iteration process terminates. At this point, for each source entity... Based on the final probability distribution Select cross-layer super-edge clusters The candidate target entity with the highest overall confidence is used as its alignment result to construct the final alignment set:
[0091] The overall score can be calculated by weighted summation of probabilities from each modality layer or by taking the maximum value; the specific strategy can be set according to the actual application scenario. This alignment set represents the equivalence relationship between the source entity and the target entity output in step S900.
[0092] like If the value is 0, the system has not yet converged and requires further iterative optimization. At this point, the updated system state is used. As the starting point for a new iteration, the system returns to step S600 to recalculate the alignment entropy of each source entity at each modal layer, re-identifies high-entropy entities, and triggers agent selection and conflict resolution until the termination condition is met. This closed-loop feedback mechanism ensures that even if the initial evidence is ambiguous or conflicting, the system can gradually approach the correct alignment decision through multiple rounds of agent collaboration and weight adjustment.
[0093] Through the above iterative process, EvoWildAlign (i.e., this method) achieves adaptive reasoning of evolutionary evidence in open-world multimodal temporal knowledge graphs, and stably outputs high-quality entity alignment relationships in dynamic and uncertain environments.
[0094] This invention's method can be directly applied to entity fusion scenarios in cross-source multimodal temporal knowledge graphs. For example, when building a large comprehensive knowledge base, it is necessary to align multimodal temporal data of people in Wikipedia with political figures data in the ICEWS event database, thereby integrating scattered time-varying multimodal information and providing more complete knowledge support for subsequent downstream tasks such as intelligent question answering, event prediction, and personalized recommendations. Specifically, in intelligent question answering systems, when a user queries multimodal information (such as photos, news videos, and biographical texts) of a public figure at different times, the knowledge graph aligned by this invention can accurately link cross-source data, achieving cross-modal and cross-time knowledge retrieval and fusion, significantly improving the comprehensiveness and accuracy of the answers. In recommendation systems, the aligned multimodal temporal knowledge graph can capture the multimodal characteristics of user interests evolving over time. For example, by analyzing the images, audio, and text descriptions of film and television works that users follow at different times, it can more accurately model the dynamics of user preferences, thereby optimizing recommendation performance. In event prediction tasks, by aligning temporal event graphs from different sources, multi-source evidence can be fused to improve the prediction accuracy of event evolution.
[0095] For example, in a cross-camera target tracking system in the public safety field, different surveillance cameras capture images and video clips of the same target person at different times, possibly accompanied by audio information of the surrounding environment and text descriptions recorded in the police system. This multimodal data is distributed in heterogeneous temporal knowledge graphs. For instance, one graph stores video stream data from cameras in various areas of the city, while another graph stores case text records and suspect files from the police system. When it is necessary to track the complete activity trajectory of a target person, the images and videos captured by different cameras must be precisely aligned with the time point information in the case text descriptions to determine the target person's location and behavior at different times.
[0096] The method of this invention can be applied to such scenarios: First, a surveillance map containing video image data and a case map containing text descriptions are acquired; then, candidate targets are initially screened through neural retrieval; next, temporal projection constraints are applied to ensure that only observation data that overlaps in time are retained, such as comparing only images captured by cameras within the same time period with descriptions in case records; then, through modal projection constraints, images, videos, and text descriptions are unified into a comparable representation space; on this basis, an evolutionary hypergraph containing multimodal layers such as images, videos, and text is constructed, and various types of evidence of the same target person at different time points are organized into cross-layer hyperedge clusters; when the candidate targets supported by different modal layers are inconsistent (for example, the image layer points to person A while the text layer points to person B), the intelligent agent collaboration group dynamically updates the confidence of each candidate target by analyzing the evolution of facial features in the image over time and changes in clothing details in the text description, and finally outputs the accurately aligned target entity.
[0097] Through this process, the present invention can effectively correlate multimodal evidence scattered across different monitoring systems and at different times, restore the complete spatiotemporal trajectory of the target person, significantly improve the accuracy and robustness of cross-camera tracking, and avoid misjudgments or omissions caused by modality loss, time drift, or evidence conflict.
[0098] Figure 2 This is a schematic diagram of the overall framework of the EvoWildAlign method of the present invention. Figure 2 As shown, this invention employs a two-stage design to address the challenge of aligning open-world multimodal temporal knowledge graphs: the first stage is evolutionary hypergraph representation, which reorganizes heterogeneous temporal index multimodal observation data into a unified neural symbol evolutionary hypergraph, using time as the organizational axis and modality as the hierarchical structure, enabling cross-graph alignment and comparability of evolutionary evidence in both structural and temporal dimensions; the second stage is on-demand agent hypergraph reasoning, which introduces a multi-agent collaboration mechanism on the evolutionary hypergraph, dynamically selecting reliable temporal-modal combinations to achieve robust alignment under conditions of missing data, noise, and temporal asynchrony. Figure 2 It clearly demonstrates the entire process from inputting the original multimodal temporal knowledge graph, to constructing the evolutionary hypergraph, and then to agent collaborative reasoning and outputting aligned results.
[0099] To verify the effectiveness of the proposed method, EvoWildAlign, in open-world multimodal temporal knowledge graph alignment tasks, this section provides a comprehensive evaluation through multiple experiments. The experiments cover benchmark datasets, comparative methods, evaluation metrics, key results, ablation studies, efficiency analysis, and generalization capability verification, demonstrating the superiority of this method compared to existing technologies.
[0100] To address the lack of standard benchmarks for the Open-World Multimodal Temporal Knowledge Graph Alignment (OpenMTKGA) task, this invention constructs two new benchmark datasets: OpenMTKGA(WI) and OpenMTKGA(YI). These datasets are obtained by connecting domain-specific ICEWS (Political Event Graph, spanning from 1995 to 2021) with general knowledge graphs (Wikidata and YAGO). Unlike existing benchmarks that assume static or synchronous modalities, these two datasets preserve the asynchronous evolution of evidence; for example, news images, audio, video, and text often appear several days after the event, thus realistically reflecting the challenges of multimodal evidence evolution over time in the real world.
[0101] In addition, to verify the generalization ability of this method, it was evaluated on seven established non-OpenMTKGA benchmark datasets, including four unimodal time series datasets (ICEWS-WIKI, ICEWS-YAGO, BETA, YAGO-WIKI50K), two unimodal static datasets (DBP15K(EN-FR), DBP-WIKI), and one multimodal static dataset (FB15K-DB15K). Detailed statistical information for each dataset is shown in Table 1.
[0102] Table 1: Dataset Statistics
[0103] To comprehensively evaluate the method of this invention, 27 representative baseline methods were selected and divided into three categories: (1) unimodal static knowledge graph alignment methods (11 methods), including MTransE, AlignE, BootEA, GCN-Align, RDGCN, Dual-AMN, BERT, FuAlign, BERT-INT, PARIS, and NaiveRAG; (2) unimodal temporal knowledge graph alignment methods (11 methods), including TEA-GNN, TREA, STEA, LightTEA, Dual-Match, Simple-HHEA, ChatEA, MGTEA, AdaCoAgentEA, Self-Consistency, and Self-RAG; (3) multimodal static knowledge graph alignment methods (5 methods), including EVA, MMEA, MEAformer, MMKG-CoT, and MMKG-RAG. All methods based on large language models (LLM) used the same model version (GPT-4 or GPT-3.5) to achieve fair comparison.
[0104] The evaluation used Hits@1, Hits@5, Hits@10, and Mean Reciprocal Rank (MRR) as the primary metrics. For models that only output the final alignment result, Hits@1 was replaced by accuracy. Efficiency metrics included runtime (seconds) and token consumption. All experiments were conducted on a server equipped with four NVIDIA GeForce RTX 4090 graphics cards, implemented using the PyTorch framework. Multimodal data (images, audio) was converted into text descriptions using off-the-shelf description generation models (such as BLIP-2 and Whisper) for processing in a unified semantic space.
[0105] The comparison results of the method of this invention with 27 baselines on OpenMTKGA and multiple benchmark datasets are shown in Tables 2 and 3. Experimental results show that EvoWildAlign consistently outperforms all baselines on all datasets, achieving a maximum improvement of 33.2% in the Hits@1 metric. Specifically: Compared to the strongest single-modal static baseline (such as NaiveRAG), EvoWildAlign achieves a relative improvement of up to 76.7% on Hits@1; and a relative improvement of up to 33.2% compared to the strongest single-modal temporal baseline (such as AdaCoAgentEA). This indicates that modeling temporal dynamics or multimodal features in isolation is insufficient for the OpenMTKGA task.
[0106] Compared to advanced LLM-based paradigms (such as thought chains, retrieval-enhanced generation, and multi-agent collaboration), EvoWildAlign consistently achieves superior results, validating the effectiveness of constructing open-world alignment as an agent hypergraph collaboration problem.
[0107] The core insight is that EvoWildAlign successfully addresses the challenges of evolutionary diversity and dynamic imbalance that baseline methods fail to capture by adaptively fusing evolutionary multimodal evidence.
[0108] Table 2: Main experimental results of OpenMTKGA and TKGA datasets
[0109] Table 2 shows the comparison of the method of the present invention with three baselines on OpenMTKGA(WI), OpenMTKGA(YI), ICEWS-WIKI, and ICEWS-YAGO in terms of Hits@1, MRR, etc. EvoWildAlign ranked first in all three categories.
[0110] Table 3: Further experimental results for DBP15K (EN-FR), DBP-WIKI, BETA, and YAGO-WIKI50K-1K
[0111] As shown in Tables 2 and 3, existing baselines achieve Hits@1 scores of only 0.625 and 0.622 on OpenMTKGA(WI) and OpenMTKGA(YI), respectively, significantly lower than their performance on traditional tasks (many baselines achieve Hits@1 scores exceeding 0.9 on unimodal temporal or multimodal static tasks). This highlights the inherent difficulty of the OpenMTKGA task: the temporally evolving evidence flow makes traditional alignment methods difficult to apply. The method of this invention effectively addresses this challenge through evolutionary hypergraphs and on-demand reasoning, validating the value of the task setting and the challenging nature of the dataset.
[0112] To verify the importance of each component of the present invention, an ablation experiment was conducted on OpenMTKGA(WI), and the results are shown in Table 4.
[0113] Table 4: OpenMTKGA(WI) Ablation Experiment
[0114] The results show that: evolutionary hypergraphs provide key symbolic constraints and topological connectivity, and adaptive decoupling prevents noise propagation; on-demand agent reasoning outperforms static attention and can adapt to instance-level evidence fluctuations; collaborative decision-making and meta-evaluation loops are crucial for performance improvement.
[0115] To verify the advantages of this invention in terms of computational efficiency and economic cost, EvoWildAlign was compared with two types of baselines: (i) LLM-based baseline configurations (such as ChatEA, Self-RAG, AdaCoAgentEA, MMKG-RAG); and (ii) lightweight baseline configurations (Simple-HHEA (structure)). The results are shown in Table 5.
[0116] Table 5: Efficiency Analysis of EvoWildAlign and Baseline Configurations on OpenMTKGA (WI)
[0117] EvoWildAlign achieves an optimal balance between performance and efficiency: compared to the leading agent baseline AdaCoAgentEA, execution time is reduced by 85.3% (4.1 seconds vs. 27.8 seconds), token consumption is reduced by 49.1%, and Hits@1 is improved by 30.8%. This is thanks to the on-demand triggering mechanism, which only invokes agent inference when high entropy or conflict is detected, avoiding unnecessary computational overhead.
[0118] To verify whether this method depends on a specific LLM, the agent backbone was replaced with models of different sizes (GPT-4, GPT-3.5, Llama3-8B, Claude3.5Sonnet), and the results are shown in Table 6. Even driven by the weaker model (Llama3-8B), EvoWildAlign still achieves 0.690 Hits@1 on OpenMTKGA(WI), which is better than the strongest baseline ChatEA (0.635) using GPT-4, proving that the evolutionary hypergraph acts as a capability amplifier.
[0119] Table 6: Performance of EvoWildAlign using different LLM-based agents
[0120] The experimental results on the static multimodal dataset FB15K-DB15K are shown in Table 7. EvoWildAlign achieves a 58.5% improvement in Hits@1 compared to previous methods, indicating that the "hypergraph + agent" design can be generalized to solve evidence ambiguity problems beyond temporal tasks.
[0121] Table 7: Baseline Configuration and Experimental Results of EvoWildAlign on FB15K-DB15K
[0122] In summary, the method of this invention, EvoWildAlign, significantly outperforms existing technologies in the task of aligning open-world multimodal temporal knowledge graphs, demonstrating high efficiency, robustness, and good generalization ability, and verifying its effectiveness and advancement in practical applications.
[0123] According to another aspect of the embodiments of this application, an electronic device is also provided, including a processor and a memory, wherein the processor is configured to implement the steps of the method when executing a computer program stored in the memory.
[0124] In the above embodiments of the present invention, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.
[0125] In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are merely illustrative; for example, the division of units can be a logical functional division, and in actual implementation, there may be other division methods. For instance, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the displayed or discussed mutual coupling, direct coupling, or communication connection may be through some interfaces; the indirect coupling or communication connection between units or modules may be electrical or other forms.
[0126] Furthermore, the functional units in the various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0127] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.
[0128] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the principle of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.
Claims
1. A multi-modal temporal knowledge graph alignment method, characterized in that, Includes the following steps: Obtain a first multimodal temporal knowledge graph and a second multimodal temporal knowledge graph. Each multimodal temporal knowledge graph contains multiple entities and multimodal observation data with time indexes associated with the entities. Neural retrieval is performed on the source entities in the first multimodal temporal knowledge graph and the target entities in the second multimodal temporal knowledge graph to obtain a candidate entity set for the source entities; Temporal projection constraints and modal projection constraints are applied to the target entities in the candidate entity set to obtain projection instances that are temporally and modally aligned with the source entity. Using the projected instances as nodes and the source entities as hyperedges, construct modality-specific hypergraphs for each modality; The modality-specific hypergraphs are stacked, and different modality layers are connected by the identity of the source entity to construct a neural symbolic evolution hypergraph; Based on the probability distribution of candidate alignments in the evolutionary hypergraph, the alignment entropy of the source entity in each modal layer is calculated. When the alignment entropy exceeds a preset threshold, at least one agent is selected from the agent pool to form a cooperative group. The cooperative group identifies conflicts between target entities supported by hyperedge clusters corresponding to different modal layers and performs inference actions to update the weights of nodes in the evolutionary hypergraph. The process iteratively executes the steps of calculating the alignment entropy, selecting the agent, identifying conflicts, and updating weights until the iteration termination condition is met, and then outputs the alignment relationship between the source entity and the target entity.
2. The multimodal temporal knowledge graph alignment method as described in claim 1, characterized in that, The method for obtaining the candidate entity set of the source entity comprises: calculating embedding similarity of the source entity and each target entity in the second multi-modal time sequence knowledge graph based on text modal data of the source entity; and selecting the first K target entities with the highest embedding similarity from the second multi-modal time sequence knowledge graph as the candidate entity set of the source entity. The method for obtaining the candidate entity set of the source entity comprises: calculating embedding similarity of the source entity and each target entity in the second multi-modal time sequence knowledge graph based on text modal data of the source entity; and selecting the first K target entities with the highest embedding similarity from the second multi-modal time sequence knowledge graph as the candidate entity set of the source entity.
3. The multi-modal temporal knowledge graph alignment method of claim 1, wherein, The method for applying temporal projection constraints to target entities in the candidate entity set includes: Obtain the set of valid timestamps of the source entity; Observation data whose timestamps belong to the set of valid timestamps are selected from the multimodal observation data of the target entity as observation evidence of time alignment; Based on the observational evidence of temporal alignment, a temporally aligned projection instance of the target entity is generated.
4. The multi-modal temporal knowledge graph alignment method of claim 1, wherein, The method for applying modal projection constraints to target entities in the candidate entity set includes: Obtain the set of modal types in which the source entity exists; Observation data whose modal type belongs to the modal type set are selected from the multimodal observation data of the target entity as observation evidence for modal alignment; Based on the observational evidence of modal alignment, a projection instance of the target entity aligned modally is generated.
5. The multi-modal temporal knowledge graph alignment method of claim 1, wherein, Methods for constructing modality-specific hypergraphs for each modality include: For each modality, when the target entity has the projection instance in the modality, a node in the corresponding modality is created for the target entity; For each source entity, construct a hyperedge by combining the nodes corresponding to the target entities of all created nodes in the candidate entity set of the source entity. A weight is assigned to each node in the hyperedge, the weight being based on the initial retrieval similarity between the source entity and the target entity.
6. The multi-modal temporal knowledge graph alignment method of claim 1, wherein, Methods for constructing neural symbolic evolution hypergraphs include: The modality-specific hypergraphs are stacked to form a global evolutionary hypergraph containing multiple modality layers; For each source entity, the hyperedges corresponding to the source entity in different modal layers are associated to construct a cross-layer hyperedge cluster of the source entity; The cross-layer hyperedge cluster is used as the basic unit for cross-modal reasoning in the neural symbol evolution hypergraph.
7. The multimodal temporal knowledge graph alignment method as described in claim 1, characterized in that, The method for calculating the alignment entropy of the source entity at each modal layer includes: For each modal layer, obtain the current weight of each projected node in the hyperedge corresponding to the source entity; Calculate the probability that each projection node is correctly aligned based on its current weight. Based on the probability of each projection node, the alignment entropy of the source entity on this modal layer is calculated according to the definition of entropy.
8. The multimodal temporal knowledge graph alignment method as described in claim 1, characterized in that, The methods for updating the weights of nodes in the evolutionary hypergraph include: For each source entity, obtain the candidate target entity with the highest current probability in the hyperedge corresponding to each modal layer, and obtain the candidate entity set; If the candidate entity set contains more than one different target entity, it is determined that there is a conflict between the source entities in different modal layers; For conflicting source entities, the agents in the cooperative group infer from the multimodal time-series observation data of the source entities to obtain the confidence increment for each candidate target entity. The weights of nodes within the corresponding modality layer hyperedge in the evolutionary hypergraph are updated based on the confidence increment.
9. An electronic device, comprising a memory and a processor, characterized in that, The memory is used to store programs that support the processor in executing any of the multimodal temporal knowledge graph alignment methods of claims 1-8, and the processor is configured to execute the programs stored in the memory.
10. A computer-readable storage medium storing a computer program thereon, characterized in that, The computer program, when run by a processor, executes the steps of any of the multimodal temporal knowledge graph alignment methods described in claims 1-8.