A multi-modal event prediction method and system fusing dynamic multi-space features

CN122241574APending Publication Date: 2026-06-19HUAZHONG UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUAZHONG UNIV OF SCI & TECH
Filing Date
2026-03-16
Publication Date
2026-06-19

Smart Images

  • Figure CN122241574A_ABST
    Figure CN122241574A_ABST
Patent Text Reader

Abstract

This invention discloses a multimodal event prediction method and system that integrates dynamic multi-spatial features, belonging to the fields of artificial intelligence and data mining technology. At the knowledge acquisition level, this invention integrates temporal structural features from different spaces into the message passing framework of a relation-aware graph neural network to learn deep representations, reflecting human intelligence characteristics in associative thinking, higher-order abstraction, and logical reasoning. The pre-trained model further endows this invention with time-sensitive visual and linguistic intelligence characteristics. At the knowledge fusion level, this invention proposes a dual-fusion evolutionary attention mechanism, assigning dynamically learned weights to different modalities at different historical timestamps. Finally, based on the effective acquisition and fusion of multimodal temporal knowledge from history, this invention integrates multi-spatial features through a curvature adaptive decoder to obtain scores, thereby achieving the prediction of future multimodal events.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of artificial intelligence and data mining technology, and more specifically, relates to a multimodal event prediction method and system that integrates dynamic multi-spatial features. Background Technology

[0002] Multimodal knowledge graphs are efficient carriers of multimodal knowledge, and effective multimodal knowledge representation is crucial for predicting multimodal events in complex real-world scenarios. However, existing technologies primarily focus on static scenarios, neglecting the dynamic acquisition and fusion of multimodal knowledge. In dynamic scenarios, it is necessary not only to handle the evolution of structured event quadruples (subject entity, relation, object entity, time) over time, but also to continuously update rich auxiliary modal information (such as text and images), thus forming multimodal temporal knowledge. In the multi-intelligence paradigm, human-like cognitive systems encompass associative thinking, higher-order abstraction, logical reasoning, and especially visual and linguistic intelligence, effectively integrating these capabilities to support future decision-making. In this context, the acquisition and fusion of historical multimodal temporal knowledge constitutes a representation learning process consistent with cognition, which can naturally be extended to the prediction of future multimodal events.

[0003] At the knowledge acquisition level, associative thinking, higher-order abstraction, and logical reasoning abilities originate from dynamic structural modalities; visual and linguistic abilities, on the other hand, originate from dynamic auxiliary modalities. Humans collect multimodal memories through multiple forms rather than single intelligence to predict future events. Furthermore, different geometric spaces have unique effects when embedding various types of structured data. For example, Euclidean space excels at modeling chain structures, hyperbolic space can capture hierarchical patterns, and complex space can effectively represent spherical shell geometry. Existing single-modal dynamic graph learning methods are mostly limited to a single space, and multi-space methods based on shallow pairwise transformations struggle to capture the deep graph structures between multimodal events. Therefore, effectively integrating information from different geometric spaces and naturally extending it to deep structures in dynamic multimodal scenarios remains a pressing challenge.

[0004] At the knowledge fusion level, the multimodal fusion process aims to integrate the different modal features of multimodal events. However, traditional static knowledge fusion methods are no longer applicable to dynamic scenarios. How to dynamically fuse the dynamic features of multimodal events is a key challenge that urgently needs to be addressed. Furthermore, high-quality future multimodal event prediction systems rely on effectively extracting useful information from historical multimodal data. Traditional static collaborative attention methods treat different modalities as attention allocators and learners respectively, only capturing the interactions between modalities but failing to optimize the modal weights required for subsequent prediction. This imbalance is further exacerbated in time-series scenarios. Therefore, assigning differentiated weights to modalities at different historical moments, much like humans do, is crucial for establishing a refined temporal correlation between historical multimodal information and future events. Summary of the Invention

[0005] To address the aforementioned deficiencies or improvement needs of existing technologies, this invention provides a multimodal event prediction method and system that integrates dynamic multi-spatial features. By integrating the unique structural representations of historical multimodal entities in Euclidean, hyperbolic, and complex geometric spaces, and by fusing structural, visual, and linguistic modal features in an evolutionary manner, it achieves accurate prediction of unknown multimodal events at the next future moment.

[0006] To achieve the above objectives, according to a first aspect of the present invention, a multimodal event prediction method integrating dynamic multi-spatial features is provided, comprising: S1, Dynamic structural modality acquisition stage: The event knowledge graph composed of multimodal entities is serialized according to the time flow. Based on the message passing mechanism, the structural modality features of multimodal entities at each moment are captured from different geometric spaces. Evolutionary characteristics are assigned to the structural features to obtain the dynamic structural modality feature matrix.

[0007] S2, Dynamic Auxiliary Modality Acquisition Stage: Text features and image features are encoded for multimodal entities at each time step using a pre-trained language model and a pre-trained visual model, respectively. Evolutionary properties are assigned to the text features and image features to obtain dynamic language modality feature matrix and dynamic visual modality feature matrix (i.e., language and visual information associated with the event).

[0008] S3, Dual Fusion Evolutionary Attention Stage: Through a two-layer symmetric attention mechanism, the dynamic structural modality feature matrix, dynamic language modality feature matrix, and dynamic visual modality feature matrix at each moment are fused to obtain the fused feature matrix at each moment. By integrating the differences in attention weights of the fused feature matrices at different moments for future events, the historical multimodal temporal feature matrix is ​​obtained.

[0009] S4, Curvature Adaptive Decoding and Prediction Stage: The historical multimodal temporal feature matrix is ​​mapped to a variable negative curvature space, and the multimodal event score for the next future moment is calculated based on the hyperbolic distance to achieve prediction.

[0010] Furthermore, in S1, the event knowledge graph composed of multimodal entities is serialized according to the time flow. Based on the message passing mechanism, the structural modal features of multimodal entities at each moment are captured from different geometric spaces. The structural features are given evolutionary characteristics to obtain a dynamic structural modal feature matrix. This endows the invention with biomimetic deep associative thinking, high-order abstraction, and logical reasoning capabilities, specifically including: Based on the time flow, the event knowledge graph is divided into multiple event subgraphs corresponding to each moment. Each event subgraph contains the event interaction that occurs at the current moment, i.e., the structural modality, the entity text description of the event, i.e., the linguistic modality, and the entity image description, i.e., the visual modality. Based on Euclidean addition, Möbius addition, and complex Hadamard product operations, a multi-space message mechanism for relation-aware graph neural networks is designed. Based on addition and attention, unique message paradigms of different geometric spaces are integrated. The deep structural modal features of multimodal entities are captured through a multi-layer message passing framework. A recurrent neural network is used as the update module to give the structural modal features at different times a dynamic time-shift dependency.

[0011] Furthermore, in S2, pre-trained language models and pre-trained visual models are used to encode text features and image features for the multimodal entities at each time step, respectively. Evolutionary properties are then applied to these text features and image features to obtain dynamic language modality feature matrices and dynamic visual modality feature matrices, thereby endowing the invention with biomimetic visual and linguistic capabilities. Specifically, this includes: The pre-trained visual model acquires image features at each historical moment, and an update module based on a recurrent neural network is also introduced to give the visual modality features dynamic time-shift dependence, resulting in a dynamic visual modality feature matrix. The pre-trained language model acquires text features at each historical moment, and an update module based on a recurrent neural network is also introduced to give the language modality features dynamic time-shift dependence, resulting in a dynamic language modality feature matrix.

[0012] Furthermore, in S3, a two-layer symmetric attention mechanism is used to fuse the dynamic structural modality feature matrix, dynamic language modality feature matrix, and dynamic visual modality feature matrix at each time step to obtain a fused feature matrix at each time step. By integrating the differences in attention weights of the fused feature matrices at different time steps for future events, a historical multimodal temporal feature matrix is ​​obtained, giving the invention a biomimetic emphasis capability, enabling it to assign different emphases to different modal features at different time steps, specifically including: The first layer of the fusion attention mechanism assigns learnable weights to different modal features in parallel at each historical moment, thereby obtaining a fusion feature matrix that includes structural, visual and linguistic modal features at different times. The second-layer evolutionary attention mechanism assigns dynamically learned weights to the fusion feature matrix at each historical moment, and then obtains the historical multimodal temporal feature matrix by integrating the differences in attention weights of the fusion feature matrices at different moments to future events.

[0013] Furthermore, in S4, the historical multimodal temporal feature matrix is ​​mapped to a variable negative curvature space. Based on the hyperbolic distance, the scores of all candidate entities for the multimodal event at the next future time are calculated, thereby enabling the prediction of future missing events. Specifically, this includes: Based on the multi-spatial features of multimodal temporal events, especially the potential hierarchical structure and tree-like associations, a learnable curvature parameter is introduced to construct a dynamically changing negative curvature Poincaré sphere model space. Using the exponential mapping operation, the entity embedding vectors in the historical multimodal temporal feature matrix are projected onto the variable hyperbolic manifold. For all multimodal event candidate entities that may appear in the next time step, the hyperbolic distance between the context embedding of the unknown event query and the embedding of each candidate entity is calculated in hyperbolic space. A scoring function is constructed based on the hyperbolic distance to calculate the occurrence probability score of each candidate entity. The entity with the highest score is predicted as the target entity of the missing event in the future. Cross-entropy loss is used, and the parameters are updated through the Adam algorithm and the Euclidean backpropagation mechanism under projection constraints.

[0014] According to a second aspect of the present invention, a multimodal event prediction system integrating dynamic multi-spatial features is provided, comprising: The dynamic structural modality acquisition unit is used to serialize the event knowledge graph composed of multimodal entities according to the time flow, capture the structural modality features of multimodal entities at each moment from different geometric spaces based on the message passing mechanism, assign evolutionary characteristics to the structural features, and obtain the dynamic structural modality feature matrix.

[0015] The dynamic auxiliary modality acquisition unit is used to encode text features and image features for multimodal entities at each time step using a pre-trained language model and a pre-trained visual model, respectively, and to assign evolutionary properties to the text features and image features to obtain dynamic language modality feature matrices and dynamic visual modality feature matrices.

[0016] The dual-fusion evolutionary attention unit is used to fuse the dynamic structural modality feature matrix, dynamic language modality feature matrix, and dynamic visual modality feature matrix at each time step through a two-layer symmetric attention mechanism to obtain the fused feature matrix at each time step. By integrating the differences in attention weights of the fused feature matrices at different time steps for future events, the historical multimodal temporal feature matrix is ​​obtained.

[0017] The curvature adaptive decoding prediction unit is used to map the historical multimodal temporal feature matrix to a variable negative curvature space, calculate the multimodal event score for the next future moment based on the hyperbolic distance, and achieve prediction.

[0018] According to a third aspect of the present invention, an electronic device is provided, comprising: a computer-readable storage medium and a processor; The computer-readable storage medium is used to store executable instructions; The processor is configured to read executable instructions stored in the computer-readable storage medium and execute the method as described in the first aspect.

[0019] According to a fourth aspect of the invention, a computer-readable storage medium is provided, the computer-readable storage medium storing computer instructions for causing a processor to perform the method as described in the first aspect.

[0020] In summary, compared with the prior art, the above-described technical solutions conceived by this invention can achieve the following beneficial effects: Through a complete chain of processes—"dynamic structural modality acquisition—dynamic auxiliary modality acquisition—dual fusion evolutionary attention—curvature adaptive decoding prediction"—this invention achieves a holistic framework for acquiring and fusing multimodal knowledge in dynamic scenarios. On one hand, at the knowledge acquisition level, addressing the bottleneck of existing dynamic learning techniques being limited to shallow or single heterogeneous spaces and unable to capture deep relational perceptual geometric features, this invention integrates temporal structural features from Euclidean space, hyperbolic space, and complex space into the message-passing framework of a relation-aware graph neural network to learn deep representations that reflect human intelligence in associative thinking, higher-order abstraction, and logical reasoning. The pre-trained model further endows this invention with time-sensitive visual and linguistic intelligence. On the other hand, at the knowledge fusion level, addressing the limitation of static fusion techniques failing to capture the dynamic impact of different historical modalities on future events, this invention proposes an advanced dual fusion evolutionary attention mechanism that equally allocates dynamic learning weights to different modalities at different historical timestamps. Finally, based on the effective acquisition and fusion of multimodal temporal knowledge from history, this invention integrates multi-spatial features through a curvature adaptive decoder to obtain scores, thereby achieving prediction of future multimodal events. Attached Figure Description

[0021] Figure 1 A flowchart illustrating the multimodal event prediction method that integrates dynamic multi-spatial features provided in an embodiment of the present invention; Figure 2 A schematic diagram illustrating the multimodal event prediction process that integrates dynamic multi-spatial features, as provided in an embodiment of the present invention. Figure 3 This is a schematic diagram of the structure of a multimodal event prediction system that integrates dynamic multi-spatial features, as provided in an embodiment of the present invention. Detailed Implementation

[0022] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention. Furthermore, the technical features involved in the various embodiments of this invention described below can be combined with each other as long as they do not conflict with each other.

[0023] This invention provides a multimodal event prediction method that integrates dynamic multi-spatial features, such as... Figure 1 As shown, it includes: S1. The event knowledge graph composed of multimodal entities is serialized according to the time flow. Based on the message passing mechanism, the structural modal features of multimodal entities at each moment are captured from different geometric spaces. Evolutionary characteristics are assigned to the structural features to obtain a dynamic structural modal feature matrix.

[0024] S2, using a pre-trained language model and a pre-trained visual model respectively, encodes text features and image features for multimodal entities at each time step, and assigns evolutionary properties to the text features and image features to obtain dynamic language modality feature matrix and dynamic visual modality feature matrix.

[0025] S3 uses a two-layer symmetric attention mechanism to fuse the dynamic structural modality feature matrix, dynamic language modality feature matrix, and dynamic visual modality feature matrix at each moment to obtain the fused feature matrix at each moment. By integrating the differences in attention weights of the fused feature matrices at different moments for future events, the historical multimodal temporal feature matrix is ​​obtained.

[0026] S4. The historical multimodal temporal feature matrix is ​​mapped to a variable negative curvature space, and the multimodal event score for the next future moment is calculated based on the hyperbolic distance to achieve prediction.

[0027] Here, in step S1, the multimodal temporal knowledge graph can be formally represented as = ,in, , , , , and These represent entities, images, text, relationships, time periods, and event sets, respectively. A multimodal temporal knowledge graph based on time-stream partitioning can be formally represented as... = { , , , The multimodal knowledge graph at each specific moment contains all time-sensitive events, images, and text information. This represents a multimodal event (structural quadruple) in a multimodal temporal knowledge graph, where... and These are the subject entity and the object entity, respectively. It is the relationship that connects them. State the facts The moment it happened. And and It is associated with auxiliary visual and language modal information constrained at a specific time t.

[0028] Subsequently, as Figure 2 As shown, Euclidean messages are designed based on Euclidean addition to replicate human associative thinking abilities. Through chain-like associations of relational semantics, local neighborhood interactions that influence core multimodal events are directly aggregated. ,in and The embedding of the subject entity and relation in the neighborhood of the corresponding central entity in Euclidean space. Hyperbolic messages are designed based on the Möbius method to replicate human high-level abstraction capabilities, perceiving the global high-level structure of concurrent events through hyperbolic isometric embedding: ,in, and The embedding of corresponding subject entities and relations in hyperbolic space, The logarithmic mapping operation represents the mapping of a specific embedding vector from hyperbolic space to Euclidean space. Represents the specific curvature of the relation. Bound Möbius addition operation, By computing combinations of hyperbolic isometric reflection and rotation operations, the inherent heterogeneous logic of relationships is preserved during hierarchical learning: ,in, For the introduced learnable weight vector, and , and It is a relation-specific block diagonal matrix used to represent reflection and rotation operations. The exponential mapping operation represents the mapping of a specific embedding vector from Euclidean space to hyperbolic space. Complex messages are designed based on the complex Hadamard product to replicate human logical reasoning ability. ,in, This indicates the Hadamard product operation. This indicates the operation of taking the real part.

[0029] Next, to extend the shallow set to a deeper structural modal representation, multi-spatial messages are integrated through an additive attention mechanism: The weighting coefficient As an alignment model for summing attention, this message is then applied to the message-passing architecture of the graph neural network to obtain the deep structural topology at each time step. The update module can be represented as: ,in It is the output of the structural modality feature matrix of the above multi-space graph neural network at each corresponding time step. That is, specific The dynamic structural modal feature matrix at time step.

[0030] In step S2, as Figure 2 As shown, a pre-trained VGG vision model is used to uniquely acquire image features at each historical time step. Then, an update module is introduced to model the time-shifting effect of dynamic visual modalities. ,in, Indicates in A collection of images associated with each event (entity) at any given moment. express Pre-trained visual model at time step Calculate the average pooling of all associated image embeddings for each entity. That is, specific The dynamic visual modal feature matrix at any given time.

[0031] Then, a parallelized pre-trained BERT language model is used to obtain time-sensitive text features. Similarly, an update module is introduced to model the temporal shifts of dynamic language modalities: ,in Indicates in A collection of time-sensitive text descriptions associated with each event (entity). correspond Pre-trained language model at time step The operation adopts a similar processing method to dynamic structural modality, and endows the visual and language feature matrices with time-shifting properties through recurrent neural networks. That is, specific The dynamic language modality feature matrix at time step.

[0032] In step S3, as Figure 2 As shown, to capture the continuously evolving fusion features, a dual fusion evolutionary attention mechanism is designed, with the specific architecture consisting of multiple stacked Transformer components. To simulate the different attentional emphases of humans at different times and in different modalities, in order to effectively predict the future, an initialization matrix is ​​introduced. As a third-party attention allocator, the acquired modality and time-specific embedding feature matrices are used equally as attention learners. Specifically, The query matrix is ​​used to generate multi-head attention, and the matrices of different modalities at a specific time are used to generate the key-value matrix: ,in, Represents the ReLU activation function. It is a scaling factor used to mitigate gradient vanishing, and then a feedforward neural network is introduced to enhance temporal semantics: . That is to say, the first A fusion feature matrix for fusion attention.

[0033] The second-layer evolutionary attention mechanism aims to further assign dynamic weights to the fused feature matrix at different historical moments, thereby extracting rich evolutionary patterns that change over time, which can help predict unknown multimodal events in the future. ,in, and These correspond to the multi-head attention operation and the feedforward neural network mentioned above, respectively. That is, a unified historical multimodal temporal feature matrix.

[0034] Finally, in step S4, as Figure 2 As shown, for future multimodal event queries + To adapt to curvature variations during multi-space learning, an exponential mapping is first used to project the unified multimodal temporal entity embedding matrix and relation matrix onto a variable negative curvature space, so as to simultaneously accommodate the inherent features of different geometric spaces during the decoding process. , Next, the hyperbolic distance between all candidate entities and the event query context is calculated: ,in, and Representing random points (embedding vectors) on relation-specific geometric manifolds in hyperbolic space, and thus obtaining scores for predicting future multimodal events: ,in, and Each as a subject entity and candidate objects The parameterized learnable bias. Finally, the scores of all candidate entities are recorded, and the highest score is selected as the prediction target for future multimodal events.

[0035] In summary, the method provided by this invention, through a complete chain of processes—"dynamic structural modality acquisition—dynamic auxiliary modality acquisition—dual fusion evolutionary attention—curvature adaptive decoding prediction"—achieves a holistic framework for acquiring and fusing multimodal knowledge in dynamic scenarios. On one hand, at the knowledge acquisition level, addressing the bottleneck of existing dynamic learning techniques being limited to shallow or single heterogeneous spaces and struggling to capture deep relational perception geometric features, this invention integrates temporal structural features in Euclidean space, hyperbolic space, and complex space into the message-passing framework of a relational perception graph neural network to learn deep representations that reflect human intelligence in associative thinking, higher-order abstraction, and logical reasoning. The pre-trained model further endows this invention with time-sensitive visual and linguistic intelligence characteristics. On the other hand, at the knowledge fusion level, addressing the predicament of static fusion techniques failing to capture the dynamic impact of different historical modalities on future events, this invention proposes an advanced dual fusion evolutionary attention mechanism that equally allocates dynamic learning weights to different modalities at different historical time stamps. Ultimately, based on the effective acquisition and fusion of multimodal temporal knowledge from history, this invention integrates multi-spatial features through a curvature adaptive decoder to obtain scores, thereby enabling the prediction of future multimodal events.

[0036] The following describes the multimodal event prediction system that integrates dynamic multi-spatial features provided by the present invention. The multimodal event prediction system that integrates dynamic multi-spatial features described below can be referred to in correspondence with the multimodal event prediction method that integrates dynamic multi-spatial features described above.

[0037] This invention provides a multimodal event prediction system that integrates dynamic multi-spatial features, such as... Figure 3 As shown, it includes: The dynamic structural modality acquisition unit 310 is used to serialize the event knowledge graph composed of multimodal entities according to the time flow, capture the structural modality features of multimodal entities at each moment from different geometric spaces based on the message passing mechanism, assign evolutionary characteristics to the structural features, and obtain the dynamic structural modality feature matrix.

[0038] The dynamic auxiliary modality acquisition unit 320 is used to encode text features and image features for multimodal entities at each time step using a pre-trained language model and a pre-trained visual model, respectively, and to assign evolutionary properties to the text features and image features to obtain a dynamic language modality feature matrix and a dynamic visual modality feature matrix.

[0039] The dual-fusion evolutionary attention unit 330 is used to fuse the dynamic structural modality feature matrix, dynamic language modality feature matrix and dynamic visual modality feature matrix at each time step through a two-layer symmetric attention mechanism to obtain the fusion feature matrix at each time step. By integrating the differences in attention weights of the fusion feature matrices at different time steps for future events, the historical multimodal temporal feature matrix is ​​obtained.

[0040] The curvature adaptive decoding prediction unit 340 is used to map the historical multimodal temporal feature matrix to a variable negative curvature space, calculate the multimodal event score of the next future moment based on the hyperbolic distance, and realize prediction.

[0041] The system provided in this invention achieves a holistic framework for acquiring and fusing multimodal knowledge in dynamic scenarios through a complete chain process of "dynamic structural modality acquisition—dynamic auxiliary modality acquisition—dual fusion evolutionary attention—curvature adaptive decoding prediction." On one hand, at the knowledge acquisition level, addressing the bottleneck of existing dynamic learning techniques being limited to shallow or single heterogeneous spaces and unable to capture deep relational perception geometric features, this invention integrates temporal structural features in Euclidean space, hyperbolic space, and complex space into the message-passing framework of a relational perception graph neural network to learn deep representations that reflect human intelligence characteristics in associative thinking, higher-order abstraction, and logical reasoning. The pre-trained model further endows this invention with time-sensitive visual and linguistic intelligence characteristics. On the other hand, at the knowledge fusion level, addressing the predicament of static fusion techniques failing to capture the dynamic impact of different historical modalities on future events, this invention proposes an advanced dual fusion evolutionary attention mechanism that equally allocates dynamic learning weights to different modalities at different historical time stamps. Ultimately, based on the effective acquisition and fusion of multimodal temporal knowledge from history, this invention integrates multi-spatial features through a curvature adaptive decoder to obtain scores, thereby enabling the prediction of future multimodal events.

[0042] Based on any of the above embodiments, the step of serializing the event knowledge graph composed of multimodal entities according to the time flow, and the dynamic structural modality acquisition stage utilizing the inherent geometric properties of Euclidean space, hyperbolic space, and complex space to model the dynamic interactions of historical multimodal events, endows the present invention with biomimetic deep associative thinking, high-order abstraction, and logical reasoning capabilities, specifically including: The multimodal event knowledge graph is divided based on time flow. Each event subgraph at each moment contains the event interaction that occurs at the current moment, i.e., the structural modality; the entity text description of the event, i.e., the linguistic modality; and the entity image description, i.e., the visual modality. Based on Euclidean addition, Möbius addition, and complex Hadamard product operations, a multi-space message mechanism for relation-aware graph neural networks is designed. Based on addition and attention, unique message paradigms of different geometric spaces are integrated. The deep structural modal features of multimodal entities are captured through a multi-layer message passing framework. A recurrent neural network is used as an update module to give the structural modal features at different times a dynamic time-shift dependency, resulting in a dynamic structural modal feature matrix.

[0043] Based on any of the above embodiments, the dynamic auxiliary modality acquisition stage encodes the time-evolving auxiliary modality information of the multimodal entity at each moment using a pre-trained language model and a pre-trained visual model, respectively, endowing the present invention with biomimetic visual and linguistic capabilities, specifically including: The pre-trained visual model acquires unique image features at each historical moment, and an update module based on a recurrent neural network is also introduced to give the visual modality features dynamic time-shift dependence, resulting in a dynamic visual modality feature matrix. The pre-trained language model acquires unique text features at each historical moment, and an update module based on a recurrent neural network is also introduced to give the language modality features dynamic time-shift dependence, resulting in a dynamic language modality feature matrix.

[0044] Based on any of the above embodiments, the dual-fusion evolutionary attention stage dynamically fuses multimodal features from different historical moments through two carefully designed symmetric attention mechanisms, thereby constructing a temporal dependency relationship between future unknown events and historical multimodal data. This endows the invention with a biomimetic emphasis capability, enabling it to assign different emphases to different modal features at different times, specifically including: The first-layer fusion attention mechanism assigns learnable weights to different modal features in parallel at each historical moment, thereby obtaining a fusion feature matrix that includes structural, visual, and linguistic modal features at different times. The second-layer attention mechanism assigns dynamically learned weights to the fusion feature matrix at each historical moment, and then obtains the historical multimodal temporal feature matrix by integrating the differences in attention weights of the fusion feature matrices at different moments to future events.

[0045] Based on any of the above embodiments, the curvature adaptive decoding prediction stage maps the entity embedding vectors in the historical multimodal temporal feature matrix to a variable negative curvature space, calculates the scores of all candidate entities for the multimodal event at the next future time based on hyperbolic distance, and thus achieves the prediction of future missing events, specifically including: Based on the multi-spatial features of multimodal temporal events, especially the potential hierarchical structure and tree-like associations, a learnable curvature parameter is introduced to construct a dynamically changing negative curvature Poincaré sphere model space. Using the exponential mapping operation, the entity embedding vectors in the historical multimodal temporal feature matrix are projected onto the variable hyperbolic manifold. For all multimodal event candidate entities that may appear in the next time step, the hyperbolic distance between the context embedding of the unknown event query and the embedding of each candidate entity is calculated in hyperbolic space. A scoring function is constructed based on the hyperbolic distance to calculate the occurrence probability score of each candidate entity. The entity with the highest score is predicted as the target entity of the missing event in the future. Cross-entropy loss is used, and the parameters are updated through the Adam algorithm and the Euclidean backpropagation mechanism under projection constraints.

[0046] This invention provides an electronic device, including: a computer-readable storage medium and a processor; The computer-readable storage medium is used to store executable instructions; The processor is configured to read executable instructions stored in the computer-readable storage medium and execute the method as described in any of the above embodiments.

[0047] This invention provides a computer-readable storage medium storing computer instructions that cause a processor to perform the method described in any of the above embodiments.

[0048] This invention provides a computer program product, including a computer program or instructions, which, when executed by a processor, implement the method described in any of the above embodiments.

[0049] The technical features of the embodiments described above can be combined arbitrarily. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as the combination of these technical features does not contradict each other, it should be considered within the scope of this specification. It should be noted that the terms "in one embodiment," "for example," and "again" in this invention are intended to illustrate the invention and are not intended to limit the invention.

[0050] The embodiments described above are merely examples of several implementations of the present invention, and while the descriptions are relatively specific and detailed, they should not be construed as limiting the scope of the patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these modifications and improvements all fall within the scope of protection of the present invention.

Claims

1. A multimodal event prediction method integrating dynamic multi-spatial features, characterized in that, include: S1, the event knowledge graph composed of multimodal entities is serialized according to the time flow, and the structural modal features of multimodal entities at each moment are captured from different geometric spaces based on the message passing mechanism. The structural features are given evolutionary characteristics to obtain a dynamic structural modal feature matrix. S2, encode text features and image features for multimodal entities at each time step using a pre-trained language model and a pre-trained visual model respectively, and assign evolutionary properties to the text features and image features to obtain dynamic language modality feature matrix and dynamic visual modality feature matrix; S3, through a two-layer symmetric attention mechanism, fuses the dynamic structural modality feature matrix, dynamic language modality feature matrix and dynamic visual modality feature matrix at each moment to obtain the fused feature matrix at each moment. By integrating the differences in attention weights of the fused feature matrices at different moments to future events, the historical multimodal temporal feature matrix is ​​obtained. S4. The historical multimodal temporal feature matrix is ​​mapped to a variable negative curvature space, and the multimodal event score for the next future moment is calculated based on the hyperbolic distance to achieve prediction.

2. The multimodal event prediction method integrating dynamic multi-spatial features as described in claim 1, characterized in that, S1 includes: Based on the time flow, the event knowledge graph is divided into multiple event subgraphs corresponding to each moment. Each event subgraph includes the event interaction that occurs at the current moment, i.e., the structural modality, the entity text description of the event, i.e., the linguistic modality, and the entity image description, i.e., the visual modality. Based on Euclidean addition, Möbius addition, and complex Hadamard product operations, a multi-space message mechanism for relation-aware graph neural networks is designed. Based on addition and attention, message paradigms of different geometric spaces are integrated. The structural modal features of multimodal entities are captured through a multi-layer message passing framework. A recurrent neural network is used as an update module to give the structural modal features at different times a dynamic time-shift dependency.

3. The multimodal event prediction method integrating dynamic multi-spatial features as described in claim 1, characterized in that, S2 include: The pre-trained visual model acquires image features at each historical moment, and an update module based on a recurrent neural network is also introduced to give the visual modality features dynamic time-shift dependence, resulting in a dynamic visual modality feature matrix. The pre-trained language model acquires text features at each historical moment, and an update module based on a recurrent neural network is also introduced to give the language modality features dynamic time-shift dependence, resulting in a dynamic language modality feature matrix.

4. The multimodal event prediction method integrating dynamic multi-spatial features as described in claim 1, characterized in that, S3 include: The first-layer attention mechanism assigns learnable weights to different modal features in parallel at each historical moment, thereby obtaining a fusion feature matrix that includes structural, visual, and linguistic modal features at different times. The second-layer attention mechanism assigns dynamically learned weights to the fusion feature matrix at each historical moment, and then obtains the historical multimodal temporal feature matrix by integrating the differences in attention weights of the fusion feature matrices at different moments to future events.

5. The multimodal event prediction method integrating dynamic multi-spatial features as described in claim 1, characterized in that, S4 include: A learnable curvature parameter is introduced to construct a dynamically changing negative curvature Poincaré sphere model space. Using the exponential mapping operation, the entity embedding vectors in the historical multimodal temporal feature matrix are projected onto the variable hyperbolic manifold. For all multimodal event candidate entities that may appear in the next time step, the hyperbolic distance between the context embedding of the unknown event query and the embedding of each candidate entity is calculated in hyperbolic space. A scoring function is constructed based on the hyperbolic distance to calculate the occurrence probability score of each candidate entity. The entity with the highest score is predicted as the target entity of the missing event in the future. Cross-entropy loss is used, and the parameters are updated through the Adam algorithm and the Euclidean backpropagation mechanism under projection constraints.

6. A multimodal event prediction system integrating dynamic multi-spatial features, characterized in that, include: The dynamic structural modality acquisition unit is used to serialize the event knowledge graph composed of multimodal entities according to the time flow, capture the structural modality features of multimodal entities at each moment from different geometric spaces based on the message passing mechanism, assign evolutionary characteristics to the structural features, and obtain the dynamic structural modality feature matrix. The dynamic auxiliary modality acquisition unit is used to encode text features and image features for multimodal entities at each time step using a pre-trained language model and a pre-trained visual model, respectively, and to assign evolutionary properties to the text features and image features to obtain dynamic language modality feature matrix and dynamic visual modality feature matrix. The dual-fusion evolutionary attention unit is used to fuse the dynamic structural modality feature matrix, dynamic language modality feature matrix, and dynamic visual modality feature matrix at each time step through a two-layer symmetric attention mechanism to obtain the fusion feature matrix at each time step. By integrating the differences in attention weights of the fusion feature matrices at different time steps for future events, the historical multimodal temporal feature matrix is ​​obtained. The curvature adaptive decoding prediction unit is used to map a unified historical multimodal time series embedding to a variable negative curvature space, and calculate the multimodal event score for the next future moment based on the hyperbolic distance to achieve prediction.

7. An electronic device, characterized in that, include: Computer-readable storage media and processors; The computer-readable storage medium is used to store executable instructions; The processor is configured to read executable instructions stored in the computer-readable storage medium and execute the method as described in any one of claims 1-5.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions for causing a processor to perform the method as described in any one of claims 1-5.