An asymmetric fusion-based cross-modal time series prediction method, system, medium and device

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By performing asymmetric fusion within the time series feature space, the problems of insufficient utilization of prior text information and feature distortion in existing technologies are solved, achieving higher accuracy and more stable time series prediction.

CN122263005APending Publication Date: 2026-06-23XI AN JIAOTONG UNIV

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: XI AN JIAOTONG UNIV
Filing Date: 2026-03-17
Publication Date: 2026-06-23

Application Information

Patent Timeline

17 Mar 2026

Application

23 Jun 2026

Publication

CN122263005A

IPC: G06F18/25; G06F18/213; G06F18/27; G06F40/30; G06N3/0455; G06N3/084; G06N3/09; G06F123/02

AI Tagging

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A bundle type device fault response method and system based on graph network optimization
CN115859079BStrong expandabilityreduce losses Design optimisation/simulation
A movable pier structure
CN224392917UPrecisely adjust the lifting heightShorten the construction periodDry-dockingSlipwaysDrive wheel Electric machinery
A prefabricated railway beam three-wall synchronous construction measurement method
CN122429773AImprove embedded precisionImprove construction quality Rebar Mechanical engineering
A type Ⅱ lead core support mold
CN117183162BHigh positioning accuracy quality improvement Structural engineering Mechanical engineering
Egg supply conveyor and method
CN122271248Aquality improvement improve consistency Marking out Structural engineering

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing time series prediction methods struggle to effectively utilize prior textual information when dealing with complex scenarios, especially for extreme or rare events where prediction accuracy is low. Furthermore, cross-modal fusion suffers from incomplete information and feature distortion.

Method used

A cross-modal time series prediction method based on asymmetric fusion is adopted. By projecting, modulating and orthogonally decomposing within the time series feature space, prior textual information is fused into the time series, and consistent features and modality-specific features are extracted to enhance prediction capabilities.

Benefits of technology

It improves the adaptability to complex scenarios and distribution changes, reduces the risk of feature distortion, and enhances prediction accuracy and stability, enabling high-quality time series prediction in multiple fields.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122263005A_ABST

Patent Text Reader

Abstract

The application discloses a kind of cross-modal time series prediction methods, systems, medium and equipment based on asymmetric fusion, comprising the following steps: S1, obtains input data;S2, respectively extracts time series modal features and text modal features from historical time series data, historical text prompts and future text prompts by single-modal feature extraction module;S3, the historical text modal feature is fused and handled in the time series modal feature space by asymmetric fusion module, and cross-modal consistent feature and modal unique feature are calculated;S4, the historical text modal feature and the future text modal feature are fused by history-future text fusion module, and the fusion text feature containing prediction period priori knowledge is obtained;S5, the fusion text feature is decoded, and the future time series prediction result of prediction target time period is output.The present application provides more sufficient priori support for prediction period.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of time series forecasting and cross-modal information fusion technology, specifically to a cross-modal time series forecasting method, system, medium, and device based on asymmetric fusion. Background Technology

[0002] Time series forecasting aims to infer the trend of numerical changes in the future based on the time dependence and variable dependence of historical observation data. It is an important basic capability in fields such as economic decision-making, energy scheduling, traffic management and public health early warning. Many existing methods (such as TimeMixer, iTransformer, TimesNet, PatchTST, DLinear, etc. based on Transformer, CNN, linear model or hybrid structure) mainly extract from a single numerical sequence: (1) the periodicity, trend and mutation of the time dimension; (2) the correlation, causality or coupling relationship between variables. Although these methods have strong performance on standard datasets, they often face the following difficulties in complex real-world scenarios: Incomplete information: It is difficult to express "non-numerical but strongly correlated" factors such as holidays, weather background, policy changes, regional differences, business rules, etc. based on historical numerical sequences alone. Insufficient transferability: The model is sensitive to changes in domain distribution and its performance degrades when there are changes in topology across cities, regions, seasons or sensors. Difficulty in modeling extreme / rare events: Sudden events, temporary controls, epidemics, etc. are often scarce in numerical sequences, but important clues can be obtained through text priors.

[0003] For example, the modeling of DLinear (document number https: / / doi.org / 10.1609 / aaai.v37i9.26317) relies solely on historical numerical time series data. It can only extract time and variable dependencies from numerical sequences, and it is insufficient in characterizing time series changes driven by multiple factors in complex real-world scenarios (such as sudden increases in power load caused by holidays or sudden changes in traffic flow caused by policy controls), making it difficult to express non-numerical related factors.

[0004] Meanwhile, DLinear lacks any channels for introducing prior textual information, making it impossible to supplement the model with textual clues such as disease control warnings, policy notices, and event announcements. This results in extremely low accuracy in predicting time series mutations caused by such events, failing to solve the core problem of "difficulty in modeling extreme / rare events" proposed in this application.

[0005] With the increasing capabilities of Large Language Models (LLMs) in text understanding and knowledge encoding, incorporating text priors into time series prediction has become a hot topic. However, existing LLM-assisted cross-modal time series prediction mechanisms generally fall into three categories: Independent fusion: Extracting time series features and text features separately and then directly concatenating / weighting them together. This approach is prone to "data entanglement" when modal differences are significant, and text features are easily treated as noise rather than information; Modality transformation: Mapping time series features to the text space or reprogramming the sequence using LLM to process it within the text embedding space. However, this type of method often fails to fully leverage the advantage of textual representation's "non-numerical prior," and forcibly stuffing the sequence into the text space can lead to representational bias; Cross-modal alignment: Constructing a shared space, projecting the two modalities into a new feature space and performing alignment, comparison, or consistency constraints. However, due to the fundamental differences between time series and text encoding mechanisms, the inherent differences in modal feature spaces cause distortion in alignment / shared space methods when aligning in the "new shared space." Summary of the Invention

[0006] To overcome the shortcomings of the existing technologies, this invention provides a cross-modal time series prediction method, system, medium, and device based on asymmetric fusion. By reusing the time series feature space, textual priors are injected into the prediction model using a "projection + modulation + orthogonal decomposition" approach. This completes cross-modal fusion within the time series feature space, reducing alignment distortion; extracting and enhancing coherent (collinear) information between the two modalities while retaining unique (orthogonal) information of the text; and fusing complementary contexts of historical and future texts to provide more sufficient prior support for the prediction period.

[0007] To achieve the above objectives, the technical solution adopted by the present invention is as follows: A cross-modal time series prediction method based on asymmetric fusion includes the following steps: S1. Obtain input data, the input data including: historical time series data, historical text prompts, and future text prompts; wherein the historical time series data includes... N is the number of variables and T is the time step. The future text prompt is used to describe prior information for the predicted target time period. S2. Extract time series modal features and text modal features from historical time series data, historical text prompts, and future text prompts respectively using the single modal feature extraction module; S3. The historical text modal features are fused in the time series modal feature space by the asymmetric fusion module to calculate cross-modal consistent features and modality-specific features. Cross-modal consistent features are used to enhance effective patterns in the time series and suppress noise, while modality-specific features are used to preserve the specific information of the text modality. S4. By fusing historical text modal features and future text modal features through the historical-future text fusion module, fused text features containing prior knowledge of the prediction period are obtained; S5. Decode the fused text features and output the predicted target time period. Future time series forecast results .

[0008] In S1: historical time series data is the core numerical basis for time series forecasting, and is the set of historical observation values of variables related to the forecast target, denoted as S∈R. N×T Where N is the number of variables and T is the time step, the whole is a numerical matrix containing N variables and T time steps, recording the specific observed values of different variables at continuous time steps; the historical text prompts include the time range. Numerical sequences of each variable and trend value The future text prompt includes the current time. Historical cycles Predicting the target cycle The information includes one or more of the following: historical statistical range of variables, trend direction, holiday information, geographical location information, and sensor node topology information. The future text prompt includes at least one of the following: prediction period, holiday information, and geographical location description, but does not contain unknown future numerical information.

[0009] In S2: the extraction of time series modal features includes: embedding and mapping historical time series data to obtain a time series embedding representation; inputting the time series embedding representation into a pre-layer normalized (Pre-LN) Transformer encoder, and extracting time series modal features through a multi-head self-attention mechanism, wherein the multi-head self-attention includes linear mapping of query, key, and value, attention weight calculation, and multi-head concatenation and projection output; Multi-head self-attention as a formula ; in , .

[0010] In step S2: text modality feature extraction includes: inputting the historical text prompts and future text prompts into a pre-trained large language model to obtain corresponding text sequence hidden representations; selecting the hidden representation corresponding to the aggregated tag that can represent the semantics of the entire text segment as the static text representation, and inputting the static text representation into a Pre-LN encoder isomorphic to the time series feature extraction to obtain historical text modality features. With future text modal features ; The static text representation is configured to be pre-computed and cached offline to reduce computational overhead during the online prediction phase.

[0011] In S3: the asymmetric fusion module includes consistent feature calculation and modality-specific feature calculation. Consistent feature calculation includes: modality features based on historical text. Point-by-point modulation parameters are generated using learnable functions. And β; the modulation parameters are aligned with the time series features in the variable dimension or channel dimension and the time dimension; point-by-point linear modulation is performed on the time series features to obtain consistent features. The pointwise linear modulation satisfies the following: for any feature point at either the variable index or the channel index, the corresponding linear modulation is applied. β and β are used to scale and translate the feature point, thereby enhancing or suppressing the time series features to reduce the impact of noise.

[0012] In S3: the calculation of modality-specific features includes: calculating the modality features of historical text. Projected text features are obtained by mapping linearly or by a learnable projection onto the time-series feature space. Based on the aforementioned consistent features Calculate the collinear components of the projected text features along the consistent feature direction, and calculate the orthogonal components orthogonal to the collinear components using Gram-Schmidt orthogonalization. The orthogonal components It is identified as a modality-specific feature and is used to preserve text modality-specific information in the time series feature space.

[0013] In S4: the historical-future text fusion module adopts a Transformer-based decoder structure, using future text features... As input to the query, using historical text features As key and value inputs, cross-attention fusion is performed; the output is fused text features. It is used to enhance the expression of semantics and prior knowledge during the prediction period.

[0014] In S5: consistent features Modal unique characteristics With fused text features The predicted feature representation is obtained by concatenating channels; the predicted feature representation is then input into a linear projection layer, and the output is a multivariate future time series prediction result of length L.

[0015] The method uses mean squared error as the loss function for model training and supports zero-shot prediction tasks.

[0016] A cross-modal time series prediction system based on asymmetric fusion includes a data input module, a single-modal feature extraction module, an asymmetric fusion module, a historical-future text fusion module, and a prediction output module. Data input module: used to receive historical time series data, historical text prompts, and future text prompts; The single-modal feature extraction module is used to extract time series features separately. and historical text modal features With future text modal features ; The asymmetric fusion module is used to perform point-by-point linear modulation of historical text modal features in the time series feature space and to perform collinear / orthogonal decomposition, outputting consistent features. Modal unique features ; The historical-future text fusion module is used to fuse historical text modal features with future text modal features through cross-attention to obtain fused text features. ; Prediction output module: used for consistent features Modal unique characteristics With fused text features The data is integrated and decoded to output future time series prediction results.

[0017] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the cross-modal time series prediction method based on asymmetric fusion as described above.

[0018] A computing device includes a processor and a memory, the memory storing a computer program that, when executed by the processor, causes the computing device to perform the cross-modal time series prediction method based on asymmetric fusion as described above.

[0019] This invention is applicable to prediction scenarios that require the simultaneous use of historical observation sequences and external textual priors (such as holidays, policies, geographical locations, climate backgrounds, event schedules, topological relationships, business rules, etc.), including but not limited to: economic indicator prediction, power load prediction, traffic flow prediction, meteorological element prediction, influenza incidence (ILI) prediction, financial exchange rate prediction, equipment status prediction, etc.

[0020] The beneficial effects of this invention are: This invention introduces an asymmetric cross-modal fusion mechanism, enabling textual information to effectively participate in prediction modeling within the time series feature space. This reduces the risk of feature distortion during the cross-modal fusion process, ensuring prediction accuracy while improving the model's adaptability to complex scenarios and distribution changes, thereby enhancing the stability and robustness of the prediction. This invention can achieve higher-quality time series prediction outputs in various prediction tasks such as economics, energy, transportation, meteorology, and public health.

[0021] Based on the data input module, by standardizing the construction of historical and future text prompts, the system can utilize the prior context information available during the prediction period without introducing future real numerical information, thereby enhancing its ability to characterize changing factors during the prediction period.

[0022] Based on the single-modal feature extraction module, feature encoding is performed on time series modalities and text modalities respectively, so that information from different sources can be expressed under a unified feature dimension, thereby improving the operability and computational efficiency of cross-modal information interaction.

[0023] Based on the aforementioned asymmetric fusion module, cross-modal consistent features and cross-modal unique features are extracted in the time series feature space. This allows the system to retain text-specific prior information while utilizing modal common information, avoiding the loss of text information due to noise or over-alignment, thereby improving prediction performance and interpretability.

[0024] Based on the aforementioned historical-future text fusion module, by fusing historical and future texts, the model can simultaneously utilize historical pattern summaries and prediction context priors, enhancing its adaptability to scenarios with long prediction steps or complex time spans.

[0025] Based on the prediction output module, by decoding and outputting the fused multi-source features, it is possible to generate prediction results with multiple variables and multiple step lengths within the prediction target period, and to output the corresponding prediction form according to the task requirements.

[0026] In summary, this invention provides a cross-modal time series prediction system applicable to multiple fields, based on comprehensive consideration of fusion stability, prior information utilization capability, and prediction accuracy. It has good engineering deployability and application value. Attached Figure Description

[0027] Figure 1 This is a schematic diagram of the feature space.

[0028] Figure 2 This is a schematic diagram of a spatial fusion strategy based on temporal features.

[0029] Figure 3 This is a framework diagram of an asymmetric cross-modal fusion network model.

[0030] Figure 4 This is a schematic diagram of asymmetric fusion features. Detailed Implementation

[0031] The present invention will now be described in further detail with reference to the accompanying drawings.

[0032] This invention provides a time series prediction method based on asymmetric cross-modal fusion. It acquires historical time series data of the target prediction scene and constructs historical text prompts based on this data. Simultaneously, it constructs future text prompts containing prior information such as the prediction target period, holidays, geographical location, event schedule, or sensor node topology, thus forming a prediction input dataset containing both numerical and textual modalities. The constructed prediction input dataset is input to a single-modal feature extraction module to obtain time series features, historical text features, and future text features. Within the time series feature space, the historical text features are injected into the time series features using a "projection-modulation-orthogonal decomposition" method through an asymmetric fusion module, resulting in cross-modal consistent features and modality-specific features. Furthermore, the historical-future text fusion module fuses the historical and future text features across time periods to obtain fused text features. These features are integrated and input into a prediction network to obtain prediction results. The model parameters are iteratively optimized using a loss function, ultimately achieving high-precision prediction of long-term, short-term, and zero-sample time series of the target scene. This invention can be widely applied to multivariate time series forecasting scenarios such as economic indicator forecasting, power load forecasting, traffic flow forecasting, meteorological element forecasting, and public health early warning.

[0033] A time series prediction method based on asymmetric cross-modal fusion includes the following steps: S1. Construct a multimodal prediction input dataset: Acquire historical time series data, which can be represented as a numerical matrix containing N variables and T time steps; generate historical text prompts based on the historical time series data, which at least include the historical time range, a summary of the numerical sequence of each variable, statistical range and trend information; construct future text prompts, which are used to describe the available prior information for predicting the target period, which at least includes one or more of the following: the time range of the target period, holiday / workday attributes, geographical location description, event schedule information, and sensor node topology relationship information, but does not contain future real numerical information that cannot be known in advance.

[0034] Historical time series data: Input: Historical observation matrix Optional preprocessing: missing value imputation, outlier truncation / correction, normalization by variable, etc.

[0035] Historical text prompts : Historical text prompts are used to convert historical numerical sequences into natural language "statistical summaries," including the time range, a list of values / statistics for each variable, and the magnitude of the trend.

[0036] This invention can be generated in the following manner: Time range: ; For each variable n: give a sequence of historical values, a numerical sequence. .

[0037] Trend amplitude: .

[0038] Future text prompts : Future text prompts will emphasize the "forecast period context," including the current time, historical intervals, forecast target intervals, holiday / weekend markers, and geographic location. This invention can be generated in the following manner: Current time: -Historical period: -Prediction target: A summary of each variable, including minimum value, maximum value, and trend value; Holidays: It's a weekday. It's the weekend. It is "National Day"; Geographic location: Predicted to occur in [a certain location]; Please see Figure 3 The left side is the data description; it is the input data layer of the entire prediction model, corresponding to the multimodal prediction input dataset constructed in step S1 of the invention method. It visually presents all the input data types, structures and core contents required for model training / online prediction.

[0039] S2. Extract time series modal features and text modal features.

[0040] The input data constructed in step S1 is fed into the single-modal feature extraction module to obtain multimodal feature representations for fusion and prediction; these include time series feature extraction and text feature extraction. Time series feature extraction: Historical time series data is normalized and embedded, mapping the time series to an embedded representation with a preset feature dimension. This embedded representation is then input into a pre-normalized Transformer encoder to extract time series features. The core of this process includes: normalization, linear projection, multi-layer multi-head attention, residuals, and a feedforward network. Multi-head self-attention mechanisms are used to model inter-variable and temporal dependencies, outputting time series features. The multi-head self-attention mechanism includes at least linear mapping of queries, keys, and values, attention weight calculation, concatenation of each head, and linear projection output.

[0041] Specific implementations may include: (1) Standardization Values of all time steps for the nth variable in the original historical time series Calculate the mean with standard deviation To obtain the standardized sequence Linear projection onto the feature dimension C; Mapping the time dimension to the feature space: in Representing the linear projection function, we obtain the characteristic matrix. .

[0042] (3) Multi-layered multi-head self-attention MHA right Perform multi-level MHA to capture dependencies between variables: in , Q, K, V are the query, key, and value matrices for the self-attention mechanism. This is the output of attention heads 1 through h, where softmax is the activation function. This is a scaling factor for the attention score. This is for splicing operations.

[0043] (4) Residual connection + feedforward network FFN To enhance critical patterns and preserve original information, residuals and FFNs are added: Obtain time series features . For layer normalization function; Text feature extraction: Historical text prompts and future text prompts are input into a pre-trained large language model to obtain corresponding text hidden representations; the hidden representation corresponding to the aggregated tag that can represent the semantics of the entire text is selected as the static text representation, and the static text representation is input into an encoder that is isomorphic to the time series feature extraction to obtain historical text features and future text features respectively; in some embodiments, the static text representation can be configured to be pre-computed and cached offline, thereby reducing the computational overhead in the online prediction stage.

[0044] Specifically, it includes: Text encoding and static representation extraction: Will , Inputting an LLM, the last token is extracted and passed through an embedding layer to obtain the initial text representation. The static representation is then extracted for offline dimension alignment and encoding enhancement. Send the LLM output to the Pre-LN encoder: get .

[0045] Please see Figure 3 This is a complete network model framework diagram; S3. Perform asymmetric fusion in the time series feature space to obtain cross-modal consistent features and modality-specific features.

[0046] The asymmetric fusion module is used to fuse historical text features in the time series feature space, specifically including: Consistent feature calculation: Generate point-by-point modulation parameters based on historical text features, and align the modulation parameters with the time series features in the variable dimension / channel dimension and time dimension; perform point-by-point linear modulation on the time series features to obtain cross-modal consistent features, wherein the consistent features are used to enhance the effective modes in the time series and suppress noise interference; .in , It is a linear projection.

[0047] Please see Figure 1 , Figure 1This diagram illustrates the feature spaces related to different cross-modal fusion methods. It shows the feature spaces of time series modality and text modality, as well as the shared feature space constructed by traditional cross-modal fusion. It clarifies that due to the significant inherent differences between time series and text modality in terms of encoding mechanism, feature essence, and distribution patterns, when traditional methods forcibly project the features of these two heterogeneous modalities into an artificially constructed shared feature space for alignment and fusion, the original features of the two modalities are forced to change their distribution patterns, core attributes, and internal correlation logic. Ultimately, this leads to significant distortion and information loss in the modal features, revealing the inherent defects of traditional cross-modal alignment methods. This provides key theoretical basis and visual evidence for the core design idea of this invention, which abandons the shared feature space and reuses the time series feature space to carry out asymmetric cross-modal fusion.

[0048] Modality-specific feature calculation: Historical text features are mapped to the time series feature space through linear mapping or learnable projection to obtain projected text features; based on the consistent features, the collinear components of the projected text features in the consistent feature direction are calculated, and orthogonal components orthogonal to the collinear components are calculated through Gram-Schmidt orthogonalization; the orthogonal components are determined as modality-specific features to preserve text modality-specific information in the time series feature space; Feature Decoding: Consistent features and modality-specific features are input into the decoder structure for decoding to further extract cross-variable dependencies and focus on key patterns, and output the decoded consistent features and decoded modality-specific features.

[0049] Please see Figure 2 , Figure 2 An asymmetric fusion strategy for cross-modal fusion is illustrated: Step (1): Extract time series features and text features respectively; Step (2): Linearly modulate the time series features using parameters learned from the text features; Step (3): Project the text features into the time series feature space; Step (4): Calculate the orthogonal components relative to the collinear features using the Gram-Schmidt orthogonalization method.

[0050] S4. Merge historical text features and future text features to obtain merged text features: The historical-future text fusion module performs cross-time-period fusion of historical and future text features. In some embodiments, the historical-future text fusion module adopts a Transformer-based decoder structure, uses future text features as query input and historical text features as key and value input, and performs cross-attention fusion to obtain fused text features that include prediction context and prior knowledge, thereby enhancing the ability to express semantic information during the prediction period.

[0051] Please see Figure 4 , Figure 4 This diagram illustrates the relationship between cross-modal consistent features (collinear features) and modality-specific features (orthogonal features) output by the asymmetric fusion module. It showcases the spatial distribution of these two types of features, their separate modeling mechanisms, and their complementary roles in cross-modal fusion. This verifies the core design goal of this invention—"extracting common modal information and preserving unique textual priors"—and provides visual evidence for a solution to the data entanglement and information loss problems that often occur in traditional cross-modal fusion. The feature point set of the consistent features in the diagram closely matches the spatial distribution of the original time-series features, representing the common collinear information between historical text features and time-series features. This collinearity is formed after point-by-point linear modulation, enhancing the effective information of the time series and suppressing noise. Meanwhile, the feature point set of the modality-specific features exhibits an orthogonal and non-overlapping spatial distribution with the consistent features. This is text modality-specific information obtained after Gram-Schmidt orthogonalization decomposition. Its independent distribution directly reflects the design effect of preserving textual features that are not assimilated by time-series features, supplementing non-numerical prior information such as holidays, policies, and geographical locations that time series cannot express. at the same time, Figure 4 It also clearly shows that the two types of features are separated from each other in the time series feature space, do not interfere with each other, but can form an effective complementary relationship. This avoids the "data entanglement" problem where text features are submerged by time series features, and also avoids the distortion problem caused by over-alignment of features. This confirms the core mechanism of the asymmetric fusion of the present invention: "first extract common information between modalities to strengthen time series features, and then retain the unique information of the text in the form of orthogonal components." It also visualizes and verifies the completeness and effectiveness of the fusion method in information utilization for the subsequent feature integration steps of channel splicing of the two types of features with the fused text features.

[0052] S5. Feature Integration and Prediction Output: The decoded consistent features and decoded modality-specific features output from step S3 are concatenated with the fused text features output from step S4 by channel to obtain the predicted feature representation. The predicted feature representation is then input into the prediction network / linear projection layer, which outputs a multivariate future time series prediction result of length L, thereby achieving numerical prediction of the target period.

[0053] = S6. Model Training and Parameter Iterative Optimization: During the training phase, a loss function is constructed based on the predicted results and the true values for supervised learning. In some embodiments, mean squared error is used as the main loss to minimize the difference between the predicted results and the true values, and the parameters of the single-modal feature extraction module, the asymmetric fusion module, the history-future text fusion module, and the prediction output module are updated through backpropagation.

[0054] Experimental results show that after constructing a multimodal prediction input containing historical time series data, historical text prompts, and future text prompts, this invention can inject text priors into the prediction modeling process in the manner of "projecting to the time series feature space + pointwise linear modulation + Gram-Schmidt orthogonal decomposition". This achieves effective separation and complementary fusion of modal common information and text-specific prior information, thereby obtaining high-precision time series prediction outputs in different prediction scenarios such as long-term, short-term, and zero-sample scenarios, and effectively alleviating the information loss problem caused by modal alignment in traditional cross-modal schemes.

[0055] Furthermore, compared with existing methods that require constructing a shared feature space or direct splicing and fusion, this invention performs asymmetric fusion within the time series feature space, enabling consistent features to enhance effective patterns in the time series and suppress noise. At the same time, it preserves the unique prior information of the text modality in the form of orthogonal components. Therefore, while ensuring prediction accuracy, it can reduce the risk of feature distortion in the cross-modal fusion process and improve the prediction stability and robustness of the model under complex scenarios and distribution changes, demonstrating the comprehensive superiority of the proposed solution.

Claims

1. A cross-modal time series prediction method based on asymmetric fusion, characterized in that, Includes the following steps: S1. Obtain input data, which includes: historical time series data, historical text prompts, and future text prompts; S2. Extract time series modal features and text modal features from historical time series data, historical text prompts, and future text prompts using the single modal feature extraction module; text modal features are divided into historical text modal features and future text modal features; S3. The historical text modal features are fused in the time series modal feature space by the asymmetric fusion module to calculate cross-modal consistent features and modality-specific features. Cross-modal consistent features are used to enhance effective patterns in the time series and suppress noise, while modality-specific features are used to preserve the specific information of the text modality. S4. By fusing historical text modal features and future text modal features through the historical-future text fusion module, fused text features containing prior knowledge of the prediction period are obtained; S5. Decode the fused text features and output the predicted target time period. Future time series forecast results .

2. The cross-modal time series prediction method based on asymmetric fusion according to claim 1, characterized in that, In S1: historical time series data is the core numerical basis for time series forecasting, and is the set of historical observation values of variables related to the forecast target, denoted as S∈R. N×T Where N is the number of variables and T is the time step, the whole is a numerical matrix containing N variables and T time steps, recording the specific observed values of different variables at continuous time steps; the historical text prompts include the time range. Numerical sequences of each variable and trend value The future text prompt includes the current time. Historical cycles Predicting the target cycle One or more of the following: historical statistical range of variables, trend direction, holiday information, geographical location information, and sensor node topology information.

3. The cross-modal time series prediction method based on asymmetric fusion according to claim 1, characterized in that, The future text prompt is used to describe prior information about the predicted target time period; The future text prompts include at least one of the following: predicted time period, holiday information, and geographical location description, but do not contain unknown future numerical information.

4. The cross-modal time series prediction method based on asymmetric fusion according to claim 1, characterized in that, In S2: the extraction of time series modal features includes: embedding and mapping historical time series data to obtain a time series embedded representation; inputting the time series embedded representation into a pre-layer normalized Transformer encoder, and extracting time series modal features through a multi-head self-attention mechanism, wherein the multi-head self-attention includes linear mapping of query, key, and value, attention weight calculation, and multi-head concatenation and projection output; Multi-head self-attention as a formula ; in , ; In step S2: text modality feature extraction includes: inputting the historical text prompts and future text prompts into a pre-trained large language model to obtain corresponding text sequence hidden representations; selecting the hidden representation corresponding to the aggregated tag that can represent the semantics of the entire text segment as the static text representation, and inputting the static text representation into a Pre-LN encoder isomorphic to the time series feature extraction to obtain historical text modality features. With future text modal features ; The static text representation is configured to be pre-computed and cached offline to reduce computational overhead during the online prediction phase.

5. The cross-modal time series prediction method based on asymmetric fusion according to claim 4, characterized in that, In S3: the asymmetric fusion module includes consistent feature calculation and modality-specific feature calculation. Consistent feature calculation includes: modality features based on historical text. Point-by-point modulation parameters are generated using learnable functions. And β; the modulation parameters are aligned with the time series features in the variable dimension or channel dimension and the time dimension; point-by-point linear modulation is performed on the time series features to obtain consistent features. The pointwise linear modulation satisfies the following: for any feature point at either the variable index or the channel index, the corresponding linear modulation is applied. β and β are used to scale and translate the feature point, thereby enhancing or suppressing the time series features to reduce the impact of noise; Modality-specific feature calculation includes: extracting modality features from historical text. Projected text features are obtained by mapping linearly or by a learnable projection onto the time-series feature space. Based on the aforementioned consistent features Calculate the collinear components of the projected text features along the consistent feature direction, and calculate the orthogonal components orthogonal to the collinear components using Gram-Schmidt orthogonalization. The orthogonal components It is identified as a modality-specific feature and is used to preserve text modality-specific information in the time series feature space.

6. The cross-modal time series prediction method based on asymmetric fusion according to claim 5, characterized in that, In S4: the historical-future text fusion module adopts a Transformer-based decoder structure, using future text features... As query input, using historical text features As key and value inputs, cross-attention fusion is performed; Output fused text features It is used to enhance the expression of semantics and prior knowledge during the prediction period.

7. The cross-modal time series prediction method based on asymmetric fusion according to claim 6, characterized in that, In S5: consistent features Modal unique characteristics With fused text features The predicted feature representation is obtained by concatenating channels; the predicted feature representation is then input into a linear projection layer, and the output is a multivariate future time series prediction result of length L.

8. A cross-modal time series prediction system based on asymmetric fusion for implementing the method of any one of claims 1-7, characterized in that, It includes a data input module, a single-modal feature extraction module, an asymmetric fusion module, a history-future text fusion module, and a prediction output module; Data input module: used to receive historical time series data, historical text prompts, and future text prompts; The single-modal feature extraction module is used to extract time series features separately. and historical text modal features With future text modal features ; The asymmetric fusion module is used to perform point-by-point linear modulation of historical text modal features in the time series feature space and to perform collinear / orthogonal decomposition, outputting consistent features. Modal unique features ; The historical-future text fusion module is used to fuse historical text modal features with future text modal features through cross-attention to obtain fused text features. ; Prediction output module: used for consistent features Modal unique characteristics With fused text features The data is integrated and decoded to output future time series prediction results.

9. A computer-readable storage medium, characterized in that, It stores a computer program, which, when executed by a processor, implements the cross-modal time series prediction method based on asymmetric fusion as described in any one of claims 1-7.

10. A computing device, characterized in that, The device includes a processor and a memory, wherein the memory stores a computer program, and when the computer program is executed by the processor, the computing device performs the cross-modal time series prediction method based on asymmetric fusion as described in any one of claims 1-7.