Industrial equipment multi-modal information fusion and management method
By employing multimodal data acquisition, temporal alignment, feature decomposition, and progressive fusion, the heterogeneity and temporal misalignment issues were resolved, enabling deep fusion of sensor signals and visual information and improving the accuracy and robustness of equipment status prediction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG UNIV
- Filing Date
- 2026-03-02
- Publication Date
- 2026-06-16
AI Technical Summary
Existing technologies lack comprehensive single-modal information, and multimodal fusion methods fail to effectively address heterogeneity, temporal misalignment, and long-range dependencies, resulting in low accuracy and poor robustness in equipment state prediction.
By collecting multimodal data, performing temporal alignment and feature decomposition, and employing a progressive cross-modal fusion and recursive refinement mechanism, combined with frequency domain cross-attention and residual interpolation, deep fusion of sensor signals and visual information is achieved.
It significantly improves the accuracy and robustness of equipment condition prediction, possesses anti-interference capabilities and adaptability to operating conditions, while maintaining computational efficiency and interpretability.
Smart Images

Figure CN122221135A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of predictive maintenance technology for industrial equipment, and in particular to a method for multimodal information fusion and management of industrial equipment. Background Technology
[0002] As industrial manufacturing systems evolve towards intelligence and high precision, equipment health monitoring and predictive maintenance have become crucial for ensuring production safety and efficiency. Existing technologies mainly suffer from the following two limitations:
[0003] The first category comprises single-modal methods. For example, these methods utilize only sensor signals such as vibration, acoustic emission, or cutting force, employing statistical indicators or machine learning models for condition assessment. While computationally lightweight, these methods struggle to comprehensively capture the spatial morphology and complete mechanisms of equipment degradation, exhibiting poor generalization ability under varying operating conditions and strong noise interference. The second category consists of purely visual methods that detect wear or cracks by acquiring images of the equipment surface. However, these methods are extremely sensitive to changes in lighting and occlusion, lacking robustness in real-world industrial environments.
[0004] The second category comprises simple multimodal fusion methods. To overcome the limitations of single-modal data, some studies have attempted to fuse visual and sensor data. However, existing fusion strategies are mostly limited to simple concatenation at the feature level or score weighting at the decision level, failing to fundamentally address issues such as heterogeneity, temporal misalignment, and lack of long-range dependencies among multimodal data. This shallow fusion is prone to information confusion and cannot effectively extract and correlate complementary cues across modalities in noisy environments, thus limiting further improvements in prediction accuracy.
[0005] Therefore, there is an urgent need in industrial sites for a predictive maintenance method that can deeply understand and integrate heterogeneous temporal and visual information and is highly adaptable to changes in noise and operating conditions. Summary of the Invention
[0006] To address the problems of incomplete single-modal information and neglect of data heterogeneity and temporal dependence in existing technologies, which lead to low accuracy and poor robustness in equipment status prediction, this invention proposes a multimodal information fusion and management method for industrial equipment.
[0007] The specific technical solution is as follows: A method for multimodal information fusion and management of industrial equipment, comprising the following steps:
[0008] S1: Collect multimodal data during the operation of industrial equipment, including sensor time-series data and visual image sequence data;
[0009] S2: Preprocess the multimodal data to generate multimodal temporal features that are aligned in time sequence;
[0010] S3: Decompose the multimodal time-series features into trend component features and seasonal component features;
[0011] S4: Perform progressive cross-modal fusion of the trend component features and seasonal component features from different modalities to obtain a fused feature representation;
[0012] S5: Iteratively optimize the fused feature representation through a recursive refinement mechanism;
[0013] S6: Based on the optimized fusion feature representation, perform device status prediction. Through multimodal data acquisition, temporal alignment, feature decomposition, progressive fusion, recursive refinement, and prediction, it effectively integrates the physical dynamics of sensor signals with the spatial descriptiveness of visual images, overcomes the limitations of single-modal information, and ensures the accuracy, robustness, and interpretability of the final prediction results in the overall architecture.
[0014] Furthermore, the sensor time series data is the cutting force signal acquired by the multi-channel sensor, and the visual image sequence data is an image of the equipment surface including the tool wear area;
[0015] The equipment status prediction includes equipment health status classification and future signal sequence regression.
[0016] Furthermore, step S2 includes:
[0017] The sensor time series is downsampled at multiple scales, and feature projection is performed through an embedding layer to obtain the sensor time series features;
[0018] Spatial features of the visual image sequence are extracted using a convolutional neural network, and temporal alignment and feature projection are performed on these spatial features to obtain visual temporal features. By performing multi-scale downsampling on the sensor sequence and temporal alignment and projection on the visual features, heterogeneous and asynchronous raw data can be efficiently transformed into temporally aligned and dimensionally unified feature representations, laying a solid foundation for subsequent deep fusion.
[0019] Furthermore, the embedding implemented by the embedding layer or feature projection includes value embedding, position embedding, and periodic embedding, wherein the periodic embedding is constructed using Fourier basis functions to capture cyclical patterns in the data. By fusing value, position, and periodic embeddings, especially the periodic embedding constructed using Fourier basis functions, it is possible to explicitly model the cyclical patterns prevalent in industrial data. This enhances the model's ability to understand temporal patterns, increases the information content of feature representations, and helps extract stable patterns in complex noise backgrounds.
[0020] Furthermore, step S3 is performed using a local trend decomposer, and the decomposition includes:
[0021] Extract local segments of the input features using a sliding window;
[0022] Each local segment is compared with a set of predefined kernel basis vectors to obtain the weight distribution;
[0023] Based on the weight distribution, the kernel basis vectors are combined to reconstruct the local trend components;
[0024] The local trend components of the overlapping windows are aggregated to obtain the global trend component, and the seasonal component is obtained by subtracting the input feature from the global trend component.
[0025] Furthermore, the kernel basis vector is initialized by principal component analysis (PCA) on the sampled training data. Local trends are adaptively extracted by comparing the similarity between the sliding window and the fixed kernel basis vector. This method is lightweight and robust to noise, and can more cleanly separate long-term trends from short-term seasonal fluctuations in the signal. The kernel basis vector is initialized through PCA and kept fixed, ensuring the stability and consistency of the decomposition process and effectively preventing overfitting.
[0026] Furthermore, the progressive cross-modal fusion includes the following stages executed sequentially:
[0027] In the first stage, a frequency domain cross-attention mechanism is used to achieve long-range dependency interactions between features of different modal components;
[0028] In the second stage, the interactive features and the original modal features are weighted and fused through a residual interpolation mechanism to preserve modal priors.
[0029] In the third stage, enhanced features from different modalities are adaptively merged through a gating network.
[0030] Furthermore, the frequency domain cross-attention mechanism is calculated in the following way:
[0031] The query vector Q and the key vector K are transformed to the frequency domain using a fast Fourier transform;
[0032] Perform conjugate multiplication of Q and K in the frequency domain;
[0033] The computational result is transformed back to the time domain using an inverse fast Fourier transform, and then multiplied with the value vector V to generate attention interaction features. Long-range dependencies are efficiently captured through frequency domain cross-attention, and the original characteristics of each modality are preserved through residual interpolation to suppress fusion contamination. Finally, adaptive weighting is achieved through a gating network. This three-stage design realizes a refined fusion process from global interaction to local preservation, and then to adaptive integration, significantly improving the efficiency and quality of information complementarity between heterogeneous modalities.
[0034] Furthermore, the specific process of the recursive refining mechanism is as follows:
[0035] The original multimodal temporal features generated in step S2 are used as anchor point residuals;
[0036] In each refining iteration, the anchor point residual is added to the fusion feature representation of the previous round to form the input of the current iteration;
[0037] Steps S3 and S4 are repeated on the input for n iterations, where n ≥ 2. By repeatedly injecting the original features as anchor residuals in multiple iterations, the information decay or representation drift problem in deep networks is effectively avoided, the training process is stabilized, and the model can gradually refine the fusion results, thereby obtaining a more reliable and accurate fusion feature representation.
[0038] Furthermore, step S6 specifically includes:
[0039] Information is extracted from fused feature representations at different temporal resolutions, and multi-scale prediction is performed using a scale-specific prediction head.
[0040] The results of the multi-scale predictions are weighted and averaged to generate the final prediction result. By employing a multi-scale prediction and ensemble strategy, and utilizing fusion features at different resolutions for prediction and weighted averaging, information from both macro trends and micro fluctuations can be comprehensively utilized, making the final prediction result more robust and reducing prediction bias caused by incomplete information from a single scale. This is particularly beneficial for long-term time-series prediction tasks.
[0041] The above technical solution has the following advantages or technical effects:
[0042] 1. This invention significantly improves the accuracy and reliability of industrial equipment condition prediction through a systematic multimodal fusion framework. The fusion of visual and sensor information provides a more comprehensive perspective on equipment degradation, while the progressive fusion and recursive refinement mechanisms ensure the depth and quality of information integration.
[0043] 2. This invention possesses excellent anti-interference capabilities and adaptability to various operating conditions. The local trend decomposer effectively separates noise from the signal, and the frequency domain attention and residual interpolation design enhance the model's robustness in noisy and complex dynamic environments.
[0044] 3. This invention balances performance with computational efficiency and interpretability. Frequency domain computation reduces the complexity of the attention mechanism, while fixed kernel bases and structured fusion modules make model behavior easier to analyze and understand, facilitating industrial deployment and application. Attached Figure Description
[0045] Figure 1 This is a flowchart of the method of the present invention;
[0046] Figure 2This is a schematic diagram of the overall structure of the multimodal time series prediction model of the present invention;
[0047] Figure 3 This is a schematic diagram of the cross-modal and temporal feature extraction module of the present invention;
[0048] Figure 4 This is a schematic diagram of the cross-modal fusion module structure of the present invention. Detailed Implementation
[0049] To make the technical solution of the present invention clearer, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0050] like Figure 1 , Figure 2 As shown, a multimodal information fusion and management method for industrial equipment systematically addresses the heterogeneity issues of sensor signals and visual images in terms of temporal sequence, semantics, and feature space, including:
[0051] S1: Multimodal Data Acquisition: The data acquisition module simultaneously acquires two types of heterogeneous data from the target industrial equipment.
[0052] Sensor time-series data: Multi-dimensional signals reflecting the physical state of equipment operation are acquired in real time through sensors (such as force, vibration, and acoustic emission sensors) installed on the equipment. These signals are sampled at a fixed frequency to form a time series. Where M is the signal dimension and T is the timing length.
[0053] Visual image sequence data: Images of key components of equipment (such as cutting tools, bearings, and gear surfaces) are acquired at set time intervals using visual acquisition devices such as industrial cameras to form image sequences, which are used to observe changes in their spatial shape, color, texture, and other visual states.
[0054] S2: Data Preprocessing and Temporal Feature Generation: The collected raw data needs to be preprocessed to improve quality and achieve cross-modal alignment, such as... Figure 3 As shown, the specific steps are as follows:
[0055] Data cleaning and synchronization: Denoising sensor signals (e.g., using medium filtering) and standardizing images to reduce environmental interference such as lighting conditions. Timestamp alignment of sensor sequences and image sequences based on a unified time series reference.
[0056] Multi-scale feature generation:
[0057] Sensor branch: Perform multi-level one-dimensional pooling downsampling on the aligned sensor sequence X. Specifically, at the l-th level resolution: ,in k is the step size. Subsequently, each downsampled sequence Temporal features are obtained by projecting the embedding layer onto a high-dimensional space. The embedding scheme combines value embedding (processed via a multilayer perceptron), position embedding (using sine coding), and periodic embedding (constructed via Fourier basis).
[0058] Visual branch: Extracting spatial features from image sequences using efficient networks (such as EfficientNet). To align with temporal resolution, temporal average pooling is applied: ,in Spatial flattening and projection via linear transformation: Then, the same unified embedding scheme as the sensor branch is applied to obtain visual temporal features. .
[0059] Preliminary cross-modal interaction: Sensor features and visual features of the same scale are concatenated along the feature dimension, and joint feature extraction is performed through two-dimensional convolution. To facilitate early cross-modal information exchange.
[0060] By employing systematic downsampling and a unified deep embedding strategy, force signals and image data with vastly different sampling rates and data structures were successfully transformed into a set of representations that are strictly aligned in time and unified in feature space. This fundamentally removes the obstacles to deep fusion of heterogeneous data. The introduction of periodic embedding enhances the model's ability to capture inherent cyclical patterns during equipment operation or process flow.
[0061] S3: Local Trend Decomposition and Multi-Scale Enhancement: To clearly separate the long-term performance evolution trend and short-term periodic / random fluctuations of the equipment, this step uses a local trend decomposer (LocTrend) to decompose the above features:
[0062] Decomposition operation: for any input feature LocTrend performs the following operations:
[0063] Sliding window and centralization: Apply a sliding window of length w and step size s along the time axis to generate... and 1 local segment Each window is mean-centered to remove static offset:
[0064] .
[0065] Kernel similarity matching: This involves matching the centered features with a set of K predefined kernel basis vectors. Compare each centered vector:
[0066] ;
[0067] in The soft-assignment weights of kernel k at time step t in window i. Kernel basis vectors Principal component analysis is used to initialize the sampling local window to capture the dominant variation pattern, and this pattern is fixed during training to ensure consistent trend modeling and prevent overfitting.
[0068] Trend Reconstruction and Aggregation: Local trends are reconstructed by combining kernel bases with similarity weights and restoring the mean. All local trends are aggregated through overlapping averages to form a global trend:
[0069] ,
[0070] in Record the number of overlapping windows at each time step. This indicates element-wise division.
[0071] Seasonal component acquisition: Seasonal components are used as residuals in calculations. By employing local singular value decomposition, LocTrend achieves robust decomposition of noise, improves trend identification, and reduces computational costs, making it highly efficient for long sequence modeling.
[0072] Multi-scale information enhancement: After decomposition, trend and seasonal components exchange information across different resolutions. The seasonal component employs a bottom-up enhancement path, progressively aggregating fine-resolution features to a coarse resolution; the trend component employs a top-down enhancement path, using coarse-resolution features to guide the refinement of fine-resolution features. The enhancement process uses a hybrid module containing convolutional and linear layers, and fuses the outputs of different paths through a learnable Softmax mechanism. Specifically, the enhancement process is as follows: And the corresponding trend enhancement forms.
[0073] LocTrend provides a lightweight decomposition method that is fully data-driven and requires no pre-defined periodic model. It adaptively captures local trend directions and is robust to noise. Combined with a multi-scale enhancement mechanism, it achieves semantic information complementarity across resolutions, making the physical meaning of the extracted trend and seasonal components clearer and providing a high-quality foundation for subsequent fusion.
[0074] S4: Progressive Cross-Modal Fusion: Deeply integrates trend and seasonal components from different modalities. The fusion process is completed by the cross-gated fusion module (XGateFusion), which operates in a progressive three-stage manner, focusing on seasonal components. and Taking the integration of [the elements] as an example, for instance... Figure 4 As shown:
[0075] Phase 1: Frequency Domain Cross-Attention Interaction. This phase aims to capture long-range cross-modal dependencies with low computational complexity. First, query vectors are generated for each modal component. Key vector Sum value vector ,in Then, bidirectional interaction occurs in the frequency domain:
[0076]
[0077] in and This represents the Fast Fourier Transform (FFT) and the Inverse Fourier Transform (IFFT). This represents element-wise multiplication. The implementation complexity of this operation is... The two-way interaction enables efficient long-range context modeling.
[0078] Phase 2: Residual interpolation preserves prior knowledge. To suppress "modal contamination" and preserve the inherent characteristics of each modality, the attention output is weighted and fused with the original components:
[0079]
[0080] in and These are learnable interpolation weights. This acts as an adaptive low-pass filter, ensuring that the core identity information of each modality is preserved after deep interaction.
[0081] Phase 3: Gated Adaptive Merging. The enhanced features from the two modalities are concatenated and fed into a gated network to generate channel-level weights. , where σ is the Sigmoid function. Finally, multi-head self-attention refines the internal modality dependencies and adds a cascaded residual output fusion representation. After fusion, and Aggregate and compensate for lost information to obtain a seasonal representation. A similar process is applied to the generation of the trend section. Then, learnable weighted addition is performed. and :
[0082] ;in These are learnable scalar weights. Subsequently, a shared feedforward block is used. Enhanced representation capacity: .
[0083] S5: Recursive Refinement Mechanism
[0084] To overcome the subtle information loss or representation drift that may occur in a single fusion process, this invention introduces a recursive refinement mechanism. This mechanism refines the original aligned features generated in the second step, which are not yet decomposed or fused. As anchor point residuals, inject them into n refinement rounds (n≥2):
[0085] ,
[0086] in , This represents the sequence decomposition and multimodal fusion operation. This mechanism is used for auxiliary operations such as feedforward networks. It allows the model to progressively refine the seasonal and trend signals of the residuals, promoting cross-modal collaboration while avoiding the degradation of the original fusion semantics.
[0087] By repeatedly injecting the original information as a constant reference through multiple iterations, the optimization dynamics of the fusion process are effectively stabilized, preventing the information decay problem common in deep networks. This allows the model to progressively refine and correct the fusion results, ultimately obtaining a more reliable and accurate unified device state representation.
[0088] S6: Multi-scale prediction and decision output:
[0089] like Figure 2 As shown, after recursive refinement, the optimized fusion feature representation at each scale l is obtained. Generate predictions for future time steps τ:
[0090] Multi-scale prediction: Processed using scale-specific regression heads: Features at each scale It is fed into an independent, scale-specific regression head Processing is performed to adjust the timing resolution and map it to the target space: .
[0091] Results ensemble: Integrate the prediction results from all L scales to generate the final output: For classification tasks, the class probabilities output at each scale can be averaged.
[0092] Maintenance decision generation: Based on the final prediction results (such as the trend of future signal sequences, the category and confidence level of health status), combined with the preset operation and maintenance knowledge base or rules, the system automatically generates equipment status assessment reports and maintenance timing suggestions, realizing a closed loop from perception, prediction to decision-making.
[0093] By leveraging multi-scale features for parallel prediction and integration, degradation information at different time granularities is incorporated, resulting in smoother and more stable predictions with greater robustness to abrupt changes and noise. The final output not only provides accurate state assessments and trend predictions but also directly drives maintenance decisions, achieving true predictive maintenance intelligence.
[0094] The embodiments described above are merely illustrative of several implementations of the present invention, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of the present invention. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of the present invention, and these modifications and improvements all fall within the scope of protection of the present invention. Therefore, the scope of protection of this patent should be determined by the appended claims.
Claims
1. A method for multimodal information fusion and management of industrial equipment, characterized in that, Includes the following steps: S1: Collect multimodal data during the operation of industrial equipment, including sensor time-series data and visual image sequence data; S2: Preprocess the multimodal data to generate multimodal temporal features that are aligned in time sequence; S3: Decompose the multimodal time-series features into trend component features and seasonal component features; S4: Perform progressive cross-modal fusion of the trend component features and seasonal component features from different modalities to obtain a fused feature representation; S5: Iteratively optimize the fused feature representation through a recursive refinement mechanism; S6: Based on the optimized fusion feature representation, perform device status prediction.
2. The method for multimodal information fusion and management of industrial equipment according to claim 1, characterized in that, The sensor time series data is the cutting force signal acquired by the multi-channel sensor, and the visual image sequence data is the equipment surface image including the tool wear area; The equipment status prediction includes equipment health status classification and future signal sequence regression.
3. The method for multimodal information fusion and management of industrial equipment according to claim 1, characterized in that, Step S2 includes: The sensor time series is downsampled at multiple scales, and feature projection is performed through an embedding layer to obtain the sensor time series features; The spatial features of the visual image sequence are extracted using a convolutional neural network, and the spatial features are temporally aligned and projected to obtain visual temporal features.
4. The method for multimodal information fusion and management of industrial equipment according to claim 3, characterized in that, The embedding implemented by the embedding layer or feature projection includes value embedding, position embedding and periodic embedding, wherein the periodic embedding is constructed using Fourier basis functions to capture cyclic patterns in the data.
5. The method for multimodal information fusion and management of industrial equipment according to claim 1, characterized in that, Step S3 is performed using a local trend decomposer, and the decomposition includes: Extract local segments of the input features using a sliding window; Each local segment is compared with a set of predefined kernel basis vectors to obtain the weight distribution; Based on the weight distribution, the kernel basis vectors are combined to reconstruct the local trend components; The local trend components of the overlapping windows are aggregated to obtain the global trend component, and the seasonal component is obtained by subtracting the input feature from the global trend component.
6. The method for multimodal information fusion and management of industrial equipment according to claim 5, characterized in that, The kernel basis vectors are initialized by principal component analysis of the training data samples.
7. The method for multimodal information fusion and management of industrial equipment according to claim 1, characterized in that, The progressive cross-modal fusion includes the following stages, executed sequentially: In the first stage, a frequency domain cross-attention mechanism is used to achieve long-range dependency interactions between features of different modal components; In the second stage, the interactive features and the original modal features are weighted and fused through a residual interpolation mechanism to preserve modal priors. In the third stage, enhanced features from different modalities are adaptively merged through a gating network.
8. The method for multimodal information fusion and management of industrial equipment according to claim 7, characterized in that, The frequency domain cross-attention mechanism is calculated in the following way: The query vector Q and the key vector K are transformed to the frequency domain using a fast Fourier transform; Perform conjugate multiplication of Q and K in the frequency domain; The result of the operation is transformed back to the time domain by inverse fast Fourier transform, and then multiplied with the value vector V to generate attention interaction features.
9. The method for multimodal information fusion and management of industrial equipment according to claim 1, characterized in that, The specific process of the recursive refining mechanism is as follows: The original multimodal temporal features generated in step S2 are used as anchor point residuals; In each refining iteration, the anchor point residual is added to the fusion feature representation of the previous round to form the input of the current iteration; Repeat steps S3 and S4 for the input, performing n iterations, where n ≥ 2.
10. The method for multimodal information fusion and management of industrial equipment according to claim 1, characterized in that, Step S6 specifically includes: Information is extracted from fused feature representations at different temporal resolutions, and multi-scale prediction is performed using a scale-specific prediction head. The results of the multi-scale predictions are weighted and averaged to generate the final prediction result.