Instrument state prediction method and system based on multi-modal sensor data fusion

By identifying and fusing multimodal sensing data through a one-dimensional convolutional encoder and a feature dimensionality reduction mapping module, and extracting frequency domain features by combining Fourier transform, the problems of high computational overhead and information loss in instrument status prediction are solved, and a more comprehensive prediction effect is achieved.

CN122241380APending Publication Date: 2026-06-19ZENITH INSTR CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZENITH INSTR CO LTD
Filing Date
2026-05-15
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies in instrument condition prediction have high computational costs, are prone to losing key information, and fail to fully utilize frequency domain information, resulting in insufficient prediction accuracy.

Method used

An initial word sequence is generated by a one-dimensional convolutional encoder. Important and redundant words are divided using a feature dimensionality reduction mapping module. Semantically overlapping word pairs are identified based on cosine similarity. Attention entropy values ​​are used to determine which words to retain and merge. Soft fusion is performed through a gated fusion unit. Fourier transform is executed in parallel to extract frequency domain periodic features, and a multi-domain feature fusion of time and frequency domains is constructed.

Benefits of technology

While reducing the computational complexity of the model, it effectively retains core information, improving the accuracy of instrument state prediction and its anti-interference generalization ability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241380A_ABST
    Figure CN122241380A_ABST
Patent Text Reader

Abstract

This invention belongs to the field of electronic digital data processing technology, specifically relating to an instrument condition prediction method and system based on multimodal sensor data fusion. The method includes: acquiring raw time-series monitoring data; generating an initial word sequence via a one-dimensional convolutional encoder; inputting the sequence into a first encoder stage to generate contextual state representations and attention weight distributions for each word; linearly transforming and normalizing the contextual state representations to output predicted importance scores; dividing the contextual state representations into an important word set and a redundant word set based on a preset first threshold; and identifying two words in the redundant word set whose cosine similarity to the contextual state representations is higher than a preset second threshold as target word pairs based on the original time sequence. This invention enhances the predictive insight when dealing with massive, multi-channel, complex industrial monitoring data, and improves the accuracy and anti-interference generalization ability of fault detection and instrument health assessment.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of electronic digital data processing technology. More specifically, this invention relates to an instrument condition prediction method and system based on multimodal sensor data fusion. Background Technology

[0002] Instrument condition prediction is a key technology in the Industrial Internet of Things (IIoT) and smart manufacturing. In air or water pressure testing, instruments typically require high-frequency continuous sampling and analysis to capture pressure surges or micro-leakage. During high-frequency sampling, instruments inevitably generate massive amounts of multi-channel time-series monitoring data. By analyzing this data generated during equipment operation, the future health status of the equipment can be predicted, thereby avoiding losses from unplanned downtime. Traditional instrument condition prediction methods mainly rely on basic signal processing techniques and statistical models. However, when faced with the multi-channel and highly complex data generated by modern industrial equipment, traditional methods often fail to effectively detect deep-seated temporal dependencies and pattern characteristics within the data. Furthermore, traditional methods heavily rely on expert experience for feature engineering, resulting in extremely limited generalization capabilities for practical applications.

[0003] Currently, deep learning models are becoming increasingly prevalent in time series data analysis. These models can automatically learn temporal dependencies in data, improving the accuracy of instrument status predictions to some extent. Among them, related deep learning models, through parallel computing architectures and global receptive fields, demonstrate powerful capabilities in detecting long-range dependencies. However, while this technology enhances feature extraction capabilities, it disrupts the balance between computational efficiency and performance. The computational complexity and memory consumption of the core mechanism increase exponentially with the length of the input sequence. For industrial monitoring data containing thousands or even tens of thousands of consecutive time points, this generates extremely high computational overhead, severely limiting the real-time predictive capabilities of related models.

[0004] To address the problem of excessive computational overhead in modeling, a common approach is to identify and remove redundant or secondary information points from the sequence to shorten its length, thereby reducing costs and increasing efficiency. However, existing information processing methods typically perform absolute, hard removal based on attention scores or the norm of the representation vector. This approach, lacking a soft transition, fails to establish a balanced information preservation mechanism and neglects the necessity of multimodal sensor data fusion, easily leading to the permanent loss of key information in the original sequence. On the other hand, many existing prediction models are limited to unidirectional modeling and analysis of monitoring data in the time domain, failing to fully utilize key features such as the periodicity of equipment operation and vibration frequency contained in the frequency domain information, resulting in a lack of comprehensiveness in the final extracted feature representation.

[0005] Therefore, how to effectively reduce the computational complexity of the model while constructing a balanced feature extraction framework that retains the core information in the original sequence to the maximum extent and integrates multi-domain features for a more comprehensive state representation is a problem that needs to be solved in current instrument state prediction methods based on multimodal sensor data fusion. Summary of the Invention

[0006] To address the technical problems of existing technologies, such as high computational overhead, easy loss of key information, and failure to fully utilize the equipment operating periodicity and vibration frequency contained in the frequency domain information, this invention provides solutions in the following aspects.

[0007] In a first aspect, the present invention provides an instrument state prediction method based on multimodal sensor data fusion, comprising: The process involves acquiring raw temporal monitoring data, generating an initial word sequence via a one-dimensional convolutional encoder, and inputting this sequence into the first encoder stage to generate contextual state representations and attention weight distributions for each word. The contextual state representations are then linearly transformed and normalized to output predicted importance scores, which are then divided into an important word set and a redundant word set based on a preset first threshold. Within the redundant word set, based on the original temporal sequence, two words whose cosine similarity to their contextual state representations is higher than a preset second threshold are identified as target word pairs. The entropy values ​​of the attention weight distributions of the two words in each target word pair are compared, and the word with the smaller entropy value is defined as... To fuse target lexical units, lexical units with higher entropy values ​​are defined as lexical units to be pruned. Through a gated fusion unit, the contextual state representation of the lexical units to be pruned is weighted and incorporated into the target lexical unit to obtain an updated fused target lexical unit. The set of important lexical units and the updated fused target lexical unit form a fusion sequence. This fusion sequence is then concatenated with a category lexical unit and input into the second encoder stage to extract the first prediction feature. A frequency domain transformation is performed on the original time-series monitoring data to extract frequency domain periodic features as the second prediction feature. The first and second prediction features are fused and input into the classifier to output the instrument status prediction result.

[0008] This invention divides important and redundant words into segments using a feature dimensionality reduction mapping module. Within the redundant set, it identifies semantically overlapping word pairs based on cosine similarity. Attention entropy values ​​are used to determine which words to retain and merge. A gated fusion unit smoothly integrates the words to be pruned into the target words. This soft fusion significantly reduces sequence length, model computational complexity, and memory usage, while effectively reducing the probability of core feature loss. Addressing the deficiency of not fully utilizing frequency domain information, this invention, based on extracting deep features in the time domain, performs parallel Fourier transforms to extract global frequency domain periodic features, achieving multi-domain feature fusion of the time and frequency domains. This breaks the limitations of a single perspective and enhances the accuracy of instrument status prediction.

[0009] Preferably, the step of acquiring the original time-series monitoring data and generating an initial word sequence using a one-dimensional convolutional encoder includes: converting the original time-series monitoring data into a sequence of words with a dimension of 1. The time-series monitoring data input is the one-dimensional convolutional encoder consisting of two stacked one-dimensional convolutional layers; wherein, For the number of channels, The stride of the first convolutional layer is 1, and the stride of the second convolutional layer is 2, with a normalization layer following both the first and second convolutional layers, based on the output word dimension. The initial lexical sequence.

[0010] This invention utilizes an encoder composed of two stacked one-dimensional convolutional layers. By combining different phase lengths and convolutional kernels, it can effectively extract local features from multi-channel monitoring data and reduce the dimensionality of time series data. This transforms the original multimodal sensing data into a structured representation suitable for Transformer processing, while layer normalization ensures the stability of feature transfer.

[0011] Preferably, the step of outputting the predicted importance score by linear transformation and normalization of the context state representation includes: inputting the context state representation of each word into a feature dimensionality reduction mapping module; the feature dimensionality reduction mapping module performs dimensionality reduction and compression on the input dimension through at least one layer of linear transformation network and performs nonlinear activation, outputs a scalar value through the output layer, and normalizes the scalar value to between 0 and 1 through the Sigmoid function as the predicted importance score.

[0012] This invention utilizes a feature dimensionality reduction mapping module constructed from a two-layer fully connected network to establish a screening mechanism for automatically identifying the value of feature information. Through dimensionality reduction mapping and Sigmoid nonlinear activation, the model can accurately output standardized prediction importance scores, thereby successfully separating important features containing core anomaly indicators from redundant features with normal fluctuations. This provides a reliable basis for subsequent differentiated processing and reduces the waste of computing resources.

[0013] Preferably, identifying two word pairs whose cosine similarity to the context state representation is higher than a preset second threshold as target word pairs includes: setting a sliding time window in the redundant word set, traversing word pairs within the sliding time window, calculating the dot product of the context state representation vectors of the two word pairs in each word pair, dividing by the product of the magnitudes of the context state representation vectors of the corresponding two word pairs to obtain a cosine similarity value, and identifying word pairs whose obtained cosine similarity value is higher than the preset second threshold as target word pairs.

[0014] This invention can accurately capture feature pairs that are highly overlapping in semantic or temporal patterns in redundant feature sets by calculating the cosine similarity of context state representation. This similarity evaluation mechanism enables the model to scientifically screen redundant words with merging potential, providing an accurate target basis for subsequent soft fusion.

[0015] Preferably, the updated fusion target lexical units satisfy the following relation: ; in, This indicates the updated target term for fusion. Indicates the target word group to be fused. The terminology to be pruned is represented by the target terminology. After concatenating the target terminology with the terminology to be pruned, the gate value is calculated through a linear transformation layer and a Sigmoid activation function. .

[0016] Preferably, the extraction of the first predictive feature includes: inserting the category term into the starting position of the fused sequence, inputting the spliced ​​fused sequence into the second encoder stage composed of multiple stacked Transformer encoder layers, each Transformer encoder layer containing a multi-head self-attention module and a feedforward neural network module, and extracting the first predictive feature from the output corresponding to the category term.

[0017] Preferably, the frequency domain transformation is a one-dimensional fast Fourier transform, and the extraction of frequency domain periodic features as the second prediction feature includes: independently performing a one-dimensional fast Fourier transform on each channel of the original time-series monitoring data, selecting a preset number of frequency components in the spectrum of each channel according to the descending order of energy amplitude, and concatenating the amplitude and phase information of the selected frequency components from all channels to obtain the second prediction feature.

[0018] Preferably, the step of fusing the first prediction feature and the second prediction feature, inputting the data into a classifier, and outputting the instrument state prediction result includes: concatenating the first prediction feature and the second prediction feature to obtain a fused feature vector, inputting the fused feature vector into a classifier, outputting the probability distribution of the instrument in each preset state through a Softmax function, and selecting the state corresponding to the maximum probability value as the instrument state prediction result.

[0019] This invention, by independently performing a one-dimensional fast Fourier transform and combining it with energy amplitude descending sorting, can efficiently extract the most important periodic components and vibration frequencies contained in the equipment during operation. The comprehensive splicing of amplitude and phase constructs a high-density frequency domain feature vector, enriching the multi-dimensional representation of the instrument status and improving the comprehensiveness of the prediction model.

[0020] Preferably, the one-dimensional convolutional encoder, the first encoder stage, the second encoder stage, the feature dimensionality reduction mapping module, and the classifier are optimized through supervised training. The supervised training steps include: acquiring raw time-series monitoring data samples with state labels, performing Z-score normalization on the samples; calculating the network loss using a weighted cross-entropy loss function, wherein corresponding class weights are set for different state labels based on the sample ratio differences of different instrument states; and updating the model parameters through an optimizer based on the network loss until a preset convergence criterion is met.

[0021] Secondly, the present invention provides an instrument condition prediction system based on multimodal sensor data fusion, including a processor and a memory, wherein the memory stores computer program instructions, and when the computer program instructions are executed by the processor, the above-mentioned instrument condition prediction method based on multimodal sensor data fusion is implemented.

[0022] By adopting the above technical solution, the instrument state prediction method based on multimodal sensing data fusion is generated into a computer program and stored in a memory for loading and execution by a processor. This allows for the creation of a terminal device based on the memory and processor, facilitating its use.

[0023] The beneficial effects of this invention are as follows: This invention uses a pioneering lexical feature dimensionality reduction mapping module and attention entropy calculation to accurately distinguish between core abnormal indicators and regular fluctuation redundant information. It also uses a gated fusion unit to dynamically weight and merge highly similar redundant lexical units, achieving adaptive feature compression while maintaining high accuracy. This effectively avoids the problem of permanent discontinuity of key temporal features caused by traditional direct discarding methods.

[0024] Furthermore, based on the deep mining of long-distance dependencies in temporal context using an improved Transformer encoder, a one-dimensional Fast Fourier Transform is introduced across domains to comprehensively extract frequency domain features characterizing the macroscopic periodicity and vibration properties of the equipment. By constructing a three-dimensional, multi-dimensional feature representation that integrates the time and frequency domains, the predictive power in dealing with massive, multi-channel, and complex industrial monitoring data is enhanced, improving the accuracy and anti-interference generalization ability of fault detection and instrument health assessment. Attached Figure Description

[0025] Figure 1 This is a flowchart of the instrument state prediction method based on multimodal sensor data fusion of the present invention; Figure 2 This is a time-series dynamic response diagram of the multimodal sensing data prediction results and the original monitoring pressure in an embodiment of the present invention. Detailed Implementation

[0026] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.

[0027] The specific embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

[0028] This invention discloses an instrument condition prediction method based on multimodal sensor data fusion, referring to... Figure 1 This includes steps S1-S5: S1. Obtain the original temporal monitoring data, generate the initial word sequence through a one-dimensional convolutional encoder, and input it into the first encoder stage to generate the context state representation and attention weight distribution of each word.

[0029] In an optional embodiment, the Pandas library is used to read... aisle The raw time-series monitoring data is organized into a three-dimensional tensor of batch number × M × N. The torch.nn.Conv1d one-dimensional convolutional layer in the PyTorch deep learning framework is called as a one-dimensional convolutional encoder. The input channel is set to M, the output channel is the embedding dimension of the initial word, and the convolution kernel size and stride are configured. The three-dimensional tensor is convolved to generate an initial word sequence with the dimension of batch number × sequence length × embedding dimension.

[0030] The preferred raw time-series monitoring data are collected temperature, vibration, and pressure data; among which, One channel is Data was collected from 5 monitoring points; specifically, in the barometric pressure test, 5 monitoring points were set up. It is 5.

[0031] Dimension is The raw time-series monitoring data input consists of a one-dimensional convolutional encoder composed of two stacked one-dimensional convolutional layers. The first convolutional layer uses a kernel of size 3, a stride of 1, and a ReLU activation function. After the first convolutional layer, a normalization layer is connected. The second convolutional layer uses a kernel of size 5, a stride of 2, and a GELU activation function. After the second convolutional layer, a normalization layer is connected. The output is a... An initial sequence of lexical units, where each lexical unit has a dimension of . .

[0032] The one-dimensional convolutional encoder is a neural network model. Its network structure includes an input layer, a first convolutional layer, a first normalization layer, a second convolutional layer, a second normalization layer, and an output layer. The input dimension is... The original time-series monitoring data tensor, It is the number of channels. This is the length of the time series; the output is the dimension. The initial word sequence tensor, It is the number of lexical units. It is the dimension of each word element.

[0033] The raw time-series monitoring data is transformed into a fixed-dimensional initial word sequence suitable for processing by the Transformer model, and local temporal patterns are detected through convolution operations. For example, for an instrument with three monitoring channels: vibration, temperature, and current... If the value is 3, and samples are taken at a fixed frequency within 10 seconds, then the continuous time points collected will be... The original time-series monitoring data tensor has a dimension of 3×1000. The original time-series monitoring data with a dimension of 3×1000 is input into a one-dimensional convolutional encoder consisting of two stacked one-dimensional convolutional layers. The first convolutional layer is set with 128 convolutional kernels of size 3, a stride of 1, and uses same padding to maintain the sequence length. After processing by the ReLU activation function and subsequent layer normalization operations, the dimension of the original time-series monitoring data becomes 128×1000.

[0034] The original time-series monitoring data, now with a dimension of 128×1000, enters the second convolutional layer. This second convolutional layer has 256 kernels of size 5 and a stride of 2. The number of kernels corresponds to the initial word dimension. The sequence length is 256. Since the stride is 2, the sequence length is downsampled to half, resulting in an output dimension of 256×500. GELU is chosen as the activation function for the second convolutional layer. Similarly, a normalization layer is connected after the second convolutional layer to stabilize the training. The resulting 256×500 tensor needs to be rearranged to a 500×256 format, forming an initial word sequence of L=500 words, where each word is a 256-dimensional vector carrying the encoded information of the local temporal pattern.

[0035] The initial word sequence is input into the first encoder stage, which consists of at least one Transformer encoder layer. The multi-head self-attention module inside the first encoder stage calculates the attention score between each word. After normalization by the Softmax function, the attention weight distribution is formed. The attention weight distribution is a probability vector with the same length as the initial word sequence. Combined with the feedforward neural network layer, the first encoder stage outputs the context state representation of each word after context information fusion.

[0036] In this way, by extracting local features of multi-channel monitoring data through a one-dimensional convolutional encoder and reducing the dimensionality of the time series, and by using the first encoder stage to establish global associations between various lexical units, the context state representation contains macroscopic operational features, providing high-quality feature input for subsequent processing.

[0037] S2. The context state representation is linearly transformed and normalized to output the predicted importance score, which is then divided into an important word set and a redundant word set according to a preset first threshold.

[0038] In an optional embodiment, a learnable feature dimensionality reduction mapping module consisting of a torch.nn.Linear fully connected layer and a sigmoid activation function is constructed. The context state representation of each word output from the first encoder stage is input into the feature dimensionality reduction mapping module to calculate a prediction importance score between 0 and 1. Based on a preset first threshold, word indices with prediction importance scores higher than the first threshold are recorded as important word sets, and word indices with scores not higher than the first threshold are recorded as redundant word sets.

[0039] Specifically, the context state representation of each word is input into a feature dimensionality reduction mapping module consisting of two fully connected layers. The first fully connected layer maps the input dimension to half of its own and uses the ReLU activation function. The second fully connected layer outputs a scalar value, which is normalized to between 0 and 1 using the Sigmoid function as the prediction importance score.

[0040] The learnable feature dimensionality reduction mapping module is preferably a neural network model. The network structure includes an input layer, a first fully connected layer, a second fully connected layer, and an output layer. The input of the network has a dimension of [missing value]. The context state representation vector of a single word is output as a scalar value between 0 and 1, representing the prediction importance score.

[0041] For example, after processing in the first encoder stage, a context state representation with L = 500 lexical units is obtained, where each context state representation is a dimensional unit. For each word in the sequence, its 256-dimensional representation vector is independently input into the feature dimensionality reduction mapping module. The weight matrix has a dimension of 256×128, which linearly transforms the input vector from 256 dimensions to 128 dimensions and introduces non-linearity through the ReLU activation function.

[0042] The 128-dimensional vector, after processing in the first layer, is fed into the second fully connected layer. The weight matrix of this layer has a dimension of 128×1, mapping the 128-dimensional vector to a unique scalar value, logit. To transform the unbounded value into a standardized importance score, this scalar value is input into the Sigmoid function, which outputs a floating-point number between 0 and 1, representing the predicted importance score of the term. After processing all 500 terms, a list of 500 scores is obtained. Based on a preset first threshold of 0.2, terms with scores greater than 0.2 are assigned to the important term set, while terms with scores less than or equal to 0.2 are assigned to the redundant term set.

[0043] In this way, an automatic screening mechanism for identifying the value of feature information was constructed, successfully separating important features containing core abnormal indicators from redundant features of normal fluctuations, thus avoiding the waste of computing resources in subsequent processing.

[0044] S3. In the redundant word set, based on the original temporal sequence, identify two words whose cosine similarity to the context state representation is higher than the preset second threshold as target word pairs; compare the entropy values ​​of the attention weight distribution of the two words in the target word pairs, define the word with the smaller entropy value as the fusion target word, and define the word with the larger entropy value as the word to be pruned.

[0045] In an optional embodiment, for the context state representation of any two words in the redundant word set, the `torch.nn.functional.cosine_similarity` function is called to calculate the pairwise cosine similarity. Word pairs with similarity values ​​higher than a preset second threshold are selected. For each selected word pair, their respective attention weight distribution vectors are extracted. The information entropy is calculated using the `scipy.stats.entropy` function of the SciPy library or a custom function based on the Shannon entropy formula. The entropy values ​​of the two words in the same word pair are compared. The word with the smaller entropy value is designated as the target word for fusion, and the word with the larger entropy value is designated as the word to be pruned.

[0046] In the redundant lexical set, when identifying lexical pairs whose cosine similarity to any two context state representations is higher than a preset second threshold, a length of [value missing] is set to avoid excessive computational complexity caused by global search. The sliding time window only traverses all word pairs within the sliding time window, calculates the dot product of the context state representation vectors of each word pair, and divides it by the product of the magnitudes of the two vectors to obtain the cosine similarity value. Word pairs with cosine similarity values ​​greater than the second threshold are identified as pairs to be processed.

[0047] In addition, when calculating information concentration, besides using the scipy.stats.entropy function of the SciPy library to calculate Shannon entropy, when engineering computing power is limited, the variance of the attention weight distribution vector can also be directly calculated as an alternative scheme for the dispersion index. The smaller the variance, the higher the entropy value, indicating that the information is more dispersed. The larger the variance, the lower the entropy value, indicating that the information is more concentrated. This reduces the hardware overhead of floating-point operations and exponential calculations.

[0048] Taking this embodiment as an example, the redundant lexical set contains 80 lexical units, and the sliding timing window length is set. The value is 10. For each lexical unit in the redundant lexical set, pairing calculations are performed only with other lexical units within a temporally adjacent window. Each lexical unit is paired with at most one other lexical unit. The total computational cost of a word pair is far less than that of a global pairwise traversal. and Extract their respective context state representation vectors and Both are 256-dimensional.

[0049] The cosine similarity between the two vectors is calculated using the following formula: ,in Represents the vector dot product. The L2 norm of the vector is represented, and the result is a value between -1 and 1. The calculated cosine similarity is compared with a preset second threshold of 0.95. If the calculated cosine similarity is greater than 0.95, then it is considered... and Word pairs that are highly similar in semantics or pattern are identified as a pair to be processed. This process iterates through all word pairs to obtain a list containing all highly similar word pairs for the next step of the fusion operation.

[0050] The entropy of the attention weight distribution reflects the degree of attention a word pays to other positions in the sequence. When the attention entropy of a word is low, its attention weights are concentrated in a few specific positions, indicating that the word encodes discriminative features closely related to specific local temporal patterns. When the entropy is high, the attention weights tend to be evenly distributed across positions, indicating that the word has failed to capture discriminative local patterns and its encoded information is less unique. Based on this characteristic, selecting words with lower entropy values ​​as the target words for fusion and retaining them, while incorporating words with higher entropy values ​​as words to be pruned, can compress the sequence length while prioritizing the retention of feature representations with stronger discriminative capabilities.

[0051] S4. Through the gating fusion unit, the context state representation of the word to be pruned is weighted and integrated into the fusion target word to obtain the updated fusion target word.

[0052] In an optional embodiment, a gated fusion unit containing a torch.nn.Linear fully connected layer and a sigmoid activation function is implemented. For each word to be pruned and its corresponding target word, the context state representation is concatenated and input into the unit to generate a gated scalar. A gated weighted fusion mechanism is adopted, and the representations of the word to be pruned and the target word are weighted and summed using the gate weight to update the representation of the target word, thus obtaining the updated target word.

[0053] Specifically, for each identified word pair, let the representation of the fusion target word be as follows: The representation of the morpheme to be pruned is as follows ; Will and After concatenation, a gate value between 0 and 1 is calculated using a linear transformation layer and a sigmoid activation function. The updated fusion target lexical satisfies the following relation:

[0054] Specifically, lexical units with lower entropy values ​​are used as target lexical units for fusion, while lexical units with higher entropy values ​​are used as lexical units to be pruned. For example, setting... and Both vectors are 256-dimensional. They are concatenated along their feature dimensions to form a 512-dimensional concatenated vector. This 512-dimensional concatenated vector is then fed into a learnable gating network. The gating network consists of a linear transformation layer with a weight matrix of dimension 512×1 and a sigmoid activation function. This maps the 512-dimensional input to a scalar and constrains it to the range of 0 and 1 using the sigmoid function, generating the gating value. Gating value The decision determines how much information to incorporate from the lexical units to be pruned. The updated target lexical units are then calculated using the above formula. After this operation is completed, the lexical units to be pruned are discarded, and the updated target lexical units will replace the original target lexical units in the next stage, thereby reducing the number of lexical units while retaining key information.

[0055] In this way, the fusion ratio of redundant information is dynamically controlled by the gating value, and a smooth transition from secondary features to primary features is achieved. It is worth noting that the word fusion step only performs a linear-level preprocessing calculation once under the sliding time window constraint. After fusion, the sequence length entering the second encoder stage is greatly shortened. The computational complexity of the multi-head self-attention module in the second encoder stage is proportional to the square of the sequence length. The computational savings brought about by the reduction of the sequence length far outweigh the additional overhead introduced by the fusion step itself. The overall scheme achieves a net gain in computational efficiency.

[0056] S5. Construct a fusion sequence by combining the important word set with the updated fusion target word. Concatenate the fusion sequence with a category word and input it into the second encoder stage to extract the first prediction feature. Perform frequency domain transformation on the original time-series monitoring data and extract frequency domain periodic features as the second prediction feature. Fuse the first prediction feature and the second prediction feature and input them into the classifier to output the instrument status prediction result.

[0057] In an optional embodiment, the `torch.cat` function is used to concatenate the lexical units corresponding to the important lexical unit set with all the updated fusion target lexical units along the dimension of the initial lexical unit sequence, forming a pruned and fused lexical unit sequence. A learnable `torch.nn.Parameter` is defined as a category lexical unit, with the same dimension as other lexical units in the sequence. The `torch.cat` function is used to concatenate the category lexical unit to the beginning of the sequence. The concatenated complete sequence is input into the second encoder stage, which consists of another set of `torch.nn.TransformerEncoderLayer` modules. After the model forward propagation is completed, the output vector corresponding to the position of the category lexical unit is extracted as the first predicted feature.

[0058] In this process, a learnable category term is inserted into the beginning of the pruned and fused initial term sequence. The concatenated sequence is then input into the second encoder stage, which consists of four stacked standard Transformer encoder layers. Each Transformer encoder layer contains an 8-head multi-head self-attention module and a feedforward neural network module. The first predicted feature is extracted from the output corresponding to the category term.

[0059] The second encoder stage is a neural network model. The network structure consists of four standard Transformer encoder layers stacked together. Each Transformer encoder layer contains an 8-head multi-head self-attention module and a feedforward neural network module. Each module is followed by residual connections and layer normalization. The network input is a sequence of terms with inserted category terms and added positional encodings; the dimension is... , It is the number of lexical units after pruning and fusion. It is the lexical dimension, and the output is the dimension. The first predictive feature is the output state corresponding to the category word.

[0060] The fusion sequence consists of a set of important lexical units and all updated target lexical units for fusion. For example, the number of lexical units remaining after pruning and fusion is... The value is 460. At this point, one dimension is... 256 learnable category lexical units (CLS) are created and inserted at the beginning of the fused sequence to form a lexical sequence of length 461. Since the length of the fused sequence has changed after lexical fusion and pruning, the positional encoding of the original initial lexical sequence is no longer applicable. Therefore, after the fused sequence is constructed, a positional encoding of the same length as the sequence is regenerated according to the actual arrangement order of each lexical unit in the fused sequence to ensure that the positional information received in the second encoder stage is consistent with the compressed sequence structure. In order for the model to be able to perceive the order information of the lexical units, a positional encoding of the same length as the sequence is calculated and added to each lexical representation of the sequence.

[0061] The fused sequence, infused with location information, is fed into a deep encoder containing four standard Transformer encoder layers. In each encoder layer, the sequence passes through an eight-head multi-head self-attention module, enabling the model to simultaneously focus on information from different representation subspaces. The output of this module, after residual connections and layer normalization, enters a feedforward neural network module. For example, it maps 256 dimensions to 1024 dimensions and then back to 256 dimensions to enhance nonlinear representation capabilities. After the data flows through all four encoder layers, a 256-dimensional vector is extracted from the output state of the category word at the beginning of the sequence, serving as the first predictive feature representing the high-level abstract pattern of the entire temporal data.

[0062] Then, a Fourier transform is performed on the original time-series monitoring data to extract the global periodic features in the frequency domain, thus obtaining the second prediction feature. Specifically, for... Each channel of the system independently performs a one-dimensional fast Fourier transform. In the spectrum of each channel, the top 10 most important frequency components are selected based on their energy amplitude. The amplitude and phase information of the selected frequency components from all channels are then concatenated to obtain the second predictive feature.

[0063] For example, with Each channel Taking the original time-series data of 1000 data points as an example, a one-dimensional FFT is performed on the sequence of 1000 data points in each channel to obtain a spectrum containing the complex values ​​corresponding to each frequency point. For each frequency point in the spectrum, the energy amplitude is obtained by calculating the modulus of the complex number, and the phase is obtained by calculating the argument of the complex number.

[0064] The energy amplitudes of all frequency points are sorted in descending order, and the top 10 frequency points with the largest amplitudes are selected. These represent the most predominant periodic components of the signal. For these 10 selected frequency components, both their energy amplitude and phase angle are extracted, resulting in a total of 20 values, which constitute the frequency domain feature vector of that channel. This process is repeated for all three channels, generating a 20-dimensional feature vector for each channel. These three vectors are then concatenated sequentially to form a single-dimensional feature vector. The frequency domain feature vector is 60, and this vector is the second prediction feature.

[0065] Finally, the `torch.cat` function is used to concatenate the first and second predicted features along the feature dimension to obtain a fused feature vector. This fused feature vector is then input into a classifier consisting of several `torch.nn.Linear` layers and a ReLU activation function. This classifier is a non-linear classifier, and the last layer of the classifier uses the Softmax activation function to output the probability of the instrument in each preset state. The `torch.argmax` function selects the category with the highest probability value as the instrument state prediction result, such as leakage, compressor failure, pump failure, etc.

[0066] The classifier is a neural network model, whose network structure includes an input layer, a fully connected layer, and an output layer; the network input has a dimension of [dimensional value missing]. The fused feature vector, It is the first predictive feature dimension. The second predictive feature dimension is the network output, which represents the probability distribution of the instrument under different states. This output is calculated using the Softmax function. The first predictive feature is obtained from the category token output of the second encoder stage. The second predicted feature is 256, extracted via Fourier transform, with a dimension of 256. The value is 60. The fusion operation uses concatenation to connect the two feature vectors end to end, forming a 316-dimensional fused feature vector.

[0067] The 316-dimensional fused feature vector is input to a classifier, which is a non-linear classifier with a fully connected layer containing learnable weights and bias terms. The weight matrix has a dimension of 316×3, where 3 corresponds to the three preset states of the instrument: normal, warning, and fault. The linear layer outputs a vector containing three original scores (logits). The vector is processed by the Softmax function to be converted into a probability distribution with a sum of 1, for example, [0.05, 0.25, 0.7]. The instrument state prediction result is determined by taking the index of the maximum value in this probability distribution using the argmax operation.

[0068] All models containing learnable parameters in this invention, including the one-dimensional convolutional encoder, the first encoder stage, the second encoder stage, the feature dimensionality reduction mapping module, and the classifier, are optimized through supervised training. The specific training process is as follows: first, data is collected from the instrument... The time-series monitoring data of each channel were labeled with three categories: "Normal," "Warning," and "Fault." These were then divided into training, validation, and test sets in a 7:2:1 ratio. Z-score normalization (mean 0, variance 1) was applied to all data, and missing values ​​were handled. Weighted cross-entropy loss was used. To address the sample imbalance issue caused by the low proportion of fault and warning samples, a class weight of 1:2:3 (Normal:Warning:Fault) was used. The AdamW optimizer was selected, with a base learning rate of 1e-4 and a weight decay of [missing value]. The batch size is 32, the maximum number of training rounds is set to 100, and an early stopping strategy is adopted to terminate training if the validation set loss does not decrease for 5 consecutive rounds. For regularization, a Dropout layer with a dropout rate of 0.2 is added before the classifier to prevent the model from overfitting. At the same time, all layers are normalized and residual connections are included in the model training to improve the model's convergence stability. The evaluation metrics for model training are accuracy, fault class recall, and warning class F1 score on the test set, with accuracy ≥95% and fault class recall ≥90% as the convergence criteria for the model.

[0069] like Figure 2 As shown, under normal operating conditions, the original monitored pressure exhibits normal periodic fluctuations and slight noise, at which point the system's output prediction probability remains at an extremely low safety level. When the time reaches the first dashed line, a micro-leakage occurs in the system, at which point the monitored pressure only shows a slight downward trend, making it highly likely that traditional threshold alarms will miss the warning. However, the model of this invention, because it retains the core anomaly indicators through feature dimensionality reduction mapping in the pre-stage, and through the dual fusion of Transformer and frequency domain features, allows the system's output prediction probability to respond quickly and significantly, issuing timely warnings.

[0070] When the time reaches the second dotted line, a pressure surge event occurs in the instrument, causing a sharp drop in the original monitored pressure and a sudden spike in the fault prediction probability output by the system. The above time-series comparison demonstrates that the instrument state prediction method based on multimodal sensor data fusion proposed in this invention can not only effectively extract the temporal dependencies in multi-channel data, but also accurately and quickly capture micro-leakage and sudden fault phenomena of equipment while greatly compressing the computational overhead of sequences. This verifies the high sensitivity and high accuracy of this invention in actual industrial monitoring.

[0071] This invention also discloses an instrument condition prediction system based on multimodal sensor data fusion, including a processor and a memory. The memory stores computer program instructions, which, when executed by the processor, implement the instrument condition prediction method based on multimodal sensor data fusion according to this invention.

[0072] The instrument state prediction method and system based on multimodal sensor data fusion described above also include other components well known to those skilled in the art, such as communication buses and communication interfaces. Their settings and functions are known in the art and will not be described in detail here.

Claims

1. An instrument condition prediction method based on multimodal sensor data fusion, characterized in that, include: The raw temporal monitoring data is acquired, and an initial word sequence is generated by a one-dimensional convolutional encoder. This sequence is then input into the first encoder stage to generate the contextual state representation and attention weight distribution of each word. The context state representation is linearly transformed and normalized to output the predicted importance score, which is then divided into an important word set and a redundant word set based on a preset first threshold. In the set of redundant lexical units, based on the original temporal sequence, two lexical units whose cosine similarity to the context state representation is higher than a preset second threshold are identified as target lexical units; the entropy values ​​of the attention weight distribution of the two lexical units in the target lexical unit pair are compared, and the lexical unit with the smaller entropy value is defined as the fusion target lexical unit, and the lexical unit with the larger entropy value is defined as the lexical unit to be pruned; The context state representation of the lexical unit to be pruned is weighted and incorporated into the target lexical unit to obtain the updated target lexical unit. The important word set and the updated fusion target word form a fusion sequence. The fusion sequence is then concatenated with a category word and input into the second encoder stage to extract the first prediction feature. The original time-series monitoring data is transformed in the frequency domain, and the periodic features in the frequency domain are extracted as the second prediction features. The first prediction features and the second prediction features are fused and input into the classifier to output the instrument status prediction results.

2. The instrument state prediction method based on multimodal sensor data fusion according to claim 1, characterized in that, The process of acquiring raw temporal monitoring data and generating an initial word sequence using a one-dimensional convolutional encoder includes: Dimension is The time-series monitoring data input is the one-dimensional convolutional encoder consisting of two stacked one-dimensional convolutional layers; wherein, For the number of channels, The stride of the first convolutional layer is 1, and the stride of the second convolutional layer is 2, with a normalization layer following both the first and second convolutional layers, based on the output word dimension. The initial lexical sequence.

3. The instrument state prediction method based on multimodal sensor data fusion according to claim 1, characterized in that, The step of transforming and normalizing the context state representation to output the predicted importance score includes: The context state representation of each word is input into the feature dimensionality reduction mapping module. The feature dimensionality reduction mapping module reduces and compresses the input dimension through at least one layer of linear transformation network and performs nonlinear activation. Then, it outputs a scalar value through the output layer and normalizes the scalar value to between 0 and 1 through the Sigmoid function as the prediction importance score.

4. The instrument state prediction method based on multimodal sensor data fusion according to claim 1, characterized in that, The step of identifying two lexical units whose cosine similarity to the context state representation is higher than a preset second threshold as target lexical pairs includes: A sliding time window is set in the redundant word set. Word pairs are traversed within the sliding time window. The dot product of the context state representation vectors of the two words in each word pair is calculated and divided by the product of the magnitudes of the context state representation vectors of the two words to obtain the cosine similarity value. Word pairs with the obtained cosine similarity value greater than the preset second threshold are identified as target word pairs.

5. The instrument state prediction method based on multimodal sensor data fusion according to claim 1, characterized in that, The updated fusion target lexical units satisfy the following relation: ; in, This indicates the updated target term for fusion. Indicates the target word group to be fused. The terminology to be pruned is represented by the target terminology. After concatenating the target terminology with the terminology to be pruned, the gate value is calculated through a linear transformation layer and a Sigmoid activation function. .

6. The instrument state prediction method based on multimodal sensor data fusion according to claim 1, characterized in that, The extraction of the first predictive feature includes: The category term is inserted at the beginning of the fused sequence, and the spliced ​​fused sequence is input into the second encoder stage, which is composed of multiple stacked Transformer encoder layers. Each Transformer encoder layer contains a multi-head self-attention module and a feedforward neural network module, and the first predicted feature is extracted from the output corresponding to the category term.

7. The instrument state prediction method based on multimodal sensor data fusion according to claim 1, characterized in that, The frequency domain transformation is a one-dimensional fast Fourier transform, and the extraction of frequency domain periodic features as the second prediction feature includes: A one-dimensional fast Fourier transform is independently performed on each channel of the original time-series monitoring data. In the spectrum of each channel, a preset number of frequency components are selected in descending order of energy amplitude. The amplitude and phase information of the selected frequency components from all channels are concatenated to obtain the second prediction feature.

8. The instrument state prediction method based on multimodal sensor data fusion according to claim 1, characterized in that, The process of fusing the first prediction feature and the second prediction feature, inputting them into a classifier, and outputting an instrument status prediction result includes: The first predicted feature and the second predicted feature are concatenated to obtain a fused feature vector. The fused feature vector is then input into a classifier, and the probability distribution of the instrument under each preset state is output through the Softmax function. The state corresponding to the highest probability value is selected as the instrument state prediction result.

9. The instrument state prediction method based on multimodal sensor data fusion according to claim 1, characterized in that, The one-dimensional convolutional encoder, the first encoder stage, the second encoder stage, the feature dimensionality reduction mapping module, and the classifier undergo parameter optimization through supervised training. The supervised training steps include: Obtain raw time-series monitoring data samples with status labels, and perform Z-score normalization on the samples; The network loss is calculated using a weighted cross-entropy loss function, where corresponding class weights are set for different state labels to account for the differences in the sample proportions of different instrument states. The network loss is used to update the model parameters through an optimizer until the preset convergence criterion is met.

10. An instrument condition prediction system based on multimodal sensor data fusion, characterized in that, include: A processor and a memory, the memory storing computer program instructions that, when executed by the processor, implement the instrument state prediction method based on multimodal sensor data fusion according to any one of claims 1-9.