Industrial time series data enhancement method and device
By using a conditional generation model based on multimodal semantic information and multi-scale structured state space modeling, the problem of lack of semantic consistency in multimodal data fusion and generated data in industrial scenarios is solved. This enables high-fidelity, controllable time-series data synthesis with online updates and environmental adaptability.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHINA ORDNANCE EQUIP GRP AUTOMATION RES INST CO LTD
- Filing Date
- 2025-06-30
- Publication Date
- 2026-06-12
Smart Images

Figure CN122196357A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of intelligent manufacturing and industrial Internet of Things (IIoT) technology, and in particular to an industrial time-series data augmentation method and apparatus based on multi-scale SSSD and driven by multimodal cueing. Background Technology
[0002] In the fields of intelligent manufacturing and the Industrial Internet of Things (IIoT), data augmentation is one of the key technologies for ensuring equipment health management and improving the intelligent decision-making capabilities of systems. Existing technologies have proposed time-series data augmentation methods to address the problem of imbalanced data samples, including oversampling, synthesizing minority class samples (such as SMOTE), and methods based on Generative Adversarial Networks (GANs). These methods improve the imbalanced data sample problem by synthesizing new data on minority class samples, further addressing the issue of downstream task classification and prediction models being biased towards the dominant class, and improving the recognition accuracy of classification models and the prediction accuracy of prediction models to a certain extent. However, these existing technologies still face the following key challenges in practical industrial applications: First, industrial field data often exhibits multimodality and heterogeneity, including textual data, image data, and multi-channel sensor time-series data. These data modalities have characteristics such as inconsistent distribution, different sampling frequencies, and sparse prompts. Traditional time-series generation methods cannot effectively fuse information from different modalities, resulting in a lack of semantic consistency in the generated data. Secondly, industrial discrete data typically possesses various temporal structures and physical dependencies. For example, equipment failures often exhibit a gradual degradation process with clear temporal evolution patterns; or sensor time-series data may show short-term jumps. Generation methods based on GANs or autoregressive models often lack the ability to model temporal structures, making it difficult to generate time-series data with realistic dynamic evolution characteristics. Furthermore, limited computing resources in industrial settings prevent existing methods from meeting the demand for high-quality data synthesis in real-time on edge devices. Moreover, most current data augmentation methods are static generation mechanisms, lacking the ability to dynamically update based on environmental changes, prompts, and task feedback. This makes it difficult to adapt to frequently changing operating conditions and task objectives in manufacturing environments, resulting in uncontrollable relevance and effectiveness of generated samples.
[0003] The closest existing technology to this invention is "A Method and System for Generating Industrial Process Time Series Data" (Patent No. CN116894186A), which discloses a method for generating industrial time series data based on Variational Autoencoder (VAE) and Generative Adversarial Network (GAN). This technology constructs a VAE-LSTM network as the generator, utilizes a Siamese network as the discriminator, and introduces a Pattern Seeking Regularization (MSR) strategy to enhance the diversity and realism of generated samples, thus possessing certain technical advantages in improving the quality of industrial time series data generation. However, this existing technology still has shortcomings: 1. Lack of modeling capability for multimodal cue information. This method is a typical unconditional generation strategy, which only models based on sensor numerical data and cannot integrate multimodal discrete information existing in industrial scenarios. As a result, the generated samples are difficult to keep consistent with the actual industrial scenarios or semantic information.
[0004] 2. Lack of a controllable generation mechanism guided by prompts. Because no semantically relevant control vectors are set, the generation process cannot be modulated according to the task objective or key prompts, thus failing to guarantee the semantic relevance and task-specificity of the synthesized data.
[0005] 3. Limited ability to model time dependencies. This method mainly relies on LSTM / Bi-LSTM to model time dependencies, but such networks suffer from problems such as gradient vanishing, limited structural depth, and low inference efficiency in long sequence modeling, making it difficult to effectively model long-term dynamic changes such as equipment wear and aging in manufacturing systems.
[0006] 4. Lack of online learning and adaptive update mechanisms. Existing solutions employ static, offline training methods, which cannot dynamically adjust the model based on real-time task changes and environmental feedback, thus lacking adaptability to time-varying systems in real-world manufacturing scenarios. Summary of the Invention
[0007] In view of the above problems, the present invention provides an industrial time-series data enhancement method and apparatus for overcoming or at least partially solving the above problems.
[0008] This invention provides the following solution: An industrial time-series data augmentation method includes: Acquire text data, image data, discrete industrial data, and label information for classification features in industrial scenarios; A fused prompt vector is obtained by using a prompt condition generation model that integrates multimodal semantic information, combined with the text data, the image data, the industrial discrete data, and the label information of the classification features; The time step is embedded using the modulation parameter generation module and combined with the fused cue vector to generate modulation parameters for modulating the internal features of the network. Extract multivariate Gaussian noise of the same dimension as the target data from a normal distribution; The multivariate Gaussian noise and the modulation parameters are input into the backbone network for reverse diffusion to obtain the noise prediction at the current time. The backbone network includes a conditional diffusion model based on multi-scale structured state space modeling. The conditional diffusion model based on multi-scale structured state space modeling replaces the convolutional module in the diffusion model U-Net structure with an SSM variant module to construct the SSSD backbone model. Two scales of SSM modules are used to perform short-term and long-term modeling of industrial time series data. The time series sequence of the previous time step is calculated based on the predicted noise at the current time step; Repeat the process until time step t=0, to obtain the final generated target sequence that matches the input label.
[0009] Preferably, the multimodal semantic information prompt condition generation model includes a text prompt encoding module, an image prompt encoding module, a numerical prompt encoding module, a label prompt encoder module, and a modality fusion module. The text prompt encoding module is used to encode text data in industrial scenarios to obtain text prompt embedding vectors; The image cue encoding module is used to extract semantic features from the input image to obtain the image cue embedding vector; The numerical prompting encoding module is used to transform multiple discrete prompting messages manually entered in industrial discrete data into structured vector representations, and a unified numerical prompting embedding vector is generated through weighted fusion to serve as a conditional input to guide the behavioral decisions of time series generation or prediction models. The label prompt encoder module is used to perform structured representation of label information with discrete classification features in industrial scenarios to obtain label prompt embedding vectors; The modality fusion module uses the text cue embedding vector, the image cue embedding vector, the numerical cue embedding vector, and the label cue embedding vector as inputs to perform unified fusion, forming a fused cue vector to guide the downstream time-series data generation model.
[0010] Preferably, the text prompt encoding module is used to perform the following operations: The input text data is segmented and embedded. WordPiece word segmentation technology is used to map each word or subword into a fixed-dimensional word vector. The text prompt embedding vector is obtained by extracting semantic information from the processed text using a pre-trained Transformer model through a multi-layer self-attention mechanism.
[0011] Preferably, each layer of the Transformer model is calculated using the following formula:
[0012]
[0013] In the formula, These represent the query, key, and value matrices, respectively. The dimension representing the key; the weighted value obtained through calculation. It represents the semantic information of each word.
[0014] Preferably, the image prompt encoding module is used to perform the following operations: A pre-trained lightweight convolutional neural network, ResNet-18, is used as the backbone network to standardize the size of discrete image data in the industrial discrete dataset and normalize the pixels, scaling the pixel values of the images to between 0 and 1. The images are then input into the ResNet-18 network to extract semantic features, and the output, via pooling layers and fully connected layers, is a fixed-dimensional semantic vector used as the image cue embedding vector.
[0015] Preferably, the numerical prompt encoding module is used to perform the following operations: Construct the embedding matrix, the corresponding index of the input field value, and the embedding vector representation for each type of discrete numerical value; Calculate the weights corresponding to each embedding vector; Each discrete embedding vector is added to its weight coefficient to obtain a unified discrete vector hint, which is then used to obtain the numerical hint embedding vector.
[0016] Preferably, the label prompt encoder module is used to perform the following operations: Map the original labels to a fixed integer index form; Construct an embedding matrix for the tag set in order to calculate the tag cue embedding vector.
[0017] Preferably, the modal fusion module is used to perform the following operations: The text prompt embedding vector, the image prompt embedding vector, the numerical prompt embedding vector, and the label prompt embedding vector are received by the input interface; The weighting coefficients are calculated using the weighting calculation unit. The fusion cue vector is obtained by summing the embedding vectors of each modality according to the generated weights.
[0018] Preferably, the modulation parameter generation module is used to perform the following operations: The time step and fusion cue vector are used as inputs, and the time step is converted into a fixed-dimensional time vector through sinusoidal position encoding; The time vector and the fused prompt vector are concatenated to obtain a 2d-dimensional joint vector; Modulation parameters are obtained through a multilayer perceptron structure, and the modulation parameters include scaling factors and translation factors.
[0019] An industrial time-series data augmentation apparatus is provided for performing the aforementioned industrial time-series data augmentation method, wherein the method is applied to a high aspect ratio warhead assembly system, and the apparatus comprises: The data acquisition unit is used to acquire text data, image data, industrial discrete data, and label information of classification features in industrial scenarios. The front-end data processing unit is used to obtain a fused prompt vector by using a prompt condition generation model that integrates multimodal semantic information and combining the text data, the image data, the industrial discrete data, and the label information of the discrete classification features; The modulation parameter acquisition unit is used to embed the time step using the modulation parameter generation module and combine it with the fused cue vector to generate modulation parameters for modulating the internal features of the network. The noise acquisition unit is used to extract multivariable Gaussian noise of the same dimension as the target data from a normal distribution. The noise prediction unit is used to input the multivariate Gaussian noise and the modulation parameters into the backbone network for reverse diffusion to obtain the noise prediction at the current time. The backbone network includes a conditional diffusion model based on multi-scale structured state space modeling. The conditional diffusion model based on multi-scale structured state space modeling replaces the convolutional modules in the diffusion model U-Net structure with SSM variant modules to construct the SSSD backbone model, and uses SSM variant modules of two scales to perform short-term and long-term modeling of industrial time series data. The previous time-series sequence unit is used to calculate the previous time-series sequence based on the current time-predicted noise. The target sequence generation unit is used to repeatedly execute until time step t=0 to obtain the final generated target sequence that is consistent with the input label.
[0020] According to specific embodiments provided by the present invention, the present invention discloses the following technical effects: This application provides an industrial time-series data augmentation method and apparatus that integrates the long-term dependency modeling capability of the Structured State-Space Model (SSM) with the high-quality generation capability of the diffusion model. It enables high-fidelity time-series data synthesis guided by multimodal conditions in complex industrial scenarios, employing a multi-scale structure while considering both the short-term abrupt changes and long-term trends of industrial discrete data. By introducing a cue encoder and a cue weight adaptive module, discrete cue information from different modalities such as images, text, and labels can be embedded into the generation process. Furthermore, by combining expert experience and data-driven strategies, the guiding strength of each modality on the generated results is dynamically modulated, thereby improving the semantic consistency and fault feature saliency of the generated data. In addition, the model supports online sampling and dynamic update mechanisms, enabling real-time optimization of the cue fusion method and generation strategy based on feedback from downstream tasks (such as fault classification accuracy and prediction stability). This achieves a closed-loop augmentation process from "discrete cue - conditional modeling - continuous generation," exhibiting strong robustness and environmental adaptability. Of course, any product implementing this invention does not necessarily need to achieve all of the advantages described above at the same time. Attached Figure Description
[0021] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly described below. Obviously, the drawings described below are merely some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without any creative effort.
[0022] Figure 1 This is a flowchart of an industrial time-series data augmentation method provided in an embodiment of the present invention; Figure 2 This is a schematic diagram of the prompt condition generation model provided in an embodiment of the present invention; Figure 3 This is a structural diagram of a multilayer perceptron provided in an embodiment of the present invention; Figure 4 This is a schematic diagram of the multi-scale conditional diffusion model provided in an embodiment of the present invention; Figure 5 This is a schematic diagram of the multi-scale S4 module provided in an embodiment of the present invention; Figure 6 This is the multimodal cue-driven industrial time-series data generation process based on multi-scale SSSD provided in this embodiment of the invention; Figure 7 This is a schematic diagram of an industrial time-series data enhancement device provided in an embodiment of the present invention; Figure 8 This is a schematic diagram of an industrial time-series data enhancement device provided in an embodiment of the present invention. Detailed Implementation
[0023] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, and not all of them. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention are within the scope of protection of the present invention.
[0024] See Figure 1 This invention provides an industrial time-series data augmentation method, such as... Figure 1 As shown, the method may include: S101: Acquire text data, image data, industrial discrete data, and label information of classification features in industrial scenarios; S102: A fused prompt vector is obtained by combining the text data, the image data, the discrete industrial data, and the label information of the classification features using a prompt condition generation model that integrates multimodal semantic information; the multimodal semantic information prompt condition generation model includes a text prompt encoding module, an image prompt encoding module, a numerical prompt encoding module, a label prompt encoder module, and a modality fusion module. The text prompt encoding module is used to encode text data in industrial scenarios to obtain text prompt embedding vectors; The image cue encoding module is used to extract semantic features from the input image to obtain the image cue embedding vector; The numerical prompting encoding module is used to transform multiple discrete prompting messages manually entered in industrial discrete data into structured vector representations, and a unified numerical prompting embedding vector is generated through weighted fusion to serve as a conditional input to guide the behavioral decisions of time series generation or prediction models. The label prompt encoder module is used to perform structured representation of label information with discrete classification features in industrial scenarios to obtain label prompt embedding vectors; The modality fusion module uses the text cue embedding vector, the image cue embedding vector, the numerical cue embedding vector, and the label cue embedding vector as inputs to perform unified fusion, forming a fused cue vector to guide the downstream time-series data generation model.
[0025] The text prompt encoding module is used to perform the following operations: The input text data is segmented and embedded. WordPiece word segmentation technology is used to map each word or subword into a fixed-dimensional word vector. The text prompt embedding vector is obtained by extracting semantic information from the processed text using a pre-trained Transformer model through a multi-layer self-attention mechanism.
[0026] Each layer of the Transformer model is calculated using the following formula:
[0027]
[0028] In the formula, These represent the query, key, and value matrices, respectively. The dimension representing the key; the weighted value obtained through calculation. It represents the semantic information of each word.
[0029] The image prompt encoding module is used to perform the following operations: A pre-trained lightweight convolutional neural network, ResNet-18, is used as the backbone network to standardize the size of discrete image data in the industrial discrete dataset and normalize the pixels, scaling the pixel values of the images to between 0 and 1. The images are then input into the ResNet-18 network to extract semantic features, and the output, via pooling layers and fully connected layers, is a fixed-dimensional semantic vector used as the image cue embedding vector.
[0030] The numerical prompt encoding module is used to perform the following operations: Construct the embedding matrix, the corresponding index of the input field value, and the embedding vector representation for each type of discrete numerical value; Calculate the weights corresponding to each embedding vector; Each discrete embedding vector is added to its weight coefficient to obtain a unified discrete vector hint, which is then used to obtain the numerical hint embedding vector.
[0031] The label prompt encoder module is used to perform the following operations: Map the original labels to a fixed integer index form; Construct an embedding matrix for the tag set in order to calculate the tag cue embedding vector.
[0032] The modality fusion module is used to perform the following operations: The text prompt embedding vector, the image prompt embedding vector, the numerical prompt embedding vector, and the label prompt embedding vector are received by the input interface; The weighting coefficients are calculated using the weighting calculation unit. The fusion cue vector is obtained by summing the embedding vectors of each modality according to the generated weights.
[0033] S103: The time step is embedded using the modulation parameter generation module and combined with the fused cue vector to generate modulation parameters for modulating the internal features of the network; the modulation parameter generation module is used to perform the following operations: The time step and fusion cue vector are used as inputs, and the time step is converted into a fixed-dimensional time vector through sinusoidal position encoding; The time vector and the fused prompt vector are concatenated to obtain a 2d-dimensional joint vector; Modulation parameters are obtained through a multilayer perceptron structure, and the modulation parameters include scaling factors and translation factors.
[0034] S104: Extract multivariate Gaussian noise of the same dimension as the target data from a normal distribution; S105: Input the multivariate Gaussian noise and the modulation parameters into the backbone network for reverse diffusion to obtain the noise prediction at the current time; the backbone network includes a conditional diffusion model based on multi-scale structured state space modeling. The conditional diffusion model based on multi-scale structured state space modeling replaces the convolutional module in the diffusion model U-Net structure with an SSM combined variant module to construct an SSSD backbone diffusion model, and uses two scales of SSM combined variant modules to perform short-term and long-term modeling of industrial time series data; S106: Calculate the time series sequence of the previous time step based on the predicted noise at the current time step; S107: Repeat the execution until time step t=0, to obtain the final generated target sequence that matches the input label.
[0035] The method provided in this application aims to address the problems of scarce fault data, modal heterogeneity, and unbalanced data distribution during manufacturing, testing, and maintenance. This method uses discrete modalities such as images, text, and labels widely present in industrial settings as semantic cue inputs. It utilizes a multi-scale SSSD diffusion model to model the dynamic features of time series in a structured state space, enabling the generation of continuous multi-source time series data guided by cue conditions. The model achieves semantic fusion and key adaptive learning of multi-modal cue information through a cue encoder and a dynamic weight module. Combined with the back-diffusion modeling capability of multi-scale SSSD, it synthesizes high-confidence samples while ensuring the rationality of the physical structure of the time series. Furthermore, it supports expert knowledge initialization and data-driven optimization adjustment of cue modality weights, thereby improving the semantic consistency between the generated data and the actual state of industrial equipment. Compared with traditional data augmentation, this invention has several advantages: it can simultaneously model temporal structure and semantic conditions, avoiding random interpolation or unreasonable anomalous samples; it supports a multimodal heterogeneous information fusion prompt generation mechanism, improving the realism of equipment state data simulation; it utilizes a multi-scale SSSD model to model the diffusion process in the state space, improving the temporal coherence and industrial interpretability of the generated data; and it combines online updating and adaptive learning mechanisms, possessing good real-time performance and scalability.
[0036] This method provides a new paradigm with practical value and cutting-edge innovation for industrial artificial intelligence in extreme condition simulation, small-sample training, equipment health prediction, and intelligent diagnosis and optimization. It can be widely applied in typical industrial IoT environments such as high-end manufacturing, smart factories, predictive maintenance of industrial equipment, aerospace, wind power operation and maintenance, semiconductor manufacturing, chemical production lines, and rail transportation. It drives the intelligent evolution of traditional manufacturing systems towards data-driven and model-generated collaborative optimization.
[0037] The industrial time-series data augmentation method provided in this application will be described in detail below.
[0038] This invention provides a multimodal cue-driven industrial time-series data augmentation method based on multi-scale SSSD. With a structured state-space diffusion model as the core, it introduces multiple modal cue information such as images, text, and labels during the generation process. Combined with cue encoding and dynamic weight adjustment mechanism, it realizes semantically controllable time-series data generation and improves the quality of synthetic data in small-sample industrial scenarios.
[0039] This method employs a prompt condition generation module that integrates multimodal semantic information as a pre-module of the conditional diffusion generation model, such as... Figure 2 As shown, by constructing a conditional vector that integrates image, text, discrete numerical values, and label information, controllable generation with strong task relevance is achieved. This module mainly includes six structural units: a text prompt encoding module, an image prompt encoding module, a numerical prompt encoding module, a label prompt encoding unit, a modality fusion module, and a modulation parameter generation unit. Specifically: 1. Text prompt encoding module.
[0040] The text prompt encoding module uses the pre-trained language model BERT to encode text data from industrial scenarios. The input text data primarily originates from industrial equipment operation logs, fault diagnosis reports, operating condition manuals, alarm information, etc. The text content typically contains key information such as equipment operating status, fault type, and alarm content, which is crucial for generating time-series data related to equipment faults. First, the input text data undergoes word segmentation and embedding processing. WordPiece sub-word segmentation technology maps each word or sub-word to a fixed-dimensional word vector. Next, the processed text is input into a pre-trained Transformer model, which extracts semantic information through a multi-layer self-attention mechanism. Specifically, each layer of the Transformer model is calculated using the following formula: (1.1) (1.2) in, These represent the query, key, and value matrices, respectively. This is the dimension of the key. The weighted value is obtained through calculation. It represents the semantic information of each word.
[0041] After multiple Transformer encoding layers, the output is a context representation of each word. To obtain a fixed-dimensional text vector, pooling is performed on the outputs of all layers using average pooling. The pooled representation is as follows: (1.3) in, It is the first The output of the layer, The maximum length of the text, the pooled text embedding vector Having dimensions Ultimately, the resulting text suggestion vector... It is used as a conditional input for subsequent generative models, providing semantic guidance on information such as device operating status and fault type.
[0042] 2. Image prompt encoding module.
[0043] The goal of the image cue encoding module is to extract semantic features from input images (such as thermal images, equipment screenshots, etc.). These images typically contain important information related to equipment operating status, fault type, or changes in operating conditions. Specifically, a pre-trained lightweight convolutional neural network, ResNet-18, is used as the backbone network. First, the discrete image data in the industrial discrete dataset is standardized to a fixed size (224×224) to ensure that all input images are of consistent size before entering the network. Then, pixel normalization is performed, scaling the pixel values of the images to between 0 and 1 to reduce data bias and ensure the stability of the model training process. Finally, the images are input into the ResNet-18 network for semantic feature extraction, outputting a fixed-dimensional semantic vector through pooling layers and fully connected layers. .
[0044] 3. Numerical prompt encoding module.
[0045] The numerical prompt encoding module transforms multiple discrete prompts manually entered from industrial discrete data into structured vector representations. It then generates a unified prompt vector through weighted fusion, which serves as conditional input to guide the behavioral decisions of time series generation or prediction models. This module effectively handles various unstructured discrete fields (such as material codes, equipment models, operator IDs, process numbers, etc.), improving the model's responsiveness and task adaptability to structured control information. Specifically, it first constructs an embedding matrix for each type of discrete numerical value, in the form of... ,in, Indicates the first One field, This represents the total number of categories in this field. For embedded dimensions. For input field values Corresponding index Its embedding vector is represented as Then, the weights corresponding to each embedding vector are calculated. This sub-unit takes as input the concatenation of all embedding vectors and outputs the weight coefficients for each field. The specific calculation is shown below: (1.4) in and For learnable parameters, ,satisfy Finally, each discrete embedding vector is added to its weight coefficient to obtain a unified discrete vector hint.
[0046] (1.5) This cue vector also serves as a guiding condition for the downstream generative model, and its weight update mechanism can be interfaced with the dynamic weight adjustment module described below.
[0047] 4. Label prompt encoding unit.
[0048] The label-based prompt encoder module is used to structure and represent label information with discrete classification features in industrial scenarios (such as equipment health status labels, process stage identifiers, fault type codes, etc.) to enable the conditional control role of label-based prompts in time series data generation tasks. This unit constructs a trainable embedding mapping structure to transform symbolic, discontinuous discrete labels into dense, low-dimensional vector representations, which are then output to the prompt fusion module. Specifically, the original labels are first mapped to fixed integer indices, represented as... Where C represents the number of categories of all possible labels, and then an embedding matrix of the label set is constructed. ,in For the embedding dimension, the corresponding tag index The corresponding tag hint embedding vector is: (1.6) 5. Modal fusion module.
[0049] The modality fusion module takes the aforementioned text cue embedding vectors, image cue embedding vectors, numerical cue embedding vectors, and label cue embedding vectors as input, and performs unified fusion to form a conditional vector to guide the downstream time-series data generation model. This unit adopts a weighted fusion strategy and automatically learns and optimizes the importance of each modality through a dynamic weight adjustment mechanism, thereby improving the task adaptability and semantic control capability of the fused cue information under different spatiotemporal conditions.
[0050] Specifically, the prompt vector The input is obtained through the input interface, and the weight calculation unit calculates the weight coefficients (detailed implementation of the weight calculation unit is given below). Finally, the modality embedding vectors are sorted according to the generated weights. Weighted summation yields the fused cue vector: (1.7) 6. Modulation parameter generation unit.
[0051] To enable the model to conditionally modulate at different diffusion steps (i.e., noise levels), time steps are embedded and combined with fused multimodal cue information to generate control parameters for modulating the network's internal features. Specifically, time steps... and fusion hint vector As input, the time steps are then converted into fixed-dimensional time vectors using sinusoidal position encoding. The encoding method is as follows: (1.8) in This is used as the vector dimension index. Then, the time vector and the fused cue vector are concatenated to obtain a 2d-dimensional joint vector: (1.9) Finally, the modulation parameters are obtained through a multilayer perceptron (MLP) architecture. .in Scaling factor The translation factor is used. The multilayer perceptron structure is as follows: Figure 3 As shown.
[0052] This invention provides a conditional diffusion model based on multi-scale structured state space modeling, the structure of which is as follows: Figure 4 As shown, this model is based on the Conditional Diffusion Model (SSSD) and employs a multi-scale structural modeling approach to address the challenge of a single SSSD model simultaneously modeling short-term abrupt changes and long-term trends in industrial time-series data. The SSSD model combines Structured State Space Modeling (SSM) and Diffusion Modeling (DDPM), with its key component being the replacement of the convolutional modules in the traditional U-Net structure with a combined variant module (S4) of the SSM. This invention uses S4 modules at two different scales for short-term and long-term modeling of industrial time-series data. Specifically: Original time series data samples The noisy time series sample is obtained by superimposing noise, as follows: (2.1) in, Let Gaussian noise be the function. Then, the noisy time series samples are input into S4 modules at two different scales. The state-space model of the S4 module is represented as: (2.2) In the formula: Given the input sequence, For the output sequence, It is a hidden state. Short-run modeling represents it as... When inputting the original sequence into the long-term modeling structural unit, downsampling is performed first to reduce computational resource consumption and decouple local noise from the interference of long-term modeling. The output is upsampled to restore it to a uniform dimension. Then, multi-scale output weighted fusion is performed, represented as: (2.3) These are learnable weight parameters. The structure is as follows: Figure 5 As shown.
[0053] The modulation features are then obtained by modulation using modulation parameters, and are represented as follows: (2.4) Finally, the output of each diffusion layer is obtained through residual connection and normalization. Finally, the training output of the noise is obtained through a multi-layer encoder-decoder structure. .
[0054] The specific training process of the proposed model is as follows: Step 1: Each modal cue is encoded by its respective encoder and transformed into a unified dimensional embedding. .
[0055] Step 2: Weight and fuse the various cue vectors to obtain the modality fusion vector. .
[0056] Step 3: Concatenate the prompt vector with the time step and input it into the modulation network to obtain the modulation parameters. .
[0057] Step 4: Process the original sample Noise perturbation is performed to obtain noise samples. .
[0058] Step 5: Input noisy samples into the hierarchical backbone network for long-term modeling feature output. and short-term modeling feature output .
[0059] Step 6: The modeling features are weighted and fused to obtain the next diffusion output. .
[0060] Step 7: Use the modulation parameters obtained in Step 3 For time series characteristics Modulation is performed to obtain .
[0061] Step 8: Repeat steps 5-7 to enter the encoding-decoding process and obtain the noise prediction output. .
[0062] Step 9: Calculate the principal loss using the mean squared error loss.
[0063] Step 10: Repeat steps 3-8 to obtain the noise prediction output without prompting conditions. .
[0064] Step 11: Calculate the auxiliary cueing loss using the L1 variant loss.
[0065] Step 12: Backpropagation updates modal fusion weights, hierarchical fusion weights, and model parameters.
[0066] If it is necessary to combine downstream tasks to optimize the model through feedback, the following steps need to be performed.
[0067] Step 1: Calculate the generated data based on the predicted noise output. .
[0068] Step 2: Input the generated data into the downstream task network to execute the task.
[0069] Step 3: Calculate the task efficiency (accuracy, recall) as the loss of the diffusion model based on the judgment criteria.
[0070] Step 4: Update and optimize the weight parameters of the diffusion model using the main loss, auxiliary cue loss, and task loss.
[0071] 3. This invention designs a weight adaptive mechanism to guide the model to dynamically learn the importance of different cue modalities and the fusion strategy of multi-scale modeling paths. Specifically, an auxiliary loss is added to the original loss function, expressed as: (3.1) The model's initial weights are provided by expert knowledge, and then the corresponding weights are adaptively updated using a data-driven approach. The mean squared error loss is expressed as: (3.2) Auxiliary loss A variant of L1 loss is used, the core of which is to compare the output results under prompted and un prompted conditions. This is represented as: (3.3) in The output is given with prompts. This is achieved by removing the cue conditions by setting them to 0, thus eliminating the cue conditions. During training, the modal weights and multi-scale modeling path weights in the model are updated through backpropagation of the loss.
[0072] This invention provides a downstream task feedback optimization mechanism that dynamically adjusts the generation strategy based on task performance indicators to improve model adaptability. Specifically, the aforementioned time-series data generation network is decoupled from various downstream tasks, such as fault diagnosis and classification, and digital twin simulation modules. Data generated by the time-series data generation network is input into the downstream tasks to obtain evaluation indicators for the downstream tasks. These indicators measure the performance of the generated data in terms of the tasks and constitute the task feedback loss. Together with the main loss and cue loss described above, this mechanism optimizes the generated task parameters. Furthermore, when deployed on edge devices, this mechanism supports freezing the backbone diffusion network and updating only the cue weight parameters, enabling fine-tuning to meet lightweight requirements.
[0073] This application describes an industrial equipment simulation data generation system based on multimodal prompts. The system is applied to opto-mechatronic equipment in a factory, and the data to be generated is fault type data for this equipment.
[0074] Due to the high degree of integration of opto-mechatronic equipment in a certain factory, physical modeling is difficult. Therefore, a data-driven approach is adopted for fault diagnosis. Because actual fault samples are scarce, training the fault diagnosis model is challenging, necessitating fault sample generation to balance the dataset. A multi-modal cue-driven industrial time-series data augmentation method based on multi-scale SSSD is used to generate fault time-series data samples for use by the fault diagnosis model. The time-series data generation process is as follows: Figure 6 As shown: The specific implementation steps are as follows: Step 1: After encoding the operation logs, thermal imaging images, fault tags, and discrete recording parameters of the opto-mechanical equipment, a prompt vector is constructed through weighted fusion. .
[0075] Step 2: Set the cue vector The time steps are concatenated and input into the modulation network to obtain the modulation parameters. .
[0076] Step 3: Extract multivariate Gaussian noise from the normal distribution with the same dimensions as the target data. .
[0077] Step 4: Reduce noise Modulation parameters The noise is input into the backbone network for backdiffusion. This yields the noise prediction for the current time step. .
[0078] Step 5: Calculate the noise from the previous time step based on the current time step. , .
[0079] Step 6: Iterate through steps 3-5 until time step t=0, at which point the final generated target sequence, consistent with the input label, is obtained. .
[0080] Through the above steps, industrial time-series data samples required for fault diagnosis of opto-mechatronic equipment are generated. The generated samples are consistent with the operating conditions, fault types, and operating modes described by the prompt conditions in terms of numerical trends and periodic structure, and are highly close to the statistical distribution of the real samples in multiple industrial indicators (such as mean square error, spectral characteristics, periodic jitter, and abnormal envelope).
[0081] In summary, the final output of this invention not only possesses high-quality simulation capabilities, but also achieves industrial-specific data synthesis guided by multimodal semantics, providing an effective solution to problems such as data scarcity and imbalance in industrial scenarios.
[0082] A method for constructing multimodal cue coding units is presented, encompassing text cue coding units, image cue coding units, numerical cue coding units, label cue coding units, modality fusion units, and modulation parameter generation modules. This method supports the unified mapping of various heterogeneous cue information commonly found in industrial settings, such as operation logs, equipment thermal images, operating parameters, and fault labels, to a cue vector space of the same dimension, forming a semantically consistent conditional control signal. Furthermore, to achieve flexible control of different modalities and field information, a dual cue weight fusion mechanism is employed. In discrete numerical modalities (such as multi-field operating parameters), multiple embedded fields are aggregated through weighted summation. Simultaneously, among multimodalities (such as text, images, and labels), the importance ratio of cue modalities is obtained through softmax learning. The final fused cue vector can be used to guide the diffusion-based data generation process, exhibiting good controllability, interpretability, and industrial adaptability.
[0083] A conditional diffusion model architecture for multi-scale structured state-space modeling: This architecture uses the SSSD model as the backbone network, employs two different scale S4 modules in parallel to model the local short-term dynamics and global long-term dependencies of industrial time-series data, and combines this with a weighted fusion mechanism to generate the industrial time-series data. This network preserves local response details while enhancing the ability to model long-term trends, significantly improving the quality and controllability of industrial time-series data generation.
[0084] Conditional diffusion generation and weighted adaptive mechanism: This mechanism protects the cue vector guiding the industrial time-series data generation process and employs a cue information loss design method. It uses a combination of main loss and cue information loss for adaptive optimization of model parameters. This mechanism enhances the adaptive modeling capability for different operating conditions, equipment types, and abnormal behaviors.
[0085] Downstream task feedback mechanism: This mechanism protects the data generation model by back-optimizing based on the results of downstream tasks (such as equipment fault diagnosis, production line anomaly detection, and equipment lifespan prediction). This mechanism forms a closed-loop structure of prompt adjustment – result generation – downstream task – back-adjustment, and can be deployed on edge devices with limited computing resources for fine-tuning of model parameters. This mechanism improves the model's practical adaptability and deployment usability in complex industrial scenarios.
[0086] In summary, the industrial time-series data augmentation method provided in this application integrates the long-term dependency modeling capability of the Structured State-Space Model (SSM) with the high-quality generation capability of the diffusion model. It enables high-fidelity time-series data synthesis guided by multimodal conditions in complex industrial scenarios, employing a multi-scale structure while considering both the short-term abrupt changes and long-term trends of discrete industrial data. By introducing a cue encoder and a cue weight adaptive module, discrete cue information from different modalities such as images, text, and labels can be embedded into the generation process. Furthermore, by combining expert experience and data-driven strategies, the guiding strength of each modality on the generated results is dynamically adjusted, thereby improving the semantic consistency and fault feature saliency of the generated data. In addition, the model supports online sampling and dynamic update mechanisms, enabling real-time optimization of the cue fusion method and generation strategy based on feedback from downstream tasks (such as fault classification accuracy and prediction stability). This achieves a closed-loop augmentation process from "discrete cue - conditional modeling - continuous generation," exhibiting strong robustness and environmental adaptability.
[0087] See Figure 7 This application embodiment can also provide an industrial time-series data enhancement device, such as... Figure 7 As shown, the apparatus for performing the above-described industrial time-series data augmentation method includes: The data acquisition unit 701 is used to acquire text data, image data, industrial discrete data, and label information of classification features in industrial scenarios. The front-end data processing unit 702 is used to obtain a fused prompt vector by combining the text data, the image data, the industrial discrete data and the label information of the classification features with a prompt condition generation model that integrates multimodal semantic information; The modulation parameter acquisition unit 703 is used to embed the time step using the modulation parameter generation module and combine it with the fused fusion cue vector to generate modulation parameters for modulating the internal features of the network. The noise acquisition unit 704 is used to extract multivariable Gaussian noise of the same dimension as the target data from the normal distribution. The noise prediction unit 705 is used to input the multivariate Gaussian noise and the modulation parameters into the backbone network for reverse diffusion to obtain the noise prediction at the current time. The backbone network includes a conditional diffusion model based on multi-scale structured state space modeling. The conditional diffusion model based on multi-scale structured state space modeling replaces the convolutional module in the diffusion model U-Net structure with the SSM combined variant module to construct the SSSD backbone diffusion model, and uses the SSM combined variant module at two scales to perform short-term and long-term modeling of industrial time series data. The previous time sequence unit 706 is used to calculate the previous time sequence based on the predicted noise at the current time. The target sequence generation unit 707 is used to repeatedly execute until time step t=0 to obtain the final generated target sequence that is consistent with the input label.
[0088] This application embodiment can also provide an industrial time-series data enhancement device, the device including a processor and a memory: The memory is used to store program code and transmit the program code to the processor; The processor is used to execute the steps of the above-described industrial time-series data augmentation method according to the instructions in the program code.
[0089] like Figure 8 As shown in the illustration, an industrial time-series data enhancement device provided in this application embodiment may include: a processor 10, a memory 11, a communication interface 12, and a communication bus 13. The processor 10, memory 11, and communication interface 12 all communicate with each other through the communication bus 13.
[0090] In this embodiment, the processor 10 may be a central processing unit (CPU), a graphics processing unit (GPU), an application-specific integrated circuit, a digital signal processor, a field-programmable gate array, or other programmable logic devices.
[0091] The processor 10 can call programs stored in the memory 11. Specifically, the processor 10 can execute operations in embodiments of the industrial time-series data augmentation method.
[0092] The memory 11 is used to store one or more programs, which may include program code, including computer operation instructions. In this embodiment, the memory 11 stores at least a program for implementing the following functions: applied to a high aspect ratio warhead assembly system, the method comprising: Acquire text data, image data, discrete industrial data, and label information for classification features in industrial scenarios; A fused prompt vector is obtained by using a prompt condition generation model that integrates multimodal semantic information, combined with the text data, the image data, the industrial discrete data, and the label information of the classification features; The time step is embedded using the modulation parameter generation module and combined with the fused cue vector to generate modulation parameters for modulating the internal features of the network. Extract multivariate Gaussian noise of the same dimension as the target data from a normal distribution; The multivariate Gaussian noise and the modulation parameters are input into the backbone network for reverse diffusion to obtain the noise prediction at the current time. The backbone network includes a conditional diffusion model based on multi-scale structured state space modeling. The conditional diffusion model based on multi-scale structured state space modeling replaces the convolutional module in the diffusion model U-Net structure with an SSM combined variant module to construct an SSSD backbone diffusion model. Two scales of SSM combined variant modules are used for short-term and long-term modeling of industrial time series data. The time series sequence of the previous time step is calculated based on the predicted noise at the current time step; Repeat the process until time step t=0, to obtain the final generated target sequence that matches the input label.
[0093] In one possible implementation, the memory 11 may include a program storage area and a data storage area. The program storage area may store the operating system and applications required for at least one function (such as file creation or data read / write). The data storage area may store data created during use, such as initialization data.
[0094] In addition, memory 11 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device or other volatile solid-state storage device.
[0095] Communication interface 12 can be an interface for a communication model, used to connect with other devices or systems.
[0096] Of course, it should be noted that, Figure 8 The structure shown does not constitute a limitation on the industrial time-series data augmentation device in the embodiments of this application. In practical applications, the industrial time-series data augmentation device may include more than Figure 8 More or fewer components as shown, or combinations of certain components.
[0097] This application embodiment may also provide a computer-readable storage medium for storing program code for executing the steps of the above-described industrial time-series data augmentation method.
[0098] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0099] As can be seen from the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware platforms. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in various embodiments or some parts of the embodiments of this application.
[0100] The various embodiments in this specification are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, for system or system embodiments, since they are basically similar to method embodiments, the description is relatively simple, and relevant parts can be referred to the descriptions in the method embodiments. The systems and system embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0101] The above description is merely a preferred embodiment of the present invention and is not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention are included within the scope of protection of the present invention.
Claims
1. A method for augmenting industrial time-series data, characterized in that, include: Acquire text data, image data, discrete industrial data, and label information for classification features in industrial scenarios; A fused prompt vector is obtained by using a prompt condition generation model that integrates multimodal semantic information, combined with the text data, the image data, the industrial discrete data, and the label information of the classification features; The time step is embedded using the modulation parameter generation module and combined with the fused cue vector to generate modulation parameters for modulating the internal features of the network. Extract multivariate Gaussian noise of the same dimension as the target data from a normal distribution; The multivariate Gaussian noise and the modulation parameters are input into the backbone network for reverse diffusion to obtain the noise prediction at the current time. The backbone network includes a conditional diffusion model based on multi-scale structured state space modeling. The conditional diffusion model based on multi-scale structured state space modeling replaces the convolutional module in the diffusion model U-Net structure with an SSM variant module to construct the SSSD backbone model. Two scales of SSM modules are used to perform short-term and long-term modeling of industrial time series data. The time series sequence of the previous time step is calculated based on the predicted noise at the current time step; Repeat the process until time step t = 0, to obtain the final generated target sequence that matches the input label.
2. The industrial time-series data augmentation method according to claim 1, characterized in that, The multimodal semantic information prompt condition generation model includes a text prompt encoding module, an image prompt encoding module, a numerical prompt encoding module, a label prompt encoder module, and a modality fusion module. The text prompt encoding module is used to encode text data in industrial scenarios to obtain text prompt embedding vectors; The image cue encoding module is used to extract semantic features from the input image to obtain the image cue embedding vector; The numerical prompting encoding module is used to transform multiple discrete prompting messages manually entered in industrial discrete data into structured vector representations, and a unified numerical prompting embedding vector is generated through weighted fusion to serve as a conditional input to guide the behavioral decisions of time series generation or prediction models. The label prompt encoder module is used to perform structured representation of label information with discrete classification features in industrial scenarios to obtain label prompt embedding vectors; The modality fusion module uses the text cue embedding vector, the image cue embedding vector, the numerical cue embedding vector, and the label cue embedding vector as inputs to perform unified fusion, forming a fused cue vector to guide the downstream time-series data generation model.
3. The industrial time-series data augmentation method according to claim 2, characterized in that, The text prompt encoding module is used to perform the following operations: The input text data is segmented and embedded. WordPiece word segmentation technology is used to map each word or subword into a fixed-dimensional word vector. The text prompt embedding vector is obtained by extracting semantic information from the processed text using a pre-trained Transformer model through a multi-layer self-attention mechanism.
4. The industrial time-series data augmentation method according to claim 3, characterized in that, Each layer of the Transformer model is calculated using the following formula: Q=XW Q ,K=XW K ,V=XW V In the formula, Q, K, and V represent the query, key, and value matrices, respectively, and d k The dimension of the key is represented; the weighted value Attention(Q,K,V) obtained by calculation represents the semantic information of each word.
5. The industrial time-series data augmentation method according to claim 2, characterized in that, The image prompt encoding module is used to perform the following operations: A pre-trained lightweight convolutional neural network, ResNet-18, is used as the backbone network to standardize the size of discrete image data in the industrial discrete dataset and normalize the pixels, scaling the pixel values of the images to between 0 and 1. The images are then input into the ResNet-18 network to extract semantic features, and the output, via pooling layers and fully connected layers, is a fixed-dimensional semantic vector used as the image cue embedding vector.
6. The industrial time-series data augmentation method according to claim 2, characterized in that, The numerical prompt encoding module is used to perform the following operations: Construct the embedding matrix, the corresponding index of the input field value, and the embedding vector representation for each type of discrete numerical value; Calculate the weights corresponding to each embedding vector; Each discrete embedding vector is added to its weight coefficient to obtain a unified discrete vector hint, which is then used to obtain the numerical hint embedding vector.
7. The industrial time-series data augmentation method according to claim 2, characterized in that, The label prompt encoder module is used to perform the following operations: Map the original labels to a fixed integer index form; Construct an embedding matrix for the tag set in order to calculate the tag cue embedding vector.
8. The industrial time-series data augmentation method according to claim 2, characterized in that, The modality fusion module is used to perform the following operations: The text prompt embedding vector, the image prompt embedding vector, the numerical prompt embedding vector, and the label prompt embedding vector are received by the input interface; The weighting coefficients are calculated using the weighting calculation unit. The fusion cue vector is obtained by summing the embedding vectors of each modality according to the generated weights.
9. The industrial time-series data augmentation method according to claim 1, characterized in that, The modulation parameter generation module is used to perform the following operations: The time step and fusion cue vector are used as inputs, and the time step is converted into a fixed-dimensional time vector through sinusoidal position encoding; The time vector and the fused prompt vector are concatenated to obtain a 2d-dimensional joint vector; Modulation parameters are obtained through a multilayer perceptron structure, and the modulation parameters include scaling factors and translation factors.
10. An industrial time-series data enhancement device, characterized in that, An apparatus for performing the industrial time-series data augmentation method according to any one of claims 1-9, the method being applied to a high aspect ratio warhead assembly system, the apparatus comprising: The data acquisition unit is used to acquire text data, image data, industrial discrete data, and label information of classification features in industrial scenarios. The front-end data processing unit is used to obtain a fused prompt vector by using a prompt condition generation model that integrates multimodal semantic information and combining the text data, the image data, the industrial discrete data, and the label information of the discrete classification features; The modulation parameter acquisition unit is used to embed the time step using the modulation parameter generation module and combine it with the fused cue vector to generate modulation parameters for modulating the internal features of the network. The noise acquisition unit is used to extract multivariable Gaussian noise of the same dimension as the target data from a normal distribution. The noise prediction unit is used to input the multivariate Gaussian noise and the modulation parameters into the backbone network for reverse diffusion to obtain the noise prediction at the current time. The backbone network includes a conditional diffusion model based on multi-scale structured state space modeling. The conditional diffusion model based on multi-scale structured state space modeling replaces the convolutional modules in the diffusion model U-Net structure with SSM variant modules to construct the SSSD backbone model, and uses SSM variant modules of two scales to perform short-term and long-term modeling of industrial time series data. The previous time-series sequence unit is used to calculate the previous time-series sequence based on the current time-predicted noise. The target sequence generation unit is used to repeatedly execute until time step t=0 to obtain the final generated target sequence that is consistent with the input label.