Tobacco primary processing workshop key equipment online abnormality early warning method based on diffusion type reconstruction anomaly detection

By combining dual-channel sound acquisition and a diffusion-based reconstruction network, the system extracts equipment acoustic signatures and background sound conditions, solving the stability problem of online anomaly early warning for tobacco processing workshop equipment and achieving effective early warning under background sound changes.

CN122201345APending Publication Date: 2026-06-12CHINA TOBACCO ANHUI IND CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA TOBACCO ANHUI IND CO LTD
Filing Date
2026-03-16
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In tobacco processing workshops, existing technologies struggle to provide effective online anomaly warnings for critical equipment in the presence of changing background noise, leading to false alarms or missed alarms. Furthermore, the reconstructed network outputs inconsistently under different background conditions.

Method used

A diffusion-based reconstruction method is adopted. Through dual-channel sound acquisition, device voiceprints and background sound conditions are extracted to construct a diffusion-based reconstruction network. The background sound conditions are used for consistency training constraints, and the reconstruction deviation is calculated for early warning.

🎯Benefits of technology

It improves the stability of early warning under different background noise conditions, reduces false alarms and false alarms, enhances the stability of the model across shifts and working conditions, and can effectively reflect the structural deviation of equipment sound patterns.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122201345A_ABST
    Figure CN122201345A_ABST
Patent Text Reader

Abstract

The application discloses a kind of based on diffusion type reconstruction anomaly detection tobacco silk workshop key equipment online anomaly early warning method, comprising:1 synchronous acquisition equipment near-field channel and reference channel sound segment, respectively extract 128×96 equipment voiceprint representation and the 64-dimensional background sound condition formed by energy distribution, rhythm change and frequency component;2 utilize historical normal sample training consistency constraint conditioned diffusion reconstruction network, reconstruct normal voiceprint under given background condition;3 online calculate the reconstruction deviation of voiceprint and reconstructed voiceprint and form sequence, determine abnormal event under early warning threshold and duration threshold, generate the early warning signal containing equipment identification, time and deviation.The application can improve the early warning stability under different background sound conditions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of industrial equipment condition monitoring and anomaly detection, and in particular to an online anomaly early warning method for key equipment in a tobacco processing workshop based on diffusion-based reconstruction anomaly detection. Background Technology

[0002] Key equipment in tobacco processing workshops, such as those for cutting, drying, and feeding, are prone to early abnormalities during long-term operation, such as bearing wear, loosening and rubbing, and foreign object ingress. These acoustic issues often precede shutdown failures. Due to the open space, dense equipment, and frequent changes in operating conditions within the workshop, background noise fluctuates with shifts, output, and the status of fans and conveyor systems. This results in significant differences in the acoustic environment of the same equipment at different times. Therefore, online status monitoring using sound signals requires consideration of both the equipment's own sound and the influence of ambient noise.

[0003] In existing technologies, solutions for anomaly detection in industrial equipment typically employ single-channel audio or vibration sensors to acquire signals. The audio is then segmented into frames and analyzed over time and frequency to extract features such as MFCC, spectrograms, or Mel-frequency spectra. Status identification and early warning are then achieved through threshold rules, traditional machine learning models, or deep networks. In the data-driven approach, there are supervised learning-based multi-classification models, as well as reconstruction-based anomaly detection methods trained solely on normal data. These include autoencoders, variational autoencoders, GAN reconstruction, and models based on temporal prediction errors. Anomaly scores are calculated based on the "deviation between input and reconstruction / prediction," and online alerts are achieved using fixed thresholds or sliding windows.

[0004] The aforementioned solutions often face stability issues caused by background noise changes in silk-making workshop scenarios. With single-channel pickup, equipment noise and workshop background noise couple within the same channel. Changes in background energy distribution and rhythm directly affect voiceprint characteristics, causing the model's learned "normal pattern" to drift with environmental changes. Relying solely on noise reduction or simple normalization cannot guarantee consistent background states for each acquisition, making reconstruction errors incomparable under different backgrounds. This easily leads to false alarms triggered by background fluctuations or missed alarms masking genuine anomalies. Furthermore, without explicit modeling of background conditions, the consistency of the reconstruction network's output under similar environments is difficult to constrain, causing online scores to jump with scene disturbances, hindering continuous judgment.

[0005] Therefore, a method for online anomaly early warning of key equipment in tobacco processing workshops that can overcome the shortcomings of the existing technology is a problem that needs to be solved by those skilled in the art. Summary of the Invention

[0006] One objective of this invention is to propose an online anomaly early warning method for key equipment in a tobacco processing workshop based on diffusion-based reconstruction anomaly detection. The aim is to achieve stable and sustainable anomaly early warning for key equipment based on online dual-channel sound acquisition, under the condition that the background sound in the tobacco processing workshop continuously changes with the working conditions and environment. This allows the anomaly judgment to reflect the changes in the equipment's own sound rather than the fluctuations in the background sound as much as possible, thereby improving the stability of the early warning under different background sound conditions.

[0007] To achieve the above-mentioned objectives, the present invention adopts the following technical solution: The present invention provides an online anomaly early warning method for key equipment in a tobacco processing workshop based on diffusion-based reconstruction anomaly detection, characterized by comprising: S1. Collect channel sound segments and reference channel sound segments of the target key equipment in the tobacco processing workshop to form a dual-channel sound segment stream, and associate the dual-channel sound segment stream with the identification of the target key equipment and the collection time; S2. Based on the dual-channel audio segment stream, extract background sound conditions from the reference channel audio segment, extract device voiceprint representation from the channel audio segment of the target key device, and combine the background sound conditions with the device voiceprint representation to form a conditional voiceprint sample. S3. Construct a diffusion-based reconstruction network, including: diffusion step sequence number encoding unit, background sound condition encoding unit, voiceprint reconstruction unit, and output mapping unit. Process the background sound conditions to obtain the reconstructed device voiceprint representation. Combine the reconstructed device voiceprint representation with the real device voiceprint representation under the same background sound conditions to construct reconstruction training constraints and consistency training constraints for training the diffusion-based reconstruction network and obtain the conditional diffusion reconstruction model. S4. Based on the real-time conditional acoustic fingerprint samples, the conditional diffusion reconstruction model is processed to obtain the optimal reconstructed device acoustic fingerprint representation. The difference amplitude between the device acoustic fingerprint representation in the real-time conditional acoustic fingerprint samples and the optimal reconstructed device acoustic fingerprint representation is calculated point by point according to the frequency band and time frame. The average of all difference amplitudes is used to obtain the reconstruction deviation at the acquisition time. Thus, the reconstruction deviation at the continuous acquisition time is formed into a reconstruction deviation sequence in time order. S5. Under the constraints of the pre-set warning threshold and duration threshold, the reconstructed deviation sequence is judged to form an abnormal judgment event or a normal judgment event. S6. Generate an early warning signal based on the abnormal event. The early warning signal includes: the identifier of the target key equipment, the acquisition time, and the reconstruction deviation information.

[0008] The online anomaly early warning method for key equipment in tobacco processing workshops based on diffusion-based reconstruction anomaly detection, as described in this invention, is also characterized in that S1 includes: The equipment channel acquisition unit is placed in the near field of the target key equipment and pointed to the main sound source of the target key equipment. The reference channel acquisition unit is placed at the acquisition position of the preset reference channel to acquire the background sound of the workshop, thereby completing the dual-channel acquisition configuration. The device channel acquisition unit and the reference channel acquisition unit are synchronously triggered under a unified clock, so as to generate a pair of device channel sound segments and reference channel sound segments at the same acquisition time. The continuously acquired signals from the two channels are framed according to a preset sampling rate, and continuous device channel audio segment streams and reference channel audio segment streams are formed according to preset segment durations, so that each device channel audio segment and each reference channel audio segment carries the corresponding acquisition time. The device channel audio segments and reference channel audio segments at the same acquisition time are merged into a dual-channel audio segment stream according to their correspondence, and the identifier of the target key device and the acquisition time are associated with the dual-channel audio segment stream.

[0009] Furthermore, S2 includes: Obtain a pair of reference channel audio segments and device channel audio segments from the same acquisition time in the dual-channel audio segment stream; Time-frequency analysis was performed on the audio segment of the reference channel to obtain a time-frequency feature map of the reference channel containing 128 frequency bands and 96 time frames; The time-frequency feature map of the reference channel is grouped into groups of 6 time frames along the frequency band direction to obtain 32 frequency band groups. The average energy of each frequency band group over 96 time frames is calculated to obtain a 32-dimensional energy distribution vector. The time-frequency feature map of the reference channel is grouped into 16 time periods along the time frame direction in 6 time frames, and the total energy change in each time period is calculated to obtain a 16-dimensional rhythm change vector. After averaging the time-frequency feature map of the reference channel along the time frame direction, the mean spectrum of the frequency band is obtained. Then, the mean spectrum of the frequency band is divided into 16 frequency band segments in groups of 8, and the energy proportion of each frequency band segment is extracted to obtain a 16-dimensional frequency component distribution vector. By splicing the 32-dimensional energy distribution vector, the 16-dimensional rhythm change vector, and the 16-dimensional frequency component distribution vector in a fixed order, a 64-dimensional background sound condition is formed. After performing time-frequency analysis on the audio segments of the device channel, a device acoustic signature containing 128 frequency bands and 96 time frames was obtained. The 64 four-dimensional background sound conditions and the device acoustic signature were combined according to the corresponding acquisition time to form a conditional acoustic signature sample.

[0010] Furthermore, the acquisition time is obtained in S2 using equation (1). 64-dimensional background sound conditions : (1) In equation (1), For the time of data collection The 32-dimensional energy distribution vector below, and , For the time of data collection The next The average energy of each frequency band over 96 time frames. Indicates the time of data collection The 16-dimensional rhythmic variation vector below, and , For the time of data collection Next The total energy change over a time period; Indicates the time of data collection The 16-dimensional frequency component distribution vector below, and , For the time of data collection The next The energy percentage of each frequency band For the time of data collection The reference channel audio clip below Time-frequency characteristic map of the reference channel obtained by time-frequency analysis In the The frequency band and the first Energy values ​​on each time frame For the time of data collection The next The total energy of each time frame Indicates the time of data collection The next The total energy of each time frame For the time of data collection The next The average energy of each frequency band over 96 time frames; Indicates the first The frequency band and the first Energy values ​​on each time frame This indicates that the scalar components arranged in order within the parentheses are combined into a column vector. For frequency band index, For the index of the time frame, For the index of the frequency band group, For indexing time periods, This is the index for the frequency band.

[0011] Furthermore, S3 includes: A training sample set is constructed by selecting conditional voiceprint samples from historical normal operation periods. Each conditional voiceprint sample in the training sample set contains 64-dimensional background sound conditions and 128×96-dimensional device voiceprint representation. Conditional voiceprint samples are obtained from the training sample set in batches, and a 128-dimensional diffusion step number is generated for each conditional voiceprint sample. The 64-dimensional background sound conditions in each batch are input into the background sound condition coding unit, thereby generating 128-dimensional condition codes using two fully connected layers. The 128-dimensional diffusion step sequence vector in each batch is input into the diffusion step sequence encoding unit, thereby generating a 128-dimensional step sequence encoding using a fully connected layer. By concatenating the 128-dimensional conditional code with the 128-dimensional step sequence code, a 256-dimensional reconstructed conditional vector is obtained. The 128×96-dimensional device voiceprint representation is input as a two-dimensional time-frequency feature map into the first convolutional block of the three-layer downsampling path in the voiceprint reconstruction unit. Convolution and downsampling are performed layer by layer along the downsampling path to output feature maps at each scale. The 256-dimensional reconstruction condition vector is input into the speaker reconstruction unit. The 256-dimensional reconstruction condition vector is fully connected and mapped in each convolutional block of the three-layer downsampling path and the three-layer upsampling path. A conditional bias vector with the same number of channels as the convolutional block is generated. The vector is then repeatedly expanded in the frequency band and time frame dimensions until it matches the normalized output size of the convolutional block. The conditional bias vector is then added to the normalized output of the convolutional block to obtain the intermediate reconstruction feature corresponding to the 128-dimensional diffusion step number, which is the feature map output by the speaker reconstruction unit under the diffusion step number. This completes the conditional reconstruction under the current background sound conditions and diffusion step number constraints. The intermediate reconstructed features output by the voiceprint reconstruction unit are input into the output mapping unit, thereby using a 1×1 two-dimensional convolutional layer to output the reconstructed device voiceprint representation, ensuring consistency with the input device voiceprint representation. The structure is used to construct a reconstruction error between the reconstructed device voiceprint representation and the corresponding real device voiceprint representation, which serves as a reconstruction training constraint for the diffusion reconstruction network. Within each batch, based on the vector distance threshold of the background sound conditions, sample pairs with the same background sound conditions or within the same range of background sound conditions are constructed, and the reconstructed device voiceprint representation obtained by the sample pairs under the same diffusion step number is used to construct consistency deviation as consistency training constraint. The joint training objective consists of reconstructed training constraints and consistent training constraints. We can jointly optimize and update the parameters of the diffusion reconstruction network, and then iteratively train to obtain the conditional diffusion reconstruction model.

[0012] Furthermore, a joint training objective is constructed using equation (2). : (2) In equation (2), To reconstruct the weights of the training constraints, The weights for consistency training constraints, This refers to the batch sample size. and This serves as an index for conditional voiceprint samples within a batch. For the first 64-dimensional background sound conditions of conditional voiceprint samples. For the first 64-dimensional background sound conditions of conditional voiceprint samples. The vector distance threshold for background noise conditions. Represents Euclidean distance. This is an indicator function that takes the value of 1 when the condition within the parentheses is true, and 0 otherwise. For the first Device voiceprint characterization of a conditional voiceprint sample In the The frequency band and the first The values ​​on each time frame For the first Each conditional voiceprint sample in the diffusion step number Speaker signature of the reconstructed device at the lower output In the The frequency band and the first The values ​​on each time frame For the first Each conditional voiceprint sample is at the same diffusion step number. Speaker signature of the reconstructed device at the lower output In the The frequency band and the first The values ​​on each time frame For the frequency band index, and ∈[ , ], For time frame index, ∈[ , ], This is the sequence number of the diffusion step.

[0013] Furthermore, in S4, the acquisition time is obtained using equation (3). Reconstruction deviation : (3) In equation (3), For the time of data collection Device voiceprint characterization In the The frequency band and the first The values ​​on each time frame For the time of data collection Optimal reconstruction device acoustic signature In the The frequency band and the first The values ​​at each time frame; For the time of data collection The next Normalized frequency band weights for each frequency band For the time of data collection Next Unnormalized frequency band weights for each frequency band For the frequency band index used in weighted normalized summation, and , For the time of data collection Energy distribution vector under In the Components in each frequency band Indexed by frequency band Index of a specific frequency band group For floor operations, This is a preset energy bias scalar. This indicates taking the absolute value of the value within the parentheses.

[0014] Furthermore, S5 includes: Obtain the acquisition time sequence corresponding to the reconstructed deviation sequence, and convert the duration threshold into the number of consecutive acquisition times as the duration determination length; Using the acquisition time as an index, a sliding judgment window with a continuous judgment length is constructed on the reconstructed deviation sequence, and the reconstructed deviation is compared with the warning threshold sequentially within the sliding judgment window; When the reconstruction deviation corresponding to multiple consecutive acquisition times within the sliding judgment window exceeds the warning threshold, an abnormal judgment event containing the start and end acquisition times of the corresponding window will be output. If the reconstruction deviation at any acquisition time within the sliding judgment window does not exceed the warning threshold, then a normal judgment event containing the current acquisition time will be output.

[0015] Furthermore, step S6 specifically involves: The start and end times of the abnormal judgment window are determined based on the abnormal judgment event, and the identifier of the target key device corresponding to the window where the abnormal judgment event is located is retrieved in the association relationship of the dual-channel audio segment stream. Based on the start and end times of data acquisition, extract the subsequence of reconstruction deviation corresponding to the window where the anomaly judgment event is located from the reconstruction deviation sequence. The reconstruction deviation subsequence is traversed to determine the maximum reconstruction deviation and the acquisition time corresponding to the maximum reconstruction deviation. The maximum reconstruction deviation is then used as the reconstruction deviation information of the corresponding anomaly judgment event. The identification of the target key equipment, the acquisition time, and the reconstruction deviation information are combined to generate an early warning signal and output it.

[0016] The present invention provides an electronic device, including a memory and a processor, characterized in that the memory is used to store a program supporting the processor in performing the method described therein, and the processor is configured to execute the program stored in the memory.

[0017] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. This invention proposes an improved anomaly detection method based on diffusion-based reconstruction. By using the background sound conditions extracted from the reference channel as explicit input conditions, a consistency training constraint is introduced to ensure that the diffusion-based reconstruction network produces consistent reconstruction results under the same or similar background sound conditions. Compared to schemes that rely solely on single-channel acoustic features and reconstruct using autoencoders or unconditional diffusion networks, this invention constructs sample pairs using the vector distance threshold of the background sound conditions during the training phase. It applies a consistency deviation penalty to the reconstruction output under the same diffusion step, thereby suppressing reconstruction drift caused by changes in workshop background or the randomness of diffusion steps. This allows the anomaly score to more accurately reflect the structural deviation of the equipment's acoustic signature, helping to reduce false positives and false negatives caused by changes in background noise, and enhancing the model's stability under background disturbances across shifts and operating conditions during online inference.

[0018] 2. This invention proposes a novel dual-channel synchronous acquisition and conditional voiceprint construction method. It acquires background noise from the workshop by near-field pointing to the main sound source of the equipment through the device channel and at a fixed position through the reference channel. A unified clock is used for synchronous triggering to form paired segments at the same acquisition time, ensuring a strict temporal correspondence between background information and equipment voiceprints. Furthermore, the time-frequency feature map of the reference channel is compressed into a 64-dimensional background sound condition composed of energy distribution, rhythmic changes, and frequency component proportions, and normalized to stabilize the conditional encoding input. The device channel outputs a 128x96 device voiceprint representation. Unlike existing methods that use only a single voiceprint feature as model input and implicitly mix background changes into device features, this design separates and parameterizes the "background state" from the original sound, forming a consistent conditional field that can be used for training and inference. This allows the reconstructed network to learn normal voiceprint patterns under the same background conditions, reducing the contamination of the equipment state representation by background sound changes.

[0019] 3. The present invention proposes an overall method for online warning. By means of conditional diffusion reconstruction, the final reconstructed device voiceprint representation is obtained, and the reconstruction deviation is formed by the weighted average of point-by-point differences. The frequency band weights are constructed and normalized by the energy distribution components in the background sound condition, so that the deviation calculation is consistent with the current background energy distribution in the frequency band dimension, avoiding the sensitivity of simple averaging to energy-sparse or background-dominated frequency bands. Subsequently, a sliding window judgment is constructed with a warning threshold and a duration threshold, requiring that the deviation exceeds the threshold at consecutive moments to output an abnormal event, so as to distinguish transient fluctuations such as short-term impact noise and occasional interference from persistent abnormalities; at the same time, when an event is generated, the device identifier is traced back through the correlation relationship of dual-channel segments, and the moment of peak deviation within the window is located, so that the warning signal carries traceable device, time, and deviation information, facilitating subsequent linkage disposal and recording. BRIEF DESCRIPTION OF THE DRAWINGS

[0020] Figure 1 is the flow chart of the online abnormal warning method for key equipment in the tobacco leaf processing workshop based on diffusion reconstruction abnormal detection of the present invention; Figure 2 is the flow chart of dual-channel sound segment acquisition and correlation of the online abnormal warning method for key equipment in the tobacco leaf processing workshop based on diffusion reconstruction abnormal detection of the present invention; Figure 3 is the flow chart of conditional voiceprint sample generation of the online abnormal warning method for key equipment in the tobacco leaf processing workshop based on diffusion reconstruction abnormal detection of the present invention; Figure 4 is the flow chart of conditional diffusion reconstruction model training of the online abnormal warning method for key equipment in the tobacco leaf processing workshop based on diffusion reconstruction abnormal detection of the present invention; Figure 5 is the flow chart of online reconstruction and reconstruction deviation calculation of the online abnormal warning method for key equipment in the tobacco leaf processing workshop based on diffusion reconstruction abnormal detection of the present invention; Figure 6 is the flow chart of warning threshold and duration threshold judgment of the online abnormal warning method for key equipment in the tobacco leaf processing workshop based on diffusion reconstruction abnormal detection of the present invention; Figure 7 is the flow chart of warning signal generation and output of the online abnormal warning method for key equipment in the tobacco leaf processing workshop based on diffusion reconstruction abnormal detection of the present invention; Figure 8 is the schematic diagram of feature generation of the online abnormal warning method for key equipment in the tobacco leaf processing workshop based on diffusion reconstruction abnormal detection of the present invention; Figure 9 is the schematic diagram of reconstruction difference comparison of the online abnormal warning method for key equipment in the tobacco leaf processing workshop based on diffusion reconstruction abnormal detection of the present invention. DETAILED DESCRIPTION OF THE EMBODIMENTS

[0021] In Example 1, a method for online anomaly early warning of key equipment in a tobacco processing workshop based on diffusion-based reconstruction anomaly detection is described. (See also...) Figure 1 ,include: S1. Collect sound segments from the equipment channel and the reference channel of the target key equipment in the tobacco processing workshop to form a dual-channel sound segment stream, and associate the target key equipment identifier and the collection time with the dual-channel sound segment stream.

[0022] See Figure 2 Step S1 is as follows: When deploying an online anomaly early warning system in a tobacco processing workshop, an equipment channel acquisition unit and a reference channel acquisition unit are set up for each target key piece of equipment. The target key equipment is identified as follows: The equipment channel acquisition unit is installed on the body or bracket of the target key equipment in a near-field arrangement, and the sound pickup is directed towards the main sound source area of ​​the target key equipment, so that the equipment channel acquisition signal is mainly the operating sound of the target key equipment. The reference channel acquisition unit is installed at a preset reference channel acquisition position. This preset reference channel acquisition position is fixed relative to the target key equipment and avoids the direction of the main sound source of the target key equipment, so that the reference channel acquisition signal is mainly the background sound of the workshop, thereby providing an independent acoustic input for subsequent background sound condition extraction.

[0023] To ensure that the reference channel audio segment and the device channel audio segment enter the feature generation link in pairs at the same acquisition time, a unified clock is used to synchronously trigger the two channel acquisition links. The unified clock is provided by the synchronization trigger controller, which simultaneously outputs trigger signals to the device channel acquisition unit and the reference channel acquisition unit, and generates the acquisition time at the trigger time. The device channel acquisition unit at the acquisition time The audio segment from the output device channel is denoted as The reference channel acquisition unit at the acquisition time Output the reference channel audio segment, denoted as Through synchronous triggering, and Sharing the same acquisition time marker avoids mismatching of background sound states from different workshops to equipment voiceprint representations; here, "feature generation link" refers to the subsequent feature extraction process in step S2, which includes: extracting 64-dimensional background sound conditions from the reference channel sound segment, and extracting from the equipment channel sound segment. Device voiceprint characterization and the formation of conditional voiceprint samples.

[0024] During continuous acquisition, the two acquisition units output continuous digital audio streams at a preset sampling rate. Sampling is performed, and the two-channel continuous digital audio stream is divided into frames. The framing parameters include frame length. With frame shift and according to the preset segment length The continuous frames are combined to form a sound segment. Specifically, the starting position of the segment is taken as the starting point of the first frame, and the length is [missing information]. The sampling points form the first frame, and then the frames are shifted. Slide backward to form the next frame, until the duration of the segment is covered. The corresponding sampling range is used to obtain the number of time frames contained in the sound segment. , and The configuration is such that each audio segment corresponds to 96 time frames after time-frequency analysis, to match the 128x96 time-frequency structure input of the subsequent device's speakerprint representation. The segment boundaries are based on the acquisition time. Divide the index to make each device channel audio segment Data is collected from device channels covering a fixed duration, with each reference channel containing a sound segment. Data is collected from a reference channel covering the same duration, and each audio segment carries the corresponding acquisition time. .

[0025] After the segments are formed, the audio segment streams from the device channel and the audio segment streams from the reference channel are compared according to the acquisition time. Perform corresponding merging to generate a dual-channel audio segment stream, with the acquisition time being... The dual-channel audio segment is denoted as ,in Includes target key equipment identification Collection time Audio clips from the equipment channel Audio clips from the reference channel ,in Used for voiceprint characterization in subsequent extraction equipment. Used for subsequent extraction of background sound conditions, the structured organization of this dual-channel sound segment stream allows the background sound information of the reference channel to be used as an independent field throughout the subsequent construction of conditional voiceprint samples and the reasoning process of conditional diffusion reconstruction model, providing a stable source of conditional input under consistent training constraints.

[0026] S2. Based on the dual-channel audio segment stream, extract background sound conditions from the reference channel audio segment, extract device voiceprint representation from the device channel audio segment, and combine the background sound conditions with the device voiceprint representation to form a conditional voiceprint sample.

[0027] In this embodiment, see Figure 3 Step S2 is as follows: When generating conditional voiceprint samples, a pair of reference channel voice segments and device channel voice segments are read from the dual-channel audio segment stream according to the acquisition time, and the acquisition time is recorded as... The reference channel audio segment is denoted as The audio segments from the device channel are recorded as ,Will and As input to the same feature generation chain, it ensures that the subsequently obtained background sound conditions and device voiceprint representation are consistent at the acquisition time. The above corresponds.

[0028] right Time-frequency analysis was performed to obtain the time-frequency characteristic map of the reference channel. The time-frequency analysis used a fixed frame length and frame shift pair. The process involves framing the data, applying a window function to each frame, performing a Discrete Fourier Transform (DFT) to obtain the spectral amplitude, squaring the spectral amplitude to obtain the spectral energy, mapping the spectral energy through a filter bank of 128 frequency bands to obtain energy values ​​for those 128 bands, and stacking these values ​​in chronological order to form 96 time frames. This results in a time-frequency characteristic map of the reference channel, denoted as... ,in It is a two-dimensional array of 128 by 96, with the frequency band dimension corresponding to 128 frequency bands and the time dimension corresponding to 96 time frames.

[0029] The time-frequency feature map of the reference channel is grouped into groups of 6 time frames along the frequency band direction to obtain 32 frequency band groups. The average energy of each frequency band group over 96 time frames is calculated to obtain a 32-dimensional energy distribution vector. The time-frequency feature map of the reference channel is grouped into 16 time periods along the time frame direction in 6 time frames, and the total energy change in each time period is calculated to obtain a 16-dimensional rhythm change vector.

[0030] After averaging the time-frequency feature map of the reference channel along the time frame direction, the mean spectrum of the frequency band is obtained. Then, the mean spectrum of the frequency band is divided into 16 frequency band segments in groups of 8, and the energy proportion of each frequency band segment is extracted to obtain a 16-dimensional frequency component distribution vector. By splicing the 32-dimensional energy distribution vector, the 16-dimensional rhythm change vector, and the 16-dimensional frequency component distribution vector in a fixed order, a 64-dimensional background sound condition is formed.

[0031] based on Figure 8 This indicates that the dual-channel audio segments enter the feature generation link at the same acquisition time: time-frequency analysis is performed on the reference channel audio segment and the device channel audio segment respectively. and of Characterization; then by Extracting energy distribution Rhythm changes With frequency components splicing and normalization yields ,and Together they constitute conditional voiceprint samples. Figure 8 This directly reflects the decoupling of "background sound conditions" and "device voiceprint representation", reducing the impact of background fluctuations on subsequent reconstruction and scoring, and improving stability under different background sound conditions.

[0032] based on Using equation (1), a 64-dimensional background sound condition is constructed. This allows it to simultaneously contain energy distribution vector, rhythm variation vector, and frequency component distribution vector, as well as background sound conditions. The specific calculations use a combination of solution functions: (1) In equation (1), For the time of data collection, For the time of data collection Reference channel audio segment, For the reason The time-frequency characteristic map of the reference channel obtained by time-frequency analysis. for In the The frequency band and the first Energy values ​​in each time frame For the time of data collection The next The average energy of each frequency band group over ninety-six time frames, and thirty-two according to Arranged from smallest to largest, they form a thirty-two-dimensional energy distribution vector. ,and , For the first The total energy of each time frame Indicates the time of data collection The sixteen-dimensional rhythmic change vector below, and , For the first The total energy change over sixteen time periods. according to Arranged from smallest to largest to form a sixteen-dimensional rhythmic variation vector , For the first The average energy of each frequency band over ninety-six time frames. For the time of data collection The next The energy percentage of each frequency band, sixteen according to Arranged from smallest to largest to form a sixteen-dimensional frequency component distribution vector , The 64-dimensional background sound conditions are obtained by splicing them together in a fixed order. For the time of data collection The next The total energy of each time frame Indicates the time of data collection The next The total energy of each time frame For the time of data collection The next The average energy of each frequency band over 96 time frames; Indicates the first The frequency band and the first Energy values ​​in each time frame For frequency band indexing, For time frame indexing, For frequency band group index, Indexed by time period This is a frequency band index.

[0033] To avoid the impact of numerical scale differences at different acquisition times on the stable input of subsequent background sound conditional coding units, the following measures are taken: Amplitude normalization is performed using a dimension-wise linear mapping method. Each dimension is mapped based on the minimum and maximum values ​​of that dimension during the historical normal operation period, so that each normalized dimension falls within a preset numerical range. When the minimum and maximum values ​​of a certain dimension are the same, resulting in a zero denominator, the normalization result of that dimension is set to zero, thereby obtaining the sixty-four-dimensional background sound conditions used as the conditional input for the conditional diffusion reconstruction model.

[0034] right Conduct and A consistent time-frequency analysis process is used, employing the same frame length, frame shift, and 128-band filter bank. The mapping is a device voiceprint representation comprising 128 frequency bands and 96 time frames, denoted as ,in A two-dimensional array of 128 x 96 is used to represent the acquisition time. The time-frequency structure of the operating sound of the target key equipment body, and finally, and By collection time Corresponding combinations form conditional voiceprint samples, denoted as and will As the training input and online inference input of the diffusion reconstruction network for subsequent consistency training, the background sound condition is entered into the conditional diffusion reconstruction link as an independent condition field.

[0035] S3. A diffusion reconstruction network with consistent training is trained based on conditional voiceprint samples from historical normal operation periods. The diffusion reconstruction network takes background sound conditions as input and outputs the reconstructed device voiceprint representation. Under the consistency training constraint, the diffusion reconstruction network produces consistent reconstruction results for the normal device voiceprint representation under the same background sound conditions, thus obtaining the conditional diffusion reconstruction model.

[0036] In this embodiment, see Figure 4 Step S3 is as follows: When training the conditional diffusion reconstruction model, a training sample set is first formed by selecting conditional voiceprint samples generated during historical normal operation periods. Each conditional voiceprint sample in the training sample set contains 64-dimensional background sound conditions and a 128×96-dimensional device voiceprint representation. The training sample set then... Conditional voiceprint samples are denoted as Among them, the background sound condition is denoted as , It is a 64-dimensional vector, obtained by concatenating a 32-dimensional energy distribution vector, a 16-dimensional rhythm variation vector, and a 16-dimensional frequency component distribution vector in a fixed order. The device's voiceprint representation is denoted as... , The training sample set consists of only normal operation data, which is a two-dimensional time-frequency feature map of 128 by 96. The diffusion reconstruction network learns the characteristics of normal device voiceprint representation under given background sound conditions.

[0037] The training sample set is read in batches, and the number of samples in each training batch is denoted as . For each conditional voiceprint sample, a diffusion step number is generated, and the value range of the diffusion step number is as follows: to ,in To preset the number of diffusion steps, set the diffusion step number. Encoded as a 128-dimensional diffusion step sequence number vector, denoted as The encoding method uses a trainable embedding table, which assigns the sequence number of each diffusion step to the array. It is mapped to a fixed-length 128-dimensional vector, which serves as the input to the diffusion step sequence number encoding unit.

[0038] Background sound conditions The input background noise conditional coding unit consists of two fully connected layers. The first and second fully connected layers each have 128 output neurons. A normalization layer and a non-linear activation layer are concatenated after each fully connected layer to output a 128-dimensional conditional code, denoted as [missing information]. , diffuse step number vector The input is a diffusion step sequence number encoding unit, which consists of a fully connected layer and has 128 output neurons, outputting a 128-dimensional step sequence code, denoted as... ,Will and By concatenating the dimensions, a 256-dimensional reconstruction condition vector is obtained, denoted as... ,Will As a conditional input for the voiceprint reconstruction unit.

[0039] During each forward training iteration, the device's voiceprint representation is... According to diffusion step number Noise injection generates a noisy device acoustic signature, which is denoted as [the noisy device acoustic signature]. The injected noise uses a preset noise intensity sequence to control the noise amplitude. The preset noise intensity sequence includes... Each noise intensity scalar corresponds to a diffusion step number, generating a... A random noise matrix of the same dimension, scaled point-by-point by the corresponding noise intensity scalar, and then compared with... Adding them point by point, we get ,Will and The input is fed into the voiceprint reconstruction unit, which adopts a two-dimensional convolutional encoding and decoding structure without self-attention mechanism. It includes three downsampling paths, three upsampling paths, and skip connections. The first convolutional block has 64 output channels, the second convolutional block has 128 output channels, the third convolutional block has 256 output channels, the fourth convolutional block has 128 output channels, the fifth convolutional block has 64 output channels, and the sixth convolutional block has 64 output channels. Each convolutional block consists of two two-dimensional convolutional layers, two normalization layers, and two nonlinear activation layers. Residual connections are set at the beginning and end of the convolutional block. The residual connections at the beginning and end of the convolutional block mean that within each convolutional block, the input of the convolutional block is added point by point to the convolutional output at the end of the convolutional block to form the final output of the convolutional block, thereby stabilizing deep training.

[0040] 128×96 dimensional device voiceprint characterization In the first convolutional block of the three-layer downsampling path of the two-dimensional time-frequency feature map input to the speaker reconstruction unit, the feature map is obtained and downsampled once. Then, it enters the second and third convolutional blocks sequentially and performs downsampling once in each block. At the same time, the outputs of the first and second convolutional blocks are retained as skip connections. The upsampling path starts from the output of the third convolutional block, first upsampling to the same scale as the output of the second convolutional block, then fusing the two along the channel dimension and inputting them into the fourth convolutional block; then upsampling to the same scale as the output of the first convolutional block, then fusing the two along the channel dimension and inputting them into the fifth convolutional block; the sixth convolutional block further reconstructs the fused features at the highest resolution, thus performing convolution and downsampling layer by layer along the downsampling path to output feature maps at each scale.

[0041] In this embodiment, each convolutional block of the three downsampling paths outputs a feature map at the corresponding scale. The output used for further downsampling serves as the input to the next layer, while a feature map is retained for skip connections with the three upsampling paths. The three upsampling paths upsample layer by layer, starting from the deepest feature map. At each layer, the upsampled feature map is fused with the feature map retained by the corresponding downsampling layer along the channel dimension before entering the convolutional block of that layer, outputting a feature map that gradually restores the resolution. Here, the normalized output of the convolutional block is an intermediate result within the convolutional block. After "adding a conditional bias vector," it enters nonlinear activation and forms the output of the convolutional block. Therefore, the relationship between the normalized output and the outputs of the upsampling and downsampling paths is that of "internal intermediate state and final output of the layer."

[0042] The conditional bias vector is added at the same position in all six convolutional blocks: to achieve reference conditionalization, the reconstructed conditional vector is added within each convolutional block. Perform a fully connected mapping to obtain a conditional bias vector that matches the number of output channels of the convolutional block, denoted as . ,in, The convolution block number is used to determine the sequence number of the convolution block. After being broadcast over 96 time frames, the data is added to the normalized output of the convolutional block (before entering nonlinear activation) to obtain the intermediate reconstruction feature corresponding to the 128-dimensional diffusion step number, i.e., the feature map output by the speaker reconstruction unit at that diffusion step number. This introduces conditional constraints of background noise and diffusion step number at each scale of the downsampling and upsampling paths. The intermediate reconstruction feature output from the sixth convolutional block is then input to the output mapping unit. The output mapping unit uses a 1x1 2D convolutional layer to map the number of output channels to 128 while maintaining this mapping for 96 time frames, thus obtaining the speaker representation of the reconstruction device, denoted as... The broadcasting here refers to repeatedly expanding the conditional bias vector obtained by mapping the conditional vector from the dimensionality reconstruction conditional vector, according to the number of channels in the convolutional block, at all frequency band positions and all time frame positions, so that its size is consistent with the normalized output of the convolutional block, and then adding it point by point to the normalized output. Its function is only to ensure that the same conditional bias can be applied to the corresponding position of each frequency band and each time frame, without changing the numerical meaning of the conditional bias vector itself.

[0043] The reconstructed training constraints and the consistency training constraints are combined into a joint training objective within the batch. As shown in equation (2), and with this joint training objective Backpropagation is performed to update the parameters of the diffusion-reconstructed network, with joint training objectives. Using a joint inversion function: (2) In equation (2), To reconstruct the weights of the training constraints, The weights for consistency training constraints, This refers to the batch sample size. and For sample index within a batch, For the first Sixty-four-dimensional background sound conditions of the sample. For the first Sixty-four-dimensional background sound conditions of the sample. The vector distance threshold for background noise conditions. Represents Euclidean distance. This is an indicator function that takes the value of 1 when the condition within the parentheses is true, and 0 otherwise. For the first Device voiceprint characterization of sample In the The frequency band and the first The values ​​on each time frame For the first Sample number in diffusion step Speaker signature of the reconstructed device at the lower output In the The frequency band and the first The values ​​on each time frame For the first Samples at the same diffusion step number Speaker signature of the reconstructed device at the lower output In the The frequency band and the first The values ​​on each time frame It is a frequency band index and its value range is to , The time frame index has a value range of 1. to ,symbol This indicates taking the absolute value of the value within the parentheses. This is the diffusion step number.

[0044] When there is no such condition in the batch When the sample pairs result in a denominator of zero, the consistency training constraint term is set to zero and only the reconstruction training constraint term is used to update the parameters. The gradient descent optimizer is used to iteratively update the parameters of the diffusion reconstruction network until the preset number of training rounds or the joint training objective converges, thus obtaining the trained conditional diffusion reconstruction model.

[0045] S4. Based on the real-time conditionalized voiceprint sample input conditional diffusion reconstruction model, obtain the reconstructed device voiceprint representation, and calculate the reconstruction deviation between the device voiceprint representation and the reconstructed device voiceprint representation to form a reconstruction deviation sequence.

[0046] In this embodiment, see Figure 5 Step S4 is as follows: During the online inference phase, real-time conditionalized voiceprint samples are read and the acquisition time is determined, which is recorded as . The real-time conditional voiceprint samples are recorded as ,from The 64-dimensional background sound conditions were obtained from the analysis. With device voiceprint characterization ,in It is a 64-dimensional vector, obtained by concatenating a 32-dimensional energy distribution vector, a 16-dimensional rhythm variation vector, and a 16-dimensional frequency component distribution vector in a fixed order. It is a two-dimensional time-frequency feature map of 128 by 96, corresponding to 128 frequency bands and 96 time frames.

[0047] A diffusion step sequence number is generated based on a preset number of diffusion steps, and the preset number of diffusion steps is denoted as... The first in the diffusion step sequence The diffusion step number is denoted as ,in Values to And take them in sequence according to the preset diffusion step number. Number each diffusion step This is mapped to a 128-dimensional diffusion step index vector through a trainable embedding table, denoted as... ,Will As the diffusion step number input for the conditional diffusion reconstruction model.

[0048] Conditional encoding is performed on the background sound conditions and diffusion step number, and then... The input is a background noise conditional coding unit, which consists of two fully connected layers, each with 128 output neurons. A normalization layer and a non-linear activation layer are concatenated after each fully connected layer, outputting a 128-dimensional conditional code, denoted as [missing information]. ,Will The input is a diffusion step sequence number encoding unit, which consists of a fully connected layer and has 128 output neurons, outputting a 128-dimensional step sequence code, denoted as... ,Will and By concatenating the dimensions, a 256-dimensional reconstruction condition vector is obtained, denoted as... .

[0049] The two-dimensional convolutional encoding and decoding process of the voiceprint reconstruction unit is performed using the device voiceprint representation and reconstruction condition vector, and the reconstructed device voiceprint representation is iteratively updated according to the diffusion step number sequence. The voiceprint input for the next iteration is denoted as and will Initialize to In the In the next iteration, and The input is a voiceprint reconstruction unit, which encodes and decodes the data using a three-layer downsampling path and a three-layer upsampling path, and fuses corresponding scale features through skip connections. Within each convolutional block, the voiceprint reconstruction unit performs... Perform a fully connected mapping to obtain a conditional bias vector with the same number of output channels as the convolutional block. Broadcast this conditional bias vector over ninety-six time frames and add it to the normalized output of the convolutional block to obtain the intermediate reconstructed feature corresponding to that diffusion step number, denoted as... ,Will The input / output mapping unit employs a single 1x1 2D convolutional layer to map the number of channels to 128 while maintaining 96 time frames. The output is used to reconstruct the speaker's voiceprint representation, denoted as... ,Will As the voiceprint input for the next iteration and let Until the last diffusion step in the diffusion step sequence is reached. ,Will The final reconstructed device's acoustic signature is identified as follows: .

[0050] Figure 9 Explanation of reconstruction differences under the same background sound conditions: Under normal circumstances, the input voiceprint... Reconstructed voiceprints were obtained using a conditional diffusion reconstruction model. The difference between the two The overall difference is relatively small; under abnormal conditions, the difference increases significantly across multiple frequency bands and time frames, thereby significantly increasing the reconstruction deviation and triggering continuous judgment. Figure 9 This invention highlights that it uses "reconstruction deviation" to characterize the structural deviation of the device's acoustic signature, and within a given... Lowering the score makes it more comparable and helps reduce false alarms and missed alarms.

[0051] The difference amplitude between the device acoustic signature and the final reconstructed device acoustic signature is calculated point-by-point according to frequency band and time frame. The average of all difference amplitudes is then used to obtain the reconstruction deviation at that acquisition moment. To ensure that this averaging process explicitly references the energy distribution information of the background sound conditions during calculation, the following steps are taken: The first thirty-two-dimensional components are extracted in order of frequency band group to form an energy distribution vector. and with After constructing the frequency band weights, the deviation is reconstructed by weighting the amplitude of the point-by-point difference. The offset function is shown in equation (3): (3) In equation (3), For the time of data collection Reconstruction deviation, For the time of data collection, For device voiceprint characterization In the The frequency band and the first The values ​​on each time frame To ultimately reconstruct the device's acoustic signature In the The frequency band and the first The values ​​on each time frame It is a frequency band index and its value range is to , The time frame index has a value range of 1. to , For the first The normalized frequency band weights of each frequency band, and through the normalization, make , For the first Unnormalized frequency band weights for each frequency band The frequency band index is used for weighted normalized summation and its value range is [value range missing]. to , Energy distribution vector In the Components in each frequency band Indexed by frequency band A defined frequency band group index. For floor operations, The preset energy bias scalar has a unit of measurement similar to... Consistency, used to avoid zero denominator, sign This indicates taking the absolute value of the value within the parentheses.

[0052] Output of continuous acquisition time Add data in ascending order of acquisition time to form a reconstruction deviation sequence, denoted as... And reconstruct the deviation sequence This serves as the input for determining subsequent warning thresholds and duration thresholds.

[0053] S5. Under the constraints of the pre-set warning threshold and duration threshold, the reconstructed deviation sequence is judged to form an abnormal judgment event or a normal judgment event.

[0054] In this embodiment, see Figure 6 Step S5 is as follows: Read the reconstructed deviation sequence and obtain the acquisition time order, then mark the target key equipment as... , The reconstructed deviation sequence is determined by the device number bound to the device channel during deployment, and is denoted as... and will Represented as a sequence arranged in ascending order of acquisition time ,in, For the first Each data collection moment, For the time of data collection The corresponding reconstruction deviation, This represents the current cumulative number of data collection moments.

[0055] The duration threshold is denoted as , The collection period is denoted as a preset duration scalar in seconds. , The time interval between adjacent acquisition times is a scalar. The duration is calculated by converting the number of consecutive acquisition moments into the duration of the decision. The result is then rounded up to obtain the number of consecutive acquisition times, denoted as . ,Will As the duration of the decision.

[0056] Using the acquisition time as an index, a sliding judgment window with a continuous judgment length is constructed on the reconstructed deviation sequence. Within the sliding judgment window, the reconstructed deviation is compared with the warning threshold sequentially, and the warning threshold is denoted as... , To and Scalar thresholds of the same dimension, at the time of data acquisition Upon arrival, determine and The size relationship, when At that time, based on the time interval of data collection corresponding The reconstructed deviations constitute a sliding decision window, denoted as . In the window Read each item sequentially according to time. and In comparison, among them The index of the time of collection within the window and the range of values ​​is: to .

[0057] When the reconstruction deviation corresponding to multiple consecutive acquisition times within the sliding judgment window exceeds the warning threshold, an abnormal judgment event containing the start and end acquisition times of that window is output. Specifically, within the window... Internal to all Execute conditional judgment, when for each All meet When constructing an exception detection event And output the exception detection event. At least include the identification of the target critical equipment. Window start acquisition time Window termination time Continuity determination length Warning threshold and will Write to the event queue for the early warning link to read.

[0058] When the reconstruction deviation at any acquisition moment within the sliding judgment window does not exceed the warning threshold, a normal judgment event containing the current acquisition moment is output. Here, "any acquisition moment" refers to any one of all acquisition moments covered by the sliding judgment window; "current acquisition moment" refers to the window's termination acquisition moment (latest acquisition moment) when the judgment is triggered. Specifically, within the window... As long as any exists within satisfy Then construct a normal decision event. And output the normal judgment event. At least include the identification of the target critical equipment. Current data collection time Current Reconstruction Deviation Continuity determination length Warning threshold and will Write to the event queue for status recording; as each new data acquisition moment arrives, it is processed according to... Incremental update and output or This completes the reconstruction of the deviation sequence. The continuous determination.

[0059] S6. Generate an early warning signal based on the abnormal event. The early warning signal includes the target key equipment identifier, the acquisition time, and the reconstruction deviation information.

[0060] In this embodiment, see Figure 7 Step S6 is as follows: Read the exception detection event and determine the exception detection window, then record the exception detection event as... ,from The start and end times of the data acquisition in the error detection window are recorded as follows: and and verify The dual-channel audio stream is recorded as The correlation between the two-channel audio segment streams is denoted as: ,in This is a mapping table indexed by acquisition time. Each entry must contain at least one field: the acquisition time and the device number field for the audio segment record of the device channel. ,exist Read the acquisition time Record the corresponding device channel audio segment, and obtain the target key device identifier from the device number field, denoted as... Perform an associated search on the anomaly detection window and retrieve... As the identifier of the target critical equipment corresponding to this anomaly detection window, it is denoted as... and verify If the verification fails, the generation of the current warning signal will be terminated and a window association failure flag will be output to the warning interface.

[0061] Based on the start and end acquisition times, a subsequence of reconstruction deviation is extracted from the reconstruction deviation sequence, and the reconstruction deviation sequence is denoted as... and will Represented as a key-value sequence arranged in ascending order of acquisition time. ,in For the first Each data collection moment, For the time of data collection The corresponding reconstruction deviation value, The cumulative number of data collection moments. Perform interval truncation and select the interval that meets the requirements. All key-value pairs, by Incrementing to form a reconstructed deviation subsequence, denoted as ,in The length of the subsequence, when When the warning signal generation is terminated, a data missing marker will be output to the warning interface.

[0062] The reconstructed deviation subsequence is traversed to determine the maximum reconstruction deviation and its corresponding acquisition time. The maximum reconstruction deviation is denoted as... The acquisition time corresponding to the maximum reconstruction deviation is denoted as ,Will Initialize to ,Will Initialize to For traversing the index from to Read sequentially ,when At that time, Updated to And Updated to ,Will As the reconstruction deviation information for this anomaly determination event, and will With the anomaly detection window Binding storage ensures that the early warning signal can be located at the peak collection time.

[0063] The target key equipment identifier, acquisition time, and reconstruction deviation information are combined to generate an early warning signal and output to the early warning interface. The early warning signal is recorded as follows: Identify the target key equipment according to the field order. Peak acquisition time Maximum reconstruction deviation Anomaly detection window start time The time when the abnormal judgment window terminates the data collection. Write and for Adding an exception detection event identifier field enables event deduplication and tracking. Send to the early warning interface, and receive it. Then, according to the device identification and the time of data collection, the early warning record is output to the outside world.

[0064] in, This is an abnormal event. This represents the starting time for data collection within the anomaly detection window. This is the time when data collection ends within the anomaly detection window. For the time of data collection, It is a dual-channel audio clip stream. The relationship between the two-channel audio segment streams. For the time of data collection The corresponding target key equipment identifier, This is the identifier of the target critical equipment corresponding to the anomaly detection window. To reconstruct the deviation sequence, To reconstruct the first deviation sequence Each data collection moment, For the time of data collection The corresponding reconstruction deviation, This represents the cumulative number of data collection moments. For the reconstructed deviation subsequence corresponding to the anomaly detection window, For the first subsequence Each data collection moment, The length of the subsequence. For subsequence traversal index, To maximize the reconstruction deviation, The acquisition time corresponding to the maximum reconstruction deviation. This is an early warning signal.

[0065] In this embodiment, an electronic device includes a memory and a processor. The memory stores a program that supports the processor in executing the above-described method, and the processor is configured to execute the program stored in the memory.

[0066] In this embodiment, a computer-readable storage medium stores a computer program, which is executed by a processor to perform the steps of the above method.

Claims

1. A method for online anomaly early warning of key equipment in a tobacco processing workshop based on diffusion-based reconstruction anomaly detection, characterized in that, include: S1. Collect channel sound segments and reference channel sound segments of the target key equipment in the tobacco processing workshop to form a dual-channel sound segment stream, and associate the dual-channel sound segment stream with the identification of the target key equipment and the collection time; S2. Based on the dual-channel audio segment stream, extract background sound conditions from the reference channel audio segment, extract device voiceprint representation from the channel audio segment of the target key device, and combine the background sound conditions with the device voiceprint representation to form a conditional voiceprint sample. S3. Construct a diffusion-based reconstruction network, including: diffusion step sequence number encoding unit, background sound condition encoding unit, voiceprint reconstruction unit, and output mapping unit. Process the background sound conditions to obtain the reconstructed device voiceprint representation. Combine the reconstructed device voiceprint representation with the real device voiceprint representation under the same background sound conditions to construct reconstruction training constraints and consistency training constraints for training the diffusion-based reconstruction network and obtain the conditional diffusion reconstruction model. S4. Based on the real-time conditional acoustic fingerprint samples, the conditional diffusion reconstruction model is processed to obtain the optimal reconstructed device acoustic fingerprint representation. The difference amplitude between the device acoustic fingerprint representation in the real-time conditional acoustic fingerprint samples and the optimal reconstructed device acoustic fingerprint representation is calculated point by point according to the frequency band and time frame. The average of all difference amplitudes is used to obtain the reconstruction deviation at the acquisition time. Thus, the reconstruction deviation at the continuous acquisition time is formed into a reconstruction deviation sequence in time order. S5. Under the constraints of the pre-set warning threshold and duration threshold, the reconstructed deviation sequence is judged to form an abnormal judgment event or a normal judgment event. S6. Generate an early warning signal based on the abnormal event. The early warning signal includes: the target key equipment identifier, the acquisition time, and the reconstruction deviation information.

2. The online anomaly early warning method for key equipment in a tobacco processing workshop based on diffusion-based reconstruction anomaly detection as described in claim 1, characterized in that, S1 includes: The equipment channel acquisition unit is placed in the near field of the target key equipment and pointed to the main sound source of the target key equipment. The reference channel acquisition unit is placed at the acquisition position of the preset reference channel to acquire the background sound of the workshop, thereby completing the dual-channel acquisition configuration. The device channel acquisition unit and the reference channel acquisition unit are synchronously triggered under a unified clock, so as to generate a pair of device channel sound segments and reference channel sound segments at the same acquisition time. The continuously acquired signals from the two channels are framed according to a preset sampling rate, and continuous device channel audio segment streams and reference channel audio segment streams are formed according to preset segment durations, so that each device channel audio segment and each reference channel audio segment carries the corresponding acquisition time. The device channel audio segments and reference channel audio segments at the same acquisition time are merged into a dual-channel audio segment stream according to their correspondence, and the identifier of the target key device and the acquisition time are associated with the dual-channel audio segment stream.

3. The online anomaly early warning method for key equipment in a tobacco processing workshop based on diffusion-based reconstruction anomaly detection as described in claim 1, characterized in that, S2 include: Obtain a pair of reference channel audio segments and device channel audio segments from the same acquisition time in the dual-channel audio segment stream; Time-frequency analysis was performed on the audio segment of the reference channel to obtain a time-frequency feature map of the reference channel containing 128 frequency bands and 96 time frames; The time-frequency feature map of the reference channel is grouped into groups of 6 time frames along the frequency band direction to obtain 32 frequency band groups. The average energy of each frequency band group over 96 time frames is calculated to obtain a 32-dimensional energy distribution vector. The time-frequency feature map of the reference channel is grouped into 16 time periods along the time frame direction in 6 time frames, and the total energy change in each time period is calculated to obtain a 16-dimensional rhythm change vector. After averaging the time-frequency feature map of the reference channel along the time frame direction, the mean spectrum of the frequency band is obtained. Then, the mean spectrum of the frequency band is divided into 16 frequency band segments in groups of 8, and the energy proportion of each frequency band segment is extracted to obtain a 16-dimensional frequency component distribution vector. By splicing the 32-dimensional energy distribution vector, the 16-dimensional rhythm change vector, and the 16-dimensional frequency component distribution vector in a fixed order, a 64-dimensional background sound condition is formed. After performing time-frequency analysis on the audio segments of the device channel, a device acoustic signature containing 128 frequency bands and 96 time frames was obtained. The 64 four-dimensional background sound conditions and the device acoustic signature were combined according to the corresponding acquisition time to form a conditional acoustic signature sample.

4. The online anomaly early warning method for key equipment in a tobacco processing workshop based on diffusion-based reconstruction anomaly detection as described in claim 3, characterized in that, The acquisition time is obtained using equation (1) in S2. 64-dimensional background sound conditions : (1) In equation (1), For the time of data collection The 32-dimensional energy distribution vector below, and , For the time of data collection The next The average energy of each frequency band over 96 time frames. Indicates the time of data collection The 16-dimensional rhythmic variation vector below, and , For the time of data collection Next The total energy change over a time period; Indicates the time of data collection The 16-dimensional frequency component distribution vector below, and , For the time of data collection The next The energy percentage of each frequency band For the time of data collection The reference channel audio clip below Time-frequency characteristic map of the reference channel obtained by time-frequency analysis In the The frequency band and the first Energy values ​​in each time frame For the time of data collection The next The total energy of each time frame Indicates the time of data collection The next The total energy of each time frame For the time of data collection The next The average energy of each frequency band over 96 time frames; Indicates the first The frequency band and the first Energy values ​​in each time frame This indicates that the scalar components arranged in order within the parentheses are combined into a column vector. For frequency band index, For the index of the time frame, For the index of the frequency band group, For indexing time periods, This is the index for the frequency band.

5. The online anomaly early warning method for key equipment in a tobacco processing workshop based on diffusion-based reconstruction anomaly detection according to claim 1, characterized in that, S3 include: A training sample set is constructed by selecting conditional voiceprint samples from historical normal operation periods. Each conditional voiceprint sample in the training sample set contains 64-dimensional background sound conditions and 128×96-dimensional device voiceprint representation. Conditional voiceprint samples are obtained from the training sample set in batches, and a 128-dimensional diffusion step number is generated for each conditional voiceprint sample. The 64-dimensional background sound conditions in each batch are input into the background sound condition coding unit, thereby generating 128-dimensional condition codes using two fully connected layers. The 128-dimensional diffusion step sequence vector in each batch is input into the diffusion step sequence encoding unit, thereby generating a 128-dimensional step sequence encoding using a fully connected layer. By concatenating the 128-dimensional conditional code with the 128-dimensional step sequence code, a 256-dimensional reconstructed conditional vector is obtained. The 128×96-dimensional device voiceprint representation is input as a two-dimensional time-frequency feature map into the first convolutional block of the three-layer downsampling path in the voiceprint reconstruction unit. Convolution and downsampling are performed layer by layer along the downsampling path to output feature maps at each scale. The 256-dimensional reconstruction condition vector is input into the speaker reconstruction unit. The 256-dimensional reconstruction condition vector is fully connected and mapped in each convolutional block of the three-layer downsampling path and the three-layer upsampling path. A conditional bias vector with the same number of channels as the convolutional block is generated. The vector is then repeatedly expanded in the frequency band and time frame dimensions until it matches the normalized output size of the convolutional block. The conditional bias vector is then added to the normalized output of the convolutional block to obtain the intermediate reconstruction feature corresponding to the 128-dimensional diffusion step number, which is the feature map output by the speaker reconstruction unit under the diffusion step number. This completes the conditional reconstruction under the current background sound conditions and diffusion step number constraints. The intermediate reconstructed features output by the voiceprint reconstruction unit are input into the output mapping unit, thereby using a 1×1 two-dimensional convolutional layer to output the reconstructed device voiceprint representation, ensuring consistency with the input device voiceprint representation. The structure is used to construct a reconstruction error between the reconstructed device voiceprint representation and the corresponding real device voiceprint representation, which serves as a reconstruction training constraint for the diffusion reconstruction network. Within each batch, based on the vector distance threshold of the background sound conditions, sample pairs with the same background sound conditions or within the same range of background sound conditions are constructed, and the reconstructed device voiceprint representation obtained by the sample pairs under the same diffusion step number is used to construct consistency deviation as consistency training constraint. The joint training objective consists of reconstructed training constraints and consistent training constraints. We can jointly optimize and update the parameters of the diffusion reconstruction network, and then iteratively train to obtain the conditional diffusion reconstruction model.

6. The online anomaly early warning method for key equipment in a tobacco processing workshop based on diffusion-based reconstruction anomaly detection according to claim 5, characterized in that, Construct joint training objectives using equation (2) : (2) In equation (2), To reconstruct the weights of the training constraints, The weights for consistency training constraints, This refers to the batch sample size. and This serves as an index for conditional voiceprint samples within a batch. For the first 64-dimensional background sound conditions of conditional voiceprint samples. For the first 64-dimensional background sound conditions of conditional voiceprint samples. The vector distance threshold for background noise conditions. Represents Euclidean distance. This is an indicator function that takes the value of 1 when the condition within the parentheses is true, and 0 otherwise. For the first Device voiceprint characterization of a conditional voiceprint sample In the The frequency band and the first The values ​​on each time frame For the first Each conditional voiceprint sample in the diffusion step number Speaker signature of the reconstructed device at the lower output In the The frequency band and the first The values ​​on each time frame For the first Each conditional voiceprint sample is at the same diffusion step number. Speaker signature of the reconstructed device at the lower output In the The frequency band and the first The values ​​on each time frame For the frequency band index, and ∈[ , ], For time frame index, ∈[ , ], This is the sequence number of the diffusion step.

7. The online anomaly early warning method for key equipment in a tobacco processing workshop based on diffusion-based reconstruction anomaly detection as described in claim 1, characterized in that, In S4, the acquisition time is obtained using equation (3). Reconstruction deviation : (3) In equation (3), For the time of data collection Device voiceprint characterization In the The frequency band and the first The values ​​on each time frame For the time of data collection Optimal reconstruction device acoustic signature In the The frequency band and the first The values ​​at each time frame; For the time of data collection The next Normalized frequency band weights for each frequency band For the time of data collection Next Unnormalized frequency band weights for each frequency band For the frequency band index used in weighted normalized summation, and , For the time of data collection Energy distribution vector under In the Components in each frequency band Indexed by frequency band Index of a specific frequency band group For floor operations, This is a preset energy bias scalar. This indicates taking the absolute value of the value within the parentheses.

8. The online anomaly early warning method for key equipment in a tobacco processing workshop based on diffusion-based reconstruction anomaly detection according to claim 1, characterized in that, S5 include: Obtain the acquisition time sequence corresponding to the reconstructed deviation sequence, and convert the duration threshold into the number of consecutive acquisition times as the duration determination length; Using the acquisition time as an index, a sliding judgment window with a continuous judgment length is constructed on the reconstructed deviation sequence, and the reconstructed deviation is compared with the warning threshold sequentially within the sliding judgment window; When the reconstruction deviation corresponding to multiple consecutive acquisition times within the sliding judgment window exceeds the warning threshold, an abnormal judgment event containing the start and end acquisition times of the corresponding window will be output. If the reconstruction deviation at any acquisition time within the sliding judgment window does not exceed the warning threshold, then a normal judgment event containing the current acquisition time will be output.

9. The online anomaly early warning method for key equipment in a tobacco processing workshop based on diffusion-based reconstruction anomaly detection according to claim 1, characterized in that, Step S6 is as follows: The start and end times of the abnormal judgment window are determined based on the abnormal judgment event, and the identifier of the target key device corresponding to the window where the abnormal judgment event is located is retrieved in the association relationship of the dual-channel audio segment stream. Based on the start and end times of data acquisition, extract the subsequence of reconstruction deviation corresponding to the window where the anomaly judgment event is located from the reconstruction deviation sequence. The reconstruction deviation subsequence is traversed to determine the maximum reconstruction deviation and the acquisition time corresponding to the maximum reconstruction deviation. The maximum reconstruction deviation is then used as the reconstruction deviation information of the corresponding anomaly judgment event. The identification of the target key equipment, the acquisition time, and the reconstruction deviation information are combined to generate an early warning signal and output it.

10. An electronic device, comprising a memory and a processor, characterized in that, The memory is used to store a program that supports a processor in executing the method of any one of claims 1-9, the processor being configured to execute the program stored in the memory.