A fire smoke detection method and system based on spatio-temporal feature discrimination
By employing wavelet multi-scale decomposition and temporal feature fusion, the problem of incomplete feature representation in smoke detection in existing technologies is solved, achieving high-precision and highly interference-resistant fire smoke detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANJING LINGSHU INTELLIGENT TECHNOLOGY CO LTD
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-12
AI Technical Summary
Existing video smoke detection methods rely on single-dimensional feature extraction, which cannot simultaneously preserve the global structure and detailed texture of smoke, and do not fully explore the temporal variation patterns of smoke, resulting in insufficient detection accuracy and anti-interference ability.
Multi-scale spatial features of smoke are extracted by wavelet multi-scale decomposition, a multi-scale temporal feature sequence is constructed, and cross-dimensional feature fusion is performed to generate smoke spatiotemporal fusion features. Finally, a smoke discriminator classifier is used to output the fire smoke detection results.
It improves the accuracy and anti-interference ability of smoke detection, adapts to the smoke detection needs of different scales and scenarios, and balances detection accuracy and efficiency.
Smart Images

Figure CN122200512A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer vision and fire detection technology, specifically to a fire smoke detection method and system based on spatiotemporal feature discrimination. Background Technology
[0002] Fire smoke is a crucial early-stage characteristic of fires. Timely and accurate detection of fire smoke can buy valuable time for fire prevention, evacuation, and emergency response, reducing loss of life and property. With the widespread adoption of video surveillance technology, fire smoke detection based on video frame sequences has become the mainstream method. Its core principle is to extract smoke features from video frames and combine this feature analysis to accurately distinguish between smoke and non-smoke components.
[0003] Currently, most existing video smoke detection methods rely on single-dimensional feature extraction. They either extract spatial features of smoke from a single frame, analyzing static features like grayscale and contours for detection, or simply stitch together features from multiple frames, failing to fully explore the dynamic changes of smoke over time. Detection methods relying solely on spatial features struggle to distinguish smoke from spatially similar interfering elements such as water vapor, dust, and light spots, leading to false positives. Methods lacking effective temporal feature analysis cannot capture the dynamic characteristics of smoke, such as its diffusion trajectory and concentration changes over time, making them unsuitable for detecting fine, diffuse smoke or distant smoke, and prone to missed detections.
[0004] Meanwhile, existing methods do not fully consider the multi-scale characteristics of smoke during feature processing, and cannot simultaneously take into account the global contour and fine texture features of smoke, resulting in poor adaptability to smoke at different scales (such as large-area smoke in the foreground and small-scale smoke in the background). Moreover, most methods have not achieved effective fusion of spatial features and temporal features, and the feature representation is not comprehensive enough, which further affects the accuracy and reliability of smoke detection and makes it difficult to meet the high precision and high anti-interference requirements of smoke detection in real-world scenarios. Summary of the Invention
[0005] To address the shortcomings of existing video smoke detection methods that rely solely on single-dimensional feature extraction, failing to simultaneously preserve the global structure and detailed texture of smoke, and unable to accurately capture the temporal variations of smoke, resulting in incomplete feature representation and difficulty in meeting the high-precision, high-interference-resistant smoke detection requirements of real-world scenarios, this invention proposes a fire smoke detection method based on spatiotemporal feature discrimination, comprising:
[0006] Obtain the video frame sequence to be detected;
[0007] Wavelet multi-scale decomposition is performed on each frame image in the video frame sequence to obtain multi-scale high-frequency sub-images and low-frequency sub-images of each frame. Based on the multi-scale high-frequency sub-images and low-frequency sub-images of each frame, the corresponding smoke multi-scale spatial features of each frame are extracted.
[0008] Based on the multi-scale spatial features of smoke corresponding to each frame, a multi-scale temporal feature sequence is constructed. The temporal feature sequence is then dynamically correlated between frames and features are extracted to obtain dynamic temporal features of smoke at different scales.
[0009] Cross-dimensional feature fusion is performed on the multi-scale spatial features of smoke and the corresponding scale dynamic temporal features of smoke to generate smoke spatiotemporal fusion features corresponding to different scales. The smoke spatiotemporal fusion features of different scales are then aggregated to obtain global smoke spatiotemporal fusion features.
[0010] Based on the global smoke spatiotemporal fusion features, a preset smoke discrimination classifier is used to output the fire smoke detection results corresponding to the video frame sequence.
[0011] Optionally, the step of performing wavelet multi-scale decomposition on each frame of the video frame sequence to obtain multi-scale high-frequency sub-images and low-frequency sub-images for each frame includes:
[0012] Perform multi-scale two-dimensional discrete wavelet decomposition on each frame of the video frame sequence to output low-frequency sub-images and high-frequency sub-images corresponding to different scales;
[0013] The low-frequency sub-images and high-frequency sub-images corresponding to different scales are classified and integrated to obtain the multi-scale high-frequency sub-images and low-frequency sub-images corresponding to each frame.
[0014] Optionally, the step of extracting the multi-scale spatial features of smoke corresponding to each frame based on the multi-scale high-frequency sub-images and low-frequency sub-images of each frame includes:
[0015] Based on each frame, low-frequency features of the smoke region are extracted for low-frequency sub-images corresponding to different scales.
[0016] For high-frequency sub-images corresponding to different scales, high-frequency features of smoke edges are extracted respectively;
[0017] By concatenating the low-frequency features of the smoke region and the high-frequency features of the smoke edge at the same scale, the spatial features of the smoke at different scales are obtained.
[0018] Based on the spatial features of smoke at different scales, the multi-scale spatial features of smoke for each frame are obtained.
[0019] The low-frequency features of the smoke region include one or more of the following: global structural features, gray-scale distribution features, and contour morphology features of the smoke region;
[0020] The high-frequency features of the smoke edge include one or more of the following: gradient abrupt change features, texture diffusion features, and high-frequency detail variation features.
[0021] Optionally, the step of constructing a multi-scale temporal feature sequence based on the multi-scale spatial features of smoke corresponding to each frame includes:
[0022] The smoke multi-scale spatial features corresponding to each frame are sorted according to the temporal order of each frame in the video frame sequence.
[0023] For each scale, spatial features of that scale are extracted from the sorted multi-scale spatial features of smoke in each frame and combined temporally to generate a temporal feature sequence corresponding to that scale.
[0024] Optionally, the step of performing inter-frame dynamic correlation and feature extraction on the temporal feature sequence to obtain dynamic temporal features of smoke at different scales includes:
[0025] For each scale, extract the temporal dynamic features at different scales from the temporal feature sequence.
[0026] Encode and aggregate temporal dynamic features at different scales to generate dynamic temporal features of smoke at different scales;
[0027] The temporal dynamic features include one or more of the following: inter-frame difference features of adjacent frame features, feature motion offset features, regional grayscale temporal change features, and spatial diffusion trend features.
[0028] Optionally, the step of performing cross-dimensional feature fusion on the multi-scale spatial features of the smoke and the corresponding scale dynamic temporal features of the smoke to generate spatiotemporal fusion features of the smoke at different scales includes:
[0029] For each scale, the multi-scale spatial features of smoke at that scale are dimensionally aligned with the dynamic temporal features of smoke to obtain the dimensionally aligned multi-scale spatial features of smoke and dynamic temporal features of smoke.
[0030] By employing any one of the following methods—feature channel splicing, weighted feature fusion, or attention-adaptive fusion—the multi-scale spatial features of smoke after dimensional alignment are fused with the dynamic temporal features of smoke to generate the spatiotemporal fusion features of smoke corresponding to the scale.
[0031] Optionally, the aggregation of spatiotemporal fusion features of smoke at different scales to obtain global spatiotemporal fusion features includes:
[0032] The spatiotemporal fusion features of smoke at different scales are aligned by resolution to obtain aligned spatiotemporal fusion features of smoke at different scales.
[0033] The aligned spatiotemporal fusion features of smoke at different scales are superimposed to obtain global spatiotemporal fusion features of smoke.
[0034] Optionally, the step of outputting the fire smoke detection result corresponding to the video frame sequence based on the global smoke spatiotemporal fusion features and using a preset smoke discrimination classifier includes:
[0035] The global smoke spatiotemporal fusion features are input into a preset smoke discrimination classifier, and the fire smoke probability map corresponding to the video frame sequence is output.
[0036] Based on a preset confidence threshold, the fire smoke probability map is binarized and connected component analysis is performed to obtain the location and area of the smoke target in the video frame sequence.
[0037] Based on the location and area of the smoke target, generate fire smoke detection results;
[0038] The fire smoke detection results include one or more of the following: smoke area coordinates, fire confidence level, and fire severity level.
[0039] Based on the same inventive concept, the present invention also provides a fire smoke detection system based on spatiotemporal feature discrimination, comprising:
[0040] The video acquisition module is used to acquire the video frame sequence to be detected;
[0041] The scale decomposition module is used to perform wavelet multi-scale decomposition on each frame image in the video frame sequence to obtain multi-scale high-frequency sub-images and low-frequency sub-images of each frame, and extract the corresponding smoke multi-scale spatial features based on the multi-scale high-frequency sub-images and low-frequency sub-images of each frame.
[0042] The feature extraction module is used to construct a multi-scale temporal feature sequence based on the multi-scale spatial features of smoke corresponding to each frame, and to perform inter-frame dynamic correlation and feature extraction on the temporal feature sequence to obtain dynamic temporal features of smoke at different scales.
[0043] The feature aggregation module is used to perform cross-dimensional feature fusion on the multi-scale spatial features of smoke and the corresponding scale dynamic temporal features of smoke to generate smoke spatiotemporal fusion features corresponding to different scales, and to aggregate the smoke spatiotemporal fusion features of different scales to obtain global smoke spatiotemporal fusion features.
[0044] The smoke detection module is used to output the fire smoke detection results corresponding to the video frame sequence based on the global smoke spatiotemporal fusion features and a preset smoke discrimination classifier.
[0045] Optionally, the scale decomposition module includes:
[0046] The wavelet decomposition submodule is used to perform multi-scale two-dimensional discrete wavelet decomposition on each frame of the video frame sequence and output low-frequency sub-images and high-frequency sub-images corresponding to different scales.
[0047] The classification and integration submodule is used to classify and integrate low-frequency sub-images and high-frequency sub-images corresponding to different scales to obtain multi-scale high-frequency sub-images and low-frequency sub-images for each frame.
[0048] Optionally, the scale decomposition module further includes:
[0049] The low-frequency extraction submodule is used to extract low-frequency features of the smoke region based on each frame and the low-frequency sub-images corresponding to different scales.
[0050] The high-frequency extraction submodule is used to extract high-frequency features of the smoke edge for high-frequency sub-images corresponding to different scales.
[0051] The feature stitching submodule is used to stitch together the low-frequency features of the smoke region and the high-frequency features of the smoke edge at the same scale to obtain the smoke spatial features corresponding to different scales.
[0052] The scale fusion submodule is used to obtain the multi-scale spatial features of smoke for each frame based on the spatial features of smoke at different scales.
[0053] The low-frequency features of the smoke region include one or more of the following: global structural features, gray-scale distribution features, and contour morphology features of the smoke region;
[0054] The high-frequency features of the smoke edge include one or more of the following: gradient abrupt change features, texture diffusion features, and high-frequency detail variation features.
[0055] Optionally, the feature extraction module includes:
[0056] The feature sorting submodule is used to sort the smoke multi-scale spatial features corresponding to each frame according to the temporal order of each frame in the video frame sequence.
[0057] The temporal combination submodule is used to extract the spatial features of the scale from the sorted multi-scale spatial features of smoke in each frame for each scale and perform temporal combination to generate the temporal feature sequence corresponding to the scale.
[0058] Optionally, the feature extraction module further includes:
[0059] The temporal extraction submodule is used to extract temporal dynamic features at different scales for the temporal feature sequences corresponding to each scale.
[0060] The encoding aggregation submodule is used to encode and aggregate temporal dynamic features at different scales to generate smoke dynamic temporal features corresponding to different scales.
[0061] The temporal dynamic features include one or more of the following: inter-frame difference features of adjacent frame features, feature motion offset features, regional grayscale temporal change features, and spatial diffusion trend features.
[0062] Optionally, the feature aggregation module includes:
[0063] The feature alignment submodule is used to perform dimensional alignment of the multi-scale spatial features of smoke and the dynamic temporal features of smoke at each scale, so as to obtain the dimensionally aligned multi-scale spatial features of smoke and dynamic temporal features of smoke.
[0064] The multi-scale fusion submodule is used to fuse the multi-scale spatial features of smoke after dimensional alignment with the dynamic temporal features of smoke using any one of the following methods: feature channel splicing, weighted feature fusion, or attention adaptive fusion, to generate the spatiotemporal fusion features of smoke corresponding to the scale.
[0065] Optionally, the feature aggregation module further includes:
[0066] The resolution alignment submodule is used to perform resolution alignment on the spatiotemporal fusion features of smoke at different scales to obtain aligned spatiotemporal fusion features of smoke at different scales.
[0067] The feature overlay submodule is used to overlay the aligned spatiotemporal fusion features of smoke at different scales to obtain global spatiotemporal fusion features of smoke.
[0068] Optionally, the smoke detection module includes:
[0069] The probability output submodule is used to input the global smoke spatiotemporal fusion features into a preset smoke discrimination classifier and output the fire smoke probability map corresponding to the video frame sequence.
[0070] The graph analysis submodule is used to perform binarization and connected component analysis on the fire smoke probability map based on a preset confidence threshold, so as to obtain the location and area of the smoke target in the video frame sequence.
[0071] The results output submodule is used to generate fire smoke detection results based on the location and area of the smoke target;
[0072] The fire smoke detection results include one or more of the following: smoke area coordinates, fire confidence level, and fire severity level.
[0073] In another aspect, the present invention also provides an electronic device, comprising: at least one processor and a memory; the memory and the processor are connected via a bus;
[0074] The memory is used to store one or more programs;
[0075] When the one or more programs are executed by the at least one processor, a fire smoke detection method based on spatiotemporal feature discrimination as described above is implemented.
[0076] In another aspect, the present invention also provides a computer device readable storage medium having an executable program stored thereon, wherein when the executable program is executed, it implements the fire smoke detection method based on spatiotemporal feature discrimination as described above.
[0077] Compared with the prior art, the beneficial effects of the present invention are as follows:
[0078] This invention provides a fire smoke detection method and system based on spatiotemporal feature discrimination, comprising: acquiring a video frame sequence to be detected; performing wavelet multi-scale decomposition on each frame image in the video frame sequence to obtain multi-scale high-frequency sub-images and low-frequency sub-images of each frame; extracting smoke multi-scale spatial features corresponding to each frame based on the multi-scale high-frequency sub-images and low-frequency sub-images of each frame; constructing a multi-scale temporal feature sequence based on the smoke multi-scale spatial features corresponding to each frame; performing inter-frame dynamic correlation and feature extraction on the temporal feature sequence to obtain smoke dynamic temporal features at different scales; performing cross-dimensional feature fusion on the smoke multi-scale spatial features and the corresponding scale smoke dynamic temporal features to generate smoke spatiotemporal fusion features corresponding to different scales; and aggregating the smoke spatiotemporal fusion features at different scales to obtain global smoke spatiotemporal fusion features; and based on the global smoke spatiotemporal fusion features... The invention utilizes a pre-defined smoke discrimination classifier to output the fire smoke detection results corresponding to the video frame sequence. By employing wavelet multi-scale decomposition to extract spatial features of smoke, it can simultaneously preserve the overall structure and detailed texture features of the smoke. Based on the multi-scale temporal feature sequence, inter-frame dynamic correlation analysis can effectively capture the dynamic changes of smoke in the temporal dimension. Furthermore, by fusing spatial and temporal features across dimensions and aggregating multi-scale features to form global spatiotemporal features, smoke detection is completed based on these global features. This avoids the problem of single-frame detection being easily affected by similar interference objects, improving the anti-interference capability and recognition accuracy of smoke detection, and also adapting to the detection needs of smoke at different scales. Therefore, the method of this invention can improve the accuracy and anti-interference of smoke detection, while adapting to smoke detection at different scales and in different scenarios, balancing detection accuracy and efficiency. Attached Figure Description
[0079] Figure 1A schematic flowchart of a fire smoke detection method based on spatiotemporal feature discrimination provided by the present invention;
[0080] Figure 2 A schematic diagram of the overall framework of a fire smoke detection method based on spatiotemporal feature discrimination provided by the present invention;
[0081] Figure 3 A schematic diagram of a first fire in a fire smoke detection method based on spatiotemporal feature discrimination provided in a specific embodiment of the present invention;
[0082] Figure 4 A schematic diagram of the segmentation detection result corresponding to the first fire schematic diagram in a fire smoke detection method based on spatiotemporal feature discrimination provided in a specific embodiment of the present invention;
[0083] Figure 5 This is a schematic diagram of a second fire in a fire smoke detection method based on spatiotemporal feature discrimination, provided in a specific embodiment of the present invention;
[0084] Figure 6 A schematic diagram of the segmentation detection result corresponding to the second fire schematic diagram in a fire smoke detection method based on spatiotemporal feature discrimination provided in a specific embodiment of the present invention;
[0085] Figure 7 A schematic diagram of the structural composition of a fire smoke detection system based on spatiotemporal feature discrimination provided by the present invention;
[0086] Figure 8 This is a schematic diagram of the structure of an electronic device provided by the present invention. Detailed Implementation
[0087] This invention proposes a fire smoke detection method, system, device, and medium based on spatiotemporal feature discrimination. The specific embodiments of this invention will be further described in detail below with reference to the accompanying drawings.
[0088] Example 1:
[0089] This invention provides a fire smoke detection method based on spatiotemporal feature discrimination, the flowchart of which is shown below. Figure 1 As shown, it includes:
[0090] Step 1: Obtain the video frame sequence to be detected;
[0091] Step 2: Perform wavelet multi-scale decomposition on each frame of the video frame sequence to obtain multi-scale high-frequency sub-images and low-frequency sub-images of each frame. Based on the multi-scale high-frequency sub-images and low-frequency sub-images of each frame, extract the corresponding smoke multi-scale spatial features of each frame.
[0092] Step 3: Based on the multi-scale spatial features of smoke corresponding to each frame, construct a multi-scale temporal feature sequence, perform inter-frame dynamic correlation and feature extraction on the temporal feature sequence, and obtain dynamic temporal features of smoke at different scales.
[0093] Step 4: Perform cross-dimensional feature fusion on the multi-scale spatial features of the smoke and the corresponding scale dynamic temporal features of the smoke to generate smoke spatiotemporal fusion features corresponding to different scales, and aggregate the smoke spatiotemporal fusion features of different scales to obtain global smoke spatiotemporal fusion features.
[0094] Step 5: Based on the global smoke spatiotemporal fusion features, use a preset smoke discrimination classifier to output the fire smoke detection results corresponding to the video frame sequence.
[0095] In one implementation, the video frame sequence in step 1 above is obtained by enhancing and standardizing the acquired original video sequence;
[0096] The video frame sequence obtained after enhancement and normalization retains its two-dimensional structure, making it suitable for direct input into two-dimensional discrete wavelet transform (2D-DWT) for frequency decomposition. Therefore, to address the dynamic blurring of flames and smoke in videos caused by motion and disturbances, a wavelet multi-scale feature decomposition and fusion mechanism can be introduced as an important component of feature extraction. Due to its excellent time-frequency locality, wavelet transform can simultaneously preserve the overall structure and local texture details of an image, making it particularly suitable for processing image signals with non-stationary characteristics in fire detection (such as flame flickering and smoke drift). Compared to the global nature of Fourier transform, wavelet transform possesses stronger local representation capabilities in the space-frequency domain. Specifically:
[0097] In one implementation, step 2 above, which involves performing wavelet multi-scale decomposition on each frame of the video frame sequence to obtain multi-scale high-frequency and low-frequency sub-images for each frame, may include:
[0098] Perform multi-scale two-dimensional discrete wavelet decomposition on each frame of the video frame sequence to output low-frequency sub-images and high-frequency sub-images corresponding to different scales;
[0099] The low-frequency sub-images and high-frequency sub-images corresponding to different scales are classified and integrated to obtain the multi-scale high-frequency sub-images and low-frequency sub-images corresponding to each frame.
[0100] In this implementation, each frame in the video frame sequence undergoes multi-scale two-dimensional discrete wavelet decomposition, which can be represented as follows:
[0101] ;
[0102] in, Represents two-dimensional discrete wavelet decomposition; Represents an image frame in a video frame sequence; This represents the low-frequency subband, used to represent the overall contour information of the image; This represents the horizontal high-frequency sub-band, used to reflect the vertical edges in an image; This represents the vertical high-frequency subband, used to reflect the horizontal edges in an image; This represents the diagonal high-frequency sub-band, used to capture diagonal detail textures; among which, It belongs to the low-frequency subgraph. It belongs to the high-frequency subgraph;
[0103] The transformation process is essentially a downsampling operation performed on the image after low-pass (L) and high-pass (H) filtering, as shown in the following expression:
[0104] ;
[0105] ;
[0106] ;
[0107] ;
[0108] in, Representing the low-frequency subgraph ( Sub-band) in coordinates The pixel value at that location represents the global contour / smooth region features of the image at that position; Represents the horizontal high-frequency subplot in coordinates The pixel value at that location represents the vertical edge feature at that position; Indicates the vertical high-frequency subgraph in coordinates The pixel value at that location represents the horizontal edge feature at that position; Indicates the diagonal high-frequency subgraph in coordinates The pixel value at that location represents the diagonal detail / texture features at that position; Indicates the original video frame image in coordinates The pixel grayscale value at ( For row coordinates, (column coordinates) Represents the scaling function (low-pass filter function); This represents a wavelet function (high-pass filter function); in this implementation, Daubechies (db4) or Haar wavelets can be used as basis functions because they are efficient and accurate in detecting signal abrupt changes. In applications, the low-frequency subband... The primary focus is on preserving the general outline of the flame and the background structure, which helps in constructing macroscopic structural semantics. The high-frequency subbands LH, HL, and HH, on the other hand, emphasize edge information and detailed textures, particularly capturing blurred smoke boundaries and transitional areas in the flame. To fully utilize this subband information, each subband can be fed into a parallel, lightweight convolutional module (such as a Depthwise CNN) or a Transformer encoder for unified feature encoding. The expression can be as follows:
[0109] ;
[0110] in, It represents the low-frequency structural feature tensor obtained after the low-frequency subband is encoded by the encoder, which retains the core structural information such as the global outline and overall shape of the smoke / flame; This represents the horizontal high-frequency feature tensor obtained after encoding the horizontal high-frequency sub-band by the encoder, focusing on the vertical edge details of the smoke blur boundary and the flame transition region. This represents the vertical high-frequency feature tensor obtained after the vertical high-frequency subband is encoded by the encoder. This represents the diagonal high-frequency feature tensor obtained after the diagonal high-frequency subband is encoded by the encoder, capturing the oblique detail texture of the blurred smoke boundary and the flame transition area; This indicates that the low-frequency subband is input into a pre-defined lightweight convolutional module (such as a Depthwise CNN) or a Transformer encoder to perform a unified feature encoding operation; This means that the horizontal high-frequency sub-band is input into the same type of encoder mentioned above to perform a unified feature encoding operation. The encoder enhances the vertical edge detail information of the LH sub-band and outputs the encoded horizontal high-frequency features. This means that the vertical high-frequency sub-band is input into the same type of encoder mentioned above to perform a unified feature encoding operation. The encoder enhances the horizontal edge detail information of the HL sub-band and outputs the encoded vertical high-frequency features. This means that the diagonal high-frequency subband is input into the same type of encoder mentioned above to perform a unified feature encoding operation. The encoder enhances the diagonal detail texture information of the HH subband and outputs the encoded diagonal high-frequency features.
[0111] Subsequently, the four features are integrated through a multi-scale feature fusion mechanism. The fusion strategy adopts weighted channel fusion and spatial attention mechanism, and its overall expression can be as follows:
[0112] ;
[0113] in, Indicates subband fusion characteristics; Subband Channel fusion weights (which can be trained or adaptively set via entropy control). Subband Attention map; Subband The coded feature map; this fusion method can take into account the contribution of information of different frequencies, which is conducive to strengthening the edge recognition and structure preservation capabilities of flames and smoke in dynamic backgrounds, thereby constructing a more robust and discriminative multi-scale fire feature representation, and significantly improving the detection performance of the model in complex scenes.
[0114] Specifically, in the above implementation, the process of extracting the multi-scale spatial features of smoke corresponding to each frame based on the multi-scale high-frequency sub-images and low-frequency sub-images of each frame may include:
[0115] Based on each frame, low-frequency features of the smoke region are extracted for low-frequency sub-images corresponding to different scales.
[0116] For high-frequency sub-images corresponding to different scales, high-frequency features of smoke edges are extracted respectively;
[0117] By concatenating the low-frequency features of the smoke region and the high-frequency features of the smoke edge at the same scale, the spatial features of the smoke at different scales are obtained.
[0118] Based on the spatial features of smoke at different scales, the multi-scale spatial features of smoke for each frame are obtained.
[0119] The low-frequency features of the smoke region include one or more of the following: global structural features, gray-scale distribution features, and contour morphology features of the smoke region;
[0120] The high-frequency features of the smoke edge include one or more of the following: gradient abrupt change features, texture diffusion features, and high-frequency detail variation features.
[0121] After completing the wavelet multi-scale feature decomposition and fusion through the above steps, a high-quality spatial representation of each frame of the image in multiple frequency dimensions can be obtained. Specific steps may include:
[0122] For each time frame of image frame 𝑡 For example, low-frequency structural features can be obtained. High-frequency edge features The joint representation of:
[0123] ;
[0124] In the formula,
[0125] ;
[0126] in, Indicates time The image frame is decomposed by wavelet and the low-frequency structural features and high-frequency edge features are spliced together to obtain the single-frame smoke multi-scale spatial feature tensor. This indicates a feature concatenation operation (channel dimension concatenation), used to fuse low-frequency features with high-frequency features into a single feature tensor. Indicates time The low-frequency structural features of the smoke corresponding to the low-frequency sub-image obtained by the two-dimensional discrete wavelet transform of the image frame represent the overall information of the smoke, such as the global contour and gray-level distribution. Indicates time The set of high-frequency smoke edge features corresponding to all high-frequency sub-images obtained by two-dimensional discrete wavelet transform of the image frame; It represents the real number field, indicating that the numerical type of the feature tensor is real. Indicates the number of feature channels; Indicates the height of the image / feature map; Indicates the width of the image / feature map; Indicates time The smoke features corresponding to the horizontal high-frequency subband (LH) obtained by wavelet decomposition of the image frame reflect the vertical edge details of the smoke. Indicates time The smoke features corresponding to the vertical high-frequency subband (HL) obtained by wavelet decomposition of the image frame reflect the horizontal edge details of the smoke; Indicates time The wavelet decomposition of image frames yields smoke features corresponding to the diagonal high-frequency subband (HH), capturing the oblique detail texture of smoke; however, these features are static and cannot capture the evolution of fire over time. For example, the typical flicker frequency of flames is concentrated in the 4–12 Hz range, belonging to short-period high-frequency signals, while smoke behaves more like a low-frequency, slowly changing continuous process. Therefore, to achieve fire detection with higher robustness and lower false alarm rate, a spatiotemporal joint modeling mechanism is considered to construct a multi-scale spatial feature sequence, specifically;
[0127] By fusing wavelet features from multiple consecutive time points (e.g., 10 frames) (That is, the multi-scale spatial features of the smoke mentioned above) are organized into a three-dimensional spatiotemporal feature sequence tensor:
[0128] ;
[0129] in, Indicates continuous Time intervals (from time 1 to time 2) Multi-scale spatial features of smoke in a single frame The three-dimensional spatiotemporal feature sequence tensor formed by stacking along the time axis preserves both the spatial features of the smoke and the temporal sequence information. This represents the length of the time window; this tensor can both preserve the spatial multi-scale representation and be explicitly organized into an ordered time series, creating a foundation for time series modeling. After completing wavelet multi-scale decomposition, however, wavelet decomposition is essentially a fixed transformation and cannot dynamically determine "which channels or regions are more important" based on a specific scenario. To address this, a joint structure combining Channel Gated Spatial Network (CGN) and Frequency-domain Attention (FDA) can be considered. CGN enhances the response to flame and smoke regions in the current frame while suppressing background interference that may cause false alarms, such as elements with similar colors to flames but lacking dynamic features, like red clothing, lights, and car headlights. It learns and generates channel weight vectors, assigning high weights to salient region channels (e.g., flame / smoke textures) while suppressing background channels. This process is performed independently for each frame, constructing a semantically enhanced spatial feature sequence, which serves as input for the subsequent FDA mechanism to model the temporal dynamics. For the feature map of each frame, an attention mechanism can be applied. Its core idea is to use Global Average Pooling (GAP) to compress the two-dimensional features of each channel into a single value to obtain channel-level global information. The expression can be as follows:
[0130] ;
[0131] in, Indicates the first The channel-level global information scalar obtained by global average pooling (GAP) of each feature channel represents the overall response level of the channel feature in the spatial dimension. Indicates the feature map height. , Represents the pixel index in the vertical direction of the feature map; Indicates the width of the feature map. , Represents the pixel index in the horizontal direction of the feature map; Indicates time The multi-scale spatial feature tensor of smoke in a single frame; Indicates time In the feature tensor, the first The first channel, the first Line number Feature values of column positions;
[0132] Subsequently, attention weights for each channel can be generated using a compression-excitation structure consisting of two fully connected layers. :
[0133] ;
[0134] in, Indicates the first Attention weights for each feature channel are used to measure the importance of that channel for smoke detection; This represents the sigmoid activation function; Table of ReLU activation functions; This represents the learnable first parameter matrix; This represents the learnable second parameter matrix; This represents the channel global information vector, composed of global information scalars from all channels; ultimately, this channel weight vector... Applying this to the original feature map to achieve channel-level enhancement, the expression can be as follows:
[0135] ;
[0136] in, This indicates that after channel attention weight enhancement, time... The first characteristic tensor The first channel, the first Line number Enhanced feature values for column positions; this mechanism assigns higher weights to prominent region channels (such as flame and smoke feature channels) while suppressing background channels, enabling a greater focus on key areas in the spatial dimension and effectively reducing false alarm interference.
[0137] To further leverage temporal evolution patterns to improve fire detection robustness, these spatial features can be organized into a complete temporal structure as input for subsequent frequency domain modeling. Specifically:
[0138] In one implementation, step 3 above, which involves constructing a multi-scale temporal feature sequence based on the multi-scale spatial features of smoke corresponding to each frame, may include:
[0139] The smoke multi-scale spatial features corresponding to each frame are sorted according to the temporal order of each frame in the video frame sequence.
[0140] For each scale, the spatial features of the scale are extracted from the sorted multi-scale spatial features of smoke in each frame and combined temporally to generate the temporal feature sequence corresponding to the scale.
[0141] In this implementation, a three-dimensional spatiotemporal feature tensor is constructed by stacking continuous features within a time window along the time axis. :
[0142] ;
[0143] in, Indicates the length of the time window; Indicates the number of feature channels; Indicates the feature map height; This represents the width of the feature map. This tensor can preserve the semantic information in the space of each frame and form an ordered sequence in the time dimension. In addition, a frequency-domain attention (FDA) mechanism can be introduced to dynamically model this feature sequence, including: performing a Fast Fourier Transform (FFT) on the feature tensor along the time axis to obtain the spectral response of each channel and each spatial location in the time dimension, as shown in the following expression:
[0144] ;
[0145] in, The frequency domain feature tensor obtained by fast Fourier transforming the spatiotemporal feature tensor is: This represents the frequency dimension index, indicating the frequency component in the time dimension; Indicates the feature channel index; This represents the pixel index in the horizontal direction of the feature map; Indicates the pixel index in the vertical direction of the feature map; Indicates Fast Fourier Transform; In the spatiotemporal feature tensor, frequency ,aisle Spatial location Temporal eigenvalues at;
[0146] By statistically analyzing the spectral energy distribution, frequency components within the range of 4–12 Hz (typical flame flicker frequency) are identified, and a frequency domain attention map is generated. This is used for weighted filtering of the spectrum. By strengthening effective frequency bands and suppressing irrelevant frequency bands, the response capability to the time dynamics of typical fires can be enhanced, as expressed below:
[0147] ;
[0148] in, This indicates that after frequency domain attention weighting, the frequency... ,aisle Spatial location Enhanced frequency domain eigenvalues at the location;
[0149] Finally, an inverse IFFT is performed to return to the time domain, yielding the dynamically modeled temporal feature tensor:
[0150] ;
[0151] in, This represents the time-domain temporal feature tensor obtained after dynamic modeling using the frequency domain attention mechanism (FDA), which preserves the typical temporal dynamic characteristics of fire smoke. ; This represents the one-dimensional inverse fast Fourier transform performed along the time dimension (T dimension); This represents the full-dimensional frequency domain feature tensor after frequency domain attention weighting;
[0152] This output result This high-quality temporal feature sequence integrates spatial attention enhancement and frequency domain time awareness, enabling it to recognize evolutionary patterns such as flame flickering and smoke diffusion. After completing spatial saliency enhancement and temporal frequency modeling, it already possesses a certain degree of spatial recognition and temporal dynamic response capability. However, real-world fire scenarios often exhibit phenomena such as occlusion, short-term extinction, camera shake, and long-distance blurring. These factors can cause flame or smoke areas to disappear instantaneously in individual frames, leading to model misjudgments of "fire ended" or false alarms. To further enhance structural memory and temporal robustness, a wavelet-based feature propagation chain is introduced as the final link in spatiotemporal feature processing. This chain is specifically used for cross-frame information structural alignment and dynamic propagation. The output not only preserves high-frequency and structural information in the spatial dimension but also maintains cross-frame consistency in the temporal dimension, significantly improving the ability to remember long-term fire evolution processes and ensuring judgment stability even in the face of frame breaks, hidden line disappearance, and occlusion.
[0153] In one implementation, the process of performing inter-frame dynamic correlation and feature extraction on the temporal feature sequence to obtain dynamic temporal features of smoke at different scales may include:
[0154] For each scale, extract the temporal dynamic features at different scales from the temporal feature sequence.
[0155] Encode and aggregate temporal dynamic features at different scales to generate dynamic temporal features of smoke at different scales;
[0156] The temporal dynamic features include one or more of the following: inter-frame difference features of adjacent frame features, feature motion offset features, regional grayscale temporal change features, and spatial diffusion trend features.
[0157] In one implementation, step 4 above involves cross-dimensional feature fusion of the multi-scale spatial features of the smoke and the corresponding scale dynamic temporal features of the smoke to generate spatiotemporal fusion features of the smoke at different scales. This process may include:
[0158] For each scale, the multi-scale spatial features of smoke at that scale are dimensionally aligned with the dynamic temporal features of smoke to obtain the dimensionally aligned multi-scale spatial features of smoke and dynamic temporal features of smoke.
[0159] By employing any one of the following methods—feature channel splicing, weighted feature fusion, or attention-adaptive fusion—the multi-scale spatial features of smoke after dimensional alignment are fused with the dynamic temporal features of smoke to generate the spatiotemporal fusion features of smoke corresponding to the scale.
[0160] In one implementation, the process of aggregating the spatiotemporal fusion features of smoke at different scales to obtain global spatiotemporal fusion features may include:
[0161] The spatiotemporal fusion features of smoke at different scales are aligned by resolution to obtain aligned spatiotemporal fusion features of smoke at different scales.
[0162] The aligned spatiotemporal fusion features of smoke at different scales are superimposed to obtain global spatiotemporal fusion features of smoke.
[0163] The global smoke spatiotemporal fusion features obtained through the above steps integrate multiple information such as space, time, frequency, and structure, possessing excellent flame and smoke perception capabilities. It is necessary to transform these high-dimensional spatiotemporal features into a human-readable and system-operable risk representation, namely, a fire probability map for each frame of image. This map not only serves as the basis for the final alarm decision but also facilitates visualization and risk quantification analysis. Specifically:
[0164] In one implementation, step 5 above, which involves using a preset smoke discriminator to output the fire smoke detection result corresponding to the video frame sequence based on the global smoke spatiotemporal fusion features, may include:
[0165] The global smoke spatiotemporal fusion features are input into a preset smoke discrimination classifier, and the fire smoke probability map corresponding to the video frame sequence is output.
[0166] Based on a preset confidence threshold, the fire smoke probability map is binarized and connected component analysis is performed to obtain the location and area of the smoke target in the video frame sequence.
[0167] Based on the location and area of the smoke target, generate fire smoke detection results;
[0168] The fire smoke detection results include one or more of the following: smoke area coordinates, fire confidence level, and fire severity level;
[0169] In this implementation, the key to training the smoke discrimination classifier lies in ensuring the diversity, accuracy, and standardization of the training data. This invention constructs a training sample library from publicly available fire video datasets (such as FIRESENSE, UCF-Fire, and RWFD) and self-collected high-resolution fire scene videos, covering various complex scenarios such as urban blocks, warehouses, forests, and industrial parks. The data format is mainly .mp4 or .avi video streams, with a resolution generally above 720p. Frame extraction processing is performed on the video sequences in the training sample library, including: dividing continuous video into image frame sequences at a fixed frame rate (e.g., 10 frames / second) to construct a spatiotemporal data stream with temporal continuity, ensuring that the subsequent model receives image input with consistent rhythm. To improve training efficiency and model adaptability, all image frame sequences are uniformly scaled to a standard input resolution (e.g., 224×224 or 320×320) to facilitate batch processing and network structure compatibility. For supervised learning models, each image frame can be precisely labeled, including the bounding boxes of flame and smoke regions and corresponding time segment labels. The label format is uniformly managed using COCO format to ensure standardization and reusability. Building upon this, to enhance the classifier's robustness in complex real-world environments, various image enhancement strategies are introduced to simulate diverse scenarios such as low light, blur, rain, fog, and noise interference. These strategies include brightness perturbation, gamma transformation, and Gaussian blur, significantly expanding the diversity and generalization ability of the training data. Furthermore, to meet the structural input requirements for temporal modeling and frequency domain analysis in actual detection processes, image frames can be further organized into sequence batches of consecutive frames. Each batch contains a fixed number of consecutive frames (e.g., 10 frames), providing a stable and structured data input foundation for the classifier to capture spatiotemporal dynamic changes such as flame flickering and smoke drift. This model training process lays a solid data foundation for improving the overall performance of fire smoke detection.
[0170] In this implementation, a lightweight spatial classification module can be introduced, using a 1×1 convolution combined with a sigmoid activation function to map the features of each spatial location to a fire probability value between [0,1], as shown in the following expression:
[0171] ;
[0172] in, Indicates time In the image frame, spatial location The fire probability value at the location ranges from [0,1]. The larger the value, the higher the probability that the location is a fire / smoke area. This represents the sigmoid activation function; This represents a 1×1 convolution operation, used to map high-dimensional feature tensors into single-channel probabilistic graphical features, achieving feature dimensionality compression and spatial classification. Indicates time The enhanced smoke feature tensor is obtained by processing the image frame through the wavelet feature propagation chain (WavProp);
[0173] Considering that feature fluctuations between consecutive frames (such as flickering flame edges and moving smoke) can cause drastic fluctuations in the probability map, we further consider introducing temporal consistency constraints, such as using sliding window filtering or exponential moving averages to smooth the probability sequence.
[0174] ;
[0175] in, Indicates time The spatial fire probability value after being smoothed by time consistency constraints; Indicates the smoothing coefficient; Indicates time Raw fire probability values without smoothing; Indicates time The fire probability value is smoothed by time consistency constraints; this avoids misjudging "fire extinguished" due to the disappearance of flames in a single frame, while maintaining the dynamic continuity of the boundaries. The smoothed probability map... Adaptive thresholds (such as the Otsu algorithm) are used to extract regions, and real fire areas are selected by combining indicators such as area, shape, and stability to construct a candidate fire point set;
[0176] In summary, this invention addresses the problem that existing video smoke detection methods rely solely on single-dimensional feature extraction, failing to simultaneously preserve the global structure and detailed texture of smoke, and also unable to accurately capture the temporal changes in smoke, resulting in incomplete feature representation and difficulty in meeting the high-precision, high-interference-resistant smoke detection requirements in real-world scenarios. Therefore, this invention proposes a fire smoke detection method based on spatiotemporal feature discrimination, the overall framework of which is illustrated in Figure 2. As shown, the original video frame sequence is first preprocessed to perform basic processing such as image normalization and noise filtering. Then, a two-dimensional discrete wavelet transform is performed on the preprocessed image to decompose it into a low-frequency sub-image representing the overall contour information of the image and a high-frequency sub-image reflecting the image edges and texture details. Low-frequency features of the smoke region are extracted based on the low-frequency sub-image, and high-frequency features of the smoke edge are extracted based on the high-frequency sub-image, realizing accurate mining of multi-scale spatial features of smoke. Next, the low-frequency and high-frequency features are fused at multiple scales to obtain a comprehensive representation feature that takes into account both the global structure and subtle texture of the smoke. Finally, the fused features are input into a smoke discrimination classifier to classify smoke and non-smoke regions, and finally output the fire smoke detection result. Therefore, the method of this invention can simultaneously preserve the global structure and detailed texture features of smoke, effectively improving the anti-interference ability and recognition accuracy of smoke detection, and meeting the needs of high-precision smoke detection in real-world scenarios.
[0177] Example 2:
[0178] To verify the smoke detection and segmentation effectiveness of the method of this invention, a comparative explanation is provided using experimental results, such as... Figure 3 The image shown is a schematic diagram of the first fire scene, including a smoke area and a complex background; the image frame is processed using the detection method of this invention to obtain the following result. Figure 4 The segmentation detection results shown are as follows: Figure 5 The image shows a schematic diagram of the second fire scenario, with corresponding segmentation and detection as follows. Figure 6 As shown in the comparison, the method of the present invention can more accurately segment the smoke region, suppress background interference, and improve the accuracy of smoke recognition. To further quantify and verify the performance advantages of the method of the present invention and clarify the specific contribution of each core component to the spatiotemporal modeling capability, the present invention designs and implements multiple ablation experiments on a standard dataset based on actual scene performance. The experiment evaluates the role of each component by removing the Channel Gated Spatial Network (CGN), Frequency Domain Attention Mechanism (FDA), and Wavelet Feature Propagation Chain (WavProp) respectively. The specific experimental results are shown in Table 1.
[0179] Table 1 Experimental Results
[0180] Serial Number Experimental group configuration F1 score (%) mIoU(%) False alarm rate decreased (%) 1 Baseline (without CGN / FDA / WavProp) 68.9 61.2 - 2 +CGN 75.3 69.5 ↓17.6% 3 +CGN+FDA 84.7 78.6 ↓2.1% 4 Complete model (CGN+FDA+WavProp) 92.4 85.7 ↓29.4%
[0181] Starting with the baseline model (without CGN, FDA, and WavProp), the F1-score was only 68.9% and the mIoU was 61.2%, indicating that without structural modeling and dynamic temporal awareness, the model's performance was limited and susceptible to misinterpretations from background noise and short-term interference. However, with the gradual introduction of modules, the performance steadily improved: after adding CGN, the F1-score increased to 75.3%, and the mIoU improved by 8.3 percentage points to 69.5%, demonstrating that the channel attention mechanism significantly enhanced the model's spatial response to flame and smoke areas and suppressed misjudgments of non-fire areas. After adding FDA (+CGN + FDA), the F1-score further increased to 84.7%, and the mIoU reached 78.6%, indicating that frequency domain modeling effectively improved the ability to recognize dynamic evolutionary features such as flame flicker and smoke diffusion, enhancing consistency in the temporal dimension. After introducing wavelet feature propagation (complete model), the system's F1-score reached 92.4%, and the mIoU improved to 85.7%, making it the best-performing group among the four.
[0182] The method of this invention is compared with existing models, and the results are shown in Table 2.
[0183] Table 2 Model Comparison
[0184] Serial Number Model Name F1 score (%) mIoU(%) T-IoU (Timing Consistency) False alarm rate (%) Average false negative rate (%) 1 YOLOv5 (2D CNN) 81.1 72.9 0.68 14.7 13.5 2 I3D (3D CNN) 84.0 76.5 0.74 12.3 11.2 3 ViT+LSTM 86.5 78.2 0.77 10.9 10.1 4 FlowNet+Mask-RCNN 79.4 70.3 0.65 18.6 17.8 5 The method of this invention (CGN+FDA+WavProp) 92.4 85.7 0.89 7.4 7.1
[0185] As shown in Table 2 above, the method of this invention significantly outperforms other models in the two key accuracy metrics of F1-score (92.4%) and mIoU (85.7%). Compared with the second-best ViT+LSTM (F1-score 86.5%, mIoU 78.2%), it improves by 5.9 and 7.5 percentage points respectively. This indicates that multi-scale feature extraction based on wavelet decomposition can effectively improve the regional boundary recognition ability. The introduction of CGN and FDA can significantly enhance the model's ability to locate significant flame / smoke regions, especially in the segmentation performance of blurred and low-contrast scenes.
[0186] This embodiment demonstrates that the fire smoke detection method based on spatiotemporal feature discrimination proposed in this invention accurately extracts the global contour and edge detail features of smoke through wavelet multi-scale decomposition. It combines this with a channel-gated spatial network (CGN) to enhance the spatial response of the smoke region, a frequency domain attention mechanism (FDA) to strengthen the temporal dynamic perception of flame flicker and smoke diffusion, and a wavelet feature propagation chain (WavProp) to optimize feature transfer efficiency. The synergistic effect of these core components not only effectively suppresses interference from complex backgrounds and accurately segments the smoke region, but also significantly improves detection accuracy and temporal consistency. This invention maintains a significant lead in the two key accuracy indicators of F1-score and mIoU, while reducing both false alarm rate and false negative rate to varying degrees. This indicates that the method of this invention can effectively solve the problems of incomplete feature representation, weak anti-interference ability, and insufficient temporal modeling in existing methods. It can adapt to fire smoke detection scenarios in complex backgrounds, balancing detection accuracy and practicality, and meeting the core requirements of high accuracy, low false alarm rate, and low false negative rate for smoke detection in practical applications.
[0187] Example 3:
[0188] Based on the same inventive concept, this invention also provides a fire smoke detection system based on spatiotemporal feature discrimination, the structural composition of which is shown in the schematic diagram below. Figure 7 As shown, it includes:
[0189] The video acquisition module is used to acquire the video frame sequence to be detected;
[0190] The scale decomposition module is used to perform wavelet multi-scale decomposition on each frame image in the video frame sequence to obtain multi-scale high-frequency sub-images and low-frequency sub-images of each frame, and extract the corresponding smoke multi-scale spatial features based on the multi-scale high-frequency sub-images and low-frequency sub-images of each frame.
[0191] The feature extraction module is used to construct a multi-scale temporal feature sequence based on the multi-scale spatial features of smoke corresponding to each frame, and to perform inter-frame dynamic correlation and feature extraction on the temporal feature sequence to obtain dynamic temporal features of smoke at different scales.
[0192] The feature aggregation module is used to perform cross-dimensional feature fusion on the multi-scale spatial features of smoke and the corresponding scale dynamic temporal features of smoke to generate smoke spatiotemporal fusion features corresponding to different scales, and to aggregate the smoke spatiotemporal fusion features of different scales to obtain global smoke spatiotemporal fusion features.
[0193] The smoke detection module is used to output the fire smoke detection results corresponding to the video frame sequence based on the global smoke spatiotemporal fusion features and a preset smoke discrimination classifier.
[0194] In one implementation, the scale decomposition module includes:
[0195] The wavelet decomposition submodule is used to perform multi-scale two-dimensional discrete wavelet decomposition on each frame of the video frame sequence and output low-frequency sub-images and high-frequency sub-images corresponding to different scales.
[0196] The classification and integration submodule is used to classify and integrate low-frequency sub-images and high-frequency sub-images corresponding to different scales to obtain multi-scale high-frequency sub-images and low-frequency sub-images for each frame.
[0197] In one implementation, the scale decomposition module further includes:
[0198] The low-frequency extraction submodule is used to extract low-frequency features of the smoke region based on each frame and the low-frequency sub-images corresponding to different scales.
[0199] The high-frequency extraction submodule is used to extract high-frequency features of the smoke edge for high-frequency sub-images corresponding to different scales.
[0200] The feature stitching submodule is used to stitch together the low-frequency features of the smoke region and the high-frequency features of the smoke edge at the same scale to obtain the smoke spatial features corresponding to different scales.
[0201] The scale fusion submodule is used to obtain the multi-scale spatial features of smoke for each frame based on the spatial features of smoke at different scales.
[0202] The low-frequency features of the smoke region include one or more of the following: global structural features, gray-scale distribution features, and contour morphology features of the smoke region;
[0203] The high-frequency features of the smoke edge include one or more of the following: gradient abrupt change features, texture diffusion features, and high-frequency detail variation features.
[0204] In one implementation, the feature extraction module includes:
[0205] The feature sorting submodule is used to sort the smoke multi-scale spatial features corresponding to each frame according to the temporal order of each frame in the video frame sequence.
[0206] The temporal combination submodule is used to extract the spatial features of the scale from the sorted multi-scale spatial features of smoke in each frame for each scale and perform temporal combination to generate the temporal feature sequence corresponding to the scale.
[0207] In one implementation, the feature extraction module further includes:
[0208] The temporal extraction submodule is used to extract temporal dynamic features at different scales for the temporal feature sequences corresponding to each scale.
[0209] The encoding aggregation submodule is used to encode and aggregate temporal dynamic features at different scales to generate smoke dynamic temporal features corresponding to different scales.
[0210] The temporal dynamic features include one or more of the following: inter-frame difference features of adjacent frame features, feature motion offset features, regional grayscale temporal change features, and spatial diffusion trend features.
[0211] In one implementation, the feature aggregation module includes:
[0212] The feature alignment submodule is used to perform dimensional alignment of the multi-scale spatial features of smoke and the dynamic temporal features of smoke at each scale, so as to obtain the dimensionally aligned multi-scale spatial features of smoke and dynamic temporal features of smoke.
[0213] The multi-scale fusion submodule is used to fuse the multi-scale spatial features of smoke after dimensional alignment with the dynamic temporal features of smoke using any one of the following methods: feature channel splicing, weighted feature fusion, or attention adaptive fusion, to generate the spatiotemporal fusion features of smoke corresponding to the scale.
[0214] In one implementation, the feature aggregation module further includes:
[0215] The resolution alignment submodule is used to perform resolution alignment on the spatiotemporal fusion features of smoke at different scales to obtain aligned spatiotemporal fusion features of smoke at different scales.
[0216] The feature overlay submodule is used to overlay the aligned spatiotemporal fusion features of smoke at different scales to obtain global spatiotemporal fusion features of smoke.
[0217] In one implementation, the smoke detection module includes:
[0218] The probability output submodule is used to input the global smoke spatiotemporal fusion features into a preset smoke discrimination classifier and output the fire smoke probability map corresponding to the video frame sequence.
[0219] The graph analysis submodule is used to perform binarization and connected component analysis on the fire smoke probability map based on a preset confidence threshold, so as to obtain the location and area of the smoke target in the video frame sequence.
[0220] The results output submodule is used to generate fire smoke detection results based on the location and area of the smoke target;
[0221] The fire smoke detection results include one or more of the following: smoke area coordinates, fire confidence level, and fire severity level.
[0222] Example 4:
[0223] like Figure 8 As shown, the present invention also provides an electronic device, which may be a computer device, a microcontroller device, a smart mobile device, etc. The electronic device in this embodiment may include a processor, a memory, a transceiver component, etc. The memory, processor, and transceiver component are connected via a bus; the memory can be used to store executable programs, and an exemplary executable program may include instructions; the processor is used to execute the instructions stored in the memory. The memory can also be used to store data, which can be accessed and / or modified when instructions are executed.
[0224] The processor may be a Central Processing Unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. It is the computing core and control core of the terminal, and it is suitable for implementing one or more instructions. Specifically, it is suitable for loading and executing one or more instructions in the storage medium to implement the corresponding method flow or corresponding function, so as to implement the steps of the fire smoke detection method based on spatiotemporal feature discrimination in the above embodiments.
[0225] Example 5:
[0226] Based on the same inventive concept, this invention also provides a readable storage medium, specifically an electronic device readable storage medium (Memory). This readable storage medium is a memory device within an electronic device used to store programs and data. It is understood that the storage medium here can include both built-in storage media within the electronic device and extended storage media supported by the electronic device. The storage medium provides storage space, which stores the terminal's operating system. Furthermore, this storage space also stores one or more instructions suitable for loading and execution by a processor. These instructions can be one or more executable programs (including program code). It should be noted that the storage medium here can be high-speed RAM or non-volatile memory, such as at least one disk storage device. Loading and executing one or more instructions stored in the storage medium by the processor can implement the steps of the fire smoke detection method based on spatiotemporal feature discrimination in the above embodiments.
[0227] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0228] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0229] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0230] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0231] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention and not to limit its protection scope. Although the present invention has been described in detail with reference to the above embodiments, those skilled in the art should understand that after reading the present invention, they can still make various changes, modifications or equivalent substitutions to the specific implementation methods of the application, but these changes, modifications or equivalent substitutions are all within the protection scope of the claims.
Claims
1. A fire smoke detection method based on spatiotemporal feature discrimination, characterized in that, include: Obtain the video frame sequence to be detected; Wavelet multi-scale decomposition is performed on each frame image in the video frame sequence to obtain multi-scale high-frequency sub-images and low-frequency sub-images of each frame. Based on the multi-scale high-frequency sub-images and low-frequency sub-images of each frame, the corresponding smoke multi-scale spatial features of each frame are extracted. Based on the multi-scale spatial features of smoke corresponding to each frame, a multi-scale temporal feature sequence is constructed. The temporal feature sequence is then dynamically correlated between frames and features are extracted to obtain dynamic temporal features of smoke at different scales. Cross-dimensional feature fusion is performed on the multi-scale spatial features of smoke and the corresponding scale dynamic temporal features of smoke to generate smoke spatiotemporal fusion features corresponding to different scales. The smoke spatiotemporal fusion features of different scales are then aggregated to obtain global smoke spatiotemporal fusion features. Based on the global smoke spatiotemporal fusion features, a preset smoke discrimination classifier is used to output the fire smoke detection results corresponding to the video frame sequence.
2. The method as described in claim 1, characterized in that, The step of performing wavelet multi-scale decomposition on each frame of the video frame sequence to obtain multi-scale high-frequency sub-images and low-frequency sub-images for each frame includes: Perform multi-scale two-dimensional discrete wavelet decomposition on each frame of the video frame sequence to output low-frequency sub-images and high-frequency sub-images corresponding to different scales; The low-frequency sub-images and high-frequency sub-images corresponding to different scales are classified and integrated to obtain the multi-scale high-frequency sub-images and low-frequency sub-images corresponding to each frame.
3. The method as described in claim 1, characterized in that, The step of extracting the multi-scale spatial features of smoke corresponding to each frame based on the multi-scale high-frequency sub-images and low-frequency sub-images of each frame includes: Based on each frame, low-frequency features of the smoke region are extracted for low-frequency sub-images corresponding to different scales. For high-frequency sub-images corresponding to different scales, high-frequency features of smoke edges are extracted respectively; By concatenating the low-frequency features of the smoke region and the high-frequency features of the smoke edge at the same scale, the spatial features of the smoke at different scales are obtained. Based on the spatial features of smoke at different scales, the multi-scale spatial features of smoke for each frame are obtained. The low-frequency features of the smoke region include one or more of the following: global structural features, gray-scale distribution features, and contour morphology features of the smoke region; The high-frequency features of the smoke edge include one or more of the following: gradient abrupt change features, texture diffusion features, and high-frequency detail variation features.
4. The method as described in claim 1, characterized in that, The step of constructing a multi-scale temporal feature sequence based on the multi-scale spatial features of smoke corresponding to each frame includes: The smoke multi-scale spatial features corresponding to each frame are sorted according to the temporal order of each frame in the video frame sequence. For each scale, spatial features of that scale are extracted from the sorted multi-scale spatial features of smoke in each frame and combined temporally to generate a temporal feature sequence corresponding to that scale.
5. The method as described in claim 1 or 4, characterized in that, The step of performing inter-frame dynamic correlation and feature extraction on the temporal feature sequence to obtain dynamic temporal features of smoke at different scales includes: For each scale, extract the temporal dynamic features at different scales from the temporal feature sequence. Encode and aggregate temporal dynamic features at different scales to generate dynamic temporal features of smoke at different scales; The temporal dynamic features include one or more of the following: inter-frame difference features of adjacent frame features, feature motion offset features, regional grayscale temporal change features, and spatial diffusion trend features.
6. The method as described in claim 1, characterized in that, The process of fusing the multi-scale spatial features of the smoke with the corresponding scale dynamic temporal features of the smoke across dimensions to generate spatiotemporal fusion features of the smoke at different scales includes: For each scale, the multi-scale spatial features of smoke at that scale are dimensionally aligned with the dynamic temporal features of smoke to obtain the dimensionally aligned multi-scale spatial features of smoke and dynamic temporal features of smoke. By employing any one of the following methods—feature channel splicing, weighted feature fusion, or attention-adaptive fusion—the multi-scale spatial features of smoke after dimensional alignment are fused with the dynamic temporal features of smoke to generate the spatiotemporal fusion features of smoke corresponding to the scale.
7. The method as described in claim 1, characterized in that, The aggregation of spatiotemporal fusion features of smoke at different scales to obtain global spatiotemporal fusion features includes: The spatiotemporal fusion features of smoke at different scales are aligned by resolution to obtain aligned spatiotemporal fusion features of smoke at different scales. The aligned spatiotemporal fusion features of smoke at different scales are superimposed to obtain global spatiotemporal fusion features of smoke.
8. The method as described in claim 1, characterized in that, The step of outputting the fire smoke detection results corresponding to the video frame sequence based on the global smoke spatiotemporal fusion features and using a preset smoke discrimination classifier includes: The global smoke spatiotemporal fusion features are input into a preset smoke discrimination classifier, and the fire smoke probability map corresponding to the video frame sequence is output. Based on a preset confidence threshold, the fire smoke probability map is binarized and connected component analysis is performed to obtain the location and area of the smoke target in the video frame sequence. Based on the location and area of the smoke target, generate fire smoke detection results; The fire smoke detection results include one or more of the following: smoke area coordinates, fire confidence level, and fire severity level.
9. A fire smoke detection system based on spatiotemporal feature discrimination, characterized in that, include: The video acquisition module is used to acquire the video frame sequence to be detected; The scale decomposition module is used to perform wavelet multi-scale decomposition on each frame image in the video frame sequence to obtain multi-scale high-frequency sub-images and low-frequency sub-images of each frame, and extract the corresponding smoke multi-scale spatial features based on the multi-scale high-frequency sub-images and low-frequency sub-images of each frame. The feature extraction module is used to construct a multi-scale temporal feature sequence based on the multi-scale spatial features of smoke corresponding to each frame, and to perform inter-frame dynamic correlation and feature extraction on the temporal feature sequence to obtain dynamic temporal features of smoke at different scales. The feature aggregation module is used to perform cross-dimensional feature fusion on the multi-scale spatial features of smoke and the corresponding scale dynamic temporal features of smoke to generate smoke spatiotemporal fusion features corresponding to different scales, and to aggregate the smoke spatiotemporal fusion features of different scales to obtain global smoke spatiotemporal fusion features. The smoke detection module is used to output the fire smoke detection results corresponding to the video frame sequence based on the global smoke spatiotemporal fusion features and a preset smoke discrimination classifier.
10. The system as described in claim 9, characterized in that, The scale decomposition module includes: The wavelet decomposition submodule is used to perform multi-scale two-dimensional discrete wavelet decomposition on each frame of the video frame sequence and output low-frequency sub-images and high-frequency sub-images corresponding to different scales. The classification and integration submodule is used to classify and integrate low-frequency sub-images and high-frequency sub-images corresponding to different scales to obtain multi-scale high-frequency sub-images and low-frequency sub-images for each frame.