A flame height recognition method and device based on improved Unet and binocular vision, a terminal device and a computer readable storage medium
By improving the Unet model and binocular vision technology, multi-scale semantic features are extracted and weighted flame feature maps are generated, which solves the problems of inaccurate flame segmentation and mismatch of weak textures in complex environments and achieves high-precision flame height recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ELECTRIC POWER RES INST OF GUANGDONG POWER GRID CO LTD
- Filing Date
- 2026-03-20
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies suffer from inaccurate flame segmentation and low height measurement accuracy due to weak texture mismatch in complex environments.
An improved Unet model is constructed, which extracts multi-scale semantic features through an encoder, calculates channel contribution using an attention module and generates a weighted feature map, and performs feature space reconstruction and binarization using a boundary refinement module to generate a high-quality flame feature mask. The actual height of the flame is then calculated using binocular vision.
It significantly enhances the expression of flame features, suppresses background noise interference, and improves the accuracy and precision of flame height identification, meeting the high-precision requirements of power grid safety monitoring.
Smart Images

Figure CN122244792A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer vision and intelligent monitoring technology, and in particular to a method, apparatus, terminal device and computer-readable storage medium for flame height recognition based on improved Unet and binocular vision. Background Technology
[0002] With the construction of smart grids, using binocular vision technology for non-contact monitoring of wildfires around power transmission lines has become an important means of assessing fire risk levels.
[0003] However, in actual field monitoring scenarios of power transmission lines, the background of images is often extremely complex, frequently accompanied by noise such as dense smoke obscuring the image and interference from tree branches and leaves. Currently, general deep learning models used for processing binocular images typically apply equal weights to the semantic features of all channels when extracting features, lacking a mechanism for filtering the importance of feature channels. This results in the model being unable to distinguish which channels carry flame features and which carry smoke or background noise. Consequently, the extracted multi-scale semantic features are mixed with a large number of background elements, directly interfering with the subsequent accurate identification of flame targets. This leads to a significant error in the final determination of the actual height of the flame, making it difficult to meet the high-precision requirements of power grid safety monitoring. Summary of the Invention
[0004] This invention provides a flame height recognition method based on an improved Unet and binocular vision, which can solve the problem of low height measurement accuracy caused by inaccurate flame segmentation and weak texture mismatch in complex environments in existing technologies.
[0005] One embodiment of the present invention provides a flame height recognition method based on improved Unet and binocular vision, comprising: Acquire stereo images from a stereo camera; The stereo image is input into a pre-built improved Unet model so that the improved Unet model generates a corresponding binarized stereo image mask based on the stereo image. The actual height of the flame is determined based on the binarized binocular image mask. The improved Unet model includes an encoder, an attention module, a decoder, and a boundary refinement module; wherein the attention module includes a global average pooling layer and a fully connected layer. Based on the encoder, multi-scale semantic features of the binocular image at different resolution levels are extracted; Based on the attention module, a global average pooling layer is used to pool the multi-scale semantic features to obtain channel descriptors that represent global information. A fully connected layer is then used to calculate the contribution of each channel feature to flame target recognition based on the channel descriptors, and a channel scaling factor is generated based on the contribution. The channel scaling factor is proportional to the contribution. The multi-scale semantic features are then weighted according to the channel scaling factor to generate a weighted feature map where flame features are enhanced and background noise is suppressed. Based on the decoder, the weighted feature map is reconstructed and fused in the feature space to obtain preliminary segmentation features. Based on the boundary refinement module, the preliminary segmentation features are refined in the boundary context and binarized to generate the binarized binocular image mask.
[0006] Further, the boundary refinement module includes a dilated convolutional layer and an output convolutional layer; the step of reconstructing and fusing the weighted feature map based on the decoder to obtain preliminary segmentation features, and then performing boundary context refinement and binarization processing on the preliminary segmentation features based on the boundary refinement module to generate the binarized binocular image mask, includes: The decoder upsamples the weighted feature map, and then the upsampled feature map is spliced and fused with the feature map of the corresponding level of the encoder to generate a preliminary segmentation feature containing flame outline information. Based on the dilated convolutional layer, the initial segmentation features are processed by dilated convolution to expand the receptive field, capturing the contextual information of the blurred flame edges to obtain boundary enhancement features. The boundary enhancement features are mapped by channel dimensionality reduction using the output convolutional layer, and the mapped features are then processed by probability mapping and binarization through an activation function to output a binarized stereo image mask for characterizing the flame pixel region. Furthermore, the binarized binocular image mask includes: a binarized left-eye image mask and a binarized right-eye image mask; wherein, in the binarized left-eye and right-eye image masks, the regions marked with valid target identifier values are flame regions, and the remaining regions are background regions; the step of determining the actual height of the flame based on the binarized binocular image mask includes: traversing the left-eye image mask and extracting the set of pixels marked with the valid target identifier values; determining the highest and lowest points of the pixel set on the vertical coordinate axis, and calculating the difference in the vertical coordinates between the highest and lowest points to obtain the flame pixel height; using the highest point as a reference, constructing a stereo matching search path based on epipolar constraints, performing a search for corresponding points in the right-eye image mask, and determining only the pixels in the right-eye image mask marked with valid target identifier values and located on the stereo matching search path as candidate matching points; calculating the matching cost between the highest point and each of the candidate matching points, and determining the candidate matching point with the lowest matching cost as the optimal matching point; The flame region disparity value is obtained based on the difference between the highest point and the best matching point on the horizontal coordinate axis; the actual flame height is calculated based on the flame region disparity value and the flame pixel height.
[0007] Further, the binocular image includes: a left-eye image and a right-eye image; calculating the matching cost between the highest point and each of the candidate matching points includes: Obtain the first gray-level gradient feature of the neighborhood of the first pixel point corresponding to the highest point in the left eye image, and the second gray-level gradient feature of the neighborhood of each second pixel point corresponding to each candidate matching point in the right eye image; calculate the texture correlation score between the neighborhood of the first pixel point and the neighborhood of each second pixel point using an image statistical similarity algorithm; calculate the gradient similarity score between the first gray-level gradient feature and the second gray-level gradient feature. The texture relevance score and the gradient similarity score are weighted and fused, and the cost is transformed to obtain the matching cost between the highest point and each of the candidate matching points.
[0008] Further, the step of calculating the actual height of the flame based on the parallax value of the flame region and the pixel height of the flame includes: obtaining the calibrated focal length and baseline length of the binocular camera; calculating the depth distance from the flame to the camera based on the binocular triangulation principle and the inverse relationship between the calibrated focal length, baseline length, and the parallax value of the flame region; and calculating the actual height of the flame based on the pinhole imaging projection principle and the linear proportional relationship between the depth distance, calibrated focal length, and the pixel height of the flame.
[0009] Furthermore, the encoder includes multiple parallel convolutional branches; the extraction of multi-scale semantic features of the stereo image at different resolution levels based on the encoder includes: based on the encoder, using convolutional kernels with different receptive field sizes in each of the convolutional branches, performing convolution operations on the stereo image to obtain multiple sets of branch feature maps; and stitching and fusing the branch feature maps along the feature channel dimension to generate the multi-scale semantic features used to capture the morphological features of flames at different scales.
[0010] Furthermore, the method for constructing the improved Unet model includes: constructing a dedicated dataset for power transmission line wildfires by including binocular wildfire image pairs from multiple scenarios and their corresponding pixel-level labeled masks; inputting the left and right eye images from the binocular wildfire image pairs as joint training samples into a deep neural network to be trained; calculating a pixel classification loss value to measure pixel classification accuracy and a boundary shape loss value to measure boundary contour overlap based on the prediction mask output by the deep neural network and the pixel-level labeled mask; and updating the model parameters through backpropagation based on the weighted sum of the pixel classification loss value and the boundary shape loss value until the model converges, thereby generating the improved Unet model.
[0011] Another embodiment of the present invention provides a flame height recognition device based on an improved Unet and binocular vision, comprising: an image acquisition module, a semantic segmentation module, and a height calculation module; the image acquisition module is used to acquire binocular images from a binocular camera; the semantic segmentation module is used to input the binocular images into a pre-constructed improved Unet model, so that the improved Unet model generates a corresponding binarized binocular image mask based on the binocular images; wherein, the improved Unet model includes: an encoder, an attention module, a decoder, and a boundary refinement module; wherein, the attention module includes a global average pooling layer and a fully connected layer; based on the encoder, multi-scale semantic features of the binocular images at different resolution levels are extracted; based on the attention module, the multi-scale semantic features are pooled using a global average pooling layer to obtain a channel description for representing global information. The process involves: using a fully connected layer, calculating the contribution of each channel feature to flame target recognition based on the channel descriptor, and generating a channel scaling factor based on the contribution; wherein the channel scaling factor is proportional to the contribution; weighting the multi-scale semantic features based on the channel scaling factor to generate a weighted feature map with enhanced flame features and suppressed background noise; reconstructing and fusing the weighted feature map using the decoder to obtain preliminary segmentation features; and refining and binarizing the preliminary segmentation features using the boundary refinement module to generate the binarized stereo image mask. The height calculation module is used to determine the actual height of the flame based on the binarized binocular image mask.
[0012] Another embodiment of the present invention provides a terminal device, including: a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, wherein when the processor executes the computer program, it implements the steps of the present invention's flame height recognition method based on improved Unet and binocular vision.
[0013] Another embodiment of the present invention also provides a computer-readable storage medium item, including: a stored computer program, wherein, when the computer program is running, it controls the device where the computer-readable storage medium is located to execute the present invention's flame height recognition method based on improved Unet and binocular vision.
[0014] The embodiments of the present invention have the following beneficial effects: This invention proposes a flame height recognition method based on an improved Unet and binocular vision. The core of this method lies in constructing an improved Unet model with feature filtering capabilities. The method first acquires binocular images and inputs them into the model. An encoder extracts multi-scale semantic features at different resolution levels. Based on this, a global average pooling operation is performed using an attention module to obtain channel descriptors representing global information. A fully connected layer is then used to calculate the contribution of each channel to flame target recognition, generating channel scaling coefficients. Finally, these coefficients are multiplied channel-by-channel with the features to generate a weighted feature map where flame features are enhanced and background noise is suppressed.
[0015] This method introduces a specific weight recalibration mechanism in the feature extraction stage, breaking the limitation of traditional models that treat all feature channels equally. It can automatically suppress the feature responses of background noise such as smoke and trees according to the channel contribution, while significantly enhancing the feature expression of the flame target, thereby generating a high-quality weighted feature map. This provides a high-quality feature basis for subsequent determination of the actual height of the flame, effectively solving the problem of inaccurate flame height recognition due to noise interference in complex backgrounds. Attached Figure Description
[0016] To more clearly illustrate the technical solution of this application, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0017] Figure 1 This is a flowchart illustrating a flame height recognition method based on improved Unet and binocular vision, provided by an embodiment of the present invention.
[0018] Figure 2 This is a schematic diagram of the structure of a device for flame height recognition based on an improved Unet and binocular vision according to an embodiment of the present invention. Detailed Implementation
[0019] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings of the embodiments. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0020] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains; the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the application; the terms “comprising” and “having”, and any variations thereof, in the specification, claims, and foregoing description of the drawings are intended to cover non-exclusive inclusion.
[0021] In the description of the embodiments of this application, technical terms such as "first" and "second" are used only to distinguish different objects and should not be construed as indicating or implying relative importance or implicitly specifying the number, specific order, or primary and secondary relationship of the indicated technical features. In the description of the embodiments of this application, "multiple" means two or more, unless otherwise explicitly defined.
[0022] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.
[0023] In the description of the embodiments in this application, the term "and / or" is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Additionally, the character " / " in this document generally indicates that the preceding and following related objects have an "or" relationship.
[0024] In the description of the embodiments of this application, the term "multiple" refers to two or more (including two), similarly, "multiple sets" refers to two or more (including two sets), and "multiple pieces" refers to two or more (including two pieces).
[0025] In the description of the embodiments of this application, unless otherwise expressly specified and limited, technical terms such as "installation," "connection," "joining," and "fixing" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral part; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; they can refer to the internal communication of two components or the interaction between two components. For those skilled in the art, the specific meaning of the above terms in the embodiments of this application can be understood according to the specific circumstances.
[0026] To address the low accuracy of existing technologies in complex environments due to inaccurate flame segmentation and weak texture mismatch, an embodiment of the present invention provides a flame height recognition method based on an improved Unet and binocular vision, comprising: Step S1: Acquire stereo images from the stereo camera; Specifically, in a preferred embodiment, the binocular camera is pre-installed on the side of the power transmission line tower or mounted on a drone inspection platform; the binocular camera includes a left-eye camera and a right-eye camera, both having the same optical parameters (such as focal length f, resolution, etc.) and their optical axes are parallel and aligned; the binocular image includes a left-eye image and a right-eye image; To illustrate, a synchronous triggering mechanism ensures that the left and right eye images are acquired at the same time, and that their sampling timestamps are completely consistent.
[0027] It is worth noting that before performing the step of acquiring stereo images, an offline calibration step of the stereo camera is included to obtain the preset camera calibration parameters. Specifically, the Zhang Zhengyou calibration method is used to complete the parameter calibration of the stereo camera: First, a high-precision asymmetric black and white checkerboard (e.g., a 7×9 square array) is used as a physical calibration reference; then, the stereo camera is controlled to simultaneously capture multiple sets (e.g., 10 to 20 sets) of checkerboard images containing different poses (including different angles of tilt, rotation, and coverage of edge regions); subsequently, a corner detection algorithm is used to extract the checkerboard corners and perform sub-pixel level optimization, solving a system of equations to obtain the intrinsic parameter matrix of the stereo camera (including horizontal focal length). Vertical focal length and principal point coordinates , ), distortion coefficients (including radial distortion) , and tangential distortion , ) and the extrinsic matrix between the left and right cameras (including the rotation matrix R and the translation vector T).
[0028] Furthermore, after acquiring the original binocular images, the images are preprocessed using the parameters obtained from the above calibration. Specifically, the `initUndistortRectifyMap` function in a computer vision library (such as OpenCV) is called to eliminate edge distortion of the original image by combining the intrinsic parameter matrix and distortion coefficients, and to perform stereo correction on the left and right images by combining the extrinsic parameter matrix.
[0029] In a specific application scenario, the corrected binocular images eliminate lens physical distortion, and the corresponding pixel rows of the left and right images are on the same horizontal epipolar line, thus forming a row-aligned corrected image pair, providing a physical basis for subsequent epipolar constraint matching.
[0030] By first performing offline calibration to obtain precise parameters, and then acquiring real-time images for synchronous triggering and stereo correction, the method described above ensures a high degree of consistency in the flame morphology between the left and right eye images in highly dynamic scenarios such as wildfires. This effectively avoids interference from flame position shifts or morphological distortions caused by time differences in shooting, which could affect subsequent matching calculations. Simultaneously, aligning images to the same horizontal epipolar line using stereo correction greatly simplifies the search space in the subsequent stereo matching process, reducing the complex two-dimensional search to an efficient one-dimensional row search. This significantly improves computational efficiency while eliminating lens distortion, ensuring the initial accuracy of depth and distance calculations from the source. This lays a solid image data foundation for achieving high-precision flame height recognition in the future.
[0031] Step S2: Input the stereo image into a pre-constructed improved Unet model, so that the improved Unet model generates a corresponding binarized stereo image mask based on the stereo image; wherein, the improved Unet model includes: an encoder, an attention module, a decoder, and a boundary refinement module; wherein, the attention module includes a global average pooling layer and a fully connected layer; generating a corresponding binarized stereo image mask based on the stereo image includes: extracting multi-scale semantic features of the stereo image at different resolution levels based on the encoder; and performing a pooling operation on the multi-scale semantic features using a global average pooling layer based on the attention module to obtain... A channel descriptor is used to represent global information; through a fully connected layer, the contribution of each channel feature to flame target recognition is calculated based on the channel descriptor, and a channel scaling coefficient is generated based on the contribution; wherein, the channel scaling coefficient is proportional to the contribution; the multi-scale semantic features are weighted according to the channel scaling coefficient to generate a weighted feature map in which flame features are enhanced and background noise is suppressed; the weighted feature map is reconstructed and fused based on the decoder to obtain preliminary segmentation features, and the preliminary segmentation features are refined by boundary context and binarized based on the boundary refinement module to generate the binarized stereo image mask.
[0032] In a preferred embodiment, the method for constructing the improved Unet model includes: constructing a dedicated dataset for power transmission line wildfires by including binocular wildfire image pairs from multiple scenarios and their corresponding pixel-level labeled masks; inputting the left and right eye images from the binocular wildfire image pairs as joint training samples into a deep neural network to be trained; calculating a pixel classification loss value to measure pixel classification accuracy and a boundary shape loss value to measure boundary contour overlap based on the prediction mask output by the deep neural network and the pixel-level labeled mask; and updating the model parameters through backpropagation based on the weighted sum of the pixel classification loss value and the boundary shape loss value until the model converges, thereby generating the improved Unet model.
[0033] Specifically, when constructing the dedicated dataset, images of complex scenes such as daytime, nighttime, and smoke-covered environments are collected using drones equipped with dual-spectral cameras or tower monitoring devices. These images are then manually annotated using the Labelme tool, forming a four-tuple dataset of "image-mask-parameter-true height". During training, the pixel classification loss value can specifically employ Binary Cross-Entropy Loss (BCE Loss), and the boundary shape loss value can specifically employ Dice Loss. The model uses an end-to-end joint training approach to ensure that the semantic features output from the left and right images, after being processed by the same set of network parameters, have a high degree of consistency.
[0034] By using the above settings and a dedicated dataset covering multiple scenarios and a joint training strategy, the model's generalization ability to complex outdoor environments can be significantly improved. At the same time, the introduction of boundary shape loss values can force the network to focus on the details of the flame edges, effectively solving the problem of contour smoothing or breakage that traditional loss functions easily produce when dealing with flame targets with blurred edges.
[0035] In a preferred embodiment, the encoder includes multiple parallel convolutional branches; The step of extracting multi-scale semantic features of the binocular image at different resolution levels based on the encoder includes: using convolution kernels with different receptive field sizes in each convolution branch to perform convolution operations on the binocular image to obtain multiple sets of branch feature maps; and stitching and fusing the branch feature maps in the feature channel dimension to generate the multi-scale semantic features used to capture the morphological features of flames at different scales.
[0036] Specifically, the encoder's backbone network is designed with a multi-branch structure at each downsampling stage. Illustratively, the parallel convolutional branches specifically include three branches, employing convolutional kernels of sizes 3×3, 5×5, and 7×7, respectively. Because the distance between the flame and the camera varies in power transmission line monitoring scenarios, the flame's scale in the image changes drastically. A 3×3 convolutional kernel can extract the subtle texture features of the flame tip, while a large 7×7 convolutional kernel can capture the macroscopic contour features of the flame's main body. By concatenating these features along the channel dimension, a feature map integrating multi-scale information is obtained.
[0037] By setting up convolutional kernels with different receptive fields in parallel, the encoder is equipped with multi-scale perception capabilities. It can capture both local details and global morphology of flames without increasing network depth, effectively solving the technical problem that a single convolutional kernel cannot adapt to the large scale changes of wildfire targets. This provides a rich source of information for subsequent feature selection.
[0038] In a preferred embodiment, the attention module utilizes a squeeze-and-excitation (SE) mechanism; First, the feature map with spatial dimensions of h×w is "squeezed" into a 1×1 channel descriptor by global average pooling. This descriptor is actually a global statistic of the feature distribution of the entire image; where h represents the pixel height of the input feature map and w represents the pixel width of the input feature map.
[0039] Next, the descriptor is input into a structure containing two fully connected layers (FC) and activation functions (such as ReLU and Sigmoid), allowing the model to automatically learn the non-linear dependencies between channels and thus calculate the weights (i.e., channel scaling factors) for each channel. Finally, these factors are multiplied back into the original feature map. Illustratively, in a wildfire scene, the model automatically assigns higher weights (close to 1) to channels carrying flame color (e.g., red, yellow) and brightness features, while assigning lower weights (close to 0) to channels carrying background noise such as smoke (gray) and trees (green).
[0040] Through the above settings, the weight recalibration of the feature channel dimension was achieved. This SE attention mechanism enables the network to automatically focus on the flame target, actively suppress the interference response of dense smoke obscuring and complex background, significantly improve the signal-to-noise ratio of the feature map, and thus solve the problem of insufficient accuracy in flame feature extraction under complex background.
[0041] In a preferred embodiment, the boundary refinement module includes a dilated convolutional layer and an output convolutional layer; the step of reconstructing and fusing the weighted feature map based on the decoder to obtain preliminary segmentation features, and then performing boundary context refinement and binarization processing on the preliminary segmentation features based on the boundary refinement module to generate the binarized binocular image mask, includes: The decoder upsamples the weighted feature map, and then the upsampled feature map is spliced and fused with the feature map of the corresponding level of the encoder to generate a preliminary segmentation feature containing flame outline information. Based on the dilated convolutional layer, the initial segmentation features are processed by dilated convolution to expand the receptive field, capturing the contextual information of the blurred flame edges to obtain boundary enhancement features. The boundary enhancement features are mapped by channel dimensionality reduction using the output convolutional layer, and the mapped features are then processed by probability mapping and binarization through an activation function to output a binarized stereo image mask for characterizing the flame pixel region. In a preferred embodiment, the boundary refinement module is located at the network output and employs dilated convolutions with a dilation rate greater than 1, which can expand the receptive field without reducing resolution. A 1×1 convolutional layer maps multi-channel features to a single-channel probability map, which is then mapped to the (0, 1) interval via a sigmoid function. A threshold (e.g., 0.5) is set for binarization, ultimately outputting left and right eye image masks respectively. Through these settings, dilated convolutions effectively capture the contextual information around the flame edge, correcting the loss of edge details caused by downsampling, resulting in a more realistic mask contour that closely resembles the flame shape.
[0042] It should be noted that the binarized binocular image mask includes: a binarized left-eye image mask and a binarized right-eye image mask; wherein, in the binarized left-eye image mask and right-eye image mask, the area marked as a valid target identifier value is the flame area, and the remaining area is the background area; In this embodiment, the synergistic effect of feature space reconstruction and boundary context refinement effectively overcomes the challenges of segmentation blurring and contour adhesion caused by the semi-transparent smoke and varied shapes of flame edges in wildfire scenes, achieving high-precision pixel-level extraction of flames. More importantly, the generated binarized left and right eye image masks define extremely strict and precise constraints for subsequent binocular stereo matching. This allows the subsequent disparity search and matching process to bypass complex background interference areas and operate only within the actual flame target, significantly reducing the mismatch rate against weak texture backgrounds and laying a solid and reliable data foundation for the final high-precision calculation of the actual flame height.
[0043] Step S3: Determine the actual height of the flame based on the binarized binocular image mask.
[0044] In a preferred embodiment, determining the actual flame height based on the binarized binocular image mask includes: traversing the left-eye image mask and extracting a set of pixels marked with the valid target identifier value; determining the highest and lowest points of the pixel set on the vertical coordinate axis and calculating the difference in the vertical coordinates between the highest and lowest points to obtain the flame pixel height; constructing a stereo matching search path based on epipolar constraints using the highest point as a reference, performing a corresponding point search in the right-eye image mask, and determining only pixels in the right-eye image mask marked with the valid target identifier value and located on the stereo matching search path as candidate matching points; calculating the matching cost between the highest point and each of the candidate matching points, and determining the candidate matching point with the lowest matching cost as the best matching point; obtaining the flame region disparity value based on the difference between the highest point and the best matching point on the horizontal coordinate axis; and calculating the actual flame height based on the flame region disparity value and the flame pixel height.
[0045] Specifically, the effective target identifier value is typically 1 (representing a foreground flame). When calculating disparity, the epipolar constraint characteristic after stereo correction is utilized, meaning that the corresponding point in the right image must lie on the same horizontal line (same row number) as the point in the left image. Through this setting, by introducing a mask constraint strategy, the search range of stereo matching is strictly limited to the flame mask area in the right image, directly shielding against interference from the background area. This fundamentally eliminates the mismatch problem that traditional full-image matching easily produces in weak-texture areas, ensuring the robustness of disparity calculation.
[0046] In the above matching process, calculating the matching cost between the highest point and each of the candidate matching points includes: obtaining the first gray-level gradient feature of the neighborhood of the first pixel point corresponding to the highest point in the left eye image, and the second gray-level gradient feature of the neighborhood of each second pixel point corresponding to each of the candidate matching points in the right eye image; calculating the texture correlation score between the neighborhood of the first pixel point and the neighborhood of each second pixel point using an image statistical similarity algorithm; and calculating the gradient similarity score between the first gray-level gradient feature and the second gray-level gradient feature. The texture relevance score and the gradient similarity score are weighted and fused, and the cost is transformed to obtain the matching cost between the highest point and each of the candidate matching points.
[0047] Indicatively, the image statistical similarity algorithm preferably employs the normalized cross-correlation function (NCC) because this method has good robustness to changes in illumination. At the same time, it introduces gradient features, which are often weak in the internal texture of flames, while gradients can well reflect local shape changes. This method of calculating the dual matching cost by combining texture and gradient information further improves the accuracy of matching in weak texture areas of flames, ensuring that the best matching point found is the actual corresponding physical point.
[0048] Finally, the step of calculating the actual height of the flame based on the parallax value of the flame region and the pixel height of the flame includes: obtaining the calibrated focal length and baseline length of the binocular camera; calculating the depth distance from the flame to the camera based on the binocular triangulation principle and the inverse relationship between the calibrated focal length, baseline length, and the parallax value of the flame region; and calculating the actual height of the flame based on the pinhole imaging projection principle and the linear proportional relationship between the depth distance, calibrated focal length, and the pixel height of the flame.
[0049] Specifically, the depth distance is calculated using the formula Z=(f×B) / D, and the actual height is calculated using the formula H=(Z / f)×Y. Here, Z represents the vertical depth distance from the flame target point (i.e., the highest point of the flame) to the imaging plane of the binocular camera; f represents the calibration focal length of the binocular camera, which is the camera intrinsic parameter obtained in step S1 using the Zhang Zhengyou calibration method (specifically, it can be the equivalent focal length fx in the horizontal direction or the equivalent focal length fy in the vertical direction; since the image has been corrected, they are usually approximately equal); B represents the baseline length of the binocular camera. The parameters refer to the physical straight-line distance between the optical centers of the left and right cameras, determined by the translation vector magnitude in the extrinsic calibration matrix in step S1; D represents the disparity value of the flame region, which is the difference on the horizontal coordinate axis between the highest point of the left image and the best matching point of the right image calculated in the previous step; H represents the actual physical height of the flame to be solved; Y represents the flame pixel height, which is the difference between the ordinate of the highest point and the lowest point of the flame extracted by traversing the left image mask in the previous step. Through the above settings, the two-dimensional image information is mapped back to the three-dimensional physical space, realizing non-contact, automated, and accurate measurement of flame height, providing reliable data support for power grid operation and maintenance.
[0050] like Figure 2As shown, another embodiment of the present invention also provides a flame height recognition device based on improved Unet and binocular vision, including: an image acquisition module, a semantic segmentation module, and a height calculation module; the image acquisition module is used to acquire binocular images from a binocular camera; the semantic segmentation module is used to input the binocular images into a pre-constructed improved Unet model, so that the improved Unet model generates a corresponding binarized binocular image mask based on the binocular images; wherein, the improved Unet model includes: an encoder, an attention module, a decoder, and a boundary refinement module; wherein, the attention module includes a global average pooling layer and a fully connected layer; the step of generating a corresponding binarized binocular image mask based on the binocular images includes: extracting multi-scale semantic features of the binocular images at different resolution levels based on the encoder; and performing a pooling operation on the multi-scale semantic features using a global average pooling layer based on the attention module. A channel descriptor for representing global information is obtained; through a fully connected layer, the contribution of each channel feature to flame target recognition is calculated based on the channel descriptor, and a channel scaling coefficient is generated based on the contribution; wherein the channel scaling coefficient is proportional to the contribution; the multi-scale semantic features are weighted according to the channel scaling coefficient to generate a weighted feature map in which flame features are enhanced and background noise is suppressed; the weighted feature map is reconstructed and fused based on the decoder to obtain preliminary segmentation features, and the preliminary segmentation features are refined by boundary context and binarized based on the boundary refinement module to generate the binarized binocular image mask.
[0051] The height calculation module is used to determine the actual height of the flame based on the binarized binocular image mask.
[0052] It is understood that the above-described device embodiments correspond to the method embodiments of the present invention, and can realize the flame height recognition method based on improved Unet and binocular vision provided by any of the above-described method embodiments of the present invention.
[0053] It should be noted that the device embodiments described above are merely illustrative, and some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Furthermore, in the accompanying drawings of the device embodiments provided by this invention, the connection relationships between modules indicate that they have communication connections, which can specifically be implemented as one or more communication buses or signal lines. Those skilled in the art can understand and implement this without any creative effort.
[0054] Based on the above embodiment of a flame height recognition method based on improved Unet and binocular vision, another embodiment of the present invention provides a terminal device, which includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor. When the processor executes the computer program, it implements a flame height recognition method based on improved Unet and binocular vision according to any embodiment of the present invention.
[0055] For example, in this embodiment, the computer program can be divided into one or more modules, which are stored in the memory and executed by the processor to complete the present invention. The one or more modules may be a series of computer program instruction segments capable of performing a specific function, which describe the execution process of the computer program in the terminal device.
[0056] The terminal device may be a desktop computer, laptop, handheld computer, or cloud server, etc. The terminal device may include, but is not limited to, a processor and a memory.
[0057] The processor can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor can be a microprocessor or any conventional processor. The processor is the control center of the terminal device, connecting all parts of the terminal device via various interfaces and lines.
[0058] Based on the above-described method embodiments, another embodiment of the present invention provides a computer-readable storage medium including a stored computer program, wherein, when the computer program is executed, it controls the device where the computer-readable storage medium is located to execute the flame height recognition method based on improved Unet and binocular vision described in any of the above-described method embodiments of the present invention.
[0059] The modules / units integrated in the device / terminal equipment, if implemented as software functional units and sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the above embodiments of the present invention can also be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a portable hard drive, a magnetic disk, an optical disk, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium, etc.
[0060] The above description represents the preferred embodiments of the present invention. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principles of the present invention, and these improvements and modifications are also considered to be within the scope of protection of the present invention.
Claims
1. A flame height recognition method based on improved Unet and binocular vision, characterized in that, include: Acquire stereo images from a stereo camera; The stereo image is input into a pre-built improved Unet model so that the improved Unet model generates a corresponding binarized stereo image mask based on the stereo image. The actual height of the flame is determined based on the binarized binocular image mask. The improved Unet model includes an encoder, an attention module, a decoder, and a boundary refinement module; wherein the attention module includes a global average pooling layer and a fully connected layer. The step of generating a corresponding binarized binocular image mask based on the binocular image includes: Based on the encoder, multi-scale semantic features of the binocular image at different resolution levels are extracted; Based on the attention module, a global average pooling layer is used to pool the multi-scale semantic features to obtain channel descriptors that represent global information. A fully connected layer is then used to calculate the contribution of each channel feature to flame target recognition based on the channel descriptors, and a channel scaling factor is generated based on the contribution. The channel scaling factor is proportional to the contribution. The multi-scale semantic features are then weighted according to the channel scaling factor to generate a weighted feature map where flame features are enhanced and background noise is suppressed. Based on the decoder, the weighted feature map is reconstructed and fused in the feature space to obtain preliminary segmentation features. Based on the boundary refinement module, the preliminary segmentation features are refined in the boundary context and binarized to generate the binarized binocular image mask.
2. The flame height recognition method based on improved Unet and binocular vision of claim 1, wherein, The boundary refinement module includes a dilated convolutional layer and an output convolutional layer; the preliminary segmentation features are obtained by feature space reconstruction and fusion based on the decoder of the weighted feature map, and the preliminary segmentation features are further refined and binarized based on the boundary refinement module to generate the binarized stereo image mask, including: The decoder upsamples the weighted feature map, and then the upsampled feature map is spliced and fused with the feature map of the corresponding level of the encoder to generate preliminary segmentation features containing flame outline information. Based on the dilated convolutional layer, the preliminary segmentation features are subjected to dilated convolution processing to expand the receptive field, capturing the contextual information of the blurred edges of the flames, and obtaining boundary enhancement features; The boundary enhancement features are mapped by channel dimensionality reduction using the output convolutional layer, and the mapped features are then processed by probability mapping and binarization using an activation function to output a binarized stereo image mask for representing the flame pixel region.
3. The flame height recognition method based on improved Unet and binocular vision as described in claim 2, characterized in that, The binarized binocular image mask includes: a binarized left-eye image mask and a binarized right-eye image mask; wherein, in the binarized left-eye image mask and the binarized right-eye image mask, the region marked with a valid target identifier value is the flame region, and the remaining regions are the background region; the step of determining the actual height of the flame based on the binarized binocular image mask includes: traversing the binarized left-eye image mask and extracting the set of pixels marked with the valid target identifier value; determining the highest and lowest points of the pixel set on the vertical coordinate axis, and calculating the difference in the vertical coordinates between the highest and lowest points to obtain the flame pixel height; using the highest point as a reference, constructing a stereo matching search path based on epipolar constraints, performing a search for corresponding points in the binarized right-eye image mask, and determining only the pixels in the right-eye image mask marked with a valid target identifier value and located on the stereo matching search path as candidate matching points; calculating the matching cost between the highest point and each of the candidate matching points, and determining the candidate matching point with the lowest matching cost as the optimal matching point; The flame region disparity value is obtained based on the difference between the highest point and the best matching point on the horizontal coordinate axis; the actual flame height is calculated based on the flame region disparity value and the flame pixel height.
4. The flame height recognition method based on improved Unet and binocular vision as described in claim 3, characterized in that, The binocular images include: a left-eye image and a right-eye image; The calculation of the matching cost between the highest point and each of the candidate matching points includes: Obtain the first gray-level gradient feature of the neighborhood of the first pixel point corresponding to the highest point in the left eye image, and the second gray-level gradient feature of the neighborhood of each second pixel point corresponding to each candidate matching point in the right eye image; calculate the texture correlation score between the neighborhood of the first pixel point and the neighborhood of each second pixel point using an image statistical similarity algorithm; calculate the gradient similarity score between the first gray-level gradient feature and each second gray-level gradient feature; The texture relevance score and the gradient similarity score are weighted and fused, and the cost is transformed to obtain the matching cost between the highest point and each of the candidate matching points.
5. The flame height recognition method based on improved Unet and binocular vision as described in claim 4, characterized in that, The step of calculating the actual height of the flame based on the parallax value of the flame region and the pixel height of the flame includes: obtaining the calibrated focal length and baseline length of the binocular camera; calculating the depth distance from the flame to the camera based on the binocular triangulation principle and the inverse relationship between the calibrated focal length, baseline length, and the parallax value of the flame region; and calculating the actual height of the flame based on the pinhole imaging projection principle and the linear proportional relationship between the depth distance, calibrated focal length, and the pixel height of the flame.
6. The flame height recognition method based on improved Unet and binocular vision as described in claim 5, characterized in that, The encoder includes multiple parallel convolutional branches; The step of extracting multi-scale semantic features of the stereo image at different resolution levels based on the encoder includes: based on the encoder, using convolution kernels with different receptive field sizes in each of the convolution branches, performing convolution operations on the stereo image to obtain multiple sets of branch feature maps; The branch feature maps are spliced and fused along the feature channel dimension to generate the multi-scale semantic features used to capture the morphological features of flames at different scales.
7. The flame height recognition method based on improved Unet and binocular vision as described in claim 6, characterized in that, The method for constructing the improved Unet model includes: constructing a dedicated dataset for power transmission line wildfires by including binocular wildfire image pairs from multiple scenarios and their corresponding pixel-level labeled masks; inputting the left and right eye images from the binocular wildfire image pairs as joint training samples into a deep neural network to be trained; calculating a pixel classification loss value to measure pixel classification accuracy and a boundary shape loss value to measure boundary contour overlap based on the prediction mask output by the deep neural network and the pixel-level labeled mask; and updating the model parameters through backpropagation based on the weighted sum of the pixel classification loss value and the boundary shape loss value until the model converges, thereby generating the improved Unet model.
8. A flame height recognition device based on improved Unet and binocular vision, characterized in that, include: Image acquisition module, semantic segmentation module, and height calculation module; Image acquisition module, used to acquire stereo images from stereo camera; A semantic segmentation module is used to input the stereo image into a pre-built improved Unet model, so that the improved Unet model generates a corresponding binarized stereo image mask based on the stereo image; The improved Unet model includes an encoder, an attention module, a decoder, and a boundary refinement module. The attention module includes a global average pooling layer and a fully connected layer. Generating a corresponding binarized binocular image mask based on the binocular image includes: extracting multi-scale semantic features of the binocular image at different resolution levels based on the encoder; performing pooling operations on the multi-scale semantic features using the global average pooling layer based on the attention module to obtain channel descriptors representing global information; calculating the contribution of each channel feature to flame target recognition using the channel descriptors through the fully connected layer, and generating channel scaling coefficients based on the contribution; wherein the channel scaling coefficients are proportional to the contribution; weighting the multi-scale semantic features based on the channel scaling coefficients to generate a weighted feature map where flame features are enhanced and background noise is suppressed; reconstructing and fusing the weighted feature map using the decoder to obtain preliminary segmentation features; and performing boundary context refinement and binarization processing on the preliminary segmentation features based on the boundary refinement module to generate the binarized binocular image mask. The height calculation module is used to determine the actual height of the flame based on the binarized binocular image mask.
9. A terminal device, characterized in that, The method includes a processor, a memory, and a computer program stored in the memory and configured to be executed by the processor, wherein, when the processor executes the computer program, it implements the flame height recognition method based on improved Unet and binocular vision as described in any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, include: A stored computer program, wherein, when the computer program is executed, it controls the device containing the computer-readable storage medium to perform the flame height recognition method based on improved Unet and binocular vision as described in any one of claims 1-7.