Emergency rescue real-time human body detection method and device based on timing motion feature enhancement

By combining the processing of visible light and infrared image sequences in emergency rescue, a frequency domain dynamic attention mask and a dual-branch neural network are generated, which solves the problems of missed detection and false alarms in human body detection in complex environments, and realizes efficient detection and accurate identification of weak vital signs.

CN121305671BActive Publication Date: 2026-06-26XI AN JIAOTONG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XI AN JIAOTONG UNIV
Filing Date
2025-10-14
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies for human detection in complex emergency rescue environments suffer from high false alarm and false false alarm rates, difficulty in effectively extracting weak motion features, difficulty in suppressing dynamic background interference, and lack of adaptability, making it impossible to balance detection accuracy and real-time performance.

Method used

By acquiring visible light and infrared image sequences in real time, and combining continuous frame difference, optical flow algorithm and time-frequency transformation, a frequency domain dynamic attention mask is generated, a dual-branch neural network is constructed, feature weights are dynamically allocated, and infrared temperature verification is combined to improve the sensitivity to human micro-motion features and filter dynamic background noise.

Benefits of technology

It significantly reduced the false negative rate of stationary or low-mobility trapped personnel, reduced false alarms, improved the robustness and accuracy of detection, and met the real-time needs of emergency rescue.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121305671B_ABST
    Figure CN121305671B_ABST
Patent Text Reader

Abstract

The application discloses an emergency rescue real-time human body detection method and equipment based on timing motion feature enhancement, which comprises the following steps: acquiring visible light, infrared image sequences and environmental parameters (including smoke concentration and light intensity) of a rescue area in real time, and performing spatial registration preprocessing; performing frame difference processing on the two types of image sequences respectively, generating binary motion masks, calculating optical flow amplitude, and obtaining a multi-scale motion energy field according to adaptive fusion of the smoke concentration; extracting human body micro-motion frequency band energy through time-frequency conversion, generating a frequency domain dynamic attention mask, and combining the multi-scale motion energy field to extract a significant motion target area; constructing a double-branch neural network, extracting timing motion features and multi-modal appearance features, dynamically distributing weights and fusing, and outputting a human body bounding box and a detection confidence; calculating the environmental complexity according to the environmental parameters, determining a dynamic confidence threshold, verifying the detection result in combination with the average temperature of the human body bounding box area, and generating an alarm information if the condition is met. The application aims at solving the problems of high missed detection rate and unstable recognition of traditional methods in complex environments.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and intelligent sensing technology, and in particular to a real-time human detection method and device for emergency rescue based on temporal motion feature enhancement. Background Technology

[0002] With the rapid development of artificial intelligence and computer vision technologies, video-based human detection methods have been widely applied in security monitoring, intelligent transportation, and public safety. In normal environments, deep learning-based target detection algorithms (such as YOLO, SSD, and Faster R-CNN) can effectively identify human targets, meeting the detection needs of most scenarios. However, in emergency rescue scenarios, such as fires, dense smoke, earthquake ruins, and mine collapses, traditional detection methods relying on appearance features face severe challenges. Low light, occlusion, smoke interference, and dynamic backgrounds (such as flickering flames and floating debris) severely degrade image quality, leading to blurred target outlines and feature loss, resulting in a significant decrease in detection accuracy and persistently high false positive and false negative rates.

[0003] To improve detection performance in complex environments, some studies have introduced motion-feature-based detection methods, such as frame difference, background modeling, and optical flow. These methods can capture moving human targets to some extent, reducing reliance on static appearance. However, in actual rescue scenarios, trapped individuals are often stationary or exhibit only weak vital signs (such as regular breathing or slight finger tremors). Traditional motion detection algorithms, lacking sensitivity to these subtle movements, are prone to missed detections. Furthermore, irregular movements in dynamic backgrounds (such as flames, smoke, and dust) frequently interfere with detection results, making it difficult for the system to distinguish between real human movement and environmental noise.

[0004] In recent years, frequency domain analysis and attention mechanisms have been applied to target detection in dynamic scenes, attempting to filter background interference by extracting features from specific frequency bands. However, most methods have failed to accurately locate the frequency bands of vital signs (0.1-2Hz) and lack the ability to collaboratively optimize dynamic and static features in complex environments. Furthermore, existing deep learning detection frameworks are mostly static models, unable to adaptively adjust feature weights according to environmental changes, resulting in insufficient robustness in extreme environments.

[0005] In summary, existing technologies for human detection in complex emergency rescue environments still have many shortcomings, mainly reflected in: the lack of an effective extraction mechanism for weak motion features, the difficulty in suppressing dynamic background interference, and the single strategy for fusing motion and appearance features, which lacks adaptability and cannot balance detection accuracy and real-time performance, making it difficult to meet the actual needs of quickly and reliably locating trapped personnel in emergency rescue. Summary of the Invention

[0006] To address the problems existing in the prior art, this invention provides a real-time human detection method and device for emergency rescue based on temporal motion feature enhancement. Its purpose is to solve the problems of high false negative rate and unstable recognition of traditional human detection methods in complex environments such as low light, occlusion and dynamic background interference.

[0007] To solve the above-mentioned technical problems, the present invention is achieved through the following technical solution:

[0008] According to a first aspect of the present invention, a real-time human detection method for emergency rescue based on temporal motion feature enhancement is provided, comprising:

[0009] The system acquires visible light image sequences, infrared image sequences, and environmental parameters synchronously collected from the rescue area in real time, and performs spatial registration preprocessing on the visible light image sequences and infrared image sequences; the environmental parameters include at least smoke concentration and light intensity.

[0010] The visible light image sequence and the infrared image sequence are processed by continuous frame difference to generate a binary motion mask. At the same time, the optical flow amplitude is calculated by an optical flow algorithm. The fusion coefficient is adaptively adjusted according to the smoke concentration, and the binary motion mask and the optical flow amplitude are fused according to the fusion coefficient to obtain a multi-scale motion energy field.

[0011] The multi-scale motion energy field is transformed by time and frequency to extract the frequency band energy related to human micro-motion characteristics. A frequency domain dynamic attention mask is generated based on the frequency band energy. The frequency domain dynamic attention mask is combined with the multi-scale motion energy field, and then the salient motion target region is extracted by moving average filtering and region growing algorithm.

[0012] A dual-branch neural network architecture consisting of a motion feature branch and an appearance feature branch is constructed. The salient moving target region is input into the motion feature branch, and temporal motion features are extracted through a temporal convolutional network. The preprocessed visible light image sequence and infrared image sequence are input into the appearance feature branch, and multimodal appearance features are extracted through a target detection network. Based on the motion saliency quantification index of the temporal motion features and the appearance reliability index of the multimodal appearance features, motion weights and appearance weights are dynamically allocated. The temporal motion features and multimodal appearance features are fused according to the motion weights and appearance weights to obtain fused features. Finally, the human body bounding box and detection confidence are output through the detection head.

[0013] The environmental complexity is calculated based on smoke concentration and light intensity, and a dynamic confidence threshold is determined based on the environmental complexity. The detection confidence is compared with the dynamic confidence threshold, and the average temperature of the human body bounding box region in the infrared image sequence is used for verification. If the detection confidence is not lower than the dynamic confidence threshold and the average temperature is within the range of human vital signs temperature, an alarm message is generated and output.

[0014] In one possible implementation of the first aspect, the step of performing consecutive frame difference processing on the visible light image sequence and the infrared image sequence to generate a binarized motion mask specifically involves:

[0015]

[0016] In the formula, For the first i Binarization motion mask for a frame of visible light or infrared image; Indicates the first Visible light or infrared images after frame preprocessing; Represents pixel-level logical AND operations;

[0017] The optical flow amplitude is calculated using an optical flow algorithm, specifically as follows:

[0018]

[0019] In the formula, For the first i Optical flow amplitude of a frame of visible light or infrared image; For the first i The horizontal optical flow component of a frame of visible light or infrared image; For the first i The optical flow component in the vertical direction of a frame of visible light or infrared image.

[0020] In one possible implementation of the first aspect, the adaptive adjustment of the fusion coefficient based on the smoke concentration specifically includes:

[0021]

[0022] In the formula, The fusion coefficient; This is the normalized value of smoke concentration. and These are empirical parameters;

[0023] The process of fusing the binarized motion mask and the optical flow amplitude according to a fusion coefficient to obtain a multi-scale motion energy field is as follows:

[0024]

[0025] In the formula, For the first i Multiscale motion energy fields of visible light or infrared images.

[0026] In one possible implementation of the first aspect, the step of performing time-frequency transformation on the multi-scale motion energy field to extract frequency band energy related to human micro-motion characteristics specifically includes:

[0027]

[0028]

[0029] In the formula, It is a short-time Fourier transform; For the first i Multiscale motion energy fields of a frame of visible light or infrared images; For the Hanning window function; Centered on the time window; For frequency components; The differential element of time; The frequency band energy related to the micro-motion characteristics of the human body;

[0030] The frequency domain dynamic attention mask based on frequency band energy generation is specifically as follows:

[0031] Integrating the frequency band energy along the time axis yields the frequency domain energy distribution:

[0032]

[0033] In the formula, Frequency domain energy distribution;

[0034] Based on the frequency domain energy distribution, the energy threshold is adaptively calculated:

[0035]

[0036] In the formula, Energy threshold; and These are the mean and standard deviation of the frequency domain energy distribution, respectively.

[0037] Based on the energy threshold, the frequency band energy is mapped back to the spatial domain to generate a frequency domain dynamic attention mask:

[0038]

[0039] In the formula, This is the inverse short-time Fourier transform; For indicator functions;

[0040] The process of combining a frequency-domain dynamic attention mask with a multi-scale motion energy field, and then extracting the salient moving target region through a moving average filter and a region growing algorithm, specifically involves:

[0041]

[0042]

[0043]

[0044] In the formula, For the first i Enhanced features of a frame of visible light or infrared images; This indicates pixel-by-pixel multiplication; For residual weights; The first step after moving average filtering i Enhanced features of a frame of visible light or infrared images; The size of the time window; For the first i Significant moving target regions in a frame of visible light or infrared images; This refers to the operation of the region growing algorithm.

[0045] In one possible implementation of the first aspect, the temporal convolutional network is a 3D convolutional neural network;

[0046] The extraction of temporal motion features through a temporal convolutional network specifically involves:

[0047]

[0048] In the formula, Temporal motion features; kernel size Step length The number of output channels increases layer by layer;

[0049] The target detection network is an improved YOLOv8 network, which specifically involves introducing a dynamic cross-modal attention module into the backbone of the YOLOv8 network.

[0050] The multimodal appearance features are extracted using a target detection network. Specifically:

[0051]

[0052] In the formula, For multimodal appearance features; For the first i Frame of visible light images; For the first i Frame infrared image; For the Sigmoid function; This indicates multiplication by channel.

[0053] In one possible implementation of the first aspect, the dynamic allocation of motion weights and appearance weights based on the motion saliency quantification index of temporal motion features and the appearance reliability index of multimodal appearance features specifically involves:

[0054] Based on the global average energy value of the salient moving target region, the motion saliency quantification index of temporal motion characteristics is calculated:

[0055]

[0056] In the formula, A quantitative index for the significance of temporal motion characteristics; The height of the region of salient moving target; The width of the region of the salient moving target; The coordinates of the pixels in the salient moving target region;

[0057] Calculate the signal-to-noise ratio of the infrared image sequence and the average confidence score of keypoint detection in the visible light image sequence;

[0058] By combining the signal-to-noise ratio of infrared image sequences with the average confidence score of keypoint detection in visible light image sequences, the appearance reliability index of multimodal appearance features is calculated:

[0059]

[0060] In the formula, For the appearance reliability index of multimodal appearance features; The signal-to-noise ratio of the infrared image sequence; Average confidence level for keypoint detection in visible light image sequences;

[0061] Dynamically assign motion weights and appearance weights using the Sigmoid function:

[0062]

[0063]

[0064] In the formula, For motion weights; Weighting based on appearance; To control the rate of change of weights; For bias terms;

[0065] The process of fusing temporal motion features and multimodal appearance features according to motion weights and appearance weights to obtain fused features is as follows:

[0066]

[0067] In the formula, This is a feature of fusion.

[0068] In one possible implementation of the first aspect, the step of calculating environmental complexity based on smoke concentration and light intensity, and determining a dynamic confidence threshold based on environmental complexity, specifically involves:

[0069]

[0070]

[0071] In the formula, This is a dynamic confidence threshold. This represents the normalized value of the smoke concentration. This is the normalized value of light intensity; This represents the light intensity threshold.

[0072] According to a second aspect of the present invention, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the aforementioned real-time human detection method for emergency rescue based on temporal motion feature enhancement.

[0073] According to a third aspect of the present invention, a computer-readable storage medium is provided, the computer-readable storage medium storing a computer program, which, when executed by a processor, implements the aforementioned real-time human body detection method for emergency rescue based on temporal motion feature enhancement.

[0074] According to a fourth aspect of the present invention, a computer program product is provided, which, when executed by a processor, implements the aforementioned real-time human body detection method for emergency rescue based on temporal motion feature enhancement.

[0075] Compared with the prior art, the present invention has at least the following beneficial effects:

[0076] This invention provides a real-time human detection method for emergency rescue based on temporal motion feature enhancement. Addressing the shortcomings of existing emergency rescue human detection technologies, such as insufficient sensitivity to slightly moving trapped individuals, difficulty in suppressing dynamic background interference, limited feature fusion strategies, and poor adaptability of static models, this invention achieves performance improvements through multi-dimensional technical optimization: Firstly, it constructs a multi-scale motion energy field by fusing continuous frame difference with optical flow, and adaptively adjusts the fusion coefficient according to smoke concentration. Combined with a temporal convolutional network to extract temporal motion features, this effectively enhances the detection sensitivity to weak vital signs (such as breathing and slight limb movements), significantly reducing the false negative rate for stationary or low-activity trapped individuals. Secondly, it accurately extracts the frequency band energy of human vital signs through time-frequency transformation, constructs a frequency domain dynamic attention mask, and combines it with moving average filtering and... The region growing algorithm efficiently filters dynamic background noise such as flame flickering and smoke floating, reducing false alarms. Simultaneously, a dual-branch neural network architecture combining motion and appearance features is employed. Feature weights are dynamically allocated based on motion saliency quantification and appearance reliability indices, enabling adaptive synergistic fusion of the two features. This enhances the weight of infrared appearance features in low-light environments and strengthens the role of motion features in dynamic backgrounds, overcoming the limitations of single-feature dependence and improving the system's robustness in various complex environments. Furthermore, environmental complexity is calculated based on smoke concentration and light intensity to determine the dynamic confidence threshold. This is combined with verification of the average temperature of the human body bounding box region in the infrared image (matching the temperature range of human vital signs), forming a dual decision-making mechanism of dynamic threshold and temperature verification, further improving detection accuracy.

[0077] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description

[0078] To more clearly illustrate the technical solutions in the specific embodiments of the present invention, the drawings used in the description of the specific embodiments will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0079] Figure 1 A flowchart illustrating a real-time human detection method for emergency rescue based on temporal motion feature enhancement, as described in this invention.

[0080] Figure 2 This is a diagram of the architecture of a general emergency rescue human detection system.

[0081] Figure 3 This is a flowchart of the real-time human detection process for emergency rescue based on temporal motion feature enhancement in this embodiment.

[0082] Figure 4 A diagram of the collaborative decision-making structure for motion and appearance features. Detailed Implementation

[0083] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0084] In emergency rescue scenarios, such as complex environments like fires, dense smoke, and ruins, traditional human detection methods primarily rely on physical features (such as contours, textures, and colors). However, their performance significantly deteriorates under conditions of low light, severe obstruction, and dynamic background interference (such as flickering flames and drifting smoke), resulting in high false negative and false positive rates. Furthermore, trapped individuals are often stationary or exhibit only weak vital signs (such as breathing, heartbeat, and finger tremors), making traditional motion-based detection methods ineffective. Therefore, this invention proposes a technical solution that leverages subtle human motion features in complex dynamic environments to improve detection accuracy and robustness.

[0085] In the following embodiments, such as Figure 2 As shown, the emergency rescue detection system consists of a high-definition visible light camera, an infrared thermal imaging device, and an inertial navigation and environmental perception module. It can be flexibly deployed on drones, robots, or fixed monitoring points to meet the dynamic search and rescue needs in complex environments such as fires, dense smoke, and ruins. The environmental perception module collects real-time information on light intensity, smoke concentration, and dynamic background interference to assist in motion feature enhancement and adaptive adjustment of feature weights. Data synchronization errors between sensors are negligible, ensuring temporal consistency of multi-source information.

[0086] It should be noted that the baseline data for human vital signs (such as respiratory rate, heart rate characteristics, and subtle limb movements) were collected from historical rescue cases and laboratory simulation environments to construct a weak motion feature database, which serves as the basis for dynamic attention mechanisms and abnormal movement judgment. The overall system inference latency is controlled within 100ms, meeting the requirements of emergency rescue for real-time performance and high response speed, ensuring rapid completion of human detection and vital sign identification in complex environments, and improving rescue efficiency and safety assurance capabilities.

[0087] like Figure 1 and Figure 3 As shown, this invention provides a real-time human detection method for emergency rescue based on temporal motion feature enhancement, specifically including the following steps:

[0088] Step 1: Acquire visible light image sequences, infrared image sequences, and environmental parameters synchronously collected from the rescue area in real time, and perform spatial registration preprocessing on the visible light image sequences and infrared image sequences; the environmental parameters include at least smoke concentration and light intensity.

[0089] Specifically, the system deploys visible light cameras and infrared thermal imaging devices on drones or rescue robot platforms to ensure comprehensive coverage of the rescue area. The visible light cameras capture visible light images, while the infrared thermal imaging devices capture infrared images. An inertial navigation module is also integrated. It is used to collect dynamic environmental parameters in real time, including key indicators such as smoke concentration and light intensity.

[0090] It should be understood that the acquired raw visible light and infrared images need to be preprocessed, specifically by performing the following operations:

[0091] First, for visible light images The sequence undergoes enhancement processing, employing contrast-limited adaptive histogram equalization (CLAHE algorithm) to improve detail in low-light areas, and Wiener filtering to eliminate motion blur.

[0092]

[0093] Secondly, regarding infrared images The sequence undergoes correction processing, compensating for detector response differences through non-uniformity correction (NUC), and then effectively suppressing thermal noise by combining it with wavelet threshold denoising technology.

[0094]

[0095] Finally, the scale-invariant feature transform (SIFT algorithm) is applied. and Feature points are spatially registered through affine transformation to ensure consistency of cross-modal data.

[0096] After the above processing, a standardized preprocessed dataset is finally obtained.

[0097]

[0098] in, and For the first Frame visible light images and infrared images, This is the environmental parameter vector at the corresponding time point.

[0099] Step 2: Perform continuous frame difference processing on the visible light image sequence and the infrared image sequence respectively to generate a binarized motion mask; at the same time, use the optical flow algorithm to calculate the optical flow amplitude; and adaptively adjust the fusion coefficient according to the smoke concentration, and fuse the binarized motion mask and the optical flow amplitude according to the fusion coefficient to obtain a multi-scale motion energy field.

[0100] In other words, this step constructs a multi-scale motion energy field by integrating continuous frame difference and optical flow analysis techniques, extracting and enhancing subtle human motion features from complex backgrounds while suppressing dynamic environmental interference. Continuous frame difference processing can effectively capture temporal motion features and suppress instantaneous noise.

[0101] In one possible implementation, a three-frame difference process is performed on the visible light image sequence and the infrared image sequence respectively to generate a binarized motion mask, specifically:

[0102]

[0103] In the formula, For the first i A binary motion mask for a frame of visible light or infrared image, marking potential motion regions; Indicates the first Visible light or infrared images after frame preprocessing; Indicates the first Visible light or infrared images after frame preprocessing; Indicates the first Visible light or infrared images after frame preprocessing; It represents a pixel-level logical AND operation, retaining only the regions where the difference between two consecutive segments is significant, thus eliminating single-frame noise interference.

[0104] In one feasible approach, the Farneback dense optical flow algorithm is used to calculate the optical flow amplitude, specifically as follows:

[0105]

[0106] In the formula, For the first i Optical flow amplitude of a frame of visible light or infrared image; For the first i The horizontal optical flow component of a frame of visible light or infrared image; For the first i The optical flow component in the vertical direction of a frame of visible light or infrared image.

[0107] In one feasible approach, the fusion coefficient is adaptively adjusted based on the smoke concentration in the dynamic environmental parameters, specifically as follows:

[0108]

[0109] In the formula, The fusion coefficient; This is the normalized value of smoke concentration. and These are empirical parameters that control the steepness and threshold of the weight change curve.

[0110] In one feasible approach, the binarized motion mask and the optical flow amplitude are fused using a fusion coefficient to obtain a multi-scale motion energy field, specifically:

[0111]

[0112] In the formula, For the first i Multiscale motion energy fields of visible light or infrared images.

[0113] Step 3: Perform time-frequency transformation on the multi-scale motion energy field to extract frequency band energy related to human micro-motion characteristics, and generate a frequency domain dynamic attention mask based on the frequency band energy; combine the frequency domain dynamic attention mask with the multi-scale motion energy field, and then extract the salient motion target region through moving average filtering and region growing algorithm.

[0114] Specifically, this step uses frequency domain analysis and dynamic attention mechanism design to accurately extract human vital signs-related features (such as weak periodic movements like breathing and heartbeat, i.e., human micro-motion features) from multi-scale motion energy fields. It generates an attention mask that focuses on the saliency of dynamic motion in the vital sign frequency band, highlighting potential human tracking areas, improving feature saliency, reducing the risk of false detection, and suppressing high-frequency or low-frequency interference such as flame flickering and smoke drifting.

[0115] In one feasible approach, a short-time Fourier transform (STFT) is employed to perform time-frequency transformation on the multi-scale motion energy field, extracting frequency band energy related to the micro-motion characteristics of the human body. Specifically:

[0116]

[0117]

[0118] That is, by limiting the frequency range, the characteristics of human vital signs are preserved (0.1–2 Hz).

[0119] In the formula, It is a short-time Fourier transform; For the first i Multiscale motion energy fields of a frame of visible light or infrared images; For the Hanning window function; Centered on the time window; For frequency components; The differential element of time; This refers to the frequency band energy related to the micro-movement characteristics of the human body.

[0120] In one feasible approach, a frequency-domain dynamic attention mask is generated based on frequency band energy, as follows:

[0121] First, the frequency band energy is integrated along the time axis to obtain the frequency domain energy distribution:

[0122]

[0123] In the formula, This represents the energy distribution in the frequency domain.

[0124] Secondly, based on the frequency domain energy distribution, the energy threshold is adaptively calculated:

[0125]

[0126] In the formula, Energy threshold; and These are the mean and standard deviation of the frequency domain energy distribution, respectively.

[0127] Finally, based on the energy threshold, the frequency band energy is mapped back to the spatial domain to generate a frequency domain dynamic attention mask:

[0128]

[0129] In the formula, This is the inverse short-time Fourier transform; This is an indicator function.

[0130] In one feasible approach, a frequency-domain dynamic attention mask is combined with a multi-scale motion energy field to enhance human region features and suppress background noise, specifically:

[0131]

[0132] In the formula, For the first i Enhanced features of a frame of visible light or infrared images; This indicates pixel-by-pixel multiplication; Residual weights (default) ), retaining a small amount of global motion information to avoid over-suppression.

[0133] To avoid false enhancements caused by transient disturbances, spatiotemporal continuity constraints are introduced. First, for Perform moving average filtering:

[0134]

[0135] in, The first step after moving average filtering i Enhanced features of a frame of visible light or infrared images; This represents the size of the time window.

[0136] Finally, using a region growing algorithm, the largest connected region is extracted as the final salient motion target region, specifically:

[0137]

[0138] For the first i Significant moving target regions in a frame of visible light or infrared images; This refers to the operation of the region growing algorithm.

[0139] Step 4: Construct a dual-branch neural network architecture including a motion feature branch and an appearance feature branch; input the salient moving target region into the motion feature branch, and extract temporal motion features through a temporal convolutional network; input the preprocessed visible light image sequence and infrared image sequence into the appearance feature branch, and extract multimodal appearance features through a target detection network; dynamically allocate motion weights and appearance weights according to the motion saliency quantification index of the temporal motion features and the appearance reliability index of the multimodal appearance features, fuse the temporal motion features and multimodal appearance features according to the motion weights and appearance weights to obtain fused features, and then output the human body bounding box and detection confidence score through the detection head.

[0140] Specifically, this step uses a dual-branch neural network architecture and a dynamic weight allocation mechanism to achieve adaptive fusion of motion features and appearance features, thereby improving the robustness of human detection in complex environments.

[0141] like Figure 4 As shown, in one possible implementation, the temporal convolutional network is a lightweight 3D convolutional neural network (3D-CNN), the network structure of which contains three 3D convolutional layers, each followed by batch normalization and ReLU activation.

[0142] That is, extracting temporal motion features through temporal convolutional networks, specifically:

[0143]

[0144] In the formula, Temporal motion features; kernel size Step length The number of output channels increases layer by layer ( ).

[0145] In one possible implementation, the target detection network is an improved YOLOv8 network. The improvement is that a dynamic cross-modal attention module (DCM-Attn) is introduced into the backbone of the YOLOv8 network to adaptively fuse visible light and infrared features.

[0146] DCM-Attn extracts multimodal appearance features by weighting infrared features with channel attention and adding them element-wise to visible light features, thus using a target detection network. Specifically:

[0147]

[0148] In the formula, For multimodal appearance features; For the first i Frame of visible light images; For the first i Frame infrared image; For the Sigmoid function; This indicates multiplication by channel.

[0149] Preferably, improvements to the YOLOv8 network also include replacing standard convolutions with Ghost modules to reduce computational load.

[0150] In one feasible approach, motion weights and appearance weights are dynamically allocated based on the motion saliency quantification index of temporal motion features and the appearance reliability index of multimodal appearance features, as follows:

[0151] First, based on the global average energy value of the salient motion target region, the motion saliency quantification index of the temporal motion characteristics is calculated:

[0152]

[0153] In the formula, A quantitative index for the significance of temporal motion characteristics; The height of the region of salient moving target; The width of the region of the salient moving target; These are the coordinates of pixels in the region of the salient moving target.

[0154] Secondly, the signal-to-noise ratio of the infrared image sequence and the average confidence level of keypoint detection in the visible light image sequence are calculated.

[0155] Then, by combining the signal-to-noise ratio of the infrared image sequence and the average confidence level of keypoint detection in the visible light image sequence, the appearance reliability index of the multimodal appearance features is calculated:

[0156]

[0157] In the formula, For the appearance reliability index of multimodal appearance features; The signal-to-noise ratio of the infrared image sequence; Average confidence level for keypoint detection in visible light image sequences.

[0158] Finally, the motion weights and appearance weights are dynamically allocated using the Sigmoid function:

[0159]

[0160]

[0161] In the formula, For motion weights; Weighting based on appearance; To control the rate of change of weights; This is a bias term, the default. .when At times (such as when the smoke obscures the view). When the value approaches 1, it prioritizes motion features; otherwise, it focuses on appearance features.

[0162] In one feasible approach, temporal motion features and multimodal appearance features are fused according to motion weights and appearance weights to obtain fused features, specifically:

[0163]

[0164] In the formula, This is a feature of fusion.

[0165] The detection head uses a YOLOv8 hybrid channel detection structure to output human body bounding boxes. and detection confidence level .

[0166] Step 5: Calculate the environmental complexity based on the smoke concentration and light intensity, and determine the dynamic confidence threshold based on the environmental complexity; compare the detection confidence with the dynamic confidence threshold, and verify it in conjunction with the average temperature of the human body bounding box region in the infrared image sequence. If the detection confidence is not lower than the dynamic confidence threshold and the average temperature is within the range of human vital signs temperature, then generate alarm information and output it.

[0167] In one feasible approach, environmental complexity is calculated based on smoke concentration and light intensity, and a dynamic confidence threshold is determined based on this environmental complexity. In other words, the dynamic confidence threshold is determined according to the environmental complexity. Adaptive adjustment, specifically:

[0168]

[0169]

[0170] In the formula, For dynamic confidence threshold, A larger value indicates a more complex environment; This represents the normalized value of the smoke concentration. This is the normalized value of light intensity; This represents the light intensity threshold.

[0171] Combined with infrared images Verify whether the average temperature of the target area is within the range of human vital signs (36–42℃):

[0172]

[0173] in, This serves as a verification result identifier; This represents the total number of pixels in the target region. This is an indicator function.

[0174] If the temperature is abnormal (such as a flame at high temperature or a cold object), it is identified as an interference source, and the alarm is suppressed.

[0175] When the objective is satisfied simultaneously and At that time, an alarm data packet is generated. The alarm data packet All alarm events and related data (including video clips, fusion features, and environmental parameters) are uploaded to the emergency command center and automatically stored in a local or cloud database. Rescue personnel can mark false alarms or missed alarms via their terminals, and the system periodically calculates detection performance indicators. Preferably, the weight allocation parameters are dynamically optimized based on feedback data, eliminating the need to retrain the model.

[0176] In another embodiment of the present invention, a computer device is provided, comprising a processor and a memory. The memory stores a computer program, which includes program instructions. The processor executes the program instructions stored in the computer storage medium. The processor may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. It is the computing and control core of the terminal, suitable for implementing one or more instructions, specifically suitable for loading and executing one or more instructions in the computer storage medium to achieve a corresponding method flow or corresponding function. The processor described in this embodiment of the present invention can be used in the operation of a real-time human detection method for emergency rescue based on enhanced temporal motion features.

[0177] In another embodiment of the present invention, a storage medium is provided, specifically a computer-readable storage medium (Memory), which is a memory device in a computer device used to store programs and data. It is understood that the computer-readable storage medium here can include both the built-in storage medium in the computer device and extended storage media supported by the computer device. The computer-readable storage medium provides storage space that stores the terminal's operating system. Furthermore, the storage space also stores one or more instructions suitable for loading and execution by a processor. These instructions can be one or more computer programs (including program code). It should be noted that the computer-readable storage medium here can be a high-speed RAM memory or a non-volatile memory, such as at least one disk storage device. The processor can load and execute one or more instructions stored in the computer-readable storage medium to implement the corresponding steps of the emergency rescue real-time human detection method based on temporal motion feature enhancement in the above embodiments.

[0178] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0179] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0180] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0181] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0182] This invention also provides a computer program product, which is used to execute any of the above-described methods for real-time human detection in emergency rescue based on enhanced temporal motion features. Since the computer program product provided by this invention belongs to the same inventive concept as the aforementioned method for real-time human detection in emergency rescue based on enhanced temporal motion features, it possesses all the advantages of the aforementioned method. Therefore, the beneficial effects of the computer program product provided by this invention will not be elaborated upon here.

[0183] In this invention, the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to a specific feature, structure, material, or characteristic described in connection with that embodiment or example, which is included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.

[0184] Finally, it should be noted that the above-described embodiments are merely specific implementations of the present invention, used to illustrate the technical solutions of the present invention, and not to limit them. The scope of protection of the present invention is not limited thereto. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments within the scope of the technology disclosed in the present invention, or make equivalent substitutions for some of the technical features; and these modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be covered within the scope of protection of the present invention.

Claims

1. A real-time human detection method for emergency rescue based on temporal motion feature enhancement, characterized in that, Includes the following steps: The system acquires visible light image sequences, infrared image sequences, and environmental parameters synchronously collected from the rescue area in real time, and performs spatial registration preprocessing on the visible light image sequences and infrared image sequences; the environmental parameters include at least smoke concentration and light intensity. The visible light image sequence and the infrared image sequence are processed by continuous frame difference to generate a binarized motion mask; at the same time, the optical flow amplitude is calculated using an optical flow algorithm. Based on the smoke concentration, the fusion coefficient is adaptively adjusted, and the binarized motion mask and optical flow amplitude are fused according to the fusion coefficient to obtain a multi-scale motion energy field. The multi-scale motion energy field is transformed by time and frequency to extract the frequency band energy related to human micro-motion characteristics. A frequency domain dynamic attention mask is generated based on the frequency band energy. The frequency domain dynamic attention mask is combined with the multi-scale motion energy field, and then the salient motion target region is extracted by moving average filtering and region growing algorithm. A two-branch neural network architecture consisting of a motion feature branch and an appearance feature branch is constructed; the salient moving target region is input into the motion feature branch, and temporal motion features are extracted through a temporal convolutional network; The preprocessed visible light image sequence and infrared image sequence are input into the appearance feature branch, and multimodal appearance features are extracted through the target detection network; Based on the motion saliency quantification index of temporal motion features and the appearance reliability index of multimodal appearance features, motion weights and appearance weights are dynamically allocated. The temporal motion features and multimodal appearance features are then fused according to the motion weights and appearance weights to obtain fused features. Finally, the human body bounding box and detection confidence are output through the detection head. The environmental complexity is calculated based on smoke concentration and light intensity, and a dynamic confidence threshold is determined based on the environmental complexity. The detection confidence is compared with the dynamic confidence threshold, and the average temperature of the human body bounding box region in the infrared image sequence is used for verification. If the detection confidence is not lower than the dynamic confidence threshold and the average temperature is within the range of human vital signs temperature, an alarm message is generated and output.

2. The real-time human detection method for emergency rescue based on temporal motion feature enhancement according to claim 1, characterized in that, The step of performing continuous frame difference processing on the visible light image sequence and the infrared image sequence respectively to generate a binarized motion mask is as follows: In the formula, For the first i Binarization motion mask for a frame of visible light or infrared image; Indicates the first Visible light or infrared images after frame preprocessing; Represents pixel-level logical AND operations; The optical flow amplitude is calculated using an optical flow algorithm, specifically as follows: In the formula, For the first i Optical flow amplitude of a frame of visible light or infrared image; For the first i The horizontal optical flow component of a frame of visible light or infrared image; For the first i The optical flow component in the vertical direction of a frame of visible light or infrared image.

3. The real-time human detection method for emergency rescue based on temporal motion feature enhancement according to claim 2, characterized in that, The adaptive adjustment of the fusion coefficient based on smoke concentration is specifically as follows: In the formula, The fusion coefficient; This is the normalized value of smoke concentration. and These are empirical parameters; The process of fusing the binarized motion mask and the optical flow amplitude according to a fusion coefficient to obtain a multi-scale motion energy field is as follows: In the formula, For the first i Multiscale motion energy fields of visible light or infrared images.

4. The real-time human detection method for emergency rescue based on temporal motion feature enhancement according to claim 1, characterized in that, The process of performing time-frequency transformation on the multi-scale motion energy field to extract frequency band energy related to human micro-motion characteristics specifically involves: In the formula, It is a short-time Fourier transform; For the first i Multiscale motion energy fields of a frame of visible light or infrared images; For the Hanning window function; Centered on the time window; For frequency components; The differential element of time; The frequency band energy related to the micro-motion characteristics of the human body; The frequency domain dynamic attention mask based on frequency band energy generation is specifically as follows: Integrating the frequency band energy along the time axis yields the frequency domain energy distribution: In the formula, Frequency domain energy distribution; Based on the frequency domain energy distribution, the energy threshold is adaptively calculated: In the formula, Energy threshold; and These are the mean and standard deviation of the frequency domain energy distribution, respectively. Based on the energy threshold, the frequency band energy is mapped back to the spatial domain to generate a frequency domain dynamic attention mask: In the formula, This is the inverse short-time Fourier transform; For indicator functions; The process of combining a frequency-domain dynamic attention mask with a multi-scale motion energy field, and then extracting the salient moving target region through a moving average filter and a region growing algorithm, specifically involves: In the formula, For the first i Enhanced features of a frame of visible light or infrared images; This indicates pixel-by-pixel multiplication; For residual weights; The first step after moving average filtering i Enhanced features of a frame of visible light or infrared images; The size of the time window; For the first i Significant moving target regions in a frame of visible light or infrared images; This refers to the operation of the region growing algorithm.

5. The real-time human detection method for emergency rescue based on temporal motion feature enhancement according to claim 4, characterized in that, The temporal convolutional network is a 3D convolutional neural network; The extraction of temporal motion features through a temporal convolutional network specifically involves: In the formula, Temporal motion features; kernel size Step length The number of output channels increases layer by layer; The target detection network is an improved YOLOv8 network, which specifically involves introducing a dynamic cross-modal attention module into the backbone of the YOLOv8 network. The multimodal appearance features are extracted using a target detection network. Specifically: In the formula, For multimodal appearance features; For the first i Frame of visible light images; For the first i Frame infrared image; For the Sigmoid function; This indicates multiplication by channel.

6. The real-time human detection method for emergency rescue based on temporal motion feature enhancement according to claim 5, characterized in that, The dynamic allocation of motion weights and appearance weights based on the motion saliency quantification index of temporal motion characteristics and the appearance reliability index of multimodal appearance characteristics is as follows: Based on the global average energy value of the salient moving target region, the motion saliency quantification index of temporal motion characteristics is calculated: In the formula, A quantitative index for the significance of temporal motion characteristics; The height of the region of salient moving target; The width of the region of the salient moving target; The coordinates of the pixels in the salient moving target region; Calculate the signal-to-noise ratio of the infrared image sequence and the average confidence score of keypoint detection in the visible light image sequence; By combining the signal-to-noise ratio of infrared image sequences with the average confidence score of keypoint detection in visible light image sequences, the appearance reliability index of multimodal appearance features is calculated: In the formula, For the appearance reliability index of multimodal appearance features; The signal-to-noise ratio of the infrared image sequence; Average confidence level for keypoint detection in visible light image sequences; Dynamically assign motion weights and appearance weights using the Sigmoid function: In the formula, For motion weights; Weighting based on appearance; To control the rate of change of weights; For bias terms; The process of fusing temporal motion features and multimodal appearance features according to motion weights and appearance weights to obtain fused features is as follows: In the formula, This is a feature of fusion.

7. The real-time human detection method for emergency rescue based on temporal motion feature enhancement according to claim 1, characterized in that, The calculation of environmental complexity based on smoke concentration and light intensity, and the determination of a dynamic confidence threshold based on environmental complexity, specifically involves: In the formula, This is a dynamic confidence threshold. This represents the normalized value of the smoke concentration. This is the normalized value of light intensity; This represents the light intensity threshold.

8. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements a real-time human detection method for emergency rescue based on temporal motion feature enhancement as described in any one of claims 1 to 7.

9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements a real-time human detection method for emergency rescue based on temporal motion feature enhancement as described in any one of claims 1 to 7.

10. A computer program product, characterized in that, When executed by a processor, the computer program product implements a real-time human detection method for emergency rescue based on temporal motion feature enhancement as described in any one of claims 1 to 7.