A multi-mode forest fire intelligent patrol method and system based on a compound wing unmanned aerial vehicle
By using a compound-wing UAV equipped with a dual-light pod and a multimodal detection model, combined with rotation matrix and temporal sliding window filtering algorithms, the problems of insufficient multimodal fusion and inaccurate positioning in UAV forest fire patrol were solved, achieving high-precision and stable fire point positioning and identification.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HUNAN UNIV
- Filing Date
- 2026-05-29
- Publication Date
- 2026-06-30
AI Technical Summary
Existing drone-based forest fire detection technologies suffer from insufficient multimodal perception fusion, high false alarm rates, poor long-distance target positioning accuracy, and susceptibility to drone vibrations, causing fire location coordinates to drift drastically on electronic maps and making it impossible to achieve stable and accurate fire source location.
A compound-wing UAV equipped with a dual-light pod is used to collect visible light video streams and infrared thermal imaging video streams in real time. The fire is detected in real time through a multimodal forest fire detection model of a ground station. The fire location is achieved by combining a rotation matrix and a time-series sliding window filtering algorithm. The modal weights are dynamically adjusted and the influence of fuselage vibration is eliminated to generate highly stable fire point latitude and longitude coordinates.
It enables accurate identification and high-precision positioning of suspected fire points in complex environments, reduces false alarm rates, ensures the stability and accuracy of fire source geographical location, and supports highly reliable forest fire monitoring.
Smart Images

Figure CN122313628A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of unmanned aerial vehicle (UAV) application and intelligent monitoring technology, and in particular relates to an intelligent forest fire inspection method and system that combines a composite wing UAV platform, adaptive multimodal image processing and real-time high-precision temporal geolocation. Background Technology
[0002] Forest fires are characterized by their suddenness, rapid spread, and high destructiveness. Early detection and accurate location of fires are crucial to minimizing disaster losses. Traditional forest patrols rely mainly on manual ground patrols or observation from high towers and telescopes, which suffer from low patrol efficiency, large blind spots, and poor timeliness.
[0003] In recent years, UAV remote sensing technology has been widely used in forestry monitoring due to its maneuverability and wide field of view. However, existing UAV forest patrol technologies still face the following technical bottlenecks that urgently need to be addressed in practical applications: Insufficient depth of multimodal perception fusion, resulting in a high false alarm rate: Although some solutions have introduced dual-light pods for visible light and infrared, most existing fusion algorithms use simple image overlay or fixed-weight channel stitching. In complex fire scenarios (such as when dense smoke completely blocks visible light or sunlight reflection interferes with infrared thermal imaging), simple fusion methods cannot dynamically adjust modal weights, leading to feature mutual exclusion or noise amplification, which easily results in missed detections or false alarms. Poor accuracy and drastic fluctuations in long-distance target positioning: Most existing fire point positioning methods are based on the geometric projection of single-frame images, ignoring the slight vibrations of the UAV fuselage and airflow disturbances during high-altitude flight. When shooting at long distances with a long focal length and oblique view, small attitude angle jitters can be amplified into positioning errors of tens of meters on the ground, causing the calculated fire point coordinates to drift drastically on the electronic map. Ground command personnel cannot obtain a stable and accurate geographical location of the fire source, seriously affecting firefighting decisions.
[0004] Therefore, there is an urgent need for a comprehensive intelligent inspection system that integrates long-endurance flight capability, anti-interference multimodal detection based on attention mechanism, and real-time high-precision positioning with anti-shake capability. Summary of the Invention
[0005] To address the above technical problems, this invention provides a multimodal intelligent forest fire inspection method and system based on a compound-wing UAV.
[0006] The technical solution adopted by this invention to solve its technical problem is: A multimodal intelligent forest fire patrol method based on a compound-wing UAV, the method comprising the following steps: S100, equipped with a composite wing UAV inspection route, the UAV flies autonomously according to the preset route, and uses the onboard dual-light pod to simultaneously collect visible light video stream and infrared thermal imaging video stream; The S200 encodes the acquired dual-optical video streams through the airborne communication gateway and transmits them to the ground control station in real time via the air-to-ground communication link using the RTSP protocol. The S300 and ground control station receive the video stream, call the pre-trained multimodal forest fire detection model, and perform real-time detection on the transmitted video in the multimodal intelligent forest fire inspection system to detect whether there are any suspected fire points. S400 When a suspected fire point is detected, the system automatically locks onto the target and saves the image frame, extracts the target's position in the image pixel coordinate system, and simultaneously acquires the UAV's current GPS position information and full attitude information; based on the image pixel coordinates, UAV attitude information, flight altitude, and camera sensor parameters, the system uses a collinearity equation based on the rotation matrix and a time-series sliding window filtering algorithm to calculate the true latitude and longitude coordinates of the fire point in geographic space. The S500 associates and stores the latitude and longitude coordinates of the fire point, the fire type, and the corresponding on-site images, triggers an alarm, and generates a structured inspection record.
[0007] Preferably, the S100 compound wing UAV combines the flight modes of rotor vertical take-off and landing and fixed-wing cruise; the dual-light pod maintains ground observation during the acquisition process, and the infrared thermal imaging and visible light camera undergo strict line-of-sight alignment calibration to ensure that the two video streams are spatially pixel aligned.
[0008] Preferably, the airborne communication gateway in the S200 has a built-in hardware video encoder, which uses the H.264 video encoding standard to compress the dual-optical video stream; the air-to-ground communication link uses a 5G mobile communication network transmission link to ensure that the video transmission delay is lower than a preset threshold.
[0009] Preferably, the multimodal forest fire detection model in S300 adopts a dual-stream fusion convolutional neural network architecture, including: a visible light branch for receiving and processing visible light video streams, which contains multiple sequentially connected convolutional blocks and pooling layers to extract the color and texture features of smoke; an infrared branch for receiving and processing infrared thermal imaging video streams, which contains multiple sequentially connected convolutional blocks and pooling layers to extract the brightness and morphological features of high-temperature hot spots; and a feature fusion module connected after the visible light and infrared branches, which incorporates a dual-stream cross-attention fusion mechanism. This mechanism includes a channel attention unit and a spatial calibration unit. The channel attention unit first performs global average pooling on the input feature map to compress spatial information, utilizing multi-layer sensing... The system automatically calculates the channel weight vectors of the two branches to determine the relative importance of visible light and infrared information in the current scene, and performs weighted processing on the feature maps. Subsequently, the spatial calibration unit concatenates the weighted feature maps and generates a spatial attention mask through convolution operations to lock high-response fire point regions in the image and suppress background forest noise. Finally, the double-calibrated feature maps are added element-wise to generate joint features containing rich semantics. The joint detection network, connected after the feature fusion module, processes the joint features through deep layers of convolutional layers, global pooling layers, and fully connected layers, and then feeds them to parallel flame detection and smoke detection units, finally outputting the bounding box coordinates and confidence scores of the fire target.
[0010] Preferably, S400 includes: S410: Extract the position of the burning target in the image pixel coordinate system, which is to extract the position of the center point of the detection box in the image pixel coordinate system; S420: Extracts the UAV's attitude information and camera intrinsics from image metadata. The UAV's attitude information includes heading angle, pitch angle, and roll angle, while the camera intrinsics include image pixel resolution, sensor parameters, and camera focal length. S430: Based on the pixel coordinates of the ignition point and camera intrinsic parameters, a three-dimensional line-of-sight vector of the ignition point in the normalized camera coordinate system is constructed using the pinhole imaging principle. S440: It uses the heading angle, pitch angle, and roll angle to construct a rotation matrix from the camera coordinate system to the geographic coordinate system, performs spatial attitude rotation transformation on the line of sight vector, and corrects the geometric distortion caused by gimbal tilt or fuselage tilt. S450: Combining flight altitude, the intersection of the rotated line-of-sight vector and the ground plane is calculated using the principle of similar triangles, thereby obtaining the instantaneous ground distance offset of the ignition point relative to the UAV; S460: Establish a coordinate buffer queue, perform discrete analysis on the instantaneous coordinates calculated in multiple consecutive frames, remove abnormal noise points affected by fuselage vibration, and perform time-series weighted smoothing on the effective coordinates based on the fire detection confidence level, and finally calculate the longitude and latitude of the ignition point with high stability.
[0011] Preferably, in S430, the line-of-sight vector of the ignition point in the normalized camera coordinate system is constructed, specifically expressed by the formula: In the formula, The line-of-sight vector of the ignition point in the normalized camera coordinate system; The pixel coordinates of the ignition point; For the physical dimensions of the sensor, Image pixel resolution, This refers to the camera's focal length.
[0012] Preferably, in S440, a rotation matrix from the camera coordinate system to the geographic coordinate system is constructed using the heading angle, pitch angle, and roll angle, and the line-of-sight vector is transformed to the geographic coordinate system, specifically expressed by the formula: In the formula, Let be a rotation matrix. For heading angle, The pitch angle, This is the roll angle. To calculate the rotated geographic coordinate system vector.
[0013] Preferably, in S450, the offset of the ignition point relative to the ground distance of the UAV is calculated, specifically expressed by the formula: In the formula, This is the ground projection offset; Flight altitude; Preferably, in S460, the longitude and latitude of the ignition point are calculated, specifically including two steps: instantaneous coordinate calculation and time-series weighted smoothing, expressed by the formula: In the formula, These are the instantaneous latitude and longitude coordinates of the current frame. The longitude and latitude of the drone center, This is the conversion factor for latitude and longitude units. For the final output smooth coordinates; For the first in the cache queue The effective instantaneous coordinates of the frame; For the first The confidence level of fire detection corresponding to the frame; This is the size of the sliding window.
[0014] Preferably, the inspection records in the S500 are stored in JSON format, and the record content includes: fire detection timestamp, drone location, and calculated coordinates of the fire point. Fire confidence level and on-site screenshots with marked boxes.
[0015] A multimodal intelligent forest fire patrol system based on a compound-wing UAV includes an airborne data acquisition and transmission unit and a ground computer system connected in communication with it. The ground computer system is equipped with a streaming media preprocessing unit and a multimodal fire detection and localization model connected in sequence. The airborne acquisition and transmission unit is used to acquire dual-light video streams with real-time pose information of the UAV and send the dual-light video streams to the ground computer system using the RTSP protocol. The streaming media preprocessing unit is used to receive and decode dual-light video streams, extract the UAV full attitude information and dual-light image data corresponding to the current frame, and send the extracted data to the multimodal fire detection and localization model; The multimodal fire detection and localization model receives image data and pose information, and uses a multimodal intelligent forest fire inspection method based on a compound wing UAV to perform fire inference and coordinate calculation, thereby obtaining fire inspection results containing real latitude and longitude coordinates.
[0016] The aforementioned multimodal intelligent forest fire inspection method and system based on a compound-wing UAV first controls the compound-wing UAV to perform a long-endurance inspection mission along a preset route. It utilizes an onboard dual-light pod to simultaneously acquire visible light and infrared thermal imaging video streams, which are then encoded and encapsulated via an onboard gateway and transmitted back to the ground control station in real time using the RTSP protocol. Subsequently, the ground station invokes a multimodal detection model with a built-in dual-stream cross-attention fusion mechanism to perform frame-by-frame inference on the video stream. Through channel attention and spatial calibration units, it dynamically integrates infrared and visible light features to accurately identify suspected fire points even in complex lighting and smoke-covered environments. In the positioning phase, the system simultaneously acquires the UAV's heading, pitch, roll, and other full-attitude parameters. It uses a full-attitude rotation matrix to perform spatial correction on the imaging geometry and further combines a temporal sliding window filtering algorithm to perform weighted smoothing on the calculated coordinates of multiple consecutive frames to eliminate positioning drift caused by fuselage vibration. Finally, the time-optimized fire point latitude and longitude, fire confidence level, and on-site screenshots are structured and saved as a JSON log, triggering an alarm. This method effectively solves the problems of traditional single-modal detection being susceptible to interference and unstable long-distance positioning, and realizes all-weather, highly reliable perception and high-precision positioning of forest fires. Attached Figure Description
[0017] Figure 1 This is a flowchart of a multimodal intelligent forest fire inspection method based on a compound-wing UAV in one embodiment of the present invention; Figure 2 This is a schematic diagram of a multimodal fusion convolutional neural network architecture in one embodiment of the present invention; Figure 3 This is a schematic diagram of the interface of a multimodal intelligent forest fire patrol system in one embodiment of the present invention; Figure 4 This is a schematic diagram of a multimodal intelligent forest fire patrol system for a compound-wing UAV in one embodiment of the present invention. Detailed Implementation
[0018] To enable those skilled in the art to better understand the technical solution of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings.
[0019] See Figure 1 , Figure 1 This is a flowchart of a multimodal intelligent forest fire patrol method based on a compound-wing UAV according to an embodiment of the present invention. A multimodal intelligent forest fire patrol method based on a compound-wing UAV includes the following steps: The S100 is equipped with a composite-wing UAV inspection route. The UAV flies autonomously according to the preset route and uses an onboard dual-light pod to simultaneously collect visible light video streams and infrared thermal imaging video streams.
[0020] In one embodiment, the S100 compound wing UAV combines the flight modes of rotor vertical take-off and landing and fixed-wing cruise; the dual-light pod maintains ground observation during the acquisition process, and the infrared thermal imaging and visible light camera undergo strict line-of-sight alignment calibration to ensure that the two video streams are spatially pixel aligned.
[0021] The S200 encodes the acquired dual-optical video streams through the airborne communication gateway and transmits them to the ground control station in real time via the air-to-ground communication link using the RTSP protocol.
[0022] In one embodiment, the airborne communication gateway in S200 has a built-in hardware video encoder that uses the H.264 video encoding standard to compress the dual-optical video stream; the air-to-ground communication link uses a 5G mobile communication network to ensure that the video transmission delay is below a preset threshold.
[0023] The S300 and ground control station receive the video stream and call the pre-trained multimodal forest fire detection model to perform real-time detection on the transmitted video in the multimodal intelligent forest fire inspection system to detect whether there are any suspected fire points.
[0024] Specifically, see Figure 2 , Figure 2This is a schematic diagram of a multimodal fusion convolutional neural network architecture in one embodiment of the present invention. The multimodal forest fire detection model in S300 adopts a dual-stream fusion convolutional neural network architecture, including: a visible light branch for receiving and processing visible light video streams, comprising multiple sequentially connected convolutional blocks and pooling layers (convolutional block 1, pooling layer, convolutional block 2, pooling layer, and convolutional block 3) to extract the color and texture features of smoke; an infrared branch for receiving and processing infrared thermal imaging video streams, comprising multiple sequentially connected convolutional blocks and pooling layers (convolutional block 1, pooling layer, convolutional block 2, pooling layer, and convolutional block 3) to extract the brightness and morphological features of high-temperature hot spots; and a feature fusion module connected after the visible light and infrared branches, which incorporates a dual-stream cross-attention fusion mechanism. This mechanism includes a channel attention unit and a spatial calibration unit. The channel attention unit first performs global average pooling on the input feature map to... Spatial information is compressed, and the channel weight vectors of the two branches are automatically calculated using a multilayer perceptron to determine the relative importance of visible light and infrared information in the current scene (e.g., automatically reducing the weight of visible light when obscured by dense smoke). The feature maps are then weighted. Subsequently, the spatial calibration unit concatenates the weighted feature maps and generates a spatial attention mask through convolution operations to lock high-response fire points in the image and suppress background forest noise. Finally, the double-calibrated feature maps are element-wise summed to generate joint features containing rich semantics. The joint detection network, connected after the feature fusion module, processes the joint features through deep layers of convolutional layers, global pooling layers, and fully connected layers before being distributed to parallel flame detection and smoke detection units. Finally, the bounding box coordinates and confidence scores of the fire targets are output.
[0025] S400 When a suspected fire point is detected, the system automatically locks onto the target and saves the image frame, extracts the target's position in the image pixel coordinate system, and simultaneously acquires the UAV's current GPS position information and full attitude information; based on the image pixel coordinates, UAV attitude information, flight altitude, and camera sensor parameters, the system uses a collinearity equation based on the rotation matrix and a time-series sliding window filtering algorithm to calculate the true latitude and longitude coordinates of the fire point in geographic space. Specifically, see Figure 3 , Figure 3 This is a schematic diagram of the interface of a multimodal intelligent forest fire patrol system in one embodiment of the present invention. To solve the problems of projection distortion and positioning jitter caused by drone vibration when shooting from a distance and at a non-vertical angle, this step uses a full attitude rotation matrix for spatial geometric correction and combines it with a multi-frame temporal smoothing strategy for accurate calculation. In one embodiment, S400 includes: S410: Extract the position of the burning target in the image pixel coordinate system, which is to extract the position of the center point of the detection box in the image pixel coordinate system; S420: Extracts the UAV's attitude information and camera intrinsics from image metadata. The UAV's attitude information includes heading angle, pitch angle, and roll angle, while the camera intrinsics include image pixel resolution, sensor parameters, and camera focal length. S430: Based on the pixel coordinates of the ignition point and camera intrinsic parameters, a three-dimensional line-of-sight vector of the ignition point in the normalized camera coordinate system is constructed using the pinhole imaging principle. S440: It uses the heading angle, pitch angle, and roll angle to construct a rotation matrix from the camera coordinate system to the geographic coordinate system, performs spatial attitude rotation transformation on the line of sight vector, and corrects the geometric distortion caused by gimbal tilt or fuselage tilt. S450: Combining flight altitude, the intersection of the rotated line-of-sight vector and the ground plane is calculated using the principle of similar triangles, thereby obtaining the instantaneous ground distance offset of the ignition point relative to the UAV; S460: Establish a coordinate buffer queue, perform discrete analysis on the instantaneous coordinates calculated in multiple consecutive frames, remove abnormal noise points affected by fuselage vibration, and perform time-series weighted smoothing on the effective coordinates based on the fire detection confidence level, and finally calculate the longitude and latitude of the ignition point with high stability.
[0026] In one embodiment, the line-of-sight vector of the ignition point in the normalized camera coordinate system is constructed in S430, specifically expressed by the formula: In the formula, The line-of-sight vector of the ignition point in the normalized camera coordinate system; The pixel coordinates of the ignition point; For the physical dimensions of the sensor, Image pixel resolution, This refers to the camera's focal length.
[0027] In one embodiment, S440 uses the heading angle, pitch angle, and roll angle to construct a rotation matrix from the camera coordinate system to the geographic coordinate system, and transforms the line-of-sight vector to the geographic coordinate system, specifically expressed by the formula: In the formula, Let be a rotation matrix. For heading angle, The pitch angle, This is the roll angle. To calculate the rotated geographic coordinate system vector.
[0028] In one embodiment, S450 calculates the offset of the ignition point relative to the ground distance of the drone, specifically expressed by the formula: In the formula, This is the ground projection offset; Flight altitude; In one embodiment, S460 calculates the longitude and latitude of the ignition point, specifically including two steps: instantaneous coordinate calculation and time-weighted smoothing, expressed by the formula: In the formula, These are the instantaneous latitude and longitude coordinates of the current frame. The longitude and latitude of the drone center, This is the conversion factor for latitude and longitude units. For the final output smooth coordinates; For the first in the cache queue The effective instantaneous coordinates of the frame; For the first The confidence level of fire detection corresponding to the frame; This is the size of the sliding window.
[0029] The S500 associates and stores the latitude and longitude coordinates of the fire point, the fire type, and the corresponding on-site images, triggers an alarm, and generates a structured inspection record.
[0030] In one embodiment, the inspection records in the S500 are stored in JSON format, and the record content includes: fire detection timestamp, drone location, and calculated coordinates of the fire point. Fire confidence level and on-site screenshots with marked boxes.
[0031] In one embodiment, see Figure 4 , Figure 4 This is a schematic diagram of a multimodal intelligent forest fire patrol system based on a compound-wing UAV according to an embodiment of the present invention. The system includes an airborne data acquisition and transmission unit and a ground computer system communicatively connected thereto. The ground computer system is equipped with a streaming media preprocessing unit and a multimodal fire detection and localization model connected in sequence. The airborne acquisition and transmission unit is used to acquire dual-light video streams with real-time pose information of the UAV and send the dual-light video streams to the ground computer system using the RTSP protocol. The streaming media preprocessing unit is used to receive and decode dual-light video streams, extract the UAV full attitude information and dual-light image data corresponding to the current frame, and send the extracted data to the multimodal fire detection and localization model; The multimodal fire detection and localization model receives image data and pose information, and uses a multimodal intelligent forest fire inspection method based on a compound wing UAV to perform fire inference and coordinate calculation, thereby obtaining fire inspection results containing real latitude and longitude coordinates.
[0032] Specific limitations regarding the multimodal intelligent forest fire patrol system based on a compound-wing UAV can be found in the above description of the multimodal intelligent forest fire patrol method based on a compound-wing UAV, and will not be repeated here. Each module in the aforementioned multimodal intelligent forest fire patrol system based on a compound-wing UAV can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the corresponding operations of each module.
[0033] The aforementioned multimodal intelligent forest fire inspection method and system based on a compound-wing UAV first controls the compound-wing UAV to perform a long-endurance inspection mission along a preset route. It utilizes an onboard dual-light pod to simultaneously acquire visible light and infrared thermal imaging video streams, which are then encoded and encapsulated via an onboard gateway and transmitted back to the ground control station in real time using the RTSP protocol. Subsequently, the ground station invokes a multimodal detection model with a built-in dual-stream cross-attention fusion mechanism to perform frame-by-frame inference on the video stream. Through channel attention and spatial calibration units, it dynamically integrates infrared and visible light features to accurately identify suspected fire points even in complex lighting and smoke-covered environments. In the positioning phase, the system simultaneously acquires the UAV's heading, pitch, roll, and other full-attitude parameters. It uses a full-attitude rotation matrix to perform spatial correction on the imaging geometry and further combines a temporal sliding window filtering algorithm to perform weighted smoothing on the calculated coordinates of multiple consecutive frames to eliminate positioning drift caused by fuselage vibration. Finally, the time-optimized fire point latitude and longitude, fire confidence level, and on-site screenshots are structured and saved as a JSON log, triggering an alarm. This method effectively solves the problems of traditional single-modal detection being susceptible to interference and unstable long-distance positioning, and realizes all-weather, highly reliable perception and high-precision positioning of forest fires.
[0034] In one embodiment, a computer device is also provided, including a memory and a processor, the memory storing a computer program, the processor executing the computer program to implement the steps of a multimodal intelligent forest fire inspection method based on a compound-wing UAV.
[0035] In one embodiment, a computer-readable storage medium is also provided, on which a computer program is stored, which, when executed by a processor, implements the steps of a multimodal intelligent forest fire patrol method based on a compound-wing unmanned aerial vehicle.
[0036] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the methods described above. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, or optical storage, etc. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM), etc.
[0037] The above provides a detailed description of a multimodal intelligent forest fire patrol method and system based on a compound-wing UAV, as provided by this invention. Specific examples have been used to illustrate the principles and implementation methods of this invention. The descriptions of the embodiments above are merely for the purpose of helping to understand the core ideas of this invention. It should be noted that those skilled in the art can make various improvements and modifications to this invention without departing from its principles, and these improvements and modifications also fall within the protection scope of the claims of this invention.
Claims
1. A multimodal intelligent forest fire patrol method based on a compound-wing unmanned aerial vehicle (UAV), characterized in that, The method includes the following steps: S100, equipped with a composite wing UAV inspection route, the UAV flies autonomously according to the preset route, and uses the onboard dual-light pod to simultaneously collect visible light video stream and infrared thermal imaging video stream; The S200 encodes the acquired dual-optical video streams through the airborne communication gateway and transmits them to the ground control station in real time via the air-to-ground communication link using the RTSP protocol. The S300 and ground control station receive the video stream and call the pre-trained multimodal forest fire detection model to perform real-time detection on the transmitted video in the multimodal intelligent forest fire inspection system to detect whether there are any suspected fire points. S400 When a suspected fire point is detected, the system automatically locks onto the target and saves the image frame, extracts the target's position in the image pixel coordinate system, and simultaneously acquires the UAV's current GPS position information and full attitude information; based on the image pixel coordinates, UAV attitude information, flight altitude, and camera sensor parameters, the system uses a collinearity equation based on the rotation matrix and a time-series sliding window filtering algorithm to calculate the true latitude and longitude coordinates of the fire point in geographic space. The S500 associates and stores the latitude and longitude coordinates of the fire point, the fire type, and the corresponding on-site images, triggers an alarm, and generates a structured inspection record.
2. The method according to claim 1, characterized in that, The S100 compound wing UAV combines the flight modes of rotor vertical take-off and landing with fixed-wing cruise; the dual-light pod maintains ground observation during the acquisition process, and the infrared thermal imaging and visible light camera undergo strict line-of-sight alignment calibration to ensure that the two video streams are spatially pixel aligned.
3. The method according to claim 2, characterized in that, The S200's airborne communication gateway has a built-in hardware video encoder that uses the H.264 video encoding standard to compress dual-optical video streams; the air-to-ground communication link uses a 5G mobile communication network to ensure that video transmission latency is below a preset threshold.
4. The method according to claim 3, characterized in that, The S300 multimodal forest fire detection model employs a dual-stream fusion convolutional neural network architecture, comprising: a visible light branch for receiving and processing visible light video streams, containing multiple sequentially connected convolutional blocks and pooling layers to extract the color and texture features of smoke; an infrared branch for receiving and processing infrared thermal imaging video streams, containing multiple sequentially connected convolutional blocks and pooling layers to extract the brightness and morphological features of high-temperature hotspots; and a feature fusion module connected to the visible light and infrared branches, incorporating a dual-stream cross-attention fusion mechanism. This mechanism includes channel attention units and spatial calibration units. The channel attention unit first performs global average pooling on the input feature map to compress spatial information, utilizing a multilayer perceptron. The system automatically calculates the channel weight vectors of the two branches to determine the relative importance of visible light and infrared information in the current scene, and then weights the feature maps. Subsequently, the spatial calibration unit concatenates the weighted feature maps and generates a spatial attention mask through convolution operations to lock high-response fire points in the image and suppress background forest noise. Finally, the double-calibrated feature maps are element-wise summed to generate joint features with rich semantics. The joint detection network, connected after the feature fusion module, processes the joint features through deep layers of convolutional layers, global pooling layers, and fully connected layers before being distributed to parallel flame detection and smoke detection units. Finally, it outputs the bounding box coordinates and confidence scores of the fire targets.
5. The method according to claim 4, characterized in that, The S400 includes: S410: Extract the position of the burning target in the image pixel coordinate system, which is to extract the position of the center point of the detection box in the image pixel coordinate system; S420: Extracts the UAV's attitude information and camera intrinsics from image metadata. The UAV's attitude information includes heading angle, pitch angle, and roll angle, while the camera intrinsics include image pixel resolution, sensor parameters, and camera focal length. S430: Based on the pixel coordinates of the ignition point and camera intrinsic parameters, a three-dimensional line-of-sight vector of the ignition point in the normalized camera coordinate system is constructed using the pinhole imaging principle. S440: It uses the heading angle, pitch angle, and roll angle to construct a rotation matrix from the camera coordinate system to the geographic coordinate system, performs spatial attitude rotation transformation on the line of sight vector, and corrects the geometric distortion caused by gimbal tilt or fuselage tilt. S450: Combining flight altitude, the intersection of the rotated line-of-sight vector and the ground plane is calculated using the principle of similar triangles, thereby obtaining the instantaneous ground distance offset of the ignition point relative to the UAV; S460: Establish a coordinate buffer queue, perform discrete analysis on the instantaneous coordinates calculated in multiple consecutive frames, remove abnormal noise points affected by fuselage vibration, and perform time-series weighted smoothing on the effective coordinates based on the fire detection confidence level, and finally calculate the longitude and latitude of the ignition point with high stability.
6. The method according to claim 5, characterized in that, In S430, the line-of-sight vector of the ignition point in the normalized camera coordinate system is constructed, which is expressed by the following formula: In the formula, The line-of-sight vector of the ignition point in the normalized camera coordinate system; The pixel coordinates of the ignition point; For the physical dimensions of the sensor, Image pixel resolution, This refers to the camera's focal length.
7. The method according to claim 6, characterized in that, In S440, the heading angle, pitch angle, and roll angle are used to construct a rotation matrix from the camera coordinate system to the geographic coordinate system, and the line-of-sight vector is transformed to the geographic coordinate system. This is expressed by the following formula: In the formula, Let be a rotation matrix. For heading angle, The pitch angle, This is the roll angle. To calculate the rotated geographic coordinate system vector.
8. The method according to claim 7, characterized in that, The S450 calculates the offset of the fire point relative to the ground distance of the UAV, which is expressed by the following formula: In the formula, This is the ground projection offset; Flight altitude; The S460 algorithm calculates the longitude and latitude of the ignition point, specifically through two steps: instantaneous coordinate calculation and time-weighted smoothing. This can be expressed by the following formula: In the formula, These are the instantaneous latitude and longitude coordinates of the current frame. The longitude and latitude of the drone center, This is the conversion factor for latitude and longitude units. For the final output smooth coordinates; For the first in the cache queue The effective instantaneous coordinates of the frame; For the first The confidence level of fire detection corresponding to the frame; This is the size of the sliding window.
9. The method according to claim 8, characterized in that, The inspection records in the S500 are stored in JSON format. The records include: fire detection timestamp, drone location, and calculated coordinates of the fire point. Fire confidence level and on-site screenshots with marked boxes.
10. A multimodal intelligent forest fire patrol system based on a compound-wing unmanned aerial vehicle (UAV), characterized in that, It includes an airborne data acquisition and transmission unit and a ground computer system connected to it. The ground computer system is equipped with a streaming media preprocessing unit and a multimodal fire detection and location model connected in sequence. The airborne acquisition and transmission unit is used to acquire dual-light video streams with real-time pose information of the UAV and send the dual-light video streams to the ground computer system using the RTSP protocol. The streaming media preprocessing unit is used to receive and decode dual-light video streams, extract the UAV full attitude information and dual-light image data corresponding to the current frame, and send the extracted data to the multimodal fire detection and localization model; The multimodal fire detection and localization model receives image data and pose information, and uses the method described in any one of claims 1 to 9 to perform fire inference and coordinate calculation to obtain fire inspection results containing real latitude and longitude coordinates.