A Method and System for Forest Fire Area Extraction Based on Swing Transformer Multi-Temporal Fusion
By employing the Swin Transformer multi-temporal fusion method, the problems of spectral confusion, temporal registration, and multi-scale feature fusion in forest fire area extraction were solved, achieving high-precision and reliable fire area extraction and meeting the needs of rapid, accurate, and large-scale mapping for forest fire emergency monitoring.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING INFORMATION SCI & TECH UNIV
- Filing Date
- 2026-02-27
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies for extracting forest fire area suffer from severe spectral confusion, insufficient temporal registration accuracy, inadequate utilization of spatial context, and lack of multi-scale feature fusion, resulting in insufficient accuracy and efficiency of extraction results, making it difficult to meet the needs of rapid and accurate large-scale mapping.
A multi-temporal fusion method based on Swin Transformer is adopted. Preprocessing is performed on pre-disaster and post-disaster remote sensing image data. Multi-scale features of fire are extracted using multi-level window attention modules and temporal attention modules. Feature fusion is performed by combining feature pyramid network. Fire area extraction results are generated by sub-pixel level area integration and confidence evaluation.
It significantly improves the accuracy and consistency of forest fire area extraction, reduces the fragmentation of classification results, and provides high-precision and reliable fire area estimation, meeting the needs of rapid, accurate, and large-scale mapping for forest fire emergency monitoring.
Smart Images

Figure CN122244130A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of remote sensing image processing and forest disaster monitoring technology, and in particular to a method and system for extracting forest fire area based on SwinTransformer multi-temporal fusion. Background Technology
[0002] With the increasing impact of global climate change and human activities, the frequency and intensity of forest fires are on the rise, posing a serious threat to global ecosystems and human society. There is an urgent need to develop rapid and accurate fire loss assessment technologies to support post-disaster emergency response, loss assessment, and ecological restoration planning. In recent years, high spatiotemporal resolution and freely available remote sensing data, exemplified by Sentinel-2, have provided an unprecedented data foundation for large-scale, high-frequency forest fire monitoring. However, automatically and accurately extracting the burned area from multi-temporal remote sensing imagery remains a challenging technical problem.
[0003] Currently, forest fire area extraction mainly relies on three technical approaches: first, methods based on spectral indices, such as the difference method between the Normalized Burning Index (NBR) and the Normalized Difference Vegetation Index (NDVI), which uses thresholds to determine areas of change; second, methods based on traditional machine learning, such as Support Vector Machines (SVM) and Random Forests, which rely on manually designed features for classification; and third, methods based on deep learning, especially segmentation networks such as Convolutional Neural Networks (CNN) and U-Net, which achieve end-to-end feature learning and classification. Although these methods have improved automation to some extent, they still have several fundamental limitations.
[0004] Existing technologies face the following main drawbacks: First, spectral obfuscation is a significant problem. Burnt areas, shadows, and water bodies exhibit similar spectral responses in specific bands, making spectral index-based methods prone to misjudgment. Overall accuracy is typically only 60%-75%, severely limiting the reliability of assessment results. Second, temporal registration accuracy is insufficient. Existing methods often employ simple global registration strategies, ignoring local geometric distortions caused by mountainous terrain undulations. This results in a 1-3 pixel registration error between pre-disaster and post-disaster images, leading to failure in edge area change detection. The false negative rate for small fire areas (e.g., less than 1 hectare) can be as high as 35%. Third, spatial context modeling capabilities are limited. Traditional CNNs are limited by their local receptive field, making it difficult to capture large-scale spatial patterns and spatial continuity features of fire spread. This results in fragmented extraction results and an inability to effectively distinguish between the main burned area and scattered interfering patches. Finally, there is a lack of multi-level feature fusion mechanisms. Most deep learning models extract features at a single scale and fail to integrate multi-scale information from local texture to landscape pattern, resulting in insufficient ability to distinguish different fire intensities (light, moderate and severe) and relatively coarse area statistics.
[0005] Therefore, current technology cannot achieve a good balance between accuracy, efficiency and robustness, making it difficult to meet the urgent need for rapid, accurate, and large-scale mapping in forest fire emergency monitoring. Summary of the Invention
[0006] In view of this, embodiments of the present invention provide a method and system for extracting forest fire area based on Swin Transformer multi-temporal fusion, in order to solve the problems of severe spectral confusion, insufficient temporal registration accuracy, insufficient utilization of spatial context, and lack of multi-scale feature fusion in the existing technology.
[0007] On one hand, this invention provides a method for extracting forest fire area based on Swin Transformer multi-temporal fusion, the method comprising: Remote sensing image data from two time phases, before and after the disaster, are acquired. The remote sensing image data is preprocessed to obtain preprocessed pre-disaster images, post-disaster images, and a binary mask. The three images are then stitched together along the channel dimension to form three-channel input data. The three-channel input data is input into a pre-trained Swin Transformer model to extract multi-scale features of the fire. The Swin Transformer model includes a multi-level window attention module with different window sizes, which is used to extract hierarchical features from local texture to global spatial distribution. The model also uses a temporal attention module to fuse the feature changes of the pre-disaster and post-disaster images. Based on the feature map output by the Swin Transformer model, the fire area is classified to generate a pixel-level fire probability map. Based on the fire probability map and the preset ground sampling distance, the sub-pixel level forest fire damage area is calculated, and the confidence level of the calculation results is evaluated to output the final fire area extraction result.
[0008] In some embodiments of the present invention, the remote sensing image data is preprocessed to obtain preprocessed pre-disaster images, post-disaster images, and a binary mask, including: Radiometric and atmospheric corrections are applied to the pre-disaster and post-disaster remote sensing images, respectively, to obtain surface reflectance images; wherein the radiometric correction satisfies the following formula: ; in, Indicates the reflectivity of the top layer of the atmosphere; Indicates the radiance of the upper atmosphere; This represents the Earth-Sun distance correction factor; Indicates solar spectral irradiance; Indicates the solar zenith angle; The atmospheric correction uses the Sen2Cor algorithm and satisfies the following formula: ; in, Indicates surface reflectance; This represents the amount of attenuation of apparent reflectivity caused by atmospheric absorption and scattering. This represents the amount of attenuation that Rayleigh scattering causes to the apparent reflectivity; Geometric registration is performed on the corrected pre-disaster and post-disaster images to achieve sub-pixel-level pixel alignment. Based on the registered pre-disaster and post-disaster images, the normalized fire index (dNBR) is calculated, and combined with the preset regions of interest, a binary mask is generated to mark potential fire areas.
[0009] In some embodiments of the present invention, geometric registration of the corrected pre-disaster and post-disaster images includes: The SIFT algorithm was used to extract matching feature point pairs between the pre-disaster and post-disaster images. The RANSAC algorithm is used to filter the feature point pairs to obtain the filtered feature point set; Based on the feature point set, a block registration model is used to calculate local geometric transformation parameters, including: dividing the image into multiple regular grid blocks; for each image block, calculating the local geometric transformation matrix based on the feature points falling within the block; and performing a weighted average of the local transformation matrices of all image blocks to fuse them into a full-image transformation matrix; wherein the weighted averaging process satisfies the following formula: ; in, Represents the full-image transformation matrix; Indicates the first Transformation matrix of each image block; Indicates the first The weights of each image block.
[0010] In some embodiments of the present invention, the window size of the multi-level window attention module in the Swing Transformer model is set to multiple different scales, including 7×7, 14×14, 28×28 and 56×56 pixels, and a shift window mechanism is adopted, wherein the shift size is half of the window size.
[0011] In some embodiments of the present invention, the Swin Transformer model further includes a feature pyramid network, and after extracting the multi-scale fire features, it further includes: The feature pyramid network upsamples and fuses the feature maps at different levels output by the multi-level window attention module. The bottom-up path of the feature pyramid network receives the feature map sequence with decreasing resolution output by the multi-level window attention module, and the top-down path transmits the high-level semantic information from the deep feature map through upsampling and fuses it with the spatial detail information of the shallow feature map element by element to generate a fire feature map that integrates multi-scale semantic information.
[0012] In some embodiments of the present invention, a temporal attention module is used to fuse the feature changes of the pre-disaster images and the post-disaster images, including: The features of the pre-disaster images are used as the query matrix, and the features of the post-disaster images are used as the key matrix and value matrix. The phase-change weight matrix is calculated by scaling the dot product attention, as expressed by the formula: ; in, Represents the phase-change weight matrix; This represents the normalized exponential function; Represents the query matrix; Represents the key matrix; Indicates transpose; Indicates the dimension of the key vector; Represents a value matrix; The value matrix is weighted according to the phase change weight matrix to output the enhanced phase change features.
[0013] In some embodiments of the present invention, calculating the sub-pixel level of forest fire damage area includes: Based on the fire category probability value corresponding to each pixel in the fire probability map and the actual ground area represented by each pixel, a weighted integral is calculated and summed, as expressed by the formula: ; in, Indicates the area damaged by forest fires; Indicates position The fire category probability value of the pixel at that location; Indicates pixel area weight; Indicates the ground sampling distance.
[0014] In some embodiments of the present invention, confidence evaluation of the calculation results includes: The entropy value of the fire probability map is calculated based on the probability distribution of each pixel to which it belongs, so as to quantify the uncertainty of the model's classification decision for each pixel. Pixel regions with entropy values below a preset threshold are identified as high-confidence regions, and corresponding binary confidence masks are generated. Based on the binary confidence mask, the reliability of the calculated forest fire damage area is assessed to obtain a reliable area estimate; the reliable area estimate is calculated according to the following formula: ; in, This represents a reliable area estimate; Indicates the area damaged by forest fires; Represents a binary confidence mask The sum of all pixel values in the range; Represents a binary confidence mask The total number of pixels in the middle.
[0015] On the other hand, the present invention provides a forest fire area extraction system based on Swin Transformer multi-temporal fusion, the system comprising: The data preprocessing module is used to acquire remote sensing image data from two time phases: before and after the disaster. It performs radiometric correction, atmospheric correction, and geometric registration on the remote sensing image data to generate pixel-level aligned pre-disaster images, post-disaster images, and binary masks. The three are then stitched together along the channel dimension to form three-channel input data. The fire feature extraction module includes a pre-trained Swin Transformer model, which receives the three-channel input data and uses its multi-level window attention mechanism and temporal attention module to extract and fuse multi-scale fire change features from local to global. The area calculation and evaluation module is used to perform pixel-level classification based on the feature map output by the fire feature extraction module, generate a fire probability map, calculate the sub-pixel level fire damage area based on the fire probability map and a preset ground sampling distance, evaluate the confidence level of the calculation results, and output the final fire area extraction result.
[0016] On the other hand, the present invention provides a computer-readable storage medium having a computer program / instructions stored thereon, which, when executed by a processor, implement the steps of any of the methods mentioned above.
[0017] This invention provides a method and system for extracting forest fire area based on Swing Transformer multi-temporal fusion, which has the following beneficial effects: By introducing a hierarchical window attention mechanism based on the Swin Transformer and a multi-temporal fusion structure, this invention can comprehensively capture the feature information of forest fires from local texture to global spatial distribution. Through window partitioning at different scales and shifting window design, the model effectively models the spatial continuity of large-scale fire spread, significantly improving the accuracy of change detection. It overcomes the shortcomings of traditional convolutional neural networks, such as limited receptive fields and insufficient utilization of contextual information, making the extracted results more spatially coherent and greatly reducing the fragmentation of classification results.
[0018] This invention proposes a three-channel input system and a precise registration process, aligning and fusing pre-disaster and post-disaster images with a spatially guided mask at the pixel level. It addresses the issues of spectral confusion and missed detection of small targets caused by radiometric differences and geometric misalignments at the data source, providing high-quality, highly comparable input for subsequent deep learning models. Combined with a temporal attention module specifically designed for multi-temporal changes, the model can focus more intently on real areas of change, effectively distinguishing spectrally similar features such as burned areas, shadows, and water bodies, thus maintaining high-accuracy classification performance even in complex terrains and scenes.
[0019] In the result quantification stage, this invention proposes a sub-pixel-level area integration algorithm based on probabilistic maps. The classification probability of each pixel is interpreted as the fire coverage ratio, and the damaged area is calculated with sub-pixel precision through continuous integration, overcoming the statistical limitations of integer pixels. Simultaneously, an information entropy-based confidence assessment mechanism is introduced, which quantifies the uncertainty of the output results and provides a reliability-corrected area estimate. This makes the final area report not only more accurate but also includes clear credibility indicators, providing a more scientific and reliable quantitative basis for emergency decision-making and post-disaster assessment.
[0020] In summary, this invention has achieved technological breakthroughs in three key areas: feature extraction, change detection, and area quantization. It has formed a complete technical system with high precision, high reliability, and high degree of automation, which can meet the urgent need for rapid, accurate, and large-scale mapping in forest fire emergency monitoring.
[0021] Additional advantages, objects, and features of the invention will be set forth in part in the description which follows, and will also become apparent in part to those skilled in the art upon studying the description, or may be learned by practice of the invention. The objects and other advantages of the invention can be realized and obtained by means of the structures specifically pointed out in the description and drawings.
[0022] Those skilled in the art will understand that the objectives and advantages achievable with the present invention are not limited to those specifically described above, and that the above and other objectives achievable with the present invention will become clearer from the following detailed description. Attached Figure Description
[0023] The accompanying drawings, which are included to provide a further understanding of the invention and form part of this application, are not intended to limit the scope of the invention. In the drawings: Figure 1 This is a schematic diagram illustrating the steps of a forest fire area extraction method based on Swing Transformer multi-temporal fusion in one embodiment of the present invention.
[0024] Figure 2 This is a schematic diagram of the overall process of a forest fire area extraction method based on Swing Transformer multi-temporal fusion in one embodiment of the present invention.
[0025] Figure 3(a) is a schematic diagram of the Swin TransformerBlock structure in one embodiment of the present invention.
[0026] Figure 3(b) is a schematic diagram of the structure of the Swing Transformer model in one embodiment of the present invention. Detailed Implementation
[0027] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the embodiments and accompanying drawings. Here, the illustrative embodiments and descriptions of this invention are used to explain the invention, but are not intended to limit the invention.
[0028] It should also be noted that, in order to avoid obscuring the invention with unnecessary details, only the structures and / or processing steps closely related to the solution according to the invention are shown in the accompanying drawings, while other details that are not closely related to the invention are omitted.
[0029] It should be emphasized that the term "including / comprises" as used herein refers to the presence of a feature, element, step, or component, but does not exclude the presence or addition of one or more other features, elements, steps, or components.
[0030] It should also be noted that, unless otherwise specified, the term "connection" in this article can refer not only to a direct connection, but also to an indirect connection involving an intermediary.
[0031] In the following description, embodiments of the invention will be illustrated with reference to the accompanying drawings. In the drawings, the same reference numerals represent the same or similar parts, or the same or similar steps.
[0032] It should be emphasized here that the step markers mentioned below are not a limitation on the order of the steps, but should be understood as meaning that the steps can be executed in the order mentioned in the embodiments, or in a different order than in the embodiments, or several steps can be executed simultaneously.
[0033] To address the problems of severe spectral confusion, insufficient temporal registration accuracy, inadequate utilization of spatial context, and lack of multi-scale feature fusion in existing technologies, this invention provides a forest fire area extraction method based on Swin Transformer multi-temporal fusion, such as... Figure 1 As shown, the method includes the following steps S101~S104: Step S101: Acquire remote sensing image data for two time phases, before and after the disaster. Preprocess the remote sensing image data to obtain preprocessed pre-disaster images, post-disaster images, and a binary mask. Then, stitch the three together along the channel dimension to form three-channel input data.
[0034] Step S102: Input the three-channel input data into the pre-trained Swin Transformer model to extract multi-scale features of the fire. The Swin Transformer model includes multi-level window attention modules with different window sizes for hierarchical feature extraction from local texture to global spatial distribution, and uses a temporal attention module to fuse feature changes between pre-disaster and post-disaster images.
[0035] Step S103: Based on the feature map output by the Swin Transformer model, classify the fire area and generate a pixel-level fire probability map.
[0036] Step S104: Based on the fire probability map and the preset ground sampling distance, calculate the sub-pixel level forest fire damage area, evaluate the confidence level of the calculation results, and output the final fire area extraction result.
[0037] like Figure 2The figure shows a schematic diagram of the overall process of the forest fire area extraction method based on Swin Transformer multi-temporal fusion.
[0038] In step S101, remote sensing image data from both pre-disaster and post-disaster phases are first acquired and preprocessed to obtain processed pre-disaster images, post-disaster images, and a binary mask. Then, the pre-disaster images, post-disaster images, and the binary mask are stitched together along the channel dimension to form three-channel input data.
[0039] Step S101 aims to acquire high-quality, pixel-aligned pre- and post-disaster remote sensing images and construct a standardized data cube containing original information, change information, and spatially relevant regions, providing accurate and consistent input for subsequent deep learning models.
[0040] In some embodiments, Sentinel-2 L1C level data before and after a forest fire in the target area are acquired from a satellite data platform. Typically, multiple spectral bands are selected, including visible light (such as B2, B3, B4), near-infrared (B8), and short-wave infrared (B11, B12), to fully utilize the differences in reflectance characteristics of vegetation and burned areas in different bands. The original spatial resolution of the short-wave infrared band B12 is 20m, which needs to be unified to 10m through resampling to ensure consistency in spatial resolution across bands, laying the foundation for subsequent pixel-level alignment.
[0041] In some embodiments, preprocessing of remote sensing image data includes three parts: radiometric and atmospheric correction, geometric registration, and mask generation and data cube construction. Specifically: Radiometric and atmospheric corrections involve radiometric calibration of the raw digital quantization (DN) values, converting them to top-atmospheric radiance. Subsequently, mature atmospheric correction algorithms such as Sen2Cor are used to eliminate the effects of atmospheric absorption, aerosol scattering, and Rayleigh scattering, ultimately yielding a surface reflectance image. This process transforms incomparably non-comparable DN values into reflectance values with clear physical meaning, eliminating interference from differences in illumination and atmospheric conditions, and is fundamental to ensuring the comparability of multi-temporal data.
[0042] In some embodiments, the conversion formula for the radiance of the upper atmosphere is shown in formula (1): ; (1) in, Indicates the radiance of the upper atmosphere; This represents the raw digital quantization value recorded by the sensor; This represents the scaling factor.
[0043] In some embodiments, radiance is converted to atmospheric top reflectance, and the conversion formula is shown in Equation (2): ; (2) in, Indicates the reflectivity of the top layer of the atmosphere; Indicates the radiance of the upper atmosphere; This represents the Earth-Sun distance correction factor; This represents the average solar spectral irradiance at the top of the atmosphere; It represents the solar zenith angle.
[0044] In some embodiments, the Sen2Cor algorithm is used to eliminate atmospheric effects and obtain the true surface reflectance. The algorithm is shown in formula (3): ; (3) in, Indicates surface reflectance; This represents the amount of attenuation of apparent reflectivity caused by atmospheric absorption and scattering. This represents the amount of attenuation that Rayleigh scattering causes to the apparent reflectivity.
[0045] Precise geometric registration includes: To eliminate geometric misalignment between pre-disaster and post-disaster images caused by satellite attitude, orbital parameters, and terrain undulations, this invention designs a block-based precise registration strategy, including feature extraction and matching, mismatch removal, and block transformation and fusion. Specifically: First, the SIFT (Scale Invariant Feature Transform) algorithm is used to automatically extract and match feature points on pre-disaster and post-disaster images to obtain an initial set of matching points.
[0046] Then, the RANSAC (Random Sampling Consensus) algorithm is used to filter the matching point pairs, remove mismatched points, and obtain a high-precision set of points with the same name.
[0047] Finally, the image is divided into multiple regular grid blocks, such as 256×256 pixels. A local geometric transformation matrix (such as an affine transformation matrix) is calculated based on the corresponding points in each grid block. Different weights are assigned according to the feature point density in each block. The final full-image transformation matrix is generated by weighted averaging and fusion, thereby achieving sub-pixel level geometric registration accuracy better than 1 pixel.
[0048] In some embodiments, the weighted average fusion is as shown in formula (4): ; (4) in, Represents the transformation matrix of the entire graph; Indicates the first Transformation matrix of each image block; Indicates the first The weight of each image patch is determined by the density of the feature point set within that patch.
[0049] Mask generation and data cube construction include: based on precisely registered pre-disaster and post-disaster images, a binary mask can be generated by combining prior knowledge (such as land use maps and forest vector boundaries) or by simple threshold segmentation. For example, the pixel value of the target area (such as forest) is 1, and the pixel value of the background area is 0.
[0050] Finally, the pre-disaster images (containing k bands), post-disaster images (containing k bands), and binary masks, after the above correction and registration, are stitched together along the channel dimension to construct a three-dimensional data cube. This data cube has a shape of (height H, width W, number of channels 2k+1) and serves as the unified input for subsequent deep learning models.
[0051] In step S102, a Swin Transformer model is pre-trained, enabling it to automatically learn and extract multi-level features, ranging from local details to global semantics, from the preprocessed three-channel input data to characterize forest fire damage. The core of this approach lies in effectively fusing pre- and post-disaster spatiotemporal change information through the model's hierarchical window attention mechanism and temporal attention module, thereby distinguishing between actual fire-affected areas and spectrally similar features.
[0052] Figure 3(a) shows a schematic diagram of the Swin Transformer Block structure. Figure 3(b) shows a schematic diagram of the Swin Transformer model in one embodiment of the present invention. First, a Swin Transformer network specifically designed for multi-temporal forest fire changes is constructed and trained. This process can be divided into two stages: model structure design and model training.
[0053] The Swin Transformer model includes multi-level window attention modules with different window sizes for hierarchical feature extraction from local texture to global spatial distribution.
[0054] Preferably, the multi-level window attention module consists of four consecutive stages, each stage containing a set of Swing Transformer Blocks. Its core innovation lies in the introduction of a multi-scale window partitioning strategy and a shifting window attention mechanism.
[0055] For the multi-scale window partitioning strategy, the input feature map is divided into non-overlapping windows of different sizes at different stages, such as 7×7, 14×14, 28×28, or 56×56 pixels. Small windows (such as 7×7) can focus on extracting subtle texture variations (such as burn marks on a single tree canopy), while large windows (such as 56×56) can capture a wider range of fire spread patterns and spatial context.
[0056] In some embodiments, hierarchical feature maps: Feature_Stage1: (H / 4, W / 4, C) #Shallow texture feature; Feature_Stage2: (H / 8, W / 8, 2C) #Mid-level object features; Feature_Stage3: (H / 16, W / 16, 4C) #Deep semantic features; Feature_Stage4: (H / 32, W / 32, 8C) #Global context feature; In the hierarchical feature map, the initial value of the feature dimension C is set to 96 based on the model configuration. In subsequent stages, the feature dimension increases sequentially by 2C, 4C, and 8C. That is, Feature_Stage1 has a dimension of C=96, Feature_Stage2 has 2C=192, Feature_Stage3 has 4C=384, and Feature_Stage4 has 8C=768. This dimension design can gradually enhance the expressive power of semantic features while increasing model depth, balancing the efficiency and effectiveness of feature extraction.
[0057] For the shift window attention mechanism, in order to specifically capture the changing features between pre-disaster and post-disaster images, this invention embeds a temporal attention module in a specific layer of the network (e.g., after Stage 2). This module uses the features of the pre-disaster images as a query matrix and the features of the post-disaster images as a key and value matrix, and calculates the change weights through scaled dot product attention.
[0058] In some embodiments, the formula for calculating the change weight is as shown in formula (5): ; (5) in, Represents the phase-change weight matrix; This represents the normalized exponential function; Represents the query matrix; Represents the key matrix; Indicates transpose; Indicates the dimension of the key vector; Represents a value matrix; In the shift window mechanism, the shift size is strictly set to half of the window size. For example, the shift size of a 7×7 window is 3×3 and the shift size of a 14×14 window is 7×7. This design can effectively avoid feature fragmentation caused by window division and improve the global feature capture capability.
[0059] The output of the temporal attention module is a feature map that enhances temporal change information, highlighting the transition region from "vegetation" to "fire / scorched earth".
[0060] In some embodiments, in order to comprehensively utilize features extracted from different stages with different semantic granularity and spatial resolution, the present invention integrates a feature pyramid network (FPN) on top of the Swing Transformer backbone network. Through this feature pyramid network, feature maps of different levels output by multi-level window attention modules are upsampled and fused.
[0061] Specifically, the bottom-up path of the feature pyramid network receives a sequence of feature maps with decreasing resolution output from multi-level window attention modules, and then, through a top-down path, transmits high-level semantic information from deep feature maps via upsampling operations and fuses it with the spatial detail information of shallow feature maps element-wise to generate a fire feature map that incorporates multi-scale semantic information.
[0062] That is, the high-semantic, low-resolution feature maps output from the deep stages are upsampled (e.g., by bilinear interpolation) to the same size as the feature maps from the shallow stages, and then added element-wise. For example, the features of Stage 4 are upsampled by a factor of 2 and added to the features of Stage 3, then the result is upsampled by a factor of 2 and added to the features of Stage 2, and so on. This ultimately generates a set of feature maps that fuse multi-scale information, providing rich feature representations for subsequent classification.
[0063] In some embodiments, the training method for the Swin Transformer model includes: Sentinel-2 pre- and post-disaster image pairs from historical forest fire cases were collected and coupled with meticulously drawn ground truth maps of burned areas. The aforementioned preprocessing step S101 was performed on each image pair to construct a large number of training sample pairs.
[0064] Obtain the initial Swin Transformer model to be trained, with the model structure as described above.
[0065] The initial Swing Transformer model was trained using training samples. A mask-guided weighted loss function was constructed. Using this weighted loss function as the optimization objective, the network parameters were iteratively optimized using optimizers such as Adam, and finally, the trained Swing Transformer model was obtained.
[0066] During training, appropriate learning rates, batch sizes, and training epochs are set, and a validation set is used to monitor performance and prevent overfitting.
[0067] The mask-guided weighted loss function consists of two parts: weighted cross-entropy loss and boundary-aware loss.
[0068] In some embodiments, the weighted cross-entropy loss employs the basic pixel-level classification loss, but applies a higher penalty weight to the prediction error of the fire region (the region indicated by mask M as 1) in the ground truth label, as shown in Equation (6): ; (6) in, Indicates a mask; Indicates the penalty weight; and These represent the pixels in the category. (e.g., burned / unburned) True labels and model predicted probabilities.
[0069] An additional loss term, the boundary-aware loss, is introduced for the fire zone boundary. The boundary-aware loss takes the model's predictions, the ground truth labels, and a binary mask as input. By calculating the difference between the predicted and ground truth boundaries, it accurately optimizes the classification precision of the fire zone boundary, reduces area calculation errors caused by boundary ambiguity, and thus encourages the model to make clearer predictions in transitional areas between fire and non-fire zones.
[0070] Therefore, the total loss function is shown in equation (7): ; (7) in, Represents the total loss function; This represents the weighted cross-entropy loss; The weighting coefficients are used to balance the two losses; This represents the boundary-aware loss.
[0071] After the Swin Transformer model is trained, its parameters are fixed for use in the inference phase. In step S102, the three-channel input data of the region to be predicted, output from step S101, is input into the Swin Transformer model. The model propagates forward, sequentially passing through a hierarchical window attention module, a temporal attention module, and a feature pyramid fusion module, ultimately outputting a feature map that fuses multi-scale spatiotemporal variation information. These feature maps serve as the multi-scale features of the fire and are passed to the next stage for classification.
[0072] In step S103, the fire area is classified according to the feature map output by the Swin Transformer model to generate a pixel-level fire probability map.
[0073] Step S103 aims to utilize the fire features extracted in step S102, rich in multi-scale spatiotemporal contextual information, to make precise classification decisions for each pixel of the input image, determining whether it belongs to a forest fire-damaged area. The core output of this step is not a simple binary label, but a continuous, pixel-level fire probability map, which provides a crucial data foundation for subsequent sub-pixel-level area calculations and high-precision assessments.
[0074] To achieve refined classification of fire areas, this step constructs a lightweight yet efficient classification decoder on top of the multi-scale fusion feature map extracted by the Swin Transformer model.
[0075] In some embodiments, the fire zone is classified based on the feature map output by the Swin Transformer model, specifically including: Receive fused feature maps with different resolutions from the output of the Feature Pyramid Network (FPN). For example, these fused feature maps are denoted as {FPN1, FPN2, FPN3, FPN4}. Among them, FPN1 has the highest spatial resolution and contains rich detailed information; FPN4 has the strongest semantic information and contains global context.
[0076] First, all FPN feature maps at all levels are uniformly adjusted to the same number of channels through convolutional layers, and then the spatial resolution of all feature maps is unified to the same size as the original input image, or 1 / 4, 1 / 2, etc. of the original input, depending on the downsampling factor of the network design.
[0077] Then, the multi-level feature maps with uniform size are spliced together along the channel dimension to form a comprehensive feature tensor that integrates details and semantic information.
[0078] Finally, one or more convolutional layers (usually ending with a 1×1 convolution) are applied to the concatenated feature tensor to map the number of channels to the number of target categories. For example, two categories: fire / non-fire; or multiple categories: no fire, minor fire, severe fire, etc. The output of the last convolutional layer is usually called logits.
[0079] In some embodiments, generating a pixel-level fire probability map includes: The logits map output by the classifier is input into a Softmax function, which normalizes it along the channel dimension (category dimension). The Softmax function converts the logits value of each pixel across all categories into a probability distribution, ensuring that the sum of the probabilities of each pixel across all categories is 1.
[0080] By extracting the channels corresponding to the fire category, the required pixel-level fire probability map can be obtained. The range of each pixel value is between [0, 1], which intuitively represents the probability of forest fire damage at that location.
[0081] In some embodiments, during the model training phase (described in S102), the classifier and the SwinTransformer model in step S103 are jointly trained end-to-end. The input to the mask-guided weighted loss function (including weighted cross-entropy loss and boundary-aware loss) used during training is the predicted probability map and the ground truth label map generated in this step. Through the backpropagation algorithm, the loss function not only optimizes the parameters of the classifier but also optimizes all the parameters of the SwinTransformer feature extractor, enabling the entire network to learn the feature representation and classification boundary that is most conducive to distinguishing fire areas.
[0082] In step S104, based on the fire probability map and the preset ground sampling distance, the sub-pixel level forest fire damage area is calculated, the confidence level of the calculation results is evaluated, and the final fire area extraction result is output.
[0083] The core objective of step S104 is to transform the pixel-level fire probability map generated in step S103 into an accurate and quantitative estimate of the forest fire damage area, and to quantitatively evaluate the reliability of this estimate. This step directly overcomes the problem of coarsened area statistics caused by hard classification (either 0 or 1) in traditional methods, and provides an objective measure of the reliability of the results, thereby outputting a final area extraction result that is both highly accurate and highly reliable.
[0084] In some embodiments, the sub-pixel level of forest fire damage area is calculated based on a fire probability map and a preset ground sampling distance, including: Traditional methods typically involve simply counting pixels in the classification result (binary image) and multiplying it by the area of each individual pixel. This approach ignores the mixing of features within a pixel and introduces significant errors at boundaries and in fragmented areas. Therefore, this invention proposes a sub-pixel-level area integration method based on probabilistic maps, achieving more refined area statistics.
[0085] The potential burned area of all pixels in the entire fire probability map is summed and integrated. The actual ground area represented by each pixel is determined by its Ground Sample Distance (GSD). GSD is a key parameter of remote sensing imagery, representing the actual ground size corresponding to one pixel in the image. For example, the GSD of Sentinel-2 multispectral band is 10 meters, meaning that one pixel represents a 10m × 10m area on the ground.
[0086] In some embodiments, the area is calculated as shown in formula (8): ; (8) in, Indicates the area damaged by forest fires; Indicates position The fire category probability value of the pixel at that location; Indicates pixel area weight; This represents the ground sampling distance. The pixel area weight setting is used to correct the impact of terrain undulation on the actual area of the pixel. It is set to 1.0 in flat areas and adaptively adjusted according to the slope value in areas with large terrain slopes to ensure accurate calculation of the area contribution of each pixel. Combined with the ground sampling distance (GSD=10m), the baseline area of a single pixel is 10m×10m=100m².
[0087] In some embodiments, a confidence assessment of the calculation results includes: Because model predictions are inherently uncertain (especially in challenging areas such as class boundaries, shadows, and cloud cover), directly integrating the area using the probabilities of all pixels may introduce errors due to low-confidence predictions. Therefore, this invention performs a confidence assessment on the area calculation results.
[0088] Information entropy is used to quantify the uncertainty of the model's classification decision for each pixel. Higher information entropy indicates greater uncertainty in the model's classification of that pixel. An entropy threshold is pre-set, and pixels with entropy values below the threshold are classified as high-confidence regions, indicating that the model's classification of these pixels is reliable. A binary confidence mask is then generated accordingly.
[0089] The original area calculation results are corrected or weighted based on a binary confidence mask to obtain a reliable area estimate, as shown in formula (9): ; (9) in, This represents a reliable area estimate; Indicates the area damaged by forest fires; Represents a binary confidence mask The sum of all pixel values in the range; Represents a binary confidence mask The total number of pixels in the middle.
[0090] The significance of formula (9) is that it uses the area proportion of high-confidence regions to weight the total area estimate. If the entire region is well-defined (proportion close to 1), the reliable area is close to the original area; if there are a large number of uncertain regions (proportion small), the reliable area will shrink accordingly, thus giving a more conservative but more reliable estimate.
[0091] Corresponding to the above method, the present invention also provides a forest fire area extraction system based on Swin Transformer multi-temporal fusion, the system comprising: The data preprocessing module is used to acquire remote sensing image data from two time phases: before and after the disaster. It performs radiometric correction, atmospheric correction, and geometric registration on the remote sensing image data to generate pixel-level aligned pre-disaster images, post-disaster images, and binary masks. The three are then stitched together along the channel dimension to form three-channel input data.
[0092] The fire feature extraction module includes a pre-trained Swin Transformer model, which receives three-channel input data and uses its multi-level window attention mechanism and temporal attention module to extract and fuse multi-scale fire change features from local to global.
[0093] The area calculation and evaluation module is used to perform pixel-level classification based on the feature map output by the fire feature extraction module, generate a fire probability map, calculate the sub-pixel level fire damage area based on the fire probability map and the preset ground sampling distance, evaluate the confidence of the calculation results, and output the final fire area extraction result.
[0094] The present invention will be further described below with reference to a specific embodiment.
[0095] In this embodiment, a forest fire occurred in a forest area, with a burned area of approximately 2,000 hectares. Sentinel-2 L1C data from August 15, 2023 (before the disaster) and August 25, 2023 (after the disaster) were used, with spectral bands including B2-B4 (visible light), B8 (near infrared), and B12 (shortwave infrared) and a spatial resolution of 10-20m.
[0096] Pre-disaster and post-disaster data are shown in Table 1.
[0097] Table 1 Test Data Table The training parameters of the Swin Transformer model are shown in Table 2.
[0098] Table 2 Model Training Parameters The forest fire area extraction method based on Swin Transformer multi-temporal fusion provided by this invention, the traditional convolutional neural network (CNN) and the spectral index method were used for area extraction respectively. The accuracy evaluation results are shown in Table 3.
[0099] Table 3 Accuracy Evaluation Results The area extraction results of the method of the present invention include the fire-damaged area, the degree of burning, and the spatial distribution accuracy.
[0100] The actual fire area was 1987.5 hectares, while the fire area extracted by the method of this invention was 1825.3 hectares, with an error of 8.2%. The fire area extracted by the traditional method was 2156.8 hectares, with an error of 17.3%.
[0101] The statistics are classified according to the degree of burning, as shown in Table 4. Spatial distribution accuracy assessment includes three core dimensions: first, boundary positioning accuracy, i.e., the deviation between the extracted fire area boundary and the actual boundary, which the method of this invention can achieve ±1.2 pixels (12 meters); second, small-area fire point detection rate, which can reach 89.3% for fire points with an area greater than 0.5 ha; and third, fragmentation index, which is used to measure the integrity of the extraction results. The fragmentation index of this method is 0.23, which is lower than the ideal threshold of 0.3, indicating that the extraction results have good continuity.
[0102] Corresponding to the above method, the present invention also provides an electronic device including a computer device, the computer device including a processor and a memory, the memory storing computer instructions, the processor executing the computer instructions stored in the memory, and when the computer instructions are executed by the processor, the electronic device performs the steps of the method as described above.
[0103] This invention also provides a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the steps of the aforementioned method. The computer-readable storage medium may be a tangible storage medium, such as random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, register, floppy disk, hard disk, removable storage disk, CD-ROM, or any other form of storage medium known in the art.
[0104] Those skilled in the art will understand that the exemplary components, systems, and methods described in conjunction with the embodiments disclosed herein can be implemented in hardware, software, or a combination of both. Whether implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this invention. When implemented in hardware, it can be, for example, electronic circuits, application-specific integrated circuits (ASICs), appropriate firmware, plug-ins, function cards, etc. When implemented in software, the elements of this invention are programs or code segments used to perform the desired tasks. The programs or code segments can be stored in a machine-readable medium or transmitted over a transmission medium or communication link via data signals carried in a carrier wave.
[0105] It should be clarified that the present invention is not limited to the specific configurations and processes described above and shown in the figures. For the sake of brevity, detailed descriptions of known methods are omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method process of the present invention is not limited to the specific steps described and shown. Those skilled in the art can make various changes, modifications, and additions, or change the order of steps, after understanding the spirit of the present invention.
[0106] In this invention, features described and / or illustrated for one embodiment may be used in the same or similar manner in one or more other embodiments, and / or combined with or in place of features of other embodiments.
[0107] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. For those skilled in the art, various modifications and variations of the embodiments of the present invention are possible. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A method for extracting forest fire area based on Swing Transformer multi-temporal fusion, characterized in that, The method includes: Remote sensing image data from two time phases, before and after the disaster, are acquired. The remote sensing image data is preprocessed to obtain preprocessed pre-disaster images, post-disaster images, and a binary mask. The three images are then stitched together along the channel dimension to form three-channel input data. The three-channel input data is input into a pre-trained Swin Transformer model to extract multi-scale features of the fire. The Swin Transformer model includes a multi-level window attention module with different window sizes, which is used to extract hierarchical features from local texture to global spatial distribution. The model also uses a temporal attention module to fuse the feature changes of the pre-disaster and post-disaster images. Based on the feature map output by the Swin Transformer model, the fire area is classified to generate a pixel-level fire probability map. Based on the fire probability map and the preset ground sampling distance, the sub-pixel level forest fire damage area is calculated, and the confidence level of the calculation results is evaluated to output the final fire area extraction result.
2. The forest fire area extraction method based on Swing Transformer multi-temporal fusion according to claim 1, characterized in that, The remote sensing image data is preprocessed to obtain preprocessed pre-disaster images, post-disaster images, and a binary mask, including: Radiometric and atmospheric corrections are applied to the pre-disaster and post-disaster remote sensing images, respectively, to obtain surface reflectance images; wherein the radiometric correction satisfies the following formula: ; in, Indicates the reflectivity of the top layer of the atmosphere; Indicates the radiance of the upper atmosphere; This represents the Earth-Sun distance correction factor; Indicates solar spectral irradiance; Indicates the solar zenith angle; The atmospheric correction uses the Sen2Cor algorithm and satisfies the following formula: ; in, Indicates surface reflectance; This represents the amount of attenuation of apparent reflectivity caused by atmospheric absorption and scattering. This represents the amount of attenuation that Rayleigh scattering causes to the apparent reflectivity; Geometric registration is performed on the corrected pre-disaster and post-disaster images to achieve sub-pixel-level pixel alignment. Based on the registered pre-disaster and post-disaster images, the normalized combustion index is calculated, and combined with the preset areas of interest, a binary mask is generated to mark potential fire areas.
3. The forest fire area extraction method based on Swing Transformer multi-temporal fusion according to claim 2, characterized in that, Geometric registration of the corrected pre-disaster and post-disaster images, including: The SIFT algorithm was used to extract matching feature point pairs between the pre-disaster and post-disaster images. The RANSAC algorithm is used to filter the feature point pairs to obtain the filtered feature point set; Based on the feature point set, a block registration model is used to calculate local geometric transformation parameters, including: dividing the image into multiple regular grid blocks; for each image block, calculating the local geometric transformation matrix based on the feature points falling within the block; and performing a weighted average of the local transformation matrices of all image blocks to fuse them into a full-image transformation matrix; wherein the weighted averaging process satisfies the following formula: ; in, Represents the full-image transformation matrix; Indicates the first Transformation matrix of each image block; Indicates the first The weights of each image block.
4. The forest fire area extraction method based on Swing Transformer multi-temporal fusion according to claim 1, characterized in that, The multi-level window attention module in the Swing Transformer model has multiple different window sizes, including 7×7, 14×14, 28×28 and 56×56 pixels, and adopts a shift window mechanism, where the shift size is half of the window size.
5. The forest fire area extraction method based on Swing Transformer multi-temporal fusion according to claim 1, characterized in that, The Swin Transformer model also includes a feature pyramid network, which, after extracting the multi-scale features of the fire, further includes: The feature pyramid network upsamples and fuses the feature maps at different levels output by the multi-level window attention module. The bottom-up path of the feature pyramid network receives the feature map sequence with decreasing resolution output by the multi-level window attention module, and the top-down path transmits the high-level semantic information from the deep feature map through upsampling and fuses it with the spatial detail information of the shallow feature map element by element to generate a fire feature map that integrates multi-scale semantic information.
6. The forest fire area extraction method based on Swing Transformer multi-temporal fusion according to claim 1, characterized in that, The feature changes between the pre-disaster and post-disaster images are fused using a temporal attention module, including: The features of the pre-disaster images are used as the query matrix, and the features of the post-disaster images are used as the key matrix and value matrix. The phase-change weight matrix is calculated by scaling the dot product attention, as expressed by the formula: ; in, Represents the phase-change weight matrix; This represents the normalized exponential function; Represents the query matrix; Represents the key matrix; Indicates transpose; Indicates the dimension of the key vector; Represents a value matrix; The value matrix is weighted according to the phase change weight matrix to output the enhanced phase change features.
7. The forest fire area extraction method based on Swing Transformer multi-temporal fusion according to claim 1, characterized in that, Calculating the sub-pixel level of forest fire damage includes: Based on the fire category probability value corresponding to each pixel in the fire probability map and the actual ground area represented by each pixel, a weighted integral is calculated and summed, as expressed by the formula: ; in, Indicates the area damaged by forest fires; Indicates position The fire category probability value of the pixel at that location; Indicates pixel area weight; Indicates the ground sampling distance.
8. The forest fire area extraction method based on Swing Transformer multi-temporal fusion according to claim 1, characterized in that, The confidence level of the calculation results is assessed, including: The entropy value of the fire probability map is calculated based on the probability distribution of each pixel to which it belongs, so as to quantify the uncertainty of the model's classification decision for each pixel. Pixel regions with entropy values below a preset threshold are identified as high-confidence regions, and corresponding binary confidence masks are generated. Based on the binary confidence mask, the reliability of the calculated forest fire damage area is assessed to obtain a reliable area estimate; the reliable area estimate is calculated according to the following formula: ; in, This represents a reliable area estimate; Indicates the area damaged by forest fires; Represents a binary confidence mask The sum of all pixel values in the range; Represents a binary confidence mask The total number of pixels in the middle.
9. A forest fire area extraction system based on Swing Transformer multi-temporal fusion, characterized in that, The system includes: The data preprocessing module is used to acquire remote sensing image data from two time phases: before and after the disaster. It performs radiometric correction, atmospheric correction, and geometric registration on the remote sensing image data to generate pixel-level aligned pre-disaster images, post-disaster images, and binary masks. The three are then stitched together along the channel dimension to form three-channel input data. The fire feature extraction module includes a pre-trained Swin Transformer model, which receives the three-channel input data and uses its multi-level window attention mechanism and temporal attention module to extract and fuse multi-scale fire change features from local to global. The area calculation and evaluation module is used to perform pixel-level classification based on the feature map output by the fire feature extraction module, generate a fire probability map, calculate the sub-pixel level fire damage area based on the fire probability map and a preset ground sampling distance, evaluate the confidence level of the calculation results, and output the final fire area extraction result.
10. A computer-readable storage medium having a computer program / instructions stored thereon, characterized in that, When the computer program / instructions are executed by the processor, they implement the steps of the method as described in any one of claims 1 to 8.