Training method of image extraction model, image extraction method and device
By using an encoder network, a difference extraction network, and a fusion decoding network in the image extraction model, the high cost and false positives/false negatives of traditional landslide detection methods are solved, achieving efficient and accurate extraction of landslide feature information.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- AEROSPACE INFORMATION RES INST CAS
- Filing Date
- 2024-04-01
- Publication Date
- 2026-06-23
AI Technical Summary
Traditional landslide detection methods rely on manual visual interpretation, which is costly and has a high probability of false positives and false negatives, making it difficult to accurately extract landslide feature information.
An image extraction model, including an encoder network, a difference extraction network, and a fusion decoding network, is adopted. By acquiring a baseline image and a reference image, the difference extraction network is used to extract feature maps, and multi-scale fusion and decoding techniques are combined to generate image segmentation results to train the model.
It improves the accuracy of landslide detection, reduces the risk of false positives and false negatives, and enhances the ability to detect landslide changes.
Smart Images

Figure CN118212487B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the fields of image processing and pattern recognition, and more specifically, to a training method for an image extraction model, an image extraction method, a training device for the image extraction model, and an image extraction device. Background Technology
[0002] With the development of remote sensing technology, landslide prediction based on remote sensing images has become an important direction. By analyzing remote sensing images from different time periods, feature information about landslides can be effectively extracted, enabling early warning of landslides. Traditional detection methods rely on manual visual interpretation, which has obvious drawbacks, requiring significant manpower and time. Using machine learning methods such as random forests to extract landslides saves considerable manpower and resources, and the cost of manual annotation is relatively low, making it a commonly used landslide extraction method. However, these methods have a high probability of mis-extraction and are highly dependent on prior knowledge of manually selected features. Summary of the Invention
[0003] In view of the above problems, this disclosure provides a training method for an image extraction model, an image extraction method, a training device for an image extraction model, and an image extraction device.
[0004] According to a first aspect of this disclosure, a method for training an image extraction model is provided, wherein the image extraction model includes an encoder network, a difference extraction network, and a fusion decoding network, comprising: acquiring a reference image and a reference image, wherein the acquisition time of the reference image is earlier than the acquisition time of the reference image, and the reference image and the reference image represent a target geographic environment region; inputting the reference image into a first encoding network and outputting a first encoded feature map; inputting the reference image into a second encoding network and outputting a second encoded feature map, wherein the encoder network includes a first encoding network and a second encoding network; using the difference extraction network to extract difference features from the first encoded feature map and the second encoded feature map to obtain a difference feature map; inputting the difference feature map into the fusion decoding network to obtain an image segmentation result corresponding to the reference image, the image segmentation result representing the geological change attributes of the target geographic environment region; and training the image extraction model based on the image segmentation result and the label data corresponding to the image segmentation result to obtain a trained image extraction model.
[0005] According to embodiments of this disclosure, a difference extraction network is used to extract difference features from a first encoded feature map and a second encoded feature map to obtain a difference feature map. This includes: obtaining a preliminary difference feature map based on the first encoded feature map and the second encoded feature map; inputting the preliminary difference feature map into a difference extraction layer to output an intermediate difference feature map; obtaining a first intermediate feature map based on the first encoded feature map and the intermediate difference feature map; obtaining a second intermediate feature map based on the second encoded feature map and the intermediate difference feature map; and obtaining a difference extraction feature map based on the first intermediate feature map, the second intermediate feature map, and the intermediate difference feature map.
[0006] According to embodiments of this disclosure, inputting a preliminary difference feature map into a difference extraction layer and outputting an intermediate difference feature map includes: inputting the preliminary difference feature map into a convolutional layer to obtain a first feature map; processing the first feature map using a first convolutional unit to obtain a second feature map; processing the first feature map using a second convolutional unit to obtain a third feature map, wherein the first convolutional kernel of the first convolutional unit is different from the second convolutional kernel of the second convolutional unit; inputting the difference feature map into an attention unit to obtain an attention feature map, wherein the difference extraction layer includes a first convolutional unit, a second convolutional unit, and an attention unit; and fusing the second feature map, the third feature map, and the attention feature map to obtain an intermediate difference feature map.
[0007] According to embodiments of this disclosure, inputting a difference feature map into a fusion decoding network to obtain an image segmentation result corresponding to a reference image includes: fusing multi-level difference feature maps using a multi-scale fusion layer to obtain a fused feature map; and decoding the fused feature map using a decoding layer to obtain an image segmentation result, wherein the fusion decoding network includes a multi-scale fusion layer and a decoding layer.
[0008] According to embodiments of this disclosure, a multi-scale fusion layer is used to fuse multi-level difference feature maps to obtain a fused feature map, including: adding multi-level difference feature maps based on an attention mechanism to obtain an attention fusion feature map; connecting multi-level difference feature maps along the channel dimension to obtain a channel feature map; processing the channel feature map using an attention mechanism to obtain an attention channel feature map; obtaining an intermediate fusion feature map based on the attention fusion feature map, the channel feature map, and the attention channel feature map; and processing the intermediate fusion feature map using a pooling layer and adding it to the multi-level difference feature maps respectively to obtain the fusion feature map of the corresponding level.
[0009] According to embodiments of this disclosure, decoding the fused feature map using a decoding layer to obtain an image segmentation result includes: upsampling the i-th fused feature map to obtain the i-th upsampled fused feature map; concatenating the (i-1)-th fused feature map and the i-th upsampled fused feature map along the channel dimension to obtain the i-th intermediate fused feature map; inputting the i-th intermediate fused feature map into a convolutional pooling sub-layer to output the i-th decoded feature map; and performing image segmentation on the i-th decoded feature map to obtain an image segmentation result.
[0010] According to embodiments of this disclosure, inputting the i-th intermediate fused feature map into a convolutional pooling sub-layer and outputting the i-th decoded feature map includes: inputting the i-th intermediate fused feature map into a third convolutional unit to obtain the i-th intermediate convolutional fused feature map; inputting the i-th intermediate convolutional fused feature map into a pooling convolutional unit to obtain a pooling feature map; inputting the i-th intermediate convolutional fused feature map into a fourth convolutional unit to obtain a convolutional feature map; and obtaining the i-th decoded feature map based on the i-th intermediate convolutional fused feature map, the pooling feature map, and the convolutional feature map.
[0011] A second aspect of this disclosure provides an image extraction method, comprising:
[0012] Acquire images at a first time step and at a second time step, with the first time step image acquired earlier than the second time step image. Input the first time step image and the second time step image into the trained image extraction model to obtain the image segmentation result corresponding to the second time step image.
[0013] A third aspect of this disclosure provides a training apparatus for an image extraction model, comprising:
[0014] A first acquisition module is used to acquire a reference image and a base image, wherein the base image is acquired earlier than the reference image, and the base image and reference image represent the target geographic environment region; a first encoding module is used to input the base image into a first encoding network and output a first encoded feature map; a second encoding module is used to input the reference image into a second encoding network and output a second encoded feature map, wherein the encoder network includes a first encoding network and a second encoding network; a difference extraction module is used to extract difference features from the first encoded feature map and the second encoded feature map using a difference extraction network to obtain a difference feature map; a fusion decoding module is used to input the difference feature map into a fusion decoding network to obtain an image segmentation result corresponding to the reference image, the image segmentation result representing the geological change attributes of the target geographic environment region; and a model extraction module is used to train an image extraction model based on the image segmentation result and the label data corresponding to the image segmentation result to obtain a trained image extraction model.
[0015] A fourth aspect of this disclosure provides an image extraction apparatus, comprising:
[0016] The second acquisition module is used to acquire the first time-phase image and the second time-phase image, wherein the acquisition time of the first time-phase image is earlier than the acquisition time of the second time-phase image; the input module is used to input the first time-phase image and the second time-phase image into the trained image extraction model to obtain the image segmentation result corresponding to the second time-phase image.
[0017] Another aspect of this disclosure provides an electronic device, including: one or more processors; and a memory for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors perform the method described above.
[0018] Another aspect of this disclosure provides a computer-readable storage medium having executable instructions stored thereon, which, when executed by a processor, cause the processor to perform the methods described above.
[0019] Another aspect of this disclosure provides a computer program product including a computer program that, when executed by a processor, implements the above-described method.
[0020] According to embodiments of this disclosure, a reference image and a base image are acquired. The base image is input into a first encoding network, which outputs a first encoded feature map. The reference image is input into a second encoding network, which outputs a second encoded feature map. A difference extraction network is used to extract difference features from the first and second encoded feature maps to obtain a difference feature map. This difference feature map is then input into a fusion decoding network to obtain an image segmentation result corresponding to the reference image. The image segmentation result characterizes the geological change attributes of the target geographical environment area. An image extraction model is trained based on the image segmentation result and the corresponding label data to obtain the trained image extraction model. By employing a difference extraction network, the image extraction model's ability to detect landslide differences is enhanced, its ability to detect landslide changes is improved, and thus the model's ability to identify image changes is enhanced. Attached Figure Description
[0021] The foregoing contents, as well as other objects, features, and advantages of this disclosure, will become clearer from the following description of embodiments with reference to the accompanying drawings, in which:
[0022] Figure 1 This diagram illustrates an application scenario of the image extraction method according to an embodiment of the present disclosure.
[0023] Figure 2 A flowchart illustrating a training method for an image extraction model according to an embodiment of the present disclosure is shown schematically.
[0024] Figure 3A schematic diagram of a difference extraction network according to an embodiment of the present disclosure is shown;
[0025] Figure 4 A schematic diagram of a GL module in a difference extraction layer according to an embodiment of the present disclosure is shown.
[0026] Figure 5 A schematic diagram of a multi-scale fusion layer according to an embodiment of the present disclosure is shown.
[0027] Figure 6 A schematic diagram of a decoding layer according to an embodiment of the present disclosure is shown.
[0028] Figure 7 A schematic diagram of the network structure of a difference extraction model according to an embodiment of the present disclosure is shown.
[0029] Figure 8 A flowchart illustrating an image extraction method according to an embodiment of the present disclosure is shown schematically.
[0030] Figure 9 A flowchart illustrating the implementation of an image extraction method according to an embodiment of the present disclosure is shown schematically.
[0031] Figure 10 An image extraction diagram according to an embodiment of the present disclosure is illustrated schematically;
[0032] Figure 11 This schematic diagram illustrates a structural block diagram of a training apparatus for an image extraction model according to an embodiment of the present disclosure;
[0033] Figure 12 A schematic block diagram of an image extraction apparatus according to an embodiment of the present disclosure is shown; and
[0034] Figure 13 A block diagram schematically illustrates an electronic device suitable for implementing an image extraction method according to an embodiment of the present disclosure. Detailed Implementation
[0035] The embodiments of the present disclosure will now be described with reference to the accompanying drawings. However, it should be understood that these descriptions are exemplary only and are not intended to limit the scope of the disclosure. In the following detailed description, numerous specific details are set forth to provide a thorough understanding of the embodiments of the present disclosure for ease of explanation. However, it will be apparent that one or more embodiments may be practiced without these specific details. Furthermore, descriptions of well-known structures and techniques are omitted in the following description to avoid unnecessarily obscuring the concepts of the present disclosure.
[0036] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit this disclosure. The terms “comprising,” “including,” etc., as used herein indicate the presence of the stated features, steps, operations, and / or components, but do not exclude the presence or addition of one or more other features, steps, operations, or components.
[0037] All terms used herein (including technical and scientific terms) have the meanings commonly understood by those skilled in the art, unless otherwise defined. It should be noted that the terms used herein are to be interpreted in a manner consistent with the context of this specification, and not in an idealized or overly rigid way.
[0038] When using expressions such as "at least one of A, B, and C", they should generally be interpreted in accordance with the meaning that is commonly understood by a person skilled in the art (e.g., "a system having at least one of A, B, and C" should include, but is not limited to, a system having A alone, a system having B alone, a system having C alone, a system having A and B, a system having A and C, a system having B and C, and / or a system having A, B, and C, etc.).
[0039] In large countries prone to geological changes, the prevalence of mountainous areas and complex terrain mean that potential geological hazards are widespread. Rapid monitoring of landslide changes after events like earthquakes can be extremely helpful. Therefore, research on effective landslide monitoring is crucial for mitigating losses caused by geological changes.
[0040] With the development of remote sensing technology, landslide prediction based on remote sensing images has become an important direction. By analyzing remote sensing images from different time phases, such as optical and radar images, pre-landslide feature information can be effectively extracted, enabling early warning of landslide changes. Traditional manual methods can no longer meet current recognition requirements, and automated methods have become dominant. Traditional detection methods rely on manual visual interpretation, which has obvious drawbacks, requiring high manpower and time costs. Using machine learning methods such as random forests to extract landslides saves a lot of manpower and resources, and the cost of manual annotation is relatively low, making it a commonly used landslide extraction method. However, these methods have a high probability of mis-extraction and are highly dependent on prior knowledge of manually selected features.
[0041] The inventors discovered that the main problem with current methods is that landslides and certain background features may share similar spectral and textural characteristics in images, potentially leading to the model incorrectly extracting bare land and other features, thus reducing the accuracy of the extraction results. Secondly, landslides exhibit significant morphological variations, easily resulting in missed extractions. Landslides with substantial feature differences are easily overlooked or misclassified as non-landslide areas by the model, especially when the model's discrimination capabilities are limited. These challenges in landslide change detection make accurate landslide extraction a truly challenging task.
[0042] In view of this, this disclosure provides a training method for an image extraction model, an image extraction method, a training device for an image extraction model, and an image extraction device. The image extraction model includes an encoder network, a difference extraction network, and a fusion decoding network. The method includes: acquiring a reference image and a base image, wherein the base image was acquired earlier than the reference image, and the base image and reference image represent a target geographic environment region; inputting the base image into a first encoding network and outputting a first encoded feature map; inputting the reference image into a second encoding network and outputting a second encoded feature map, wherein the encoder network includes a first encoding network and a second encoding network; using the difference extraction network to extract difference features from the first and second encoded feature maps to obtain a difference feature map; inputting the difference feature map into the fusion decoding network to obtain an image segmentation result corresponding to the reference image, the image segmentation result representing the geological change attributes of the target geographic environment region; and training the image extraction model based on the image segmentation result and the corresponding label data to obtain the trained image extraction model.
[0043] In the technical solution disclosed herein, the user information (including but not limited to user personal information, user image information, user device information, such as location information) and data (including but not limited to data used for analysis, stored data, and displayed data) involved are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, storage, use, processing, transmission, provision, disclosure, and application of related data all comply with relevant laws, regulations, and standards, necessary confidentiality measures have been taken, and they do not violate public order and good morals. Corresponding operation entry points are provided for users to choose to authorize or refuse.
[0044] Figure 1 The illustration shows an application scenario of the image extraction method according to an embodiment of the present disclosure.
[0045] like Figure 1 As shown, application scenario 100 according to this embodiment may include terminal devices 101, 102, and 103, network 104, and server 105. Network 104 is used as a medium to provide a communication link between terminal devices 101, 102, and 103 and server 105. Network 104 may include various connection types, such as wired or wireless communication links or fiber optic cables, etc.
[0046] Users can use terminal devices 101, 102, and 103 to interact with server 105 via network 104 to receive or send messages, etc. Various communication client applications can be installed on terminal devices 101, 102, and 103, such as shopping applications, web browser applications, search applications, instant messaging tools, email clients, social media platform software, etc. (for example only).
[0047] Terminal devices 101, 102, and 103 can be various electronic devices with displays and web browsing capabilities, including but not limited to smartphones, tablets, laptops, and desktop computers.
[0048] Server 105 can be a server that provides various services, such as a backend management server that supports websites browsed by users using terminal devices 101, 102, and 103 (for example only). The backend management server can analyze and process data such as received user requests, and feed back the processing results (such as web pages, information, or data obtained or generated according to user requests) to the terminal devices.
[0049] It should be noted that the image extraction model training method and image extraction method provided in this embodiment can generally be executed by server 105. Correspondingly, the image extraction model training device and image extraction device provided in this embodiment can generally be located in server 105. The image extraction model training method and image extraction method provided in this embodiment can also be executed by a server or server cluster that is different from server 105 and capable of communicating with terminal devices 101, 102, 103 and / or server 105. Correspondingly, the image extraction model training device and image extraction device provided in this embodiment can also be located in a server or server cluster that is different from server 105 and capable of communicating with terminal devices 101, 102, 103 and / or server 105.
[0050] It should be understood that Figure 1 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.
[0051] Figure 2 A flowchart illustrating a training method for an image extraction model according to an embodiment of the present disclosure is shown schematically.
[0052] like Figure 2 As shown, the method 200 includes operations S210 to S260.
[0053] In operation S210, a baseline image and a reference image are acquired.
[0054] In operation S220, the reference image is input into the first coding network, and the first coding feature map is output.
[0055] In operation S230, the reference image is input into the second coding network, and the second coding feature map is output.
[0056] In operation S240, the difference extraction network is used to extract the difference features from the first and second encoded feature maps to obtain the difference feature map.
[0057] In operation S250, the difference feature map is input into the fusion decoding network to obtain the image segmentation result corresponding to the reference image. The image segmentation result characterizes the geological change attributes of the target geographic environment area.
[0058] In operation S260, an image extraction model is trained based on the image segmentation results and the corresponding label data to obtain the trained image extraction model.
[0059] According to embodiments of this disclosure, a reference image and a base image can characterize a target geographic environment area, and the base image can be acquired earlier than the reference image. For example, the base image can be an image taken before a geological change occurred, and the reference image can be an image taken after the geological change occurred. However, this is not a limitation, and embodiments of this disclosure do not restrict the use of base images and reference images.
[0060] According to embodiments of this disclosure, the encoder network may include convolutional layers, a first encoding network, and a second encoding network. The first encoding network may include residual network layers, wherein multiple residual network layers may be used as a backbone network to extract features. A reference image is input to the convolutional layers to obtain a corresponding feature map, which is then input to the first encoding network to output a first encoded feature map. The second encoding network may include multiple residual network layers. A reference image is input to the convolutional layers to obtain a corresponding feature map, which is then input to the second encoding network to output a second encoded feature map. Furthermore, the first and second encoding networks do not share convolutional layers and residual network layers.
[0061] According to embodiments of this disclosure, the difference extraction network may include multiple layers. The difference extraction network of the corresponding layer can be used to extract difference features from the first encoded feature map and the second encoded feature map of the corresponding layer to obtain a difference feature map. By using the difference extraction network, the ability to detect differences in landslides can be enhanced, the ability to detect changes in landslides can be enhanced, and thus the model's ability to respond to image changes can be improved.
[0062] According to embodiments of this disclosure, for example, to address the issue of missed landslide extraction in existing methods, a difference extraction network can be used. The input temporal features are subtracted to obtain preliminary differences. After passing through the difference extraction layer, an intermediate difference feature map is obtained. Based on the first encoded feature map, the second encoded feature map, and the intermediate difference feature map, the final difference information is extracted through a convolutional layer to obtain the difference extraction feature map, thus avoiding the loss of change information.
[0063] According to embodiments of this disclosure, a fusion coding network can fuse features at different scales, capture spatial contextual relationships within features, and extract information from different receptive fields.
[0064] According to embodiments of this disclosure, a difference feature map can be input into a fusion decoding network to obtain an image segmentation result corresponding to a reference image. The image segmentation result characterizes the geological change attributes of the target geographic environment area, such as geological changes after a landslide or earthquake.
[0065] According to embodiments of this disclosure, an image extraction model can be trained based on the image segmentation results and the label data corresponding to the image segmentation results to obtain the trained image extraction model.
[0066] According to embodiments of this disclosure, a reference image and a base image are acquired. The base image is input into a first encoding network, which outputs a first encoded feature map. The reference image is input into a second encoding network, which outputs a second encoded feature map. A difference extraction network is used to extract difference features from the first and second encoded feature maps to obtain a difference feature map. This difference feature map is then input into a fusion decoding network to obtain an image segmentation result corresponding to the reference image. The image segmentation result characterizes the geological change attributes of the target geographical environment area. An image extraction model is trained based on the image segmentation result and the corresponding label data to obtain the trained image extraction model. By employing a difference extraction network, the image extraction model's ability to detect landslide differences is enhanced, its ability to detect landslide changes is improved, and thus the model's ability to identify image changes is enhanced.
[0067] According to embodiments of this disclosure, a difference extraction network is used to extract difference features from a first encoded feature map and a second encoded feature map to obtain a difference feature map. This includes: obtaining a preliminary difference feature map based on the first encoded feature map and the second encoded feature map; inputting the preliminary difference feature map into a difference extraction layer to output an intermediate difference feature map; obtaining a first intermediate feature map based on the first encoded feature map and the intermediate difference feature map; obtaining a second intermediate feature map based on the second encoded feature map and the intermediate difference feature map; and obtaining a difference extraction feature map based on the first intermediate feature map, the second intermediate feature map, and the intermediate difference feature map.
[0068] Figure 3 A schematic diagram of a difference extraction network according to an embodiment of the present disclosure is shown.
[0069] like Figure 3As shown, corresponding elements in the first encoded feature map 310 and the second encoded feature map 320 can be subtracted to obtain a preliminary difference feature map. This preliminary difference feature map can be input into the difference extraction layer 330 to obtain an intermediate difference feature map. Corresponding elements in the first encoded feature map and the intermediate difference feature map can be multiplied to obtain a first intermediate feature map. Similarly, corresponding elements in the second encoded feature map and the intermediate difference feature map can be multiplied to obtain a second intermediate feature map. Then, corresponding elements in the first encoded feature map and the first intermediate feature map are added together to obtain a first summed feature map. Finally, corresponding elements in the second encoded feature map and the second intermediate feature map are added together to obtain a second summed feature map. The first summed feature map, the second summed feature map, and the intermediate difference feature map can be concatenated along their channel dimensions. The resulting feature map is then passed through a convolutional layer to obtain the difference extraction feature map.
[0070] According to embodiments of this disclosure, preliminary differences are obtained by using feature maps at different times. Then, a difference extraction layer is used to further extract the differences between the two temporal images. The obtained differences are used as weights and multiplied to guide the two temporal feature maps (first coded feature map and second coded feature map). The first summed feature map, the second summed feature map and the intermediate difference feature map are then connected by channel dimensions. Finally, the final difference information is extracted through a convolutional layer, thus avoiding the loss of change information.
[0071] According to embodiments of this disclosure, inputting a preliminary difference feature map into a difference extraction layer and outputting an intermediate difference feature map includes: inputting the preliminary difference feature map into a convolutional layer to obtain a first feature map; processing the first feature map using a first convolutional unit to obtain a second feature map; processing the first feature map using a second convolutional unit to obtain a third feature map, wherein the first convolutional kernel of the first convolutional unit is different from the second convolutional kernel of the second convolutional unit; inputting the difference feature map into an attention unit to obtain an attention feature map, wherein the difference extraction layer includes a first convolutional unit, a second convolutional unit, and an attention unit; and fusing the second feature map, the third feature map, and the attention feature map to obtain an intermediate difference feature map.
[0072] Figure 4 A schematic diagram of a GL module in a difference extraction layer according to an embodiment of the present disclosure is shown.
[0073] like Figure 4As shown, the initial difference feature map can be input into a convolutional layer to obtain the first feature map. The first feature map is then processed by the first convolutional unit (i.e., a 3×3 convolutional layer) to obtain the second feature map; the first feature map is then processed by the second convolutional unit (i.e., a 7×7 convolutional layer) to obtain the third feature map. The convolutional kernels in the first and second convolutional units can be different; for example, the first convolutional unit can be a 3×3 convolution, and the second convolutional unit can be a 7×7 convolution. The difference feature map is then input into an attention unit (i.e., a Transformer Block) to obtain the attention feature map. Corresponding elements from the second and third feature maps can be added together to obtain a new feature map. This new feature map is then added to the corresponding elements of the attention feature map to obtain an intermediate difference feature map.
[0074] According to embodiments of this disclosure, by using a difference extraction layer, high-dimensional and low-dimensional feature maps can be fused, which can enhance the detection capability of geological changes and thus improve the model's ability to identify image changes.
[0075] According to embodiments of this disclosure, inputting a difference feature map into a fusion decoding network to obtain an image segmentation result corresponding to a reference image includes: fusing multi-level difference feature maps using a multi-scale fusion layer to obtain a fused feature map; and decoding the fused feature map using a decoding layer to obtain an image segmentation result, wherein the fusion decoding network includes a multi-scale fusion layer and a decoding layer.
[0076] According to embodiments of this disclosure, a multi-scale fusion layer can be used to fuse multi-level differential feature maps to obtain a fused feature map. The multi-scale fusion layer can capture the spatial contextual relationships within features by fusing features at different scales, avoiding information loss and thus improving the model's expressive power.
[0077] According to embodiments of this disclosure, a decoding layer can be used to decode the fused feature map to obtain image segmentation results. Specifically, the decoding layer enables the model to extract information from different receptive fields. Different receptive fields can better model spatial relationships. By integrating local and global information, the relationships between pixels can be understood more accurately, thereby ensuring the differentiation between landslides and bare land and reducing the risk of misidentification.
[0078] According to embodiments of this disclosure, a multi-scale fusion layer is used to fuse multi-level difference feature maps to obtain a fused feature map, including: adding multi-level difference feature maps based on an attention mechanism to obtain an attention fusion feature map; connecting multi-level difference feature maps along the channel dimension to obtain a channel feature map; processing the channel feature map using an attention mechanism to obtain an attention channel feature map; obtaining an intermediate fusion feature map based on the attention fusion feature map, the channel feature map, and the attention channel feature map; and processing the intermediate fusion feature map using a pooling layer and adding it to the multi-level difference feature maps respectively to obtain the fusion feature map of the corresponding level.
[0079] According to embodiments of this disclosure, multi-level difference feature maps can be upsampled to the same size, and then the upsampled multi-level difference feature maps can be added together using an attention mechanism to obtain an attention fusion feature map. The upsampled multi-level difference feature maps are then concatenated along the channel dimension to obtain a channel feature map. The channel feature map can be processed using an attention mechanism to obtain an attention channel feature map. Corresponding elements in the attention channel feature map and the attention fusion feature map can be added together to obtain a fused feature map. This fused feature map is then multiplied by the channel feature map to obtain an intermediate fused feature map. The intermediate fused feature map can be pooled and then added to the corresponding difference feature maps in each multi-level layer to obtain the fused feature map for the corresponding layer.
[0080] Figure 5 A schematic diagram of a multi-scale fusion layer according to an embodiment of the present disclosure is shown.
[0081] like Figure 5 As shown, features at different scales can provide information at multiple levels. The difference extraction layer upsamples three layers of features to the same scale. The first branch: by upsampling the multi-level difference feature maps to the same scale, adding them together, and performing channel attention (i.e., CA) to obtain the attention on the channels of each feature layer (i.e., attention fusion feature map); the second branch: by concatenating the four feature maps and applying channel attention, to obtain the attention on the channels of the merged four layers (i.e., attention channel feature map). Finally, the obtained result (i.e., the intermediate fusion feature map) is layer-wise pooled to restore it to the corresponding feature level, improving the model's generalization ability and further reducing the omission of slippage during prediction.
[0082] According to embodiments of this disclosure, by fusing features at different scales, spatial contextual relationships within the features are captured, and information loss is avoided, thereby improving the expressive power of the model.
[0083] According to embodiments of this disclosure, decoding the fused feature map using a decoding layer to obtain an image segmentation result includes: upsampling the i-th fused feature map to obtain the i-th upsampled fused feature map; concatenating the (i-1)-th fused feature map and the i-th upsampled fused feature map along the channel dimension to obtain the i-th intermediate fused feature map; inputting the i-th intermediate fused feature map into a convolutional pooling sub-layer to output the i-th decoded feature map; and performing image segmentation on the i-th decoded feature map to obtain an image segmentation result.
[0084] Figure 6 A schematic diagram of a decoding layer according to an embodiment of the present disclosure is shown.
[0085] like Figure 6 As shown, the i-th fused feature map 620 can be upsampled to obtain the i-th upsampled fused feature map. The (i-1)-th fused feature map 610 and the i-th upsampled fused feature map can be concatenated along the channel dimension to obtain the i-th intermediate fused feature map. The i-th intermediate fused feature map can be input into the convolutional pooling sub-layer 630, which outputs the i-th decoded feature map 640. Image segmentation is then performed on the i-th decoded feature map to obtain the image segmentation result.
[0086] According to embodiments of this disclosure, inputting the i-th intermediate fused feature map into a convolutional pooling sub-layer and outputting the i-th decoded feature map includes: inputting the i-th intermediate fused feature map into a third convolutional unit to obtain the i-th intermediate convolutional fused feature map; inputting the i-th intermediate convolutional fused feature map into a pooling convolutional unit to obtain a pooling feature map; inputting the i-th intermediate convolutional fused feature map into a fourth convolutional unit to obtain a convolutional feature map; and obtaining the i-th decoded feature map based on the i-th intermediate convolutional fused feature map, the pooling feature map, and the convolutional feature map.
[0087] According to embodiments of this disclosure, pooling convolutional units may include 1 / 2 pooling and 1 / 4 size pooling, but are not limited thereto, and this disclosure does not limit the pooling size.
[0088] According to embodiments of this disclosure, the number of parameters in the i-th intermediate fusion feature map can be reduced by controlling the number of channels through a 1×1 convolutional layer (i.e., the third convolutional unit). Next, the original-size feature map, the feature map after 1 / 2 pooling, and the feature map after 1 / 4 pooling are extracted through four branches respectively. Then, a residual connection is made with the input. The results obtained from the four branches are added together. Finally, features with multi-scale receptive fields are extracted through a 3×3 convolutional layer.
[0089] According to embodiments of this disclosure, pooling operations enable the model to extract information from different receptive fields. Different receptive fields can better model spatial relationships. By integrating local and global information, the relationship between pixels can be understood more accurately, thereby ensuring the differentiation between landslides and bare land and reducing the risk of mis-extraction.
[0090] Figure 7 A schematic diagram of the network structure of the difference extraction model according to an embodiment of the present disclosure is shown.
[0091] like Figure 7 As shown, the reference image is input into a convolutional layer and convolved to obtain the corresponding feature map. This feature map is then input into the first-level first-coding network, outputting the first-level first-coding feature map. Similarly, the reference image is input into a convolutional layer and convolved to obtain the corresponding feature map. This feature map is then input into the first-level second-coding network, outputting the first-level second-coding feature map. The resulting first-level first-coding feature map and first-level second-coding feature map are then combined... Figure 1 The first layer's input is fed into the first-level difference extraction network to extract differential features, resulting in a differential feature map. Similarly, the first coding feature map obtained from the first level is used as the input to the first coding network of the second level, resulting in the first coding feature map of the corresponding level. The second coding feature map obtained from the first level is used as the input to the second coding network of the second level, resulting in the second coding feature map of the corresponding level. Then, the first coding feature map and the second coding feature map of the corresponding level are combined... Figure 1 The input is fed into the differential extraction network of the corresponding layer for differential feature extraction, resulting in a differential feature map for that layer, and so on. Here, this embodiment of the disclosure does not limit the number of layers in the encoder network.
[0092] According to embodiments of this disclosure, the obtained multi-level differential feature maps are input into a fusion decoding network for image segmentation to obtain an image segmentation result corresponding to a reference image. Specifically, the multi-level differential feature maps are input into a multi-scale fusion layer for multi-scale fusion, and then pooling layers are used to output fused feature maps of corresponding levels. A decoding layer is then used to decode the multi-level fused feature maps to obtain the image segmentation result. Figure 7 As shown, the fused feature map of level 4 and the fused feature map of level 3 are combined. Figure 1 The input is fed into the third-level decoding layer for decoding, and the resulting fourth-level decoded feature map is combined with the fused feature map from the second level. Figure 1 The input is fed into the second-level decoding layer for decoding, and the resulting third-level decoded feature map is then combined with the fused feature map from the first level. Figure 1 The input is fed into the first-level decoding layer, and the resulting second decoded feature map is passed through the segmentation head to further obtain the image segmentation result.
[0093] Figure 8 A flowchart illustrating an image extraction method according to an embodiment of the present disclosure is shown schematically.
[0094] like Figure 8 As shown, the method includes operations S810 to S820.
[0095] During operation of S810, a first-time image and a second-time image are acquired, with the first-time image acquired earlier than the second-time image.
[0096] In operation S820, the first time-phase image and the second time-phase image are input into the trained image extraction model to obtain the image segmentation result corresponding to the second time-phase image.
[0097] According to embodiments of this disclosure, a first-time image and a second-time image can be obtained first, and the first-time image and the second-time image can be input into a trained image extraction model to obtain an image segmentation result corresponding to the second-time image.
[0098] Figure 9 A flowchart illustrating the implementation of an image extraction method according to an embodiment of the present disclosure is shown.
[0099] like Figure 9 As shown, we can first obtain the public dataset S910, and then uniformly crop the acquired images S920, for example, cropping the images of each region to a size of 256×256. We can then divide the dataset according to regions S930, selecting different regions, with each region as an image set. These image sets are then divided into training and testing sets. The training set is used to train a landslide transformation detection model based on the image extraction model structure S940, and the testing set is used to verify the model's accuracy on each test set S950.
[0100] Figure 10 An image extraction diagram according to an embodiment of the present disclosure is illustrated schematically.
[0101] like Figure 10As shown, the trained model was tested on the test set, and the landslide extraction results were obtained in five test sets. Among them, (a) is the image before the landslide, (b) is the image after the landslide, (c) is the corresponding label, and (d) is the prediction result. It can be seen that the proposed image extraction model has alleviated the problems of missing and false landslides in the detection of landslide changes, and the model can successfully extract most landslides. At the same time, the Intersection over Union (IOU), Precision, Recall and F1 scores were calculated for the test set to evaluate the accuracy of the model. The IOU is the overlap ratio of the "predicted bounding box" and the "real bounding box", that is, the ratio of their intersection and union. The larger the ratio, the better. The larger the values of Precision, Recall and F1, the better the effect of the image extraction model. The calculation methods of IOU, Precision and Recall and F1 are shown in formulas (1)-(4).
[0102]
[0103] Where TP represents the number of real pixels extracted as landslides, TN represents the number of real pixels extracted as background, FP represents the number of real background features misclassified as landslides, and FN represents the number of pixels where landslide samples were incorrectly predicted as samples by the model.
[0104] According to embodiments of this disclosure, training and test samples are selected based on different regions. Regions 1, 2, 3, 4, and 5 are selected as the test set, and the remaining 12 datasets are used as the training set. Images from each region are uniformly cropped to 256×256 pixels. The training set contains 5426 images, while the test set includes 160 images for Region 1, 336 images for Region 2, 495 images for Region 3, 448 images for Region 4, and 462 images for Region 5. The accuracy of the image extraction model on each test set is verified, as shown in Table 1.
[0105] Table 1
[0106]
[0107]
[0108] Based on the training method of the image extraction model described above, this disclosure also provides a training device for the image extraction model. The following will combine... Figure 11 The device is described in detail.
[0109] Figure 11 A schematic block diagram of a training apparatus for an image extraction model according to an embodiment of the present disclosure is shown.
[0110] like Figure 11 As shown, the training device 1100 for the image extraction model in this embodiment includes a first acquisition module 1110, a first encoding module 1120, a second encoding module 1130, a difference extraction module 1140, a fusion decoding module 1150, and a model extraction module 1160.
[0111] The first acquisition module 1110 is used to acquire a reference image and a base image, wherein the base image is acquired earlier than the reference image, and the base image and the reference image represent the target geographic environment area. In one embodiment, the first acquisition module 1110 can be used to perform the operation S210 described above, which will not be repeated here.
[0112] The first encoding module 1120 is used to input the reference image into the first encoding network and output the first encoded feature map. In one embodiment, the first encoding module 1120 can be used to perform the operation S220 described above, which will not be repeated here.
[0113] The second encoding module 1130 is used to input the reference image into the second encoding network and output a second encoded feature map, wherein the encoder network includes a first encoding network and a second encoding network. In one embodiment, the second encoding module 1130 can be used to perform the operation S230 described above, which will not be repeated here.
[0114] The difference extraction module 1140 is used to extract difference features from the first encoded feature map and the second encoded feature map using a difference extraction network to obtain a difference feature map. In one embodiment, the difference extraction module 1140 can be used to perform the operation S240 described above, which will not be repeated here.
[0115] The fusion decoding module 1150 is used to input the difference feature map into the fusion decoding network to obtain an image segmentation result corresponding to the reference image. The image segmentation result characterizes the geological change attributes of the target geographic environment area. In one embodiment, the fusion decoding module 1150 can be used to perform the operation S250 described above, which will not be repeated here.
[0116] The model extraction module 1160 is used to train an image extraction model based on the image segmentation results and the label data corresponding to the image segmentation results, thereby obtaining the trained image extraction model. In one embodiment, the model extraction module 1160 can be used to perform the operation S260 described above, which will not be repeated here.
[0117] According to embodiments of this disclosure, the difference extraction module 1140 includes: a first difference unit, a second difference unit, a first intermediate unit, a second intermediate unit, and a difference extraction unit.
[0118] The first difference unit is used to obtain a preliminary difference feature map based on the first coding feature map and the second coding feature map.
[0119] The second difference unit is used to input the preliminary difference feature map into the difference extraction layer and output the intermediate difference feature map.
[0120] The first intermediate unit is used to obtain the first intermediate feature map based on the first encoded feature map and the intermediate difference feature map.
[0121] The second intermediate unit is used to obtain the second intermediate feature map based on the second encoded feature map and the intermediate difference feature map.
[0122] The difference extraction unit is used to obtain a difference extraction feature map based on the first intermediate feature map, the second intermediate feature map, and the intermediate difference feature map.
[0123] According to embodiments of this disclosure, the second difference unit includes: a first feature subunit, a second feature subunit, a third feature subunit, an attention feature subunit, and an intermediate difference subunit.
[0124] The first feature subunit is used to input the preliminary difference feature map into the convolutional layer to obtain the first feature map.
[0125] The second feature subunit is used to process the first feature map using the first convolution unit to obtain the second feature map.
[0126] The third feature subunit is used to process the first feature map using the second convolution unit to obtain the third feature map. The first convolution kernel of the first convolution unit is different from the second convolution kernel of the second convolution unit.
[0127] The attention feature subunit is used to input the difference feature map into the attention unit to obtain the attention feature map. The difference extraction layer includes a first convolutional unit, a second convolutional unit, and an attention unit.
[0128] The intermediate difference subunit is used to fuse the second feature map, the third feature map, and the attention feature map to obtain the intermediate difference feature map.
[0129] According to embodiments of this disclosure, the fusion decoding module 1150 includes a fusion unit and a decoding unit.
[0130] The fusion unit is used to fuse multi-level differential feature maps using a multi-scale fusion layer to obtain a fused feature map.
[0131] The decoding unit is used to decode the fused feature map using the decoding layer to obtain the image segmentation result. The fusion decoding network includes a multi-scale fusion layer and a decoding layer.
[0132] According to embodiments of this disclosure, the fusion unit includes: an attention fusion subunit, a dimension connection subunit, an attention channel subunit, an intermediate fusion subunit, and a processing subunit.
[0133] The attention fusion subunit is used to add up the differential feature maps of multiple levels based on the attention mechanism to obtain the attention fusion feature map.
[0134] The dimensional connection subunit is used to connect the multi-level difference feature maps along the channel dimension to obtain the channel feature map.
[0135] The attention channel subunit is used to process the channel feature map using the attention mechanism to obtain the attention channel feature map.
[0136] The intermediate fusion subunit is used to obtain the intermediate fusion feature map based on the attention fusion feature map, the channel feature map, and the attention channel feature map.
[0137] The processing subunit is used to process the intermediate fused feature map using the pooling layer, and add it to the difference feature maps of the multi-level layers respectively to obtain the fused feature map of the corresponding level.
[0138] According to embodiments of this disclosure, the decoding unit includes: an upsampling subunit, a feature fusion subunit, a convolutional pooling subunit, and a segmentation subunit.
[0139] The upsampling subunit is used to upsample the i-th fused feature map to obtain the i-th upsampled fused feature map.
[0140] The feature fusion subunit is used to connect the (i-1)th fused feature map and the i-th upsampled fused feature map along the channel dimension to obtain the i-th intermediate fused feature map.
[0141] The convolutional pooling sub-unit is used to input the i-th intermediate fused feature map into the convolutional pooling sub-layer and output the i-th decoded feature map.
[0142] The segmentation subunit is used to perform image segmentation on the i-th decoded feature map to obtain the image segmentation result.
[0143] According to an embodiment of this disclosure, the i-th intermediate fused feature map is input into a third convolutional unit to obtain the i-th intermediate convolutional fused feature map; the i-th intermediate convolutional fused feature map is input into a pooling convolutional unit to obtain a pooling feature map; the i-th intermediate convolutional fused feature map is input into a fourth convolutional unit to obtain a convolutional feature map; and the i-th decoded feature map is obtained based on the i-th intermediate convolutional fused feature map, the pooling feature map, and the convolutional feature map.
[0144] Figure 12 A schematic block diagram of an image extraction apparatus according to an embodiment of the present disclosure is shown.
[0145] like Figure 12 As shown, the image extraction device 1200 of this embodiment includes a second acquisition module 1210 and an input module 1220.
[0146] The second acquisition module 1210 is used to acquire a first time-phase image and a second time-phase image, wherein the acquisition time of the first time-phase image is earlier than the acquisition time of the second time-phase image. In one embodiment, the second acquisition module 1210 can be used to perform the operation S810 described above, which will not be repeated here.
[0147] The input module 1220 is used to input the first time-phase image and the second time-phase image into the trained image extraction model to obtain the image segmentation result corresponding to the second time-phase image. In one embodiment, the input module 1220 can be used to perform the operation S820 described above, which will not be repeated here.
[0148] Figure 13 A block diagram schematically illustrates an electronic device suitable for implementing an image extraction method according to an embodiment of the present disclosure.
[0149] like Figure 13 As shown, an electronic device 1300 according to an embodiment of the present disclosure includes a processor 1301, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1302 or a program loaded from a storage portion 1308 into a random access memory (RAM) 1303. The processor 1301 may include, for example, a general-purpose microprocessor (e.g., a CPU), an instruction set processor and / or an associated chipset and / or a special-purpose microprocessor (e.g., an application-specific integrated circuit (ASIC)), etc. The processor 1301 may also include onboard memory for caching purposes. The processor 1301 may include a single processing unit or multiple processing units for performing different actions of the method flow according to an embodiment of the present disclosure.
[0150] RAM 1303 stores various programs and data required for the operation of electronic device 1300. Processor 1301, ROM 1302, and RAM 1303 are interconnected via bus 1304. Processor 1301 performs various operations of the method flow according to embodiments of the present disclosure by executing programs in ROM 1302 and / or RAM 1303. It should be noted that the programs may also be stored in one or more memories other than ROM 1302 and RAM 1303. Processor 1301 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in said one or more memories.
[0151] According to embodiments of this disclosure, the electronic device 1300 may further include an input / output (I / O) interface 1305, which is also connected to a bus 1304. The electronic device 1300 may also include one or more of the following components connected to the I / O interface 1305: an input section 1306 including a keyboard, mouse, etc.; an output section 1307 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 1308 including a hard disk, etc.; and a communication section 1309 including a network interface card such as a LAN card, modem, etc. The communication section 1309 performs communication processing via a network such as the Internet. A drive 1310 is also connected to the I / O interface 1305 as needed. A removable medium 1311, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on the drive 1310 as needed so that computer programs read from it can be installed into the storage section 1308 as needed.
[0152] This disclosure also provides a computer-readable storage medium, which may be included in the device / apparatus / system described in the above embodiments; or it may exist independently and not assembled into the device / apparatus / system. The computer-readable storage medium carries one or more programs that, when executed, implement the method according to the embodiments of this disclosure.
[0153] According to embodiments of this disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, such as including, but not limited to: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this disclosure, the computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. For example, according to embodiments of this disclosure, the computer-readable storage medium may include ROM 1302 and / or RAM 1303 and / or one or more memories other than ROM 1302 and RAM 1303 described above.
[0154] Embodiments of this disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowchart. When the computer program product is run on a computer system, the program code enables the computer system to implement the image extraction method provided in the embodiments of this disclosure.
[0155] When the computer program is executed by the processor 1301, it performs the functions defined in the system / apparatus of this disclosure embodiments. According to embodiments of this disclosure, the systems, apparatuses, modules, units, etc., described above can be implemented by computer program modules.
[0156] In one embodiment, the computer program may rely on a tangible storage medium such as an optical storage device or a magnetic storage device. In another embodiment, the computer program may also be transmitted and distributed in the form of signals over a network medium, and may be downloaded and installed via the communication section 1309, and / or installed from the removable medium 1311. The program code contained in the computer program can be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination thereof.
[0157] In such an embodiment, the computer program can be downloaded and installed from a network via the communication section 1309, and / or installed from the removable medium 1311. When the computer program is executed by the processor 1301, it performs the functions defined in the system of this disclosure embodiment. According to embodiments of this disclosure, the systems, devices, apparatuses, modules, units, etc., described above can be implemented by computer program modules.
[0158] According to embodiments of this disclosure, program code for executing the computer programs provided in embodiments of this disclosure can be written in any combination of one or more programming languages. Specifically, these computational programs can be implemented using high-level procedural and / or object-oriented programming languages, and / or assembly / machine languages. Programming languages include, but are not limited to, languages such as Java, C++, Python, "C", or similar programming languages. The program code can execute entirely on the user's computing device, partially on the user's device, partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).
[0159] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0160] Those skilled in the art will understand that the features described in the various embodiments and / or claims of this disclosure can be combined or combined in various ways, even if such combinations or combinations are not explicitly described in this disclosure. In particular, the features described in the various embodiments and / or claims of this disclosure can be combined or combined in various ways without departing from the spirit and teachings of this disclosure. All such combinations and / or combinations fall within the scope of this disclosure.
[0161] The embodiments of this disclosure have been described above. However, these embodiments are for illustrative purposes only and are not intended to limit the scope of this disclosure. Although various embodiments have been described above, this does not mean that the measures in the various embodiments cannot be used advantageously in combination. The scope of this disclosure is defined by the appended claims and their equivalents. Various substitutions and modifications can be made by those skilled in the art without departing from the scope of this disclosure, and all such substitutions and modifications should fall within the scope of this disclosure.
Claims
1. A method for training an image extraction model, characterized in that, The image extraction model includes an encoder network, a difference extraction network, and a fusion decoding network; the method includes: Acquire a baseline image and a reference image, wherein the baseline image was acquired earlier than the reference image, and the baseline image and the reference image represent a target geographic environment region; The reference image is input into the first coding network, and the first coding feature map is output. The reference image is input into a second coding network to output a second coding feature map, wherein the encoder network includes the first coding network and the second coding network; Based on the first encoded feature map and the second encoded feature map, a preliminary difference feature map is obtained; The preliminary difference feature map is input into the difference extraction layer, and the intermediate difference feature map is output. Based on the first encoded feature map and the intermediate difference feature map, a first intermediate feature map is obtained; Based on the second encoded feature map and the intermediate difference feature map, a second intermediate feature map is obtained; Based on the first intermediate feature map, the second intermediate feature map, and the intermediate difference feature map, a difference extraction feature map is obtained; The difference feature map is input into the fusion decoding network to obtain an image segmentation result corresponding to the reference image, wherein the image segmentation result characterizes the geological change attributes of the target geographic environment region; and The image extraction model is trained based on the image segmentation result and the label data corresponding to the image segmentation result to obtain the trained image extraction model; The step of inputting the preliminary difference feature map into the difference extraction layer and outputting the intermediate difference feature map includes: The preliminary difference feature map is input into the convolutional layer to obtain the first feature map; The first feature map is processed using the first convolutional unit to obtain the second feature map; The first feature map is processed by the second convolution unit to obtain the third feature map. The first convolution kernel of the first convolution unit is different from the second convolution kernel of the second convolution unit. The difference feature map is input into the attention unit to obtain the attention feature map, wherein the difference extraction layer includes the first convolutional unit, the second convolutional unit and the attention unit; The intermediate difference feature map is obtained by fusing the second feature map, the third feature map, and the attention feature map.
2. The training method of claim 1, wherein, The step of inputting the difference feature map into the fusion decoding network to obtain the image segmentation result corresponding to the reference image includes: A multi-scale fusion layer is used to fuse the differential feature maps of multiple levels to obtain a fused feature map; The image segmentation result is obtained by decoding the fused feature map using a decoding layer, wherein the fusion decoding network includes the multi-scale fusion layer and the decoding layer.
3. The training method according to claim 2, characterized in that, The method of fusing multi-level differential feature maps using a multi-scale fusion layer to obtain a fused feature map includes: Based on the attention mechanism, the difference feature maps of the multiple levels are added together to obtain the attention fusion feature map; The multi-level difference feature maps are connected along the channel dimension to obtain the channel feature map; The channel feature map is processed using an attention mechanism to obtain an attention channel feature map; An intermediate fusion feature map is obtained based on the attention fusion feature map, the channel feature map, and the attention channel feature map; The intermediate fused feature map is processed using a pooling layer and added to the differential feature maps of the multi-level layers respectively to obtain the fused feature map of the corresponding level.
4. The training method according to claim 2, characterized in that, The step of decoding the fused feature map using a decoding layer to obtain the image segmentation result includes: Upsample the i-th fused feature map to obtain the i-th upsampled fused feature map; The (i-1)th fused feature map and the i-th upsampled fused feature map are concatenated along the channel dimension to obtain the i-th intermediate fused feature map; The i-th intermediate fused feature map is input into the convolutional pooling sub-layer, and the i-th decoded feature map is output. The i-th decoded feature map is segmented to obtain the image segmentation result.
5. The training method according to claim 4, characterized in that, The step of inputting the i-th intermediate fused feature map into the convolutional pooling sub-layer and outputting the i-th decoded feature map includes: The i-th intermediate fused feature map is input into the third convolutional unit to obtain the i-th intermediate convolutional fused feature map; The i-th intermediate convolutional fusion feature map is input into the pooling convolutional unit to obtain the pooling feature map; The i-th intermediate convolutional fusion feature map is input into the fourth convolutional unit to obtain the convolutional feature map; The i-th decoding feature map is obtained based on the i-th intermediate convolutional fusion feature map, the pooling feature map, and the convolutional feature map.
6. An image extraction method, characterized in that, include: Acquire a first time-phase image and a second time-phase image, wherein the acquisition time of the first time-phase image is earlier than the acquisition time of the second time-phase image; The first time-phase image and the second time-phase image are input into the trained image extraction model to obtain the image segmentation result corresponding to the second time-phase image, wherein the trained image extraction model is obtained by the method according to any one of claims 1 to 5.
7. A training device for an image extraction model, characterized in that, include: The first acquisition module is used to acquire a reference image and a reference image, wherein the acquisition time of the reference image is earlier than the acquisition time of the reference image, and the reference image and the reference image represent a target geographic environment area; The first encoding module is used to input the reference image into the first encoding network and output the first encoded feature map; The second encoding module is used to input the reference image into the second encoding network and output a second encoded feature map, wherein the encoder network includes the first encoding network and the second encoding network; The difference extraction module is used to extract difference features from the first encoded feature map and the second encoded feature map using a difference extraction network to obtain a difference feature map. The fusion decoding module is used to input the difference feature map into the fusion decoding network to obtain an image segmentation result corresponding to the reference image, wherein the image segmentation result characterizes the geological change attributes of the target geographic environment area; and The model extraction module is used to train the image extraction model based on the image segmentation result and the label data corresponding to the image segmentation result, so as to obtain the trained image extraction model. The training device is constructed according to the method of any one of claims 1 to 5.
8. An image extraction device, characterized in that, The device includes: The second acquisition module is used to acquire a first time image and a second time image, wherein the acquisition time of the first time image is earlier than the acquisition time of the second time image; The input module is used to input the first time-phase image and the second time-phase image into the trained image extraction model to obtain the image segmentation result corresponding to the second time-phase image; The trained image extraction model is obtained by the method according to any one of claims 1 to 5.