Training method of image extraction model and image extraction method

By combining linear embedding, encoding, and decoding networks in the image extraction model with local and global attention layers and weighted feature fusion layers, the problem of low landslide extraction accuracy in complex terrain areas is solved, achieving higher landslide detail and edge extraction capabilities.

CN118212489BActive Publication Date: 2026-06-23AEROSPACE INFORMATION RES INST CAS

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
AEROSPACE INFORMATION RES INST CAS
Filing Date
2024-04-17
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing technologies suffer from low accuracy and model confusion with background features in landslide extraction in complex terrains such as mountains and hills. In particular, when the landslide and background features have similar colors and textures, it is difficult to accurately extract landslide details and edges.

Method used

An image extraction model is adopted, including a linear embedding network, an encoding network, and a decoding network. A decoding sub-network is constructed using a local global attention layer and a weighted feature fusion layer. The image segmentation result is obtained through segmentation head mapping. The model is trained based on the label data, and the contribution of each feature map is balanced to improve robustness.

Benefits of technology

It improves the ability to extract landslide details and edges, reduces model overfitting, and enhances the accuracy and robustness of landslide extraction in complex terrain areas.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118212489B_ABST
    Figure CN118212489B_ABST
Patent Text Reader

Abstract

The present disclosure provides a training method of an image extraction model and an image extraction method, which can be applied to the fields of image processing and pattern recognition. The image extraction model comprises a linear embedding network, an encoding network and a decoding network. The training method comprises: processing a sample image obtained by using the linear embedding network to obtain an embedding feature map; processing the embedding feature map by using the encoding network to obtain i encoding feature maps; processing the i encoding feature maps by using the decoding network to obtain a weight fusion feature map, wherein the decoding sub-network is constructed based on a local-global attention layer and a weight feature fusion layer; performing segmentation head mapping processing on the weight fusion feature map to obtain an image segmentation result, wherein the image segmentation result represents a geological change attribute of a target geographical environment region; and training the image extraction model according to the image segmentation result and label data corresponding to the image segmentation result to obtain a trained image extraction model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the fields of natural language processing and finance, and more specifically, to a training method for an image extraction model, an image extraction method, an apparatus, a device, a medium, and a program product. Background Technology

[0002] With the development of computer technology, the use of machine learning models combined with a large number of high-resolution remote sensing images obtained by remote sensing technology has made great progress in landslide extraction, which is more efficient. However, it requires manual selection of features, and the landslide extraction results are subject to subjective influence.

[0003] Current methods perform well in extracting large landslides that are significantly different from the background features. However, for complex terrains such as mountains and hills, the large variations in terrain and lighting conditions can make landslide extraction difficult. When the landslide shares similarities in color, texture, or other features with the background features, the model may confuse the landslide with the background, thus affecting extraction accuracy. Summary of the Invention

[0004] In view of the above problems, this disclosure provides a training method for an image extraction model, an image extraction method, an apparatus, a device, a medium, and a program product.

[0005] According to a first aspect of this disclosure, a training method for an image extraction model is provided. The image extraction model includes a linear embedding network, an encoding network, and a decoding network, comprising: processing an acquired sample image using the linear embedding network to obtain an embedding feature map; processing the embedding feature map using the encoding network to obtain i encoded feature maps, wherein the encoding network includes i sequentially connected encoding sub-networks, one encoding sub-network corresponding to one encoded feature map, i > 1; processing the i encoded feature maps using the decoding network to obtain a weighted fusion feature map, wherein the decoding network includes j sequentially connected decoding sub-networks, the decoding sub-networks being constructed based on a local global attention layer and a weighted feature fusion layer, j > 0; performing segmentation head mapping processing on the weighted fusion feature map to obtain an image segmentation result, wherein the image segmentation result characterizes the geological change attributes of the target geographic environment region; and training the image extraction model based on the image segmentation result and the label data corresponding to the image segmentation result to obtain the trained image extraction model.

[0006] A second aspect of this disclosure provides an image extraction method, comprising: acquiring a test sample image; inputting the test sample image into a trained image extraction model to obtain an image segmentation result corresponding to the test sample image.

[0007] A third aspect of this disclosure provides a training apparatus for an image extraction model, comprising:

[0008] The embedding module is used to process the acquired sample images using a linear embedding network to obtain embedding feature maps;

[0009] The processing module is used to process the embedded feature maps using an encoding network to obtain i encoded feature maps, wherein the encoding network includes i sequentially connected encoding sub-networks, and one encoding sub-network corresponds to one encoded feature map, i>1;

[0010] The weight module is used to process i encoded feature maps using the decoding network to obtain a weighted fusion feature map. The decoding network includes j decoder sub-networks connected in sequence. The decoder sub-networks are constructed based on a local global attention layer and a weighted feature fusion layer, where j > 0.

[0011] The training results module is used to perform segmentation head mapping processing on the weighted fusion feature map to obtain the image segmentation result, where the image segmentation result represents the geological change attributes of the target geographic environment region; and

[0012] The training module is used to train the image extraction model based on the image segmentation results and the corresponding label data, resulting in the trained image extraction model.

[0013] A fourth aspect of this disclosure provides an image extraction apparatus, comprising:

[0014] The acquisition module is used to acquire test sample images;

[0015] The test results module is used to input the test sample image into the trained image extraction model to obtain the image segmentation result corresponding to the test sample image.

[0016] A fifth aspect of this disclosure provides an electronic device comprising: one or more processors; and a memory for storing one or more programs, wherein when the one or more programs are executed by the one or more processors, the one or more processors perform the method described above.

[0017] A sixth aspect of this disclosure also provides a computer-readable storage medium having executable instructions stored thereon, which, when executed by a processor, cause the processor to perform the methods described above.

[0018] The seventh aspect of this disclosure also provides a computer program product, including a computer program that, when executed by a processor, implements the above-described method.

[0019] According to embodiments of this disclosure, sample images are processed using a linear embedding network to obtain embedded feature maps. These embedded feature maps are then processed using an encoding network to obtain i encoded feature maps. Finally, the i encoded feature maps are processed using a decoding network to obtain a weighted fusion feature map. The decoding network comprises j sequentially connected decoding sub-networks, each constructed based on a local-to-global attention layer and a weighted feature fusion layer. The weighted fusion feature map is processed using a segmentation head mapping to obtain an image segmentation result. An image extraction model is trained based on the image segmentation result and the corresponding label data to obtain the trained image extraction model. By using the decoding network, the contributions of each feature map during the extraction process can be balanced, reducing overfitting. Furthermore, since the decoding sub-network is constructed based on a local-to-global attention layer, it can improve the extraction of landslide details and edges, thereby enhancing the robustness of the model. Attached Figure Description

[0020] The foregoing contents, as well as other objects, features, and advantages of this disclosure, will become clearer from the following description of embodiments with reference to the accompanying drawings, in which:

[0021] Figure 1 This diagram illustrates an application scenario of the image extraction method according to an embodiment of the present disclosure.

[0022] Figure 2 A flowchart illustrating a training method for an image extraction model according to an embodiment of the present disclosure is shown schematically.

[0023] Figure 3 A schematic diagram of a weighted feature fusion layer according to an embodiment of the present disclosure is shown.

[0024] Figure 4 A schematic diagram of a local global attention layer according to an embodiment of the present disclosure is shown;

[0025] Figure 5 A schematic diagram of a local global attention sublayer according to an embodiment of the present disclosure is shown;

[0026] Figure 6 A schematic diagram of an image extraction model according to an embodiment of the present disclosure is shown.

[0027] Figure 7 A flowchart illustrating an image extraction method according to an embodiment of the present disclosure is shown schematically.

[0028] Figure 8 A flowchart illustrating the implementation of an image extraction method according to an embodiment of the present disclosure is shown schematically.

[0029] Figure 9 A landslide extraction map is schematically shown according to an embodiment of the present disclosure;

[0030] Figure 10 This schematic diagram illustrates a structural block diagram of a training apparatus for an image extraction model according to an embodiment of the present disclosure;

[0031] Figure 11 A schematic block diagram of an image extraction apparatus according to an embodiment of the present disclosure is shown; and

[0032] Figure 12 A block diagram schematically illustrates an electronic device suitable for implementing an image extraction method according to an embodiment of the present disclosure. Detailed Implementation

[0033] The embodiments of the present disclosure will now be described with reference to the accompanying drawings. However, it should be understood that these descriptions are exemplary only and are not intended to limit the scope of the disclosure. In the following detailed description, numerous specific details are set forth to provide a thorough understanding of the embodiments of the present disclosure for ease of explanation. However, it will be apparent that one or more embodiments may be practiced without these specific details. Furthermore, descriptions of well-known structures and techniques are omitted in the following description to avoid unnecessarily obscuring the concepts of the present disclosure.

[0034] The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit this disclosure. The terms “comprising,” “including,” etc., as used herein indicate the presence of the stated features, steps, operations, and / or components, but do not exclude the presence or addition of one or more other features, steps, operations, or components.

[0035] All terms used herein (including technical and scientific terms) have the meanings commonly understood by those skilled in the art, unless otherwise defined. It should be noted that the terms used herein should be interpreted in a manner consistent with the context of this specification, and not in an idealized or overly rigid way.

[0036] When using expressions such as "at least one of A, B, and C", they should generally be interpreted in accordance with the meaning that is commonly understood by a person skilled in the art (e.g., "a system having at least one of A, B, and C" should include, but is not limited to, a system having A alone, a system having B alone, a system having C alone, a system having A and B, a system having A and C, a system having B and C, and / or a system having A, B, and C, etc.).

[0037] Landslide extraction research plays a crucial role in developing effective response strategies and mitigating the threats posed by landslides to infrastructure safety. Simultaneously, landslide extraction research provides management departments with detailed information on landslides, assisting them in formulating more effective rescue strategies. In recent years, with the rapid development of remote sensing technology, landslide extraction methods have evolved from traditional manual visual interpretation to automated landslide extraction using machine learning. Semantic segmentation methods from deep learning have also begun to be applied to landslide extraction tasks. Among these methods, manual visual interpretation is a traditional and widely used approach, where interpreters distinguish landslide boundaries by observing the texture, shape, and color features of landslides on remote sensing images with the naked eye. While highly accurate, it is susceptible to subjective influence from the interpreter and has relatively low efficiency. With the development of computer technology, significant progress has been made in landslide extraction using machine learning models combined with a large amount of high-resolution remote sensing imagery obtained through remote sensing technology. This approach is more efficient, but still requires manual feature selection, and the landslide extraction results are subject to subjective influence. Semantic segmentation methods using deep learning models can achieve automated feature extraction, enabling pixel-level classification of remote sensing images and more finely dividing different categories within the image, effectively distinguishing landslides from the background. To improve the generalization ability of semantic segmentation models, one strategy is to train the model on multiple different datasets and then apply it to other datasets.

[0038] The inventors discovered that the current method works well for extracting large landslides that are significantly different from the background features. However, for complex terrains such as mountains and hills, the large variations in terrain affect the lighting conditions of the image, which may make landslide extraction difficult. For some small landslides or those with fine edge details, accurate extraction may be impossible. When the landslide has similar color, texture, or other features to the background features, the model may confuse the landslide with the background features, affecting extraction accuracy. The method also lacks portability; factors such as soil and vegetation in different regions can cause differences in the spectral characteristics of landslide areas, leading to poor performance when extracting landslides in new areas.

[0039] In view of this, this disclosure provides a training method and an image extraction method for an image extraction model. The image extraction model includes a linear embedding network, an encoding network, and a decoding network, comprising: processing the acquired sample image using the linear embedding network to obtain an embedding feature map; processing the embedding feature map using the encoding network to obtain i encoded feature maps, wherein the encoding network includes i sequentially connected encoding sub-networks, one encoding sub-network corresponding to one encoded feature map, i>1; processing the i encoded feature maps using the decoding network to obtain a weighted fusion feature map, wherein the decoding network includes j sequentially connected decoding sub-networks, the decoding sub-networks being constructed based on a local global attention layer and a weighted feature fusion layer, j>0; performing segmentation head mapping processing on the weighted fusion feature map to obtain an image segmentation result, wherein the image segmentation result characterizes the geological change attributes of the target geographical environment region; and training the image extraction model based on the image segmentation result and the label data corresponding to the image segmentation result to obtain the trained image extraction model.

[0040] In the technical solution disclosed herein, the user information (including but not limited to user personal information, user image information, user device information, such as location information) and data (including but not limited to data used for analysis, stored data, and displayed data) involved are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, storage, use, processing, transmission, provision, disclosure, and application of related data all comply with relevant laws, regulations, and standards, necessary confidentiality measures have been taken, and they do not violate public order and good morals. Corresponding operation entry points are provided for users to choose to authorize or refuse.

[0041] Figure 1 The illustration shows an application scenario of the image extraction method according to an embodiment of the present disclosure.

[0042] like Figure 1 As shown, application scenario 100 according to this embodiment may include terminal devices 101, 102, and 103, network 104, and server 105. Network 104 is used as a medium to provide a communication link between terminal devices 101, 102, and 103 and server 105. Network 104 may include various connection types, such as wired or wireless communication links or fiber optic cables, etc.

[0043] Users can use terminal devices 101, 102, and 103 to interact with server 105 via network 104 to receive or send messages, etc. Various communication client applications can be installed on terminal devices 101, 102, and 103, such as shopping applications, web browser applications, search applications, instant messaging tools, email clients, social media platform software, etc. (for example only).

[0044] Terminal devices 101, 102, and 103 can be various electronic devices with displays and web browsing capabilities, including but not limited to smartphones, tablets, laptops, and desktop computers.

[0045] Server 105 can be a server that provides various services, such as a backend management server that supports websites browsed by users using terminal devices 101, 102, and 103 (for example only). The backend management server can analyze and process data such as received user requests, and feed back the processing results (such as web pages, information, or data obtained or generated according to user requests) to the terminal devices.

[0046] It should be noted that the image extraction model training method and image extraction method provided in this embodiment can generally be executed by server 105. Correspondingly, the image extraction model training device and image extraction device provided in this embodiment can generally be located in server 105. The image extraction model training method and image extraction method provided in this embodiment can also be executed by a server or server cluster that is different from server 105 and capable of communicating with terminal devices 101, 102, 103 and / or server 105. Correspondingly, the image extraction model training device and image extraction device provided in this embodiment can also be located in a server or server cluster that is different from server 105 and capable of communicating with terminal devices 101, 102, 103 and / or server 105.

[0047] It should be understood that Figure 1 The number of terminal devices, networks, and servers shown is merely illustrative. Depending on implementation needs, any number of terminal devices, networks, and servers can be included.

[0048] Figure 2 A flowchart illustrating a training method for an image extraction model according to an embodiment of the present disclosure is shown schematically.

[0049] like Figure 2 As shown, the training method of the image extraction model in this embodiment includes operations S210 to S250.

[0050] In operation S210, the acquired sample images are processed using a linear embedding network to obtain an embedding feature map.

[0051] In operation S220, the embedded feature maps are processed using an encoding network to obtain i encoded feature maps.

[0052] In operation S230, the decoding network processes i encoded feature maps to obtain a weighted fused feature map.

[0053] In operation S240, the weighted fusion feature map is processed by segmentation head mapping to obtain the image segmentation result.

[0054] In operation S250, an image extraction model is trained based on the image segmentation results and the corresponding label data to obtain the trained image extraction model.

[0055] According to embodiments of this disclosure, the sample images represent target geographical environments, such as landslide-prone mountainous areas, but are not limited to these. Embodiments of this disclosure do not limit the specific category of the sample images. The acquired sample images can be images that have undergone preprocessing of the original images, standardizing the size of the original images. For large-area remote sensing images, the images can be cropped and resized to 256×256. The model input channels can select three channels: red, green, and blue.

[0056] According to embodiments of this disclosure, a linear embedding network may include segmenting a sample image and converting the segmented image into a linear embedding feature map. The acquired sample image can be processed using a linear embedding network to obtain the embedding feature map. The obtained embedding feature map of the sample image is then input into other networks.

[0057] According to embodiments of this disclosure, an encoding network can be used to process embedded feature maps to obtain i encoded feature maps. The encoding network may include i sequentially connected encoding sub-networks, with one encoding sub-network corresponding to one encoded feature map, where i > 1. One encoding sub-network can generate a corresponding encoded feature map. Specifically, the Swin Transformer Block may include self-attention computation, self-attention encoding, and relative position encoding. Image patch merging can combine the generated multiple features to obtain the encoded feature map.

[0058] According to embodiments of this disclosure, a decoding network can be used to process i encoded feature maps to obtain a weighted fusion feature map. The decoding network may include j sequentially connected decoding sub-networks, which are constructed based on a local global attention layer and a weighted feature fusion layer, where j > 0.

[0059] According to embodiments of this disclosure, the segmentation head can refer to the part of the model specifically used to perform image segmentation tasks. For pixel-level classification tasks, it can assign a category label to each pixel in the image. The segmentation head can be used to map the weighted fusion feature map to obtain the image segmentation result. The image segmentation result characterizes the geological change attributes of the target geographic environment area. For example, an image of geological changes after a landslide, but not limited to this, and embodiments of this disclosure do not limit this. The segmentation head can transform the feature map into a classification map with the same spatial dimension as the input image. During training, the network parameters can be optimized by comparing the output of the segmentation head with the true label map. The image extraction model can be trained based on the image segmentation result and the corresponding label data to obtain the trained image extraction model.

[0060] According to embodiments of this disclosure, sample images are processed using a linear embedding network to obtain embedded feature maps. These embedded feature maps are then processed using an encoding network to obtain i encoded feature maps. Finally, the i encoded feature maps are processed using a decoding network to obtain a weighted fusion feature map. The decoding network comprises j sequentially connected decoding sub-networks, each constructed based on a local-to-global attention layer and a weighted feature fusion layer. The weighted fusion feature map is processed using a segmentation head mapping to obtain an image segmentation result. An image extraction model is trained based on the image segmentation result and the corresponding label data to obtain the trained image extraction model. By using the decoding network, the contributions of each feature map during the extraction process can be balanced, reducing overfitting. Furthermore, since the decoding sub-network is constructed based on a local-to-global attention layer, it can improve the extraction of landslide details and edges, thereby enhancing the robustness of the model.

[0061] According to embodiments of this disclosure, the sample image obtained by processing it using a linear embedding network to obtain an embedding feature map includes: performing image segmentation processing on the sample image to obtain multiple sample sub-images; and performing linear embedding processing on the sample sub-images to obtain an embedding feature map.

[0062] According to embodiments of this disclosure, a sample image can be segmented into multiple sample sub-images. Here, embodiments of this disclosure do not limit the number of sample sub-images. Linear embedding processing can be performed on the sample sub-images to obtain embedded feature maps.

[0063] According to embodiments of this disclosure, a weighted fusion feature map is obtained by processing i encoded feature maps using a decoding network, including: when m = i-1 and n = i, processing the nth encoded feature map and the (n-1th encoded feature map using the mth decoding sub-network to obtain the mth intermediate fusion feature map, where i > 1; when m = n < i-1, processing the (m+1th)th intermediate fusion feature map and the nth encoded feature map using the mth decoding sub-network to obtain the mth intermediate fusion feature map; and when m = 1, determining the mth intermediate fusion feature map as the weighted fusion feature map.

[0064] Figure 3 A schematic diagram of an image extraction model according to an embodiment of the present disclosure is shown.

[0065] like Figure 3 As shown, the sample images can be processed using the linear embedding network 310 to obtain the embedding feature map. Specifically, the sample images can be divided into multiple sample sub-images; the sample sub-images can be linearly embedded to obtain the embedding feature map.

[0066] According to embodiments of this disclosure, the embedding feature map output by the linear embedding network 310 can be processed using the coding network 320, such as... Figure 3 As shown, the encoding network 320 includes encoding sub-networks 321, 322, 323, and 324. However, it is not limited to this, and the embodiments of this disclosure do not limit the number of encoding sub-networks. The embedding feature map output by the linear embedding network 310 is input into the encoding sub-network 321 to obtain the corresponding encoding feature map. The corresponding encoding feature map is then input into the encoding sub-network 322 to obtain the encoding feature map corresponding to the encoding sub-network 322, and so on, ultimately resulting in four encoding feature maps, such as... Figure 3 As shown, the coding network consists of four sequentially connected coding sub-networks, with each coding sub-network corresponding to a coding feature map.

[0067] According to embodiments of this disclosure, the decoding network may include three sequentially connected decoding sub-networks, specifically decoding sub-network 331, decoding sub-network 332, and decoding sub-network 333. However, this is not the only limitation; the embodiments of this disclosure only demonstrate the case where there are three decoding sub-networks. The decoding sub-networks are constructed based on a local global attention layer (i.e., the Swin-CNNTransformer Block module in the figure) and a weighted feature fusion layer (i.e., the weighted fusion module in the figure). Decoding network 330 can process four encoded feature maps to obtain a weighted fusion feature map. Specifically, the encoded feature map generated by encoding sub-network 324 is skip-connected and input to the local global attention layer in decoding sub-network 333 to obtain the corresponding feature map. This feature map, along with the encoded feature map generated by encoding sub-network 323, is then input to the weighted feature fusion layer in decoding sub-network 333 to obtain a second intermediate fusion feature map. Similarly, the first intermediate fusion feature map output by decoding sub-network 332 and the encoded feature map output by encoding sub-network 321 are output to decoding sub-network 331, finally obtaining the weighted fusion feature map. Image segmentation results can be obtained by performing segmentation head mapping on the weighted fusion feature map. This disclosure only illustrates four encoding sub-networks and three decoding sub-networks, but it is not limited to this; the number of encoding sub-networks is not limited in this disclosure.

[0068] According to embodiments of this disclosure, when training the image extraction model, an optimizer (Adam, adaptive moment estimation) is used. To prevent overfitting, L2 regularization is applied to the model parameters, i.e., a penalty of the squared weights is added to the loss function. The initial learning rate (LR) is set to 0.001. To improve the stability and effectiveness of training, a learning rate adjuster is created to dynamically reduce the learning rate during training to address situations where the model's performance on the validation set no longer improves; here, the reduction factor is set to 0.5. The model input image size is set to 256×256 pixels, with 3 input channels. The model simultaneously reads in the image and its corresponding segmentation mask for training.

[0069] According to embodiments of this disclosure, a decoding network can be used to process i encoded feature maps to obtain a weighted fusion feature map. When m = i-1 and n = i, the m-th decoding sub-network is used to process the n-th and (n-1)-th encoded feature maps to obtain the m-th intermediate fusion feature map. For example... Figure 3As shown, for example, the third decoding sub-network 333 can be used to process the fourth coding feature map generated by the coding sub-network 324 and the third coding feature map generated by the coding sub-network 323 to obtain the third intermediate fusion feature map. The fourth coding feature map can be generated by the fourth coding sub-network, and the third coding feature map can be generated by the third coding sub-network, where i > 1.

[0070] According to embodiments of this disclosure, when m = n < i-1, the m-th decoding sub-network processes the (m+1)-th intermediate fused feature map and the n-th encoded feature map to obtain the m-th intermediate fused feature map. For example... Figure 3 As shown, for example, when m=n=2, the second decoding sub-network can be used to process the third intermediate fused feature map and the second encoded feature map to obtain the second intermediate fused feature map. When m=1, the first decoding sub-network can be used to process the second intermediate fused feature map and the first encoded feature map to obtain the first intermediate fused feature map (i.e., the result output by the decoding sub-network 331). The obtained first intermediate fused feature map can be determined as the weighted fused feature map.

[0071] According to embodiments of this disclosure, the m-th intermediate fusion feature map is generated as follows: A first attention feature map is obtained by processing a first input feature map using a local-to-global attention layer, wherein the first input feature map includes either an encoded feature map or the m-th intermediate fusion feature map; the first attention feature map and the encoded feature map are processed using a weighted feature fusion layer to obtain an output feature map, wherein the output feature map includes either the m-th intermediate fusion feature map or a weighted fusion feature map; wherein processing the first attention feature map and the encoded feature map using a weighted feature fusion layer to obtain the output feature map includes: adjusting the weight ratio of the first attention feature map and the encoded feature map using learnable parameters, wherein the learnable parameters characterize the weight ratio between different feature maps; and adding the first attention feature map and the encoded feature map according to the weight ratio to obtain the output feature map.

[0072] According to embodiments of this disclosure, a first input feature map can be processed using a local-to-global attention layer to obtain a first attention feature map. The first input feature map may include an encoded feature map or an m-th intermediate fused feature map.

[0073] According to embodiments of this disclosure, a weighted feature fusion layer can be used to process the first attention feature map and the encoded feature map to obtain an output feature map. The output feature map may include the m-th intermediate fused feature map or the weighted fused feature map.

[0074] Figure 4 A schematic diagram of a weighted feature fusion layer according to an embodiment of the present disclosure is shown.

[0075] like Figure 4As shown, X1 represents the encoding feature map and X2 represents the first attention feature map. First, X2 can be upsampled to the same dimension as X1. Then, the appropriate weight ratio can be adjusted using the learnable parameter weights. Based on the weight ratio, the first attention feature map and the encoding feature map can be added together, and the resulting output feature map can be output.

[0076] According to embodiments of this disclosure, in the weighted feature fusion layer, learnable parameters can be used to adjust the weight ratio of the first attention feature map and the encoding feature map, where the learnable parameters represent the weight ratio between different feature maps. Based on the weight ratio, the first attention feature map and the encoding feature map can be added together to obtain the output feature map. For example, the learnable parameters determine the weight ratio of the first attention feature map to be 0.4 and the weight ratio of the encoding feature map to be 0.6. The output feature map can be obtained by multiplying 0.4 by the first attention feature map and adding 0.6 by the encoding feature map. The output feature map fuses the features of the first attention feature map and the encoding feature map, balancing the contributions of each feature map during the extraction process and reducing overfitting of the model.

[0077] According to embodiments of this disclosure, a local-to-global attention layer can be used to improve the extraction of landslide details and landslide edges, thereby enhancing the model's extraction capability. By using a weighted feature fusion layer, the contributions of each feature map during the extraction process can be balanced, reducing overfitting and improving the model's robustness.

[0078] According to embodiments of this disclosure, processing a first input feature map using a local global attention layer to obtain a first attention feature map includes: processing the first input feature map using a local global attention sub-layer to obtain a first feature map; obtaining a second feature map based on the first input feature map and the first feature map; processing the second feature map using a multilayer perceptron sub-layer to obtain a local feature map, wherein the local global attention layer includes a local global attention sub-layer and a multilayer perceptron sub-layer; and obtaining the first attention feature map based on the second feature map and the local feature map.

[0079] Figure 5 A schematic diagram of a local global attention layer according to an embodiment of the present disclosure is shown.

[0080] like Figure 5As shown, the first input feature map can be processed using the normalization unit 522 (i.e., the Norm module) to obtain a normalized feature map. This normalized feature map is then input to the local-global attention unit 521 (i.e., the Local-Global Attention module), which outputs the first feature map. The first input feature map and the first feature map can be added together to obtain the second feature map. The second feature map can be processed using the multilayer perceptron sublayer 520 to obtain a local feature map. The local-global attention layer includes the local-global attention sublayer 520 and the multilayer perceptron sublayer 510. The second feature map and the local feature map can be added together to obtain the first attention feature map.

[0081] According to embodiments of this disclosure, a first input feature map can be processed using a local-to-global attention sublayer 521 to obtain a first feature map. A second feature map can be obtained based on the first input feature map and the first feature map. A multilayer perceptron sublayer 510 can be used to process the second feature map to obtain a local feature map. The multilayer perceptron sublayer 510 may include a first normalization unit 512 (i.e., a Norm module) and a multilayer perceptron unit 511 (i.e., an MLP module). The first normalization unit 512 and the multilayer perceptron unit 511 are linearly connected. The second feature map is input to the first normalization unit 512 in the multilayer perceptron sublayer 510, which outputs a corresponding first normalized feature map. The first normalized feature map is input to the multilayer perceptron unit 511, which outputs a local feature map. The local feature map and the second feature map can then be added together to obtain a local feature map.

[0082] According to an embodiment of this disclosure, processing a first input feature map using a local global attention sublayer to obtain a first feature map includes: processing the first input feature map using a normalization unit to obtain a normalized feature map; inputting the normalized feature map into a local global attention unit and outputting the first feature map, wherein the local global attention sublayer includes a normalization unit and a local global attention unit.

[0083] According to embodiments of this disclosure, a normalization unit can be used to process the first input feature map to obtain a normalized feature map. This normalized feature map is then input to a local-global attention unit, which outputs the first feature map. The local-global attention sublayer can include both a normalization unit and a local-global attention unit, enabling the simultaneous learning of local and global features, thus improving the extraction of landslide details.

[0084] According to embodiments of this disclosure, inputting a normalized feature map into a local global attention unit to obtain a first feature map includes: performing convolutional normalization on the normalized feature map to obtain a convolutional feature map; performing function mapping on the normalized feature map to obtain a mapped feature map; summing the convolutional feature map and the mapped feature map to obtain a normalized mapped feature map; performing convolutional processing on the normalized mapped feature map to obtain the first feature map; wherein, performing convolutional normalization on the normalized feature map to obtain the convolutional feature map includes: processing the normalized feature map using a first convolutional unit to obtain a first convolutional feature map; processing the normalized feature map using a second convolutional unit to obtain a second convolutional feature map, wherein the first convolutional kernel of the first convolutional unit is different from the second convolutional kernel of the second convolutional unit; and adding the first convolutional feature map and the second convolutional feature map to obtain the convolutional feature map.

[0085] According to embodiments of this disclosure, a normalized feature map can be subjected to convolutional normalization processing to obtain a convolutional feature map; the normalized feature map can then be mapped using a function to obtain a mapped feature map. Specifically, this can include: processing the normalized feature map using a first convolutional unit to obtain a first convolutional feature map; and processing the normalized feature map using a second convolutional unit to obtain a second convolutional feature map, wherein the first convolutional kernel of the first convolutional unit and the second convolutional kernel of the second convolutional unit are different; for example, the first convolutional kernel in the first convolutional unit can be a 3×3 convolution, and the second convolutional kernel in the second convolutional unit can be a 1×1 convolution. Finally, the first convolutional feature map and the second convolutional feature map can be added together to obtain the final convolutional feature map.

[0086] According to embodiments of this disclosure, the convolutional feature map and the mapping feature map can be summed to obtain a normalized mapping feature map. The normalized mapping feature map is then convolved to obtain a first feature map. Specifically, the normalized mapping feature map is convolved using the same convolution kernel as in a linear convolutional network. After convolution, it can be normalized, and finally, a 1×1 convolution is performed to obtain the first feature map.

[0087] According to embodiments of this disclosure, a normalized feature map is mapped using a function to obtain a mapped feature map, including: determining a query matrix, a key matrix, and a value matrix based on the feature sequence of the normalized feature map; obtaining a mapping encoding matrix based on the query matrix and the key matrix; and multiplying the mapping encoding matrix and the value matrix to obtain the mapped feature map.

[0088] According to embodiments of this disclosure, a query matrix, a key matrix, and a value matrix can be determined based on the feature sequence of a normalized feature map. After multiplying corresponding elements of the query matrix and the key matrix, a softmax mapping is performed to obtain a mapped encoding matrix. The mapped encoding matrix and the value matrix can then be multiplied to obtain a mapped feature map.

[0089] Figure 6A schematic diagram of a local global attention sublayer according to an embodiment of the present disclosure is shown.

[0090] like Figure 6 As shown, the normalized feature map can be processed using the first convolutional unit 610 to obtain the first convolutional feature map. The first convolutional unit 610 includes normalization and a 3×3 convolution. The normalized feature map can be processed using the second convolutional unit 620 to obtain the second convolutional feature map. The second convolutional unit 620 includes normalization and a 1×1 convolution. The first convolutional kernel of the first convolutional unit 610 is a 3×3 convolution, and the second convolutional kernel of the second convolutional unit 620 is a 1×1 convolution. The first and second convolutional feature maps can be added together to obtain the final convolutional feature map. Figure 6 On the right side (640), the query matrix (Q), key matrix (K), and value matrix (V) can be determined based on the feature sequence of the normalized feature map. The query matrix and key matrix can be multiplied element-wise and then mapped using Softmax to obtain the mapped encoding matrix. Multiplying the mapped encoding matrix and the value matrix yields the mapped feature map. The convolutional feature map and the mapped feature map are summed to obtain the normalized mapped feature map. The normalized mapped feature map is then convolved (as shown in 630) to obtain the first feature map. The convolutional processing can include a win_size×win_size convolution (using the same kernel as in linear embedding networks), normalization, and a 1×1 convolution.

[0091] According to embodiments of this disclosure, processing the embedded feature map using an encoding network to obtain i encoded feature maps includes: when n=1, processing the embedded feature map using the nth encoding sub-network to obtain the nth encoded feature map; when n is greater than 1, processing the (n-1)th encoded feature map using the nth encoding sub-network to obtain the nth encoded feature map.

[0092] According to embodiments of this disclosure, when n=1, the first coding sub-network can be used to process the embedded feature map to obtain the first coding feature map. When n is greater than 1, the n-1 coding feature map can be processed using the n-th coding sub-network to obtain the n-th coding feature map. For example, when n=2, the first coding feature map can be processed using the second coding sub-network to obtain the second coding feature map.

[0093] Figure 7 A flowchart illustrating an image extraction method according to an embodiment of the present disclosure is shown schematically.

[0094] like Figure 7 As shown, the image extraction method of this embodiment includes operations S710 to S720.

[0095] The S710 is used to acquire test sample images.

[0096] When operating the S720, the test sample image is input into the trained image extraction model, and the image segmentation result corresponding to the test sample image is output.

[0097] According to embodiments of this disclosure, test sample images can be obtained first, then input into a trained image extraction model to output the corresponding image segmentation results.

[0098] Figure 8 A flowchart illustrating the implementation of an image extraction method according to an embodiment of the present disclosure is shown.

[0099] like Figure 8 As shown, the dataset is first acquired and preprocessed. The task is completed based on existing datasets from Region 1, Region 2, Region 3, and Region 4. To standardize the size of the input images, large-scale remote sensing impacts are cropped and resized to 256×256. The model input channels are selected as red, green, and blue. To improve the model's generalization ability, Region 1 and Region 2 datasets are used for training, and Region 3 dataset is used for testing. Using multiple different datasets for training allows the model to learn the characteristics of different landslides, thus improving the model's generalization ability.

[0100] According to embodiments of this disclosure, an image extraction model can be trained using a dataset from region 1. After obtaining the trained image extraction model, a prediction can be made using a dataset from region 3 to obtain the corresponding segmentation result. The segmentation result is then analyzed, and the network parameters in the image extraction model are adjusted. The image extraction model can then be retrained using a dataset from region 2. After obtaining the trained image extraction model, prediction is performed again to obtain the segmentation result, and its accuracy is evaluated.

[0101] Figure 9 A landslide extraction map is schematically shown according to an embodiment of the present disclosure.

[0102] like Figure 9As shown, the test set was input into the trained model for testing, and the landslide extraction results of Region 3 dataset were obtained. Here, the more complex landslides and landslide samples with poor extraction results can be displayed. (a) and (c) are the original images, and (b) and (d) are the landslide extraction images. It can be seen from the extraction results that the model has a good extraction effect. The original image has a complex background and unclear edges, but the model can still extract the landslide. In addition to the effect image, four semantic segmentation metrics were also selected to evaluate the model performance. Here, Recall, Precision, F1 Score and mIoU intersection-union ratio were selected to evaluate the model performance. TP, FP, FN and TN represent the pixels correctly classified as landslides, the pixels incorrectly identified as landslides in the background, the pixels identified as background in the landslide segmentation, and the pixels correctly classified as background, respectively. The calculation formulas for these four metrics are shown in (1)-(4). The evaluation metric results are shown in Table 1.

[0103]

[0104]

[0105]

[0106]

[0107] Table 1 Landslide Extraction Assessment Index

[0108] Recall Precision F1 Score mIoU 74.02 73.74 73.88 64.45

[0109] Figure 10 A schematic block diagram of a training apparatus for an image extraction model according to an embodiment of the present disclosure is shown.

[0110] like Figure 10 As shown, the training device 1000 for the image extraction model in this embodiment includes an embedding module 1010, a processing module 1020, a weighting module 1030, a training result module 1040, and a training module 1050.

[0111] The embedding module 1010 is used to process the acquired sample image using a linear embedding network to obtain an embedded feature map. In one embodiment, the embedding module 1010 can be used to perform the operation S210 described above, which will not be repeated here.

[0112] The processing module 1020 is used to process the embedded feature maps using an encoding network to obtain i encoded feature maps, wherein the encoding network includes i sequentially connected encoding sub-networks, and one encoding sub-network corresponds to one encoded feature map, i>1. In one embodiment, the processing module 1020 can be used to perform the operation S220 described above, which will not be repeated here.

[0113] The weight module 1030 is used to process i encoded feature maps using a decoding network to obtain a weighted fusion feature map. The decoding network includes j sequentially connected decoding sub-networks, which are constructed based on a local-to-global attention layer and a weighted feature fusion layer, where j > 0. In one embodiment, the weight module 1030 can be used to perform the operation S230 described above, which will not be repeated here.

[0114] The training result module 1040 is used to perform segmentation head mapping processing on the weighted fusion feature map to obtain image segmentation results, wherein the image segmentation results characterize the geological change attributes of the target geographic environment area. In one embodiment, the training result module 1040 can be used to perform the operation S240 described above, which will not be repeated here.

[0115] The training module 1050 is used to train an image extraction model based on the image segmentation results and the corresponding label data, thereby obtaining the trained image extraction model. In one embodiment, the training module 1050 can be used to perform the operation S250 described above, which will not be repeated here.

[0116] According to embodiments of this disclosure, sample images are processed using a linear embedding network to obtain embedded feature maps. These embedded feature maps are then processed using an encoding network to obtain i encoded feature maps. Finally, the i encoded feature maps are processed using a decoding network to obtain a weighted fusion feature map. The decoding network comprises j sequentially connected decoding sub-networks, each constructed based on a local-to-global attention layer and a weighted feature fusion layer. The weighted fusion feature map is processed using a segmentation head mapping to obtain an image segmentation result. An image extraction model is trained based on the image segmentation result and the corresponding label data to obtain the trained image extraction model. By using the decoding network, the contributions of each feature map during the extraction process can be balanced, reducing overfitting. Furthermore, since the decoding sub-network is constructed based on a local-to-global attention layer, it can improve the extraction of landslide details and edges, thereby enhancing the robustness of the model.

[0117] According to embodiments of this disclosure, the weighting module 1030 includes: a first processing unit, a second processing unit, and a weight fusion unit.

[0118] The first processing unit is used to process the nth encoded feature map and the (n-1)th encoded feature map using the mth decoding sub-network when m = i-1 and n = i, to obtain the mth intermediate fused feature map, where i > 1.

[0119] The second processing unit is used to process the (m+1)th intermediate fusion feature map and the nth encoding feature map using the mth decoding sub-network when m = n < i-1, to obtain the mth intermediate fusion feature map.

[0120] The weighted fusion unit is used to determine the m-th intermediate fusion feature map as the weighted fusion feature map when m=1.

[0121] According to embodiments of this disclosure, the m-th intermediate fusion feature map is generated as follows:

[0122] An attention unit is used to process the first input feature map using a local global attention layer to obtain a first attention feature map, wherein the first input feature map includes an encoded feature map or an m-th intermediate fused feature map.

[0123] The output feature unit is used to process the first attention feature map and the encoding feature map using the weighted feature fusion layer to obtain the output feature map, wherein the output feature map includes the m-th intermediate fused feature map or the weighted fused feature map.

[0124] According to embodiments of this disclosure, the output feature unit includes: a weighting percentage subunit and an addition subunit.

[0125] The weighting subunit is used to adjust the weighting ratio of the first attention feature map and the encoding feature map using learnable parameters, where the learnable parameters represent the weighting ratio between different feature maps.

[0126] The addition subunit is used to add the first attention feature map and the encoding feature map according to the weight ratio to obtain the output feature map.

[0127] According to embodiments of this disclosure, the attention unit includes: a first feature subunit, a second feature subunit, a local feature subunit, and a first attention subunit.

[0128] The first feature subunit is used to process the first input feature map using the local global attention sublayer to obtain the first feature map.

[0129] The second feature subunit is used to obtain the second feature map based on the first input feature map and the first feature map.

[0130] The local feature subunit is used to process the second feature map using the multilayer perceptron sublayer to obtain the local feature map. The local global attention layer includes the local global attention sublayer and the multilayer perceptron sublayer.

[0131] The first attention subunit is used to obtain the first attention feature map based on the second feature map and the local feature map.

[0132] According to embodiments of this disclosure, the first feature subunit includes:

[0133] The first input feature map is processed using a normalization unit to obtain a normalized feature map.

[0134] The normalized feature map is input into the local global attention unit to obtain the first feature map, wherein the local global attention sub-layer includes a normalized unit and a local global attention unit.

[0135] The normalized feature map is input into the local global attention unit to obtain the first feature map, which includes:

[0136] The normalized feature map is subjected to convolutional normalization to obtain a convolutional feature map; the normalized feature map is then mapped using a function to obtain a mapped feature map; the normalized feature map and the mapped feature map are summed to obtain a normalized mapped feature map; the normalized mapped feature map is then convolved to obtain a first feature map; wherein, the convolutional normalization of the normalized feature map to obtain the convolutional feature map includes: processing the normalized feature map using a first convolutional unit to obtain a first convolutional feature map; processing the normalized feature map using a second convolutional unit to obtain a second convolutional feature map, wherein the first convolutional kernel of the first convolutional unit and the second convolutional kernel of the second convolutional unit are different; the first convolutional feature map and the second convolutional feature map are then summed to obtain the convolutional feature map.

[0137] According to embodiments of this disclosure, a normalized feature map is mapped using a function to obtain a mapped feature map, including: determining a query matrix, a key matrix, and a value matrix based on the feature sequence of the normalized feature map; obtaining a mapping encoding matrix based on the query matrix and the key matrix; and multiplying the mapping encoding matrix and the value matrix to obtain the mapped feature map.

[0138] According to embodiments of this disclosure, the embedding module 1010 includes a partitioning unit and an embedding unit.

[0139] The partitioning unit is used to perform image partitioning processing on the sample image to obtain multiple sample sub-images.

[0140] The embedding unit is used to perform linear embedding processing on the sample sub-images to obtain the embedded feature map.

[0141] According to embodiments of this disclosure, the processing module includes: a first processing unit and a second processing unit.

[0142] The first processing unit is used to process the embedded feature map using the nth coding sub-network when n=1, to obtain the nth coding feature map.

[0143] The second processing unit is used to process the (n-1)th encoded feature map using the nth encoded sub-network when n is greater than 1, to obtain the nth encoded feature map.

[0144] Figure 11 A schematic block diagram of an image extraction apparatus according to an embodiment of the present disclosure is shown.

[0145] like Figure 11As shown, the image extraction device 1100 of this embodiment includes an acquisition module 1110 and a test result module 1120.

[0146] The acquisition module 1110 is used to acquire test sample images. In one embodiment, the acquisition module 1110 can be used to perform the operation S710 described above, which will not be repeated here.

[0147] The test result module 1120 is used to input the test sample image into the trained image extraction model to obtain the image segmentation result corresponding to the test sample image. In one embodiment, the test result module 1120 can be used to perform the operation S720 described above, which will not be repeated here.

[0148] According to embodiments of this disclosure, any and multiple modules among the embedding module 1010, processing module 1020, weight module 1030, training result module 1040, training module 1050, acquisition module 1110, and test result module 1120 can be combined into one module, or any one of these modules can be split into multiple modules. Alternatively, at least some of the functionality of one or more of these modules can be combined with at least some of the functionality of other modules and implemented in one module. According to embodiments of this disclosure, at least one of the embedding module 1010, processing module 1020, weighting module 1030, training result module 1040, training module 1050, acquisition module 1110, and test result module 1120 can be at least partially implemented as hardware circuits, such as field-programmable gate arrays (FPGAs), programmable logic arrays (PLAs), systems-on-a-chip, systems-on-a-substrate, systems-on-package, application-specific integrated circuits (ASICs), or any other reasonable means of integrating or packaging circuits, or implemented in software, hardware, or firmware, or in any suitable combination of any of these three implementation methods. Alternatively, at least one of the embedding module 1010, processing module 1020, weighting module 1030, training result module 1040, training module 1050, acquisition module 1110, and test result module 1120 can be at least partially implemented as a computer program module, which can perform corresponding functions when the computer program module is run.

[0149] Figure 12 A block diagram schematically illustrates an electronic device suitable for implementing an image extraction method according to an embodiment of the present disclosure.

[0150] like Figure 12As shown, an electronic device 1200 according to an embodiment of the present disclosure includes a processor 1201, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 1202 or a program loaded from a storage portion 1208 into a random access memory (RAM) 1203. The processor 1201 may include, for example, a general-purpose microprocessor (e.g., a CPU), an instruction set processor and / or an associated chipset and / or a special-purpose microprocessor (e.g., an application-specific integrated circuit (ASIC)), etc. The processor 1201 may also include onboard memory for caching purposes. The processor 1201 may include a single processing unit or multiple processing units for performing different actions of the method flow according to an embodiment of the present disclosure.

[0151] RAM 1203 stores various programs and data required for the operation of electronic device 1200. Processor 1201, ROM 1202, and RAM 1203 are interconnected via bus 1204. Processor 1201 performs various operations of the method flow according to embodiments of the present disclosure by executing programs in ROM 1202 and / or RAM 1203. It should be noted that the programs may also be stored in one or more memories other than ROM 1202 and RAM 1203. Processor 1201 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in said one or more memories.

[0152] According to embodiments of this disclosure, the electronic device 1200 may further include an input / output (I / O) interface 1205, which is also connected to the bus 1204. The electronic device 1200 may also include one or more of the following components connected to the I / O interface 1205: an input section 1206 including a keyboard, mouse, etc.; an output section 1207 including a cathode ray tube (CRT), liquid crystal display (LCD), etc., and a speaker, etc.; a storage section 1208 including a hard disk, etc.; and a communication section 1209 including a network interface card such as a LAN card, modem, etc. The communication section 1209 performs communication processing via a network such as the Internet. A drive 1210 is also connected to the I / O interface 1205 as needed. A removable medium 1211, such as a disk, optical disk, magneto-optical disk, semiconductor memory, etc., is installed on the drive 1210 as needed so that computer programs read from it can be installed into the storage section 1208 as needed.

[0153] This disclosure also provides a computer-readable storage medium, which may be included in the device / apparatus / system described in the above embodiments; or it may exist independently and not assembled into the device / apparatus / system. The computer-readable storage medium carries one or more programs that, when executed, implement the method according to the embodiments of this disclosure.

[0154] According to embodiments of this disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium, such as including, but not limited to: portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof. In this disclosure, the computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. For example, according to embodiments of this disclosure, the computer-readable storage medium may include ROM 1202 and / or RAM 1203 and / or one or more memories other than ROM 1202 and RAM 1203 described above.

[0155] Embodiments of this disclosure also include a computer program product comprising a computer program containing program code for performing the methods shown in the flowchart. When the computer program product is run on a computer system, the program code enables the computer system to implement the image extraction method provided in the embodiments of this disclosure.

[0156] When the computer program is executed by the processor 1201, it performs the functions defined in the system / apparatus of this disclosure embodiments. According to embodiments of this disclosure, the systems, apparatuses, modules, units, etc., described above can be implemented by computer program modules.

[0157] In one embodiment, the computer program may rely on a tangible storage medium such as an optical storage device or a magnetic storage device. In another embodiment, the computer program may also be transmitted and distributed in the form of signals over a network medium, and may be downloaded and installed via the communication section 1209, and / or installed from the removable medium 1211. The program code contained in the computer program can be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination thereof.

[0158] In such an embodiment, the computer program can be downloaded and installed from a network via the communication section 1209, and / or installed from the removable medium 1211. When the computer program is executed by the processor 1201, it performs the functions defined in the system of this disclosure embodiment. According to embodiments of this disclosure, the systems, devices, apparatuses, modules, units, etc., described above can be implemented by computer program modules.

[0159] According to embodiments of this disclosure, program code for executing the computer programs provided in embodiments of this disclosure can be written in any combination of one or more programming languages. Specifically, these computational programs can be implemented using high-level procedural and / or object-oriented programming languages, and / or assembly / machine languages. Programming languages ​​include, but are not limited to, languages ​​such as Java, C++, Python, "C", or similar programming languages. The program code can execute entirely on the user's computing device, partially on the user's device, partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).

[0160] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this disclosure. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in a block diagram or flowchart, and combinations of blocks in a block diagram or flowchart, may be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.

[0161] Those skilled in the art will understand that the features described in the various embodiments and / or claims of this disclosure can be combined or combined in various ways, even if such combinations or combinations are not explicitly described in this disclosure. In particular, the features described in the various embodiments and / or claims of this disclosure can be combined or combined in various ways without departing from the spirit and teachings of this disclosure. All such combinations and / or combinations fall within the scope of this disclosure.

[0162] The embodiments of this disclosure have been described above. However, these embodiments are for illustrative purposes only and are not intended to limit the scope of this disclosure. Although various embodiments have been described above, this does not mean that the measures in the various embodiments cannot be used advantageously in combination. The scope of this disclosure is defined by the appended claims and their equivalents. Various substitutions and modifications can be made by those skilled in the art without departing from the scope of this disclosure, and all such substitutions and modifications should fall within the scope of this disclosure.

Claims

1. A training method for an image extraction model, the image extraction model comprising a linear embedding network, an encoding network, and a decoding network, characterized in that, The method includes: The sample images are processed using the linear embedding network to obtain an embedding feature map; The embedded feature map is processed using the coding network to obtain i coded feature maps, wherein the coding network includes i sequentially connected coding sub-networks, and one coding sub-network corresponds to one coding feature map, i>1; When m=i-1 and n=i, the m-th decoding sub-network is used to process the n-th and (n-1)-th encoded feature maps to obtain the m-th intermediate fused feature map, where i>1; When m=n<i-1, the m-th decoding sub-network is used to process the (m+1)-th intermediate fused feature map and the n-th encoded feature map to obtain the m-th intermediate fused feature map; When m=1, the m-th intermediate fusion feature map is determined as the weight fusion feature map, wherein the decoding network includes j sequentially connected decoding sub-networks, the decoding sub-networks are constructed based on the local global attention layer and the weight feature fusion layer, j>0; The weighted fusion feature map is subjected to segmentation head mapping processing to obtain image segmentation results, wherein the image segmentation results characterize the geological change attributes of the target geographic environment region; and The image extraction model is trained based on the image segmentation result and the label data corresponding to the image segmentation result to obtain the trained image extraction model; The m-th intermediate fusion feature map is generated as follows: The first input feature map is processed using the local global attention layer to obtain a first attention feature map, wherein the first input feature map includes an encoded feature map or an m-th intermediate fused feature map; The first attention feature map and the encoding feature map are processed by the weighted feature fusion layer to obtain an output feature map, wherein the output feature map includes the m-th intermediate fusion feature map or the weighted fusion feature map; The step of processing the first attention feature map and the encoded feature map using the weighted feature fusion layer to obtain the output feature map includes: The weight ratio of the first attention feature map and the encoded feature map is adjusted using learnable parameters, wherein the learnable parameters represent the weight ratio between different feature maps; Based on the weight ratio, the first attention feature map and the encoding feature map are added together to obtain the output feature map.

2. The training method according to claim 1, characterized in that, The first input feature map is processed using the local-to-global attention layer to obtain a first attention feature map, including: The first input feature map is processed using a local global attention sublayer to obtain the first feature map; The second feature map is obtained based on the first input feature map and the first feature map; The second feature map is processed using a multilayer perceptron sublayer to obtain a local feature map, wherein the local global attention layer includes the local global attention sublayer and the multilayer perceptron sublayer; The first attention feature map is obtained based on the second feature map and the local feature map.

3. The training method according to claim 2, characterized in that, The process of processing the first input feature map using a local-to-global attention sublayer to obtain the first feature map includes: The first input feature map is processed using a normalization unit to obtain a normalized feature map; The normalized feature map is input into the local global attention unit, and the first feature map is output. The local global attention sub-layer includes the normalization unit and the local global attention unit.

4. The training method according to claim 3, characterized in that, The normalized feature map is input into a local global attention unit, and the first feature map is output, including: The normalized feature map is subjected to convolutional normalization to obtain a convolutional feature map; The normalized feature map is subjected to function mapping to obtain the mapped feature map; The convolutional feature map and the mapping feature map are summed to obtain the normalized mapping feature map; The normalized mapping feature map is convolved to obtain the first feature map; The step of performing convolutional normalization on the normalized feature map to obtain a convolutional feature map includes: The normalized feature map is processed using the first convolutional unit to obtain the first convolutional feature map; The normalized feature map is processed using a second convolutional unit to obtain a second convolutional feature map, wherein the first convolutional kernel of the first convolutional unit is different from the second convolutional kernel of the second convolutional unit; The first convolutional feature map and the second convolutional feature map are added together to obtain the convolutional feature map.

5. The training method according to claim 4, characterized in that, The step of performing a function mapping on the normalized feature map to obtain a mapped feature map includes: Based on the feature sequence of the normalized feature map, determine the query matrix, key matrix, and value matrix; Based on the query matrix and the key matrix, a mapping encoding matrix is ​​obtained; The mapping feature map is obtained by multiplying the mapping encoding matrix and the value matrix.

6. The training method according to claim 1, characterized in that, The process of obtaining the embedded feature map by processing the sample image using the linear embedding network includes: The sample image is segmented to obtain multiple sample sub-images; The sample sub-images are subjected to linear embedding processing to obtain the embedding feature map.

7. The training method according to claim 1, characterized in that, The process of processing the embedded feature map using the coding network to obtain i coded feature maps includes: When n=1, the embedded feature map is processed using the nth coding sub-network to obtain the nth coding feature map; When n is greater than 1, the nth coding sub-network is used to process the (n-1)th coding feature map to obtain the nth coding feature map.

8. An image extraction method, characterized in that, include: Obtain test sample images; The test sample image is input into the trained image extraction model, and the image segmentation result corresponding to the test sample image is output. The trained image extraction model is obtained by the method according to any one of claims 1 to 7.