A human body parsing network training method and device based on unmanned aerial vehicle images

By using a pre-defined encoder, an improved spatial pyramid pooling module, and a detail extractor in the human body parsing network of UAV images, combined with boundary residual calculation, the problems of large number of parameters and slow speed in UAV image human body parsing algorithms are solved, achieving high-precision and fast human body parsing.

CN116778528BActive Publication Date: 2026-06-23深圳市华赛睿飞智能科技有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
深圳市华赛睿飞智能科技有限公司
Filing Date
2023-06-21
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing human body parsing algorithms based on UAV images suffer from a large number of model parameters, resulting in slow inference speed and low parsing accuracy.

Method used

A human body parsing network is constructed using a pre-set encoder, an improved spatial pyramid pooling module, and a pre-set detail extractor. Combined with boundary residual calculation, data augmentation and image preprocessing are used to improve the model's generalization ability and boundary recognition accuracy.

Benefits of technology

While reducing the number of model parameters, the accuracy and speed of human body analysis in UAV images have been significantly improved.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116778528B_ABST
    Figure CN116778528B_ABST
Patent Text Reader

Abstract

A human body analysis network training method and device based on unmanned aerial vehicle images and a medium, including data enhancement and image preprocessing on multiple unmanned aerial vehicle images to obtain a training image set, using an encoder to extract features from the training images to obtain multiple semantic feature maps and target feature maps that meet screening requirements, using a detail extractor to perform boundary supervision on the target feature maps to obtain boundary binary classification results, using an improved spatial pyramid pooling module and boundary residual to perform multi-scale pooling on preset layer feature maps in the semantic feature maps to obtain multiple human body analysis results, summing a final loss function based on a cross-entropy loss function corresponding to the multiple human body analysis results and a binary classification loss function corresponding to the boundary binary classification results to train the human body analysis network, and optimizing and adjusting the human body analysis network to obtain a standard human body analysis network. The standard human body analysis network can improve the analysis accuracy of human body analysis while reducing the number of model parameters.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, specifically to a method, device, and medium for training a human body analysis network based on drone images. Background Technology

[0002] Drones have wide applications in security protection, tracking and identifying suspected individuals. However, due to the unique characteristics of drones, the designed algorithms must possess real-time performance and high accuracy. Drone-based human recognition refers to the fine-grained analysis of human bodies using drone images, specifically segmenting multiple parts of the body such as the head, clothing, and legs. Human body analysis is a fine-grained classification task.

[0003] The challenge of human body segmentation based on drone images lies in the fact that drone images are typically taken from the air or at a considerable distance, resulting in the human body occupying only a small portion of the image, thus falling under the category of small-object semantic segmentation. Traditional machine learning-based drone image human body segmentation suffers from low accuracy, performing only a coarse segmentation. While existing deep learning-based algorithms can achieve high accuracy, the large number of parameters hinders the rapid inference speed of human body recognition based on drone images. Summary of the Invention

[0004] The main technical problem solved by this invention is to improve the resolution accuracy of human body resolution networks in UAV images while reducing the number of model parameters.

[0005] According to the first aspect, one embodiment provides a method for training a human body parsing network based on UAV images, comprising:

[0006] Multiple drone images are acquired, data augmentation processing is performed on the multiple drone images, and image preprocessing is performed on the data-augmented images to obtain a training image set;

[0007] Boundary residuals are calculated on the training images in the training image set to obtain the boundary residuals;

[0008] The training images in the training image set are subjected to feature extraction using a preset encoder to obtain multiple semantic feature maps. The semantic feature maps that meet the screening requirements are selected as target feature maps.

[0009] The target feature map is subjected to boundary supervision processing using a preset detail extractor to obtain a boundary binary classification result.

[0010] The improved spatial pyramid pooling module and the boundary residual are used to perform multi-scale pooling on the preset number of feature maps in the multiple semantic feature maps to obtain multiple human body parsing results.

[0011] Based on the multiple human body parsing results, a corresponding cross-entropy loss function is constructed, and a corresponding binary classification loss function is constructed based on the boundary binary classification results. The human body parsing network is trained using the final loss function obtained by summing the cross-entropy loss function and the binary classification loss function. The trained human body parsing network is then optimized and adjusted to obtain a standard human body parsing network. The human body parsing network is constructed and generated by a preset encoder, an improved spatial pyramid pooling, and a preset detail extractor.

[0012] In one embodiment, the step of calculating boundary residuals from the training images in the training image set to obtain boundary residuals includes:

[0013] The training images in the training image set are downsampled to obtain a downsampled image set;

[0014] The image that meets the preset number of layers in the downsampled image set is selected as the image of the current layer, and the image with a preset interval from the image of the current layer is selected as the image of the next layer.

[0015] The lower layer image is upsampled to obtain an upsampled image. The difference between the current layer image and the upsampled image is calculated to obtain the boundary residual.

[0016] In one embodiment, the improved spatial pyramid pooling module and the boundary residual are used to perform multi-scale pooling on the preset number of feature maps in the multiple semantic feature maps to obtain multiple human body parsing results, including:

[0017] The preset layer feature map is input into the improved spatial pyramid pooling module for feature convolution processing to obtain multiple pooled features.

[0018] The multiple pooling features are concatenated to obtain concatenated features. The concatenated features are then reduced in dimensionality using a convolutional layer to obtain the reduced concatenated features.

[0019] The dimensionality-reduced concatenated features are processed by a preset feature refinement module to obtain the final output features.

[0020] Human body analysis is performed based on the final output features and the boundary residuals to obtain multiple human body analysis results.

[0021] In one embodiment, the step of inputting a feature map with a preset number of layers into the improved spatial pyramid pooling module for feature convolution processing yields multiple pooled features, including:

[0022] The preset number of feature maps are convolved to obtain the first pooling feature of the improved spatial pyramid pooling module.

[0023] The preset number of feature maps are subjected to dilated convolution processing using dilated convolution layers with different dilation rates to obtain multiple dilated pooling features.

[0024] The first layer pooling features and the multiple hole pooling features are combined to obtain multiple pooling features.

[0025] In one embodiment, the step of using a preset feature refinement module to perform feature selection processing on the dimensionality-reduced concatenated features to obtain the final output features includes:

[0026] The dimensionality-reduced concatenated features are subjected to average pooling using a global average pooling layer to obtain channel weight features.

[0027] The channel weight features are subjected to convolution, normalization, and activation processing to obtain the final weight features;

[0028] The channel weight feature and the final weight feature are multiplied together to obtain the final output feature.

[0029] In one embodiment, the human body parsing process based on the final output features and the boundary residuals to obtain multiple human body parsing results includes:

[0030] The final output features are input into multiple convolutional layers for convolution processing to obtain the first parsing result;

[0031] The final output feature and the first parsing result are upsampled respectively to obtain the first upsampled feature and the first upsampled result. The first upsampled feature, the first upsampled result and the boundary residual are concatenated to obtain the initial concatenated feature. The initial concatenated feature is then convolved and refined to obtain the second parsing result.

[0032] The second analytical result and the boundary residual are re-processed with upsampling, feature concatenation, convolution and feature refinement until the third and fourth analytical results are obtained.

[0033] The first analysis result, the second analysis result, the third analysis result, and the fourth analysis result are summarized to obtain multiple human body analysis results.

[0034] In one embodiment, the data augmentation processing of the plurality of UAV images includes:

[0035] The data augmentation processes include image translation, image flipping, random brightness transformation, and median filtering.

[0036] According to a second aspect, one embodiment provides a human body recognition method based on drone images, comprising:

[0037] Obtain the human image to be parsed;

[0038] The human image to be parsed is input into a standard human parsing network for human parsing processing to obtain the human parsing result; wherein the standard human parsing network is trained by the method described in any one of claims 1 to 7.

[0039] According to a third aspect, embodiments of the present invention provide an apparatus comprising:

[0040] Memory, used to store programs;

[0041] A processor for implementing the target detection method based on a patrol robot as described in any of the preceding claims by executing a program stored in the memory.

[0042] The human body recognition method, apparatus, and medium based on UAV images according to the above embodiments include constructing a human body parsing network based on a preset encoder, an improved spatial pyramid pooling module, and a preset detail extractor, and training, optimizing, and adjusting the human body parsing network to obtain a standard human body parsing network. The improved spatial pyramid pooling module enables the standard human body parsing network to extract features at different scales, thereby improving the ability to recognize small targets. The preset detail extractor can perform boundary supervision processing on the image, increasing the accuracy of the standard human body parsing network for boundary recognition. Simultaneously, the boundary residuals obtained from boundary residual calculation are incorporated during the training process of the human body parsing network, which can be used to solve the boundary segmentation problem in human body parsing by the standard human body parsing network. Therefore, the standard human body parsing network obtained after training can improve the accuracy of human body parsing in UAV images. Attached Figure Description

[0043] Figure 1 This is a flowchart illustrating the training process of a human body parsing network based on UAV images, as described in an embodiment of this application.

[0044] Figure 2 This is a schematic diagram of the boundary residual calculation process in one embodiment;

[0045] Figure 3 This is a schematic diagram of a multi-scale pooling process in one embodiment;

[0046] Figure 4 This is a schematic diagram of a multi-scale pooling process in another embodiment;

[0047] Figure 5 This is a schematic diagram of a multi-scale pooling process in another embodiment;

[0048] Figure 6 This is a schematic diagram of a multi-scale pooling process in another embodiment;

[0049] Figure 7 This is a schematic diagram of a human body recognition process based on drone images, according to another embodiment.

[0050] Figure 8 This is a structural block diagram of a human body analysis network training device based on UAV images, according to an embodiment of this application. Detailed Implementation

[0051] The present invention will now be described in further detail with reference to specific embodiments and accompanying drawings. Similar elements in different embodiments are referred to by associated similar element reference numerals. In the following embodiments, many details are described to facilitate a better understanding of this application. However, those skilled in the art will readily recognize that some features may be omitted in different situations, or may be replaced by other elements, materials, or methods. In some cases, certain operations related to this application are not shown or described in the specification. This is to avoid obscuring the core parts of this application with excessive description. For those skilled in the art, detailed description of these related operations is not necessary; they can fully understand the related operations based on the description in the specification and general technical knowledge in the art.

[0052] Furthermore, the features, operations, or characteristics described in the specification can be combined in any suitable manner to form various embodiments. At the same time, the steps or actions in the method description can be rearranged or adjusted in a manner obvious to those skilled in the art. Therefore, the various orders in the specification and drawings are only for the clear description of a particular embodiment and do not imply a necessary order, unless otherwise stated that a particular order must be followed.

[0053] The serial numbers assigned to components in this document, such as "first" and "second," are used only to distinguish the described objects and have no sequential or technical meaning. The terms "connection" and "linkage" used in this application, unless otherwise specified, include both direct and indirect connections (linkages).

[0054] In this embodiment of the invention, a human body parsing network is constructed based on a preset encoder, an improved spatial pyramid pooling, and a preset detail extractor. A boundary residual calculation process is added, and the calculated boundary residuals are used to train, optimize, and adjust the human body parsing network to obtain a standard human body parsing network. This can improve the accuracy of human body parsing in UAV images using the standard human body parsing network.

[0055] Please refer to Figure 1Some embodiments of the present invention provide a method for training a human body parsing network based on UAV images, including steps S10 to S60, which are described in detail below.

[0056] Step S10: Acquire multiple drone images, perform data augmentation on the multiple drone images, and perform image preprocessing on the data-augmented images to obtain a training image set.

[0057] In some embodiments, data augmentation processing is performed on multiple drone images, including:

[0058] Data augmentation processing includes image translation, image flipping, random brightness transformation, and median filtering.

[0059] In some embodiments, data augmentation of multiple drone images can enhance the generalization ability of the model during subsequent model training.

[0060] In some embodiments, the data-enhanced images are preprocessed to obtain a training image set, wherein the image preprocessing includes image cropping and image normalization.

[0061] Step S20: Calculate the boundary residuals of the training images in the training image set to obtain the boundary residuals.

[0062] See Figure 2 In some embodiments, step S20 calculates the boundary residuals of the training images in the training image set to obtain the boundary residuals, which includes steps S21 to S23, as described in detail below.

[0063] Step S21: Downsample the training images in the training image set to obtain a downsampled image set.

[0064] Step S22: Select an image from the downsampled image set that meets the preset number of layers as the current layer image, and select an image with a preset interval from the current layer image as the next layer image.

[0065] In some embodiments, the downsampled image set includes layer 1, layer 2, ..., layer k and layer (k+1). The layer k image in the downsampled image set is selected as the current layer image, and the layer (k+1) image in the downsampled image set is selected as the next layer image. Here, k represents the layer index of the Laplacian pyramid.

[0066] Step S23: Upsample the lower layer image to obtain an upsampled image, calculate the difference between the current layer image and the upsampled image, and obtain the boundary residual.

[0067] In some embodiments, the lower-layer image is upsampled to obtain an upsampled image, up(L). k+1The difference between the current layer image and the upsampled image is calculated, and the formula for the boundary residual is expressed as follows:

[0068] R k =L k -up(L k+1 )

[0069] Among them, R k For boundary residuals, L k For this layer of image, up(L k+1 ) represents the upsampled image, and k represents the layer index of the Laplace pyramid.

[0070] Step S30: Use a preset encoder to extract features from the training images in the training image set to obtain multiple semantic feature maps. Select the semantic feature maps that meet the screening requirements from the multiple semantic feature maps as target feature maps.

[0071] In some embodiments, the preset encoder can be a ResNet18 encoder, which has a four-layer encoding structure. Therefore, by using the ResNet18 encoder to extract features from the training image, four feature maps I1, I2, I3 and I4 are obtained.

[0072] In some embodiments, the network formula for the ResNet18 encoder is as follows:

[0073] F(·)=(F4 ° F3 ° F2 ° F1 ° )

[0074] Among them, F k (·) represents the k-th layer encoder, F4 ° This indicates the fourth layer encoder, F3. ° This indicates the third-layer encoder, F2. ° Indicates the second-layer encoder, F1 ° This indicates the first layer encoder.

[0075] In some embodiments, the filtering requirements are predefined requirements, which are about filtering feature maps corresponding to different coding structures in a preset encoder, and selecting the feature map corresponding to the second coding structure as the target feature map.

[0076] Step S40: Use a preset detail extractor to perform boundary supervision processing on the target feature map to obtain the boundary binary classification result.

[0077] In some embodiments, the target feature map is input into a preset detail extractor to obtain a boundary binary classification result, wherein the boundary binary classification result represents the boundary information of the input human body, thereby improving the model's ability to recognize the human body boundary.

[0078] In some embodiments, the boundary binary classification result is expressed as Seg detail ∈R 2×H×W Where H is the height of the target feature map and W is the width of the target feature map.

[0079] Step S50: Use the improved spatial pyramid pooling module and boundary residuals to perform multi-scale pooling on the preset number of feature maps in multiple semantic feature maps to obtain multiple human body parsing results.

[0080] See Figure 3 In one embodiment, step S50 uses the improved spatial pyramid pooling module and boundary residual to perform multi-scale pooling on the preset number of feature maps in multiple semantic feature maps to obtain multiple human body parsing results, including steps S51 to S54, which are described in detail below.

[0081] Step S51: Input the preset layer feature map into the improved spatial pyramid pooling module for feature convolution processing to obtain multiple pooled features.

[0082] See Figure 4 In one embodiment, step S51 inputs a preset number of layer feature maps into the improved spatial pyramid pooling module for feature convolution processing to obtain multiple pooled features, including steps S511 to S513, which are described in detail below.

[0083] Step S511: Perform convolution processing on the feature map of the preset number of layers to obtain the first layer pooling feature of the improved spatial pyramid pooling module.

[0084] Step S512: Perform dilated convolution processing on the feature map of the preset number of layers using dilated convolution layers with different dilation rates to obtain multiple dilated pooling features.

[0085] Step S513: Summarize the first layer pooling features and multiple hollow pooling features to obtain multiple pooling features.

[0086] In some embodiments, the preset layer feature map is feature I4 output by the fourth layer encoder in the preset encoder. It is first processed by a 1×1 convolution to obtain the first layer pooling feature of the improved spatial pyramid pooling module. Then, dilated convolutional layers with different dilation rates are used to perform dilated convolution processing on the feature maps of the preset number of layers to obtain multiple dilated pooling features. Specifically, the feature I4 output by the fourth encoder layer is passed through a 3×3 dilated convolutional layer with a dilation rate rate of 6 to obtain the second pooling feature of the improved spatial pyramid pooling module. Then, the feature I4 output from the fourth encoder is passed through a 3×3 dilated convolutional layer with a dilation rate rate of 12 to obtain the third pooled feature of the improved spatial pyramid pooling module. Finally, the feature I4 output from the fourth layer encoder is passed through a 3×3 dilated convolutional layer with a dilation rate of rate=18 to obtain the fourth layer pooled feature of the improved spatial pyramid pooling module.

[0087] Step S52: Perform feature concatenation on multiple pooling features to obtain concatenated features, and use a convolutional layer to reduce the dimensionality of the concatenated features to obtain the dimensionality-reduced concatenated features.

[0088] In some embodiments, multiple pooling features are concatenated along the channel dimension to obtain concatenated features. splicing features A 1×1 convolutional layer is input to reduce the number of channels, resulting in dimensionality-reduced concatenated features.

[0089] Step S53: Use the preset feature refinement module to perform feature selection processing on the dimensionality-reduced spliced ​​features to obtain the final output features.

[0090] See Figure 5 In one embodiment, step S53 uses a preset feature refinement module to perform feature selection processing on the dimensionality-reduced spliced ​​features to obtain the final output features, including steps S531 to S533, which are described in detail below.

[0091] Step S531: Use a global average pooling layer to perform average pooling on the dimensionality-reduced spliced ​​features to obtain channel weight features.

[0092] Step S532: Perform convolution, normalization, and activation processing on the channel weight features to obtain the final weight features.

[0093] Step S533: Multiply the channel weight features and the final weight features to obtain the final output features.

[0094] In some embodiments, a global average pooling layer is used to perform average pooling on the dimensionality-reduced concatenated features to obtain channel weight features. Channel weight features The final weight features are obtained after a 1×1 convolution, a batch normalization layer, and a sigmoid activation function layer. The batch normalization layer is used for normalization, and the sigmoid activation function layer is used for activation. The channel weight features and the final weight features are then multiplied to obtain the final output features.

[0095] In some embodiments, the formulas for steps S531 to S533 are expressed as follows:

[0096]

[0097] Filter(I4) 3×3,rate=6 ,Filter(I4) 3×3,rate=12 ,Filter(I4) 3×3,rate=18 )

[0098]

[0099]

[0100] in, For feature concatenation, `concat` represents concatenation processing along the channel dimension, and `Filter(·)`... n×n,rate=k This represents a dilated convolution with an n×n kernel and a dilation rate of k. GAP represents the global average pooling operation. Channel weight features For the final output features, I4 represents the features output by the fourth layer encoder, and sigmoid represents the activation function.

[0101] Step S54: Perform human body analysis processing based on the final output features and boundary residuals to obtain multiple human body analysis results.

[0102] See Figure 6 In one embodiment, step S54 performs human body parsing processing based on the final output features and boundary residuals to obtain multiple human body parsing results, including steps S541 to S544, which are described in detail below.

[0103] Step S541: Input the final output features into multiple convolutional layers for convolution processing to obtain the first parsing result.

[0104] Step S542: Perform upsampling processing on the final output features and the first parsing result respectively to obtain the first upsampled features and the first upsampled result. Perform feature concatenation processing on the first upsampled features, the first upsampled result and the boundary residual to obtain the initial concatenated features. After performing convolution processing and feature refinement processing on the initial concatenated features, the second parsing result is obtained.

[0105] Step S543: Re-perform upsampling, feature concatenation, convolution, and feature refinement on the second analytical result and boundary residual until the third and fourth analytical results are obtained.

[0106] Step S544: Summarize the first, second, third, and fourth analysis results to obtain multiple human body analysis results.

[0107] In some embodiments, the final output features After four convolutional layers, the first parsing result D4 is obtained. This first parsing result D4 is then upsampled through a single layer to obtain the first upsampled result, which is the final output feature. The first upsampled feature obtained after one layer of upsampling, along with the boundary residual R3, are concatenated to obtain the initial concatenated feature. Initial splicing features After passing through the feature refinement module and four convolutional layers, the second analytical result D3 is obtained. The second analytical result and the boundary residual are then subjected to upsampling, feature concatenation, convolution, and feature refinement processes again until the third and fourth analytical results are obtained.

[0108] In some embodiments, the formulas for steps S541 to S544 are expressed as follows:

[0109]

[0110] Where conv represents a four-layer convolution operation, ARM is the feature refinement module, concat represents feature concatenation along the channel dimension, up represents an upsampling operation, and D i D represents the parsing result of the i-th layer. i+1 R represents the parsing result of layer i+1. i This represents the boundary residual of the i-th layer. This represents the semantic features of layer i+1.

[0111] Because different parts of the human body vary in size, it is difficult to simultaneously identify large and small objects using a single convolutional kernel size. Therefore, an improved spatial pyramid pooling is added after the last encoder layer. Unlike traditional spatial pyramid pooling, an attention-based feature refinement module is added afterward. This module allows the network to adaptively select features for output.

[0112] Step S60: Construct a corresponding cross-entropy loss function based on multiple human body parsing results, and construct a corresponding binary classification loss function based on the boundary binary classification results. Use the final loss function obtained by summing the cross-entropy loss function and the binary classification loss function to train the human body parsing network, and optimize and adjust the trained human body parsing network to obtain a standard human body parsing network. The human body parsing network is constructed and generated by a preset encoder, an improved spatial pyramid pooling, and a preset detail extractor.

[0113] In some embodiments, a corresponding cross-entropy loss function is constructed based on multiple human body parsing results, including:

[0114] The cross-entropy loss function is:

[0115]

[0116] Where, ξ seg-loss Represents the cross-entropy loss function. For the true human body parsing labels of the Laplacian features of the i-th layer, D i Let ξ represent the i-th parsing result. ce This indicates the label difference value.

[0117] In some embodiments, a corresponding binary classification loss function is constructed based on the boundary binary classification results, including:

[0118] The binary classification loss function is:

[0119]

[0120] Where, ξ detail-loss Represents the binary classification loss function. Seg is a real human body boundary label. detail For the boundary binary classification result, ξ dice This represents the similarity measurement function.

[0121] In some embodiments, the trained human body parsing network is optimized and adjusted to obtain a standard human body parsing network. This optimization can be achieved using the SGD optimizer, and the network learning rate can be dynamically adjusted using a poly learning rate adjustment strategy. The ResNet18 encoding backbone network used is pre-trained on the ImageNet dataset. The batch size and epoch size are set to 4 and 300, respectively. A poly learning strategy is used, with the initial learning rate adjusted after each epoch. Multiply. Train the network on the optimizer using mini-batch stochastic gradient descent (SGC) with a momentum of 0.9 and weight decay of 0.0001. The final trained standard human parsing network is obtained.

[0122] See also Figure 7 Some embodiments provide a human body recognition method based on drone images, including steps 1 and 2:

[0123] Step 1: Obtain the human image to be parsed.

[0124] Step 2: Input the human image to be parsed into a standard human parsing network for human parsing processing to obtain the human parsing result; wherein, the standard human parsing network is trained by the method as described in steps S10 to S60.

[0125] Step S10: Acquire multiple drone images, perform data augmentation on the multiple drone images, and perform image preprocessing on the data-augmented images to obtain a training image set.

[0126] In some embodiments, data augmentation processing is performed on multiple drone images, including:

[0127] Data augmentation processing includes image translation, image flipping, random brightness transformation, and median filtering.

[0128] In some embodiments, data augmentation of multiple drone images can enhance the generalization ability of the model during subsequent model training.

[0129] In some embodiments, the data-enhanced images are preprocessed to obtain a training image set, wherein the image preprocessing includes image cropping and image normalization.

[0130] Step S20: Calculate the boundary residuals of the training images in the training image set to obtain the boundary residuals.

[0131] See Figure 2 In some embodiments, step S20 calculates the boundary residuals of the training images in the training image set to obtain the boundary residuals, which includes steps S21 to S23, as described in detail below.

[0132] Step S21: Downsample the training images in the training image set to obtain a downsampled image set.

[0133] Step S22: Select an image from the downsampled image set that meets the preset number of layers as the current layer image, and select an image with a preset interval from the current layer image as the next layer image.

[0134] In some embodiments, the downsampled image set includes layer 1, layer 2, ..., layer k and layer (k+1). The layer k image in the downsampled image set is selected as the current layer image, and the layer (k+1) image in the downsampled image set is selected as the next layer image. Here, k represents the layer index of the Laplacian pyramid.

[0135] Step S23: Upsample the lower layer image to obtain an upsampled image, calculate the difference between the current layer image and the upsampled image, and obtain the boundary residual.

[0136] In some embodiments, the lower-layer image is upsampled to obtain an upsampled image, up(L). k+1 The difference between the current layer image and the upsampled image is calculated, and the formula for the boundary residual is expressed as follows:

[0137] R k =L k -up(L k+1 )

[0138] Among them, R k For boundary residuals, L k For this layer of image, up(L k+1 ) represents the upsampled image, and k represents the layer index of the Laplace pyramid.

[0139] Step S30: Use a preset encoder to extract features from the training images in the training image set to obtain multiple semantic feature maps. Select the semantic feature maps that meet the screening requirements from the multiple semantic feature maps as target feature maps.

[0140] In some embodiments, the preset encoder can be a ResNet18 encoder, which has a four-layer encoding structure. Therefore, by using the ResNet18 encoder to extract features from the training image, four feature maps I1, I2, I3 and I4 are obtained.

[0141] In some embodiments, the network formula for the ResNet18 encoder is as follows:

[0142] F(·)=(F4 ° F3 ° F2 ° F1 ° )

[0143] Among them, F k (·) represents the k-th layer encoder, F4 ° This indicates the fourth layer encoder, F3. ° This indicates the third-layer encoder, F2. ° Indicates the second-layer encoder, F1 ° This indicates the first layer encoder.

[0144] In some embodiments, the filtering requirements are predefined requirements, which are about filtering feature maps corresponding to different coding structures in a preset encoder, and selecting the feature map corresponding to the second coding structure as the target feature map.

[0145] Step S40: Use a preset detail extractor to perform boundary supervision processing on the target feature map to obtain the boundary binary classification result.

[0146] In some embodiments, the target feature map is input into a preset detail extractor to obtain a boundary binary classification result, wherein the boundary binary classification result represents the boundary information of the input human body, thereby improving the model's ability to recognize the human body boundary.

[0147] In some embodiments, the boundary binary classification result is expressed as Seg detail ∈R 2×H×W Where H is the height of the target feature map and W is the width of the target feature map.

[0148] Step S50: Use the improved spatial pyramid pooling module and boundary residuals to perform multi-scale pooling on the preset number of feature maps in multiple semantic feature maps to obtain multiple human body parsing results.

[0149] See Figure 3 In one embodiment, step S50 uses the improved spatial pyramid pooling module and boundary residual to perform multi-scale pooling on the preset number of feature maps in multiple semantic feature maps to obtain multiple human body parsing results, including steps S51 to S54, which are described in detail below.

[0150] Step S51: Input the preset layer feature map into the improved spatial pyramid pooling module for feature convolution processing to obtain multiple pooled features.

[0151] See Figure 4 In one embodiment, step S51 inputs a preset number of layer feature maps into the improved spatial pyramid pooling module for feature convolution processing to obtain multiple pooled features, including steps S511 to S513, which are described in detail below.

[0152] Step S511: Perform convolution processing on the feature map of the preset number of layers to obtain the first layer pooling feature of the improved spatial pyramid pooling module.

[0153] Step S512: Perform dilated convolution processing on the feature map of the preset number of layers using dilated convolution layers with different dilation rates to obtain multiple dilated pooling features.

[0154] Step S513: Summarize the first layer pooling features and multiple hollow pooling features to obtain multiple pooling features.

[0155] In some embodiments, the preset layer feature map is feature I4 output by the fourth layer encoder in the preset encoder. It is first processed by a 1×1 convolution to obtain the first layer pooling feature of the improved spatial pyramid pooling module. Then, dilated convolutional layers with different dilation rates are used to perform dilated convolution processing on the feature maps of the preset number of layers to obtain multiple dilated pooling features. Specifically, the feature I4 output by the fourth encoder layer is passed through a 3×3 dilated convolutional layer with a dilation rate rate of 6 to obtain the second pooling feature of the improved spatial pyramid pooling module. Then, the feature I4 output from the fourth encoder is passed through a 3×3 dilated convolutional layer with a dilation rate rate of 12 to obtain the third pooled feature of the improved spatial pyramid pooling module. Finally, the feature I4 output from the fourth layer encoder is passed through a 3×3 dilated convolutional layer with a dilation rate of rate=18 to obtain the fourth layer pooled feature of the improved spatial pyramid pooling module.

[0156] Step S52: Perform feature concatenation on multiple pooling features to obtain concatenated features, and use a convolutional layer to reduce the dimensionality of the concatenated features to obtain the dimensionality-reduced concatenated features.

[0157] In some embodiments, multiple pooling features are concatenated along the channel dimension to obtain concatenated features. splicing features A 1×1 convolutional layer is input to reduce the number of channels, resulting in dimensionality-reduced concatenated features.

[0158] Step S53: Use the preset feature refinement module to perform feature selection processing on the dimensionality-reduced spliced ​​features to obtain the final output features.

[0159] See Figure 5 In one embodiment, step S53 uses a preset feature refinement module to perform feature selection processing on the dimensionality-reduced spliced ​​features to obtain the final output features, including steps S531 to S533, which are described in detail below.

[0160] Step S531: Use a global average pooling layer to perform average pooling on the dimensionality-reduced spliced ​​features to obtain channel weight features.

[0161] Step S532: Perform convolution, normalization, and activation processing on the channel weight features to obtain the final weight features.

[0162] Step S533: Multiply the channel weight features and the final weight features to obtain the final output features.

[0163] In some embodiments, a global average pooling layer is used to perform average pooling on the dimensionality-reduced concatenated features to obtain channel weight features. Channel weight features The final weight features are obtained after a 1×1 convolution, a batch normalization layer, and a sigmoid activation function layer. The batch normalization layer is used for normalization, and the sigmoid activation function layer is used for activation. The channel weight features and the final weight features are then multiplied to obtain the final output features.

[0164] In some embodiments, the formulas for steps S531 to S533 are expressed as follows:

[0165]

[0166] Filter(I4) 3×3,rate=6 ,Filter(I4) 3×3,rate=12 ,Filter(I4) 3×3,rate=18 )

[0167]

[0168]

[0169] in, For feature concatenation, `concat` represents concatenation processing along the channel dimension, and `Filter(·)`... n×n,rate=k This represents a dilated convolution with an n×n kernel and a dilation rate of k. GAP represents the global average pooling operation. Channel weight features For the final output features, I4 represents the features output by the fourth layer encoder, and sigmoid represents the activation function.

[0170] Step S54: Perform human body analysis processing based on the final output features and boundary residuals to obtain multiple human body analysis results.

[0171] See Figure 6 In one embodiment, step S54 performs human body parsing processing based on the final output features and boundary residuals to obtain multiple human body parsing results, including steps S541 to S544, which are described in detail below.

[0172] Step S541: Input the final output features into multiple convolutional layers for convolution processing to obtain the first parsing result.

[0173] Step S542: Perform upsampling processing on the final output features and the first parsing result respectively to obtain the first upsampled features and the first upsampled result. Perform feature concatenation processing on the first upsampled features, the first upsampled result and the boundary residual to obtain the initial concatenated features. After performing convolution processing and feature refinement processing on the initial concatenated features, the second parsing result is obtained.

[0174] Step S543: Re-perform upsampling, feature concatenation, convolution, and feature refinement on the second analytical result and boundary residual until the third and fourth analytical results are obtained.

[0175] Step S544: Summarize the first, second, third, and fourth analysis results to obtain multiple human body analysis results.

[0176] In some embodiments, the final output features After four convolutional layers, the first parsing result D4 is obtained. This first parsing result D4 is then upsampled through a single layer to obtain the first upsampled result, which is the final output feature. The first upsampled feature obtained after one layer of upsampling, along with the boundary residual R3, are concatenated to obtain the initial concatenated feature. Initial splicing features After passing through the feature refinement module and four convolutional layers, the second analytical result D3 is obtained. The second analytical result and the boundary residual are then subjected to upsampling, feature concatenation, convolution, and feature refinement processes again until the third and fourth analytical results are obtained.

[0177] In some embodiments, the formulas for steps S541 to S544 are expressed as follows:

[0178]

[0179] Where conv represents a four-layer convolution operation, ARM is the feature refinement module, concat represents feature concatenation along the channel dimension, up represents an upsampling operation, and D i D represents the parsing result of the i-th layer. i+1 R represents the parsing result of layer i+1. i This represents the boundary residual of the i-th layer. This represents the semantic features of layer i+1.

[0180] Because different parts of the human body vary in size, it is difficult to simultaneously identify large and small objects using a single convolutional kernel size. Therefore, an improved spatial pyramid pooling is added after the last encoder layer. Unlike traditional spatial pyramid pooling, an attention-based feature refinement module is added afterward. This module allows the network to adaptively select features for output.

[0181] Step S60: Construct a corresponding cross-entropy loss function based on multiple human body parsing results, and construct a corresponding binary classification loss function based on the boundary binary classification results. Use the final loss function obtained by summing the cross-entropy loss function and the binary classification loss function to train the human body parsing network, and optimize and adjust the trained human body parsing network to obtain a standard human body parsing network. The human body parsing network is constructed and generated by a preset encoder, an improved spatial pyramid pooling, and a preset detail extractor.

[0182] In some embodiments, a corresponding cross-entropy loss function is constructed based on multiple human body parsing results, including:

[0183] The cross-entropy loss function is:

[0184]

[0185] Where, ξ seg-loss Represents the cross-entropy loss function. For the true human body parsing labels of the Laplacian features of the i-th layer, D i Let ξ represent the i-th parsing result. ce This indicates the label difference value.

[0186] In some embodiments, a corresponding binary classification loss function is constructed based on the boundary binary classification results, including:

[0187] The binary classification loss function is:

[0188]

[0189] Where, ξ detail-loss Represents the binary classification loss function. Seg is a real human body boundary label. detail For the boundary binary classification result, ξ dice This represents the similarity measurement function.

[0190] In some embodiments, the trained human body parsing network is optimized and adjusted to obtain a standard human body parsing network. This optimization can be achieved using the SGD optimizer, and the network learning rate can be dynamically adjusted using a poly learning rate adjustment strategy. The ResNet18 encoding backbone network used is pre-trained on the ImageNet dataset. The batch size and epoch size are set to 4 and 300, respectively. A poly learning strategy is used, with the initial learning rate adjusted after each epoch. Multiply. Train the network on the optimizer using mini-batch stochastic gradient descent (SGC) with a momentum of 0.9 and weight decay of 0.0001. The final trained standard human parsing network is obtained.

[0191] Please refer to Figure 8 Some embodiments provide a training device for a human body parsing network based on UAV images, including a residual calculation module 10, a boundary supervision module 20, a multi-scale pooling module 30, and a network training module 40.

[0192] The residual calculation module 10 is used to acquire multiple UAV images, perform data augmentation on the multiple UAV images, and perform image preprocessing on the data-augmented images to obtain a training image set. The module then performs boundary residual calculation on the training images in the training image set to obtain the boundary residual.

[0193] In some embodiments, the residual calculation module 10 performs data augmentation processing on multiple UAV images, including:

[0194] Data augmentation processing includes image translation, image flipping, random brightness transformation, and median filtering.

[0195] In some embodiments, data augmentation of multiple drone images can enhance the generalization ability of the model during subsequent model training.

[0196] In some embodiments, the data-enhanced images are preprocessed to obtain a training image set, wherein the image preprocessing includes image cropping and image normalization.

[0197] In some embodiments, the residual calculation module 10 performs boundary residual calculation on the training images in the training image set to obtain the boundary residuals, which can be achieved as follows:

[0198] The residual calculation module 10 performs downsampling processing on the training images in the training image set to obtain a downsampled image set. It selects images in the downsampled image set that meet the preset number of layers as the current layer images and selects images with a preset interval from the current layer images as the next layer images.

[0199] In some embodiments, the downsampled image set includes layer 1, layer 2, ..., layer k and layer (k+1). The layer k image in the downsampled image set is selected as the current layer image, and the layer (k+1) image in the downsampled image set is selected as the next layer image. Here, k represents the layer index of the Laplacian pyramid.

[0200] The residual calculation module 10 performs upsampling processing on the lower layer image to obtain an upsampled image, calculates the difference between the current layer image and the upsampled image, and obtains the boundary residual.

[0201] In some embodiments, the lower-layer image is upsampled to obtain an upsampled image, up(L). k+1 The difference between the current layer image and the upsampled image is calculated, and the formula for the boundary residual is expressed as follows:

[0202] R k =L k -up(L k+1 )

[0203] Among them, R k For boundary residuals, L k For this layer of image, up(L k+1 () is an upsampled image.

[0204] The boundary supervision module 20 is used to extract features from the training images in the training image set using a preset encoder to obtain multiple semantic feature maps. The semantic feature maps that meet the screening requirements are selected as target feature maps. The target feature maps are then subjected to boundary supervision processing using a preset detail extractor to obtain boundary binary classification results.

[0205] In some embodiments, the preset encoder can be a ResNet18 encoder, which has a four-layer encoding structure. Therefore, by using the ResNet18 encoder to extract features from the training image, four feature maps I1, I2, I3 and I4 are obtained.

[0206] In some embodiments, the network formula for the ResNet18 encoder is as follows:

[0207] F(·)=(F4 ° F3 ° F2 ° F1 ° )

[0208] Among them, F k (·) represents the k-th layer encoder.

[0209] In some embodiments, the filtering requirements are predefined requirements, which are about filtering feature maps corresponding to different coding structures in a preset encoder, and selecting the feature map corresponding to the second coding structure as the target feature map.

[0210] In some embodiments, the boundary supervision module 20 uses a preset detail extractor to perform boundary supervision processing on the target feature map to obtain a boundary binary classification result.

[0211] In some embodiments, the boundary supervision module 20 inputs the target feature map into a preset detail extractor to obtain a boundary binary classification result, wherein the boundary binary classification result represents the boundary information of the input human body, thereby improving the model's ability to recognize the human body boundary.

[0212] In some embodiments, the boundary binary classification result is expressed as Seg detail ∈R 2×H×W Where H is the height of the target feature map and W is the width of the target feature map.

[0213] The multi-scale pooling module 30 is used to perform multi-scale pooling on the preset number of feature maps in multiple semantic feature maps using the improved spatial pyramid pooling module and boundary residuals, so as to obtain multiple human body parsing results.

[0214] In some embodiments, the multi-scale pooling module 30 inputs a feature map with a preset number of layers into the improved spatial pyramid pooling module for feature convolution processing to obtain multiple pooled features; in some embodiments,

[0215] The multi-scale pooling module 30 performs convolution processing on the feature map of the preset number of layers to obtain the first layer pooling feature of the improved spatial pyramid pooling module. It then performs dilated convolution processing on the feature map of the preset number of layers using dilated convolution layers with different dilation rates to obtain multiple dilated pooling features. Finally, it summarizes the first layer pooling feature and the multiple dilated pooling features to obtain multiple pooling features.

[0216] In some embodiments, the preset layer feature map is feature I4 output by the fourth layer encoder in the preset encoder. It is first processed by a 1×1 convolution to obtain the first layer pooling feature of the improved spatial pyramid pooling module. Then, dilated convolutional layers with different dilation rates are used to perform dilated convolution processing on the feature maps of the preset number of layers to obtain multiple dilated pooling features. Specifically, the feature I4 output by the fourth encoder layer is passed through a 3×3 dilated convolutional layer with a dilation rate rate of 6 to obtain the second pooling feature of the improved spatial pyramid pooling module. Then, the feature I4 output from the fourth encoder is passed through a 3×3 dilated convolutional layer with a dilation rate rate of 12 to obtain the third pooled feature of the improved spatial pyramid pooling module. Finally, the feature I4 output from the fourth layer encoder is passed through a 3×3 dilated convolutional layer with a dilation rate of rate=18 to obtain the fourth layer pooled feature of the improved spatial pyramid pooling module.

[0217] The multi-scale pooling module 30 performs feature concatenation on multiple pooling features to obtain concatenated features. The concatenated features are then dimensionality-reduced using a convolutional layer to obtain the dimensionality-reduced concatenated features.

[0218] In some embodiments, the multi-scale pooling module 30 concatenates multiple pooling features along the channel dimension to obtain concatenated features. splicing features A 1×1 convolutional layer is input to reduce the number of channels, resulting in dimensionality-reduced concatenated features.

[0219] The multi-scale pooling module 30 uses a preset feature refinement module to perform feature selection processing on the dimensionality-reduced concatenated features to obtain the final output features.

[0220] In one embodiment, the multi-scale pooling module 30 uses a preset feature refinement module to perform feature selection processing on the dimensionality-reduced spliced ​​features to obtain the final output features. This can be achieved as follows: the multi-scale pooling module 30 uses a global average pooling layer to perform average pooling processing on the dimensionality-reduced spliced ​​features to obtain channel weight features.

[0221] The multi-scale pooling module 30 performs convolution, normalization, and activation processing on the channel weight features to obtain the final weight features. The channel weight features and the final weight features are then multiplied to obtain the final output features.

[0222] In some embodiments, a global average pooling layer is used to perform average pooling on the dimensionality-reduced concatenated features to obtain channel weight features. Channel weight features The final weight features are obtained after a 1×1 convolution, a batch normalization layer, and a sigmoid activation function layer. The batch normalization layer is used for normalization, and the sigmoid activation function layer is used for activation. The channel weight features and the final weight features are then multiplied to obtain the final output features.

[0223] In some embodiments, the formulas for steps S531 to S533 are expressed as follows:

[0224]

[0225] Filter(I4) 3×3,rate=6 ,Filter(I4) 3×3,rate=12 ,Filter(I4) 3×3,rate=18 )

[0226]

[0227]

[0228] in, For feature concatenation, `concat` represents concatenation processing along the channel dimension, and `Filter(·)`... n×n,rate=k This represents a dilated convolution with an n×n kernel and a dilation rate of k. GAP represents the global average pooling operation. Channel weight features This is the final output feature.

[0229] The multi-scale pooling module 30 performs human body parsing processing based on the final output features and boundary residuals to obtain multiple human body parsing results.

[0230] In one embodiment, the multi-scale pooling module 30 performs human body parsing processing based on the final output features and boundary residuals to obtain multiple human body parsing results, which can be implemented as follows:

[0231] The multi-scale pooling module 30 inputs the final output features into multiple convolutional layers for convolution processing to obtain the first parsing result. Upsampling processing is then performed on the final output features and the first parsing result to obtain the first upsampled features and the first upsampled result. Feature concatenation processing is then performed on the first upsampled features, the first upsampled result, and the boundary residual to obtain the initial concatenated features. After convolution processing and feature refinement processing on the initial concatenated features, the second parsing result is obtained. Upsampling processing, feature concatenation processing, convolution processing, and feature refinement processing are then performed again on the second parsing result and the boundary residual until the third parsing result and the fourth parsing result are obtained. The first parsing result, the second parsing result, the third parsing result, and the fourth parsing result are then summarized to obtain multiple human body parsing results.

[0232] In some embodiments, the final output features After four convolutional layers, the first parsing result D4 is obtained. This first parsing result D4 is then upsampled through a single layer to obtain the first upsampled result, which is the final output feature. The first upsampled feature obtained after one layer of upsampling, along with the boundary residual R3, are concatenated to obtain the initial concatenated feature. Initial splicing features After passing through the feature refinement module and four convolutional layers, the second analytical result D3 is obtained. The second analytical result and the boundary residual are then subjected to upsampling, feature concatenation, convolution, and feature refinement processes again until the third and fourth analytical results are obtained.

[0233] In some embodiments, the above process is expressed by the following formula:

[0234]

[0235] Where conv represents a four-layer convolution operation, ARM is the feature refinement module, concat represents feature concatenation along the channel dimension, up represents an upsampling operation, and D i+1 R represents the parsing result of layer i+1. i This represents the boundary residual of the i-th layer. This represents the semantic features of layer i+1.

[0236] Because different parts of the human body vary in size, it is difficult to simultaneously identify large and small objects using a single convolutional kernel size. Therefore, an improved spatial pyramid pooling is added after the last encoder layer. Unlike traditional spatial pyramid pooling, an attention-based feature refinement module is added afterward. This module allows the network to adaptively select features for output.

[0237] The network training module 40 is used to construct the corresponding cross-entropy loss function based on multiple human body parsing results, and construct the corresponding binary classification loss function based on the boundary binary classification results. The final loss function obtained by summing the cross-entropy loss function and the binary classification loss function is used to train the human body parsing network. The trained human body parsing network is then optimized and adjusted to obtain the standard human body parsing network. The human body parsing network is constructed and generated by a preset encoder, an improved spatial pyramid pooling, and a preset detail extractor.

[0238] In some embodiments, a corresponding cross-entropy loss function is constructed based on multiple human body parsing results, including:

[0239] The cross-entropy loss function is:

[0240]

[0241] Where, ξ seg-loss Represents the cross-entropy loss function. For the true human body parsing labels of the Laplacian features of the i-th layer, D i This represents the parsing result of the i-th parsing.

[0242] In some embodiments, a corresponding binary classification loss function is constructed based on the boundary binary classification results, including:

[0243] The binary classification loss function is:

[0244]

[0245] Where, ξ detail-loss Represents the binary classification loss function. Seg is a real human body boundary label. detail This is the result of the boundary binary classification.

[0246] In some embodiments, the trained human body parsing network is optimized and adjusted to obtain a standard human body parsing network. This optimization can be achieved using the SGD optimizer, and the network learning rate can be dynamically adjusted using a poly learning rate adjustment strategy. The ResNet18 encoding backbone network used is pre-trained on the ImageNet dataset. The batch size and epoch size are set to 4 and 300, respectively. A poly learning strategy is used, with the initial learning rate adjusted after each epoch. Multiply. Train the network on the optimizer using mini-batch stochastic gradient descent (SGC) with a momentum of 0.9 and weight decay of 0.0001. The final trained standard human parsing network is obtained.

[0247] The embodiment constructs a human body parsing network based on a preset encoder, improved spatial pyramid pooling, and a preset detail extractor. A boundary residual calculation process is added, and the calculated boundary residuals are used to train, optimize, and adjust the human body parsing network to obtain a standard human body parsing network. This can improve the accuracy of human body parsing in UAV images using the standard human body parsing network.

[0248] Those skilled in the art will understand that all or part of the functions of the various methods in the above embodiments can be implemented by hardware or by computer programs. When all or part of the functions in the above embodiments are implemented by computer programs, the program can be stored in a computer-readable storage medium, which may include: read-only memory, random access memory, disk, optical disk, hard disk, etc., and the program is executed by a computer to achieve the above functions. For example, the program can be stored in the memory of a device, and when the program in the memory is executed by the processor, all or part of the above functions can be achieved. In addition, when all or part of the functions in the above embodiments are implemented by computer programs, the program can also be stored in a server, another computer, disk, optical disk, flash drive, or external hard drive, etc., and can be downloaded or copied to the memory of a local device, or the system of the local device can be updated. When the program in the memory is executed by the processor, all or part of the functions in the above embodiments can be achieved.

[0249] The above examples illustrate the present invention only to aid in understanding it and are not intended to limit the scope of the invention. Those skilled in the art can make various simple deductions, modifications, or substitutions based on the principles of this invention.

Claims

1. A method for training a human body parsing network based on UAV images, characterized in that, include: Multiple drone images are acquired, data augmentation processing is performed on the multiple drone images, and image preprocessing is performed on the data-augmented images to obtain a training image set; Boundary residuals are calculated on the training images in the training image set to obtain the boundary residuals; The training images in the training image set are subjected to feature extraction using a preset encoder to obtain multiple semantic feature maps. The semantic feature maps that meet the screening requirements are selected as target feature maps. The target feature map is subjected to boundary supervision processing using a preset detail extractor to obtain a boundary binary classification result. The preset layer feature maps from the multiple semantic feature maps are input into the improved spatial pyramid pooling module for feature convolution processing to obtain multiple pooled features; wherein, the improved spatial pyramid pooling module includes a feature refinement module; The multiple pooling features are concatenated to obtain concatenated features. The concatenated features are then reduced in dimensionality using a convolutional layer to obtain the reduced concatenated features. The feature refinement module is used to perform feature selection processing on the dimensionality-reduced concatenated features to obtain the final output features. Specifically, the feature refinement module performs feature selection processing on the dimensionality-reduced concatenated features to obtain the final output features, which includes: performing average pooling processing on the dimensionality-reduced concatenated features using a global average pooling layer to obtain channel weight features; performing convolution, normalization, and activation processing on the channel weight features to obtain the final weight features; and multiplying the channel weight features and the final weight features to obtain the final output features. Based on the final output features and the boundary residuals, human body analysis processing is performed to obtain multiple human body analysis results; Based on the multiple human body parsing results, a corresponding cross-entropy loss function is constructed, and a corresponding binary classification loss function is constructed based on the boundary binary classification results. The human body parsing network is trained using the final loss function obtained by summing the cross-entropy loss function and the binary classification loss function. The trained human body parsing network is then optimized and adjusted to obtain a standard human body parsing network. The human body parsing network is constructed and generated by a preset encoder, an improved spatial pyramid pooling, and a preset detail extractor.

2. The method as described in claim 1, characterized in that, The step of calculating the boundary residuals from the training images in the training image set to obtain the boundary residuals includes: The training images in the training image set are downsampled to obtain a downsampled image set; The image that meets the preset number of layers in the downsampled image set is selected as the image of the current layer, and the image with a preset interval from the image of the current layer is selected as the image of the next layer. The lower layer image is upsampled to obtain an upsampled image. The difference between the current layer image and the upsampled image is calculated to obtain the boundary residual.

3. The method as described in claim 1, characterized in that, The preset layer feature map from the multiple semantic feature maps is input into the improved spatial pyramid pooling module for feature convolution processing to obtain multiple pooled features, including: The preset number of feature maps are convolved to obtain the first pooling feature of the improved spatial pyramid pooling module. The preset number of feature maps are subjected to dilated convolution processing using dilated convolution layers with different dilation rates to obtain multiple dilated pooling features. The first layer pooling features and the multiple hole pooling features are combined to obtain multiple pooling features.

4. The method as described in claim 1, characterized in that, The human body parsing process is performed based on the final output features and the boundary residuals to obtain multiple human body parsing results, including: The final output features are input into multiple convolutional layers for convolution processing to obtain the first parsing result; The final output feature and the first parsing result are upsampled respectively to obtain the first upsampled feature and the first upsampled result. The first upsampled feature, the first upsampled result and the boundary residual are concatenated to obtain the initial concatenated feature. The initial concatenated feature is then convolved and refined to obtain the second parsing result. The second analytical result and the boundary residual are re-processed with upsampling, feature concatenation, convolution and feature refinement until the third and fourth analytical results are obtained. The first analysis result, the second analysis result, the third analysis result, and the fourth analysis result are summarized to obtain multiple human body analysis results.

5. The method as described in claim 1, characterized in that, The data augmentation processing of the multiple drone images includes: The data augmentation processes include image translation, image flipping, random brightness transformation, and median filtering.

6. A human body recognition method based on UAV images, characterized in that, include: Obtain the human image to be parsed; The human image to be parsed is input into a standard human parsing network for human parsing processing to obtain the human parsing result; wherein the standard human parsing network is trained by the method described in any one of claims 1 to 5.

7. A training device for a human body parsing network based on UAV images, characterized in that... include: Memory, used to store programs; A processor for implementing the method as described in any one of claims 1-5 by executing a program stored in the memory.

8. A computer-readable storage medium, characterized in that, The medium stores a program that can be executed by a processor to implement the method as described in any one of claims 1-6.