A multi-scale, multi-directional remote sensing target identification method, system, and medium
The multi-scale, multi-directional remote sensing target recognition method based on adaptive training and inference solves the problems of unbalanced angle boundaries, anchor frame allocation, and scale changes in remote sensing target recognition, and achieves accurate detection of multi-scale, multi-directional targets, thereby improving prediction accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HUNAN UNIV
- Filing Date
- 2023-11-13
- Publication Date
- 2026-06-30
AI Technical Summary
Existing remote sensing target recognition technologies suffer from several problems when dealing with multi-scale and multi-directional targets, including boundary and square issues caused by angular periodicity, uneven distribution of positive and negative anchor frame samples, imperfect evaluation criteria for predicted anchor frame quality, and inconsistent target scale variations.
A multi-scale, multi-directional remote sensing target recognition method is adopted. Through adaptive training and inference, adaptive angle encoding is used to eliminate loss spikes, and positive and negative anchor box samples are adaptively allocated to improve anchor box quality. A quality evaluation function is constructed to screen the best anchor box, thereby achieving accurate detection of multi-scale, multi-directional targets.
It achieves accurate detection of remote sensing targets at multiple scales and in multiple directions, improves prediction accuracy, can adaptively handle scale changes between and within classes of targets, eliminates boundary problems caused by angular periodicity, and improves the recognition accuracy of a small number of samples.
Smart Images

Figure CN117475323B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of remote sensing image processing technology, specifically to a multi-scale, multi-directional remote sensing target recognition method, system, and medium. Background Technology
[0002] With the rapid development of remote sensing technology, target recognition technology in remote sensing has received increasing attention. Traditional target recognition techniques mainly rely on manually designed features. These features are then reduced in dimensionality or sparsified to decrease computational load, and further regressed and located using machine learning methods. Therefore, the detection performance of traditional target recognition techniques largely depends on manually designed features. With the development of deep learning, the method of manually designed features is gradually being replaced by neural networks because neural networks have better learning capabilities for nonlinear models. As a result, neural networks are increasingly being applied to various complex modeling scenarios.
[0003] Deep learning-based horizontal bounding box target recognition has seen significant development in recent years. Single-stage frameworks such as YOLO, FOCS, and SSD offer advantages in speed and small model size, while two-stage R-CNN series boasts high prediction accuracy. However, in complex scenarios with dense targets of arbitrary orientation, horizontal bounding box prediction often leads to mis-bounding between adjacent targets and inaccurate size estimation. This is particularly problematic in military scenarios requiring precision strikes, where horizontal bounding box prediction can cause collateral damage. Therefore, rotational bounding box prediction effectively addresses the influence of arbitrarily oriented dense targets. For example, BBAVectors (Oriented Object Detection in Aerial Images with Box Boundary-Aware Vectors. IEEE Winter Conference on Applications of Computer Vision, 2021.00220) employs keypoint detection, abandoning anchor-boundary-based hyperparameter design and predicting vectors from the center point to the four sides. However, vector confusion easily occurs in the detection of horizontal and vertical bounding boxes. Therefore, a classification branch is designed to distinguish whether the detection is directed, but the regressed vectors suffer from non-perpendicularity. RDD (Single-Stage Rotation-Decoupled Detector for Oriented Objects. Remote Sens, 10.3390) proposes decoupling the rotated bounding box into a horizontal bounding box and an absolute angle, but it does not solve the boundary and square problems introduced by the angle periodicity. SCRDet++ (Detecting Small, Cluttered and Rotated Objects via Instance-Level Feature Denoising and Rotation Loss Smoothing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.3166956) designs an instance denoising module to solve the boundary blurring problem for small and cluttered targets. It designs a SmoothL1-IOU loss to solve the boundary problem introduced by the angle periodicity, but this loss is unstable during training and prone to numerical overflow.RSDet-II (Learning Modulated Loss for Rotated Object Detection. AAAI Conference on Artificial Intelligence, 10.48550.) designs a modulation loss to symmetrically address the discontinuity of angle loss at locations where L1 loss abruptly changes. It uses the four corner points of the target to sort and regress parameters to address inconsistencies. It also symmetrically addresses the locations where L1 loss abruptly changes, but it does not address the discontinuity of loss at the angle boundary from the perspective of the rotating target's angle. In summary, existing remote sensing rotating target recognition algorithms still face the following problems: (1) The problem of boundary and angle estimation for square targets caused by the periodicity of angles. In the target recognition framework based on the anchor frame method, when the anchor frame exceeds the predicted angle range, it will cause a surge in the angle loss function, leading to optimization difficulties and a large deviation in angle prediction. For square target prediction, the loss will also increase even if the predicted angle is within the range, resulting in inconsistency between the loss and the evaluation criteria. (2) The detection method based on the anchor frame has the problem of uneven distribution of positive and negative anchor frame samples. The anchor box method generates a large number of anchor boxes on each feature map for target recognition. Many excellent anchor box-based algorithms use a threshold to assign positive and negative samples, resulting in a large number of low-quality anchor boxes regressing to the target. Furthermore, different hyperparameter thresholds need to be set for different complex and dense target scenes. For categories with fewer true label boxes, a large number of low-quality anchor boxes regress, causing prediction bias. Therefore, how to adaptively assign anchor boxes during training determines the quality of the training model. (3) The problem of imperfect quality evaluation criteria for predicted anchor boxes. In the inference stage, the quality evaluation of each predicted box is based solely on classification confidence, which will cause the prediction results to be biased towards the classification branch and will depend to some extent on the design of the classifier. Therefore, a good predicted box needs to comprehensively consider classification ability and localization ability, and the contribution of both can be dynamically adjusted for different class scales. (4) The problem of inconsistent target scale. A large number of high-resolution remote sensing images come from different satellites and sensors, resulting in large scale variance. Different sampling resolutions between different types and within the same type will also cause the target to present multiple scales. How to handle intra-class and inter-class scale changes is a key challenge in positive and negative sample assignment. Summary of the Invention
[0004] The technical problem to be solved by the present invention is to provide a multi-scale, multi-directional remote sensing target recognition method, system and medium to address the above-mentioned problems of the prior art. The present invention aims to achieve accurate detection of multi-scale, multi-directional targets, which are characterized by large scale differences and arbitrary target orientations between and within remote sensing images.
[0005] Adaptive training and inference are performed based on the situation of each target. Adaptive angle encoding eliminates the loss surge caused by angle boundaries and the square problem. Adaptive assignment of positive and negative sample anchor boxes to the target improves the quality of regression anchor boxes. Adaptive quality evaluation of predicted anchor boxes is performed to improve prediction accuracy.
[0006] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows:
[0007] A multi-scale, multi-directional remote sensing target recognition method includes segmenting a remote sensing image into image patches of a specified size, inputting the image patches into a given network model for target recognition, including:
[0008] S101, extracts features from the input image patch through the backbone network;
[0009] S102, through the multi-scale aggregation module, by top-down and bottom-up multi-scale aggregation, connects multiple features of different scales across layers of the backbone network and then aggregates them again to obtain aggregated features;
[0010] S103: The detection head extracts orientation-sensitive features by performing multi-directional rotational convolution on the aggregated features. These orientation-sensitive features are then fed into the localization and classification branches. In the localization branch, multiple convolutions are used for regression to predict anchor boxes, uncertainty, and intersection-union ratio (IU). The anchor box prediction results include the anchor box's position, size, and orientation angle. In the classification branch, rotational pooling is used to extract orientation-invariant features, which are then classified using multiple convolutions to predict the category. Based on the scores obtained from the category prediction, the uncertainty prediction results, and the IU prediction results, a quality evaluation function is constructed to filter multiple anchor boxes of the same target obtained from the anchor box predictions to obtain the optimal anchor box.
[0011] Optionally, before inputting the image patch into the given network model for target recognition, the step of training the network model is further included. The training sample data during the training of the network model includes: dividing the high-resolution remote sensing image with target labels into image patches of a specified size, so that the targets in the divided image patches are automatically labeled as training sample data. The labels include the position, size, orientation angle and category of the anchor box in the high-resolution remote sensing image.
[0012] Optionally, the angle encoding method in the label includes: using the target center point as a reference point, using the area from the center point of the shortest side of the upper half of the target to the target center point as the included angle coordinate axis, and correcting the angle between the included angle coordinate axis and the horizontal right coordinate axis as the target's direction angle, so that the direction angle of any target orientation is limited to the range of [0, 180) degrees, and the predicted angle range of the target box of a rectangular target is limited to the range of [0, 180) degrees, and the predicted angle range of the target box of a square target is limited to the range of [90, 180) degrees; and the function expression for correcting the angle between the included angle coordinate axis and the horizontal right coordinate axis is:
[0013]
[0014]
[0015] In the above formula, θ rec θ is the angle between the angular coordinate axis of the corrected rectangular target and the horizontal coordinate axis to the right. sqr θ is the angle between the angular coordinate axis of the corrected square target and the horizontal right coordinate axis.
[0016] Optionally, the position of the anchor frame is in a rotating coordinate system, and its coordinate transformation function expression is:
[0017]
[0018] In the above formula, (x r ,y r (x, y) represents the coordinates of the position after transformation to the rotated coordinate system, θ is the direction angle, and (x, y) are the original coordinates of that position in the original image coordinate system before transformation. c ,y c () represents the coordinates of the target center point.
[0019] Optionally, when training the network model, the method further includes establishing an elliptical boundary with the rotated coordinate system as the XY axis and the half-length and half-width of the anchor frame as the major and minor axes of the ellipse, transforming all the center points of the anchor frames to the rotated coordinate system, and performing adaptive positive and negative sample allocation for the anchor frames; the adaptive positive and negative sample allocation includes:
[0020] S201, calculate the distance between the center point of the anchor frame and the center point of the rotating target after coordinate transformation, and the intersection-union ratio of the anchor frame and the rotating target;
[0021] S202, select the maximum value based on the length and width of each real label box, and calculate the adaptively selected parameter k based on the strides set of multiples between the feature map size generated by the multi-scale and the original image;
[0022] S203: For each ground truth label anchor box, select the top k anchor boxes with the largest intersection-union ratio (IU) as candidate boxes. Among these k anchor boxes, select the anchor box inside the ellipse boundary as the positive sample anchor box. If the center point of no anchor box falls inside the ellipse boundary, select the k anchor box center points closest to the rotation target center as the positive sample anchor boxes. If an anchor box matches multiple ground truth label anchor boxes, select the ground truth label anchor box with the largest IU to match it based on the IU.
[0023] Optionally, the functional expression for calculating the adaptively selected parameter k in step S202 is:
[0024]
[0025] In the above formula, ceil is the floor function, max{w,h} represents the maximum selected width and height, where w is the width and h is the length, and strides is the set of multiples of the size of the multi-scale generated feature map and the original image.
[0026] Optionally, when training the network model, the loss function used has the following expression:
[0027]
[0028]
[0029]
[0030] Loss class =-α t (1-p t ) γ log(p t ),
[0031] In the above formula, Loss total Loss is the loss function. reg To predict the loss for the uncertainty of the positioning branch, For positive samples, SmoothL1 is the smoothed L1 loss, pred i and target i σ represents the predicted and labeled target localization parameters. i IOU represents the uncertainty obtained from the prediction. i Loss represents the intersection-to-union ratio between rotated frames. iou To predict the loss for the intersection-union ratio of the local branches, For positive samples, CrossEntropy is the cross-entropy loss. To predict the intersection-union ratio (CUI) of the i-th sample for branch localization, Let the true label be the intersection-union ratio of the i-th sample; Loss classFor the loss of the classification branch, α t And γ are hyperparameters, p t The classification prediction probability is obtained from the category prediction, where N is the total number of samples.
[0032] Optionally, the functional expression of the quality evaluation function is:
[0033] Score = class_score α ·IOU 1-α ,
[0034] In the above formula, Score is the quality evaluation function, class_score is the class prediction result, IOU is the intersection-union ratio prediction result, and α is the uncertainty prediction result.
[0035] Furthermore, the present invention also provides a multi-scale, multi-directional remote sensing target recognition system, including a microprocessor and a memory interconnected thereto, wherein the microprocessor is programmed or configured to execute the multi-scale, multi-directional remote sensing target recognition method.
[0036] Furthermore, the present invention also provides a computer-readable storage medium storing a computer program that is programmed or configured by a microprocessor to perform the multi-scale, multi-directional remote sensing target recognition method.
[0037] Compared with existing technologies, the present invention has the following main advantages: The method of the present invention includes segmenting remote sensing images into image patches, inputting the image patches into a network model for target recognition, including: extracting features from the input image patches through a backbone network; obtaining aggregated features through a multi-scale aggregation module; extracting direction-sensitive features from the aggregated features through a detection head and sending them to a localization branch and a classification branch; implementing anchor boxes, uncertainty, and cross-union ratio (CUP) regression in the localization branch; extracting direction-invariant features in the classification branch and then classifying them; and selecting the best anchor boxes based on a quality evaluation function constructed from the scores obtained from category prediction, uncertainty prediction results, and CUP prediction results. The present invention addresses the problem of varying scales for both inter-class and intra-class targets in remote sensing images, and can achieve accurate target detection for multi-scale and multi-directional targets. The present invention can achieve two-stage accuracy while also enabling real-time processing, adaptively processing targets of different scales, inter-class or intra-class, eliminating boundary and square problems caused by angular periodicity, improving the recognition accuracy for a small number of samples, and adaptively inferring the best prediction results. Attached Figure Description
[0038] Figure 1 This is a schematic diagram of the basic process of the method in an embodiment of the present invention.
[0039] Figure 2This is a schematic diagram of the network model in an embodiment of the present invention.
[0040] Figure 3 This is a schematic diagram of the angle encoding of a rectangular target in an embodiment of the present invention.
[0041] Figure 4 This is a schematic diagram of the angle encoding of a square target in an embodiment of the present invention.
[0042] Figure 5 This is a schematic diagram of the orientation angle allocation in an embodiment of the present invention.
[0043] Figure 6 This is a schematic diagram of the quality evaluation function in an embodiment of the present invention. Detailed Implementation
[0044] like Figure 1 and Figure 2 As shown, the multi-scale, multi-directional remote sensing target recognition method in this embodiment includes dividing the remote sensing image into image patches of a specified size (e.g., 1024×1024 in this embodiment), and inputting the image patches into a given network model for target recognition, including:
[0045] S101, extracts features from the input image patch through the backbone network;
[0046] S102, through the multi-scale aggregation module, uses top-down and bottom-up multi-scale aggregation (FeaturePyramid Network-Path Aggregation Network, FPN-PNA) to connect multiple features of different scales across layers from the backbone network and then aggregate them again to obtain aggregated features;
[0047] S103: The detection head extracts orientation-sensitive features from the aggregated features using multi-directional rotational convolution. These orientation-sensitive features are then fed into the localization and classification branches. In the localization branch, multiple convolutions are used for regression to predict anchor frames, uncertainty, and intersection-union ratio (IU). The anchor frame prediction results include the anchor frame's position, size, and orientation angle. In the classification branch, rotational pooling is used to extract orientation-invariant features, which are then classified using multiple convolutions to predict the category. Based on the scores obtained from the category prediction, the uncertainty prediction results, and the IU prediction results, a quality evaluation function is constructed and regressed using multiple convolutions to predict anchor frames, uncertainty, and IU. The anchor frame prediction results include the anchor frame's position, size, and orientation angle. In the classification branch, rotational pooling is used to extract orientation-invariant features, which are then classified using multiple convolutions to predict the category. Multiple anchor frames for the same target obtained from the anchor frame predictions are then filtered to obtain the optimal anchor frame.
[0048] See Figure 1As an optional implementation, the backbone network in step S101 of this embodiment adopts the ConvNeXt-T backbone network. In addition, other backbone networks can be used as needed, which will not be listed here. The ConvNeXt-T backbone network extracts features from the input image patch and obtains features at four scales, denoted as T1, T2, T3 and T4 respectively.
[0049] In step S102, the multi-scale aggregation module uses top-down and bottom-up multi-scale aggregation to connect multiple features of different scales across layers from the backbone network and then aggregates them again to obtain aggregated features. This includes multi-scale aggregation of T1, T2, and T3 using top-down and bottom-up methods, and connecting three features of different scales across layers from the backbone network to obtain aggregated features of T4.
[0050] In this embodiment, orientation-sensitive features are extracted using eight-directional rotating convolution kernels for localization, and orientation-invariant features are extracted using rotation pooling for classification. In the localization branch, four convolutions are used for regression to predict anchor frames, uncertainty, and intersection-over-union (IoU). The anchor frame prediction results include the anchor frame's position, size, and orientation angle. In the classification branch, orientation-invariant features are extracted using rotation pooling, and then four convolutions are used for classification to predict the category.
[0051] In this embodiment, before inputting the image patch into the given network model for target recognition, there is also a step of training the network model. The training sample data during the training of the network model includes: dividing the high-resolution remote sensing image with target labels into image patches of a specified size, so that the targets in the divided image patches are automatically labeled as training sample data. The labels include the position, size, orientation angle and category of the anchor frame in the high-resolution remote sensing image.
[0052] To address the issues of increased boundary loss and squareness caused by angular periodicity, the angle encoding method in this embodiment includes: using the target center point as a reference point, and using the area from the center point of the shortest side of the upper half of the target to the target center point as the angular coordinate axis. The angle between this angular coordinate axis and the horizontal rightward coordinate axis is corrected and used as the target's direction angle, thus limiting the direction angle of any facing target to the range of [0, 180) degrees. Furthermore, the predicted target box angle range for rectangular targets is limited to [0, 180) degrees, and the predicted target box angle for square targets is limited to the range of [90, 180) degrees. The function expression for correcting the angle between the angular coordinate axis and the horizontal rightward coordinate axis is as follows:
[0053]
[0054]
[0055] In the above formula, θrec θ is the angle between the angular coordinate axis of the corrected rectangular target and the horizontal coordinate axis to the right. sqr Let θ be the angle between the angular coordinate axis of the corrected square target and the horizontal rightward coordinate axis. Since the neural network regresses the offset between the target box and the anchor box during training, the anchor box and the ground truth label box need to be angle-encoded. First, the anchor box and the ground truth label box are angle-encoded as described above, and then the offset is calculated for regression. The target angle, i.e., the direction angle θ, is obtained using the horizontal rightward vector p and the angle vector q. The calculation expression is as follows:
[0056]
[0057] When half of the predicted bounding box extends beyond the prediction boundary, the other half will enter the angle range. The angle range of each box is always limited to 180 degrees, so there will be no sudden surge in loss within the angle boundary and angle range. For square targets, the selection of the included angle coordinate axis is based on the Y-axis coordinates of the two corner points of the upper half of the square target, which are divided into two cases. If the Y-axis coordinates are equal, the included angle coordinate axis is horizontal to the left, and since the angle range is [0, 180) degrees, its angle is 0 degrees; when the Y-axis coordinates are not equal, the included angle coordinate axis is established on the left side of the corner point with the smallest Y-axis coordinate. Figure 3 and Figure 4 In the diagram, solid lines with arrows indicate the regression direction of the azimuth angle, solid lines are anchor boxes, dotted lines are anchor boxes for labels, and dashed lines are anchor boxes for predictions. Figure 3 In the example (a), the boundary problem is represented by the long side representation, where the long side is [-90, 90) degrees, the anchor box is -90 degrees, the label box is 65 degrees, the prediction box is -115 degrees, and the distance between the label box and the prediction box is 180 degrees. The Intersection over Union (IOU) is approximately 1. Figure 3 In (b), the angle encoding is within the angle range [0, 180) degrees, the anchor box is 90 degrees, the label box is 115 degrees, the prediction box is 115 degrees, |label box - prediction box| = 0 degrees, and the IOU is approximately 1. Figure 3 (c) in the figure is to solve the boundary problem by angle encoding. The angle range is [0, 180) degrees, the anchor box is 90 degrees, the label box is 30 degrees, the prediction box is 25 degrees, |label box - prediction box| = 5 degrees, and the IOU is about 0.85. Figure 4 In (a), the long-side definition method is used for the square problem, with an angle range of [-90, 90) degrees, an anchor box of 0 degrees, a label box of -60 degrees, a prediction box of 30 degrees, and |label box - prediction box| = 90 degrees. The IOU is approximately 1. Figure 4 In (b), angle encoding is used to solve the square problem. The angle range is [90, 180) degrees, the anchor box is 90 degrees, the label box is 120 degrees, the prediction box is 120 degrees, |label box - prediction box| = 0 degrees, and the IOU is approximately 1.
[0058] See Figure 5 In this embodiment, the position of the anchor frame is in a rotating coordinate system, and its coordinate transformation function expression is:
[0059]
[0060] In the above formula, (x r ,y r (x, y) represents the coordinates of the position after transformation to the rotated coordinate system, θ is the direction angle, and (x, y) are the original coordinates of that position in the original image coordinate system before transformation. c ,y c ( ) represents the coordinates of the target center point. By employing coordinate system transformation, the coordinate system and included angle are redefined at the rotating target center, and the angle is re-encoded, thus solving the problems of surged boundary loss and squareness caused by the periodicity of angles.
[0061] In this embodiment, when training the network model, the method further includes establishing an elliptical boundary with the rotating coordinate system as the XY axis and the half-length and half-width of the anchor frame as the major and minor axes of the ellipse, transforming all the center points of the anchor frames to the rotating coordinate system, and performing adaptive positive and negative sample allocation for the anchor frames. By adopting adaptive positive and negative sample allocation, a coordinate system is established for the rotating target center, an elliptical equation is established, and the coordinates of each anchor frame are rotated to the elliptical coordinate system with the ellipse as the boundary, and positive and negative samples are adaptively selected.
[0062] Specifically, adaptive positive and negative sample allocation includes:
[0063] S201, calculate the distance between the center point of the anchor frame and the center point of the rotating target after coordinate transformation, and the intersection-union ratio of the anchor frame and the rotating target;
[0064] S202, select the maximum value based on the length and width of each real label box, and calculate the adaptively selected parameter k based on the set of multiples of the multi-scale generated feature map size and the original image, strides; in this embodiment, strides = [4,8,16,32,64];
[0065] S203: For each ground truth label anchor box, select the top k anchor boxes with the largest intersection-union ratio (IU) as candidate boxes. Among these k anchor boxes, select the anchor box inside the ellipse boundary as the positive sample anchor box. If the center point of no anchor box falls inside the ellipse boundary, select the k anchor box center points closest to the rotation target center as the positive sample anchor boxes. If an anchor box matches multiple ground truth label anchor boxes, select the ground truth label anchor box with the largest IU to match it based on the IU.
[0066] As can be seen from the steps above, the adaptive positive and negative sample allocation process has no hyperparameters. It establishes an adaptive allocation process starting from each rotated target box, which improves the recognition and localization accuracy of samples with fewer categories.
[0067] In this embodiment, the function expression for calculating the adaptively selected parameter k in step S202 is:
[0068]
[0069] In the above formula, ceil is the floor function, max{w,h} represents the maximum selected width and height, where w is the width and h is the length, and strides is the set of multiples of the size of the multi-scale generated feature map and the original image.
[0070] In this embodiment, the loss function used when training the network model is expressed as follows:
[0071]
[0072]
[0073]
[0074] Loss class =-α t (1-p t ) γ log(p t ),
[0075] In the above formula, Loss total Loss is the loss function. reg To predict the loss for the uncertainty of the positioning branch, For positive samples, SmoothL1 is the smoothed L1 loss, pred i and target i σ represents the predicted and labeled target localization parameters. i IOU represents the uncertainty obtained from the prediction. i Loss represents the intersection-to-union ratio between rotated frames. iou To predict the loss for the intersection-union ratio of the local branches, For positive samples, CrossEntropy is the cross-entropy loss. To predict the intersection-union ratio (CUI) of the i-th sample for branch localization, Let the true label be the intersection-union ratio of the i-th sample; Loss class For the loss of the classification branch, α t And γ are hyperparameters, p t Let N be the classification prediction probability obtained from the category prediction, and N be the total number of samples. And:
[0076]
[0077] In the above formula, x represents the input, and otherwise represents the case where x < 1.
[0078] CrossEntropy(p,q)=-∑ x p(x)log(q(x)),
[0079] In the above formula, p(x) is the classification prediction probability corresponding to x, and q(x) is the actual distribution probability corresponding to x.
[0080]
[0081] In the above formula, p is the classification prediction probability, and y is the true label.
[0082] Traditional methods utilize Non-Maximum Suppression (NMS) to eliminate boxes with the highest classification confidence scores by calculating the Intersection over Union (IoU) with other boxes. Rotated boxes differ from horizontal boxes in that even with high classification confidence scores, a low IoU may prevent them from being used as the final prediction result. Therefore, this embodiment constructs a comprehensive quality assessment function based on the classification and localization capabilities of each predicted box. The detector predicts five parameters (x, y, w, h, θ) and their uncertainties, along with the IoU. The average of the five parameter uncertainties represents the overall uncertainty of each box, adaptively adjusting the weighting of classification and localization capabilities. The function expression for the quality assessment function in this embodiment is:
[0083] Score = class_score α ·IOU 1-α ,
[0084] In the above formula, Score is the quality evaluation function, class_score is the class prediction result, IOU is the intersection-union ratio prediction result, and α is the uncertainty prediction result. For example... Figure 6 As shown, the bounding box with a ClassScore of 0.94 obtained by performing NMS on the classification confidence score, and the bounding box with a ClassScore of 0.84 obtained by the adaptive quality assessment function that takes into account both IOU and classification confidence score, has a very low IOU. Figure 6 The solid line represents the predicted anchor box, and the dashed line represents the anchor box corresponding to the label.
[0085] By constructing a comprehensive quality evaluation function for the predicted target boxes, the weighting of the category prediction result and the intersection-union (IU) prediction result is adaptively adjusted based on the box's localization and classification capabilities. Uncertainty prediction is performed by predicting a Gaussian distribution, where five prediction parameters are used as the prediction mean and uncertainty prediction is used as the prediction variance, thereby extracting reliable information about the target.
[0086] Taking MSE loss as an example, the uncertainty L MSE The expression for the calculation function is shown in the following formula:
[0087]
[0088] Where f i For the predicted value, y i For tags.
[0089] In this embodiment, the likelihood function log(likelihood) for the predicted distribution is:
[0090]
[0091]
[0092] To verify the method of this embodiment, the following will experimentally verify the method of this embodiment (hereinafter referred to as MSMD) and the comparison method. Experiments were conducted to compare the effectiveness of the proposed method with the single-stage method BBAVectors (Oriented Object Detection in Aerial Images with Box Boundary-Aware Vectors. IEEE Winter Conference on Applications of Computer Vision, 2021.00220), the two-stage method SCRDet++ (Detecting Small, Cluttered and Rotated Objects via Instance-Level Feature Denoising and Rotation Loss Smoothing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.3166956), and the RSDet-II (Learning Modulated Loss for Rotated Object Detection. AAAI Conference on Artificial Intelligence, 10.48550). Experiments were performed on the DOTA1.0 dataset, using the Wuhan University public dataset DOTA1.0 for training and testing, and compared with the single-stage and two-stage methods. The DOTA 1.0 dataset consists of 2806 aerial images from Google Earth, JL-1 satellites, and the GF-2 satellite from the China Resources Satellite Data and Application Center. The images range in size from 800×800 to 4000×4000 pixels and contain 188,282 instances across 15 categories, all with rotated bounding boxes. This experiment uses the AdamW optimizer with an initial learning rate of 0.0002, a batch size of 4, and 25 training epochs. The learning rate is 0.00002 in epoch 12 and 0.000002 in epoch 22.
[0093] Table 1 compares the experimental results of the method in this embodiment (MSMD) with existing methods.
[0094]
[0095]
[0096] Table 1 compares the recognition accuracy and mAP50 index of four existing methods across 15 categories. In this embodiment, using a Tiny version of ConvNeXt-T as the backbone, the mAP50 is 7.54%, 2.11%, 3.05%, and 3.52% higher than the single-stage, two-stage, and two-stage frameworks, respectively, demonstrating the superiority of the method in this embodiment. On experimental equipment with an NVIDIA RTX 3090 GPU and an i7-13700K CPU, the method in this embodiment can achieve 29 frames per second.
[0097] In summary, the multi-scale, multi-directional remote sensing target recognition method in this embodiment first employs a ConvNeXt-T backbone network for feature extraction, followed by top-down and bottom-up multi-scale aggregation. Backbone network features are introduced into the bottom-up aggregation to enhance the influence of gradient flow on the backbone network. Secondly, the coordinate system of the rotated bounding boxes is redefined, and angles are adaptively encoded. Then, the anchor boxes of the multi-scale feature maps are mapped back to the original image size, and adaptive sample allocation is performed on the anchor boxes. Finally, in the remote sensing target recognition prediction stage, rotational convolution is used to extract direction-sensitive features for localization, and rotational pooling is used to extract direction-invariant features for classification. The score of each target box is adaptively determined based on its localization and classification capabilities. The key to this embodiment is improving the network's adaptability to multi-scale, multi-directional remote sensing rotating targets. Adaptive encoding of the rotating remote sensing target angles avoids angle boundary issues. Positive and negative sample boundaries are established based on the geometric appearance of the remote sensing target, and an adaptive sample allocation strategy is adopted for the anchor boxes. In the inference stage, a comprehensive prediction box quality evaluation function is constructed to adaptively evaluate the prediction box scores, making the inference results more accurate. This embodiment of the multi-scale, multi-directional remote sensing target recognition method uses adaptive angle encoding to eliminate the loss surge and squareness problems caused by angle boundaries when rotating remote sensing targets are rotated. During training, high-quality anchor boxes are adaptively selected for regression, which helps to improve detection accuracy for a small number of categories. In the inference stage, each predicted box is accurately scored according to a comprehensive quality evaluation function, and the contribution of the predicted box's classification and localization capabilities is adaptively adjusted to further improve the prediction quality.
[0098] Furthermore, this embodiment also provides a multi-scale, multi-directional remote sensing target recognition system, including a microprocessor and a memory interconnected, wherein the microprocessor is programmed or configured to execute the multi-scale, multi-directional remote sensing target recognition method. Additionally, this embodiment also provides a computer-readable storage medium storing a computer program for being programmed or configured by the microprocessor to execute the multi-scale, multi-directional remote sensing target recognition method.
[0099] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code. This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a machine for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The functions specified in one or more boxes. These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable apparatus for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0100] The above description is merely a preferred embodiment of the present invention. The scope of protection of the present invention is not limited to the above embodiments. All technical solutions falling within the scope of the present invention's concept are within the scope of protection of the present invention. It should be noted that for those skilled in the art, any improvements and modifications made without departing from the principles of the present invention should also be considered within the scope of protection of the present invention.
Claims
1. A multi-scale, multi-directional remote sensing target recognition method, characterized in that, This includes segmenting remotely sensed images into image patches of a specified size, and inputting these image patches into a given network model for target recognition, including: S101, extracts features from the input image patch through the backbone network; S102, through the multi-scale aggregation module, by top-down and bottom-up multi-scale aggregation, connects multiple features of different scales across layers of the backbone network and then aggregates them again to obtain aggregated features; S103: The detection head extracts orientation-sensitive features by performing multi-directional rotational convolution on the aggregated features. These orientation-sensitive features are then fed into the localization and classification branches. In the localization branch, multiple convolutions are used for regression to predict anchor boxes, uncertainty, and intersection-union ratio (IU). The anchor box prediction results include the anchor box's position, size, and orientation angle. In the classification branch, rotational pooling is used to extract orientation-invariant features, which are then classified using multiple convolutions to predict the category. Based on the scores obtained from the category prediction, the uncertainty prediction results, and the IU prediction results, a quality evaluation function is constructed to filter multiple anchor boxes of the same target obtained from the anchor box predictions to obtain the optimal anchor box.
2. The multi-scale, multi-directional remote sensing target recognition method according to claim 1, characterized in that, Before inputting the image patch into the given network model for target recognition, the process includes a step of training the network model. The training sample data during network model training includes: dividing the high-resolution remote sensing image with target labels into image patches of a specified size, so that the targets in the divided image patches are automatically labeled as training sample data. The labels include the position, size, orientation angle, and category of the anchor frame in the high-resolution remote sensing image.
3. The multi-scale, multi-directional remote sensing target identification method according to claim 2, characterized in that, The angle encoding method in the label includes: using the target center point as a reference point, using the area from the center point of the shortest side of the upper half of the target to the target center point as the included angle coordinate axis, and correcting the angle between the included angle coordinate axis and the horizontal right coordinate axis as the target's direction angle, so that the direction angle of any target orientation is limited to the range of [0, 180) degrees, and the predicted angle range of the target box of a rectangular target is limited to the range of [0, 180) degrees, and the predicted angle range of the target box of a square target is limited to the range of [90, 180) degrees; and the function expression for correcting the angle between the included angle coordinate axis and the horizontal right coordinate axis is: In the above formula, θ rec θ is the angle between the angular coordinate axis of the corrected rectangular target and the horizontal coordinate axis to the right. sqr θ is the angle between the angular coordinate axis of the corrected square target and the horizontal right coordinate axis.
4. The multi-scale, multi-directional remote sensing target identification method according to claim 2, characterized in that, The position of the anchor frame is in a rotating coordinate system, and its coordinate transformation function expression is: In the above formula, (x r y r (x, y) represents the coordinates of the position after transformation to the rotated coordinate system, θ is the direction angle, and (x, y) are the original coordinates of that position in the original image coordinate system before transformation. c y c () represents the coordinates of the target center point.
5. The multi-scale, multi-directional remote sensing target identification method according to claim 4, characterized in that, The training of the network model also includes establishing an elliptical boundary with the rotating coordinate system as the XY axis and the half-length and half-width of the anchor frame as the major and minor axes of the ellipse, transforming all anchor frame center points to the rotating coordinate system, and performing adaptive positive and negative sample allocation for the anchor frames; the adaptive positive and negative sample allocation includes: S201, calculate the distance between the center point of the anchor frame and the center point of the rotating target after coordinate transformation, and the intersection-union ratio of the anchor frame and the rotating target; S202, select the maximum value based on the length and width of each real label box, and calculate the adaptively selected parameter k based on the strides set of multiples between the feature map size generated by the multi-scale and the original image; S203: For each ground truth label anchor box, select the top k anchor boxes with the largest intersection-union ratio (IU) as candidate boxes. Among these k anchor boxes, select the anchor box inside the ellipse boundary as the positive sample anchor box. If the center point of no anchor box falls inside the ellipse boundary, select the k anchor box center points closest to the rotation target center as the positive sample anchor boxes. If an anchor box matches multiple ground truth label anchor boxes, select the ground truth label anchor box with the largest IU to match it based on the IU.
6. The multi-scale, multi-directional remote sensing target identification method according to claim 5, characterized in that, The functional expression for calculating the adaptively selected parameter k in step S202 is: In the above formula, ceil is the floor function, max{w, h} represents the maximum selected width and height, where w is the width and h is the length, and strides is the set of multiples of the size of the multi-scale generated feature map and the original image.
7. The multi-scale, multi-directional remote sensing target identification method according to claim 2, characterized in that, The loss function used when training the network model is expressed as follows: Loss class =-α t (1-p t ) γ log(p t ), In the above formula, Loss total Loss is the loss function. reg To predict the loss for the uncertainty of the positioning branch, For positive samples, SmoothL1 is the smoothed L1 loss, pred i and target i σ represents the predicted and labeled target localization parameters. i IOU represents the uncertainty obtained from the prediction. i Loss represents the intersection-to-union ratio between rotated frames. iou To predict the loss for the intersection-union ratio of the local branches, For positive samples, CrossEntropy is the cross-entropy loss. To predict the intersection-union ratio (CUI) of the i-th sample for branch localization, Let the true label be the intersection-union ratio of the i-th sample; Loss class For the loss of the classification branch, α t And γ are hyperparameters, p t The classification prediction probability is obtained from the category prediction, where N is the total number of samples.
8. The multi-scale, multi-directional remote sensing target identification method according to claim 1, characterized in that, The function expression of the quality evaluation function is: Score=class_score α ·IOU 1-α , In the above formula, Score is the quality evaluation function, class_score is the class prediction result, IOU is the intersection-union ratio prediction result, and α is the uncertainty prediction result.
9. A multi-scale, multi-directional remote sensing target recognition system, comprising a microprocessor and a memory interconnected, characterized in that, The microprocessor is programmed or configured to perform the multi-scale, multi-directional remote sensing target recognition method according to any one of claims 1 to 8.
10. A computer-readable storage medium storing a computer program, characterized in that, The computer program is used to be programmed or configured by a microprocessor to perform the multi-scale, multi-directional remote sensing target recognition method according to any one of claims 1 to 8.