A synthetic aperture radar target recognition method based on physical perception and visual fusion
By constructing a GAT-Former dual-stream heterogeneous network, the scattering center of SAR targets is explicitly extracted and cross-modal semantic alignment is performed by combining dynamic graph attention and visual convolutional features. This solves the problems of insufficient utilization of physical topology information and difficulty in feature semantic alignment in existing SAR ATR methods, and achieves high-precision fine-grained target recognition and improved robustness.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANJING UNIV OF POSTS & TELECOMM
- Filing Date
- 2026-02-07
- Publication Date
- 2026-06-26
AI Technical Summary
Existing SAR ATR methods suffer from insufficient utilization of physical topology information, difficulty in aligning physical and visual features semantically, and limited performance of single models when dealing with large-scale fine-grained target recognition, resulting in inadequate recognition accuracy and robustness.
We construct a GAT-Former dual-stream heterogeneous network, explicitly extract scattering centers through deep unfolding technology, combine dynamic graph attention mechanism with visual convolution features for cross-modal semantic alignment, and use heterogeneous integration strategy to solve local extremum problems in non-convex optimization space to achieve physical perception and visual fusion.
It improves the accuracy and robustness of fine-grained target recognition, achieves high-precision fine-grained classification, raises the upper limit of the recognition system, maintains a high level of perception capability in complex electromagnetic environments, and effectively solves the problem of confusion between similar-looking targets.
Smart Images

Figure CN122289752A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to a synthetic aperture radar target recognition method based on the fusion of physical perception and vision, belonging to the field of automatic radar target recognition technology. Background Technology
[0002] Synthetic Aperture Radar (SAR), as an advanced active microwave remote sensing imaging sensor, plays an irreplaceable role in geological exploration and disaster monitoring due to its all-time and all-weather data acquisition capabilities. In SAR image interpretation tasks, Automatic Target Recognition (ATR) technology is the core component for achieving intelligent perception, aiming to automatically locate and accurately classify key targets from massive radar echoes. With the advent of the Internet of Everything (IoE) era, higher demands are placed on the real-time performance and robustness of ATR systems. Although traditional methods based on template matching or Support Vector Machines (SVMs) have been widely used, their feature design relies on expert experience and is difficult to adapt to complex and ever-changing battlefield environments. In recent years, Deep Learning (DL) technology, represented by Convolutional Neural Networks (CNNs), has become the mainstream paradigm in the SAR ATR field due to its powerful feature extraction capabilities.
[0003] However, as identification tasks evolve from coarse-grained to fine-grained classification encompassing 40 or more target categories, existing methods face significant challenges. For example, when distinguishing between T72 and T62 tanks that are extremely similar in appearance, relying solely on a single mode is often ineffective. Mainstream deep learning methods typically treat SAR data as a single-channel optical grayscale image, utilizing only amplitude information and discarding complex phase information that includes the target's crucial electromagnetic scattering characteristics (EMCs). Furthermore, the speckle noise inherent in SAR makes traditional CNNs prone to overfitting to background clutter, making it difficult to capture the target's essential geometric topology.
[0004] To overcome the limitations of purely data-driven methods, researchers have begun exploring the integration of physics and deep learning. L. Liao, Z. Hong, and others proposed an Integrated Physically Interpretable Model (IPIM) in their paper "Integrated Physically Interpretable Model for SARTarget Recognition: Unified Fusion of Electromagnetic and Deep Features (IEEE Geoscience and Remote Sensing Letters)". This method utilizes a Deep Unfolding Network (DUN) to incorporate the complex SAR imaging process into the network structure to implicitly solve for physical features, and combines physical and deep features through a fusion module. While IPIM improves interpretability to some extent, its feature interactions mainly rely on simple stitching or weighting, lacking in-depth mining of the dynamic topological relationships between scattering points. In fine-grained tasks, the scattering centers of different target types exhibit subtle differences in topological distribution, and coordinate regression alone is insufficient to construct a discriminative geometric description. Furthermore, this method fails to fully utilize the ability of large-scale pre-trained visual models to extract texture details. Another representative work is the Part Attention Network (PAN) proposed by S. Feng, K. Ji et al. in their paper "PAN: Part Attention Network Integrating Electromagnetic Characteristics for Interpretable SARVehicle Target Recognition (IEEE Transactions on Geoscience and Remote Sensing)". This research designs the target part model based on the Attributed Scattering Center (ASC) model and proposes a PAN architecture that includes global and local flows, using an attention mechanism to guide the network to focus on physically meaningful regions. However, PAN's integration of the physical model still relies on predefined ASC assumptions, and its "attention" mechanism is mainly limited to spatial weighting within the visual domain, failing to construct true cross-modal semantic interaction. When facing the heterogeneity of discrete scattering points and continuous texture features, it lacks a mechanism that can actively map physical geometry to the visual context.Furthermore, none of the above works involve heterogeneous model integration strategies, making it difficult to overcome the performance bottleneck of a single architecture when dealing with ultra-fine-grained classification.
[0005] Existing technologies still suffer from problems such as insufficient utilization of physical topology information, difficulty in semantic alignment of physical-visual features, and limited performance of single models when dealing with large-scale fine-grained SAR target recognition. Therefore, there is an urgent need to develop a recognition method based on the integration of physical perception dynamic graphs and visual heterogeneity, which can achieve high-precision fine-grained target recognition by deeply mining the geometric topology of the complex domain and interacting deeply with visual textures. Summary of the Invention
[0006] This invention aims to address the problems of low utilization of complex phase information, difficulty in extracting geometric topology, and limited accuracy in large-scale fine-grained classification in existing SAR ATR methods. It proposes a SAR fine-grained target recognition method based on the integration of a physically-perceived dynamic graph and visual heterogeneous features. This invention constructs a GAT-Former dual-stream heterogeneous network, explicitly extracts scattering centers using depth unrolling technology, combines dynamic graph attention mechanisms with visual convolutional features for cross-modal semantic alignment, and utilizes a heterogeneous integration strategy to solve the local extremum problem in the non-convex optimization space, thereby maximizing the recognition accuracy of fine-grained vehicle targets.
[0007] The technical solution adopted by this invention to solve its technical problem is: a SAR fine-grained target recognition method based on the integration of physical perception dynamic map and visual heterogeneity, the method comprising the following steps: Step 1: Construct a dual-stream heterogeneous dataset for SAR fine-grained target recognition; Let the original SAR complex dataset be... For the first Using samples, construct a two-stream input set containing both the complex domain and the visual domain. ,in The total number of samples, For category labels.
[0008] Step 1-1: Construct the physical flow complex tensor Read the real part of the raw SAR data. and the virtual part Constructing tensors It preserves complete electromagnetic phase information.
[0009] Step 1-2: Construct the visual flow amplitude tensor Logarithmic transformation and normalization are performed on the modulus of complex data to generate a three-channel image. The calculation formula is as follows: in To prevent errors in numerical calculations, This is the normalized mapping function.
[0010] Step 2: Construct a physical sensing branch to generate dynamic topological map features of the scattering center; Define the physical branch mapping function as follows ,in The number of scattering centers, For feature dimensions.
[0011] Step 2-1: Extract the set of scattering centers based on a deep unfolding network The geometric parameters of the target are regressed through cascaded convolutional layers, where... Indicates the first Normalized spatial coordinates of each scattering center This indicates its scattering intensity.
[0012] Step 2-2: Construct a dynamic scattering topology graph Set around the scattering center For node set Using k-NN to construct edge sets based on Euclidean distance For any node ,like belong of If the nearest neighbor is found, then a directed edge is established. .
[0013] Steps 2-3: Aggregate topological features using a graph attention network. Calculate node... physical feature vector The formula is: in For the neighborhood node set, This is the weight matrix. Attention coefficient This is the activation function.
[0014] Step 3: Construct visual perception branches and extract visual context semantic sequences; Define the visual branch mapping function as follows A pre-trained ConvNeXt network is used to extract high-dimensional feature maps, which are then flattened into a visual feature sequence matrix. ,in is the length of the feature sequence.
[0015] Step 4: Physics-guided cross-attention fusion and classification; Construct a physical-visual cross-attention fusion module, using physical features For query matrix Based on visual features Key matrix Sum matrix Calculate fusion features : Then, a global average pooling layer is applied. and fully connected layer Obtain the predicted probability vector .
[0016] Step 5: Joint optimization solution based on hybrid regularization and heterogeneous integration; Step 5-1: Introduce the Mixup enhancement strategy. From compliance Mixing coefficient of sampling in distribution Construct virtual training samples To smooth the decision boundary: Step 5-2: Construct a heterogeneous ensemble decision model. Introduce ResNet-50 as the texture expert model. Output predicted probability Define the final target recognition decision function. For weighted fusion: in For heterogeneous integration weighting coefficients, This refers to the final, fine-grained target category identified.
[0017] Beneficial effects: 1. This invention proposes a physical perception and visual fusion SA fine-grained target recognition method. By introducing a heterogeneous integration strategy, it overcomes the recognition bottleneck of a single modality and achieves high-precision classification of fine-grained targets. Experimental results show that on the large-scale ATRNet-STAR dataset containing 40 categories of civilian vehicles and a total of 67,780 training samples, the final recognition accuracy of the proposed method reaches 96.64%. This result outperforms the contemporary pure visual state-of-the-art model ConvNeXt (96.00%), effectively solving the confusion problem of highly similar targets in SAR images and improving the upper limit of the recognition system.
[0018] 2. The physical perception dynamic graph model constructed in this invention can explicitly solve for the scattering center of the target and establish topological connections, effectively enhancing the robustness of the model in complex electromagnetic environments. The geometric structure features extracted by the GAT-Former branch in this invention have a strong ability to suppress speckle noise, achieving a recognition accuracy of 93.84% on the test set relying solely on the physical flow single mode. This proves that the dynamic graph features extracted in this invention can effectively lock the essential geometric "skeleton" of the target, maintaining a high level of perception capability even in the absence of texture information or the presence of strong background clutter.
[0019] 3. The physics-guided cross-attention fusion and heterogeneous integration mechanism designed in this invention achieves deep complementarity between physical mechanisms and visual textures. Through cross-attention modules and weighted decision fusion, the system successfully improved the recognition accuracy of a single physical model by 2.8 percentage points, from 93.84% to 96.64%. This bidirectional feedback mechanism not only solves the problem of inconsistent heterogeneous feature representations but also effectively avoids the problem of a single network architecture easily getting trapped in local extrema in a non-convex optimization space.
[0020] 4. The three-stage collaborative optimization strategy and inference enhancement method proposed in this invention balance the training stability and generalization ability of the model. Addressing the issue of small inter-class differences in fine-grained classification tasks, this invention employs a Mixup and label smoothing strategy, combined with TTA technology, to demonstrate excellent generalization performance in complex tasks involving 40 fine-grained subclasses. Confusion matrix analysis shows that this invention maintains extremely high classification confidence even when facing easily confused models of the same class, effectively avoiding the risk of overfitting on large-scale datasets.
[0021] 5. The heterogeneous fusion architecture constructed in this invention has good versatility. This invention not only achieved excellent results in the ultra-fine-grained classification of civilian vehicles, but its methodology based on the fusion of physical scattering mechanisms and visual context is also applicable to SAR target identification tasks for other targets with significant scattering characteristics, such as ships and aircraft, providing a high-precision technical solution for improving the automation and intelligence level of remote sensing interpretation systems. Attached Figure Description
[0022] Figure 1 This is the overall system architecture diagram based on GAT-Former and heterogeneous integration proposed in this invention.
[0023] Figure 2 This is an ablation comparison radar chart showing the contributions of physical flow and visual flow in multiple performance dimensions in the method of this invention.
[0024] Figure 3This is a bar chart comparing the accuracy of the method of this invention with 10 other mainstream models (DenseNet, ViT, etc.) on the ATRNet-STAR dataset.
[0025] Figure 4 This is a confusion matrix diagram of the method of the present invention on 40 categories of civilian vehicle targets, showing the fine-grained classification performance. Detailed Implementation
[0026] The invention will now be described in further detail with reference to the accompanying drawings.
[0027] Example 1 As shown in Figure 1, this invention provides a fine-grained SAR target recognition method based on the integration of physically perceived dynamic maps and visual heterogeneity. The method includes the following steps: Step 1: Construct a dual-stream heterogeneous dataset for SAR fine-grained target recognition.
[0028] This step aims to transform the raw SAR complex echo data into heterogeneous input pairs that can simultaneously characterize physical mechanisms and visual texture features. Let the raw SAR complex dataset be... For the first Using samples, construct a two-stream input set containing both the complex domain and the visual domain. ,in The total number of samples, For category labels, this embodiment is for a fine-grained vehicle target recognition task, with a total number of categories. .
[0029] Step 1-1: Construct the physical flow complex tensor .
[0030] The core of physical flow construction lies in preserving the complex phase information of SAR data to characterize the inherent electromagnetic scattering properties of the target. First, the original complex matrix is read from the storage path. And dissociate it into in-phase components. Orthogonal components To enable the network to directly perceive the phase difference information generated by waveform interference, this invention abandons the traditional amplitude preprocessing and directly... and The components are concatenated along the channel dimension. Then, a bilinear resampling operator is used. The feature space scale is uniformly normalized to This generates the physical flow input tensor. This tensor, as input to the subsequent deep unfolding network, provides complete complex-domain physical semantic support for the parameterization of the scattering center.
[0031] Step 1-2: Construct the visual flow amplitude tensor .
[0032] The visual flow construction focuses on enhancing the texture saliency of the target to adapt to the visual perception capabilities of large-scale pre-trained models. The system first calculates the pixel-level magnitude of the complex matrix to obtain the original magnitude map. Furthermore, a logarithmic transformation function is introduced to nonlinearly compress its dynamic range: in To prevent numerically singular bias factors. To eliminate the interference of speckle noise on contrast, for Perform normalization mapping based on extrema The image is quantized to the [0, 255] grayscale range and expanded into a three-channel RGB image format via channel duplication. Finally, its size is adjusted using spatial interpolation. Generate visual flow tensor It aims to capture the macroscopic geometric outline of vehicle targets through a global view.
[0033] Steps 1-3: Perform physical mechanism-based data augmentation and robust regularization.
[0034] In the model training phase, to simulate radar imaging fluctuations in complex battlefield environments and alleviate overfitting in fine-grained tasks, this invention performs a certain level of optimization on the input pairs. A collaborative data augmentation transformation was applied. This was achieved by injecting a specific distribution of Gaussian white noise into the physical stream. Thermal noise interference is simulated, and salt-and-pepper noise is added to the visual stream to simulate strong speckle effects, thereby improving the model's noise robustness. Simultaneously, to address the extremely high azimuth sensitivity of SAR images, the system performs a random rotation transformation. Combined with the horizontal flipping operation, an enhanced sample space with orientation invariance is constructed. ,in This represents a composite geometric and noise transformation operator. This enhancement strategy forces the model to learn the essential structural features of the target rather than the unstable background texture, laying a data foundation for subsequent high-precision recognition.
[0035] Step 2: Construct the physical perception branch and generate dynamic topological map features of the scattering center.
[0036] The core of this step lies in using deep neural networks to simulate the sparse computation process, starting from the complex domain input. The scattering centers reflecting the target's geometry are explicitly extracted, and the spatial correlation and electromagnetic coupling characteristics between scattering points are captured through dynamic graph modeling. The physical branch mapping function is defined as follows: ,in The preset number of scattering center feature points, is the dimension of the high-dimensional feature space after mapping.
[0037] Step 2-1: Extract the parameterized scattering center set based on the deep unfolding network.
[0038] The physics branch first utilizes cascaded deep unfolding networks (DUNs) to process the complex tensors of the physical flow. Feature decoding is performed. The network structure consists of six cascaded dual-channel convolutional operators. By simulating the unfolding form of sparse solution algorithms, it transforms the complex electromagnetic scattering calculation process into a learnable nonlinear regression task. At the end of the feature extraction layer, the system is designed with parallel spatial coordinate regression heads and scattering intensity regression heads. These utilize the Sigmoid activation function to normalize the scattering point coordinates to the [0, 1] space, and the ReLU activation function to ensure the non-negativity of the scattering intensity. Through this network, the physics branch can explicitly calculate the features contained in the original signal. The set of properties of a discrete scattering point ,in Indicates the first Two-dimensional barycentric coordinates of each scattering center This represents the equivalent intensity of its corresponding radar cross section. This parameterized characterization can effectively eliminate background speckle noise and preserve the essential geometric "skeleton" features of the target.
[0039] Step 2-2: Construct a dynamic topology graph of scattering centers .
[0040] To further explore the spatial topological distribution patterns among scattering points, this invention is based on the calculated parameter set. Constructing a dynamic scattering topology graph Among them, the node set Depend on The scattering points in the graph are composed of the initial features of each node, which are generated by a linear projection layer from the original parameters. Mapped to The edge set is obtained from a continuous space. The construction follows a dynamic -NN criterion: The system calculates the Euclidean distance between any two nodes u and v in real time. and for each node Search for the nearest neighbor in its spatial neighborhood. Each node is used to establish a directed connection edge. This dynamic mapping mechanism, guided by Euclidean space, enables the model to adaptively establish topological relationships based on the target's geometric configuration, thereby enhancing the robustness of the recognition algorithm to target translation and rotation transformations.
[0041] Steps 2-3: Aggregate topological features using graph attention networks.
[0042] In the completed dynamic graph Above, this invention utilizes a Graph Attention Network (GAT) to perform multi-layer message passing in order to generate a feature matrix that deeply encodes physical topological semantics. For any central node in the graph First, the attention mechanism is used to calculate its relationship with neighboring nodes. Energy coefficient between The attention weights are obtained by applying the Softmax function for global normalization. : in For a shared linear weight matrix, For vector concatenation operators, This is a learnable attention vector. Subsequently, the node... By aggregating structural information within its domain and combining residual connections with layer normalization operations, the output node feature vector is calculated. ,in This is a non-linear activation function. The resulting physical feature matrix... It deeply integrates the geometric topology and physical scattering semantics of the scattering center, providing structured physical constraints for subsequent heterogeneous cross-modal semantic alignment.
[0043] Step 3: Construct a visual perception branch and extract the visual context semantic sequence.
[0044] This step aims to leverage a deep convolutional architecture pre-trained on a large-scale optical dataset to extract data from the visual domain tensor. High-level semantic texture features are extracted to provide macroscopic contextual information for the discrete topological structures extracted by the physical branch. The visual branch mapping function is defined as follows: ,in The length of the feature sequence. The embedding dimension is the result of aligning visual and physical features.
[0045] First, this invention employs ConvNeXt-Tiny as the backbone network for the visual perception branch. While maintaining the inductive bias of traditional CNNs, this network achieves a Transformer-like global receptive field by introducing large 7×7 convolutional kernels, effectively capturing the semantic relationship between the macroscopic contours and shadow regions of targets in SAR images. (Visual flow tensor) After being input into the network, deep feature maps containing rich texture details are extracted through multi-level downsampling and residual block processing. .
[0046] Secondly, in order to achieve the projection alignment of visual features and physical topological features in the semantic space, the system constructs a model consisting of LayerNorm2d and... A feature projection module composed of convolutions. This module first processes the feature map... Perform mean and variance normalization to stabilize the numerical distribution, and then utilize... Convolution compresses the number of feature channels from 768 to the target dimension. The mapped visual feature matrix is obtained. : in .
[0047] Finally, this invention performs a spatial dimension serialization transformation, converting the feature map with a two-dimensional spatial arrangement... Flatten along the height and width dimensions, and then transpose to convert it into a visual context semantic sequence. : Each vector in this feature sequence This represents the visual semantics of a specific local region in an image. Through this sequential representation, the visual perception branch can provide a highly abstract "texture dictionary" for subsequent physical guidance cross-attention mechanisms, thereby helping the model accurately locate and identify the fine-grained attributes of targets in complex background interference.
[0048] Step 4: Physics-guided cross-attention fusion and classification.
[0049] The core of this step lies in achieving deep semantic alignment between discrete physical topological features and continuous visual texture features through a designed physical-visual cross-attention fusion module. This mechanism allows the model to use scattering centers with strong physical discriminative power as "anchor points" to actively extract key detail evidence from complex visual contexts, thereby achieving complementary fusion of heterogeneous information at the fine-grained feature level.
[0050] Step 4-1: Construct a physics-guided cross-modal interaction mechanism.
[0051] The system utilizes a multi-head cross-attention (MHCA) mechanism to process the physical feature matrix. With visual feature sequences First, through three sets of learnable weight matrices... The physical features are mapped to query matrices respectively. Mapping visual features to a key matrix Sum matrix : in, Represents the number of scattering center nodes. This represents the number of visual feature blocks. Subsequently, the attention weight score of the physical node for the visual region is calculated. And it uses the scaling dot product operator to measure the correlation between physical topological points and visual pixel blocks: in This is a scaling factor for the dimensions of the attention head, used to prevent gradient vanishing. Finally, the visual feature values are aggregated using weighted aggregation. Residual connections are introduced to obtain the fused feature matrix. : This "point-to-surface" interaction method forces the model to use the target geometric skeleton determined by the physical flow to suppress background clutter and speckle noise interference in the visual flow.
[0052] Step 4-2: Feature enhancement and nonlinear semantic projection.
[0053] To further enhance the characterization capability of fusion features, the system... A feed-forward network (FFN) is applied for nonlinear transformation. This process includes two layers of linear projection and an intermediate GELU activation function, aiming to mine deep discriminative attributes of the target through feature mapping in high-dimensional space. The introduction of the Dropout layer effectively prevents the risk of overfitting during fine-grained feature mining.
[0054] Step 4-3: Classification decision and probability mapping.
[0055] After completing the deep coupling of multimodal features, this invention uses a Global Average Pooling (GAP) layer to combine the matrix... Dimensionality reduction is performed along the scattering point dimension to extract the global physical-visual consistency feature vector. Subsequently, The input is a specially designed fine-grained classification head, which consists of layer normalization, a multilayer perceptron, and a softmax function. The final result is the predicted probability vector of the target. : in, and This refers to the weights and biases of the classification layer. The output... In the vector, each element represents the confidence level that the input sample belongs to the corresponding fine-grained vehicle model, providing a benchmark decision basis for subsequent heterogeneous model integration.
[0056] Step 5: Solve the joint optimization problem based on hybrid regularization and heterogeneous integration.
[0057] This step aims to further optimize the model's search capabilities in non-convex optimization spaces by employing a strongly regularized training paradigm and a multi-model decision fusion strategy, thereby addressing the persistent problems of getting trapped in local minima and overfitting with small samples in fine-grained recognition tasks. This invention adopts a three-stage collaborative optimization approach, through warm-up training, robust fine-tuning, and heterogeneous integration, ultimately achieving a significant improvement in recognition accuracy.
[0058] Step 5-1: Introduce a strong regularization training strategy that combines hybrid enhancement and label smoothing.
[0059] To smooth the decision boundaries of deep neural networks and enhance their generalization ability to high-dimensional feature spaces, this invention introduces a Mixup enhancement strategy during training. For each training batch, from Mixing coefficients of random sampling in distribution hyperparameters The value range is [0.4, 0.8]. Virtual training samples are constructed by performing pixel-level linear interpolation on the input two-stream sample pairs. The calculation formula is as follows: in These represent input tensors of different categories. Meanwhile, this invention introduces label smoothing technology to modify the cross-entropy loss function, defining a smoothing factor. This transforms hard tags into soft tags with distribution characteristics. This forces the model to learn more discriminative local details of the target, rather than simply fitting the overall contour. This hybrid regularization mechanism effectively suppresses the model's memory of training noise and improves its sensitivity in recognizing similar targets.
[0060] Step 5-2: Construct a heterogeneous integrated decision model.
[0061] Considering the inherent inductive bias limitation of single network architectures in feature extraction, this invention constructs a heterogeneous integrated system based on GAT-Former ("structure expert") and ResNet-50 ("texture expert"). Since GAT-Former focuses on mining the topological geometric relationships of discrete scattering points using dynamic graphs, while ResNet-50 focuses on extracting texture details of dense surfaces through continuous convolution, the two exhibit strong positive complementarity in feature representation. In the inference phase, the system first applies TTA technology to the input image, obtaining the mean predicted probability through horizontal flipping to eliminate azimuth bias. Subsequently, the final target recognition decision function is defined. The weighted probability fusion of the two: in, and These represent the output probability vectors of the two heterogeneous models, respectively. To integrate the weighting coefficients, this embodiment determines the optimal weight allocation by performing a grid search on the validation set. This integrated strategy successfully improved the recognition accuracy of 40 types of fine-grained targets from 93.84% with a single model to 96.64%, effectively breaking through the performance bottleneck of a single algorithm.
[0062] The effects of the present invention will be further explained in detail below with reference to simulation experiments, specifically including: 1. Experimental hardware conditions The experiments of this invention were conducted on a deep learning simulation platform using Python 3.10 and PyTorch 2.1.2, with Ubuntu 22.04 as the operating system. The hardware environment configuration was as follows: the computer CPU was a 16 vCPU Intel(R) Xeon(R) Platinum 8474C processor; the GPU was a single NVIDIA GeForce RTX 4090D graphics card with 24GB of video memory, supporting CUDA 11.8 hardware acceleration. The system had 80GB of RAM, and the hard drive included a 30GB system disk and a 50GB data disk.
[0063] 2. Experimental System and Parameter Settings This invention uses the large-scale, fine-grained SAR vehicle target dataset ATRNet-STAR for algorithm validation. This dataset contains 40 civilian vehicle models, covering four main categories and 21 subcategories: sedans, SUVs, pickup trucks, buses, trucks, and tankers. It is currently a highly challenging benchmark dataset in the field of SAR target recognition. The experiments specifically focus on the SOC-40classes task. The input is raw floating-point complex data in .mat format with slant range coordinates. The training set contains 67,780 images, and the test set contains 29,169 images.
[0064] The core model GAT-Former's physics perception branch integrates a 6-layer cascaded convolutional deep unrolled network, with a preset number of scattering centers extracted. and through dynamics - The nearest neighbor algorithm constructs a topological graph, and its feature embedding dimension The value is uniformly set to 256. The visual perception branch uses the pre-trained ConvNeXt-Tiny as its backbone, and normalizes the input image size to... It also utilizes a multi-head attention mechanism (8 heads, 0.1 dropout rate) to achieve cross-modal fusion of physical and visual features.
[0065] In the joint optimization phase, this invention employs a three-stage training paradigm. The warm-up phase uses the AdamW optimizer, setting the backbone learning rate to [missing information]. The remaining modules are During the strong regularization fine-tuning phase, the learning rate is correspondingly lowered to [a certain value]. and And introduce a mixing coefficient Mixup enhancement, a label smoothing factor of 0.1, and a weight decay of 0.05 were used to suppress overfitting. The final heterogeneous ensemble system introduced ResNet-50 as a texture supplementation expert. The optimal fusion weights of GAT-Former and ResNet-50 were determined to be 0.45 and 0.55, respectively, through grid search, thereby achieving a fine-grained recognition accuracy of 96.64% on the test set.
[0066] 3. Simulation Content Figure 2 shows a radar chart analyzing the contributions of the physical flow, visual flow, and heterogeneous integration methods across five core dimensions: accuracy, noise robustness, geometric sensitivity, texture detail, and few-sample performance. As shown in Figure 2, the ResNet-based visual flow excels in the "texture detail" dimension, demonstrating the advantages of deep convolutional neural networks in capturing low-level local pixel features and shadow variations in images. However, it exhibits significant weaknesses in handling speckle noise interference and essential geometric topology recognition. In stark contrast, the GAT-based physical flow demonstrates significant advantages in "geometric sensitivity" and "noise robustness." This verifies that the physical perception branch, by extracting essential scattering center features such as the distribution of corner reflectors, can effectively lock the rigid geometric structure of the target, thus exhibiting stronger robustness under complex electromagnetic environments and clutter interference. The final "heterogeneous integration" scheme achieves comprehensive leadership across all evaluation metrics, with a polygon area far exceeding that of the single-flow model. The experimental results intuitively reveal the strong complementarity between physical flow and visual flow: physical flow provides a robust structural "skeleton," while visual flow supplements rich texture "details." Through the organic integration of the two, the system not only improves the accuracy of conventional classification, but also shows stronger generalization ability in small sample scenarios, which fully demonstrates the necessity and superiority of the heterogeneous flow fusion strategy proposed in this invention in improving the overall performance of SAR target recognition system.
[0067] Figure 3 The bar chart shows the accuracy comparison of the proposed recognition method based on GAT-Former and heterogeneous integration on the ATRNet-STAR dataset with 10 mainstream benchmark models (including VGG16, ResNet34, DenseNet, ViT, ConvNeXt, and SARATR-X). As shown in Figure 4, the classification accuracy of the final integrated system of this invention reaches 96.64%, significantly better than the pure visual state-of-the-art model ConvNeXt (96.00%) and the SARATR-X model based on mask autoencoder (96.40%). Experimental results show that although the single GAT-Former model performs at 93.84% when focusing on extracting structural features, heterogeneous integration with the ResNet-50 texture expert can fully utilize the geometric topological "skeleton" provided by the physical flow and the detailed texture evidence provided by the visual flow, achieving complementary advantages at the feature level, thus effectively breaking through the performance bottleneck of single architecture in large-scale fine-grained classification tasks of 40 classes.
[0068] Figure 4This is a confusion matrix diagram of the method of this invention on 40 categories of civilian vehicle targets, demonstrating the system's discriminative performance in ultra-fine-grained classification tasks. The vertical axis of the confusion matrix represents the true category, and the horizontal axis represents the predicted category, with its main diagonal elements exhibiting extremely high brightness. Observation results show that the method of this invention maintains extremely high classification confidence even when facing fine-grained models with highly similar appearances and scattering distributions, such as SUVs of different brands but the same class. This is attributed to the physical perception dynamic graph model designed in this invention, which can accurately capture the essential geometric features of the target, such as the distribution of corner reflectors. Simultaneously, combined with a physics-guided cross-attention fusion mechanism, the model can automatically "grab" key visual textures corresponding to the physical structure, effectively eliminating background clutter and speckle noise interference. This proves that the system has excellent robustness and discriminative ability when processing large-scale fine-grained targets in complex remote sensing scenarios.
[0069] Based on the above experimental results and analysis, the recognition method proposed in this invention, based on the integration of physical perception dynamic graphs and visual heterogeneity, achieves a significant improvement in recognition accuracy compared to existing mainstream deep learning schemes. This fully verifies the necessity of explicitly extracting physical scattering features through deep unfolding technology and combining it with dynamic graph attention and heterogeneous integration strategies. This scheme can efficiently and stably find the optimal joint representation scheme for fine-grained targets in the electromagnetic and visual domains.
[0070] Example 2 This invention addresses the problems of low utilization of physical features, difficulty in visual semantic alignment, and weak generalization ability of single-modal models in existing SAR fine-grained target recognition methods. It proposes a SAR fine-grained target recognition method based on the integration of a physical perception dynamic graph and visual heterogeneous features. This method constructs a GAT-Former two-stream heterogeneous network, explicitly extracts scattering centers using depth unrolling technology, combines dynamic graph attention mechanisms with visual convolutional features for cross-modal semantic alignment, and utilizes a heterogeneous integration strategy to solve the confusion problem in fine-grained classification.
[0071] This invention considers constructing a dual-stream heterogeneous data input and physical-visual feature extraction model, the specific steps of which are as follows: Step (2a) Constructs a dual-stream heterogeneous input space for fine-grained SAR target recognition. To balance the integrity of the target's electromagnetic scattering characteristics with the saliency of visual texture, this invention maps the original single-channel complex SAR data into two heterogeneous tensors: a physical stream and a visual stream. Let the original SAR complex echo data be... Where H and W are the dimensions in the azimuth and range directions, respectively. First, to preserve phase information to support the subsequent explicit calculation of scattering centers, a physical flow input tensor is constructed. This process involves extracting the real part of the complex data. With the imaginary part And it is achieved through cascading operations at the channel dimension, i.e. Secondly, in order to adapt to the input domain of the pre-trained visual model and enhance the texture details that are interpretable to the human eye, a visual flow input tensor is constructed. The process first calculates the amplitude spectrum of the complex data. And perform a logarithmic transformation to compress the dynamic range, the formula is: ,in To prevent the occurrence of singularly small quantities, a local minimum normalization function is then used. The amplitude values are mapped to the [0, 1] interval, and a pseudo-color visual image is generated through a three-channel copy operation. This creates a visual contextual input containing rich contour and shadow information.
[0072] Step (2b) Establish a physical feature extraction model based on depth unrolling and dynamic graphs. To extract geometrically meaningful features from complex data affected by speckle noise, this invention designs a physical perception branch that includes a depth unrolling network and a dynamic graph attention network. First, the depth unrolling mapping function is defined. Its iterative steps simulate sparse signal recovery algorithms. The physical flow tensor... Input by An encoder consisting of cascaded convolutional layers explicitly regresses the set of parameters for the target scattering center. ,in The preset number of scattering centers. Indicates the first The normalized spatial coordinates of the scattering centers in the image domain are constrained by the Sigmoid activation function; The corresponding scattering intensity is represented by the ReLU activation function. Next, a dynamic topological graph is constructed based on the extracted scattering point set. , where the node set Corresponding set . use -NN algorithm dynamically generates edge sets For any node Calculate its Euclidean distance ,like belong of The nearest neighbors Then a directed edge is established. Finally, the graph attention mechanism is used to aggregate topological features and compute nodes. High-dimensional physical feature vectors This process involves calculating neighboring nodes. For the central node Attention coefficient To weighted aggregate neighborhood information, where It is a learnable linear transformation matrix. For attention weight vectors, This represents vector concatenation, thereby generating a physical feature matrix that contains geometric structural semantics. .
[0073] Step (2c) Establish a texture feature extraction model based on visual context. To capture the macroscopic contours and subtle texture differences of the target, this invention utilizes a convolutional neural network pre-trained on a large-scale optical dataset to construct a visual perception branch. The visual flow tensor... The input is processed through the ConvNeXt backbone network, undergoing multiple stages of downsampling and feature transformation to extract features with a spatial resolution of [missing information]. High-dimensional feature maps To align with the physical feature dimension and preserve spatial location information, layer normalization is first applied. Convolutional layers reduce the number of feature channels from Projected to target dimension The transformed feature map is obtained. Subsequently, a spatial serialization operation is performed, flattening and transposing the two-dimensional feature map along the spatial dimension to generate a visual context semantic sequence. , where the sequence length Each feature vector It encodes visual texture information of specific local regions of an image, providing a dense visual evidence library for subsequent cross-modal interactions.
[0074] This invention considers achieving high-precision recognition through a physically guided cross-attention mechanism and a heterogeneous integration strategy. The specific steps are as follows: Step (3a) Constructing a physics-guided cross-modal semantic fusion and classification model. To address the misalignment between discrete physical topology and continuous visual texture in the semantic space, this invention designs a multi-head cross-attention fusion module. This module actively retrieves key texture evidence from visual features using scattering center features with strong geometric discriminative power. Specifically, it defines the feature matrix output by the physical branch. For query matrix Define the feature sequence output by the visual branch. Key matrix Sum matrix ,in These are learnable linear projection parameters. The attention weight matrix of the physical node for the visual region is calculated. Based on this, visual context information is aggregated to obtain fused features. To enhance the nonlinearity and robustness of feature representation, residual connections and feedforward neural networks are introduced, resulting in the final fused feature representation as follows: Finally, a global average pooling layer is used to... Compressed into global feature vectors The system inputs a classification head consisting of layer normalization and fully connected layers to calculate the predicted probability vector of the target. .
[0075] Step (3b) Establish a joint optimization objective based on hybrid regularization. To address the problem of small inter-class differences and easy overfitting in fine-grained target recognition tasks, this invention introduces a Mixup enhancement and label smoothing strategy during the model training phase. For any two samples in the training batch... and ,from Mixing coefficient of sampling in distribution Construct virtual training samples With the corresponding soft tag At the same time, in order to smooth the decision boundary, the true label... Applying smoothing factor Generate a smooth label distribution ,in Let be the total number of categories. The system's optimization objective is to minimize the cross-entropy loss function after mixing. This strong regularization strategy forces the network to focus on the local subtle discriminative features of the target rather than background noise, thereby improving the model's generalization ability on unseen samples.
[0076] Step (3c) Constructing a heterogeneous integrated decision model that considers azimuth sensitivity. To overcome the inductive bias limitations of a single network architecture, this invention constructs a heterogeneous integrated system comprising the "structural expert" GAT-Former and the "texture expert" ResNet-50. Considering the sensitivity of SAR images to target azimuth angles, test-time enhancement techniques are introduced during the inference phase to enhance the input image. Perform a horizontal flip operation to obtain The average of the predicted probabilities of the two models is taken as the output of the single model, i.e. The final identification decision is derived by a weighted fusion of the predictions from the two heterogeneous models, and the decision function is defined as follows: ,in This is a balancing coefficient. It is determined through a grid search on the validation set to maximize the accuracy of fine-grained classification, thereby achieving a complementary enhancement of the advantages of physical geometry and visual texture detail.
[0077] This invention considers using a three-stage collaborative optimization strategy and a grid search algorithm to solve for the optimal recognition model. The specific steps are as follows: Step (4a) Performs co-training based on differential learning rate and cosine annealing. Given that the visual branch utilizes pre-trained weights while the physical branch is randomly initialized, this invention employs a parameter grouping optimization strategy. The AdamW optimizer is initialized, dividing the network parameters into a visual backbone parameter set. With physical and classification head parameter sets And set different initial learning rates for each. and During the training iterations, the learning rate is dynamically adjusted using a cosine annealing strategy. Learning rate of the next iteration The updated formula is ,in This represents the current cumulative iteration round number. The maximum iteration period is used. The mixed loss function is calculated using the backpropagation algorithm. Regarding parameters gradient and perform parameter updates. ,in These are the deviation correction estimates for the first and second order angular momentum, respectively. This is the weight decay coefficient, which ensures that the heterogeneous network converges stably to the global optimum in the non-convex optimization space.
[0078] Step (4b) Perform robust inference based on TTA. This occurs when model training converges and optimal parameters are saved. Then, the inference phase begins. Addressing the inherent azimuth sensitivity of SAR images, the system performs inference on each test sample. Perform geometric transformation enhancement. Construct a horizontal flip operator. Calculate the posterior probability distributions of the original sample and the flipped sample respectively, and perform prediction averaging: This step effectively eliminates the interference of imaging perspective differences on scattering topological features, significantly improving the model's recognition confidence under complex poses.
[0079] Step (4c) involves performing a grid search-based heterogeneous decision fusion solution. To determine the optimal decision boundary of the heterogeneous ensemble model, this invention uses a validation set... The weight search algorithm is executed. The ensemble weight space is defined. Traversal Each candidate weight in Calculate the ensemble prediction accuracy Select the weights that maximize accuracy. As the final fusion coefficient, the final fine-grained target category is output. This achieves optimal complementarity between physical mechanisms and visual textures at the decision-making level.
[0080] In this field, the above description is merely a preferred example of the present invention and is not intended to limit the invention. Any modifications, substitutions, improvements, and refinements made by those skilled in the art within the spirit and concept of the present invention should be covered within the protection scope of the present invention.
Claims
1. A synthetic aperture radar target recognition method based on the fusion of physical perception and vision, characterized in that, The method includes the following steps: Step 1: Construct a dual-stream heterogeneous dataset for SAR fine-grained target recognition; The raw complex echo data of the SAR system is acquired. For each sample in the dataset, a physical flow complex tensor that preserves electromagnetic phase information and a visual flow amplitude tensor that enhances visual texture information are constructed to form a dual-flow input set containing the complex domain and the visual domain. The physical flow complex tensor is used to subsequently explicitly solve for the physical scattering center of the target, and the visual flow amplitude tensor is used to extract the optical texture features of the target. Step 2: Construct a physical sensing branch to generate dynamic topological map features of the scattering center; A physical branch mapping model is established, and the physical flow complex tensor is processed using a deep unfolding network. By simulating the iterative process of the sparse signal recovery algorithm, the parameterized scattering center set of the target is explicitly regressed. A dynamic scattering topology map is constructed based on the spatial coordinates of the set of scattering centers, and the topological association information between nodes is aggregated using GAT to generate physical feature vectors containing geometric structure semantics. Step 3: Construct visual perception branches and extract visual context semantic sequences; A visual branch mapping model is established, using a convolutional neural network pre-trained on a large-scale dataset as the backbone network to perform multi-level feature extraction on the visual flow amplitude tensor, capturing the macroscopic contour and texture details of the target; the extracted high-dimensional feature map is mapped to a dimension aligned with the physical features using layer normalization and projection layers, and spatial serialization transformation is performed to generate a visual context semantic sequence. Step 4: Physics-guided cross-attention fusion and classification; A physical-visual cross-attention fusion module is constructed, using the physical feature vector as the query and the visual context semantic sequence as the key and value. The semantic correlation between the physical scattering center and the visual texture region is calculated through a multi-head attention mechanism, and key visual evidence is actively retrieved using the physical topology to generate physical-visual joint features. Subsequently, feature representation is further enhanced by a feedforward neural network and residual connections, and the predicted probability of the target is calculated by a classification head; Step 5: Joint optimization solution based on hybrid regularization and heterogeneous integration; During the model training phase, a Mixup and label smoothing strategy is introduced to construct a strongly regularized loss function. The parameters of the physical and visual branches are optimized collaboratively through a differential learning rate strategy. During the model inference phase, a heterogeneous integrated decision system containing a structural expert model and a texture expert model is constructed. Combined with TTA technology, the predicted probabilities of the two heterogeneous models are weighted and fused to output the final fine-grained target recognition result.
2. The SAR fine-grained target recognition method based on the integration of physical perception dynamic maps and visual heterogeneity as described in claim 1, characterized in that, In step 1, let the original SAR complex data be... , including the real part and the virtual part ; Step 1-1: Construct the physical flow complex tensor Read the real part of the raw SAR data and the virtual part The initial tensor is constructed by cascading along the channel dimension, and the feature space scale is uniformly normalized to a uniform size using bilinear interpolation. Generate physical flow complex tensors The first channel is the real component, and the second channel is the imaginary component, in order to preserve complete electromagnetic phase information; Step 1-2: Construct the visual flow amplitude tensor First, the pixel-level modulus of the complex data is calculated to obtain the original amplitude map. To highlight the weak scattering characteristics and compress the dynamic range, a logarithmic transform function is introduced for enhancement. The calculation formula is as follows: in To prevent numerically singular bias factors, a maximum-minimum normalization function is then used to... Linearly mapped to the [0, 255] grayscale range, and expanded into a three-channel RGB image through channel duplication, then resized using a bicubic interpolation algorithm. Generate visual flow tensor ; Steps 1-3: During the training phase, perform collaborative data augmentation on the dual-stream input; inject additive white Gaussian noise into the physical stream to simulate thermal noise, inject salt-and-pepper noise into the visual stream to simulate speckle, and simultaneously perform random rotation and horizontal flipping operations to enhance the robustness of the model.
3. The SAR fine-grained target recognition method based on the integration of physical perception dynamic maps and visual heterogeneity as described in claim 1, characterized in that, In step 2, the physical branch mapping function is defined as follows: Preset number of scattering centers Feature embedding dimension ; Step 2-1: Extract the set of scattering centers based on a deep unfolding network A deep unfolded network consisting of 6 cascaded convolutional modules was constructed, with each layer containing a convolutional layer, an instance normalization layer, and a GELU activation function; the number of network channels were 32, 64, 128, and N, respectively; spatial coordinate regression was used to predict the Nth... Normalized spatial coordinates of each scattering center The calculation formula is: Predict the corresponding scattering intensity using a scattering intensity regression head The calculation formula is: Finally, the parameterized set is obtained. ; Step 2-2: Construct a dynamic scattering topology graph , with scattering center set The points in the set are nodes. Dynamically construct edge sets using k-NN For any two nodes Calculate its Euclidean distance ; If node Belongs to node of One of the nearest neighbors then establishes a connection from point to Directed edge ; Steps 2-3: Aggregate topological features using a graph attention network for any central node in the graph. First, the node features are mapped to a high-dimensional space using a linear transformation matrix. Then, the relationship between the node features and its neighboring nodes is calculated. Attention coefficient between : in To share the weight matrix, Let || be the attention parameter vector, where || denotes feature concatenation; Using the calculated coefficients The neighborhood features are weighted and aggregated, and then processed through nonlinear activation and residual connections to generate a physical feature matrix. .
4. The SAR fine-grained target recognition method based on the integration of physical perception dynamic maps and visual heterogeneity as described in claim 1, characterized in that, In step 3, the ConvNeXt-Tiny network is used as the backbone of the visual branch; Step 3-1: Convert the visual flow tensor Input to the ConvNeXt network, through which The Stage module with large convolutional kernels performs feature extraction, generating a resolution of [resolution value missing]. Deep feature map with 768 channels ; Step 3-2: Construct a feature projection and serialization module. First, normalize the feature map using a two-dimensional layer. Standardize the process, and then utilize Convolutional layers compress the number of channels from 768 to the target dimension. The feature map M' is obtained; finally, the feature map M' is flattened and transposed along the height and width dimensions to generate a visual context semantic sequence. , where 49 is the length of the visual feature sequence.
5. The SAR fine-grained target recognition method based on the integration of physical perception dynamic maps and visual heterogeneity according to claim 1, characterized in that, In step 4, a physically guided multi-head cross-attention module is constructed, including... One point of attention; Step 4-1: Linear projection, defining the physical feature matrix For query, visual feature sequence The input features are defined as keys and values; they are projected onto the query matrix through three independent learnable fully connected layers. Key matrix Sum matrix : in ; Step 4-2: Calculate the scaling dot product attention by calculating the relevance score between the physical node and the visual region and dividing it by the scaling factor. To prevent gradient vanishing, the attention weight matrix is then normalized using the Softmax function, and finally, the visual feature values are weighted and aggregated. : in ; Step 4-3: Feature fusion and feedforward network, combining attention output with original physical features. Perform residual connections and layer normalization to obtain intermediate features. ; Then The input FFN contains two linear layers, a GELU activation function, and a Dropout layer. The calculation formula is as follows: in Expanding the dimensions to 4D, Restore dimensions to ; Step 4-4: Global classification decision, based on the fused feature matrix. Perform global average pooling along the scattering point dimension to generate a global feature vector. ;Will The input classification header consists of layers of normalization, a linear layer, a GELU activation function, a Dropout layer, and a final linear classification layer. The softmax function is used to output the classification layer. The probability vector of each fine-grained category .
6. The SAR fine-grained target recognition method based on the integration of physical perception dynamic maps and visual heterogeneity according to claim 1, characterized in that, The specific implementation process of step 5 includes: Step 5-1: Introduce a strong regularization training strategy that combines Mixup enhancement and label smoothing; In each training iteration, for any two samples within a batch and ,from Mixing coefficients of random sampling in distribution ,in Constructing virtual input samples : At the same time, a label smoothing factor is introduced. , to encode the real one-hot label Convert to soft label distribution : Calculate the cross-entropy loss function after mixing As an optimization objective: The network parameters are updated using an optimizer, with the weight decay coefficient set to 0.05 to prevent overfitting. Step 5-2: Construct a heterogeneous integrated decision model. To compensate for the potential loss of texture details in the physical perception model, a ResNet-50 network is introduced as an independent texture expert model; this texture expert model uses only the visual flow amplitude tensor As input, fine-tuning is performed using the same training strategy as in step 5-1, and the output is a texture expert prediction probability vector. The heterogeneous integrated system consists of the GAT-Former model as a "structure expert" and the ResNet-50 model as a "texture expert". Step 5-3: Perform TTA. During the inference phase, for the test image to be recognized... Generate its horizontally flipped image ;Will and Input the GAT-Former model and the ResNet-50 model respectively. For each model, calculate the arithmetic mean of the predicted probability of the original image and the predicted probability of the flipped image, and use this as the output probability of the model after TTA enhancement; Step 5-4: Calculate the final weighted ensemble decision result and define the final target recognition decision function. Output probabilities for the GAT-Former model Compared with the output probability of the ResNet-50 model Weighted linear combination: in For heterogeneous integration weight coefficients; on the validation set, weights are applied in the interval [0, 1] with a step size of 0.
05. Perform a grid search to select the option that yields the highest classification accuracy. The value is used as the optimal weight; the final recognition result is a probability vector. The category index corresponding to the largest element in the middle.