A remote sensing target detection method and system based on brain-eye cognitive behavior distillation

By employing a brain-eye cognitive behavior distillation method, utilizing cross-modal attention mechanisms and mutual information constraints, and combining knowledge distillation strategies and private orthogonal projection modules, the problem of cross-modal semantic gap and information transmission in remote sensing target detection is solved, achieving efficient remote sensing target detection.

CN121746957BActive Publication Date: 2026-06-19HANGZHOU DIANZI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HANGZHOU DIANZI UNIV
Filing Date
2026-03-02
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing remote sensing target detection technologies lack generalization ability in complex scenarios, suffer from cross-modal semantic gaps and semantic drift in information transmission, and lack collaborative guidance from eye-tracking data, resulting in models being unable to obtain complete cognitive-behavioral joint guidance.

Method used

We employ a brain-eye cognitive behavior distillation method. By constructing a multimodal dataset, we use cross-modal attention mechanisms and mutual information constraints to learn the semantic consistency between image features and physiological features. Combined with a knowledge distillation strategy and a private orthogonal projection module, we achieve complementary fusion of image features and physiological features.

Benefits of technology

It significantly enhances the model's feature extraction capabilities and robustness in complex remote sensing scenarios, improves detection accuracy and generalization ability, reduces information loss during cross-modal mapping, and achieves human-like visual understanding.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121746957B_ABST
    Figure CN121746957B_ABST
Patent Text Reader

Abstract

This invention discloses a remote sensing target detection method and system based on brain-eye cognitive behavioral distillation. The method includes multimodal data acquisition and preprocessing; teacher model construction and training; student model construction and training; model deployment and target detection inference. This invention employs an innovative brain-eye co-distillation framework, combining high temporal resolution EEG signals with high spatial resolution eye-tracking signals to jointly guide the visual model. This bimodal guidance strategy achieves complementary integration of human cognitive intelligence and behavioral patterns, overcoming the shortcomings of incomplete information in single-modal guidance, and significantly enhancing the model's feature extraction capability and robustness in complex remote sensing scenarios.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image data processing technology, and more specifically, to a remote sensing target detection method and system based on brain-eye cognitive behavior distillation. Background Technology

[0002] With the widespread use of high-resolution remote sensing imagery, remote sensing target detection has significant application value in fields such as environmental monitoring, urban planning, and military reconnaissance. However, remote sensing images typically have complex backgrounds, targets of varying scales, severe occlusion, and scarce annotation data, limiting the performance of traditional visual models in such tasks.

[0003] To address the aforementioned shortcomings, existing technical solutions are mainly developing in two directions: one is pure vision methods based on deep learning, such as convolutional neural networks and their variants, which have made some progress but lack generalization ability in situations with few samples and complex scenarios; the other is brain-inspired visual computing methods, which attempt to introduce human cognitive priors by integrating neurophysiological data such as EEG signals to assist model training, such as brain-computer interface learning (BMCL).

[0004] Brain-inspired visual computation methods have, to some extent, compensated for the cognitive deficiencies of pure visual models; however, they still have the following significant limitations:

[0005] 1. Cross-modal semantic gap: There are huge modal differences between neurophysiological signals and image visual features, making it extremely difficult to construct an accurate and consistent semantic mapping relationship between the two, and semantic drift is prone to occur in cross-modal information transmission.

[0006] 2. Insufficient spatial resolution of guidance information: Existing methods mainly rely on electroencephalogram (EEG) signals. Although EEG has high temporal resolution, its spatial resolution is low. It reflects the comprehensive response of the cerebral cortex and is difficult to provide the specific spatial gaze position and fine visual reasoning path required for target detection.

[0007] 3. Lack of complementary behavioral modalities: Existing technologies generally neglect eye-tracking (ET) data. ET can directly and with high spatial resolution reflect human visual attention behavior, which precisely makes up for the shortcomings of EEG. Without the synergy of ET, the model cannot obtain complete "cognitive-behavioral" joint guidance.

[0008] Therefore, existing technologies lack a solution that can simultaneously integrate EEG cognitive information and ET behavioral information and effectively distill them into a visual model, which limits the model's ability to achieve human-like visual understanding and robustness. Summary of the Invention

[0009] The purpose of this invention is to overcome the shortcomings of the prior art and provide a remote sensing target detection method and system based on brain-eye cognitive behavior distillation.

[0010] To achieve the above objectives, the present invention adopts the following technical solution:

[0011] A remote sensing target detection method based on brain-eye cognitive behavior distillation includes the following steps:

[0012] Step S1: Construct a multimodal dataset containing remote sensing images, EEG signals, and eye movement signals;

[0013] Step S2: Construct a teacher model that can simultaneously process remote sensing images, EEG signals, and eye-tracking signals. Through cross-modal attention mechanisms and mutual information constraints, learn a joint representation with semantic consistency between image features and physiological features.

[0014] Step S3: Construct a student model that takes remote sensing images as input only. Learn the cross-modal perception patterns obtained by the teacher model through a knowledge distillation strategy. Use the intramodal attention module and the private orthogonal projection module to simulate the teacher's cognitive behavior and preserve the image-specific details.

[0015] Step S4: In the inference phase, the student model is used for remote sensing target detection.

[0016] Furthermore, step S2 includes the following steps:

[0017] Step S21: Receive remote sensing images and extract visual embedding features of EEG guidance and eye-tracking guidance; extract EEG embedding features of EEG signals; extract eye-tracking embedding features of eye-tracking signals;

[0018] Step S22: A bidirectional cross-modal attention mechanism is used to enhance the interaction between image features and corresponding physiological features to obtain enhanced features that can reflect cross-modal semantic associations.

[0019] Step S23: Utilize the InfoNCE loss function to maximize the mutual information between image enhancement features and corresponding physiological enhancement features, so that the two remain consistent in the semantic space;

[0020] Step S24: Input the enhanced features of each modality into the classifier for target classification, calculate the classification loss, and sum the weighted loss with the mutual information constraint loss to obtain the total loss of the teacher model.

[0021] Furthermore, in step S22, the calculation of cross-modal attention weights and the feature enhancement process are implemented through the following formula:

[0022]

[0023]

[0024] In the formula, Indicates guidance to physiological modalities Visual embedding features; Represents the corresponding physiological mode Embedded features; This indicates element-wise multiplication; It is a bidirectional cross-modal attention function; and These are the enhanced visual embedding features and physiological embedding features, respectively.

[0025] Furthermore, in step S23, the InfoNCE loss function used for mutual information constraints is:

[0026]

[0027]

[0028] In the formula, and These are the enhanced eye-tracking guided and brainwave-guided visual embedding features, respectively. and These are the enhanced eye-tracking and EEG embedding features, respectively. The cosine similarity function; Temperature coefficient; This refers to the training batch size; and These represent the first in the batch. Each sample corresponds to Features and Features, the two constitute a positive sample pair; Indicates the first in the batch Each sample corresponds to feature.

[0029] Furthermore, step S3 includes the following steps:

[0030] Step S31: Receive remote sensing images and extract visual embedding features of EEG guidance and eye-tracking guidance;

[0031] Step S32: Generate attention weights based solely on unimodal image features and self-modulate the image features to generate student-oriented features, thereby simulating the attention distribution generated by cross-modal attention in the teacher model;

[0032] Step S33: Extract image-specific semantic features orthogonal to student guidance features from the same image features, and retain image detail information not covered by physiological signal alignment;

[0033] Step S34: A joint distillation strategy is adopted to align the student-oriented features of the student model with the corresponding enhancement features of the teacher model at both the feature level and the attention weight level. The distillation loss is composed of a weighted sum of feature-level loss and attention-level loss.

[0034] Step S35: After fusing the student guidance features with the image-specific semantic features, classify them, calculate the classification loss, and combine it with the distillation loss and the regularization loss used to constrain the orthogonality of the features to form the total loss of the student model for optimization.

[0035] Furthermore, in step S33, orthogonal constraints are applied to ensure the independence of student guidance features from image-specific semantic features. The orthogonal loss is calculated as follows:

[0036]

[0037] In the formula, and In each batch Image-specific semantic features and student-oriented features after L2 normalization of each sample; The cosine similarity function; This is the training batch size.

[0038] Furthermore, in step S34, for each physiological modality The expression for its knowledge distillation loss is:

[0039]

[0040] In the formula, Student-oriented characteristics Teacher image enhancement features The mean square error loss of the feature level between; Student intramodal attention weights Teacher cross-modal attention weights Mean squared error loss at the attention level; and The adaptive weights are dynamically calculated based on the current loss. For teacher confidence level.

[0041] Furthermore, in step S35, the fusion process of student-oriented features and image-specific semantic features is as follows: the student-oriented features and image-specific semantic features are concatenated along the channel dimension to form a combined feature; a self-attention mechanism is used to process the combined feature to capture the dependencies between different feature components and generate a fusion feature for final classification.

[0042] Furthermore, in step S4, the model deployed and used for inference is the trained student model. After receiving the input remote sensing image, it sequentially goes through image encoding, intramodal attention modulation, private orthogonal projection, feature fusion and classification output steps to finally generate the target detection result.

[0043] The present invention also provides a remote sensing target detection system based on brain-eye cognitive behavioral distillation, the system comprising:

[0044] The multimodal data acquisition and preprocessing module is used to simultaneously acquire the electroencephalogram (EEG) and eye movement signals generated by the subjects when they view remote sensing images, and together with the remote sensing images, form a spatiotemporally aligned multimodal dataset.

[0045] The multimodal teacher model module is connected to the multimodal data acquisition and preprocessing module to receive multimodal datasets. This module integrates an image encoding submodule, a physiological signal encoding submodule, a cross-modal attention alignment submodule, and a mutual information constraint submodule. It learns a joint representation with semantic consistency between remote sensing image features and EEG and eye-tracking features through a joint optimization algorithm, and outputs enhanced visual features aligned with each physiological signal.

[0046] The unimodal student model module is connected to the multimodal teacher model module during the training phase and receives the remote sensing image to be detected during the inference phase. This module integrates an image encoding submodule, an intramodal attention submodule, a private orthogonal projection submodule, and a feature fusion submodule. Through a knowledge distillation strategy, it learns the cross-modal perception mode of the teacher model during the training phase and simulates human-like cognitive behavior and outputs the target detection result based solely on the image input during the inference phase.

[0047] The model deployment and inference module is used to deploy the trained student model and process new input remote sensing images to output the category and location information of targets in the image.

[0048] The beneficial effects of this invention are:

[0049] 1. This invention employs an innovative brain-eye co-distillation framework that combines high temporal resolution EEG signals with high spatial resolution eye-tracking signals to jointly guide a visual model. This bimodal guidance strategy achieves complementary integration of human cognitive intelligence and behavioral patterns, overcoming the shortcomings of incomplete information in single-modal guidance, and significantly enhancing the model's feature extraction capabilities and robustness in complex remote sensing scenarios.

[0050] 2. This invention combines cross-modal attention mechanisms with mutual information constraints, achieving adaptive matching of semantic features between heterogeneous modalities through a bidirectional attention alignment strategy. This method effectively reduces the limitations imposed by data heterogeneity between physiological signals and image data, minimizes information loss during cross-modal mapping, and ensures that the model can accurately capture key features with semantic consistency.

[0051] 3. This invention employs a decoupled distillation strategy based on a teacher-student network and introduces a private orthogonal projection module. During distillation, this strategy decouples features into "cognitive consistency features" and "image-specific features," maintaining their independence through orthogonal constraints. This design ensures that the model can learn robust, human-like perceptual patterns while preserving the image's unique details, thereby effectively improving the model's generalization ability and detection accuracy.

[0052] 4. This invention designs an intramodal attention mechanism, enabling student models to simulate human-like attention patterns under conditions of only image input. This design achieves "multimodal guidance during training and single-modal application during inference," completely eliminating the reliance on expensive EEG and eye-tracking acquisition equipment during the actual deployment phase of the model, and greatly improving the practicality and deployment efficiency of the system. Attached Figure Description

[0053] Figure 1 This is a flowchart of a remote sensing target detection method based on brain-eye cognitive behavior distillation in this embodiment;

[0054] Figure 2 This is a structural framework diagram of the teacher model in this embodiment;

[0055] Figure 3 This is a structural framework diagram of the student model in this embodiment;

[0056] Figure 4 This is a structural framework diagram of a remote sensing target detection system based on brain-eye cognitive behavior distillation in this embodiment;

[0057] Figure 5 This is a diagram showing the results of the cross-modal attention ablation experiment in this embodiment.

[0058] Figure labels: Multimodal data acquisition and preprocessing module 1, Multimodal teacher model module 2, Unimodal student model module 3, Model deployment and inference module 4. Detailed Implementation

[0059] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0060] Example: A remote sensing target detection method based on brain-eye cognitive behavior distillation. This method first constructs a multimodal dataset containing remote sensing images, EEG signals, and eye-tracking (ET) signals. Then, it constructs a cognitive behavior distillation framework consisting of a multimodal teacher model and a unimodal student model. In the teacher model, image features are semantically aligned with EEG and ET features through a bidirectional cross-modal attention module (CMA) and mutual information constraints to capture prior patterns of human cognition and behavior. In the student model, an intramodal attention module (IMA) simulates the perceptual attention of the teacher model when only images are input, and a private orthogonal projection module (POP) is introduced to decouple image features into cognitively aligned guiding features and image-specific semantic features that preserve image details. Next, through a joint distillation strategy, the teacher model's understanding of the multimodal data is transferred to the student model, and orthogonal constraints are used to maintain the independence of the two types of features. Finally, the decoupled features of the student model are fused to output a detection structure. The core of this method lies in constructing a cognitive behavior distillation framework that is "multimodal guided during training and single-modal operated during inference". Using this framework, EEG and eye-tracking data are used as cognitive behavior guides. Combined with the decoupled distillation mechanism, it can effectively solve the problems of complex targets and scarce samples in remote sensing images, and achieve highly robust target detection.

[0061] Specifically, such as Figure 1 As shown, the method includes the following steps:

[0062] Step S1, Multimodal Data Acquisition and Preprocessing: Subjects are invited to view remote sensing images containing various remote sensing targets (such as vehicles, ships, and buildings). At the same time, a high temporal resolution neuroelectric signal is acquired using an EEG cap (e.g., 64 channels), and a high spatial resolution fixation point data is acquired using an eye tracker, including channel information such as fixation coordinates, pupil diameter, and saccade events.

[0063] During signal acquisition, each signal is preprocessed simultaneously. For example, for EEG signals, bandpass filtering (e.g., 0.1-40Hz) is performed to remove power frequency interference and baseline drift, and independent component analysis or wavelet transform is used to remove artifacts such as electrooculography and electromyography. For eye movement signals, gaze point extraction and filtering are performed, and the original coordinate sequence is converted into a two-dimensional gaze heatmap or trajectory sequence corresponding to the image space.

[0064] All signals are segmented to align with the time periods presented in the images and normalized to the same scale; then, a multimodal dataset containing remote sensing images, EEG signals, and EEG signals is constructed.

[0065] Step S2, Teacher Model Construction and Training: Construct a teacher model that can simultaneously process remote sensing images, EEG signals, and eye-tracking signals. Through cross-modal attention mechanisms and mutual information constraints, learn a joint representation with semantic consistency between image features and physiological features.

[0066] The teacher model is a multimodal network designed to achieve efficient multimodal remote sensing target recognition by jointly training physiological signals with image data. It also establishes consistent semantic spaces between "image-EEG" and "image-eye movement," allowing the model to learn and capture cross-modal semantically consistent representations, such as... Figure 2 As shown.

[0067] Furthermore, step S2 includes the following steps:

[0068] Step S21: Receive remote sensing images and extract visual embedding features of EEG guidance and eye-tracking guidance; extract EEG embedding features of EEG signals; extract eye-tracking embedding features of eye-tracking signals. Specifically:

[0069] Image encoding: Using a VGG16 network as the backbone, the input image... Extract high-level feature maps To align image features with different physiological modalities, two parallel modality-guided channels were designed: an EEG-guided channel and an ET-guided channel. Each channel contains two convolutional layers (including batch normalization (BN) and ReLU activation). The EEG-guided channel flattens the feature map and then projects it globally to obtain the EEG-guided visual embedding features. The eye-tracking guidance channel uses local max pooling to aggregate spatial information to obtain the visual embedding features of eye-tracking guidance. .

[0070] EEG encoding: The lightweight EEGNet network is used, which combines depthwise convolution and separable convolution to effectively capture the spatiotemporal dynamics of EEG signals and output EEG embedding features. .

[0071] Eye-tracking encoding: Design a convolution-based spatiotemporal feature extractor consisting of two spatiotemporal blocks (each containing a Conv-BN-ReLU unit); the first block uses... The first convolutional kernel extracts local patterns in the time dimension; the second block uses... The convolutional kernels aggregate spatial information along the channel dimension. This is followed by average pooling, Dropout, and... Convolution, output eye-tracking embedding features .

[0072] It should be noted that the dimension of all embedded features is unified as follows: (For example =256).

[0073] Step S22: A bidirectional cross-modal attention mechanism is used to enhance the interaction between image features and corresponding physiological features to obtain enhanced features that can reflect cross-modal semantic associations.

[0074] By designing a bidirectional cross-modal attention mechanism to align model output with physiological signal responses, semantically related patterns between heterogeneous modalities are captured. This mechanism takes the embedded features of a pair of modalities as input, calculates bidirectional attention weights through a bidirectional cross-modal attention module (CMA), and modulates and enhances the original features.

[0075] The attention coefficient is calculated as follows:

[0076]

[0077] In the formula, and These are the embedding features for two modalities; This indicates splicing along the channel dimension; It is a lightweight feedforward network that includes linear layers, GELU activation, Dropout, and layer-normalized LN, and employs... The function restricts the coefficients to Within the interval, to avoid overscaling of the representation.

[0078] Specifically, using eye-tracking-guided visual embedding features and eye-tracking embedding features For example: First, embed visual features and eye-tracking embedding features By concatenating features along a dimensional axis, we obtain hybrid features. , will mix features Feed into a lightweight feedforward network The values ​​output by the network are... After the function, its value is restricted to Within the interval, attention weight coefficients are generated. This coefficient dynamically encodes "which regions or features in the image are most relevant to eye movement behavior".

[0079] Feature enhancement is calculated as follows:

[0080]

[0081]

[0082] In the formula, Indicates guidance to physiological modalities Visual embedding features; Represents the corresponding physiological mode (EEG) Eye movement Embedding features; This indicates element-wise multiplication; This is a bidirectional cross-modal attention function, and its output is the attention weight coefficients; and These are the enhanced visual embedding features and physiological embedding features, respectively.

[0083] Specifically, it still relies on eye-tracking-guided visual embedding features. and eye-tracking embedding features For example:

[0084] Use the generated attention weights The original features are recalibrated, resulting in enhanced visual embedding features. for: Enhanced eye-tracking embedding features for: .

[0085] This process enhances the portions of image features that are consistent with eye-movement behavior, while also imbuing the eye-movement features with visual contextual information, and simultaneously processes the image-EEG pair.

[0086] Step S23: Utilize the InfoNCE loss function to maximize the mutual information between image enhancement features and corresponding physiological enhancement features, so that the two remain consistent in the semantic space.

[0087] To ensure that image features and corresponding physiological features are highly semantically consistent, mutual information (MI) constraints are introduced and optimized using the InfoNCE loss function.

[0088] The loss function is:

[0089]

[0090] The specific calculation of InfoNCE loss is as follows:

[0091]

[0092] In the formula, and These are the enhanced eye-tracking guided and brainwave-guided visual embedding features, respectively. and These are the enhanced eye-tracking and EEG embedding features, respectively. The cosine similarity function; Temperature coefficient; This refers to the training batch size; and These represent the first in the batch. Each sample corresponds to Features and Features, the two constitute a positive sample pair; Indicates the first in the batch Each sample corresponds to feature.

[0093] For each pair of enhanced features (such as enhanced visual embedding features) and enhanced eye-tracking embedding features The InfoNCE loss function will consider the losses from the same trial in a batch. Treated as sample pairs (i.e., aligned "image-physiology" pairs), they will be randomly combined with other trial samples. ( Positive sample pairs are considered negative sample pairs (i.e., mismatched pairs). By optimizing this loss, the model is trained to maximize the cosine similarity of positive sample pairs and minimize the cosine similarity of negative sample pairs.

[0094] Step S24: Input the enhanced features of each modality into the classifier for target classification, calculate the classification loss, and sum the weighted sum with the mutual information constraint loss to obtain the total loss of the teacher model. Specifically:

[0095] Each augmentation feature of the teacher model Each input is a separate fully connected (FC) classifier to obtain a predicted score.

[0096] The total loss of the teacher model is:

[0097]

[0098] In the formula, This is the balance coefficient; This is the sum of the cross-entropy losses across all modes; This is a mutual information constraint based on InfoNCE loss.

[0099] Step S3, Construction and Training of Student Model: Construct a student model that takes remote sensing images as input only. Learn the cross-modal perception patterns obtained by the teacher model through a knowledge distillation strategy. Use the intramodal attention module and the private orthogonal projection module to simulate the teacher's cognitive behavior and preserve the image-specific details.

[0100] The student model is guided by a teacher model with frozen weights during the training phase, and only receives image input during the inference phase, such as... Figure 3 As shown.

[0101] Furthermore, step S3 includes the following steps:

[0102] Step S31: Receive remote sensing images and extract visual embedding features from EEG guidance and eye-tracking guidance. Specifically:

[0103] Image encoding: The image encoding of the student model is completely identical to that of the teacher model. After inputting into the backbone network (VGG16 network), initial EEG-guided visual embedding features are obtained through both EEG-guided and ET-guided channels. and eye-tracking visual embedding features .

[0104] Step S32: Generate attention weights based solely on unimodal image features and self-modulate the image features to generate student-oriented features, thereby simulating the attention distribution generated by cross-modal attention in the teacher model.

[0105] By introducing an intramodal attention mechanism, the attention regulation pattern of a teacher model is simulated in the absence of physiological signal input, and features in the image modality that are semantically consistent with the physiological signal representation are deeply mined. In this mechanism, the intramodal attention module (IMA) generates attention weights using only visual cues, and its calculation is as follows:

[0106]

[0107]

[0108] In the formula, Adopted in the teacher model The same lightweight feedforward network structure, but its input is only single-modal graphical features, and its output features... and This is known as "student-oriented characteristics".

[0109] Step S33: Extract image-specific semantic features orthogonal to student guidance features from the same image features, and retain image detail information not covered by physiological signal alignment.

[0110] To fully utilize the unique semantic information in image modalities and avoid information loss due to excessive alignment with physiological signals, this embodiment designs a private orthogonal projection (POP) module that decouples "modal alignment semantics" from "image-inherent semantics." This module extracts image-specific semantic features from visual embedding features that are complementary to student-guided features. and This preserves the semantic information unique to the modality in the visual embedding features.

[0111] The feature extraction calculation is as follows:

[0112]

[0113]

[0114] In the formula, This represents the attention weight generation function in the private orthogonal projection module, whose structure is similar to... The structures are similar to avoid fault-like breaks in the characterization.

[0115] To ensure that image-specific semantic features are unrelated to student-oriented features, a cosine orthogonal regularization loss is applied, and the orthogonality of the two features is enforced to maximize effective semantic information. The orthogonal loss is calculated as follows:

[0116]

[0117] In the formula, and In each batch Image-specific semantic features and student-oriented features after L2 normalization of each sample; The cosine similarity function; This is the training batch size.

[0118] Step S34 employs a joint distillation strategy to align the student-oriented features of the student model with the corresponding enhancement features of the teacher model at both the feature level and the attention weight level. The distillation loss is a weighted average of the feature-level loss and the attention-level loss. Specifically:

[0119] During student model training, the teacher model's parameters are frozen. The student model aligns its student-oriented features with the teacher's bidirectional enhancement features using an offline distillation strategy. Simultaneously, to transfer the teacher's multimodal knowledge, a two-level cascaded distillation is employed, combining attention-level and feature-level distillation. For each physiological modality... , Its distillation losses include:

[0120] Characteristic-level distillation losses Minimize student-oriented features Teacher image enhancement features The mean squared error (MSE) between them forces the student-oriented features to be more oriented. Approximating teacher image enhancement features in numerical distribution .

[0121] Attention distillation loss Minimize student intramodal attention weights Teacher cross-modal attention weights The mean squared error (MSE) between models teaches students "how to allocate attention".

[0122] Due to characteristic stage distillation losses and attention distillation loss The contribution of each factor may change dynamically during training; therefore, adaptive weights are introduced to balance the contributions of both. and The calculation is as follows:

[0123]

[0124]

[0125] In the formula, Minimum value (e.g.) ); and The adaptive weights are dynamically calculated based on the current loss.

[0126] To reduce noise caused by the uncertainty in teacher model predictions, teacher confidence is introduced. , is calculated as the batch mean of the maximum probability of teacher classification logits after Softmax.

[0127] The single-mode distillation loss is:

[0128]

[0129] The total distillation loss is:

[0130]

[0131] In the formula, and These represent the single-modal distillation losses of the EEG mode and the eye-tracking mode, respectively. and This is a hyperparameter used to balance the distillation rate of EEG and eye-tracking modes.

[0132] Step S35: After fusing the student guidance features with the image-specific semantic features, classify them, calculate the classification loss, and combine it with the distillation loss and the regularization loss used to constrain the orthogonality of the features to form the total loss of the student model for optimization.

[0133] To fully utilize student-oriented features and image-specific semantic features to obtain more representative feature representations, this embodiment employs an attention-based approach to fuse student-oriented features and image-specific semantic features. Specifically:

[0134] The student-oriented features are concatenated with the image-specific features along the channel dimension: , dimension ;

[0135] Then, a self-attention mechanism is used to capture the dependencies and complementary cues between these different features:

[0136]

[0137] In the formula, , , These are the learnable weight matrices for queries, keys, and values, respectively. The embedding dimension of the feature; This is a scaling factor used to adjust the magnitude of the dot product value to prevent gradient vanishing. These are features fused using a self-attention mechanism.

[0138] Next, regarding Average pooling is performed to obtain the final fused features. :

[0139]

[0140] In the formula, This indicates the first time after enhancement by the self-attention mechanism. Sub-feature vectors, Corresponding to student-oriented characteristics and and student-specific semantic features and .

[0141] Then, the fusion features The input is a fully connected layer and a softmax layer for classification, and the cross-entropy loss is calculated. ;

[0142] The final total loss of the student model is:

[0143]

[0144] In the formula, The coefficients for balancing orthogonal constraints; The total distillation loss during student model training; This represents the orthogonal regularization loss.

[0145] Through the joint optimization described above, the student model not only acquires semantic alignment capabilities consistent with human perception, but also retains the discriminative information unique to images, thereby achieving high-performance remote sensing target detection using only image reasoning.

[0146] Step S4, Model Deployment and Target Detection Inference: In the inference phase, the student model is used for remote sensing target detection. Specifically:

[0147] The model deployed and used for inference is the trained student model. After receiving a new remote sensing image as input, it goes through image encoding, intramodal attention modulation, private orthogonal projection, feature fusion and classification output steps in sequence to finally generate the target detection result.

[0148] This embodiment also provides a remote sensing target detection system based on brain-eye cognitive behavior distillation, such as Figure 4 As shown, the system includes a multimodal data acquisition and preprocessing module 1, a multimodal teacher model module 2, a unimodal student model module 3, and a model deployment and inference module 4.

[0149] Among them, the multimodal data acquisition and preprocessing module 1 is used to simultaneously acquire the electroencephalogram (EEG) signals and eye movement signals generated when the subject views the remote sensing images, and together with the remote sensing images, it constitutes a spatiotemporally aligned multimodal dataset.

[0150] The multimodal teacher model module 2 is connected to the multimodal data acquisition and preprocessing module 1 to receive multimodal datasets. This module integrates an image encoding submodule, a physiological signal encoding submodule, a cross-modal attention alignment submodule, and a mutual information constraint submodule. It learns a joint representation with semantic consistency between remote sensing image features and EEG and eye movement features through a joint optimization algorithm, and outputs enhanced visual features aligned with each physiological signal.

[0151] The unimodal student model module 3 is connected to the multimodal teacher model module 2 during the training phase and receives the remote sensing image to be detected during the inference phase. This module integrates an image encoding submodule, an intramodal attention submodule, a private orthogonal projection submodule, and a feature fusion submodule. Through a knowledge distillation strategy, it learns the cross-modal perception mode of the teacher model during the training phase and simulates human-like cognitive behavior and outputs the target detection result based solely on the image input during the inference phase.

[0152] The Model Deployment and Inference Module 4 is used to deploy the trained student model and process the newly input remote sensing images to output the category and location information of the targets in the images.

[0153] To fully verify the effectiveness of the proposed method, this paper uses multiple sets of evaluation items for analysis, as follows:

[0154] 1. Evaluation Indicators

[0155] Since the experimental dataset used in this example is an imbalanced dataset (with fewer positive samples than negative samples), Balanced Accuracy (BA) and F1 score are used as evaluation metrics.

[0156] Furthermore, to eliminate random errors caused by the randomness of dataset partitioning and to verify the model's stability across different data subsets, this invention employs a five-fold cross-validation strategy to calculate the final performance metric. This involves summing the metric values ​​obtained from five rounds of experiments and taking the average, which serves as the final result measuring the overall model performance. The dispersion of the five rounds of experimental results relative to the mean is calculated. This metric is used to evaluate the model's robustness; a smaller variance indicates lower sensitivity to data partitioning and more stable performance.

[0157] 2. EEG data assessment

[0158] Table 1. Analysis of EEG performance of each subject

[0159]

[0160] In this example, EEG data from 5 subjects were collected for model training. EEGNet was used as the encoder, and a classifier with the same architecture as this method was connected to evaluate the quality of the EEG.

[0161] 3. Performance Comparison

[0162] To verify the effectiveness of this method, comparative experiments were conducted with several baseline models. The baseline models used for comparison were divided into two categories:

[0163] General visual backbone networks include VGG16, ResNet18, DenseNet121, EfficientNet-B0, and MobileNet V2, all of which were pre-trained on ImageNet and fine-tuned according to the same protocol as our approach; in addition, Swin Transformer and YOLOv8, which were pre-trained on the NAIP dataset, are also included.

[0164] EEG-guided reverse engineering methods include HBDVC and BMCL.

[0165] To ensure fairness in the comparison, all network models participating in the comparison were fine-tuned using the same training scheme as our method and evaluated using the same fully connected (FC) classifier.

[0166] Table 2 Performance comparison of this method with various baseline models

[0167]

[0168] Experimental results (see Table 2) show that unimodal vision models have limited performance on this task: classic CNN models (such as VGG16 and ResNet18) only achieved a balanced accuracy (BA) of about 72%, indicating that image input alone cannot fully capture the semantic and spatial cues required for five-class region localization. DenseNet121 performs poorly under few-shot conditions. Lightweight models (such as EfficientNet and MobileNet) show only minor performance improvements. Swin Transformer performs the worst, indicating that large pre-trained models have poor adaptability when transferring to small-object, low-shot scenes. YOLOv8 also shows limited adaptability under few-shot and small-object conditions.

[0169] In contrast, EEG-guided reverse engineering significantly improved the results: HBDVC showed a modest performance improvement. BMCL achieved a balance accuracy of 85.96%, confirming the effectiveness of EEG guidance.

[0170] Our method improves the Balanced Accuracy (BA) to 93.03% and the F1 score to 90.19%. This result outperforms the pure image baseline model by 19.2% and the previous EEG-guided method by 7.1%. This fully demonstrates the superiority and effectiveness of the proposed "image-EEG-eye-tracking" fusion framework in remote sensing target detection tasks.

[0171] 4. Modal contribution analysis

[0172] To evaluate the specific contribution of different physiological signals to model performance, a modal ablation experiment was conducted in this embodiment. The experiment compared four different student model variants:

[0173] a. Image only: Baseline model, not guided by physiological signals.

[0174] b. EEG+Image: A student model that only accepts knowledge distillation of EEG signals.

[0175] c. Eye-tracking + Image (ET+Image): A student model that only accepts knowledge distillation of eye-tracking signals.

[0176] d. EEG+ET+Image: A complete student model that simultaneously receives EEG and eye-tracking signals.

[0177] All variants share the same teacher model, differing only in the target mode of distillation. The experimental results are shown in Table 3.

[0178] The student models guided by a single modality (EEG only and ET only) outperformed the pure image baseline model. The student model that received both EEG and eye-tracking signal distillation achieved the highest balanced accuracy (BA) and F1 score. This indicates that the cognitive information provided by EEG and the behavioral information provided by eye-tracking are complementary, and joint guidance can significantly enhance the model's generalization ability.

[0179] Table 3 Comparison of contributions from four different modes

[0180]

[0181] 5. Ablation test

[0182] This embodiment conducted ablation experiments to evaluate the contributions of four key components: the dual-channel image encoder, the cross-modal attention (CMA) module, the joint distillation strategy, and the private orthogonal projection (POP) module. In each experiment, only the component under investigation was modified, while all other model structures and training settings remained unchanged.

[0183] (1) Dual-channel image encoder

[0184] This experiment evaluated the effectiveness of a dual-channel modality-guided image encoder in a teacher model, comparing four configurations: a backbone network (VGG16+FC), an MLP with mutual information constraints (MLP w MI, where backbone features are processed through two identical MLP heads with mutual information constraints for EEG and eye movements), a dual-channel configuration without mutual information constraints (dual-channel w o MI, a backbone network with both EEG and eye movements, but without mutual information guidance during training), and a dual-channel configuration with mutual information constraints (dual-channel w MI, a dual-channel EEG / eye movement backbone network trained under a standard mutual information-based teacher objective).

[0185] As shown in Table 4, the introduction of dual channels improves visual feature extraction compared to the backbone network, while the fully dual-channel model guided by mutual information further enhances performance compared to the MLP variant. These results demonstrate that dual channels not only help the image encoder capture richer visual features but also generate embeddings that better match the corresponding physiological modalities, supporting the framework's cross-modal consistency objective.

[0186] Table 4 Analysis of the dual-channel modal guided image encoder in the teacher model

[0187]

[0188] (2) Cross-modal attention

[0189] To evaluate the effectiveness of the CMA module in capturing cross-modal semantic consistency, four teacher model variants were compared under the same settings: Direct-NCE (image and physiological embeddings are directly constrained by InfoNCE loss), Direct-Cos (embeddings are directly constrained by cosine similarity loss), Mapped-NCE (embeddings are linearly projected to a shared space and then constrained by InfoNCE), and CMA-NCE (applying bidirectional cross-modal attention to enhance embeddings, followed by InfoNCE alignment).

[0190] like Figure 5 As shown, the proposed CMA consistently achieves the highest Balanced Accuracy (BA) and F1 score, demonstrating its superior ability to extract cross-modal semantically consistent representations.

[0191] (3) Combined distillation strategy

[0192] To evaluate the contribution of each component in the joint distillation scheme, five student model variants were evaluated: attention-only, feature-only, without confidence, without adaptive weighting, and joint (a full scheme that includes attention-level and feature-level distillation, adaptive weights, and teacher confidence).

[0193] As shown in Table 5, single-level supervision yielded moderate gains, while the fully joint approach achieved the highest BA and F1 scores. Removing the confidence term or adaptive weights both degraded performance, confirming that both are essential for stable and effective knowledge transfer.

[0194] Table 5 Ablation analysis of the student distillation scheme

[0195]

[0196] (4) Private orthogonal projection

[0197] To evaluate the ability of the POP module to capture task-related private features, ablation studies were conducted in three settings: no POP (w / o POP, the module was removed), POP without orthogonality (POP w / o Orthogonality, the module exists but there is no orthogonality constraint), and POP with orthogonality (POP w / Orthogonality, the full module with cosine orthogonal regularization), while keeping all other training settings the same.

[0198] As shown in Table 6, the introduction of POP improves the generalization ability, and the orthogonality constraint further enhances the performance.

[0199] Table 6 Ablation Analysis of Private Orthogonal Projection Module

[0200]

[0201] The above description is merely a preferred embodiment of the present invention. The scope of protection of the present invention is not limited to the above embodiments. All technical solutions falling within the scope of the present invention's concept are within the scope of protection of the present invention. It should be noted that for those skilled in the art, any improvements and modifications made without departing from the principles of the present invention should also be considered within the scope of protection of the present invention.

Claims

1. A remote sensing target detection method based on brain-eye cognitive behavioral distillation, characterized in that, Includes the following steps: Step S1: Construct a multimodal dataset containing remote sensing images, EEG signals, and eye movement signals; Step S2: Construct a teacher model that can simultaneously process remote sensing images, EEG signals, and eye movement signals. Through cross-modal attention mechanisms and mutual information constraints, learn a joint representation with semantic consistency between the visual features of remote sensing images and the physiological features of EEG and eye movement. Step S3: Construct a student model that takes remote sensing images as input only; Step S4: In the inference phase, the student model is used for remote sensing target detection; Step S3 includes the following steps: Step S31: Receive remote sensing images and extract visual embedding features of EEG guidance and eye-tracking guidance; Step S32: Generate attention weights based solely on unimodal image features and self-modulate the image features to generate student-oriented features, thereby simulating the attention distribution generated by cross-modal attention in the teacher model; Step S33: Extract image-specific semantic features orthogonal to student guidance features from the same image features, and retain image detail information not covered by physiological signal alignment; Step S34: A joint distillation strategy is adopted to align the student-oriented features of the student model with the teacher image enhancement features corresponding to the teacher model at both the feature level and the attention weight level. The distillation loss is composed of a weighted sum of feature-level loss and attention-level loss. Step S35: After fusing the student guidance features with the image-specific semantic features, classify them, calculate the classification loss, and combine it with the distillation loss and the regularization loss used to constrain the orthogonality of the features to form the total loss of the student model for optimization.

2. The remote sensing target detection method based on brain-eye cognitive behavior distillation according to claim 1, characterized in that, Step S2 includes the following steps: Step S21: Receive remote sensing images and extract visual embedding features of EEG guidance and eye-tracking guidance; extract EEG embedding features of EEG signals; extract eye-tracking embedding features of eye-tracking signals; Step S22: A bidirectional cross-modal attention mechanism is used to interactively enhance the visual embedding features and the corresponding EEG and eye-tracking physiological embedding features to obtain enhanced features that can reflect cross-modal semantic associations. Step S23: Using the InfoNCE loss function, maximize the mutual information between the enhanced visual features and the corresponding enhanced physiological features so that they remain consistent in the semantic space. Step S24: Input the enhanced features of each modality into the classifier for target classification, calculate the classification loss, and sum the weighted loss with the mutual information constraint loss to obtain the total loss of the teacher model.

3. The remote sensing target detection method based on brain-eye cognitive behavior distillation according to claim 2, characterized in that, In step S22, the calculation of cross-modal attention weights and the feature enhancement process are implemented through the following formula: In the formula, Indicates guidance to physiological modalities Visual embedding features; Represents the corresponding physiological mode Physiological embedding characteristics; This indicates element-wise multiplication; It is a bidirectional cross-modal attention function; and These are the enhanced visual embedding features and physiological embedding features, respectively.

4. The remote sensing target detection method based on brain-eye cognitive behavior distillation according to claim 2, characterized in that, In step S23, the InfoNCE loss function used for mutual information constraints is: In the formula, and These are the enhanced eye-tracking guided and brainwave-guided visual embedding features, respectively. and These are the enhanced eye-tracking and electroencephalographic embedding features, respectively. The cosine similarity function; Temperature coefficient; This refers to the training batch size; and These represent the first in the batch. Each sample corresponds to Features and Features, the two constitute a positive sample pair; Indicates the first in the batch Each sample corresponds to feature.

5. The remote sensing target detection method based on brain-eye cognitive behavior distillation according to claim 1, characterized in that, In step S33, orthogonal constraints are applied to ensure the independence of student guidance features from image-specific semantic features. The orthogonal loss is calculated as follows: In the formula, and In each batch Image-specific semantic features and student-oriented features after L2 normalization of each sample; The cosine similarity function; This is the training batch size.

6. The remote sensing target detection method based on brain-eye cognitive behavior distillation according to claim 1, characterized in that, In step S34, for each physiological modality The expression for its knowledge distillation loss is: In the formula, Student-oriented characteristics Teacher image enhancement features The mean square error loss of the feature level between; Student intramodal attention weights Teacher cross-modal attention weights Mean squared error loss at the attention level; and The adaptive weights are dynamically calculated based on the current loss. For teacher confidence level.

7. The remote sensing target detection method based on brain-eye cognitive behavior distillation according to claim 1, characterized in that, In step S35, the fusion process of student guidance features and image-specific semantic features is as follows: the student guidance features and image-specific semantic features are concatenated along the channel dimension to form a combined feature; a self-attention mechanism is used to process the combined feature to capture the dependencies between different feature components and generate a fusion feature for final classification.

8. A remote sensing target detection method based on brain-eye cognitive behavioral distillation according to any one of claims 1-7, characterized in that, In step S4, the model deployed and used for inference is the trained student model. After receiving a new remote sensing image, it sequentially goes through image encoding, intramodal attention modulation, private orthogonal projection, feature fusion and classification output steps to finally generate the target detection result.

9. A remote sensing target detection system based on brain-eye cognitive behavioral distillation for implementing the method of claim 1, characterized in that, The system includes: The multimodal data acquisition and preprocessing module (1) is used to simultaneously acquire the electroencephalogram (EEG) and eye movement signals generated when the subject views the remote sensing image, and together with the remote sensing image, it constitutes a spatiotemporally aligned multimodal dataset. The multimodal teacher model module (2) is connected to the multimodal data acquisition and preprocessing module (1) to receive multimodal datasets. This module integrates an image encoding submodule, a physiological signal encoding submodule, a cross-modal attention alignment submodule, and a mutual information constraint submodule. It learns the joint representation of the remote sensing image visual features and EEG and eye movement physiological features with semantic consistency through a joint optimization algorithm, and outputs enhanced visual features aligned with each physiological signal. The single-modal student model module (3) is connected to the multimodal teacher model module (2) during the training phase and receives the remote sensing image to be detected during the inference phase. This module integrates an image encoding submodule, an intramodal attention submodule, a private orthogonal projection submodule, and a feature fusion submodule. Through a knowledge distillation strategy, it learns the cross-modal perception mode of the teacher model during the training phase and simulates human-like cognitive behavior and outputs the target detection result based solely on the image input during the inference phase. The model deployment and inference module (4) is used to deploy the trained student model and process the new input remote sensing image to output the category and location information of the target in the image.