Asymmetric fusion and sparse projection method for timing classification of infrared dim small target
By employing asymmetric fusion and sparse projection methods, the problems of insufficient image domain features and multimodal asymmetry in infrared weak target detection are solved, enabling adaptive feature extraction and classification, thereby improving the detection rate and reducing the false alarm rate.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- CHANGCHUN INST OF OPTICS FINE MECHANICS & PHYSICS CHINESE ACAD OF SCI
- Filing Date
- 2026-05-21
- Publication Date
- 2026-06-19
AI Technical Summary
Existing infrared weak target detection technologies face challenges in large-aperture photoelectric measurement equipment, including insufficient discriminative power of image domain features, failure to consider signal quality asymmetry in multimodal fusion, lack of adaptive optimization in multi-scale feature extraction, lack of manifold topological constraints in sparse coding, and failure of loss functions to constrain the geometric structure of dictionary prototypes. These issues lead to a decrease in detection rate and an increase in false alarm rate.
Asymmetric fusion and sparse projection method is adopted. By performing time-series preprocessing on infrared image data and multi-dimensional physical quantity time-series data, multimodal time series samples are generated. Modal feature encoding and dynamic routing fusion are performed. Multi-scale features are extracted using learnable multi-scale pooling operators. Sparse projection is performed in the category-aware dictionary space to generate sparse response vectors. Finally, classification prediction and iterative optimization are performed.
It achieves directional complementary enhancement between image modalities and physical measurement modalities in infrared weak target scenarios, adaptively optimizes downsampling strategies, eliminates binning boundary effects, improves the discriminative ability and generalization performance of feature representations, reduces false alarm rate, and improves detection rate.
Smart Images

Figure CN122244567A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of infrared weak target detection technology, and more specifically, to a method for temporal classification of infrared weak targets based on asymmetric fusion and sparse projection. Background Technology
[0002] Infrared detection of small targets is a key technology in the field of optoelectronic countermeasures and target detection, and is widely used in ground-based optoelectronic measurement systems, airborne early warning platforms, and precision-guided equipment. With increasingly complex battlefield environments and ever-increasing observation distances, large-aperture ground-based optoelectronic measurement equipment plays an increasingly important role in the continuous monitoring of aerial targets at distances exceeding 70 kilometers.
[0003] However, existing infrared weak target detection technologies face the following technical bottlenecks: Image domain features lack discriminative power. When the observation distance increases to over 70 kilometers, the target appears as a point signal of about 3×3 pixels in the infrared image, containing almost no spatial structure information. Traditional image spatial domain-based methods (such as morphological filtering, local contrast enhancement, and deep learning target detection networks) struggle to extract sufficiently discriminative spatial features from such weak signals, leading to a significant decrease in detection rate and an increase in false alarm rate.
[0004] Multimodal fusion methods fail to consider signal quality asymmetry. Large-aperture photoelectric measurement systems, in addition to acquiring infrared images, can simultaneously obtain time-series data of multidimensional physical quantities such as azimuth, elevation, altitude, distance, pixel count, and radiation intensity of the target. Existing multimodal fusion methods generally employ symmetrical information interaction and fixed-gating weighting strategies, implicitly assuming that the signal quality of each modality is comparable. However, in infrared scenarios with small targets, the image modal signal-to-noise ratio is extremely low, and the information entropy is small, while the physical measurement modalities are relatively rich, continuous, and stable. Existing methods fail to effectively handle this significant asymmetry, leading to strong modal information being interfered with by weak modal noise, resulting in limited fusion gain.
[0005] Downsampling strategies for multi-scale feature extraction cannot be adaptively optimized. Existing methods for multi-scale decomposition of temporal features often employ fixed wavelet bases combined with fixed step-size downsampling, or use learnable convolutional networks but lack mathematical guarantees for the decomposition structure. The former downsampling strategy cannot be adaptively optimized end-to-end according to data characteristics, while the latter, although learnable, sacrifices the rigorous mathematical structure of wavelet multi-resolution analysis. Furthermore, existing normalization methods do not consider the dynamic impact of the quality of each modal signal on the feature distribution, and cannot adaptively adjust the normalization behavior based on the input signal quality.
[0006] Sparse coding and feature statistics methods suffer from a lack of geometric constraints and discretization errors. Existing sparse coding methods typically perform dictionary mapping in Euclidean space, failing to utilize prior knowledge that the target physical state is distributed in a manifold in the feature space. This results in a lack of manifold topological constraints in the dictionary prototype, leading to insufficient compactness of samples of the same class in the feature space. Furthermore, traditional bag-of-features coding methods use discrete histograms for feature statistics, which suffers from binning boundary effects, resulting in discontinuous and non-differentiable feature representations, thus limiting end-to-end joint optimization capabilities.
[0007] The loss function does not directly constrain the geometric structure of the dictionary prototypes. Existing classification methods mostly use the cross-entropy loss function, which is optimized in the probability space, failing to directly constrain the geometric structure of the dictionary prototypes in the feature space. The manifold smoothness between similar prototypes and the discriminative margin between dissimilar prototypes cannot be effectively guaranteed, affecting the discriminative power and generalization performance of the feature representation.
[0008] Therefore, there is an urgent need for a method for temporal classification of infrared weak targets using asymmetric fusion and sparse projection to solve one of the aforementioned technical problems. Summary of the Invention
[0009] The purpose of this application is to provide a method for temporal classification of infrared weak targets using asymmetric fusion and sparse projection, which can solve at least one of the aforementioned technical problems. The specific solution is as follows: According to a specific embodiment of this application, this application provides a method for temporal classification of infrared weak targets based on asymmetric fusion and sparse projection, comprising the following steps: Infrared image data and multidimensional physical quantity time series data are preprocessed for time series to generate multimodal time series samples; Modal feature encoding is performed on the multimodal time series samples, selective information injection is performed according to the signal quality corresponding to each mode, and fused spatiotemporal features are generated through dynamic routing fusion to complete the time series processing; The fused spatiotemporal features are subjected to dual-path multi-scale feature extraction to generate a multi-scale feature representation that includes independent variable paths and cross-variable coupling features; The multi-scale feature representation is mapped to a category-aware dictionary space of physical state manifold constraints. Sparse projection is performed based on feature similarity and manifold topological constraints to generate a sparse response vector. A continuous feature distribution descriptor is then generated based on the sparse response vector. Classification prediction is performed based on the continuous feature distribution descriptor. The dictionary mapping relationship and manifold topology are iteratively optimized based on the classification results until the classification results converge, and the infrared weak target classification results are output.
[0010] Furthermore, the selective information injection based on signal quality includes the following steps: Calculate the information entropy corresponding to the modal features of infrared images and the modal features of physical measurements; High signal quality modes and low signal quality modes are determined based on information entropy; Complementary information from the high-signal-quality mode is injected into the low-signal-quality mode.
[0011] Furthermore, the dynamic route fusion includes the following steps: Initialize the routing coefficients for each mode; Update each routing coefficient based on the consistency between modal features and fusion features; The fusion result is calculated iteratively based on the updated routing coefficients.
[0012] Furthermore, the dual-path multi-scale feature extraction includes the following steps: Independent variable paths extract univariate dynamic features through wavelet decomposition; Joint variable paths extract cross-variable coupling features through wavelet decomposition; Adaptive downsampling of features at different levels is performed using a learnable multi-scale pooling operator; Normalize the features at each level using signal quality sensing conditional normalization; Adaptive aggregation of features at each level and path is performed using attention weighting.
[0013] Furthermore, the learnable multi-scale pooling operator is implemented using a parameterized pooling kernel; The parameterized pooling kernel is a weighted combination of Haar wavelet basis, Daubechies wavelet basis and Gaussian derivative basis.
[0014] Furthermore, the category-aware dictionary space includes multiple feature package prototypes; The various feature package prototypes are connected by a topological adjacency relationship of the same category.
[0015] Further, the sparse projection based on feature similarity and manifold topological constraints includes: The projection response is calculated based on the similarity between the input features and the prototypes of each feature package; The projection response is constrained based on the topological adjacency relationships between the various feature package prototypes; The constrained projected response is sparsified to generate the sparse response vector.
[0016] Furthermore, the continuous feature distribution descriptor is generated through kernel density estimation; The kernel density estimation uses a Gaussian kernel function to fit the probability density distribution of the sparse response vector.
[0017] Furthermore, a joint loss function is used for optimization during the classification prediction process; The joint loss function includes prototype distance loss, mapping supervision loss, manifold regularization term, and Gaussian filter regularization term.
[0018] Furthermore, the iterative optimization includes the following steps: Update the dictionary mapping based on the classification results; Update the manifold topology based on the classification results; Re-execute classification predictions based on the updated dictionary mappings and manifold topology until the classification results converge.
[0019] Compared with the prior art, the above-described solutions of this application have at least the following beneficial effects: 1. This application discloses a method for temporal classification of infrared weak targets using asymmetric fusion and sparse projection. It quantifies signal quality by calculating the information entropy of each modality's encoded features, determines high-quality and low-quality modes based on the information entropy, injects complementary information from the high-quality modes into the low-quality modes, and adaptively controls the injection intensity using a quality ratio. Furthermore, it determines the fusion contribution weight of each mode through dynamic routing fusion iteration. This achieves directional complementary enhancement from strong to weak modes, avoiding the problem of strong-mode information being interfered with by weak-mode noise in symmetric fusion. It also solves the technical problem of significant quality differences between image modes and physical measurement modes in infrared weak target scenarios.
[0020] 2. This application discloses a method for temporal classification of infrared weak targets using asymmetric fusion and sparse projection. It employs a learnable multi-scale pooling operator, using a weighted combination of Haar wavelet basis, Daubechies wavelet basis, and Gaussian derivative basis to form a parameterized pooling kernel. The combination coefficients at each level are adaptively adjusted through end-to-end training, achieving task-adaptive optimization of the downsampling strategy while preserving the mathematical structure of wavelet multi-resolution analysis. Simultaneously, signal quality evaluation values are used as conditional inputs to dynamically generate scaling and offset parameters for the normalization layer. When signal quality is low, feature contrast is enhanced; when signal quality is high, the original distribution characteristics are maintained, allowing the normalization behavior to adaptively adjust according to the input quality.
[0021] 3. This application discloses a method for temporal classification of infrared weak targets using asymmetric fusion and sparse projection. It constructs topological adjacency relationships between prototypes of the same category feature pack. When projecting input features into the manifold space, it simultaneously considers the similarity with each prototype and manifold topological constraints. Furthermore, it replaces the discrete histogram binning operation with a Gaussian kernel function through kernel density estimation. Kernel density estimates are sampled at the prototype locations of the feature pack to generate continuous feature distribution descriptors, eliminating binning boundary effects and making the feature representation continuously differentiable. Simultaneously, a joint loss function including prototype distance loss, mapping supervision loss, manifold regularization term, and Gaussian filter regularization term is used to directly constrain the geometric structure of dictionary prototypes in the feature space, ensuring that similar prototypes form compact clusters and dissimilar prototypes are fully separated. Attached Figure Description
[0022] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application. It is obvious that the drawings described below are merely some embodiments of this application, and those skilled in the art can obtain other drawings based on these drawings without any inventive effort. In the drawings: Figure 1 The flowchart shown here illustrates an embodiment of the present application, which discloses a method for temporal classification of infrared weak targets using asymmetric fusion and sparse projection.
[0023] Figure 2 This is a schematic diagram illustrating the overall framework of a method for temporal classification of infrared weak targets using asymmetric fusion and sparse projection, as disclosed in this application, for the purpose of illustrating an embodiment of this application. Detailed Implementation
[0024] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0025] The terminology used in the embodiments of this application is for the purpose of describing particular embodiments only and is not intended to limit the application. The singular forms “a,” “said,” and “the” used in the embodiments of this application and the appended claims are also intended to include the plural forms, and “multiple” generally includes at least two unless the context clearly indicates otherwise.
[0026] It should be understood that the term "and / or" used in this article is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. Additionally, the character " / " in this article generally indicates that the preceding and following related objects have an "or" relationship.
[0027] It should be understood that although the terms first, second, third, etc., may be used in the embodiments of this application, these descriptions should not be limited to these terms. These terms are only used to distinguish the descriptions. For example, first may also be referred to as second without departing from the scope of the embodiments of this application, and similarly, second may also be referred to as first.
[0028] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that an article or device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such an article or device. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the article or device that includes said element.
[0029] The optional embodiments of this application are described in detail below with reference to the accompanying drawings.
[0030] like Figure 1 and Figure 2 As shown, this application provides a method for time-series classification of infrared weak targets using asymmetric fusion and sparse projection, which reconstructs the infrared weak target detection task into a multivariate time-series classification task.
[0031] Includes the following steps: S1. Perform time-series preprocessing on infrared image data and multidimensional physical quantity time-series data to generate multimodal time series samples.
[0032] In the technical solution of this application embodiment, a large-aperture photoelectric measuring device is used to track and measure aerial targets to acquire raw infrared image data. The infrared detector in the large-aperture photoelectric measuring device has a resolution of 640×512 pixels and a frame rate of 100Hz. At an observation distance of over 70 kilometers, the aerial target occupies 3×3 pixels in the image acquired by the infrared detector. In this application embodiment, there are four types of aerial targets: flocks of birds, civilian aircraft, balloons, and helicopters.
[0033] While acquiring infrared image data, an atmospheric correction device is used to perform inversion calculations on the raw data, and six-dimensional physical quantity time-series data is obtained through the actual parameters of the measuring equipment. The technical solution of this application embodiment uses six-dimensional physical quantities. In practical applications, the number of dimensions can be reduced according to the complexity of the application scenario; non-parameter dimensions can be set to 0. Alternatively, empirical parameters with pre-set weights can be used, with important weights set higher than others. The six-dimensional physical quantity time-series data includes azimuth, elevation, altitude, distance, number of pixels, and radiation intensity. The actual parameters of the measuring equipment are as follows: the azimuth range is... to The range of values for the pitch angle is: to The altitude ranges from 0 to 20,000 meters, the distance ranges from 10 to 150 kilometers, and the number of pixels ranges from 1 to 25 pixels. The measuring device in this embodiment is a photoelectric theodolite, but other existing devices can also be used for measurement; this application does not limit the use of such devices.
[0034] This application's embodiments reconstruct the problem of detecting weak targets at the 3×3 pixel level, which is difficult to handle in the image domain, into a multivariate time series classification task. By simultaneously acquiring infrared image data and six-dimensional physical quantity time series data, the spatially scarce weak target identification is transformed into a time series pattern recognition problem, thereby avoiding the technical problem of insufficient spatial features in infrared images.
[0035] Infrared image data and multidimensional physical quantity time-series data are processed to generate multimodal time-series samples. The time-series processing includes the following steps: Intermediate data segments are generated by dividing the synchronously acquired infrared image data into infrared image segments by 500 time steps; the synchronously acquired six-dimensional physical quantity time-series data are also divided into six-dimensional physical quantity time-series data segments by the same 500 time steps. Each 500 time steps corresponds to a 5-second imaging time for the infrared detector.
[0036] The final sample consists of infrared image segments and six-dimensional physical quantity time-series data segments derived from the same 500-time-step window. These segments are then paired. Each pair constitutes an independent multimodal time-series sample.
[0037] In this embodiment, 50 samples are taken from each of the four target categories: flocks of birds, civil aircraft, balloons, and helicopters, for a total of 200 samples. These samples are then divided into training and testing sets at a 4:1 ratio. A trajectory-level protocol is used during the division of the training and testing sets: each independent measurement trajectory is allocated as a whole to either the training or testing set, and then non-overlapping segments are extracted from each trajectory. This ensures that all segments of the same measurement trajectory appear in only one set within either the training or testing set, avoiding overfitting caused by the same trajectory data appearing in both sets and guaranteeing the accuracy of the model's generalization ability assessment.
[0038] The number of time steps is a balance between the infrared detector frame rate (100Hz) and the target motion characteristics: if the value is too small (e.g., less than 200 steps), the time window is insufficient to cover the complete motion cycle of targets such as bird flapping wings or helicopter rotor modulation, resulting in insufficient wavelet deepest detail components and unstable kernel density estimation; if the value is too large (e.g., greater than 1000 steps), although longer trajectories can be captured, redundant noise will be introduced, the number of independent samples will be reduced, and inference latency will be significantly increased. The number of time steps can be determined according to the actual situation, and this application does not limit it in this respect.
[0039] S2. Modal feature encoding is performed on the multimodal time series samples. Selective information injection is performed according to the signal quality corresponding to each mode. The fused spatiotemporal features are generated through dynamic routing fusion to complete the time series processing.
[0040] Modal feature encoding is performed on the multimodal time series samples obtained in step S1, that is, feature encoding is performed on the infrared image data and the six-dimensional physical quantity time series data segments respectively, and the original data is converted into high-dimensional feature vectors.
[0041] The steps for infrared image modal feature encoding are as follows: Infrared image segments are input into a ResNet-18 network. The ResNet-18 network extracts spatial features from the infrared image through residual connections. The ResNet-18 network extracts low-level spatial features such as edges, gradients, and intensity distributions from weak signals. The expression for the extracted image modality coding features is:
[0042] Where I represents the input infrared image segment; This represents the ResNet-18 network coding function; This represents the modality coding features of an image.
[0043] The steps for encoding six-dimensional physical quantity modal features, also known as physical measurement modal features, are as follows: A six-dimensional physical quantity time-series data segment is input into a four-layer one-dimensional convolutional network. The four-layer one-dimensional convolutional network extracts the temporal pattern features from the six-dimensional physical quantity time-series data segment through convolution operations. The six-dimensional physical quantity time-series data segment includes azimuth, elevation, altitude, distance, number of pixels, and radiation intensity. The expression for the physical measurement modal coding features is:
[0044] Where T represents the input six-dimensional physical quantity time series data segment; This represents the encoding function of a four-layer one-dimensional convolutional network; This represents the physical measurement mode coding characteristics.
[0045] This application provides a preferred technical solution, where selective information injection includes the following steps: Calculate the information entropy corresponding to each modal feature. Calculate the image modal coding features separately. Information entropy and physical measurement mode coding features Information entropy serves as a signal quality indicator for each modality. Information entropy quantifies the uniformity of the probability distribution of features; a higher entropy value indicates richer information and higher signal quality. The formula for calculating information entropy is the same in this application embodiment, and the expression is:
[0046] in, Represents information entropy; This represents the probability value of the j-th element after Softmax normalization of the encoded features of mode m; Represents a minimal constant. .
[0047] High-quality and low-quality modes are determined based on information entropy. Image modal information entropy is compared. With physical measurement modal information entropy The numerical value.
[0048] Information entropy is used as a quantitative indicator of signal quality. By calculating the probability distribution entropy value of each modality's coding feature, the signal quality of image modality and physical measurement modality is adaptively evaluated. In infrared weak target scenarios, image modality has extremely low signal-to-noise ratio and low information entropy, while physical measurement modality is relatively rich and has high information entropy. The technical solution of this application embodiment can automatically identify this significant asymmetry, providing a quantitative basis for subsequent selective injection.
[0049] In the technical solution of this application embodiment, under large-aperture photoelectric measurement scenarios, the target in the infrared image occupies only 3×3 pixels, and the image modal coding features... The probability distribution is concentrated in a few elements, and the information entropy The numerical value is low. Physical measurement mode coding characteristics. The probability distribution is relatively uniform, and the information entropy The value is significantly higher than Therefore, the physical measurement mode was identified as the high signal quality mode, and the image mode was identified as the low signal quality mode.
[0050] Complementary information from high-signal-quality modes is injected into low-signal-quality modes. Based on signal quality assessment results, directional information injection is performed from modes with higher signal quality to modes with lower quality. The injection intensity is controlled by the quality ratio; the greater the difference in signal quality, the higher the injection intensity. In large-aperture measurement scenarios, the injection direction is mainly from the physical measurement mode to the image mode, effectively solving the problem of insufficient image mode information, while avoiding the drawback of strong mode information being interfered with by weak mode noise in symmetrical fusion.
[0051] In the technical solution of this application embodiment, complementary information from the physical measurement mode is injected into the image mode. In this application embodiment, complementary information refers to kinematic and radiation features such as azimuth, elevation, altitude, distance, number of pixels, and radiation intensity contained in the physical measurement mode. These features are difficult to obtain in the infrared image mode due to the small size of the target pixels.
[0052] The formula for injecting the selected complementary information is expressed as follows:
[0053] in, This represents the enhanced image modality after selective complementary information injection; The coding features representing low signal quality modes are described in the embodiments of this application. ; The coding features representing high signal quality modes are described in the embodiments of this application. ; The information entropy representing the low signal quality mode, in the embodiments of this application, is... ; The information entropy representing the high signal quality mode, in the embodiments of this application, is... ; This represents the injection intensity control coefficient, used to adjust the intensity of the injection information; in this embodiment, it is set to 0.5. This represents a learnable linear transformation function used to map features of high-signal-quality modes to the feature space of low-signal-quality modes. In the technical solution of this application embodiment, It is implemented using a single-layer fully connected network, and the weight parameters are obtained through model training.
[0054] In the technical solution of this application embodiment, the injection intensity control coefficient The method for determining the value is: by traversing the validation set. The optimal classification performance was determined by selecting the classification accuracy as the criterion, balancing the injection intensity of complementary information with the degree of preservation of the original features of weak modes, so that the amount of injected information is 50% of the amplitude of the high signal quality mode features, and the best classification performance was achieved on the actual test dataset. If the value is too small, the injection will be insufficient, and the effect of weak mode enhancement will be limited. If the amount is too large, it will result in over-injection, which may introduce redundant information or overwhelm the original structure of weak modes; It exhibits good robustness in the range of 0.3 to 0.7. Adjustments can be made based on specific circumstances. The values to be selected are not limited in this embodiment of the application.
[0055] In the technical solution of this application embodiment, the injection intensity Controlled by the signal quality ratio. When the image modal signal quality... Far below the quality of physical measurement modal signals When the ratio is close to 1, the injection intensity is high. When the quality of the two modal signals is comparable, the ratio is approximately 0.5, indicating a moderate injection intensity.
[0056] After selective information injection, the enhanced image modal features are represented as follows: The physical measurement modal characteristics remain unchanged, as expressed as: .
[0057] After selective information injection is completed, fused spatiotemporal features are generated through dynamic routing fusion.
[0058] This application provides a preferred technical solution, where dynamic route fusion includes the following steps: Initialize the routing coefficients corresponding to each modality. Then, apply the enhanced image modal features. and enhanced physical measurement modal characteristics As input for dynamic route fusion, initialize image modality routing coefficients. Initialize physical measurement mode routing coefficients The logarithmic space of the routing coefficient b is initially set to zero, indicating that the fusion weights of the two modes are equal at the initial moment. For ease of formula expression, uniform notation is used. This represents the enhanced modal features. Here, m represents the modal index. When m=1, When m=2, .
[0059] The routing coefficients are updated based on the consistency between modal features and fused features. For each routing iteration... Perform the following operations. The total number of routing iterations in this embodiment is... 3.
[0060] In the technical solution of this application embodiment, the total number of iterations The method for determining the value is: by traversing the validation set. The determination is made by balancing classification accuracy and computational efficiency. As the fusion weights converge, further iterations offer limited performance improvement while significantly increasing inference latency. If R is too small, the routing coefficients do not converge sufficiently, resulting in suboptimal fusion weights; if R is too large, it increases computational overhead and diminishes marginal gain; R=3 achieves the optimal balance between accuracy and efficiency.
[0061] First, set the current routing coefficient The fusion weights for each modality are obtained by normalization using the Softmax function. The expression for the fusion weight calculation formula is:
[0062] in, This represents the fusion weight of mode m in the r-th iteration; This represents the routing coefficient of mode m in the r-th iteration; Let represent the routing coefficient of mode n in the r-th iteration. Summate the exponential routing coefficients of all modes to ensure that the sum of the fusion weights of all modes is 1.
[0063] The technical solution of this application embodiment dynamically updates the routing coefficients through multiple iterations, enabling the fusion weights of each modality to be automatically adjusted based on the consistency between modal characteristics and the fusion result, without the need for preset fixed weights. This dynamic routing mechanism adaptively re-evaluates the contribution of each modality based on the characteristics of the input data, and the fusion result is robust to data of varying quality.
[0064] The fusion result is calculated iteratively based on the updated routing coefficients. Based on the fusion weights... Learnable projection matrix and enhanced modal features Calculate the fusion result of the current iteration. The formula for calculating the fusion result is expressed as follows:
[0065] in, Represents the fused spatiotemporal features of the r-th iteration; This represents the learnable projection matrix of mode m. Similarly, when m=1, When m=2, .
[0066] The technical solution of this application embodiment introduces a learnable projection matrix in the routing fusion process. By projecting each modal feature onto a shared semantic space and then performing weighted fusion, the semantic alignment capability of cross-modal features is enhanced, and the discriminativeness of the fused features is improved.
[0067] Calculate the fusion result of the current iteration. Then, the routing coefficients are updated for the next iteration. The formula for updating the routing coefficients is:
[0068] in, Indicates the updated routing coefficients; This represents the transpose of the projected eigenvectors.
[0069] In the technical solution of this application embodiment, the update amount of the routing coefficients is equal to the dot product between the projected features and the fusion result, that is... The larger the dot product value, the higher the consistency between the modal features and the fusion result, and the greater the fusion weight of the corresponding modality in the next iteration.
[0070] Repeat the above steps until R routing iterations are completed. The final fusion result is the fusion result of the Rth iteration, expressed as:
[0071] in, This indicates the final fusion result; This represents the fusion result of the Rth iteration.
[0072] In the technical solution of the embodiments of this application, It incorporates spatial features of the image modality and temporal features of the physical measurement modality. The features of the two modalities are complemented and enhanced through selective information injection, and then adaptive weighted aggregation is achieved through dynamic routing fusion.
[0073] The technical solution of this application embodiment aligns, segments, and pairs infrared images and six-dimensional physical quantity time-series data in step S1 to form multimodal time-series samples. Step S2 encodes the two types of data and then fuses them in the feature space. Through the collaborative design of steps S1 and S2, a complete temporal processing link from raw data to fused spatiotemporal features is constructed. Step S1 performs time alignment and fragmentation of infrared image data and multi-dimensional physical quantity time-series data, reconstructing the problem of detecting small targets with scarce spatial information into a multivariate time-series classification task, laying the foundation for time-series samples in subsequent processing. Step S2, based on this, encodes the features of the multimodal time-series samples, quantifies the signal quality of each modality using information entropy, selectively injects information from high-signal-quality modalities to low-signal-quality modalities according to signal quality differences, and determines the fusion contribution weight of each modality through dynamic routing fusion iteration. The two steps are connected and complement each other. They transform discriminative information that is difficult to obtain in the image domain into temporal pattern features through temporal processing, and effectively solve the technical problem of significant signal quality asymmetry between infrared image modes and physical measurement modes through an asymmetric fusion mechanism. This allows the fused spatiotemporal features to simultaneously contain the spatial structure information of the target and the multi-dimensional physical change laws, providing a rich and discriminative feature foundation for subsequent multi-scale feature extraction and classification.
[0074] S3, Integration of spatiotemporal features Perform dual-path multi-scale feature extraction to generate multi-scale feature representations that include univariate dynamic features and cross-variable coupled features.
[0075] This application provides a preferred technical solution for dual-path multi-scale feature extraction, which includes the following steps: extracting univariate dynamic features through an independent variable path; extracting cross-variable coupled features through a joint variable path; adaptively downsampling features at different levels using a learnable multi-scale pooling operator; normalizing features at each level through signal quality-aware conditional normalization; and forming a complete multi-scale feature representation through attention-weighted adaptive aggregation.
[0076] S301, Independent Variable Path performs multi-level discrete wavelet transform on each sensor variable. In this embodiment, the six-dimensional physical quantity time series data includes six sensor variables: azimuth, elevation, altitude, distance, number of pixels, and radiation intensity. Let v represent the time-series signal of the v-th sensor variable, where v = 1, 2, ..., 6.
[0077] For each sensor variable The L-level discrete wavelet transform is performed independently. In this embodiment, L=4, and the Haar wavelet basis is used as the basis wavelet. The discrete wavelet transform decomposes the signal into an approximate component and multiple detail components. The approximate component represents the low-frequency trend of the signal, and the detail components represent the high-frequency fluctuations of the signal.
[0078] In the technical solution of this application embodiment, the wavelet decomposition level L is set to 4, which is determined based on the signal length and the target motion period characteristics. The wavelet decomposition level should cover multi-scale features from high-frequency details to low-frequency trends, and ensure that the length of the deepest approximate component is not less than 16, avoiding excessive decomposition that leads to the loss of effective information. If L is too small, the multi-scale representation capability is insufficient, making it difficult to capture the target's cross-scale dynamic patterns; if L is too large, the deepest signal is too short, statistical stability decreases, and computational complexity increases; L=4 achieves the optimal balance between feature richness and stability.
[0079] The expression for the L-level discrete wavelet transform of the v-th sensor variable is:
[0080] in, This represents the approximate component coefficient of the v-th sensor variable at level L; This represents the detail component coefficient of the v-th sensor variable at level L; This represents the detail component coefficient of the v-th sensor variable at level L-1; This represents the detail component coefficient of the v-th sensor variable at level 1; This represents the discrete wavelet transform at level L.
[0081] The technical solution of this application embodiment uses approximate components. It preserves the low-frequency information and detailed components of the signal. It preserves the high-frequency information of the signal at different scales.
[0082] S302, the joint variable path concatenates all sensor variables and then performs a multi-level discrete wavelet transform to capture cross-variable coupling relationships.
[0083] The time-series signals of 6 sensor variables The signals are concatenated along the feature dimensions to form a multidimensional joint signal. An L-level discrete wavelet transform is then performed on the concatenated joint signal, expressed as:
[0084] in, Represents the approximate component coefficients of the joint signal at level L; This represents the detail component coefficients of the joint signal at level L; This represents the detail component coefficients of the joint signal at level 1; This indicates a vector concatenation operation.
[0085] The technical solution of this application embodiment uses joint variable wavelet decomposition to reflect the cooperative change pattern between different physical quantities, such as the coupling relationship between azimuth and elevation angles, and the correlation between distance and pixel count.
[0086] The technical solution of this application embodiment performs multi-level wavelet decomposition on each sensor variable through an independent variable path, capturing the dynamic features of single variables at different scales and preserving the independent variation law of each physical quantity. The joint variable path performs joint multi-level wavelet decomposition on all variables, capturing cross-variable coupling relationships and revealing the cooperative variation patterns between different physical quantities. The complementary design of the two paths enables the multi-scale feature representation to reflect both individual characteristics and overall correlations.
[0087] S303. During the decomposition process at each level, a learnable multi-scale pooling operator is used to adaptively downsample the features at different levels. The learnable multi-scale pooling operator is implemented using a parameterized pooling kernel, replacing the fixed filter with a learnable parameterized pooling kernel.
[0088] For the l The hierarchical, parameterized pooling kernel is defined as follows:
[0089] in, Represented as the first l The parameterized pooling kernel function is hierarchical; t represents the time index; M represents the total number of basis functions constituting the parameterized pooling kernel, including fixed basis functions and learnable basis functions. In this embodiment, M=3. Indicates the first l The learnable combination coefficients of the k-th basis function at each level; Let k represent the k-th basis function.
[0090] In this embodiment, M=3 is determined based on the complementarity of the expressive power of the basis functions and the stability of optimization. The fixed basis functions are the Haar wavelet basis and the Daubechies wavelet basis, and the learnable basis function is the learnable Gaussian derivative basis. The Haar wavelet basis captures abrupt signal features, such as the sudden appearance or disappearance of a target or noise spikes. The Daubechies wavelet basis captures local smooth features, such as the smooth motion trajectory of a target. The learnable Gaussian derivative basis provides adaptive capability. The combination of the three covers the most critical signal patterns with the fewest basis functions. If M is too small, the pooling kernel's expressive power is insufficient, making it difficult to adapt to complex temporal patterns; if M is too large, the basis functions are redundant, prone to overfitting, and difficult to optimize; M=3 achieves the optimal balance between expressive power and generalization performance.
[0091] The parameterized pooling kernel is formed by a weighted combination of the Haar wavelet basis, the Daubechies wavelet basis, and the Gaussian derivative basis.
[0092] The expression for the Haar wavelet basis is:
[0093] in, This represents the Haar wavelet basis, which has tight support and orthogonality, and can effectively capture the abrupt changes in signals.
[0094] The expression for the Daubechies wavelet basis is:
[0095] in, This represents the Daubechies wavelet basis, which has better smoothness and vanishing moment characteristics, and can more finely characterize the local structure of the signal.
[0096] The Gaussian derivative basis is a learnable Gaussian derivative basis, expressed as:
[0097] in, This represents a learnable Gaussian derivative basis; Indicates the first l Learnable positional parameters at each level; Indicates the first l Learnable scale parameters at different levels. Learnable Gaussian derivative basis with smooth waveform, learnable position parameters. The center position of the wave crest can be controlled, and the scale parameter can be learned. The width of the controlled waveform is adaptively adjusted through training to adapt to the characteristics of signals at different levels.
[0098] No. l Hierarchical downsampling operations are defined as parameterized pooling kernels. Convolving with the input signal, then downsampling with a stride of 2, the expression is:
[0099] in, Indicates the first l The signal after hierarchical downsampling; Indicates the convolution operation; This means that one sampling point is retained at every time step; This indicates the input signal.
[0100] The technical solution of this application embodiment enables the pooling operation to be adaptively adjusted according to data characteristics through the weighted combination of parameterized pooling kernels and end-to-end optimization of learnable parameters. This achieves task adaptive optimization of the downsampling strategy while preserving the mathematical structure of wavelet multi-resolution analysis.
[0101] S304. Perform signal quality perception conditional normalization on the features of each level. Signal quality perception conditional normalization uses the signal quality evaluation value calculated in step S2, i.e., the information entropy, as the conditional input to dynamically generate the scaling parameters and offset parameters of the normalization layer.
[0102] Image modal information entropy and physical measurement modal information entropy Using the larger of the two information entropies as a conditional input reflects the overall signal quality level of the input data.
[0103] The expressions for dynamically generating the normalized scaling parameter and the normalized offset parameter are as follows:
[0104]
[0105] in, This represents the normalization scaling parameter; Indicates the normalized offset parameter; , These represent the learnable weight matrices; , These represent the learnable bias parameters; express , The larger of the two information entropies reflects the overall signal quality level of the input data.
[0106] The technical solution of this application uses signal quality evaluation values as conditional inputs to dynamically generate scaling and offset parameters for the normalization layer. When the signal quality is low (i.e., the Q value is small), conditional normalization enhances feature contrast; when the signal quality is high (i.e., the Q value is large), conditional normalization maintains the original distribution characteristics. Compared to the fixed parameter method used in existing technologies, signal quality-aware conditional normalization can adaptively adjust its normalization behavior according to the input quality, improving the robustness of feature representation.
[0107] Conditional normalization is performed on the features at each level, and the expression is as follows:
[0108] in, This represents the features after conditional normalization. This represents the input feature vector; This represents the mean of the features in the current batch; This represents the standard deviation of the characteristics of the current batch.
[0109] The normalization operation adjusts the feature distribution to a standard distribution with a mean of 0 and a variance of 1, and then... and Scaling and translation are performed to make the normalization behavior adaptively adjust according to signal quality.
[0110] After downsampling by learnable pooling operators, learningable convolution transformation, and conditional normalization, the complete processing flow for each level of features is as follows:
[0111] in, Represents the v-th variable. l Features after hierarchical processing; This represents the linear rectification activation function, which sets negative values to zero and keeps positive values unchanged, thus introducing nonlinearity. This indicates that the signal quality perception condition is normalized. Indicates the first l One-dimensional convolutional transformations of layers are used to extract higher-level features; This indicates that multi-scale pooling downsampling operations can be learned.
[0112] S305. Adaptive aggregation of features at each path and level through attention weighting is performed to highlight the feature representation of important levels and suppress noise interference from irrelevant levels.
[0113] Calculate the first l The attention weights for each level are expressed as follows:
[0114] in, Indicates the first l Attention weights at different levels; Indicates the first l The learnable importance vectors at each level are automatically adjusted through end-to-end training and are used to measure the contribution of features at each level to the classification task. express Transpose of; Indicates the first l Hierarchical feature vectors; Bias parameters representing the attention mechanism; Represents the hyperbolic tangent activation function; The weight matrix representing the attention mechanism; This represents a normalized exponential function that ensures the sum of attention weights across all levels is 1.
[0115] The aggregated feature is obtained by weighting and summing the features at each level according to the attention weights, and the expression is as follows:
[0116] in, This represents the aggregated feature. Summation is performed across all levels; the larger the attention weight, the higher the value of the aggregated feature. l The more important a hierarchical feature is to the classification task, the higher its proportion in the aggregation result.
[0117] After the independent variable path and joint variable path are processed through the above steps, each path generates a set of multi-level features. Both paths employ the same attention aggregation scheme to adaptively aggregate the multi-level features of each path. The aggregated features of the independent variable path are then... Aggregation features of joint variable paths To concatenate the features along the feature dimension to form a complete multi-scale feature representation. .
[0118] The technical solution of this application embodiment uses an attention mechanism to weighted aggregate features at each level. The learnable level importance vector can automatically identify the discriminative contribution of features at different scales, highlighting the feature representation of important levels and suppressing noise interference from irrelevant levels. Dual-path wavelet decomposition preserves the rigorous mathematical structure of discrete wavelet transform, avoiding the mathematical uninterpretability problem of multi-scale decomposition in pure convolutional networks. Simultaneously, the introduction of learnable pooling operators enables end-to-end gradient backpropagation and joint optimization of the entire technical solution, achieving a balance between mathematical safeguards and task adaptability.
[0119] S4. Represent the multi-scale features The category-aware dictionary space is mapped to the physical state manifold constraints. Sparse projection is performed based on feature similarity and manifold topological constraints to generate sparse response vectors. A continuous feature distribution descriptor is then generated based on the sparse response vectors.
[0120] The technical solution of this application employs sparse coding of neural feature packages and statistical structures to encode and statistically process multi-scale features. The category-aware dictionary space includes multiple feature package prototypes, with topological adjacency relationships of the same category established between these prototypes. When performing sparse projection based on feature similarity and manifold topological constraints, the projection response is first calculated based on the similarity between the input features and each feature package prototype. Then, the projection response is constrained based on the topological adjacency relationships between the feature package prototypes. Finally, the constrained projection response is sparsified to generate a sparse response vector. Continuous feature distribution descriptors are generated through kernel density estimation, and a Gaussian kernel function is used to fit the probability density distribution of the sparse response vector.
[0121] We consider K×B feature package prototypes as anchor points on the manifold, where K represents the number of categories and B represents the number of feature package prototypes assigned to each category. In this embodiment, there are four categories of aerial targets: flocks of birds, civil aircraft, balloons, and helicopters, therefore K=4. The optimal value of B is adjusted between 7 and 50 depending on the specific dataset. The adjustment is determined by optimization on the validation set, and the adjustment range needs to balance classification accuracy and computational efficiency.
[0122] The total capacity of the dictionary is set to K×B, and the dictionary structure is defined as follows:
[0123] in, Represents a dictionary; Indicates the first one in the Kth class. B feature package prototype vectors; Indicates the first one in category 1 B feature package prototype vectors; Indicates the first one in category 2 There are B feature package prototype vectors. Each feature package prototype is a high-dimensional vector representing a typical feature pattern of this type of target.
[0124] Each feature package prototype is connected by a topological adjacency relationship of the same category. Topological connections are established between prototypes of the same category, but not between prototypes of different categories. That is, topological adjacency weights are calculated based on the Euclidean distance only when two prototypes belong to the same category; the topological adjacency weights between prototypes of different categories are zero. The formula for calculating topological adjacency relationships is as follows:
[0125] in, This represents the topological adjacency weight between prototype i and prototype j; , These represent the prototype vectors of the i-th and j-th feature packets, respectively. This represents the topological neighborhood scale parameter, which controls the radius range of topological connections; This represents an indicator function, which takes the value 1 when the condition inside the parentheses is true, and takes the value 0 otherwise. This indicates the category to which the i-th prototype belongs; This indicates the category to which the j-th prototype belongs; It represents the square of the Euclidean distance between two prototype vectors.
[0126] Sparse projection is performed based on feature similarity and manifold topological constraints. The input feature vector... When projecting onto the manifold space, the similarity between the input features and the prototypes of each feature pack, as well as the manifold topological constraints, are considered simultaneously. The technical solution of this application calculates both the similarity between the input features and each prototype, and the aggregated response of prototypes within the topological neighborhood, when projecting the input features onto the manifold space. This dual constraint ensures that the projection result not only reflects the matching degree between the input features and individual prototypes, but is also subject to the overall constraint of similar neighborhood prototypes, enhancing the stability and robustness of the sparse representation.
[0127] The formula for calculating sparse response is expressed as follows:
[0128] in, This represents the projected response of the j-th prototype; Representing multi-scale features The eigenvector at the t-th time step; This represents the projection scale parameter, which controls the rate at which similarity decays; Represents the neighborhood set of prototype j on the topological graph; This represents the prototype vector of the k-th feature packet; This represents the topological adjacency weight between prototype j and prototype k.
[0129] Calculate the projected response for each feature bag prototype in the category-aware dictionary space. The projected response vector is obtained.
[0130] For the projected response vector Apply sparsity operations, retaining only the first few bytes. The maximum response is selected, and all other responses are set to zero. The sparsity treatment formula is:
[0131] in, This represents the sparse response vector generated by the sparse projection of the manifold constraint; To represent sparsity, in the embodiments of this application... 3; Represents the projected response vector; This operation indicates that the operation returns a vector with the same shape as the input vector.
[0132] The technical solutions of the embodiments of this application, The value is determined based on the total number of feature package prototypes and the information preservation principle of sparse coding. When the number of prototypes B for each class is between 7 and 50, retaining the top 3 largest responses can cover the vast majority of effective projective responses, with a cumulative contribution rate exceeding 85%, while controlling the sparsity of the response vector above 90%, balancing discriminative information preservation and computational efficiency. The value is determined according to the information preservation principle of sparse coding, which can cover the vast majority of effective projective responses; too small a value results in information loss, while too large a value leads to decreased sparsity.
[0133] Apply TopK sparsification to the projected response vector, retaining only the first few bytes. The first maximum response reduces the computational complexity of subsequent kernel density estimation, while sparsification itself has a feature selection effect, which can filter out noise interference from low responses.
[0134] The continuous feature distribution descriptor is generated through kernel density estimation. In this embodiment, the kernel density estimation uses a Gaussian kernel function to fit the probability density distribution of the sparse response vector, replacing the traditional discrete histogram binning operation.
[0135] sparse response vector Perform kernel density estimation, probability density function The calculation formula is:
[0136] in, Let represent the probability density function; x represents the independent variable of the probability density function, reflecting its position in the feature space; Represents the sparse response vector Length; Represents the sparse response vector The value of the t-th element in the array; The kernel bandwidth parameter controls the smoothness of the kernel function; G represents the Gaussian kernel function.
[0137] The expression for the Gaussian kernel function G is:
[0138] in, represents the Gaussian kernel function; u represents the independent variable of the Gaussian kernel function; exp represents the exponential function; This represents the constant value of pi.
[0139] In the technical solution of this application embodiment, the Gaussian kernel function has the characteristics of smoothness, continuity and differentiability, which makes the kernel density estimation result continuously differentiable and supports end-to-end gradient backpropagation optimization.
[0140] Kernel density estimates are sampled at the prototype locations of K×B feature bags to generate continuous feature distribution descriptors, expressed as follows:
[0141] in, These represent the feature bag prototype vectors respectively. , The kernel density at the location is used to estimate the probability density value.
[0142] In the technical solution of this application embodiment, the Continuous Feature Distribution Descriptor (CFD) is a K×B dimensional vector, where each element represents the kernel density estimate at the corresponding feature bag prototype location. CFD replaces the discrete histogram in traditional feature bag methods, eliminating binning boundary effects and making the feature representation continuously differentiable.
[0143] The kernel bandwidth parameter employs an adaptive strategy, automatically adjusting based on the local density of the sparse response. The expression is:
[0144] in, Indicates adaptive kernel bandwidth; This represents the global baseline bandwidth, initialized using Silverman rules; This represents the guiding density value estimated in the initial fixed bandwidth, i.e., in Fixed bandwidth is used at the location The probability density value obtained by kernel density estimation.
[0145] The technical solution of this application embodiment uses an adaptive strategy to enable high-density areas... Larger bandwidth can be used to improve resolution in low-density areas. Using a smaller bandwidth ensures smoothness and improves the fitting accuracy of the probability density distribution.
[0146] S5. Perform classification prediction based on the continuous feature distribution descriptor CFD, iteratively optimize the dictionary mapping relationship and manifold topology based on the classification results until the classification results converge, and output the infrared weak target classification results.
[0147] The classification prediction process employs a joint loss function for optimization. This joint loss function includes prototype distance loss, mapping supervision loss, manifold regularization term, and Gaussian filter regularization term.
[0148] The technical solution in this application uses a two-layer fully connected structure to map to a predefined label space for classification prediction. The expression for calculating classification prediction is:
[0149]
[0150] in, Represents the hidden layer feature vector; Represents the linear rectification activation function; Represents the hidden layer weight matrix; Indicates hidden layer bias; This represents the predicted category probability distribution; This represents the output layer weight matrix; This indicates the output layer bias.
[0151] The centroids of the B feature package prototype vectors for each category in the structured dictionary are used as the category prototypes. The category prototype for the k-th class... The calculation formula is:
[0152] in, Represents the class prototype of the k-th class; This represents the prototype vector of the j-th feature pack in the k-th class.
[0153] The prototype distance loss measures the manifold geodesic distance between a sample feature and the correct class prototype, as well as the manifold geodesic distance between a sample feature and the nearest incorrect class prototype. The formula for calculating the prototype distance loss is:
[0154] Where N represents the total number of samples; This represents the hidden layer output feature vector of the i-th sample; This represents the category prototype of the category to which the i-th sample belongs; This represents the true class label of the i-th sample; Represents the category prototype of the j-th class; This represents the geodesic distance metric of the manifold, used to calculate the shortest path length between two points on the manifold; margin represents the interval hyperparameter, in this embodiment of the application, margin=1.0; This indicates the operation of taking positive values, that is... .
[0155] The value of the margin hyperparameter is determined based on the distance distribution scale between similar and dissimilar feature vectors in the feature space. The manifold geodesic distance between similar sample features ranges from 0.5 to 0.8, while the manifold geodesic distance between dissimilar sample features ranges from 1.5 to 2.0. A margin that is too small will lead to blurred class distinction boundaries, while a margin that is too large will cause difficulties in model convergence. In this embodiment, the margin is set to 1.0, which is within the discrimination interval between the upper limit of similar distance and the lower limit of dissimilar distance, achieving a balance between discrimination clarity and convergence stability. The prototype distance loss function forces the manifold geodesic distance between a sample feature and the correct class prototype to be at least smaller than the manifold geodesic distance between the sample feature and the nearest incorrect class prototype by the margin. That is, the correct distance plus the margin does not exceed the nearest incorrect distance. When the difference between the correct distance and the nearest incorrect distance is greater than or equal to the margin, the loss is zero; when the difference is less than the margin, the loss is positive, driving the model to increase the interval between the two class distances. This forms a class distinction boundary with a width of margin in the feature space.
[0156] A supervision signal is introduced during the sparse mapping process, assigning an expected label R to each mapping result Y of the input X. The mapping supervision loss uses the mean squared error loss, calculated as follows:
[0157] in, Indicates the mapping supervision loss; Y represents the mean squared error function; Y represents the output of the sparse mapping. This represents the expected label matrix.
[0158] The manifold regularization term applies manifold smoothness regularization during training to ensure that topologically adjacent feature bag prototypes maintain their nearest-neighbor relationship in the feature space. The formula for calculating the manifold regularization term is:
[0159] in, Represents the manifold regularization term; This represents the topological adjacency weight between prototype i and prototype j.
[0160] The Gaussian filter regularization term ensures that topologically neighboring prototypes maintain their nearest-neighbor relationship in the feature space. The formula for calculating the Gaussian filter regularization term is:
[0161] in, This represents the Gaussian filter regularization term; Indicates the interval between adjacent time steps; The hyperparameter represents the Gaussian kernel width; a and b represent the weighting coefficients, respectively. This represents the value of the k-th feature dimension at the t-th time step; This represents the value of the k-th feature dimension at the (t-1)-th time step. The Gaussian weights decay as the time interval increases, maximizing the penalty weight for adjacent time steps and minimizing the penalty weight for distant time steps.
[0162] The technical solution of this application addresses the common impulse noise and salt-and-pepper noise in signals from large-aperture photoelectric measurement equipment by designing a Gaussian filter regularization term. By penalizing abrupt changes between adjacent states, it is equivalent to a differentiable low-pass filter operation in the frequency domain, suppressing noise without disrupting the timing structure of the effective signal.
[0163] The weighted sum of the four loss functions yields the final joint loss function:
[0164] Where L represents the joint loss function; , , , They represent , , , The weighting coefficients. In the embodiments of this application, , , , The values are 1.0, 0.5, 0.1 and 0.3 respectively.
[0165] The values of the weight coefficients in the joint loss function are determined based on the magnitude and functional priority of each loss term. The prototype distance loss serves as the dominant term to constrain the class discrimination boundary, the mapping supervision loss serves as the secondary dominant term to guide the sparse coding process, and the manifold regularization and Gaussian filter regularization terms serve as auxiliary constraints to maintain manifold smoothness and suppress temporal noise. Each weight coefficient is weighted and balanced after evaluating the independent contribution of each loss term using a validation set, ensuring that the magnitudes of the gradients are comparable. Too small a value will result in weak discrimination boundary constraints and insufficient separation between classes. An excessively large value may weaken the dominance of the prototype distance loss. and An excessively large regularization value can lead to overly strong regularization and underfitting of the model, while an excessively small value can result in ineffective regularization and overfitting. In the embodiments of this application... , , , The values were set to 1.0, 0.5, 0.1, and 0.3 respectively, achieving a reasonable balance between the dominant loss and the regularization constraint. Each weight coefficient can be adjusted according to the specific task; it should be increased when categories are easily confused. To enhance the discrimination constraint, the noise interference is increased when it is severe. To strengthen smoothness constraints, increase the intensity of intra-class feature dispersion. To enhance compact constraints, Adjustments are made based on the stability of the sparse coding. The values of each weight coefficient can be adjusted according to the specific task, and this application does not limit this.
[0166] The sparse encoding process of neural feature packets and the classification prediction process are organized into nested iterative loops. Iterative optimization includes the following steps: updating the dictionary mapping relationship based on the classification result; updating the manifold topology based on the classification result; and re-executing the classification prediction based on the updated dictionary mapping relationship and manifold topology until the classification result converges.
[0167] In each training iteration, the encoding process first performs a manifold-constrained sparse projection on the input features, generating a continuous feature distribution descriptor (CFD) through kernel density estimation. The classification prediction process then generates intermediate classification results based on the CFD. Intermediate classification results The joint loss function L is fed back into the encoding process to optimize the dictionary mapping relationship, i.e., to update the prototype vector of the feature pack. Manifold topology, i.e., updating topological adjacency weights. Sparse coding parameters include projection scale parameters, kernel bandwidth parameters, etc.
[0168] Specifically, the dictionary mapping is updated based on the classification results: the joint loss function L is calculated on the feature pack prototype vector. The gradient is updated along the gradient descent direction, causing similar prototypes to move closer to the sample features and dissimilar prototypes to move away from them.
[0169] Update the manifold topology based on the classification results: Calculate the joint loss function L with respect to the topological adjacency weights. The gradient is used to adjust the connection strength and range of the topological neighborhood, so that the manifold structure can better adapt to the data distribution.
[0170] Based on the updated dictionary mapping and manifold topology, re-execute classification prediction: re-execute the manifold-constrained sparse projection and kernel density estimation in step S4 using the updated feature bag prototype vector and topological adjacency weights to generate a new continuous feature distribution descriptor CFD, and then generate a new classification result through classification prediction.
[0171] After multiple iterations, the dictionary prototype gradually converges into representative feature patterns of each category, and the manifold structure gradually characterizes the inherent laws of the target's physical state, ultimately outputting stable infrared weak target classification results.
[0172] This application's embodiments verify the effectiveness of its technical solution through comparative experiments. The verification scope includes nine datasets, eight of which are publicly available datasets in the prior art, and one is a self-built dataset. The comparison methods include: InceptionTime, ResNet, Transformer, LSTM_att, and CBOSS. The optimal comparison method refers to the method that performs best on each of the above five algorithms on each dataset, excluding the method of this application. Specific comparison data is shown in Table 1. Table 1 Comparison of Data and Classification Performance
[0173] Based on the experimental results above, the proposed asymmetric fusion and sparse projection method for temporal classification of infrared weak targets has achieved excellent performance on multiple public datasets and a self-built experimental dataset. On AtrialFibrillation, EigenWorms, EthanolConcentration, SelfRegulationSCP1, SelfRegulationSCP2, and the self-built dataset, the classification accuracy of this application reaches or surpasses the existing best comparison methods, with an accuracy of 95% on the self-built experimental dataset. Regarding inference efficiency, the single-sample inference time of this application on all test datasets is comparable to or better than the best comparison method, with a single-sample inference time of 0.0868 seconds on the self-built dataset. These experimental results demonstrate that this application achieves a good balance between classification accuracy and computational efficiency in the infrared weak target detection task through signal quality assessment and selective information injection mechanisms, dynamic routing fusion strategies, dual-path multi-scale feature extraction, manifold-constrained sparse projection, and joint loss function optimization.
[0174] The technical solution of this application, a method for temporal classification of infrared weak targets using asymmetric fusion and sparse projection, effectively solves the problem of fusion with significant signal quality asymmetry between image modalities and physical measurement modalities through signal quality assessment, selective information injection mechanisms, and dynamic routing fusion strategies. The method calculates the information entropy of the encoded features of each modality to quantify signal quality, determines high-signal-quality and low-signal-quality modes based on the information entropy, injects complementary information from the high-signal-quality modes into the low-signal-quality modes, and determines the fusion contribution weight of each mode through dynamic routing fusion iteration, generating fused spatiotemporal features.
[0175] Multi-scale feature extraction is achieved mathematically rigorously through a dual-path multi-scale feature extraction method based on wavelet multi-resolution analysis and learnable multi-scale pooling operators. The independent variable path performs multi-level wavelet decomposition on each sensor variable to capture univariate dynamic features at different scales, while the joint variable path performs joint multi-level wavelet decomposition on all variables to capture cross-variable coupling relationships. Learnable multi-scale pooling operators are used for adaptive downsampling at each decomposition level, and signal quality-sensing conditional normalization is applied to normalize the features at each level. Finally, attention-weighted adaptive aggregation is used to form a multi-scale feature representation.
[0176] By employing a nested iterative optimization approach driven by physical state manifold constraints, topological sparse projection, kernel density estimation of continuous feature distribution descriptors, and a combination of prototype distance loss and manifold regularization, intra-class diversity interference is effectively suppressed. Multi-scale feature representations are mapped to a class-aware dictionary space, constructing topological adjacency relationships between prototypes of the same class feature packages. Sparse projection is performed based on feature similarity and manifold topological constraints to generate sparse response vectors, and continuous feature distribution descriptors are generated through kernel density estimation. A joint loss function, including prototype distance loss, mapping supervision loss, manifold regularization, and Gaussian filter regularization, is used to feed the classification prediction results back to the encoding process to optimize the dictionary mapping relationships and manifold topological structure. After iterative convergence, the classification result is output.
[0177] The technical solution of this application embodiment achieves a good balance between detection accuracy, generalization ability and computational efficiency in the infrared weak target detection task of large-aperture photoelectric measurement equipment, and is suitable for engineering deployment of ground measurement stations under resource-constrained conditions.
[0178] The flowcharts and block diagrams in the accompanying drawings illustrate the architecture, functionality, and operation of possible implementations of systems, methods, and computer program products according to various embodiments of this application. In this regard, each block in a flowchart or block diagram may represent a module, segment, or portion of code containing one or more executable instructions for implementing a specified logical function. It should also be noted that in some alternative implementations, the functions indicated in the blocks may occur in a different order than those indicated in the drawings. For example, two consecutively indicated blocks may actually be executed substantially in parallel, and they may sometimes be executed in reverse order, depending on the functions involved. It should also be noted that each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts, can be implemented using a dedicated hardware-based system that performs the specified function or operation, or using a combination of dedicated hardware and computer instructions.
[0179] The units described in the embodiments of this application can be implemented in software or hardware. The names of the units are not, in some cases, limiting the scope of the unit itself.
Claims
1. A method for temporal classification of infrared weak targets using asymmetric fusion and sparse projection, characterized in that, Includes the following steps: Infrared image data and multidimensional physical quantity time series data are preprocessed for time series to generate multimodal time series samples; Modal feature encoding is performed on the multimodal time series samples, selective information injection is performed according to the signal quality corresponding to each mode, and fused spatiotemporal features are generated through dynamic routing fusion to complete the time series processing; The fused spatiotemporal features are subjected to dual-path multi-scale feature extraction to generate a multi-scale feature representation that includes independent variable paths and cross-variable coupling features; The multi-scale feature representation is mapped to a category-aware dictionary space of physical state manifold constraints. Sparse projection is performed based on feature similarity and manifold topological constraints to generate a sparse response vector. A continuous feature distribution descriptor is then generated based on the sparse response vector. Classification prediction is performed based on the continuous feature distribution descriptor. The dictionary mapping relationship and manifold topology are iteratively optimized based on the classification results until the classification results converge, and the infrared weak target classification results are output.
2. The method according to claim 1, characterized in that, Performing selective information injection based on the signal quality includes the following steps: Calculate the information entropy corresponding to the modal features of infrared images and the modal features of physical measurements; High signal quality modes and low signal quality modes are determined based on information entropy; Complementary information from the high-signal-quality mode is injected into the low-signal-quality mode.
3. The method according to claim 1, characterized in that, The dynamic route fusion includes the following steps: Initialize the routing coefficients for each mode; Update each routing coefficient based on the consistency between modal features and fusion features; The fusion result is calculated iteratively based on the updated routing coefficients.
4. The method according to claim 1, characterized in that, The dual-path multi-scale feature extraction includes the following steps: Independent variable paths extract univariate dynamic features through wavelet decomposition; Joint variable paths extract cross-variable coupling features through wavelet decomposition; Adaptive downsampling of features at different levels is performed using a learnable multi-scale pooling operator; Normalize the features at each level using signal quality sensing conditional normalization; Adaptive aggregation of features at each level and path is performed using attention weighting.
5. The method according to claim 4, characterized in that, The learnable multi-scale pooling operator is implemented using a parameterized pooling kernel; The parameterized pooling kernel is a weighted combination of Haar wavelet basis, Daubechies wavelet basis and Gaussian derivative basis.
6. The method according to claim 1, characterized in that, The category-aware dictionary space includes multiple feature package prototypes; The various feature package prototypes are connected by a topological adjacency relationship of the same category.
7. The method according to claim 6, characterized in that, Performing sparse projection based on the feature similarity and the manifold topological constraints includes: The projection response is calculated based on the similarity between the input features and the prototypes of each feature package; The projection response is constrained based on the topological adjacency relationships between the various feature package prototypes; The constrained projected response is sparsified to generate the sparse response vector.
8. The method according to claim 1, characterized in that, The continuous feature distribution descriptor is generated through kernel density estimation; The kernel density estimation uses a Gaussian kernel function to fit the probability density distribution of the sparse response vector.
9. The method according to claim 1, characterized in that, The classification prediction process employs a joint loss function for optimization. The joint loss function includes prototype distance loss, mapping supervision loss, manifold regularization term, and Gaussian filter regularization term.
10. The method according to claim 1, characterized in that, The iterative optimization includes the following steps: Update the dictionary mapping based on the classification results; Update the manifold topology based on the classification results; Re-execute classification predictions based on the updated dictionary mappings and manifold topology until the classification results converge.