A weakly supervised anomaly detection method based on double-branch feature fusion

By employing a dual-branch feature fusion method, combining the Mamba module and attention-enhanced residual convolution module, the problem of traditional methods struggling to capture long-range dependencies and local features under weak supervision is addressed, achieving efficient and accurate anomaly detection.

CN122241609APending Publication Date: 2026-06-19YANGTZE RIVER DELTA RES INST OF NPU TAICANG

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
YANGTZE RIVER DELTA RES INST OF NPU TAICANG
Filing Date
2026-05-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Under weak supervision, traditional anomaly detection methods struggle to effectively capture long-range dependencies and local features in high-dimensional complex data, resulting in limited detection performance, especially when anomaly samples are scarce and noise contamination exists, leading to poor detection results.

Method used

A dual-branch feature fusion approach is adopted, which uses Mamba's adaptive anomaly detection module to capture global long-range dependencies and an attention-enhanced residual convolution module to extract local fine-grained features. The resulting output is then weighted and fused by the fusion module to generate an anomaly score.

Benefits of technology

It achieves efficient and accurate anomaly detection in weakly supervised scenarios, and can take into account both global context understanding and accurate local feature extraction. It has strong generalization ability and practical application value.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122241609A_ABST
    Figure CN122241609A_ABST
Patent Text Reader

Abstract

This invention relates to the field of data processing, specifically disclosing a weakly supervised anomaly detection method based on dual-branch feature fusion. The method first processes data samples to obtain a dataset, then inputs the dataset into a trained detection model to output anomaly prediction results. The detection model comprises four parts: a first branch module is a Mamba-based adaptive anomaly perception module used to capture global long-range dependencies in the dataset; a second branch module is an attention-enhanced residual convolution module used to extract local fine-grained features from the dataset; a fusion module is used to weightedly fuse the outputs of the two branch modules to obtain fused features; and an anomaly score generator is used to output the degree of anomaly corresponding to the fused features. This method can simultaneously consider global context understanding and accurate local feature extraction. Even with a small number of labeled anomaly samples and noise contamination, it still achieves optimal weakly supervised anomaly detection performance, demonstrating strong generalization ability and practical application value.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of data processing technology, and specifically to a weakly supervised anomaly detection method based on dual-branch feature fusion. Background Technology

[0002] Anomaly detection is a crucial research area in data mining and machine learning, aiming to identify anomalous samples that significantly deviate from normal patterns in large-scale datasets. It has wide-ranging applications in fields such as financial fraud detection, network intrusion detection, medical diagnosis, and industrial fault early warning. However, in real-world applications, anomalous samples are typically extremely scarce and difficult to fully label, making traditional fully supervised learning methods ineffective. Therefore, achieving efficient and accurate anomaly detection under weakly supervised conditions with only a small number of labeled anomalous samples is a critical problem that urgently needs to be solved.

[0003] In recent years, deep learning-based anomaly detection methods have attracted widespread attention due to their powerful feature learning capabilities. Convolutional Neural Networks (CNNs), as a representative deep learning model, can automatically learn multi-level nonlinear feature representations and have made significant progress in anomaly detection tasks. However, due to the inherent local receptive field limitations of convolution operations, CNNs struggle to effectively capture long-range dependencies in samples.

[0004] Traditional anomaly detection methods mainly include statistical and distance-based methods. Statistical methods assume that the data follows a certain probability distribution and classify samples that deviate from this distribution as anomalies; distance-based methods identify outliers by calculating the distance or density between samples. However, these traditional methods usually rely on manually designed features, making them difficult to effectively handle high-dimensional and complex data, and their assumptions about the data distribution often do not match reality, resulting in limited detection performance. While deep learning-based anomaly detection methods have shown good results in handling high-dimensional and complex data, the inherent limitations of the local receptive field in convolutional operations are becoming increasingly apparent. Each neuron in a convolutional neural network focuses only on local positional information in the input sequence. This local focus may lead to insufficient capture of global information, thus affecting the effective representation and reconstruction of long sequence signals. Summary of the Invention

[0005] To overcome the shortcomings of the existing technologies, this invention provides a weakly supervised anomaly detection method based on dual-branch feature fusion, which can simultaneously take into account global context understanding and accurate local feature extraction. Even when there are few labeled anomaly samples and noise pollution, it can still achieve optimal weakly supervised anomaly detection performance, and has strong generalization ability and practical application value.

[0006] To achieve the above objectives, the present invention is implemented through the following technical solution: A weakly supervised anomaly detection method based on dual-branch feature fusion includes, Acquire and process data samples to obtain a dataset. Input the processed dataset into a trained detection model and output the abnormal state prediction results, where: The detection models include: The first branch module is an adaptive anomaly detection module based on Mamba, which is used to capture global long-range dependencies in the dataset. The second branch module is an attention-enhanced residual convolution module, which is used to extract local fine-grained features of the dataset. The fusion module is used to weightedly fuse the outputs of the first branch module and the second branch module to obtain the fused features; Anomaly score generator, used to output the degree of anomaly corresponding to the fused features.

[0007] Furthermore, in the detection method of this application, the first branch module includes m concatenated first sub-modules, each of which includes, in sequence, an attention layer, an attention reconstruction layer, an improved Mamba model, a residual connection and layer normalization, and a gated attention layer, wherein: The attention layer obtains the interaction relationships between features by performing query-key-value mapping on the dataset; The attention reconstruction layer is used to reconstruct features and align dimensions of the output of the attention layer; The improved Mamba model adaptively adjusts the state update strategy based on the intensity of the abnormal features of the current sample; Residual connections and layer normalization are used to ensure effective gradient propagation and stabilize the training process; The gated attention layer adaptively weights the features concatenated after attention reconstruction and layer normalization to enhance the expressive power of anomaly-related features.

[0008] The output of the first submodule is used as the input of the next first submodule. After M layers (e.g., 5 layers) of iterative processing, the first branch module can extract increasingly abstract global feature representations layer by layer, and finally output a feature vector containing rich global semantic information.

[0009] Furthermore, in the detection method of this application, the step of obtaining the interaction relationship between features in the attention layer includes: a) Given an input feature matrix, generate a query matrix, a key matrix, and a value matrix through three independent linear transformations; b. Attention weights are obtained by calculating the dot product of the query matrix and the key matrix, and then normalized using a scaling factor to prevent the dot product value from being too large and causing the gradient to vanish. c. After normalizing the attention weights, normalize them using the Softmax function and multiply them with the value matrix to obtain the weighted aggregated attention output, which contains global interaction information between the input features.

[0010] Furthermore, the detection method in this application, the improved Mamba model, differs from the traditional Mamba model in that it includes: S1. In the feature projection stage, a dual-path projection mechanism is introduced to simultaneously project the input features into the normal mode space and the abnormal mode space, and calculate the distribution difference of the samples in the normal mode space and the abnormal mode space to assess the abnormal tendency of the samples. S2. In the local feature extraction stage, variable receptive field convolution is used instead of traditional fixed convolution kernel. The size of the convolution receptive field is dynamically adjusted according to the abnormality tendency. The higher the abnormality tendency of the region, the larger the corresponding receptive field. That is, a larger receptive field is used for suspected abnormal regions to capture contextual information, and a smaller receptive field is used for normal regions to improve computational efficiency. S3. In the state-space model calculation stage, this invention designs an anomaly-driven dynamic parameter generation module. This module dynamically generates three key parameters based on the deviation of the input features from the normal mode baseline. The parameters include: The discretization step size matrix is ​​used to control the temporal granularity of state updates, allowing for finer granularity for anomalous features; The input selection matrix prioritizes input dimensions with high outlier responses based on their contribution to state updates. The output filtering matrix is ​​used to control the output of hidden state information to suppress redundant information in normal patterns. This parameter generation mechanism enables the model to maintain fast coarse-grained modeling when processing normal samples, and automatically switch to fine-grained deep modeling when encountering abnormal samples, thus realizing the adaptive tilt of computing resources towards abnormal samples.

[0011] S4. In the output stage, residual enhancement projection is introduced to directly integrate samples with high anomaly significance in the original input features into the output through residual connection, avoiding the attenuation of abnormal signals during deep modeling.

[0012] The revamped Mamba Block maintains linear computational complexity while achieving efficient capture of sparse anomaly patterns in the data through an anomaly-sensitive adaptive mechanism.

[0013] Furthermore, the detection method in this application, the gated attention layer, includes the following process: after passing the input features through a linear transformation layer, a sigmoid activation function is used to generate gate weights with values ​​ranging from 0 to 1. The gate weights reflect the importance of each feature dimension. The larger the weight, the more important the feature dimension is for anomaly detection. The gate weights are multiplied element-wise with the input features to enhance important features and suppress irrelevant features, thus obtaining the output of the current first submodule.

[0014] Technical effects of the invention (1) This invention combines the Mamba state-space model and convolutional neural networks to propose a new weakly supervised anomaly detection method. This method utilizes the ability of the Mamba network to capture global long-range dependencies and the advantage of the convolutional network to extract local features, and can achieve efficient and accurate anomaly detection in weakly supervised scenarios.

[0015] (2) The present invention designs an adaptive anomaly perception module based on Mamba. Compared with the quadratic computational complexity of the traditional Transformer architecture, Mamba has linear computational complexity, which can efficiently model the global dependency relationship of long sequences. At the same time, it achieves adaptive perception of anomaly patterns through a gating attention mechanism.

[0016] (3) The present invention designs an attention-enhanced residual convolution module, which achieves adaptive feature enhancement and denoising through channel attention mechanism and soft thresholding strategy, effectively suppressing noise interference and improving the ability to perceive local abnormal features.

[0017] (4) The dual-branch architecture proposed in this invention can simultaneously take into account global context understanding and accurate extraction of local features. Even when there are few labeled abnormal samples and noise pollution, it can still achieve the best weak supervision anomaly detection performance, and has strong generalization ability and practical application value. Attached Figure Description

[0018] Figure 1 This is a system architecture diagram of the detection model in the embodiments of this application; Figure 2 This is a comparison chart of the PR-AUC performance of the weakly supervised anomaly detection method and seven competing methods in this embodiment under different numbers of labeled anomaly samples (No. Anomalies); Figure 3 The graph shows the PR curves of different feature extraction modules of the weakly supervised anomaly detection method in this embodiment on six datasets. Figure 4 The ROC curves for different feature extraction modules of the weakly supervised anomaly detection method in this embodiment are shown for six datasets. Detailed Implementation

[0019] This embodiment provides a weakly supervised anomaly detection method based on dual-branch feature fusion, including: Obtain and process data samples to obtain a dataset. Input the processed dataset into the trained detection model and output the abnormal state prediction results. Detection model such as Figure 1 As shown, the MACAD-Net weakly supervised anomaly detection network based on dual-branch feature fusion includes: The first branch module, the Mamba-based Adaptive Anomaly Detection Module (MAAPM), is used to capture global long-range dependencies in the dataset. The second branch module, based on the attention-enhanced residual convolution module (AERCM), is used to extract local fine-grained features of the dataset; The fusion module is used to weightedly fuse the outputs of the first branch module and the second branch module to obtain the fused features; An anomaly score generator outputs the degree of anomalousness corresponding to the fused features. Specifically, in the anomaly score generation stage, the concatenated fused features are fed into the anomaly score generator. The anomaly score generator consists of a fully connected layer and a sigmoid activation function. The fully connected layer performs a linear transformation on the fused features, mapping the high-dimensional features to a one-dimensional scalar. The sigmoid activation function normalizes this scalar to between 0 and 1, which serves as the model's predicted anomaly probability. The closer the output value is to 1, the higher the probability that the sample is anomalous; the closer the output value is to 0, the higher the probability that the sample is normal.

[0020] Furthermore, the detection method in this embodiment includes the following process for processing data samples: Missing value imputation: For attributes with missing values ​​in the data sample, the mean of all non-missing values ​​of the attribute is used to impute them to ensure the integrity of the data. Categorical attribute encoding: For categorical attributes in data samples, one-hot encoding is used to convert them into numerical feature vectors, enabling the model to effectively process non-numerical data; Normalization performs Min-Max normalization on all numerical attributes in the data sample, mapping them to the range of 0 to 1 to eliminate the influence of differences in units between different attributes. Specifically, the data sample processing method used in training the detection model is the same as described above, and normalization can accelerate the convergence process of the model.

[0021] In the process of training the detection model, in order to simulate the scenario of scarce abnormal samples in real-world applications, a preset number (e.g., 30) of abnormal samples are randomly retained from the abnormal classes of each training dataset as labeled abnormal samples, so that labeled abnormal samples account for only a very small proportion of the total training data.

[0022] Meanwhile, to simulate the situation where unlabeled data in real-world scenarios typically contains a small number of anomalies, a preset proportion (e.g., 2%) of anomalous contamination data is randomly added to the normal class in each training dataset.

[0023] Furthermore, in the detection method of this embodiment, the first branch module includes m cascaded first sub-modules. Each first sub-module includes, in sequence, an attention layer, an attention reconstruction layer, an improved Mamba model (i.e., MambaBlock), residual connections and layer normalization, and a gated attention layer, wherein: The attention layer obtains the interaction relationships between features by performing query-key-value mapping on the dataset; The attention reconstruction layer is used to reconstruct features and align dimensions of the output of the attention layer. Specifically, a linear transformation layer maps the attention output back to the dimensions of the original feature space and adds a bias term for adjustment to obtain the reconstructed feature representation. The role of this layer is to ensure the consistency of feature dimensions, which facilitates processing by subsequent modules. The improved Mamba model adaptively adjusts the state update strategy based on the intensity of the abnormal features of the current sample; Residual connections and layer normalization are used to ensure effective gradient propagation and stabilize the training process. Specifically, the output of the Mamba Block is added to the input of the first submodule through a residual connection, allowing the gradient to be directly propagated back to the shallow network, alleviating the training difficulties of deep networks. Layer normalization is then performed to standardize the features along the hidden dimension, accelerating model convergence and improving training stability.

[0024] The gated attention layer adaptively weights the features concatenated after attention reconstruction and layer normalization to enhance the expressive power of anomaly-related features.

[0025] The output of the first submodule is used as the input of the next first submodule. After M layers (e.g., 5 layers) of iterative processing, the first branch module can extract increasingly abstract global feature representations layer by layer, and finally output a feature vector containing rich global semantic information.

[0026] Furthermore, in the detection method of this embodiment, the step of obtaining the interaction relationship between features in the attention layer includes: a) Given an input feature matrix, generate a query matrix, a key matrix, and a value matrix through three independent linear transformations; b. Attention weights are obtained by calculating the dot product of the query matrix and the key matrix, and then normalized using a scaling factor to prevent the dot product value from being too large and causing the gradient to vanish. c. After normalizing the attention weights, normalize them using the Softmax function and multiply them with the value matrix to obtain the weighted aggregated attention output, which contains global interaction information between the input features.

[0027] Furthermore, the detection method in this embodiment, the improved Mamba model, differs from the traditional Mamba model in that it includes: S1. In the feature projection stage, a dual-path projection mechanism is introduced to simultaneously project the input features into the normal mode space and the abnormal mode space, and calculate the distribution difference of the samples in the normal mode space and the abnormal mode space to assess the abnormal tendency of the samples. S2. In the local feature extraction stage, variable receptive field convolution is used instead of traditional fixed convolution kernel. The size of the convolution receptive field is dynamically adjusted according to the abnormality tendency. The higher the abnormality tendency of the region, the larger the corresponding receptive field. That is, a larger receptive field is used for suspected abnormal regions to capture contextual information, and a smaller receptive field is used for normal regions to improve computational efficiency. S3. In the state-space model calculation stage, this invention designs an anomaly-driven dynamic parameter generation module. This module dynamically generates three key parameters based on the deviation of the input features from the normal mode baseline. The parameters include: The discretization step size matrix is ​​used to control the temporal granularity of state updates, allowing for finer granularity for anomalous features; The input selection matrix prioritizes input dimensions with high outlier responses based on their contribution to state updates. The output filtering matrix is ​​used to control the output of hidden state information to suppress redundant information in normal patterns. This parameter generation mechanism enables the model to maintain fast coarse-grained modeling when processing normal samples, and automatically switch to fine-grained deep modeling when encountering abnormal samples, thus realizing the adaptive tilt of computing resources towards abnormal samples.

[0028] S4. In the output stage, residual enhancement projection is introduced to directly integrate the samples with high abnormal significance (abnormal significant components) in the original input features into the output through residual connection, so as to avoid the attenuation of abnormal signals during deep modeling.

[0029] The revamped Mamba Block maintains linear computational complexity while achieving efficient capture of sparse anomaly patterns in the data through an anomaly-sensitive adaptive mechanism.

[0030] Specifically, in step S1, the normal mode spatial projection is: H norm= σ(W) norm X+b norm ); Abnormal mode spatial projection: H abn =σ(W abn X+babn ).

[0031] Where X is the input feature matrix, H norm H abn W represents the feature matrices projected onto the normal mode space and the abnormal mode space, respectively; norm W abn Let b represent the two independent learnable weight matrices in their respective spaces; norm b abn These represent the corresponding bias vectors; σ is the nonlinear activation function.

[0032] After obtaining the projection features of the two spaces, an adaptive gating mechanism is used to calculate the distribution difference between them, thereby quantifying the anomalous tendency of the samples. The mathematical expression of the adaptive gating mechanism is: S anomaly =Sigmoid(W s |H norm -H abn |+b s ); The specific calculation process is as follows: Differential feature extraction: First, calculate the spatial features H of the normal pattern. norm Spatial features of abnormal patterns H abn The element-wise absolute difference matrix directly captures and quantifies the degree of deviation of the feature responses of the input sample in two different modal spaces.

[0033] Feature weighting and mapping: The absolute difference matrix is ​​input into a linear transformation layer. This layer contains an independent, learnable weight matrix W. s With bias term b s We perform weighted aggregation and dimensional compression on high-dimensional difference information, mapping the distribution differences into a comprehensive tendency value.

[0034] Probability Normalization and Evaluation: Finally, the above propensity values ​​are processed using the Sigmoid activation function, normalizing them to the range of 0 to 1, to obtain the final distribution difference score S. anomaly (i.e., the anomaly tendency score). The closer the score is to 1, the greater the difference in feature expression between the normal and anomalous pattern spaces, indicating that it deviates significantly from the distribution benchmark of normal data and has a very high anomaly tendency; conversely, the closer the score is to 0, the more similar the feature responses are, indicating a tendency towards normal samples. This score will subsequently serve as a key control weight for dynamically adjusting the convolutional receptive field and generating Mamba state space parameters.

[0035] Specifically, in step S2, using the anomaly tendency score matrix obtained in step S1, the outputs of the two branches are subjected to element-wise weighted adaptive fusion at the spatial pixel level. The mathematical expression for this fusion is: H local =(1-S anomaly )⊙Conv (small) (X proj )+S anomaly ⊙Conv (large) (X proj ).

[0036] Where X proj This represents the input feature matrix for entering the local feature extraction stage; S anomaly The output calculated in step S1; Conv (small) () represents a small receptive field convolution operation, which has low computational cost and focuses on extracting local features from normal patterns. (large) () represents a large receptive field convolution operation, emphasizing context capture of abnormal patterns. ⊙ represents element-wise matrix multiplication, used to implement region-level spatial weighting. 1 indicates a convolution operation with S. anomaly A matrix of all ones with the same dimension. H local This represents the local feature matrix output after dynamic adjustment via variable receptive field convolution. Specifically, in step S3, the degree of deviation is represented by calculating the distance between the current input feature and the learned normal baseline vector: V dev =|X proj -u norm |; Where X proj This represents the input feature matrix currently entering the state-space model computation phase; u norm V represents the baseline vector for the normal mode. This vector is a learnable parameter that represents the central distribution of features from normal samples extracted by the model during training. dev This represents the calculated deviation matrix.

[0037] All three parameters are derived from the aforementioned deviation vector V. dev It is generated through independent linear transformations and activation functions, and its logic is as follows: The discretized step size matrix Δ uses a linear layer to map the deviation to the step size space: Δ=Softplus(W Δ V dev +b Δ ); Where V dev The deviation matrix calculated in step S3; W Δ b is the learnable weight matrix of the discretized step-size mapping layer;Δ is the corresponding bias vector; Softplus() is the smooth positive activation function.

[0038] The input selection matrix B dynamically adjusts the input weights directly based on the intensity of the abnormal response in each dimension: B = Linear B (V dev ) Among them, Linear B () denotes the independent linear transformation layer corresponding to the input selection matrix, containing the learnable weight matrix W. B With bias vector b B .

[0039] The output filtering matrix C is the generated gated filtering weight: C = Sigmoid(W) c V dev +b c ); W c b is the learnable weight matrix for the output filtering mapping layer. c is the corresponding bias vector; Sigmoid() is the gated activation function that restricts the output to the (0,1) interval to achieve soft gated filtering.

[0040] The generated dynamic parameters are applied to the state evolution equations through a discretization process to achieve adaptive computational bias: 1. Discretization Update: A' = exp(ΔA); B' = (ΔA) -1 (A'-I)ΔB Where A is the continuous-time state transition matrix, a predefined learnable parameter in SSM, describing the evolution of the hidden state in the continuous-time domain; Δ is the discretization step size matrix dynamically generated in step S3, controlling the time granularity of discretization of the continuous system; exp() represents matrix exponentiation operation; A' is the discrete state transition matrix obtained after discretization; B is the continuous-time input projection matrix, a predefined learnable parameter in SSM, describing the driving strength of the input signal on the hidden state; I is the identity matrix of the same dimension as A'; B' is the discrete input projection matrix obtained after discretization, mapping the input signal at the current moment to the hidden state space.

[0041] 2. Status Update and Output: h k =A'h k-1 +B'x k ;y k =Ch k Where h k-1x is the hidden state vector from the previous time step, carrying contextual information about the historical sequence; k Let A' be the input feature vector at the current time step; A' and B' be the discrete state transition matrix and discrete input projection matrix obtained from the above discretization process, respectively; h k y is the updated hidden state vector at the current time, which integrates historical dependency information and current input information; C is the dynamically generated output filtering matrix in step S3, which performs gated filtering on the hidden state to suppress redundant information in the normal mode; k This is the output feature vector at the current moment, which is the feature representation after adaptive state space modeling.

[0042] Technical effect description: Adaptive step size: The state transition matrix A' is adjusted by Δ. When an anomaly is encountered, Δ becomes smaller, and the scanning step size of the model in the feature dimension direction becomes shorter, thereby capturing more subtle anomaly fluctuations.

[0043] Adaptive Input and Filtering: Through dynamic B and C, the model prioritizes the selection of anomalous information and actively suppresses normal redundant information during the calculation process, ensuring that computing resources are adaptively tilted towards anomalous samples.

[0044] Specifically, in step S4, to prevent abnormal signals from attenuating during the modeling process of deep networks, this embodiment dynamically extracts salient features from the original input using a spatial masking mechanism. Its mathematical expression is: X salient =S anomaly ⊙X raw ; Where X raw This represents the original input characteristics upon entering this submodule, including the initial signal strength before state-space modeling; S anomaly This represents the anomaly tendency score matrix calculated in stage S1. This matrix serves as adaptive weights, determining the degree to which features of each region are preserved; the resulting output X... salient These are the extracted significant abnormal components. The higher the abnormal tendency at a certain location, the more completely the corresponding original signal is preserved.

[0045] The extracted salient components are dimensionally aligned using a linear projection layer and then directly accumulated to the main output of the model via residual connections. The mathematical expression is: H out =LayerNorm(Y SSM +LayerNorm(X salient )); Where Y SSM That is, the output y obtained in step S3 k Linear res() indicates a residual enhancement projection layer; LayerNorm() indicates a layer normalization operation, which normalizes the features within the parentheses on the hidden dimension.

[0046] Furthermore, the detection method in this embodiment, the gated attention layer, includes the following process: after passing the input features through a linear transformation layer, a sigmoid activation function is used to generate gate weights with values ​​ranging from 0 to 1. The gate weights reflect the importance of each feature dimension. The larger the weight, the more important the feature dimension is for anomaly detection. The gate weights are multiplied element-wise with the input features to enhance important features and suppress irrelevant features, thus obtaining the output of the current first submodule.

[0047] Furthermore, in the detection method of this embodiment, the second branch module includes n (e.g., 6) cascaded second sub-modules. Each second sub-module includes, in sequence, an average pooling layer, a depthwise convolutional layer, a global average pooling layer, a fully connected layer, a sigmoid activation function, and a soft thresholding shrinkage layer; wherein: Average pooling layers are used to perform preliminary downsampling on input samples to reduce feature dimensionality and computational cost, while preserving overall statistical information of the features. Specifically, for a given input feature, the average of all elements within a specified pooling window is calculated as the output. Specifically, setting the pooling window size to 2 and the stride to 2 ensures that the feature dimensionality of the input sample is reduced to half its original value after passing through the average pooling layer.

[0048] Deep convolutional layers enhance anomalous features layer by layer by performing multiple cascaded convolutions (e.g., 3 times); The features extracted by deep convolution are globally averaged and then input into a fully connected layer and a sigmoid activation function to generate channel-level adaptive gating weights, which are used to characterize the contribution of different channels to anomaly detection. The gate weights are multiplied element-wise with the globally averaged features, and then input into a soft-threshold shrinkage layer to perform channel-level denoising. The denoised effective features are added bitwise to the features output by the average pooling layer (passed through the bottom residual path) to obtain the output features of the current second sub-module. The output of the current second submodule is used as the input of the next second submodule. After n layers of iterative processing, the final output is a feature representation containing rich local fine-grained information.

[0049] Furthermore, the detection method in this embodiment innovatively modifies the traditional BN-ReLU-Conv unit to address the sparse distribution of local anomaly features in anomaly detection. The deep convolutional layer employs anomaly response-oriented cascaded convolutional units to construct a hierarchical feature extraction pathway; each convolutional unit introduces an anomaly activation enhancement mechanism. In the batch normalization stage, not only are the features standardized, but the deviation index between the current batch features and the normal sample feature distribution benchmark is also calculated. The deviation index is used for the dynamic modulation of the subsequent activation function. The mathematical expression for the deviation index is as follows: First, calculate the average distribution of the current batch features along the channel dimension, and then measure its difference from the baseline distribution of normal sample features. The mathematical expression is: Γ=||Mean(H curr ) -μ basae ||1; Where H curr The input feature tensor for the current batch is represented by μ; Mean() represents the batch mean calculation operation, used to extract the overall distribution center of the features in the current batch; μ basae Γ represents the baseline vector of normal sample feature distribution that has been learned and stored in advance, representing the statistical characteristics of normal data; || ||1 represents the L1 norm (sum of absolute values), used to quantify the overall strength of the current feature distribution deviating from the baseline; Γ is the deviation index calculated, and the higher the index, the greater the possibility that the current batch of features contains abnormal information.

[0050] In the nonlinear activation stage, an adaptive activation threshold strategy is adopted, which dynamically adjusts the activation threshold according to the deviation index. The higher the deviation index, the lower the activation threshold. That is, a lower threshold is used for suspected abnormal areas to retain more weak abnormal signals, and a standard threshold is used for normal areas to suppress redundant activation. In the convolution extraction stage, an anomaly-aware convolution kernel weight modulation mechanism is designed. Based on the anomaly response intensity identified by the previous convolution unit, the weight allocation of each dimension of the current convolution kernel is dynamically adjusted so that the convolution operation is tilted towards the dimension with high anomaly response, thereby achieving focused extraction of local anomaly patterns. Anomaly response intensity is essentially a measure of the saliency of the output features of the previous convolutional unit in the channel dimension. In convolutional neural networks, different channels typically represent different feature patterns, and anomaly response intensity is used to measure which channels captured signals related to anomalies. The acquisition process is as follows: The output feature map H of the previous convolutional unit L-1 Then, global statistical information for each channel is extracted using a global average pooling layer to obtain the channel description vector z. Each element z in this vector... c This represents the abnormal response intensity of the c-th channel.

[0051] The dynamic adjustment process involves a weight prediction network that generates a set of modulation coefficients based on the aforementioned response intensity and applies them to the original weights of the current convolutional layer. The specific calculation process and mathematical expression are as follows: First, using the anomalous response intensity z from the previous layer, a channel weight modulation vector α is generated through a small fully connected network (two-layer MLP). Its mathematical expression is: α = Sigmoid(W2ReLU(W1z)); Where z = [z1, z2, ..., z c [ ] is the channel description vector extracted by global average pooling, where each element z c W1 represents the abnormal response intensity of the c-th channel; W1 is the learnable weight matrix of the first fully connected layer in the two-layer MLP, responsible for nonlinear feature extraction of the channel response intensity; ReLU() is the nonlinear activation function after the first layer, which introduces nonlinear mapping capability and suppresses negative response; W2 is the learnable weight matrix of the second fully connected layer in the two-layer MLP, responsible for mapping intermediate features to channel-level modulation coefficients; Sigmoid() is the gated activation function after the second layer, which normalizes the modulation coefficients to the (0, 1) interval; α is the generated channel weight modulation vector, the values ​​of each dimension of which reflect the importance of the corresponding channel to the perception of local abnormal patterns.

[0052] The generated modulation vector is compared with the original convolution kernel weight W of the current convolutional layer. orig Element-wise multiplication is performed to obtain the modulated perceptual convolution kernel W. mod Its mathematical expression is: Among them W orig These are the original, unmodulated convolutional kernel weights of the current convolutional layer; This represents an element-wise multiplication operation, which broadcasts the modulation vector along the channel dimension and multiplies it element-wise with the corresponding weights of the convolution kernel; W mod The modulated perceptual convolution kernel has its channel weights adaptively redistributed according to the intensity of the abnormal response. The weights of channels with strong abnormal responses are enhanced, while the weights of channels with weak responses are suppressed.

[0053] By using modulated convolution kernels to perform convolution operations on the input features, we can achieve focused extraction of local abnormal patterns: H L =Conv(X;W mod ) Where X is the input feature of the current convolutional unit; W mod The above is the perceptual convolution kernel modulated by the abnormal response intensity; Conv() represents the kernel with W... mod H is the convolution operation performed by the convolution kernel on the input feature X; LThe output feature of the Lth convolutional unit is the local feature representation extracted by the anomaly-aware convolutional kernel. Compared with the original fixed-weight convolution, this output has a stronger response to anomaly patterns.

[0054] Specifically, the initial dimensionality reduction features output from the average pooling layer are used as the starting input in the deep convolutional layer. After iterative processing by L modified convolutional units, the expression of anomalous features is strengthened layer by layer while suppressing interference from normal patterns. This invention addresses the interactive features of anomalous data by setting the number of stacked convolutional units L to 3 to balance feature abstraction levels and computational efficiency, the kernel size to 2 to capture anomalous correlation patterns between adjacent features, the stride to 2 to achieve layer-by-layer compression of feature dimensions, and the zero-padding layer to 1 to maintain the integrity of boundary features.

[0055] Furthermore, in this embodiment, the detection method addresses the problem of mixed labeled noise and unlabeled anomalies in weakly supervised anomaly detection by innovatively modifying the traditional residual shrinkage network. A dynamic balance between noise suppression and anomaly feature preservation is achieved through an anomaly discrimination-guided channel-level adaptive threshold denoising mechanism. Specifically, absolute value global average pooling is performed on the multi-channel features output by the deep convolutional layer to obtain channel-level statistical information. Then, the channel activation intensity is acquired, and an anomaly discrimination difference index between channels is introduced to measure the contribution of each channel to distinguishing between anomaly and normal samples. Specifically, the multi-channel features H output by the deep convolutional layer conv Perform absolute value global average pooling to obtain the channel statistics vector z=[z1,z2,...,z... c ], where the statistical value z of the c-th channel c The calculation formula is: ; Where H conv is the multi-channel feature tensor output by the deep convolutional layer; c is the index number of the current channel; i and j are the spatial position indices of the feature map in the width and height directions, respectively; W and H are the width and height (i.e., spatial dimensions) of the feature map, respectively; the absolute value operation in the formula is used to eliminate the mutual cancellation of positive and negative activations, and to truly reflect the activation intensity at each position; z c The value is the statistical value of the c-th channel, representing the average activation intensity of the channel across the entire spatial range. The larger the value, the stronger the overall response of the channel.

[0056] Using each element z in the above statistical vector c Introducing the inter-channel anomaly discrimination difference index D c This metric measures the contribution of the c-th channel to distinguishing between abnormal and normal samples, and its mathematical expression is: .

[0057] Where z c The statistical value of the c-th channel obtained from the above calculation; μ c,norm σ is the mean activation intensity of the c-th channel on normal samples, which is statistically analyzed and stored during training and represents the baseline response level of this channel in normal mode; c,norm is the standard deviation of the activation intensity of the c-th channel on normal samples, reflecting the fluctuation range of the channel's response under normal conditions; This is a numerical stability constant (a very small positive value) to prevent numerical overflow caused by a denominator of zero; D c The anomaly discrimination difference index for the c-th channel is essentially a standardized deviation of the current activation intensity from the normal baseline, D. c The larger the value, the more the current activation of the channel deviates significantly from the normal pattern, and the higher its contribution to the discrimination of abnormal samples.

[0058] In the soft threshold shrinkage layer, a personalized soft threshold parameter is generated for each channel based on the joint characteristics of channel activation intensity and anomaly discrimination difference index. The higher the anomaly discrimination contribution of a channel, the smaller the threshold is assigned. That is, a smaller threshold is assigned to channels with high anomaly discrimination contribution (channels that are strongly activated on abnormal samples but weakly activated on normal samples) to preserve the abnormal signal to the maximum extent, and a larger threshold is assigned to channels with low anomaly discrimination contribution or noise sensitivity to enhance noise suppression. Next, a tiered soft thresholding process is performed, comparing channel-level statistics with the corresponding thresholds. Based on the comparison results, the channels are categorized into three types: strong anomaly response zone (absolute value of channel-level statistics is much greater than the threshold), weak anomaly response zone (absolute value of channel-level statistics is slightly greater than the threshold), and noise zone (absolute value of channel-level statistics is less than the threshold). The strong anomaly response zone is kept at its original value, the weak anomaly response zone is appropriately shrunk to remove uncertainty, and the noise zone is set to zero. Selective residual connections fuse the effective features retained after soft thresholding with the highly significant features (abnormally significant components) output by the average pooling layer, avoiding the propagation of noise components through the residual path. This modification allows the residual shrinking layer to suppress labeled noise while maximizing the preservation of discriminative features of sparse outlier samples. The output of the above second submodule serves as the input to the next second submodule, undergoing n-layer iterative processing.

[0059] Specifically, graded soft thresholding includes the following process: By introducing a region determination coefficient λ1 and a moderate shrinkage coefficient λ2, multi-channel features are processed hierarchically. The specific steps are as follows: Using channel-level statistics z c and the corresponding channel-level threshold τ cQuantitative comparison is performed, and based on the preset judgment coefficient λ1 (valued at 2.0 in this embodiment), the feature space is divided into the following three levels: Strong anomaly response region: satisfies |z c |≥λ1τ c At this point, if the channel statistics are significantly greater than the threshold, it indicates that the channel contains extremely significant and definite anomalous characteristic signals.

[0060] Weak anomaly response region: satisfies τ c ≤|z c |<λ1τ c If the channel statistics are slightly higher than the threshold, it indicates that the channel contains a suspected abnormal signal or a weak abnormal signal affected by noise.

[0061] Noise region: satisfies |z c |<τ c At this point, the channel characteristics are mainly composed of background noise.

[0062] Based on the above determination results, differential shrinkage is performed on the original features output by the deep convolutional layer: For regions with strong anomalous response, the original features are preserved without any shrinkage processing to ensure the lossless transmission of significant anomalous signals.

[0063] For the weak anomaly response region, a moderate shrinkage coefficient λ2 (0.5 in this embodiment) is introduced to perform moderate denoising on the features. Its mathematical expression is as follows: H out (c)=sign( H conv (c) )max(0,|H conv (c)|- 0.5τ c ) ; Where H conv (c) represents the original feature value of the c-th channel output by the deep convolutional layer; sign() is the sign function, used to preserve the positive and negative direction information of the original features, ensuring that the shrinkage operation only compresses the amplitude without changing the feature polarity; the absolute value operation is used to perform threshold comparison and shrinkage in the amplitude domain; τ c The differentiated threshold generation network generates personalized soft threshold parameters for the c-th channel, with the channel having a higher contribution to anomaly detection corresponding to τ. c The smaller; 0.5τ c The introduced moderate contraction amount, i.e., the moderate contraction coefficient λ2 = 0.5, is related to the channel threshold τ. c The product of these factors, compared to the standard soft threshold shrinkage (shrinkage amount τ), is... c The amplitude is smaller to preserve the effective components in weak anomalous signals; max(0,·) ensures that the amplitude after contraction is non-negative, avoiding excessive contraction that introduces inverse noise; Hout (c) represents the output characteristics of the c-th channel after appropriate denoising, which retains suspected abnormal signals to the maximum extent while removing some uncertain noise.

[0064] For noisy regions, all feature values ​​of the channel are set to zero to achieve complete noise reduction.

[0065] Furthermore, in the detection method of this embodiment, the process of obtaining the fusion weight coefficients in the fusion module includes: Calculate the complementarity index of the output features of the two branch modules (first branch module and second branch module), and dynamically generate the fusion weight coefficient of the corresponding branch module based on the complementarity index; Among them, complementarity indicators include: Feature space separation is used to measure the degree of difference in the output information of two branch modules; Feature synergy is used to measure the mutual support relationship between the output information of two branch modules.

[0066] Specifically, the complementarity index is used to measure the degree of difference in spatial distribution between global perception features and local denoising features. Its calculation formula is as follows: I comp =1-CosineSimilarity(H MAAPM H AERCM ) Where H MAAPM It is the global feature output by the first branch module (the Mamba-based adaptive anomaly detection module); H AERCM This represents the local features output by the second branch module (the attention-enhanced residual convolution module); CosineSimilarity() represents the cosine similarity calculation, used to measure the convergence of two sets of feature vectors in a direction. The final output is the complementarity index I. comp When the similarity is lower, the complementarity index is higher, indicating that the information extracted by the two branches has strong differences and complementarity.

[0067] Based on the above complementarity index, the contribution weight coefficients w1 and w2 of the two branches are dynamically generated using a gating network.

[0068] First, the complementarity index is passed through a lightweight weight mapping layer to generate the original weight vector: a = Softmax(W f I comp +b f ); Among them W f It is the weight matrix, b f This is the bias vector. Subsequently, the corresponding fusion weight coefficients are obtained: w1=a[0], w2=a[1].

[0069] Where a[0] and a[1] are the first and second elements of the original weight vector a, respectively; w1 is the fusion weight coefficient of the first branch module (MAAPM), reflecting the contribution ratio of global long-range dependent features in the fusion of the current sample; w2 is the fusion weight coefficient of the second branch module (AERCM), reflecting the contribution ratio of local fine-grained features in the fusion of the current sample; the two satisfy w1+w2=1, the more the abnormal pattern of the current sample is biased towards global distribution abnormality, the larger w1 is, and the more it is biased towards local fine-grained abnormality, the larger w2 is.

[0070] This ensures that the corresponding branch receives a higher fusion weight in samples where global and local pattern anomalies are predominant. The generation process of the fusion weight incorporates the feature statistics of the current sample, enabling the fusion strategy to adaptively match the distribution of anomalous patterns in the samples.

[0071] To comprehensively perceive abnormal patterns in both dimensions, we initially concatenate the output features of the two branches to obtain the joint feature matrix H. joint As the basis for calculating statistical properties: H joint =[H MAAPM ⊕H AERCM ]; The characteristic statistical properties specifically refer to three quantitative indicators that reflect the distribution, fluctuation, and offset of the sample signal, which together form the characteristic statistical vector Φ: Φ =[Avg(H joint ),Var(H joint ),Γ] Among them, Avg(H joint The global mean of the current sample features is used to reflect the overall shift of the sample signal. Var(H) joint The global variance of the current sample features is used to measure the overall volatility of the signal. Where Γ is the deviation index calculated during the batch normalization stage, which is used to reflect the instantaneous deviation intensity of the sample relative to the normal baseline.

[0072] The fusion module achieves adaptive fusion based on the dynamic evaluation of the complementarity of information between branches, forming a unified feature representation that takes into account both global semantics and local details, and then feeds it into the anomaly score generator to produce anomaly scores.

[0073] Furthermore, in this embodiment, the loss function for training the detection model is: L MACAD =L main +λ1L branch +λ2Lconsist The main loss function L main Adaptive weighting mechanism based on confidence level is adopted: The Lconsist constraint on the predictive coherence of two branch modules, representing the branch consistency loss, applies: Branch discrimination loss L branch This enables both branch modules to independently possess anomaly detection capabilities: Where y i ∈{0,1} represents the true label of the sample. To predict probabilities after fusion, and The prediction probabilities of the first branch module and the second branch module are respectively, α i γ is the class balancing weight, λ1 and λ2 are the weight coefficients, and N is the total number of samples.

[0074] This application proposes a novel branch collaborative perception weighted loss function (MACAD-Loss), which dynamically adjusts sample weights to make the model pay more attention to samples that are difficult to classify, thereby alleviating the class imbalance problem between normal and abnormal samples.

[0075] Specifically, when training MACAD-Net, the training set is fed into MACAD-Net for training. The Adam optimizer is used during the training process, with a batch size of 512 and 200 training epochs. Early stopping is used when the loss function value stops decreasing or oscillates to avoid overfitting. The balance factor α of the loss function is set to 0.8 and the focusing parameter γ to 2. The batch size is set to 512 and the number of training epochs is 200. The validation results at the end of each training phase are used to evaluate the generalization ability of the model and help decide whether to retain the trained model.

[0076] During testing, the detection model trained on the test set is used to make predictions. The network outputs an anomaly score between 0 and 1 for each test sample. The higher the score, the better the performance.

[0077] Effect verification PR-AUC and ROC-AUC were used as quantitative evaluation metrics to verify model performance. A higher PR-AUC value indicates that the model can maintain high precision and high recall in imbalanced scenarios with sparse anomalous samples, i.e., detecting more true anomalies and fewer false positives. A higher ROC-AUC value indicates that the model maintains a high true positive rate and a low false positive rate at different discrimination thresholds, demonstrating stronger overall classification ability and robustness. When both PR-AUC and ROC-AUC values ​​are high, it indicates that the model output is excellent and can accurately distinguish between normal and anomalous samples; when the metric values ​​are close to 0, it indicates that the model output is poor, and the classification performance is close to random guessing or even worse than random guessing.

[0078] The superior detection performance of the proposed method is verified by comparing the PR-AUC and ROC-AUC values ​​of the method of this invention with those of existing methods on the test set.

[0079] Experimental verification and analysis To verify the effectiveness and generalization ability of the method of this invention, eight publicly available benchmark datasets covering multiple application fields such as finance, healthcare, cybersecurity, and biometrics were selected for experimental evaluation. These datasets include: Dataset 1 Bank, Dataset 2 CelebA, Dataset 3 Census, Dataset 4 Fraud, Dataset 5 Shuttle, Dataset 6 Thyroid, Dataset 7 Backdoor, and Dataset 8 Mammography. Outlier samples in these datasets were labeled according to the common judgment criteria for each field, and their essential characteristic is a significant distribution deviation from the majority of samples in the same field. Details of the sample size and category distribution of each dataset are shown in Table 1.

[0080] The following table (Table 1) shows the sample distribution of the eight datasets in the experiment, as detailed in the table below: In this embodiment, the datasets used are eight real-world datasets from different fields: Dataset 1 Bank, Dataset 2 CelebA, Dataset 3 Census, Dataset 4 Fraud, Dataset 5 Shuttle, Dataset 6 Thyroid, Dataset 7 Backdoor, and Dataset 8 Mammography. Dataset 1: Bank dataset originates from: Moro S, Rita P, Cortez P. Dataset 1: Bankmarketing [Z]. 2012. Dataset 2, CelebA, originates from: Liu Z, Luo P, Wang X, et al. Deep learning face attributes in the wild[C]. Proceedings of International Conference on Computer Vision. 2015. The Census dataset is derived from: Census-income (kdd) [Z]. 2000. Dataset 4, Fraud, originates from: Zhou X, Zhang Z, Wang L, et al. A model based on siamese neural network for online transaction Dataset 4 Frauddetection[C]. 2019 International Joint Conference on Neural Networks. 2019: 1-7. Dataset 5, Shuttle, is derived from: Rayana S. Statlog (Dataset 5 Shuttle and Dataset 8 Mammography) [Z]. 2016. Dataset 6, the Thyroid dataset, originates from: Qui R. Dataset 6, Thyroid disease [Z]. 1987. Dataset 7, the Backdoor dataset, originates from: Moustafa N, Slay J. Unsw-nb15: A comprehensive data set for network intrusion detection systems (unsw-nb15network dataset) [C]. 2015 Military Communications and Information Systems Conference. 2015: 1-6. Dataset 8, Mammography, is derived from: Rayana S. Statlog (Dataset 5, Shuttle, and Dataset 8, Mammography) [Z]. 2016. Specifically, Dataset 1 (Bank) and Dataset 4 (Fraud) are two well-known financial datasets. Dataset 1 (Bank) comes from telemarketing campaigns by Portuguese banking institutions, where anomalous samples are successful telemarketing cases (approximately 0.65% of the records). Dataset 4 (Fraud) is used to detect fraudulent credit card transactions, where fraudulent transactions are considered anomalous data (approximately 6.10% of the records). Dataset 2 (CelebA) is a large-scale image dataset containing over 200,000 celebrity images, each labeled with 40 attributes. Baldness is used as the target for anomaly detection; it accounts for less than 1% of the dataset, while the other 39 attributes form the feature space used for learning. Dataset 3 (Census) is a well-known US Census dataset, sourced from the US Census Bureau database. Individuals with an annual income exceeding $50,000 are defined as anomalous data, accounting for approximately 0.16% of the total samples. Dataset 5 (Shuttle) is a multi-class classification dataset in the space shuttle field, where the combination of the smallest classes in the corresponding classification task forms the "anomaly" class in the dataset. Therefore, in the experiment, the five smallest categories (2, 3, 5, 6, 7) were merged into a single anomalous category, which accounted for approximately 7% of the total records. Dataset 7, Backdoor, is a network intrusion detection dataset where detected backdoor attacks are considered anomalous, while the "normal" category was extracted from the UNSW-NB15 dataset. Datasets 6, Thyroid, and 8, Mammography, are from the medical field. In the Thyroid dataset, anomalous data refers to patients diagnosed with hypothyroidism, accounting for approximately 5.62% of the records. In the Mammography dataset, anomalous data includes a small number of patients diagnosed with calcification, accounting for approximately 2.32% of the total records.

[0081] In the experiments, MACAD-Net was compared with seven methods, including Algorithm 1 iForest, Algorithm 2 FSNet+, Algorithm 3 REPEN+, Algorithm 4 DevNet, Algorithm 5 Deep-SAD, Algorithm 6 FEWAE, and Algorithm 7 Dual-MGAN. Of these methods, only Algorithm 1 iForest is a classic unsupervised anomaly detection algorithm; the other six are popular weakly supervised anomaly detection methods. Furthermore, Algorithm 2 FSNet+ and Algorithm 3 REPEN+ are optimized versions of FSNet and REPEN, respectively.

[0082] In the experiment, 30 available anomalous samples were set, representing an average proportion of 0.13% of all samples, while the normal sample contamination (noise) level was maintained at 2%. This aimed to simulate the scarcity of anomalous samples and the presence of noise in the data in real-world scenarios. The performance of MACAD-Net and seven comparative algorithms on eight public datasets was evaluated using the PR-AUC and ROC-AUC metrics. As shown in Tables 2 and 3, the last row summarizes the average performance of each method, and the last column displays the results of the proposed MACAD-Net, along with a comparison with suboptimal methods.

[0083] The table below (Table 2) compares the PR-AUC performance of MACAD-Net and seven competing methods, as shown in the table below: As shown in Table 2, MACAD-Net outperforms competing methods on all eight datasets in terms of PR-AUC. Specifically, in terms of average PR-AUC, MACAD-Net outperforms Algorithm 1 iForest (41.13%), Algorithm 2 FSNet+ (37.79%), Algorithm 3 REPEN+ (30.47%), Algorithm 4 DevNet (13.13%), Algorithm 5 Deep-SAD (13.19%), Algorithm 6 FEWAE (15.20%), and Algorithm 7 Dual-MGAN (19.35%).

[0084] The table below (Table 3) compares the ROC-AUC performance of MACAD-Net and seven competing methods, as shown in the table below: Similarly, as can be observed in Table 3, MACAD-Net outperforms competing methods on all eight datasets in terms of ROC-AUC. Specifically, in terms of average ROC-AUC, MACAD-Net outperforms the following algorithms on average: iForest (18.28%), FSNet+ (15.77%), REPEN+ (9.11%), DevNet (2.82%), Deep-SAD (4.56%), FEWAE (2.89%), and Dual-MGAN (6.00%).

[0085] Overall, MACAD-Net effectively combines local high-frequency information and global long-term dependencies of samples. Even with a small number of labeled anomalous samples and contamination (noise), MACAD-Net achieves optimal performance in weakly supervised anomaly detection. This demonstrates that MACAD-Net's efficiency makes it more feasible in real-world scenarios.

[0086] Ablation experiment: (1) Sample validity Sample efficiency was used to evaluate the impact of different numbers of labeled outliers on the model. Therefore, in the experiments, the performance of MACAD-Net and seven comparison algorithms on eight public datasets was evaluated by varying the proportion of labeled outliers. Specifically, this experiment set different numbers of available outliers, including 5, 15, 30, 60, and 120, while maintaining the normal sample contamination level at 2%. Figure 2 The PR-AUC performance of eight methods on eight different datasets is shown with varying numbers of labeled outliers.

[0087] from Figure 2 As can be seen from the data, on the eight public datasets, the sample efficiency of these anomaly detection methods generally increases with the increase of the number of labeled anomaly samples. MACAD-Net consistently maintains the highest sample efficiency among all anomaly detection methods. That is, MACAD-Net achieves the best anomaly detection performance with different labeled anomaly samples. However, in some cases, the PR-AUC of competing methods shows a downward trend as the number of labeled anomalies increases. For example, in the Census dataset (dataset 3), when the number of anomaly samples increases from 30 to 60, the PR-AUC of algorithms FEWAE (algorithm 6) and Dual-MGAN (algorithm 7) decreases by 1.6% and 0.4%, respectively; in the Shuttle dataset (dataset 5), when the number of anomaly samples increases from 30 to 120, the PR-AUC of REPEN+ (algorithm 3) decreases by 4.59%. This may be due to the large differences between the added labeled anomaly samples and the conflicts between features, causing these models to tend to optimize the features of some labeled anomaly samples while ignoring others. In fact, a similar phenomenon can be observed in the ROC-AUC performance. Overall, MACAD-Net has higher sample utilization efficiency. Even with a limited number of labeled anomalous samples, MACAD-Net achieves better performance in weakly supervised anomaly detection. This demonstrates that MACAD-Net can make more efficient use of limited labeled anomalous samples and has strong generalization ability.

[0088] (2) Effectiveness of the feature extraction module This experiment validates the effectiveness of the MACAD-Net network structure by constructing variants of the anomaly detection network, namely Network 1 (MA-Net) and Network 2 (CA-Net), which contain only an adaptive anomaly perception module based on Mamba or an attention-enhanced residual convolution module. In the experiment, the number of anomaly samples was kept to 30, while the contamination level of normal samples was maintained at 2%. The reliability and effectiveness of MACAD-Net were evaluated by comparing the PR-AUC and ROC-AUC performance of independent detection networks on eight public datasets. PR and ROC curves were plotted on six datasets: Bank, CelebA, Census, Fraud, Thyroid, and Backdoor. Figure 3 and Figure 4 The PR and ROC curves for datasets 1 (Bank), 2 (CelebA), 3 (Census), 4 (Fraud), 6 (Thyroid), and 7 (Backdoor) are presented respectively. Network 1 (MA-Net) represents a weakly supervised anomaly detection network containing only an adaptive anomaly detection module based on Mamba, and Network 2 (CA-Net) represents a weakly supervised anomaly detection network containing only an attention-enhanced residual convolution module. A larger area under the PR and ROC curves indicates better detection performance.

[0089] As can be seen, MACAD-Net outperforms both MA-Net (Network 1) and CA-Net (Network 2) on the selected datasets. Comparing PR-AUC and ROC-AUC reveals that MACAD-Net improves the average PR-AUC and ROC-AUC by 0.67% and 0.69% respectively across the six datasets compared to MA-Net (Network 1), and by 11.96% and 4.78% respectively compared to CA-Net (Network 2) across the same datasets.

[0090] The technical principles of the present invention have been described above with reference to specific embodiments. These descriptions are merely for explaining the principles of the invention and should not be construed as limiting the scope of protection of the invention in any way. Based on this explanation, those skilled in the art can conceive of other specific embodiments of the invention without creative effort, and these embodiments will all fall within the scope of protection of the present invention.

Claims

1. A weakly supervised anomaly detection method based on dual-branch feature fusion, comprising: The dataset is obtained by acquiring and processing data samples. The processed dataset is then input into a trained detection model, which outputs anomaly prediction results. Its key features are: The detection model includes: The first branch module is an adaptive anomaly detection module based on Mamba, which is used to capture global long-range dependencies of the dataset. The second branch module is an attention-enhanced residual convolution module used to extract local fine-grained features of the dataset. The fusion module is used to weightedly fuse the outputs of the first branch module and the second branch module to obtain the fused features; Anomaly score generator, used to output the degree of anomaly corresponding to the fused features.

2. The detection method according to claim 1, characterized in that: The process of processing data samples includes: Missing value imputation: For attributes with missing values ​​in the data sample, the mean of all non-missing values ​​of the attribute is used to impute them. Categorical attribute encoding: For categorical attributes in data samples, one-hot encoding is used to convert them into numerical feature vectors; The normalization operation normalizes all numerical attributes in the data sample.

3. The detection method according to claim 1, characterized in that: The first branch module comprises m cascaded first sub-modules. Each first sub-module includes, in sequence, an attention layer, an attention reconstruction layer, an improved Mamba model, residual connections and layer normalization, and a gated attention layer, wherein: The attention layer obtains the interaction relationships between features by performing query-key-value mapping on the dataset; The attention reconstruction layer is used to reconstruct features and align dimensions of the output of the attention layer; The improved Mamba model adaptively adjusts the state update strategy based on the intensity of the abnormal features of the current sample; Residual connections and layer normalization are used to ensure effective gradient propagation and stabilize the training process; The gated attention layer adaptively weights the features concatenated after attention reconstruction and layer normalization to enhance the expressive power of anomaly-related features.

4. The detection method according to claim 3, characterized in that: The steps for obtaining the interaction relationships between features in the attention layer include: a) Given an input feature matrix, generate a query matrix, a key matrix, and a value matrix through three independent linear transformations; b. Obtain the attention weights by calculating the dot product of the query matrix and the key matrix, and then normalize them using a scaling factor; c. After normalizing the attention weights, normalize them using the Softmax function and multiply them with the value matrix to obtain the weighted aggregated attention output.

5. The detection method according to claim 3, characterized in that: The improved Mamba model includes: S1. In the feature projection stage, the input features are simultaneously projected into the normal pattern space and the abnormal pattern space, and the distribution difference of the samples in the normal pattern space and the abnormal pattern space is calculated to assess the abnormal tendency of the samples. S2. In the local feature extraction stage, variable receptive field convolution is used to dynamically adjust the size of the convolution receptive field according to the anomaly tendency. The higher the anomaly tendency of a region, the larger the corresponding receptive field. S3. During the state-space model calculation stage, parameters are dynamically generated based on the deviation of the input features from the normal mode baseline. These parameters include: The discretization step size matrix is ​​used to control the temporal granularity of state updates, allowing for finer granularity for anomalous features; The input selection matrix prioritizes input dimensions with high outlier responses based on their contribution to state updates. The output filtering matrix is ​​used to control the output of hidden state information to suppress redundant information in the normal mode. S4. In the output stage, samples with high abnormal significance in the original input features are directly incorporated into the output through residual connections.

6. The detection method according to claim 3, characterized in that: The gated attention layer includes the following process: after passing the input features through a linear transformation layer, the Sigmoid activation function is used to generate gate weights with values ​​ranging from 0 to 1. The gate weights reflect the importance of each feature dimension. The larger the weight, the more important the feature dimension is for anomaly detection. The gate weights are multiplied element-wise with the input features to obtain the output of the current first submodule.

7. The detection method according to claim 1, characterized in that: The second branch module includes n cascaded second sub-modules, each of which includes an average pooling layer, a depthwise convolutional layer, a global average pooling layer, a fully connected layer, a sigmoid activation function, and a soft thresholding shrinkage layer connected in sequence. in: Average pooling layers are used to perform preliminary downsampling of input samples to reduce feature dimensionality; deep convolutional layers enhance anomalous features layer by layer by performing multiple cascaded convolutions. The features extracted by deep convolution are globally averaged and then input into a fully connected layer and a sigmoid activation function to generate channel-level adaptive gating weights. The gate weights are multiplied element-wise with the globally averaged features, and then input into a soft-threshold shrinkage layer to perform channel-level denoising. The denoised effective features are added bit-by-bit to the features output by the average pooling layer to obtain the output features of the current second sub-module; The output of the current second submodule is used as the input of the next second submodule. After n layers of iterative processing, the final output is a feature representation containing rich local fine-grained information.

8. The detection method according to claim 7, characterized in that: The deep convolutional layers employ anomaly response-oriented cascaded convolutional units to construct a hierarchical feature extraction pathway; each convolutional unit introduces an anomaly activation enhancement mechanism: During the batch normalization stage, the features are standardized, and the deviation index between the current batch features and the normal sample feature distribution benchmark is calculated. During the nonlinear activation phase, the activation threshold is dynamically adjusted based on the deviation index; the higher the deviation index, the lower the activation threshold. During the convolution extraction stage, the weight allocation of each dimension of the current convolution kernel is dynamically adjusted according to the intensity of the abnormal response identified by the previous convolution unit, so that the convolution operation is tilted towards the dimension with high abnormal response. The absolute value global average pooling is performed on the multi-channel features output by the deep convolutional layer to obtain channel-level statistical information. Then, the channel activation intensity is obtained, and the anomaly discrimination difference index between channels is introduced. In the soft threshold shrinkage layer, personalized soft threshold parameters are generated for each channel based on the joint characteristics of channel activation intensity and anomaly discrimination difference index. The higher the anomaly discrimination contribution of a channel, the smaller the threshold is assigned. Next, a graded soft thresholding process is performed, comparing channel-level statistics with the corresponding thresholds. Based on the comparison results, the levels are distinguished, including strong anomaly response regions, weak anomaly response regions, and noise regions. The strong anomaly response regions are kept at their original values, the weak anomaly response regions are appropriately shrunk to remove uncertainty, and the noise regions are directly set to zero. Selective residual connection performs residual fusion between the effective features retained after soft thresholding and the features with high abnormal significance output by the average pooling layer, thus avoiding the transmission of noise components through the residual path.

9. The detection method according to claim 1, characterized in that: The process of obtaining the fusion weight coefficients in the fusion module includes: Calculate the complementarity index of the output features of the two branch modules, and dynamically generate the fusion weight coefficient of the corresponding branch modules based on the complementarity index. The complementarity index includes: Feature space separation is used to measure the degree of difference in the output information of two branch modules; Feature synergy is used to measure the mutual support relationship between the output information of two branch modules.

10. The detection method according to claim 1, characterized in that: The loss function corresponding to training the detection model is: L MACAD =L main +λ1L branch +λ2Lconsist Where y i ∈{0,1} represents the true label of the sample. To predict probabilities after fusion, and The prediction probabilities of the first branch module and the second branch module are respectively, α i γ is the class balancing weight, λ1 and λ2 are the weight coefficients, and N is the total number of samples.