An image anomaly detection method, device and equipment based on feature reconstruction

By performing feature extraction, alignment, and reconstruction on images, the problems of false detection and missed detection in multi-category image anomaly detection in existing technologies are solved, achieving efficient anomaly detection and localization, and improving the robustness and generalization ability of the model.

CN122223005APending Publication Date: 2026-06-16GUIZHOU UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
GUIZHOU UNIV
Filing Date
2026-04-24
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing image anomaly detection methods are prone to false positives and false negatives in low-sample, multi-anomaly scenarios, making it difficult to achieve unified modeling of multiple categories. Furthermore, traditional methods have limited generalization ability for unknown anomalies.

Method used

By extracting features from the image to be detected, multi-layer feature maps are obtained, and then aligned and aggregated in spatial scale. The pre-trained feature reconstruction network is used for reconstruction, and the anomaly localization and overall anomaly score are determined based on the reconstruction error.

🎯Benefits of technology

It significantly improves the model's robustness and background interference suppression capabilities, can automatically identify abnormal regions, improve detection accuracy and generalization ability, and does not rely on abnormal samples.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122223005A_ABST
    Figure CN122223005A_ABST
Patent Text Reader

Abstract

The application provides a feature reconstruction-based image anomaly detection method, device and equipment, and relates to the technical field of image detection. The method provided by the application comprises: performing feature extraction on a target image to be detected to obtain a plurality of layers of feature maps; aligning the layers of feature maps in a spatial scale, and performing neighborhood aggregation processing and cross-layer fusion processing on each layer of feature maps to obtain an aggregated feature map corresponding to the target image; reconstructing the aggregated feature map by using a pre-trained feature reconstruction network to obtain a reconstructed feature map having the same size as the aggregated feature map; and determining an anomaly positioning map corresponding to the target image and an overall anomaly score of the target image according to a reconstruction error of the reconstructed feature map and the aggregated feature map; wherein the anomaly positioning map is used to record a position of an abnormal region, and the overall anomaly score is used to evaluate an abnormal degree of the target image.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image detection technology, and in particular to an image anomaly detection method, apparatus and device based on feature reconstruction. Background Technology

[0002] With the rapid development of computer vision technology, image anomaly detection has gained widespread attention in various application fields such as industrial quality monitoring and medical image analysis, and has gradually become one of the key technologies for ensuring system reliability and improving product consistency. The core objective of image anomaly detection is to automatically identify abnormal regions that deviate from the normal pattern from complex image data, thereby providing a valid basis for subsequent analysis and decision-making.

[0003] However, image anomaly detection still faces many challenges in practical applications. First, the number of anomalous samples in real-world scenarios is often limited, making it difficult to cover all potential anomaly patterns, thus limiting the model's generalization ability to unknown anomalies. Second, traditional methods typically train independent models for a single category, making it difficult to achieve unified modeling for multiple categories, resulting in high deployment costs and low efficiency. Under the combined influence of highly imbalanced sample distribution, limited training data scale, and the complexity of multi-category anomaly modeling, existing detection methods are prone to false positives and false negatives in low-sample, multi-anomaly scenarios, thereby restricting the improvement of overall detection performance. Summary of the Invention

[0004] This application provides a method, apparatus, and device for image anomaly detection based on feature reconstruction, which improves the accuracy and generalization ability of image anomaly detection.

[0005] Specifically, this application is implemented through the following technical solution:

[0006] A first aspect of this application provides an image anomaly detection method based on feature reconstruction, the method comprising:

[0007] Feature extraction is performed on the target image to be detected to obtain a multi-layer feature map;

[0008] Align the feature maps of each layer in spatial scale, and perform neighborhood aggregation and cross-layer fusion processing on each feature map to obtain the aggregated feature map corresponding to the target image.

[0009] The aggregated feature map is reconstructed using a pre-trained feature reconstruction network to obtain a reconstructed feature map with the same size as the aggregated feature map;

[0010] Based on the reconstruction errors of the reconstructed feature map and the aggregated feature map, an anomaly localization map corresponding to the target image and an overall anomaly score of the target image are determined; wherein, the anomaly localization map is used to record the location of the abnormal region, and the overall anomaly score is used to evaluate the degree of anomaly of the target image.

[0011] A second aspect of this application provides an image anomaly detection device based on feature reconstruction, the device comprising an extraction module, a fusion module, a reconstruction module, and a processing module;

[0012] The extraction module is used to extract features from the target image to be detected, and obtain a multi-layer feature map;

[0013] The fusion module is used to align the feature maps of each layer in spatial scale, and to perform neighborhood aggregation and cross-layer fusion processing on each feature map to obtain the aggregated feature map corresponding to the target image.

[0014] The reconstruction module is used to reconstruct the aggregated feature map using a pre-trained feature reconstruction network to obtain a reconstructed feature map with the same size as the aggregated feature map.

[0015] The processing module is used to determine the anomaly localization map corresponding to the target image and the overall anomaly score of the target image based on the reconstruction error of the reconstructed feature map and the aggregated feature map; wherein, the anomaly localization map is used to record the location of the abnormal region, and the overall anomaly score is used to evaluate the degree of anomaly of the target image.

[0016] A third aspect of this application provides an image anomaly detection device based on feature reconstruction, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of any of the methods provided in the first aspect of this application.

[0017] A fourth aspect of this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method described in any of the first aspects of this application.

[0018] This application provides a method, apparatus, and device for image anomaly detection based on feature reconstruction. The method involves extracting features from the target image to obtain multi-layer feature maps, aligning these feature maps spatially, and performing neighborhood aggregation and cross-layer fusion processing on each feature map to obtain an aggregated feature map corresponding to the target image. This aggregated feature map effectively integrates information from different scales and semantic levels, enabling the model to simultaneously focus on fine-grained textures and high-level structures in the image during detection, significantly enhancing the ability to perceive minor or local anomalies. Furthermore, the aggregated feature map is reconstructed to obtain a reconstructed feature map. The anomaly localization map corresponding to the target image and the overall anomaly score of the target image are then determined based on the reconstruction error between the aggregated and reconstructed feature maps. In this way, since normal regions are sufficiently modeled during training, the reconstruction error is usually small; while anomaly regions, due to their feature distribution deviating from the training distribution, result in significant reconstruction errors, thereby achieving automatic identification of anomaly regions. This method relies on unsupervised modeling of the feature distribution of normal images, eliminating the need for anomaly samples, significantly alleviating the modeling difficulties caused by the scarcity of anomaly samples, and effectively improving the model's robustness and background interference suppression capabilities. Attached Figure Description

[0019] Figure 1 A flowchart of an embodiment of the image anomaly detection method based on feature reconstruction provided in this application;

[0020] Figure 2 A schematic diagram illustrating the implementation principle of the feature reconstruction-based image anomaly detection method provided in this application;

[0021] Figure 3 This is a schematic diagram of the structure of a feature reconstruction network shown in an exemplary embodiment of this application;

[0022] Figure 4 This is a schematic diagram illustrating the structure of a first LCC module and / or a second LCC module as an exemplary embodiment of this application;

[0023] Figure 5 A schematic diagram illustrating the implementation principle of the training process of a feature reconstruction network as an exemplary embodiment of this application;

[0024] Figure 6 This is a schematic diagram of the structure of an image anomaly detection device based on feature reconstruction provided in this application, according to Embodiment 1. Detailed Implementation

[0025] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application.

[0026] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a,” “the,” and “the” used herein are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

[0027] It should be understood that although the terms first, second, third, etc., may be used in this application to describe various information, this information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein can be interpreted as "when," "when," or "in response to a determination." Specific embodiments are given below to describe the technical solutions of this application in detail.

[0028] Figure 1 This is a flowchart of an embodiment of the image anomaly detection method based on feature reconstruction provided in this application. Figure 2 This diagram illustrates the implementation principle of the feature reconstruction-based image anomaly detection method provided in this application. Please also refer to... Figure 1 and Figure 2 The method provided in this embodiment may include:

[0029] S101. Extract features from the target image to be detected to obtain a multi-layer feature map.

[0030] Specifically, a backbone network can be used to extract features from the target image. The backbone network can be any pre-trained backbone network containing multiple cascaded feature extraction modules. For example, in one possible implementation, the backbone network can be either ResNet50 or ResNet101.

[0031] The target image to be detected specifically refers to the image that needs to be detected for anomalies. The target image can be an image from various scenes, but in this embodiment, it is not limited to one.

[0032] It should be noted that the backbone network includes multiple cascaded feature extraction modules. The target image propagates layer by layer through these modules, with each module outputting a feature map at a corresponding level. Through each module, feature maps of the target image can be captured from low to high levels. In this step, for the multiple feature extraction modules within the backbone network, the outputs of at least two specified feature extraction modules can be obtained, acquiring feature maps at least two levels from the outputs of these two modules, thus obtaining a multi-layer feature map.

[0033] It should be noted that the specific feature extraction module designated is determined based on actual needs. In this embodiment, no limitation is imposed. For example, in one possible implementation, the designated feature extraction module is an intermediate feature extraction module. For instance, when the backbone network includes cascaded first, second, third, and fourth feature extraction modules, the designated feature extraction module can be at least two of the second, third, and fourth feature extraction modules. The following explanation uses the second and third feature extraction modules as examples.

[0034] S102. Align the feature maps of each layer in terms of spatial scale, and perform neighborhood aggregation and cross-layer fusion processing on each feature map to obtain the aggregated feature map corresponding to the target image.

[0035] Optionally, in one possible implementation, the specific implementation process of this step includes:

[0036] (1) Align the feature maps of each layer in the multi-layer feature map in terms of spatial scale so that each layer of feature map has the same spatial resolution.

[0037] (2) For each spatial location in the feature map of each layer, perform average pooling on multiple feature vectors in the specified neighborhood of the spatial location, and determine the processing result as the aggregated feature of the spatial location on the feature map of that layer.

[0038] (3) Perform channel splicing and dimension adjustment on the aggregated features of each spatial location on each layer feature map to obtain the aggregated features corresponding to the spatial location, and use the aggregated features corresponding to all spatial locations to reconstruct the aggregated feature map.

[0039] It should be noted that, as described above, a multi-layer feature map includes at least two layers of feature maps, each with a different spatial resolution. In this step, following the order of the feature extraction modules from front to back, the feature map output by the first feature extraction module is used as a reference, and the feature maps output by the subsequent feature extraction modules are upsampled or downsampled to adjust their spatial resolution to be the same as that of the first feature extraction module.

[0040] In specific implementation, for example, in one embodiment, the target image to be detected is... Its dimensions are .in, For the height of the target image, The width of the target image; the first layer feature map output by the second feature extraction module is Its dimensions are ,in, For the number of channels, The height of the first layer feature map, The width of the first layer feature map; the second layer feature map output by the third feature extraction module is... Its dimensions are ,in, Number of channels; The height of the second layer feature map, This represents the width of the second layer feature map.

[0041] As can be understood, referring to the preceding description, in this step, the second-layer feature map is upsampled or downsampled based on the first-layer feature map to adjust its spatial resolution to be the same as that of the first-layer feature map. That is, after spatial alignment, the height of each feature map layer is equal to... The width is .

[0042] Furthermore, after aligning the feature maps of each layer in terms of spatial scale, in this step, for each spatial location in each layer of feature map, multiple feature vectors in the neighborhood of that spatial location are specified, and an aggregated feature on that layer of feature map is generated through a specific aggregation strategy.

[0043] It should be noted that each spatial location in the feature map corresponds to the location point for each set of width and height. Specifically, the feature map is typically a three-dimensional tensor, and each spatial location refers to the location point corresponding to a given height and a given width. The specific range of the specified neighborhood is set according to actual needs; in this embodiment, it is not limited, and its specific range is as described in the following formula:

[0044] ;

[0045] in, Spatial location Specify the neighborhood, r is a preset value, r= , Here, h represents the height information of the spatial location, and w represents the width information of the spatial location. These are the coordinates of each spatial location within the neighborhood.

[0046] In a specific implementation, in one embodiment, each spatial location Specify the neighborhood as the spatial location A 3×3 area centered on the center.

[0047] In practice, multiple feature vectors within a specified neighborhood of the spatial location can be subjected to average pooling, and the processing result can be determined as the aggregated feature of the spatial location on the feature map of that layer.

[0048] In practice, for each channel, average pooling can be performed on multiple feature vectors within a specified neighborhood of the spatial location using the following formula.

[0049] ;

[0050] in, Spatial location Aggregated features on the l-th layer feature map For average pooling processing, Spatial location on the feature map of layer l The eigenvalue at that location.

[0051] Referring to the preceding description, for the first layer feature map After performing the above processing on the feature map of each channel, a feature map can be obtained for each channel. The aggregated feature vector (composed of the aggregated features at each position in that channel) is then obtained. For all channels, a... The aggregated features (composed of aggregated feature vectors corresponding to each channel). Similarly, referring to the previous description, for the second-layer feature map, after spatial scale alignment, its height is equal to... Width is At this point, a similar result can also be obtained for the second layer feature map. Aggregation characteristics.

[0052] Understandably, by using average pooling, multiple feature vectors within a specified neighborhood of a spatial location are fused into a single aggregated feature. This aggregated feature not only retains spatial location information but also incorporates other feature information from the neighborhood, enhancing the robustness and discriminativeness of the feature representation and providing a more discriminative feature representation for subsequent anomaly detection.

[0053] Furthermore, in practical implementation, the aggregated features on each layer's feature map can be processed by channel concatenation and dimensional adjustment according to the following formula:

[0054] ;

[0055] in, For aggregated feature maps, This is a function used for dimension adjustment. This is a function used for channel splicing. For the first The aggregated features corresponding to the layer For the first The aggregation features corresponding to the layer.

[0056] Based on the example above, the dimension of the aggregated feature corresponding to the l-th layer is... The dimension of the aggregated feature corresponding to the (l+1)th layer is After the channels are spliced ​​together, a result can be obtained. The splicing features are then further adjusted in dimension to obtain an aggregated feature map.

[0057] It should be noted that the specific dimensions of the aggregated feature map are set according to actual needs, and are not limited in this embodiment. For example, in one possible implementation, the dimensions of the aggregated feature map are... This is a preset value.

[0058] S103. Reconstruct the aggregated feature map using a pre-trained feature reconstruction network to obtain a reconstructed feature map with the same size as the aggregated feature map.

[0059] Specifically, the feature reconstruction network is pre-trained according to actual needs. The feature reconstruction network can reconstruct the input aggregated feature map and output the reconstructed feature map.

[0060] It should be noted that the feature reconstruction network includes an encoder and a decoder. The encoder is used to encode the aggregated feature map to obtain the encoded feature map; the decoder is used to decode the encoded feature map to output the corresponding reconstructed feature map. Figure 3 This is a schematic diagram illustrating the structure of a feature reconstruction network as shown in an exemplary embodiment of this application. Please refer to... Figure 3In one possible implementation, the feature reconstruction network includes an encoder and a decoder;

[0061] The encoder includes an encoding start layer, a first LCC module (Local Context Constraint, or LCC module for short), an encoding compression layer, and an encoding output layer. The encoding start layer performs convolution processing on the input aggregated feature map to compress the original channel dimension, obtaining an initial compressed feature map. The first LCC module applies local context constraints to the initial compressed feature map to generate a fused feature map. The encoding compression layer extracts key features and compresses the channel dimension of the fused feature map to obtain an encoded compressed feature map. The encoding output layer performs channel compression on the encoded compressed feature map, reducing its dimensionality to encoded features in the latent space. The latent encoding dimension of the encoded features is a preset value.

[0062] The decoder includes a decoding start layer, a decoding extension layer, a second LCC module, and a decoding output layer. The decoding start layer is used to perform channel expansion on the encoded features, initially restoring the feature dimensions to obtain an initial decoded feature map. The decoding extension layer is used to extract high-dimensional features from the initial decoded feature map to obtain a decoded extended feature map. The second LCC module is used to apply local context constraints to the decoded extended feature map to obtain a decoded fusion feature map. The decoding output layer is used to restore the decoded fusion feature map back to the original channel dimensions to obtain the reconstructed feature map.

[0063] Optionally, in one possible implementation, the encoder and the decoder have symmetrical structures; wherein the encoding start layer and the decoding output layer have the same structure, and both the encoding start layer and the decoding output layer include a first convolutional layer, a first instance normalization layer and a first activation processing layer.

[0064] The first LCC module and the second LCC module have the same structure. Both the first LCC module and the second LCC module include an average pooling layer, a first inner convolutional module, a concatenation module, and a second inner convolutional module. The average pooling layer is used to perform average pooling on the input features to obtain semantic representation features. The first inner convolutional module is used to project the semantic representation features onto a low-dimensional space to obtain fused spatial constraint features. The concatenation module is used to concatenate the input features and the fused spatial constraint features to obtain enhanced features. The second inner convolutional module is used to perform nonlinear mapping on the enhanced features to obtain output features with the same spatial size as the input features. Both the first inner convolutional module and the second inner convolutional module include an inner convolutional layer, a batch normalization layer, and an inner activation layer.

[0065] The encoding compression layer and the decoding extension layer have the same structure. Both the encoding compression layer and the decoding extension layer include N convolutional modules. Each convolutional module includes a second convolutional layer, a second instance normalization layer and a second activation processing layer.

[0066] The encoding output layer and the decoding start layer have the same structure, and both the encoding output layer and the decoding start layer include a third convolutional layer, a third instance normalization layer and a third activation processing layer.

[0067] In practical implementation, the kernel size of the first convolutional layer is 1×1, and the stride is 1. In this application, all convolutional layers in the feature reconstruction network use 1×1 convolutional kernels to achieve spatial decoupling. The 1×1 convolutional kernel performs linear combinations between channels only at a single spatial location, without relying on spatial neighborhood information, thus cutting off the path of feature copying using spatially adjacent pixels and effectively suppressing the identity shortcut problem. By using 1×1 convolutional kernels, the model focuses more independently on the features of each channel, significantly reducing the risk of feature passthrough and enhancing the diversity of feature learning.

[0068] Furthermore, the first instance normalization layer is used to perform instance normalization processing. It should be noted that instance normalization processing is performed on a single sample basis. For each input sample, instance normalization processing independently calculates the mean and variance of its features and performs normalization processing. Compared to traditional batch normalization processing, which masks individual differences among samples during the normalization process and affects the ability to distinguish abnormal patterns, instance normalization processing preserves the feature independence of each sample.

[0069] Furthermore, the activation function of the first activation processing layer is the ReLU activation function; by using the ReLU activation function as a nonlinear activation function, it increases the nonlinear expressive power of the model.

[0070] Furthermore, the first LCC module and the second LCC module have the same structure. Figure 4 This is a schematic diagram illustrating the structure of a first LCC module and / or a second LCC module according to an illustrative embodiment of this application. Please refer to... Figure 4 The LCC module includes an average pooling layer, a first inner convolutional module, a concatenation module, and a second inner convolutional module. Specifically, the average pooling layer performs average pooling within a 7×7 region with a stride of 1, used to extract local contextual semantic features and impose local contextual constraints on the input features.

[0071] Furthermore, the first inner convolutional module may include an inner convolutional layer, a batch normalization layer, and an inner activation layer. In specific implementation, the inner convolutional layer uses a 1×1 convolutional kernel, and the inner activation layer uses the ReLU activation function. It extracts features by compressing the number of channels through 1×1 convolutional units, accelerates training convergence and introduces nonlinearity through the batch normalization layer and the ReLU activation function, and outputs fused spatial constraint features.

[0072] Furthermore, the stitching module is used to stitch together the fused spatial constraint features and the input features to output enhanced features.

[0073] Furthermore, the specific structure of the second internal convolutional module is determined according to actual needs, and this embodiment does not limit it. For example, in one embodiment, the second internal convolutional module can be set to have the same structure as the first internal convolutional module.

[0074] Furthermore, the enhanced features are nonlinearly mapped through a second inner convolutional module to obtain output features with the same spatial size as the input features.

[0075] For further details, please continue to refer to [link / reference]. Figure 4 The encoding compression layer and the decoding extension layer have the same structure. Both the encoding compression layer and the decoding extension layer include N convolutional modules. Each convolutional module includes a second convolutional layer, a second instance normalization layer and a second activation processing layer.

[0076] Specifically, the number N of convolutional modules is set according to actual needs; in this embodiment, it is not limited. Stacking multiple convolutional modules helps to gradually extract and restore deeper and more complex features.

[0077] Furthermore, the second convolutional layer can use a 1×1 convolutional kernel, the second instance normalization layer can be an IN layer, and the activation function of the second activation processing layer is the ReLU activation function.

[0078] Finally, the structure of the encoding output layer and the decoding start layer is the same, and both the encoding output layer and the decoding start layer include a third convolutional layer, a third instance normalization layer and a third activation processing layer.

[0079] Specifically, the third convolutional layer can use a 1×1 convolutional kernel, the third instance normalization layer can be an IN layer, and the third activation processing layer can be a ReLU activation function.

[0080] As described above, the specific structure of the feature reconstruction network is shown in Table 1 below. Table 1 is the specific structure table of the feature reconstruction network:

[0081] Table 1: Specific Structure of Feature Reconstruction Network

[0082]

[0083] in, This represents the latent coding dimension of the encoded feature, which is a preset value. See the preceding description. is the dimension of the aggregated feature map, which is a preset value. It should be noted that in this embodiment, an LCC module is introduced into the feature reconstruction network and integrated into the encoding and decoding process (it can be integrated between the encoder's encoding start layer and encoding compression layer, and between the decoder's decoding extension layer and decoding output layer, respectively). Figure 4 As shown in the preceding description, the LCC module includes an average pooling layer and two convolutional layers, designed to extract spatial context information while suppressing interference from invalid details, thus maintaining a compact structure. Specifically, for the input features, this module extracts region statistics through an average pooling operation with a kernel size of 7×7, a stride of 1, and padding of 3, obtaining a semantic representation of the local region. This process enhances local semantic awareness while intentionally discarding some detailed information, helping to reduce the dependence of the reconstruction process on spatial details. To further limit the model's focus on details, it projects them onto low-dimensional fusionable spatial information, obtaining fused spatial constraint features. Finally, the input features and fused spatial constraint features are concatenated and mapped using a 1×1 convolutional layer, a batch normalization layer, and a ReLU activation function to ensure that the output features have the same dimension as the original input features. Given an input image, its aggregated feature map is reconstructed through the feature reconstruction network to obtain the reconstructed feature map.

[0084] In other words, the LCC module introduces local context constraints into the feature transmission path, forcing the model to ignore unnecessary pixel-level high-frequency details and instead focus on the statistical distribution patterns within local regions. This constraint mechanism effectively refines the feature representation of normal samples, thereby improving sensitivity to abnormal samples.

[0085] It should be noted that the feature reconstruction network achieves efficient reconstruction of aggregated feature maps through a carefully designed encoder and decoder structure. Specifically: the encoder first compresses the original channel dimension through the encoding start layer, and then uses the first LCC module to fuse spatial information, enhancing the expressive power of the features; the encoding compression layer further performs nonlinear mapping and dimensionality compression, and the encoding output layer reduces the dimensionality of the features to the encoded features in the latent space. The decoder, through a symmetrical structure, sequentially completes the processes of channel expansion, feature fusion, and mapping back to the original channel dimension, finally obtaining the reconstructed feature map. The LCC modules (including the first and second LCC modules) effectively extract and fuse spatial context information through operations such as average pooling, low-dimensional projection, feature concatenation, and nonlinear mapping; the encoding compression layer and decoding expansion layer, as well as the encoding output layer and decoding start layer, all adopt the same structure, ensuring the consistency and stability of the network during processing. This not only improves the accuracy of feature reconstruction but also effectively suppresses the feature pass-through problem, enhancing the network's generalization ability and robustness for multi-class image anomaly detection.

[0086] It should be noted that the feature reconstruction network is a pre-trained network. Figure 5 This is a schematic diagram illustrating the implementation principle of the training process of a feature reconstruction network, as shown in an exemplary embodiment of this application. Please refer to... Figure 5 In one possible implementation, the training process of the feature reconstruction network may include:

[0087] (1) Acquire multiple normal images; the multiple normal images belong to multiple different categories.

[0088] Specifically, multiple normal images are collected from several different categories. These normal images can represent the typical normal state of their respective categories and cover all categories to be detected.

[0089] In a specific implementation, for example, in one embodiment, multiple images can be a dataset, which consists of a training set and a test set.

[0090] (2) For each of the multiple normal images, extract the normal aggregate feature map of that normal image.

[0091] In practice, the specific methods for extracting aggregated feature maps can be found in the description of the above embodiments, and will not be repeated here.

[0092] (3) For each normal aggregated feature map, a noise tensor with the same dimension as the normal aggregated feature map is constructed based on a preset Gaussian distribution; wherein each element in the noise tensor is independently sampled from the preset Gaussian distribution.

[0093] Specifically, a Gaussian distribution can be set according to actual needs, and its mean and variance can be adjusted according to experimental requirements.

[0094] In practice, for each normal aggregated feature map, a noise tensor with the same dimension as the normal aggregated feature map is constructed based on a preset Gaussian distribution. Each element in the noise tensor is obtained by independent sampling from the Gaussian distribution, and therefore has randomness.

[0095] For example, in one embodiment, each element in the noise tensor follows an independent and identically distributed Gaussian distribution.

[0096] (4) The noise tensor is used to process the normal aggregation feature map to generate the abnormal aggregation feature map corresponding to the normal aggregation feature map.

[0097] Specifically, the noise tensor is added to the normal aggregated feature map to obtain the corresponding abnormal aggregated feature map.

[0098] In practice, the anomaly aggregation feature map can be calculated using the following formula:

[0099] ;

[0100] in, This is an abnormal aggregation feature map. This is a normal aggregated feature map. For noise tensor.

[0101] (5) The initial feature reconstruction network is trained using the normal aggregated feature map and the abnormal aggregated feature map to obtain the pre-trained feature reconstruction network.

[0102] In practice, normal aggregated feature maps and abnormal aggregated feature maps are input into the initial feature reconstruction network to obtain the trained feature reconstruction network.

[0103] In a specific implementation, in one possible approach, the loss function of the initial feature reconstruction network is:

[0104] ;

[0105] Among them, the For the loss function, the The height of the normal aggregated feature map is equal to the aforementioned The The width of the normal aggregated feature map is equal to the aforementioned... λ is a hyperparameter, and the For the normal aggregation feature map, the For the abnormal aggregation feature map, the The reconstructed feature map is the normal aggregated feature map; For the encouraging reconstruction terms of the normal aggregated feature map, The reconstructed feature map of the abnormal aggregation feature map, the This refers to the inhibitory reconstruction term of the abnormal aggregation feature map.

[0106] Understandably, the loss function consists of two parts, corresponding to the reconstruction accuracy of normal samples and the reconstruction suppression of abnormal samples, respectively. Using this loss function, the feature reconstruction network can not only learn high-quality normal feature reconstruction paths, but also deliberately create reconstruction failures on abnormal samples, thereby strengthening the discriminative ability of abnormal features and significantly improving the robustness and sensitivity of anomaly detection and localization.

[0107] It should be noted that by constructing the loss function in the above form, the initial feature reconstruction network can be trained using an encouraging reconstruction term for normal samples and an inhibiting reconstruction term for abnormal samples. In other words, the Differential Reconstruction Constraint Mechanism (DRCM) can be used to train the initial feature reconstruction network.

[0108] In this embodiment, by acquiring multiple categories of normal images and extracting their normal aggregated feature maps, and combining them with noise tensors generated by a preset Gaussian distribution to reconstruct abnormal aggregated feature maps, the feature reconstruction network is trained using both normal and abnormal aggregated feature maps. This effectively improves the network's ability to distinguish abnormal features in multiple categories of images. The trained feature reconstruction network can accurately reconstruct normal features while generating significant reconstruction errors on abnormal features, thereby achieving accurate detection of image anomalies and enhancing the robustness and generalization of multi-category image anomaly detection. Furthermore, it prevents the feature reconstruction network from learning similar reconstruction patterns on normal and abnormal samples. The aforementioned training method not only requires the model to accurately reconstruct normal samples but also uses a guidance mechanism to cause reconstruction failures when processing abnormal samples. The method provided in this embodiment, through the above model structure and training method, can prevent the network from learning similar reconstruction patterns on normal and abnormal samples, enabling the model to accurately reconstruct normal samples while also causing reconstruction failures when processing abnormal samples through a guidance mechanism. This avoids the feature reconstruction network from falling into simple feature copying due to over-reliance on the spatial neighborhood of input features during training, i.e., the so-called "identity shortcut" problem.

[0109] It should be noted that feature-based reconstruction is one of the mainstream technical approaches for image anomaly detection. Its basic principle is to train a reconstruction network using data containing only normal samples, so that the network learns the distribution of normal features. During the inference phase, normal regions can be accurately reconstructed, while abnormal regions deviate from the training distribution, resulting in a significant increase in reconstruction error, thereby achieving anomaly detection.

[0110] However, existing reconstruction methods are prone to the "identity shortcut" problem in multi-class unified modeling scenarios. When the training data covers multiple classes, the network tends to learn a simple identity mapping, that is, directly copying the input features into the output, rather than truly learning the semantic reconstruction of the features. This identity shortcut allows the network to reconstruct abnormal regions well as well, thereby weakening the anomaly detection capability.

[0111] In this application, the feature reconstruction network adopts an encoder-decoder structure. All convolutional layers use 1×1 convolutional kernels to achieve independent channel processing, and an LCC module (Local Context Constraint Module) is introduced to enhance feature representation. During the training phase, a differential reconstruction constraint mechanism is employed, utilizing a loss function containing both encouragement and suppression terms for training. This effectively solves the identity shortcut problem in reconstruction methods under multi-class unified modeling scenarios, further improving detection accuracy and generalization ability. Referring to the preceding description, it can be understood that in this application, for the feature reconstruction network: First, by using 1×1 convolutional kernels, independent channel processing is achieved, avoiding the network from using spatial neighborhood information for feature replication, effectively suppressing the identity shortcut problem; Second, by introducing the LCC module (Local Context Constraint Module), the semantic representation ability of features is enhanced while maintaining channel independence; Third, by employing a differential reconstruction constraint mechanism for training, the network can accurately reconstruct normal features while generating significant reconstruction errors for abnormal features through dual constraints of encouragement and suppression terms; Fourth, it supports multi-class unified modeling, eliminating the need to train a separate model for each class, thus improving deployment efficiency and detection accuracy.

[0112] S104. Based on the reconstruction errors of the reconstructed feature map and the aggregated feature map, determine the anomaly localization map corresponding to the target image and the overall anomaly score of the target image; wherein, the anomaly localization map is used to record the location of the abnormal region, and the overall anomaly score is used to evaluate the degree of anomaly of the target image.

[0113] In practice, the specific implementation process of this step may include:

[0114] (1) Based on the reconstruction error of the aggregated feature map and the reconstructed feature map, determine the anomaly score of each spatial location corresponding to the aggregated feature map, and perform upsampling processing on the anomaly scores of all spatial locations to obtain an anomaly score map with the same size as the target image.

[0115] (2) Perform edge smoothing on the anomaly score map to obtain an anomaly location map that represents the location of the anomaly region.

[0116] (3) Select a preset number of target abnormal scores with the highest abnormal scores from the abnormal score graph in descending order.

[0117] (4) The average of the preset number of target anomaly scores is determined as the overall anomaly score of the target image.

[0118] It should be noted that the reconstruction error is the difference between the aggregated feature map and the reconstructed feature map, and can be characterized using L2 norm (mean squared error), L1 norm (absolute error), etc. Furthermore, the anomaly score for each spatial location can be equal to the reconstruction error for that spatial location. Further, the anomaly scores for all spatial locations constitute the initial score map.

[0119] Furthermore, after obtaining the initial score map, the initial score map is upsampled to the same spatial resolution as the target image to obtain the anomaly score map.

[0120] Furthermore, the anomaly score map can be smoothed by applying a Gaussian filter to obtain an anomaly localization map that characterizes the location of the anomaly region.

[0121] Furthermore, to obtain image-level anomaly detection scores, the top-ranked anomaly can be selected from the anomaly score map. The maximum value is determined, and the average value is taken to determine the overall abnormal score, so as to balance the contribution of multiple high-scoring areas and avoid the influence of individual abnormalities on the overall judgment.

[0122] It should be noted that, firstly, by acquiring multi-layer feature maps, more comprehensive details and global context in the image can be captured, providing a rich feature foundation for subsequent anomaly detection. Simultaneously, spatial scale alignment ensures spatial consistency of feature maps at different levels, facilitating subsequent processing. The feature aggregation strategy effectively fuses feature information within the neighborhood, enhancing the robustness of feature representation while preserving spatial location information, providing more discriminative features for anomaly detection. Secondly, by reconstructing aggregated features from all spatial locations to form aggregated feature maps, these maps not only contain local details from various locations in the image but also fuse the global context of multi-layer features, giving each aggregated feature rich semantic information. Compared to ordinary feature representation methods, aggregated feature maps pay more attention to features prone to anomalies. The first aspect is that the network effectively reconstructs normal aggregated features. The second aspect is that the network focuses on detailing common features to improve the sensitivity and accuracy of anomaly detection. The third aspect is that the feature reconstruction network can effectively reconstruct normal aggregated features. When the input is abnormal aggregated features, the reconstruction error increases significantly because the model has not learned the relevant distribution, thus accurately reflecting the abnormal region. By calculating the reconstruction error, an anomaly score for each spatial location can be obtained, providing a basis for subsequent anomaly localization. The fourth aspect is that based on the anomaly score of each spatial location corresponding to the aggregated feature map, the overall anomaly score of the target image to be detected is determined, which comprehensively reflects the degree of anomaly in the image and provides an intuitive evaluation basis. Furthermore, since the anomaly score is calculated based on local information at each spatial location, it can accurately capture local abnormal regions in the image, improving the accuracy and reliability of anomaly detection. Thus, through multi-layer feature extraction, spatial scale alignment, feature aggregation, aggregated feature generation, feature reconstruction, and reconstruction error analysis, efficient anomaly detection and accurate localization of multi-class images are achieved, effectively distinguishing abnormal data and improving the performance of anomaly detection.

[0123] The image anomaly detection method based on feature reconstruction provided in this application extracts features from the target image to obtain multi-layer feature maps. These feature maps are then aligned spatially, and neighborhood aggregation and cross-layer fusion are performed on each layer to obtain an aggregated feature map corresponding to the target image. This aggregated feature map effectively integrates information from different scales and semantic levels, enabling the model to simultaneously focus on fine-grained textures and high-level structures in the image during detection, significantly enhancing its ability to perceive minor or local anomalies. Furthermore, the aggregated feature map is reconstructed to obtain a reconstructed feature map. The anomaly localization map corresponding to the target image and the overall anomaly score of the target image are then determined based on the reconstruction error between the aggregated and reconstructed feature maps. In this way, since normal regions are sufficiently modeled during training, the reconstruction error is usually small; while abnormal regions, due to their feature distribution deviating from the training distribution, result in significant reconstruction errors, thereby achieving automatic identification of abnormal regions. This method relies on unsupervised modeling of the feature distribution of normal images, eliminating the need for abnormal samples, significantly alleviating the modeling difficulties caused by the scarcity of abnormal samples, and effectively improving the model's robustness and background interference suppression capabilities.

[0124] Optionally, in one possible implementation, the method for determining the latent coding dimension of the coding feature includes:

[0125] (1) For each category of the multiple normal images, obtain the multi-layer feature map of the normal image.

[0126] Specifically, for each category, a representative normal image can be selected, and its multi-layer feature map can be extracted.

[0127] (2) For each spatial location in the multi-layer feature map, obtain the normal aggregated feature corresponding to that spatial location.

[0128] The specific implementation process and principle of this step are described in the previous embodiments and will not be repeated here.

[0129] (3) Perform principal component analysis on all the collected normal aggregation features and calculate the cumulative explained variance ratio of each principal component.

[0130] In practice, normal aggregate features can be vectorized first, converting them into a two-dimensional matrix form suitable for principal component analysis. Then, principal component analysis can be applied to analyze the normal aggregate features, projecting high-dimensional data into a low-dimensional space through linear transformation while retaining the main variation information of the normal aggregate features.

[0131] Furthermore, the normal aggregation features are processed using principal component analysis algorithms to obtain the explained variance ratio for each principal component.

[0132] It should be noted that the explained variance ratio reflects the proportion of the principal component in the data variation.

[0133] Furthermore, the explained variance ratios of each principal component are summed to obtain the cumulative explained variance ratio, which measures the importance of the top N principal components.

[0134] (4) When the cumulative explained variance ratio exceeds a preset threshold, the potential coding dimension is determined based on the number of principal components corresponding to the cumulative explained variance.

[0135] Specifically, the specific value of the preset threshold is set according to actual needs, and this embodiment does not limit it.

[0136] In practice, the latent coding dimension is determined based on the number of principal components corresponding to when the cumulative explained variance ratio exceeds a preset threshold. For example, in one embodiment, the latent coding dimension... The default setting is 64. Principal component analysis is used to reduce the dimensionality of the complete feature map. The preset threshold is 80%. At this time, the potential coding dimension is set according to the number of principal components corresponding to the first time the cumulative explained variance exceeds 80%. The value range of the coding dimension is limited to 64 to 96.

[0137] The method provided in this embodiment first selects a representative normal image for each category from multiple normal images and extracts its aggregated features. This ensures that each category has sufficient feature representations for subsequent analysis, while reducing data redundancy and improving processing efficiency. By performing principal component analysis on all extracted aggregated features (all vectors), the latent feature dimensions are adaptively determined. Since complex data or data with a large number of categories will have richer aggregated features, the compressible dimensions are limited. Simple data with few categories do not require large dimensions (larger dimensions increase computational cost and model parameter costs), making the setting of latent feature dimensions for the reconstruction network more scientific. Thus, multi-layer features... The main features in the feature map are effectively extracted, and then processed by principal component analysis to obtain the explained variance ratio for each principal component. The explained variance ratio reflects the proportion of the data variation in that principal component. The explained variance ratios of each principal component are summed to obtain the cumulative explained variance ratio. The cumulative explained variance ratio can measure the total amount of data variation that the top N principal components can explain, providing a quantitative standard for determining the potential coding dimension. This enables the feature reconstruction network to achieve effective dimensionality reduction of multi-class image features and scientific determination of the potential coding dimension while ensuring detection accuracy. This not only improves the training efficiency and detection accuracy of the feature reconstruction network, but also enhances the network's generalization ability to multi-class data.

[0138] Corresponding to the aforementioned embodiment of an image anomaly detection method based on feature reconstruction, this application also provides an embodiment of an image anomaly detection device based on feature reconstruction.

[0139] An embodiment of the image anomaly detection device based on feature reconstruction disclosed in this application can be applied to image anomaly detection equipment based on feature reconstruction. The device embodiment can be implemented in software, hardware, or a combination of both. Taking software implementation as an example, as a logical device, it is formed by the processor of the device loading the corresponding computer program instructions from non-volatile memory into memory for execution. From a hardware perspective, in addition to a processor, memory, network interface, and non-volatile memory, the image anomaly detection device based on feature reconstruction may also include other hardware, which will not be elaborated further.

[0140] Figure 6 This is a schematic diagram of the structure of an image anomaly detection device based on feature reconstruction provided in this application, according to Embodiment 1. Please refer to... Figure 6 The apparatus provided in this embodiment includes an extraction module 510, a fusion module 520, a reconstruction module 530, and a processing module 540.

[0141] The extraction module 510 is used to extract features from the target image to be detected, and obtain a multi-layer feature map.

[0142] The fusion module 520 is used to align the feature maps of each layer in spatial scale, and to perform neighborhood aggregation and cross-layer fusion processing on each feature map to obtain the aggregated feature map corresponding to the target image.

[0143] The reconstruction module 530 is used to reconstruct the aggregated feature map using a pre-trained feature reconstruction network to obtain a reconstructed feature map with the same size as the aggregated feature map.

[0144] The processing module 540 is used to determine the anomaly localization map corresponding to the target image and the overall anomaly score of the target image based on the reconstruction error of the reconstructed feature map and the aggregated feature map; wherein, the anomaly localization map is used to record the location of the abnormal region, and the overall anomaly score is used to evaluate the degree of anomaly of the target image.

[0145] This application also provides an image anomaly detection device based on feature reconstruction, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of any of the methods provided in the first aspect of this application.

[0146] This application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of any of the methods provided in the first aspect of this application.

[0147] The specific implementation process of the functions and roles of each unit in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.

[0148] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this application according to actual needs. Those skilled in the art can understand and implement this without creative effort.

[0149] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the scope of protection of this application.

Claims

1. An image anomaly detection method based on feature reconstruction, characterized in that, The method includes: Feature extraction is performed on the target image to be detected to obtain a multi-layer feature map; Align the feature maps of each layer in spatial scale, and perform neighborhood aggregation and cross-layer fusion processing on each feature map to obtain the aggregated feature map corresponding to the target image. The aggregated feature map is reconstructed using a pre-trained feature reconstruction network to obtain a reconstructed feature map with the same size as the aggregated feature map; Based on the reconstruction errors of the reconstructed feature map and the aggregated feature map, an anomaly localization map corresponding to the target image and an overall anomaly score of the target image are determined; wherein, the anomaly localization map is used to record the location of the abnormal region, and the overall anomaly score is used to evaluate the degree of anomaly of the target image.

2. The method according to claim 1, characterized in that, Based on the reconstruction errors of the reconstructed feature map and the aggregated feature map, the anomaly localization map corresponding to the target image and the overall anomaly score of the target image are determined, including: Based on the reconstruction error of the aggregated feature map and the reconstructed feature map, the anomaly score of each spatial location corresponding to the aggregated feature map is determined, and the anomaly scores of all spatial locations are upsampled to obtain an anomaly score map with the same size as the target image. The anomaly score map is subjected to edge smoothing processing to obtain an anomaly location map used to represent the location of the anomaly region; In descending order, a preset number of target anomaly scores with the highest anomaly scores are selected from the anomaly score graph. The average of the preset number of target anomaly scores is determined as the overall anomaly score of the target image.

3. The method according to claim 1, characterized in that, The process of performing neighborhood aggregation and cross-layer fusion on each layer of feature maps to obtain the aggregated feature map corresponding to the target image includes: For each spatial location in the feature map of each layer, average pooling is performed on multiple feature vectors in the specified neighborhood of that spatial location, and the processing result is determined as the aggregated feature of that spatial location on the feature map of that layer. Channel splicing and dimension adjustment are performed on the aggregated features of each spatial location on the feature map of each layer to obtain the aggregated features corresponding to that spatial location, and the aggregated feature map is restored by using the aggregated features corresponding to all spatial locations.

4. The method according to claim 3, characterized in that, The feature reconstruction network includes an encoder and a decoder; The encoder includes an encoding start layer, a first LCC module, an encoding compression layer, and an encoding output layer; the encoding start layer is used to perform convolution processing on the input aggregated feature map to compress the original channel dimension and obtain an initial compressed feature map; The first LCC module is used to apply local context constraints to the initial compressed feature map to generate a fused feature map; The encoding compression layer is used to extract key features and compress channel dimensions of the fused feature map to obtain an encoded compressed feature map; The encoding output layer is used to perform channel compression on the encoded compressed feature map, reducing the dimension of the encoded compressed feature map to the encoded features in the latent space; the latent encoding dimension of the encoded features is a preset value. The decoder includes a decoding start layer, a decoding extension layer, a second LCC module, and a decoding output layer. The decoding start layer is used to perform channel expansion on the encoded features, initially restoring the feature dimensions to obtain an initial decoded feature map. The decoding extension layer is used to extract high-dimensional features from the initial decoded feature map to obtain a decoded extended feature map. The second LCC module is used to apply local context constraints to the decoded extended feature map to obtain a decoded fusion feature map. The decoding output layer is used to restore the decoded fusion feature map back to the original channel dimensions to obtain the reconstructed feature map.

5. The method according to claim 4, characterized in that, The encoder and the decoder have a symmetrical structure; Both the encoding start layer and the decoding output layer include a first convolutional layer, a first instance normalization layer, and a first activation processing layer. Both the first LCC module and the second LCC module include an average pooling layer, a first inner convolutional module, a concatenation module, and a second inner convolutional module. The average pooling layer performs local average pooling on the input features to extract semantic representation features. The first inner convolutional module projects the semantic representation features into a low-dimensional space to obtain fused spatial constraint features. The concatenation module concatenates the input features and the fused spatial constraint features to obtain enhanced features. The second inner convolutional module performs non-linear mapping on the enhanced features to obtain output features with the same channel dimension as the input features. Both the first and second inner convolutional modules include an inner convolutional layer, a batch normalization layer, and an inner activation layer. Both the encoding compression layer and the decoding expansion layer include N convolutional modules, and each convolutional module includes a second convolutional layer, a second instance normalization layer and a second activation processing layer. Both the encoding output layer and the decoding start layer include a third convolutional layer, a third instance normalization layer, and a third activation processing layer.

6. The method according to claim 5, characterized in that, The training process of the feature reconstruction network includes: Acquire multiple normal images; the multiple normal images belong to multiple different categories; For each of the multiple normal images, extract the normal aggregated feature map of that normal image; For each normal aggregated feature map, a noise tensor with the same dimension as the normal aggregated feature map is constructed based on a preset Gaussian distribution; wherein each element in the noise tensor is independently sampled from the preset Gaussian distribution; The noise tensor is used to process the normal aggregated feature map to generate the abnormal aggregated feature map corresponding to the normal aggregated feature map; The initial feature reconstruction network is trained using the normal aggregated feature map and the abnormal aggregated feature map to obtain the pre-trained feature reconstruction network.

7. The method according to claim 6, characterized in that, The method for determining the latent coding dimension of the coding feature includes: For each category of the multiple normal images, obtain the multi-layer feature map of that normal image; For each spatial location in the multi-layer feature map, obtain the normal aggregated feature corresponding to that spatial location; Principal component analysis was performed on all collected normal aggregation features, and the cumulative explained variance ratio of each principal component was calculated. When the cumulative explained variance ratio exceeds a preset threshold, the potential coding dimension is determined based on the number of principal components corresponding to the cumulative explained variance ratio.

8. An image anomaly detection device based on feature reconstruction, characterized in that, The device includes an extraction module, a fusion module, a reconstruction module, and a processing module; The extraction module is used to extract features from the target image to be detected, and obtain a multi-layer feature map; The fusion module is used to align the feature maps of each layer in spatial scale, and to perform neighborhood aggregation and cross-layer fusion processing on each feature map to obtain the aggregated feature map corresponding to the target image. The reconstruction module is used to reconstruct the aggregated feature map using a pre-trained feature reconstruction network to obtain a reconstructed feature map with the same size as the aggregated feature map. The processing module is used to determine the anomaly localization map corresponding to the target image and the overall anomaly score of the target image based on the reconstruction error of the reconstructed feature map and the aggregated feature map; wherein, the anomaly localization map is used to record the location of the abnormal region, and the overall anomaly score is used to evaluate the degree of anomaly of the target image.

9. An image anomaly detection device based on feature reconstruction, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.