An image difference detection method, system and electronic device

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining feature extraction networks and attention interaction networks, the problems of low accuracy and efficiency in image difference detection are solved, and more efficient image difference detection is achieved.

CN115810110BActive Publication Date: 2026-06-30ZHEJIANG DAHUA TECH CO LTD

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: ZHEJIANG DAHUA TECH CO LTD
Filing Date: 2022-12-05
Publication Date: 2026-06-30

AI Technical Summary

Technical Problem

Existing image difference detection methods based on deep learning technology suffer from low detection accuracy and efficiency. In particular, unsupervised detection algorithms rely on manually designed features, which are time-consuming and labor-intensive, while supervised algorithms have many parameters and are difficult to optimize.

Method used

A feature extraction network is used to extract features from foreground and background images, an attention interaction network is used to calculate the interaction information between feature maps, and a detection network is used to perform difference detection. This avoids manually designing features and optimizes network parameters to improve detection accuracy and efficiency.

Benefits of technology

It improves the accuracy and efficiency of image difference detection, ensuring the optimization and speed of detection results.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115810110B_ABST

Patent Text Reader

Abstract

This application discloses an image difference detection method, system, and electronic device. The method includes: acquiring a foreground image and a background image; inputting the foreground image and the background image into a target feature extraction network, respectively, and outputting N foreground feature maps corresponding to the foreground image and N background feature maps corresponding to the background image; inputting M foreground feature maps from the N foreground feature maps and M background feature maps from the N background feature maps corresponding to the M foreground feature maps into a target attention interaction network, and outputting M interaction feature maps; and performing difference detection on the M interaction feature maps through a target detection network to obtain difference information between the foreground image and the background image. The technical solution provided by this application avoids manually designing the feature information of the foreground image and background image, thus improving the detection accuracy and efficiency of image difference detection.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of video analysis and processing technology, and in particular to an image difference detection method, system and electronic device. Background Technology

[0002] In recent years, with the continuous development of deep learning technology, image difference detection based on deep learning technology has been widely used in many fields such as video surveillance and manufacturing quality control.

[0003] For example, in video surveillance scenarios, image difference detection based on deep learning technology can determine whether there are missing, stolen, or changed items in the monitored area; in manufacturing quality control scenarios, image difference detection based on deep learning technology can analyze the color differences of precision instruments, and thus assess the quality of products made from precision instruments.

[0004] Currently, image difference detection based on deep learning technology is mainly divided into image difference detection based on unsupervised detection algorithms and image difference detection based on supervised detection algorithms.

[0005] Image difference detection based on unsupervised detection algorithms typically utilizes manually designed features to perform cluster analysis on foreground and background images. The foreground and background images are the two images whose differences need to be detected, thereby detecting the differences between the foreground and background images. Alternatively, the mean method can be used to calculate the difference map between the foreground and background images, and then the difference map can be used as a pseudo-real label to train a neural network, thereby detecting the differences between the foreground and background images.

[0006] Image difference detection based on supervised detection algorithms typically involves inputting the foreground and background images into a detection network. The feature values output by the detection network are then encoded, and the loss value between the encoded feature values and the ground truth labels is calculated. The detection network is trained by optimizing the loss value. After the detection network is trained, the difference information between the foreground and background images can be obtained through the trained detection network, thereby achieving the difference detection between the foreground and background images.

[0007] However, when performing difference detection between foreground and background images based on the aforementioned unsupervised detection algorithms, the human-designed features lack objectivity and are time-consuming and labor-intensive, resulting in low detection accuracy and efficiency. On the other hand, when performing difference detection between foreground and background images based on supervised algorithms, the supervised detection algorithms make poor use of the difference information between the foreground and background images, and the large number of parameters in the detection network makes it difficult to obtain the optimal parameters in a short time, which also leads to low detection accuracy and efficiency. Summary of the Invention

[0008] This application provides an image difference detection method, system, and electronic device to solve the problems of low detection accuracy and low detection efficiency in image difference detection. The specific implementation scheme is as follows:

[0009] In a first aspect, this application provides an image difference detection method, the method comprising:

[0010] Obtain the foreground and background images;

[0011] The foreground image and the background image are respectively input into the target feature extraction network, and N foreground feature maps corresponding to the foreground image and N background feature maps corresponding to the background image are output, where N is an integer greater than zero;

[0012] M foreground feature maps from the N foreground feature maps and M background feature maps from the N background feature maps corresponding to the M foreground feature maps are input into the target attention interaction network, and M interaction feature maps are output, where M is an integer greater than zero.

[0013] The difference between the foreground image and the background image is obtained by performing difference detection on the M interactive feature maps using an object detection network.

[0014] The feature extraction network extracts features from the foreground and background images respectively. Then, the interaction information between the feature maps output by the feature extraction network is calculated based on the attention interaction network to obtain the interaction feature map. Finally, the interaction feature map is detected by the detection network to obtain the difference information between the foreground and background images. This avoids the need to manually design the feature information of the foreground and background images and effectively utilizes the difference information between the foreground and background images, thereby improving the detection accuracy and efficiency of image difference detection.

[0015] In one possible implementation, prior to acquiring the foreground and background images, the method further includes:

[0016] Obtain the foreground training image and the background training image;

[0017] Based on the foreground training image and the background training image, a feature extraction network, an attention interaction network, and a detection network are trained to obtain a trained feature extraction network, a trained attention interaction network, and a trained detection network.

[0018] If the training feature extraction network, the training attention interaction network, and the training detection network all meet the preset conditions, the training feature extraction network, the training attention interaction network, and the training detection network are used as the first target feature extraction network, the target attention interaction network, and the target detection network, respectively.

[0019] Based on foreground and background training images, the feature extraction network, attention interaction network, and detection network were trained. The first target feature extraction network, target attention interaction network, and target detection network were determined based on preset conditions, so that the parameters used in the image difference detection process are all optimal, thereby making the image difference detection result optimal and further improving the accuracy of image difference detection.

[0020] In one possible implementation, after using the trained feature extraction network, the trained attention interaction network, and the trained detection network as the first target feature extraction network, the target attention interaction network, and the target detection network, the method further includes:

[0021] Optimize the specified parameters in the first target feature extraction network to obtain the second feature extraction network;

[0022] The third feature extraction network is obtained by cropping a specified feature extraction block in the second feature extraction network according to a preset cropping rate.

[0023] The predicted feature map in the first target feature extraction network is used as the real feature map of the third feature extraction network to train the third feature extraction network, thereby obtaining the target feature extraction network.

[0024] Optimizing the specified parameters in the first target feature extraction network results in a more concentrated weight distribution in the second feature extraction network, laying the foundation for accelerating the operation of the feature extraction network. Simultaneously, by cropping the specified feature extraction blocks in the second feature extraction network and guiding the training of the third feature extraction network by the first target feature extraction network, the operation speed of the target feature extraction network is accelerated while maintaining the accuracy of image difference detection. This reduces the time required for detecting differences between the foreground and background images, further improving the efficiency of image difference detection.

[0025] In one possible implementation, acquiring the foreground image and the background image includes:

[0026] The original foreground image and the original background image are obtained, and the original foreground image and the original background image are transformed according to the data preprocessing method to obtain the foreground image and the background image respectively;

[0027] The step of performing difference detection on the M interactive feature maps using an object detection network to obtain difference information between the foreground image and the background image includes:

[0028] The object detection network performs difference detection on the M interactive feature maps and outputs the difference information between the foreground image and the background image.

[0029] Based on data preprocessing operations, the original foreground image and the original background image are transformed into foreground image and background image, so that the foreground image and the background image can be directly input into the subsequent network without any further operations; and the difference information between the foreground image and the background image is obtained through the output of the object detection network.

[0030] In one possible implementation, before inputting M foreground feature maps from the N foreground feature maps and M background feature maps from the N background feature maps corresponding to the M foreground feature maps into the target attention interaction network and outputting M interaction feature maps, the method further includes:

[0031] Based on the size of the feature map, the M foreground feature maps are determined from the N foreground feature maps, and

[0032] From the N background feature maps, determine the M background feature maps that correspond to the M foreground feature maps.

[0033] Based on the size of the feature maps, M foreground feature maps and M background feature maps were extracted, so that these M foreground feature maps and M background feature maps can be directly input into the target attention interaction network, laying the foundation for calculating the interaction features between the foreground feature maps and background feature maps in the subsequent target attention interaction network.

[0034] In one possible implementation, determining the M foreground feature maps from the N foreground feature maps according to the feature map size includes:

[0035] The N foreground feature maps are sorted according to their size.

[0036] From the sorted N foreground feature maps, determine the M foreground feature maps with the smallest size.

[0037] The smallest foreground feature maps are selected as M foreground feature maps. Since the smaller the size of the feature map extracted by the target feature extraction network, the more important the feature information contained in the feature map, the M foreground feature maps contain the most core feature information of the foreground image. Furthermore, since the size of these M foreground feature maps is the smallest, the computational load of the target attention interaction network is also reduced.

[0038] In one possible implementation, the step of inputting M foreground feature maps from the N foreground feature maps and M background feature maps from the N background feature maps corresponding to the M foreground feature maps into the target attention interaction network, and outputting M interaction feature maps, includes:

[0039] Each foreground feature map in the M foreground feature maps and the corresponding background feature map in the M background feature maps are taken as a feature map group to obtain M feature map groups;

[0040] For each of the M feature map groups, the following interaction feature calculation operation is performed on the target attention interaction network:

[0041] Obtain the target feature map set input to the target attention interaction network;

[0042] Calculate the difference feature map between the foreground feature map and the background feature map in the target feature map group;

[0043] The dimensions of the difference feature map, the foreground feature map in the target feature map group, and the background feature map in the target feature map group are all converted to preset dimensions to obtain the difference feature matrix, the foreground feature matrix, and the background feature matrix;

[0044] The foreground feature matrix is multiplied by the background feature matrix according to the matrix multiplication operation to obtain the first interaction information matrix;

[0045] The first interactive information matrix is multiplied by the difference feature matrix according to the matrix multiplication operation method described above to obtain the second interactive information matrix.

[0046] The second interaction information matrix and the foreground feature matrix are input into the feedforward neural network to obtain the interaction feature map;

[0047] Each of the M feature map groups is input into the target attention interaction network to perform the interaction feature calculation operation, and M interaction feature maps are output.

[0048] Based on the interaction feature calculation between the foreground feature map, background feature map, and difference feature map in the target attention interaction network, the interaction feature information of the foreground feature map, background feature map, and difference feature map is obtained. Furthermore, the difference information between the foreground image and the background image is extracted, thereby improving the detection accuracy of image difference detection.

[0049] In one possible implementation, the step of performing difference detection on the M interactive feature maps using a target detection network to obtain difference information between the foreground image and the background image includes:

[0050] The M interactive feature maps are input into the convolutional layer and the specified activation layer in the target detection network to obtain G confidence values and the coordinates of the detection boxes corresponding to the G confidence values, where G is an integer greater than zero.

[0051] Determine whether all G confidence values are less than a preset threshold;

[0052] If so, output that there is no difference between the foreground image and the background image;

[0053] If not, map the coordinates of the detection boxes corresponding to the K confidence values that are not less than the preset threshold from the G confidence values to the background image to obtain K sets of difference box coordinates, and output the K sets of difference box coordinates that indicate the difference between the foreground image and the background image, where K is an integer greater than zero.

[0054] By comparing the confidence values of the outputs of the convolutional and activation layers in the detection network based on the interactive feature maps with a preset threshold, it is determined whether there is a difference between the foreground and background images. In addition, based on the coordinates of the detection boxes mapped to the coordinates of the difference boxes on the foreground or background images, the specific locations and quantities of differences between the foreground and background images are determined.

[0055] Secondly, this application also provides an image difference detection system, the system comprising:

[0056] The acquisition module is used to acquire foreground and background images;

[0057] The feature extraction module is used to input the foreground image and the background image into the target feature extraction network respectively, and output N foreground feature maps corresponding to the foreground image and N background feature maps corresponding to the background image, where N is an integer greater than zero;

[0058] The interaction module is used to input M foreground feature maps from the N foreground feature maps and M background feature maps from the N background feature maps corresponding to the M foreground feature maps into the target attention interaction network, and output M interaction feature maps, where M is an integer greater than zero;

[0059] The processing module is used to perform difference detection on the M interactive feature maps through an object detection network to obtain the difference information between the foreground image and the background image.

[0060] In one possible implementation, the acquisition module is specifically used to acquire a foreground training image and a background training image;

[0061] Based on the foreground training image and the background training image, a feature extraction network, an attention interaction network, and a detection network are trained to obtain a trained feature extraction network, a trained attention interaction network, and a trained detection network.

[0062] If the training feature extraction network, the training attention interaction network, and the training detection network all meet the preset conditions, the training feature extraction network, the training attention interaction network, and the training detection network are used as the first target feature extraction network, the target attention interaction network, and the target detection network, respectively.

[0063] In one possible implementation, the acquisition module is specifically used to optimize specified parameters in the first target feature extraction network to obtain a second feature extraction network;

[0064] The third feature extraction network is obtained by cropping a specified feature extraction block in the second feature extraction network according to a preset cropping rate.

[0065] The predicted feature map in the first target feature extraction network is used as the real feature map of the third feature extraction network to train the third feature extraction network, thereby obtaining the target feature extraction network.

[0066] In one possible implementation, the acquisition module is specifically used to acquire the original foreground image and the original background image, and to perform conversion processing on the original foreground image and the original background image respectively according to the data preprocessing method to obtain the foreground image and the background image;

[0067] The processing module is specifically used to perform difference detection on the M interactive feature maps through an object detection network, and output the difference information between the foreground image and the background image.

[0068] In one possible implementation, the interaction module is specifically configured to determine the M foreground feature maps from the N foreground feature maps according to the size of the feature map.

[0069] From the N background feature maps, determine the M background feature maps that correspond to the M foreground feature maps.

[0070] In one possible implementation, the interaction module is specifically used to sort the N foreground feature maps according to their size.

[0071] From the sorted N foreground feature maps, determine the M foreground feature maps with the smallest size.

[0072] In one possible implementation, the interaction module is specifically used to take each of the M foreground feature maps and the background feature map corresponding to each of the foreground feature maps in the M background feature maps as a feature map group, to obtain M feature map groups.

[0073] For each of the M feature map groups, the following interaction feature calculation operation is performed on the target attention interaction network:

[0074] Obtain the target feature map set input to the target attention interaction network;

[0075] Calculate the difference feature map between the foreground feature map and the background feature map in the target feature map group;

[0076] The dimensions of the difference feature map, the foreground feature map in the target feature map group, and the background feature map in the target feature map group are all converted to preset dimensions to obtain the difference feature matrix, the foreground feature matrix, and the background feature matrix;

[0077] The foreground feature matrix is multiplied by the background feature matrix according to the matrix multiplication operation to obtain the first interaction information matrix;

[0078] The first interactive information matrix is multiplied by the difference feature matrix according to the matrix multiplication operation method described above to obtain the second interactive information matrix.

[0079] The second interaction information matrix and the foreground feature matrix are input into the feedforward neural network to obtain the interaction feature map;

[0080] Each of the M feature map groups is input into the target attention interaction network to perform the interaction feature calculation operation, and M interaction feature maps are output.

[0081] In one possible implementation, the processing module is specifically used to input the M interactive feature maps into the convolutional layer and a specified activation layer in the target detection network to obtain G confidence values and the coordinates of the detection boxes corresponding to each of the G confidence values, where G is an integer greater than zero.

[0082] Determine whether all G confidence values are less than a preset threshold;

[0083] If so, output that there is no difference between the foreground image and the background image;

[0084] If not, map the coordinates of the detection boxes corresponding to the K confidence values that are not less than the preset threshold from the G confidence values to the background image to obtain K sets of difference box coordinates, and output the K sets of difference box coordinates that indicate the difference between the foreground image and the background image, where K is an integer greater than zero.

[0085] Thirdly, this application provides an electronic device, comprising:

[0086] Memory, used to store computer programs;

[0087] When the processor executes the computer program stored in the memory, it implements the above-described image difference detection method steps.

[0088] Fourthly, this application provides a computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps of the image difference detection method described above.

[0089] For the various aspects of the second to fourth aspects mentioned above, and the technical effects that each aspect may achieve, please refer to the above description of the technical effects that can be achieved for the first aspect or the various possible solutions in the first aspect, which will not be repeated here. Attached Figure Description

[0090] Figure 1 A flowchart of an image difference detection method provided in this application;

[0091] Figure 2 A schematic diagram illustrating the computational principle of an attention interaction network provided in this application;

[0092] Figure 3 A schematic diagram illustrating the processing steps of an image difference detection method provided in this application;

[0093] Figure 4 A schematic diagram of an image difference detection system provided in this application;

[0094] Figure 5 A schematic diagram of an electronic device provided in this application. Detailed Implementation

[0095] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. The specific operational methods in the method embodiments can also be applied to the device embodiments or system embodiments. It should be noted that in the description of this application, "multiple" is understood as "at least two". "And / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, and B existing alone. A connected to B can represent: A and B directly connected, and A and B connected through C. Furthermore, in the description of this application, terms such as "first" and "second" are used only for distinguishing the purpose of description and should not be construed as indicating or implying relative importance or order.

[0096] The embodiments of this application will now be described in detail with reference to the accompanying drawings.

[0097] Currently, image difference detection based on deep learning technology is mainly divided into image difference detection based on unsupervised detection algorithms and image difference detection based on supervised detection algorithms. However, when performing difference detection between foreground and background images using unsupervised detection algorithms, the manually designed features lack objectivity and are time-consuming and labor-intensive, resulting in low detection accuracy and efficiency. On the other hand, when performing difference detection between foreground and background images using supervised algorithms, the supervised detection algorithms make poor use of the difference information between the foreground and background images, and the large number of parameters in the detection network makes it difficult to obtain optimal parameters in a short time, which also leads to low detection accuracy and efficiency.

[0098] Therefore, this application proposes an image difference detection method. It extracts features from the foreground and background images using a feature extraction network, then calculates the interaction features between the foreground and background feature maps output by the feature extraction network using a Correlation Attention Data Process (CADP) network, thus obtaining an interaction feature map. Finally, the interaction feature map is detected using a detection network to obtain the difference information between the foreground and background images. This method avoids manually designing the feature information of the foreground and background images and effectively utilizes the difference information between them, thereby improving the detection accuracy and efficiency of image difference detection.

[0099] Reference Figure 1 The diagram shown is a flowchart of an image difference detection method provided in an embodiment of this application. The method includes:

[0100] S1, acquire the foreground and background images;

[0101] Before acquiring the foreground and background images, it is necessary to first acquire the original foreground and background images.

[0102] It should be noted that, in the embodiments of this application, the original foreground image and the original background image are two images at any time. They can be two consecutive frames or two non-consecutive frames in the same video, or two images from different scenes.

[0103] Furthermore, the original foreground and background images are transformed through data preprocessing to obtain two foreground and background images in a specified format. For example, the original PNG format foreground and background images are converted into Tensor format foreground and background images through image conversion. The size of the Tensor format foreground image and the Tensor format for background image are both (3, 416, 416), and these foreground and background images can be directly input into the network.

[0104] In the above manner, the original foreground image and the original background image are transformed into a foreground image and a background image based on data preprocessing operations, so that the foreground image and the background image can be directly input into the subsequent network without any further operations.

[0105] In one possible implementation, the network needs to be trained before acquiring the foreground and background images to obtain the target network.

[0106] Specifically, the foreground training image and background training image used for network training are first obtained. The foreground training image and background training image are both images obtained after the original foreground training image and the original background training image have been transformed by the above data preprocessing method.

[0107] Then, based on the foreground training image and the background training image, the feature extraction network, CADP, and detection network are trained to obtain the trained feature extraction network, trained CADP, and trained detection network.

[0108] It should be noted that when training the feature extraction network, CADP, and detection network, the output of the feature extraction network is used as the input of CADP, and the output of CADP is used as the input of the detection network. Finally, the output of the detection network is the difference information between the foreground training image and the background training image.

[0109] Furthermore, it should be noted that in this embodiment, a two-stream network is used as the feature extraction network, wherein the two-stream network contains Q feature extraction blocks consisting of convolutional layers, normalization layers, and activation layers, where Q is an integer greater than zero.

[0110] Furthermore, it is determined whether the aforementioned training feature extraction network, the aforementioned training CADP, and the aforementioned training detection network all meet the preset conditions.

[0111] It should be noted that the above preset conditions can be that the accuracy of image difference detection reaches a first preset threshold, or that the error of image difference detection is less than a second preset threshold. In this embodiment, the preset conditions can be selected according to the specific application scenario.

[0112] If the above-mentioned training of the feature extraction network, the above-mentioned training of the CADP, and the above-mentioned training of the detection network do not meet the preset conditions, then modify the modifiable parameters in the feature extraction network, the CADP, and the detection network, and continue to train the modified feature extraction network, the modified CADP, and the modified detection network until the training of the feature extraction network, the CADP, and the detection network meets the preset conditions.

[0113] It should be noted that, in this embodiment, when modifying the modifiable parameters in the feature extraction network, the modifiable parameters in CADP, and the modifiable parameters in the detection network, the modifiable parameters can be modified using either a grid search-based parameter modification method or a random search-based parameter modification method. In this embodiment, the parameter modification method can be selected according to the specific application scenario.

[0114] If the above-mentioned training feature extraction network, the above-mentioned training CADP, and the above-mentioned training detection network all meet the preset conditions, then the above-mentioned training feature extraction network, the above-mentioned training CADP, and the above-mentioned training detection network will be used as the first target feature extraction network, the target CADP, and the target detection network, respectively.

[0115] By using the above method, the feature extraction network, CADP, and detection network were trained based on the foreground and background training images. The first target feature extraction network, target CADP, and target detection network were determined based on preset conditions, so that the parameters used in the image difference detection process are all optimal, thereby making the image difference detection result optimal and further improving the accuracy of image difference detection.

[0116] Furthermore, due to the structural design of the first target feature extraction network, it takes a relatively long time to extract image features. Therefore, this embodiment of the application accelerates the first target feature extraction network using the following Sparse Pruning Teacher-net (TSPT) network acceleration framework:

[0117] First, optimize the specified parameters in the first target feature extraction network to obtain the second feature extraction network.

[0118] Specifically, after obtaining the first target feature extraction network, the first target feature extraction network is sparsely trained by compressing the weights of the regularization layer. Then, the specified parameters in the first target feature extraction network are optimized based on the sparse training to obtain the second feature extraction network.

[0119] In this process, sparsification training involves attenuating specified parameters according to a decay method in each loop of the first target feature extraction network training process.

[0120] It should be noted that, in this embodiment of the application, when attenuating a specified parameter according to the attenuation method, the specified parameter can be attenuated by constant attenuation, by global attenuation, or by local attenuation. In this embodiment of the application, the attenuation method can be selected according to the specific application scenario.

[0121] By using the above method to sparsely train the first target feature extraction network based on the regularization layer operation weights, the resulting second feature extraction network has more concentrated weights, laying the foundation for accelerating the operation speed of the feature extraction network.

[0122] Furthermore, after sparsification training, the second target feature extraction network is pruned, that is, a specified feature extraction block in the second feature extraction network is pruned according to a preset pruning rate to obtain the third feature extraction network; finally, the third feature extraction network is guided by the first target feature extraction network, that is, the predicted feature map in the first target feature extraction network is used as the real feature map of the third feature extraction network to train the third feature extraction network to obtain the target feature extraction network.

[0123] It should be noted that, in this embodiment of the application, the relative entropy loss function (Kullback-Leibler Divergence Loss, KLDivLoss) is used to train the third feature extraction network.

[0124] Specifically, KLDivLosss is as follows:

[0125]

[0126] Among them, y pred For the predicted feature map of the third feature extraction network, y true This is the true feature map of the third feature extraction network, which is also the predicted feature map in the first target feature extraction network.

[0127] By employing the methods described above, based on the sparse training, network pruning, and pre-pruning network guidance of the post-pruning network within the TSPT network acceleration framework, the running speed of the target feature extraction network is accelerated while ensuring the detection accuracy of image difference detection. This reduces the time required for detecting differences between foreground and background images, thereby further improving the detection efficiency of image difference detection.

[0128] S2, input the foreground image and background image into the target feature extraction network respectively, and output N foreground feature maps corresponding to the foreground image and N background feature maps corresponding to the background image;

[0129] After acquiring the foreground and background images, the foreground image is directly input into the target feature extraction network, which then outputs N foreground feature maps corresponding to the foreground image. The background image is then directly input into the target feature extraction network, which outputs N background feature maps corresponding to the background image, where N is an integer greater than zero.

[0130] It should be noted that in this application, when the foreground image and the background image are input into the target feature extraction network respectively, the input order of the foreground image and the background image can be the foreground image first, the background image first, or the foreground image and the background image simultaneously. In this application embodiment, the input order of the foreground image and the background image is not limited, but the foreground image and the background image are input into the target feature extraction network with the same parameters.

[0131] For example, a two-stream network (i.e., a target feature extraction network) contains 43 feature extraction blocks consisting of convolutional layers, normalization layers, and activation layers. Among them, the convolutional layers in the feature extraction blocks at positions 2, 5, 7, 20, and 33 are convolutional layers with a stride of 2. This transforms the foreground image and background image of size (3, 416, 416) input to the two-stream network into foreground feature maps of sizes (64, 208, 208), (128, 104, 104), (256, 52, 52), (512, 26, 26), and (1024, 13, 13) and background feature maps of sizes (64, 208, 208), (128, 104, 104), (256, 52, 52), (512, 26, 26), and (1024, 13, 13) respectively.

[0132] S3, input M foreground feature maps from N foreground feature maps and M background feature maps from N background feature maps corresponding to the M foreground feature maps into the target attention interaction network, and output M interaction feature maps;

[0133] After obtaining N foreground feature maps and N background feature maps, it is necessary to calculate the interaction features between the foreground feature maps and the background feature maps.

[0134] Specifically, before inputting M foreground feature maps from N foreground feature maps and M background feature maps from N background feature maps corresponding to the M foreground feature maps into the target CADP, M foreground feature maps are first determined from the N foreground feature maps according to the size of the feature maps, and M background feature maps corresponding to the M foreground feature maps are determined from the N background feature maps, where M is an integer greater than zero.

[0135] Using the above method, based on the size of the feature map, M foreground feature maps and M background feature maps are extracted, so that the M foreground feature maps and M background feature maps can be directly input into the target CADP, laying the foundation for calculating the interaction features between the foreground feature maps and background feature maps in the subsequent target CADP.

[0136] In one possible implementation, when determining M foreground feature maps from N foreground feature maps according to their size, the N foreground feature maps are first sorted according to their size; then, the M foreground feature maps with the smallest size are determined from the sorted N foreground feature maps.

[0137] For example, the sizes of five foreground feature maps are (64, 208, 208), (1024, 13, 13), (256, 52, 52), (128, 104, 104), and (512, 26, 26). Then, according to the size of the feature maps, the sizes of these five foreground feature maps are sorted from largest to smallest as (64, 208, 208), (128, 104, 104), (256, 52, 52), (512, 26, 26), and (1024, 13, 13). Then, the three foreground feature maps with the smallest sizes are determined from these five foreground feature maps, and the sizes of these three foreground feature maps are (256, 52, 52), (512, 26, 26), and (1024, 13, 13).

[0138] Using the above method, the smallest foreground feature maps are selected as M foreground feature maps. Since the smaller the size of the feature map extracted by the target feature extraction network, the more important the feature information contained in the feature map, the M foreground feature maps contain the most core feature information of the foreground image. Furthermore, since the M foreground feature maps are the smallest, the computational load of the target CADP is also reduced.

[0139] In one possible implementation, after determining the M foreground feature maps and the M background feature maps in the N background feature maps that correspond to the M foreground feature maps, each foreground feature map in the M foreground feature maps and the background feature map in the M background feature maps corresponding to each foreground feature map are taken as a feature map group, thus obtaining M feature map groups.

[0140] For example, the dimensions of the three foreground feature maps are (256, 52, 52), (512, 26, 26), and (1024, 13, 13), respectively, and the dimensions of the three background feature maps are (512, 26, 26), (512, 26, 26), and (1024, 13, 13), respectively. Then, the foreground feature map with size (256, 52, 52) and the background feature map with size (256, 52, 52) are grouped together as one feature map group, the foreground feature map with size (512, 26, 26) and the background feature map with size (512, 26, 26) are grouped together as another feature map group, and the foreground feature map with size (1024, 13, 13) and the background feature map with size (1024, 13, 13) are grouped together as yet another feature map group, thus obtaining three feature map groups.

[0141] Furthermore, for each of the M feature map groups, the following interactive feature calculation operation is performed on the target CADP:

[0142] First, obtain the target feature map group of the input target CADP, such as a feature map group consisting of a foreground feature map of size (256, 52, 52) and a background feature map of size (256, 52, 52).

[0143] Then, the difference feature map between the foreground feature map and the background feature map in the target feature map group is calculated; next, the size of the difference feature map, the size of the foreground feature map in the target feature map group, and the size of the background feature map in the target feature map group are all converted to preset sizes to obtain the difference feature matrix, the foreground feature matrix, and the background feature matrix.

[0144] It should be noted that, in this embodiment of the application, when converting the size of the difference feature map, the size of the foreground feature map in the target feature map group, and the size of the background feature map in the target feature map group to preset sizes, the dimensions of the difference feature map, the foreground feature map in the target feature map group, and the background feature map in the target feature map group are first converted to preset dimensions to obtain a first difference feature map, a first foreground feature map, and a first background feature map; then, the first difference feature map, the first foreground feature map, and the first background feature map are transposed to obtain a second difference feature map, a second foreground feature map, and a second background feature map; finally, a fully connected layer is used to convert the size of the second difference feature map, the size of the second foreground feature map, and the size of the second background feature map to preset sizes.

[0145] For example, if the dimensions of the difference feature map, the foreground feature map, and the background feature map are all (256, 52, 52), then the dimensions of the difference feature map, the foreground feature map, and the background feature map are all three-dimensional. First, the difference feature map is flattened, that is, its dimension is transformed from three-dimensional to two-dimensional. The resulting first difference feature map has dimensions of (256, 2704). Then, it is transposed to obtain a second difference feature map with dimensions of (2704, 256). Similarly, the foreground feature map and the background feature map are flattened and transposed to obtain a two-dimensional second foreground feature map and a two-dimensional second background feature map, and the size of the second foreground feature map and the second background feature map is (2704, 256). Then, the second difference feature map, the second foreground feature map and the second background feature map, which are all of size (2704, 256), are input into a 1024-dimensional fully connected layer to obtain the difference feature matrix, the foreground feature matrix and the background feature matrix, which are all of size (2704, 1024).

[0146] After obtaining the difference feature matrix, foreground feature matrix, and background feature matrix, the foreground feature matrix is first multiplied by the background feature matrix to obtain the first interaction information matrix, and then the first interaction information matrix is multiplied by the difference feature matrix to obtain the second interaction information matrix.

[0147] It should be noted that in this embodiment, since the size of both the foreground feature matrix and the background feature matrix is (a, b), when a and b are not equal, the foreground feature matrix and the background feature matrix cannot be directly multiplied when multiplying the foreground feature matrix by the background feature matrix according to the matrix multiplication operation. In this case, the background feature matrix needs to be transposed first to obtain a background feature matrix of size (b, a), and then the foreground feature matrix is multiplied by the transposed background feature matrix according to the matrix multiplication operation, where a and b are both integers greater than zero.

[0148] For example, refer to Figure 2 The CADP calculation principle diagram shown has the following: the difference feature matrix A is [[1, 2, 3], [2, 3, 1]], the background feature matrix B is [[2, 2, 3], [3, 3, 6]], and the foreground feature matrix C is [[3, 4, 6], [5, 6, 7]]. That is, the size of the difference feature matrix A, the background feature matrix B, and the foreground feature matrix C are all (2, 3). First, the background feature matrix B is transposed to obtain the transposed background feature matrix B as [[2, 3], [2, 3], [3, 6]]. Then, the foreground feature matrix C is multiplied by the transposed background feature matrix B to obtain the first interaction information matrix [[32, 57], [43, 75]]. Finally, the first interaction information matrix is multiplied by the difference feature matrix A to obtain the second interaction information matrix [[146, 235, 153], [193, 311, 204]].

[0149] Next, the second interaction information matrix and the foreground feature matrix are input into the feedforward neural network to obtain the interaction feature map. The size of the interaction feature map is the same as the size of the background feature map in the target feature map group.

[0150] It should be noted that in this embodiment of the application, Feedforward includes a P-layer R-dimensional fully connected layer and a regularization layer, where P and R are both integers greater than zero.

[0151] Finally, after inputting each of the M feature map groups into the target CADP to perform the above interactive feature calculation operation, M interactive feature maps can be output.

[0152] By using the above method, based on the interaction feature calculation between the foreground feature map, background feature map and difference feature map in the target CADP, the interaction feature information of the foreground feature map, background feature map and difference feature map is obtained, and the difference information between the foreground image and the background image is further extracted, thereby improving the detection accuracy of image difference detection.

[0153] S4, use the object detection network to perform difference detection on M interactive feature maps to obtain the difference information between the foreground image and the background image;

[0154] After obtaining M interaction feature maps, a target detection network is used to perform difference detection on the M interaction feature maps, thereby obtaining the difference information between the foreground image and the background image.

[0155] Specifically, the M interactive feature maps are input into the convolutional layer and the specified activation layer in the object detection network to obtain G confidence values and the coordinates of the detection boxes corresponding to each of the G confidence values, where G is an integer greater than zero.

[0156] It should be noted that, in this embodiment, the designated activation layer can be a rectified linear unit (ReLU) activation layer or a hyperbolic tangent (tanh) activation layer. In this embodiment, the activation layer can be selected according to the specific application scenario.

[0157] Furthermore, it should be noted that in the embodiments of this application, G is obtained by the following calculation formula:

[0158] G = M * H * W

[0159] Where H is the length of each of the M interactive feature maps, and W is the width of each of the M interactive feature maps. For example, if the size of the interactive feature map is (256, 26, 52), the length of the interactive feature map is 26, which means H equals 26, and the width of the interactive feature map is 52, which means W equals 52.

[0160] Further, determine whether all of the above G confidence values are less than a preset threshold;

[0161] If all G confidence values are less than the preset threshold, it means that there is no difference between the foreground image and the background image. Therefore, the output difference information is that there is no difference between the foreground image and the background image.

[0162] If any of the G confidence values is not less than a preset threshold, then the coordinates of the detection boxes corresponding to the K confidence values that are not less than the preset threshold are mapped to the background image to obtain K sets of difference box coordinates. At this time, the output difference information is that there is a difference between the foreground image and the background image and K sets of difference box coordinates representing the difference location, where K is an integer greater than zero.

[0163] It should be noted that, in this embodiment of the application, when mapping the coordinates of the detection boxes corresponding to the K confidence values that are not less than a preset threshold out of the G confidence values to the background image, the coordinates of the detection boxes corresponding to the K confidence values can also be mapped to the foreground image. In this embodiment of the application, it is not limited whether the image mapped to the coordinates of the detection boxes corresponding to the K confidence values is a foreground image or a background image.

[0164] By comparing the confidence values of the outputs of the convolutional and activation layers of the interaction feature map in the detection network with a preset threshold, it is determined whether there is a difference between the foreground image and the background image. In addition, based on the coordinates of the difference boxes in the difference information, the specific location and the number of differences between the foreground image and the background image are also determined.

[0165] In one possible implementation, after obtaining K sets of difference box coordinates, it can be further determined whether the K sets of difference box coordinates contain difference box coordinates with the same coordinates.

[0166] If the K sets of difference box coordinates contain difference box coordinates with the same coordinates, then the output difference information is that there is a difference between the foreground image and the background image, and L sets of difference box coordinates with inconsistent coordinates in the K sets of difference box coordinates representing the difference location, where L is an integer greater than zero;

[0167] If the coordinates contained in the K sets of difference box coordinates are all inconsistent, the output difference information is that there is a difference between the foreground image and the background image, and the K sets of difference box coordinates indicating the location of the difference.

[0168] By using the above method, based on the judgment of whether there are identical difference box coordinates in the K sets of difference box coordinates, the overlapping difference positions on the foreground image and the background image are removed.

[0169] In summary, the image difference detection method proposed in this application extracts features from the foreground and background images separately through a feature extraction network, calculates the interaction information between the feature maps output by the feature extraction network based on CADP to obtain an interaction feature map, and finally detects the interaction feature map based on the detection network to obtain the difference information between the foreground and background images. This avoids manually designing the feature information of the foreground and background images and effectively utilizes the difference information between the foreground and background images, thereby improving the detection accuracy and efficiency of image difference detection.

[0170] Furthermore, in the process of obtaining the target feature extraction network, target CADP, and target detection network, the TSPT network acceleration framework is used to reduce the detection time of image difference detection while ensuring the detection accuracy of image difference detection, thereby further improving the detection efficiency of image difference detection; and through CADP, the foreground feature map and the background feature map can fully interact, further improving the detection accuracy of image difference detection.

[0171] The technical solution of this application will be further explained below with reference to a specific application process.

[0172] like Figure 3The diagram illustrates the processing steps of the image difference detection method. First, the original foreground and background training images used to train the image difference detection network are acquired. These images are then preprocessed to transform them into foreground and background training images that can be directly input into the image difference detection network. Second, the feature extraction network, CADP, and detection network in the image difference detection network are trained based on these foreground and background training images. Once the trained feature extraction network, CADP, and detection network meet preset conditions, they are used as the first target feature extraction network, target CADP, and target detection network. Finally, the first target feature extraction network is optimized using the TSPT network acceleration framework to obtain the target feature extraction network.

[0173] Furthermore, after obtaining the target feature extraction network, target CADP, and target detection network, the original foreground image and original background image for which differences need to be detected are obtained. The original foreground image and original background image are then transformed into foreground image and background image that can be directly input into the target feature extraction network through the above data preprocessing. Next, the foreground image and background image are input into the target feature extraction network, and N foreground feature maps and N background feature maps are output, where N is an integer greater than zero.

[0174] Furthermore, extract the M smallest foreground feature maps from the N foreground feature maps and the M corresponding background feature maps from the N background feature maps. Then, group the background feature maps corresponding to each of the M foreground feature maps into a feature map group, resulting in M feature map groups, where M is a positive integer. For each of the M feature map groups, input the target CADP and perform the following interactive feature calculation operation:

[0175] Obtain the target feature map group from the input target CADP; then calculate the difference feature map between the foreground feature map and the background feature map in the target feature map group; then convert the dimensions of the difference feature map, the foreground feature map in the target feature map group, and the background feature map in the target feature map group into two dimensions to obtain the first difference feature map, the first foreground feature map, and the first background feature map; then transpose the first difference feature map, the first foreground feature map, and the first background feature map respectively to obtain the second difference feature map, the second foreground feature map, and the second background feature map; then transpose the second difference feature map, the first foreground feature map, and the first background feature map. The second foreground feature map and the second background feature map are respectively input into a preset fully connected layer to obtain a difference feature matrix, a foreground feature matrix, and a background feature matrix. The background feature matrix is then transposed to obtain a transposed background feature matrix. Next, the foreground feature matrix is multiplied by the transposed background feature matrix to obtain a first interaction information matrix. Then, the first interaction information matrix is multiplied by the difference feature matrix to obtain a second information interaction matrix. Finally, the second information interaction matrix and the foreground feature matrix are input into a feedforward neural network to obtain an interaction feature map with the same size as the background feature map in the target feature map group.

[0176] After inputting each of the M feature map groups into the target CADP to perform the above interactive feature calculation operation, M interactive feature maps are output.

[0177] Furthermore, the M interactive feature maps are input into the convolutional layer and ReLU activation layer in the object detection network to obtain G confidence values and the coordinates of the detection boxes corresponding to each of the G confidence values; then it is determined whether all G confidence values are less than a preset threshold, where G is a positive integer.

[0178] If all G confidence values are less than the preset threshold, the output difference information is that there is no difference between the original foreground image and the original background image;

[0179] If K out of the G confidence values are not less than a preset threshold, then the detection box coordinates corresponding to each of the K confidence values are mapped to the original background image or the original foreground image to obtain K sets of difference box coordinates, where K is a positive integer. When the K sets of difference box coordinates contain difference box coordinates with the same coordinates, the output difference information is that the original foreground image and the original background image have differences, and L sets of difference box coordinates with inconsistent coordinates in the K sets of difference box coordinates, where L is a positive integer. When the K sets of difference box coordinates do not contain difference box coordinates with the same coordinates, the output difference information is that the original foreground image and the original background image have differences, and K sets of difference box coordinates.

[0180] The above method uses a feature extraction network to extract features from the foreground and background images respectively. Then, based on CADP, the interaction information between the feature maps output by the feature extraction network is calculated to obtain an interaction feature map. Finally, the interaction feature map is detected by the detection network to output the difference information between the original foreground and background images. This avoids manually designing the feature information of the original foreground and background images and effectively utilizes the difference information between the original foreground and background images, thereby improving the detection accuracy and efficiency of image difference detection.

[0181] Furthermore, in the process of obtaining the target feature extraction network, target CADP, and target detection network, the TSPT network acceleration framework is used to reduce the detection time of image difference detection while ensuring the detection accuracy, thereby further improving the detection efficiency of image difference detection. Moreover, through CADP, the foreground feature map and the background feature map can fully interact, which further improves the detection accuracy of image difference detection.

[0182] Based on the same inventive concept, this application also provides an image difference detection system, such as... Figure 4 The diagram shown is a structural schematic of an image difference detection system provided in this application. The system includes:

[0183] The acquisition module 401 is used to acquire the foreground image and the background image;

[0184] The feature extraction module 402 is used to input the foreground image and the background image into the target feature extraction network respectively, and output N foreground feature maps corresponding to the foreground image and N background feature maps corresponding to the background image, where N is an integer greater than zero;

[0185] The interaction module 403 is used to input M foreground feature maps from N foreground feature maps and M background feature maps from N background feature maps corresponding to the M foreground feature maps into the target attention interaction network, and output M interaction feature maps, where M is an integer greater than zero.

[0186] The processing module 404 is used to perform difference detection on M interactive feature maps through an object detection network to obtain the difference information between the foreground image and the background image.

[0187] In one possible implementation, the acquisition module 401 is specifically used to acquire the foreground training image and the background training image;

[0188] Based on the foreground training image and the background training image, a feature extraction network, an attention interaction network, and a detection network are trained to obtain the trained feature extraction network, trained attention interaction network, and trained detection network.

[0189] If the training feature extraction network, the training attention interaction network, and the training detection network all meet the preset conditions, then the training feature extraction network, the training attention interaction network, and the training detection network will be used as the first target feature extraction network, the target attention interaction network, and the target detection network, respectively.

[0190] In one possible implementation, the acquisition module 401 is specifically used to optimize specified parameters in the first target feature extraction network to obtain a second feature extraction network.

[0191] The third feature extraction network is obtained by cropping a specified feature extraction block in the second feature extraction network according to a preset cropping rate.

[0192] The predicted feature map in the first target feature extraction network is used as the real feature map of the third feature extraction network to train the third feature extraction network, thus obtaining the target feature extraction network.

[0193] In one possible implementation, the acquisition module 401 is specifically used to acquire the original foreground image and the original background image, and to perform conversion processing on the original foreground image and the original background image according to the data preprocessing method to obtain the foreground image and the background image.

[0194] The processing module 404 is specifically used to perform difference detection on M interactive feature maps through an object detection network and output the difference information between the foreground image and the background image.

[0195] In one possible implementation, the interaction module 403 is specifically configured to determine M foreground feature maps from N foreground feature maps according to the size of the feature map.

[0196] From N background feature maps, determine M background feature maps that correspond to M foreground feature maps.

[0197] In one possible implementation, the interaction module 403 is specifically used to sort the N foreground feature maps according to the size of the feature map;

[0198] From the sorted N foreground feature maps, determine the M smallest foreground feature maps.

[0199] In one possible implementation, the interaction module 403 is specifically used to take each foreground feature map in the M foreground feature maps and the background feature map corresponding to each foreground feature map in the M background feature maps as a feature map group, to obtain M feature map groups.

[0200] For each of the M feature map groups, the following interaction feature calculation operation is performed by inputting it into the target attention interaction network:

[0201] Obtain the target feature map set of the input target attention interaction network;

[0202] Calculate the difference feature map between the foreground feature map and the background feature map in the target feature map group;

[0203] The dimensions of the difference feature map, the foreground feature map in the target feature map group, and the background feature map in the target feature map group are all converted to preset dimensions to obtain the difference feature matrix, the foreground feature matrix, and the background feature matrix;

[0204] The foreground feature matrix is multiplied by the background feature matrix using matrix multiplication to obtain the first interactive information matrix.

[0205] The first interactive information matrix is multiplied by the difference feature matrix according to the matrix multiplication operation to obtain the second interactive information matrix.

[0206] The second interaction information matrix and the foreground feature matrix are input into the feedforward neural network to obtain the interaction feature map;

[0207] Each of the M feature map groups is input into the target attention interaction network to perform interaction feature calculation operations, and M interaction feature maps are output.

[0208] In one possible implementation, the processing module 404 is specifically used to input M interactive feature maps into the convolutional layer and a specified activation layer in the target detection network to obtain G confidence values and the coordinates of the detection boxes corresponding to each of the G confidence values, where G is an integer greater than zero.

[0209] Determine whether all G confidence values are less than a preset threshold;

[0210] If all G confidence values are less than the preset threshold, the output foreground image and background image are not different.

[0211] If not all of the G confidence values are less than the preset threshold, the coordinates of the detection boxes corresponding to the K confidence values that are not less than the preset threshold are mapped onto the original background image to obtain K sets of difference box coordinates. The output shows that there is a difference between the foreground image and the background image and the K sets of difference box coordinates representing the location of the difference, where K is an integer greater than zero.

[0212] Based on the same inventive concept, this application also provides an electronic device that can realize the functions of the aforementioned image difference detection system. (Refer to...) Figure 5 The aforementioned electronic devices include:

[0213] At least one processor 501 and a memory 502 connected to at least one processor 501. In this embodiment, the specific connection medium between the processor 501 and the memory 502 is not limited. Figure 5 The example shown is the connection between processor 501 and memory 502 via bus 500. Bus 500 is... Figure 5 The connections between other components are indicated by thick lines and are for illustrative purposes only, not as limiting information. The Bus 500 can be divided into address bus, data bus, control bus, etc., for ease of representation. Figure 5 The term 501 is represented by a single thick line, but this does not imply that there is only one bus or one type of bus. Alternatively, the processor 501 can also be called a controller; there is no restriction on the name.

[0214] In this embodiment, memory 502 stores instructions executable by at least one processor 501. By executing the instructions stored in memory 502, at least one processor 501 can perform the image difference detection method described above. Processor 501 can implement... Figure 5 The system shown illustrates the functions of each module.

[0215] The processor 501 is the control center of the system. It can connect to various parts of the control device through various interfaces and lines. By running or executing instructions stored in memory 502 and calling data stored in memory 502, the system can perform various functions and process data, thereby monitoring the system as a whole.

[0216] In one possible design, processor 501 may include one or more processing units. Processor 501 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and applications, and the modem processor mainly handles wireless communication. It is understood that the modem processor may also not be integrated into processor 501. In some embodiments, processor 501 and memory 502 may be implemented on the same chip; in some embodiments, they may also be implemented on separate chips.

[0217] Processor 501 can be a general-purpose processor, such as a central processing unit (CPU), digital signal processor, application-specific integrated circuit, field-programmable gate array or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component, capable of implementing or executing the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the image difference detection method disclosed in the embodiments of this application can be directly manifested as being executed by a hardware processor, or executed by a combination of hardware and software modules within the processor.

[0218] Memory 502, as a non-volatile computer-readable storage medium, can be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. Memory 502 may include at least one type of storage medium, such as flash memory, hard disk, multimedia card, card-type memory, random access memory (RAM), static random access memory (SRAM), programmable read-only memory (PROM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), magnetic storage, magnetic disk, optical disk, etc. Memory 502 can be any other medium capable of carrying or storing desired program code in the form of instructions or data structures that can be accessed by a computer, but is not limited thereto. In the embodiments of this application, memory 502 can also be a circuit or any other device capable of implementing storage functions for storing program instructions and / or data.

[0219] By designing and programming the processor 501, the code corresponding to the image difference detection method described in the foregoing embodiments can be embedded into the chip, thereby enabling the chip to execute the code during operation. Figure 4 The steps of the image difference detection method in the illustrated embodiment are as follows. How to design and program the processor 501 is a technique well-known to those skilled in the art and will not be described further here.

[0220] Based on the same inventive concept, embodiments of this application also provide a storage medium storing computer instructions that, when executed on a computer, cause the computer to perform the image difference detection method described above.

[0221] In some possible implementations, various aspects of the image difference detection method provided in this application can also be implemented in the form of a program product, which includes program code that, when the program product is run on a device, causes the control device to perform the steps in the image difference detection method according to the various exemplary embodiments of this application described above.

[0222] Those skilled in the art will understand that embodiments of this application can be provided as methods, systems, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0223] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0224] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0225] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0226] Obviously, those skilled in the art can make various modifications and variations to this application without departing from the spirit and scope of this application. Therefore, if such modifications and variations fall within the scope of the claims of this application and their equivalents, this application also intends to include such modifications and variations.

Claims

1. An image difference detection method characterized by, The method includes: Obtain the foreground and background images; The foreground image and the background image are respectively input into the target feature extraction network, and N foreground feature maps corresponding to the foreground image and N background feature maps corresponding to the background image are output, where N is an integer greater than zero; M foreground feature maps from the N foreground feature maps and M background feature maps from the N background feature maps corresponding to the M foreground feature maps are input into a target attention interaction network to perform interaction feature calculation operations, outputting M interaction feature maps, where M is an integer greater than zero; the interaction feature calculation operation is as follows: calculate the difference feature map between the foreground feature map and the background feature map in the target feature map group; perform dimension transformation, transpose, and size transformation on the difference feature map, the foreground feature map, and the background feature map to obtain the difference feature matrix, the foreground feature matrix, and the background feature matrix; multiply the foreground feature matrix by the background feature matrix according to the matrix multiplication operation, and multiply the resulting product by the difference feature matrix to obtain the second interaction information matrix; input the second interaction information matrix and the foreground feature matrix into a feedforward neural network to obtain the interaction feature map; The difference between the foreground image and the background image is obtained by performing difference detection on the M interactive feature maps using an object detection network.

2. The method of claim 1, wherein, Before acquiring the foreground and background images, the process also includes: Obtain the foreground training image and the background training image; Based on the foreground training image and the background training image, a feature extraction network, an attention interaction network, and a detection network are trained to obtain a trained feature extraction network, a trained attention interaction network, and a trained detection network. If the training feature extraction network, the training attention interaction network, and the training detection network all meet the preset conditions, the training feature extraction network, the training attention interaction network, and the training detection network are used as the first target feature extraction network, the target attention interaction network, and the target detection network.

3. The method as described in claim 2, characterized in that, After using the trained feature extraction network, the trained attention interaction network, and the trained detection network as the first target feature extraction network, the target attention interaction network, and the target detection network, the method further includes: Optimize the specified parameters in the first target feature extraction network to obtain the second feature extraction network; The third feature extraction network is obtained by cropping a specified feature extraction block in the second feature extraction network according to a preset cropping rate. The predicted feature map in the first target feature extraction network is used as the real feature map of the third feature extraction network to train the third feature extraction network, thereby obtaining the target feature extraction network.

4. The method as described in claim 1, characterized in that, The acquisition of the foreground and background images includes: The original foreground image and the original background image are obtained, and the original foreground image and the original background image are transformed according to the data preprocessing method to obtain the foreground image and the background image respectively; The step of performing difference detection on the M interactive feature maps using an object detection network to obtain difference information between the foreground image and the background image includes: The object detection network performs difference detection on the M interactive feature maps and outputs the difference information between the foreground image and the background image.

5. The method as described in claim 1, characterized in that, Before inputting M foreground feature maps from the N foreground feature maps and M background feature maps from the N background feature maps corresponding to the M foreground feature maps into the target attention interaction network to perform interaction feature calculation operations and output M interaction feature maps, the method further includes: Based on the size of the feature map, the M foreground feature maps are determined from the N foreground feature maps, and From the N background feature maps, determine the M background feature maps that correspond to the M foreground feature maps.

6. The method as described in claim 5, characterized in that, The step of determining the M foreground feature maps from the N foreground feature maps according to the feature map size includes: The N foreground feature maps are sorted according to their size. From the sorted N foreground feature maps, determine the M foreground feature maps with the smallest size.

7. The method as described in claim 1, characterized in that, The step of inputting M foreground feature maps from the N foreground feature maps and M background feature maps from the N background feature maps corresponding to the M foreground feature maps into the target attention interaction network to perform interaction feature calculation operations and output M interaction feature maps includes: Each foreground feature map in the M foreground feature maps and the corresponding background feature map in the M background feature maps are taken as a feature map group to obtain M feature map groups; Each of the M feature map groups is input into the target attention interaction network to perform the interaction feature calculation operation, and M interaction feature maps are output.

8. The method as described in claim 1, characterized in that, The step of performing difference detection on the M interactive feature maps using an object detection network to obtain difference information between the foreground image and the background image includes: The M interactive feature maps are input into the convolutional layer and the specified activation layer in the target detection network to obtain G confidence values and the coordinates of the detection boxes corresponding to the G confidence values, where G is an integer greater than zero. Determine whether all G confidence values are less than a preset threshold; If so, output that there is no difference between the foreground image and the background image; If not, map the coordinates of the detection boxes corresponding to the K confidence values that are not less than the preset threshold from the G confidence values to the background image to obtain K sets of difference box coordinates, and output the K sets of difference box coordinates that indicate the difference between the foreground image and the background image, where K is an integer greater than zero.

9. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor, when executing a computer program stored in the memory, implements the method steps of any one of claims 1-8.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the steps of the method described in any one of claims 1-8.