Image detection method, device and computer readable storage medium

By using clustering modules and weighted training of feature vector pairs in image detection, the problem of insufficient accuracy in image similarity detection is solved, achieving higher detection accuracy and better neural network training results.

CN116342909BActive Publication Date: 2026-06-16RICOH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
RICOH CO LTD
Filing Date
2021-12-21
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Current image similarity detection technologies face challenges such as low quantity and quality of abnormal samples, problems with sample data labeling, and significant environmental influences, resulting in insufficient detection accuracy.

Method used

Multiple clustering modules are used for clustering to obtain channel weights and feature vectors. Image detection is performed using the weights of the feature vector pairs, and the neural network is trained to adjust the parameters and improve detection accuracy.

🎯Benefits of technology

By using clustering and weighted training of feature vector pairs, the accuracy of image similarity detection is improved, avoiding the problem of low training efficiency caused by abnormal samples and enhancing the detection performance of the neural network.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116342909B_ABST
    Figure CN116342909B_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide image detection methods, devices and computer readable storage media. The image detection method according to the embodiments of the present application comprises: acquiring two images to be detected, and acquiring feature maps corresponding to the two images to be detected respectively; performing clustering processing on each feature map to be detected by using a plurality of clustering modules, acquiring a channel weight of each clustering module, and acquiring a feature vector according to the channel weight, and using the feature vector of each clustering module to form a feature vector group; acquiring a feature vector pair formed by the feature vectors of the same clustering module in each feature vector group corresponding to the two feature maps to be detected respectively, and acquiring a weight of each feature vector pair; and performing image detection on the two images to be detected according to at least the acquired feature vector pair and the weight of the feature vector pair, to acquire a similarity of the two images to be detected.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing, and more particularly to an image detection method, apparatus, and computer-readable storage medium. Background Technology

[0002] Image similarity detection is of great significance in contemporary society, especially in industrial settings. However, image similarity detection often faces various challenges (e.g., limited anomalous samples in neural network training, problems with sample data labeling, small differences between anomalous and normal samples, significant environmental influences, and inconsistent standards for anomalous samples), which greatly affect the accuracy of image similarity detection.

[0003] Therefore, there is a need for an image detection method and apparatus that can improve the accuracy of image similarity detection. Summary of the Invention

[0004] To address the aforementioned technical problems, according to one aspect of the present invention, an image detection method is provided, comprising: acquiring two images to be detected, and acquiring feature maps to be detected corresponding to the two images to be detected respectively; performing clustering processing on each feature map to be detected using multiple clustering modules, acquiring channel weights of each clustering module and acquiring feature vectors based on the channel weights, and forming a feature vector group using the feature vectors of each clustering module; acquiring feature vector pairs composed of feature vectors from the same clustering module in the two feature vector groups corresponding to the two feature maps to be detected respectively, and acquiring the weights of each feature vector pair; performing image detection on the two images to be detected based at least on the acquired feature vector pairs and their weights, and acquiring the similarity between the two images to be detected.

[0005] According to another aspect of the present invention, an image detection method is provided, comprising: acquiring two images to be detected; performing image detection on the two images to be detected using a neural network, and obtaining the similarity between the two images to be detected; wherein the neural network is trained by: acquiring training feature maps corresponding to at least two training images respectively; performing clustering processing on each training feature map using multiple clustering modules, obtaining the channel weights of each clustering module, and obtaining training feature vectors according to the channel weights, and forming a training feature vector group using the training feature vectors of each clustering module; acquiring training feature vector pairs consisting of training feature vectors from the same clustering module for each of the two training feature vector groups corresponding to any two training feature maps respectively, and obtaining the weights of each training feature vector pair; predicting the similarity between the two training images corresponding to the two training feature vector groups respectively based at least on the acquired training feature vector pairs and the weights of the training feature vector pairs, and adjusting the parameters of the neural network according to the prediction results.

[0006] According to another aspect of the present invention, an image detection apparatus is provided, comprising: a feature map acquisition unit configured to acquire two images to be detected and acquire feature maps to be detected corresponding to the two images to be detected respectively; a clustering unit configured to perform clustering processing on each feature map to be detected using multiple clustering modules, acquire channel weights of each clustering module and acquire feature vectors according to the channel weights, and form a feature vector group using the feature vectors of each clustering module; a feature vector pair acquisition unit configured to acquire feature vector pairs composed of feature vectors from the same clustering module in two feature vector groups corresponding to the two feature maps to be detected respectively, and acquire the weights of each feature vector pair; and a detection unit configured to perform image detection on the two images to be detected at least according to the acquired feature vector pairs and the weights of the feature vector pairs, and acquire the similarity between the two images to be detected.

[0007] According to another aspect of the present invention, an image detection apparatus is provided, comprising: a processor; and a memory storing computer program instructions, wherein, when the computer program instructions are executed by the processor, the processor performs the following steps: acquiring two images to be detected, and acquiring feature maps to be detected corresponding to the two images to be detected respectively; performing clustering processing on each feature map to be detected using multiple clustering modules, acquiring channel weights of each clustering module and acquiring feature vectors based on the channel weights, and forming a feature vector group using the feature vectors of each clustering module; acquiring feature vector pairs composed of feature vectors from the same clustering module in the two feature vector groups corresponding to the two feature maps to be detected respectively, and acquiring the weights of each feature vector pair; performing image detection on the two images to be detected based at least on the acquired feature vector pairs and their weights, and acquiring the similarity between the two images to be detected.

[0008] According to another aspect of the present invention, a computer-readable storage medium is provided, having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, perform the following steps: acquiring two images to be detected, and acquiring feature maps to be detected corresponding to the two images to be detected respectively; performing clustering processing on each feature map to be detected using multiple clustering modules, acquiring channel weights of each clustering module and acquiring feature vectors based on the channel weights, and forming feature vector groups using the feature vectors of each clustering module; acquiring feature vector pairs composed of feature vectors from the same clustering module in the two feature vector groups corresponding to the two feature maps to be detected respectively, and acquiring the weights of each feature vector pair; performing image detection on the two images to be detected based at least on the acquired feature vector pairs and their weights, and acquiring the similarity between the two images to be detected.

[0009] According to another aspect of the present invention, an image detection apparatus is provided, comprising: an acquisition unit configured to acquire two images to be detected; and a processing unit configured to perform image detection on the two images to be detected using a neural network, and to acquire the similarity between the two images to be detected; wherein the neural network used by the processing unit is trained by: acquiring training feature maps corresponding to at least two training images respectively; performing clustering processing on each training feature map using multiple clustering modules, acquiring the channel weights of each clustering module, and acquiring training feature vectors according to the channel weights, and forming a training feature vector group using the training feature vectors of each clustering module; acquiring training feature vector pairs consisting of training feature vectors from the same clustering module for each of the two training feature vector groups corresponding to any two training feature maps respectively, and acquiring the weights of each training feature vector pair; predicting the similarity between the two training images corresponding to the two training feature vector groups respectively based at least on the acquired training feature vector pairs and the weights of the training feature vector pairs, and adjusting the parameters of the neural network according to the prediction results.

[0010] According to another aspect of the present invention, an image detection apparatus is provided, comprising: a processor; and a memory storing computer program instructions, wherein when the computer program instructions are executed by the processor, the processor performs the following steps: acquiring two images to be detected; performing image detection on the two images to be detected using a neural network, and acquiring the similarity between the two images to be detected; wherein the neural network is trained by: acquiring training feature maps corresponding to at least two training images respectively; performing clustering processing on each training feature map using multiple clustering modules, acquiring channel weights of each clustering module, and acquiring training feature vectors based on the channel weights, and forming training feature vector groups using the training feature vectors of each clustering module; acquiring training feature vector pairs consisting of training feature vectors from the same clustering module for each of the two training feature vector groups corresponding to any two training feature maps respectively, and acquiring the weights of each training feature vector pair; predicting the similarity between the two training images corresponding to the two training feature vector groups respectively based at least on the acquired training feature vector pairs and the weights of the training feature vector pairs, and adjusting the parameters of the neural network based on the prediction results.

[0011] According to another aspect of the present invention, a computer-readable storage medium is provided, on which computer program instructions are stored, wherein the computer program instructions, when executed by a processor, perform the following steps: acquiring two images to be detected; performing image detection on the two images to be detected using a neural network, and acquiring the similarity between the two images to be detected; wherein the neural network is trained in the following manner: acquiring training feature maps corresponding to at least two training images respectively; performing clustering processing on each training feature map using multiple clustering modules, acquiring the channel weights of each clustering module, and acquiring training feature vectors according to the channel weights, and forming training feature vector groups using the training feature vectors of each clustering module; acquiring training feature vector pairs consisting of training feature vectors from the same clustering module for each of the two training feature vector groups corresponding to any two training feature maps respectively, and acquiring the weights of each training feature vector pair; predicting the similarity between the two training images corresponding to the two training feature vector groups respectively, at least according to the acquired training feature vector pairs and the weights of the training feature vector pairs, and adjusting the parameters of the neural network according to the prediction results.

[0012] According to the above-described image detection method, apparatus, and computer-readable storage medium of the present invention, the channel weights of each clustering module and the weights of each feature vector pair during the clustering process can be used to detect the image to be detected, thereby enabling targeted focus on image features with higher weights for similarity comparison and improving the accuracy of image similarity detection.

[0013] Furthermore, the image detection method, apparatus, and computer-readable storage medium of the present invention employ a neural network that is trained considering the channel weights of each clustering module and the weights of feature vector pairs during the clustering process. This effectively avoids the problem of low quantity and quality of abnormal samples during training, thereby improving the training efficiency and detection effect of the neural network. Attached Figure Description

[0014] The above and other objects, features, and advantages of the present invention will become clearer from the detailed description of the embodiments of the present invention in conjunction with the accompanying drawings.

[0015] Figure 1 A flowchart illustrating an image detection method according to an embodiment of the present invention is shown;

[0016] Figure 2(a) shows a schematic diagram of the structure of an SE module according to one embodiment of the present invention, and Figure 2(b) shows a schematic diagram of the structure of an SE module according to another embodiment of the present invention;

[0017] Figure 3 A schematic diagram of a clustering network performing clustering processing according to an embodiment of the present invention is shown;

[0018] Figure 4 A block diagram of an image detection apparatus according to an embodiment of the present invention is shown;

[0019] Figure 5 A block diagram of an image detection apparatus according to another embodiment of the present invention is shown;

[0020] Figure 6 A block diagram of an image detection apparatus according to another embodiment of the present invention is shown;

[0021] Figure 7 A block diagram of an image detection apparatus according to another embodiment of the present invention is shown;

[0022] Figure 8 A block diagram of an image detection apparatus according to another embodiment of the present invention is shown. Detailed Implementation

[0023] The image detection method, apparatus, and computer-readable storage medium according to embodiments of the present invention will now be described with reference to the accompanying drawings. In the drawings, the same reference numerals denote the same elements throughout. It should be understood that the embodiments described herein are merely illustrative and should not be construed as limiting the scope of the invention.

[0024] This invention provides an image detection method, apparatus, and computer-readable storage medium. The following will refer to... Figure 1 An image detection method according to an embodiment of the present invention is described. The image detection method of the present invention can be applied to static images or video frames in a video that changes over time, and is not limited thereto. Correspondingly, the two images to be detected in the embodiments of the present invention can also be two static images, video frames from two videos, or a combination of one static image and one video frame, etc., and are not limited thereto. Figure 1 The flowchart of the image detection method 100 is shown.

[0025] like Figure 1 As shown, in step S101, two images to be detected are obtained, and the feature maps to be detected corresponding to the two images to be detected are obtained respectively.

[0026] In this step, information from the two images to be detected is extracted using a convolutional neural network, and the corresponding feature maps to be detected are output. Optionally, various convolutional neural networks such as ResNet, ResNext, and EfficientNet can be used to extract image information from the images to be detected to obtain the corresponding feature maps. Of course, the above selection of convolutional neural networks is only an example; in practical applications, any appropriate convolutional neural network can be used to obtain the feature maps to be detected.

[0027] Furthermore, in this step, if there are more than two images to be detected, any two images can be selected as input for detection, or all images can be paired into pairs for detection. In the actual detection process, depending on the structure of the detection network, all images to be detected can be input simultaneously, and then two images can be selected for detection separately; alternatively, one pair of images can be input and detected sequentially, and then another pair can be detected after each pair is completed. There are no restrictions on either approach.

[0028] Optionally, the i-th image among the N input images to be detected can be represented as input. i (For example, when there are two images to be detected, N=2, and i can take the values ​​1 and 2), and the feature map to be detected after being extracted by the convolutional neural network is represented as U. i Among them, U can be i Represented as U i ∈R W×H×C R is the real number space, W, H and C represent the dimensions of the feature map to be detected, where C is the number of channels of the convolutional neural network.

[0029] In step S102, each feature map to be detected is clustered using multiple clustering modules, the channel weights of each clustering module are obtained, and feature vectors are obtained based on the channel weights. The feature vectors of each clustering module are then used to form a feature vector group.

[0030] In this step, the channel weights of different clustering modules in the clustering process can be obtained to output multiple weighted feature vectors with different responses in different image regions of the feature map to be detected.

[0031] In this embodiment of the invention, the clustering module used for clustering processing can select various functional modules in the neural network to implement the above operations. For example, attention mechanism modules, such as the SE module and the CBAM module, can be selected. Of course, other modules used for clustering processing can also be selected, and there is no limitation here.

[0032] Alternatively, each feature map to be detected can be U i The inputs are respectively fed into P parallel SE modules, and P feature vectors are obtained based on the channel weights of the P SE modules to form a feature vector group. Figure 2(a) shows a schematic diagram of the structure of an SE module according to an embodiment of the present invention. As shown in Figure 2(a), the feature map to be detected U i =[u1,…,u C After passing through the pooling layer in Figure 2(a), a pooling layer feature z with 1×1×C channels is obtained. i =[z1,…,z C ],in:

[0033]

[0034] Then, continue passing through After adding a fully connected layer and a ReLU layer with 1×1×C channels, a fully connected layer with 1×1×C channels, and an activation function layer using sigmoid, the channel weights can be expressed as S. i S i It can be determined by the channel weights of each of the P SE modules. Composition, p = 1, ..., P It can be represented as:

[0035]

[0036] in, For the first fully connected layer (i.e. The characteristics of a fully connected layer with multiple channels. Let be the feature of the second fully connected layer (i.e., a fully connected layer with 1×1×C channels), sigmoid denotes normalization, and r is the reduction matrix.

[0037] Based on the channel weights of each SE module, the feature map to be detected can be weighted to obtain a weighted feature map of each of the P SE modules. Weighted feature map It can be represented as:

[0038]

[0039] Furthermore, Figure 2(b) shows a schematic diagram of the SE module according to another embodiment of the present invention. Compared with Figure 2(a), the SE module in Figure 2(b) has one less layer in its structure. A fully connected layer with one channel and a single layer The ReLU layer has multiple channels, but its processing principle is similar to that of the SE module in Figure 2(a), and will not be described in detail here. Whether the feature map to be detected passes through the SE module shown in Figure 2(a) or the SE module shown in Figure 2(b), a weighted feature map composed of each of the P SE modules can be obtained. Weighted feature map Alternatively, the obtained feature map U to be detected can also be... i Corresponding weighted feature map Output the data for subsequent processing.

[0040] Figure 3 A schematic diagram of a clustering network performing clustering processing according to an embodiment of the present invention is shown. Figure 3As shown, the feature map U to be detected i After passing through P parallel SE modules, the aforementioned weighted feature map is obtained. Subsequently, the obtained weighted feature map can be... The input continues to the subsequent fully connected layer to extract feature vectors. Specifically, the feature vector f of the p-th SE module can be... i p Represented as:

[0041]

[0042] in, For what has been passed Figure 3 The features of the fully connected layer shown are as follows: This indicates that the weighted feature map will be used. Expanded into a vector, D represents the parameters of the clustering network. The feature vector f of the p-th SE module is obtained. i p Then, the i-th feature map U of the i-th image to be detected can be... i The corresponding feature vector set is represented as F i =(f i 1 ,…,f i P ).

[0043] Alternatively, during the training process of the above clustering network, the loss function of the clustering network can be calculated based on the set of feature vectors obtained:

[0044]

[0045] in α1 represents a preset distance value that is either fixed or adjustable, and d represents a distance metric function. This distance metric function can be calculated by L1 distance (representing the absolute value of the distance), L2 distance (representing the square of the distance), or Cos distance (representing the angle between two feature vectors) and represents the distance between the two feature vectors in its subscript.

[0046] Based on the above loss function representation, during the training process of the clustering network, the parameters of the clustering network can be adjusted by minimizing the distance between the feature vectors output by the same SE module in the feature vector group corresponding to N images to be detected, while maximizing the distance between the feature vectors output by different SE modules. This enhances the ability of different SE modules to capture different image regions of the feature map to be detected.

[0047] In step S103, feature vector pairs consisting of feature vectors from the same clustering module are obtained from the two feature vector groups corresponding to the two feature maps to be detected, and the weights of each feature vector pair are obtained.

[0048] Optionally, feature vector pairs can be formed by selecting feature vectors from the same SE module (e.g., selecting the p-th SE module) from the two feature vector groups corresponding to the two images to be detected. For example, the i-th image input from N images to be detected can be used as a feature vector pair. i The corresponding feature vector set is represented as F i =(f i 1 ,…,f i P ), input the j-th image to be detected. j The corresponding feature vector set is represented as Therefore, the eigenvector pair corresponding to the p-th SE module in these two eigenvector groups is represented as follows: Then, the following operation is performed on each pair of feature vectors:

[0049]

[0050] Here, OP represents the distance vector of the feature vectors, or the ordered concatenation of two feature vectors.

[0051] Subsequently, the features with P channels can be obtained:

[0052]

[0053] Where Cont represents the feature generated by combining all feature vector pairs. Perform channel-by-channel concatenation on P channels to obtain the feature F with the number of channels P. i,j .

[0054] After obtaining F i,j Then, at this point, feature F can be... i,j Input an SE module to obtain feature pairs corresponding to P channels. The weights. The structure of the SE module used here can be similar to the SE module shown in Figure 2(a) or Figure 2(b). For example, the SE module used here can consist of a pooling layer with 1×1×P channels connected in sequence, similar to that in Figure 2(a). The SE module consists of a fully connected layer and a ReLU layer with 1×1×P channels, a fully connected layer with 1×1×P channels, and an activation function layer (sigmoid). Alternatively, the SE module used here can be similar to that in Figure 2(b), consisting of a pooling layer with 1×1×P channels, a fully connected layer with 1×1×P channels, and an activation function layer (sigmoid). The processing principle is similar to that of the SE module in Figure 2(a) or Figure 2(b), and will not be elaborated further here. Regardless of whether the SE module structure is similar to that in Figure 2(a) or Figure 2(b), the input here is the feature F. i,j It includes the feature vectors generated from P channels. The output is the feature vector of the P channels corresponding to the features. The corresponding feature pairs weights Represented as

[0055] Alternatively, the feature F based on the feature vector pair can be directly output in the SE module described above. i,j and the weights of each eigenvector pair The obtained weighted feature vector pairs This is used in the subsequent similarity acquisition process.

[0056] In step S104, image detection is performed on the two images to be detected based at least on the obtained feature vector pairs and their weights, and the similarity between the two images to be detected is obtained.

[0057] In one example, the feature vectors of the above P channels can be paired with the corresponding features. and their respective feature pairs weights The input is used to a regression network to calculate the similarity between the i-th and j-th images to be detected. Specifically, the similarity between the i-th and j-th images to be detected in the p-th channel can be calculated. Represented as:

[0058]

[0059] Layer R represents a regression network, which can consist of fully connected layers and sigmoid activation function layers.

[0060] Furthermore, during the training process of the regression network, the loss function L can be... R1 Represented as:

[0061]

[0062] Here, label represents the regression target, which also signifies the similarity between the i-th and j-th images to be detected. Specifically, if... It can then be expressed as on the contrary It can be represented as 0.

[0063] Furthermore, in the above regression process, when the obtained weighted feature vector pairs are further considered... Then, the similarity between the i-th and j-th images to be detected can be further calculated. Represented as:

[0064]

[0065] Therefore, considering them separately The weights λ and After assigning the weights (1-λ), the loss function L can be... R2 Represented as:

[0066]

[0067] Among them, the weighted feature vector pairs of the i-th and j-th images to be detected are... If label i =label j Then it can be represented as label i,j =1, otherwise label i,j It can be represented as 0.

[0068] The above loss function L R1 Or L R2 Both can be used to adjust the regression network, thereby improving the accuracy of image similarity detection by changing the parameters of the regression network.

[0069] In another example, the distance between each feature vector pair can be calculated using the acquired feature vector pairs and their weights, and the similarity between the corresponding two images to be detected can be obtained based on the calculated distance. For example, the features corresponding to the feature vector pairs of the P channels mentioned above can be used... and their respective feature pairs weights We calculate the distance between each pair of feature vectors to determine the similarity between the i-th and j-th images to be detected.

[0070] Specifically, the distance D between each feature pair can be expressed as:

[0071]

[0072] Where d represents the distance metric function, which can be calculated from the L1 distance (representing the absolute value of the distance), the L2 distance (representing the square of the distance), or the Cos distance (representing the angle between two feature vectors), and represents the input of the p-th SE module in its subscript. i and input j Corresponding feature vector The distance between them. It is an eigenvector pair The weight.

[0073] Optionally, the similarity between two images to be detected can be determined based on the calculated distance D between each feature pair. For example, if the distance D between each feature pair obtained from the two images to be detected meets a preset threshold, the two images to be detected are considered similar; conversely, if the distance D between each feature pair obtained from the two images to be detected does not meet the preset threshold, the two images to be detected are considered dissimilar. Of course, the specific calculation result of the distance D can also be used to specifically represent the numerical value of the similarity between the two images to be detected, which will not be elaborated here.

[0074] The similarity between two images to be detected calculated using a regression network and the similarity between two images to be detected determined by distance can be applied simultaneously or selectively. Furthermore, no method is restricted here from calculating the similarity between two images to be detected based on the aforementioned results.

[0075] The image detection method described above according to embodiments of the present invention can utilize the channel weights of each clustering module and the weights of each feature vector pair during the clustering process to detect the image to be detected. This allows for targeted focus on image features with higher weights for similarity comparison, improving the accuracy of image similarity detection. In these embodiments, the neural network used is trained considering the channel weights of each clustering module and the weights of the feature vector pairs during the clustering process. This effectively avoids the problem of low quantity and quality of outlier samples during training, improving the training efficiency and detection performance of the neural network.

[0076] The following reference Figure 4 An image detection method according to an embodiment of the present invention is described. The image detection method of the present invention can be applied to static images or video frames in a video that changes over time, and is not limited thereto. Correspondingly, the two images to be detected in the embodiments of the present invention can also be two static images, video frames from two videos, or a combination of one static image and one video frame, etc., and are not limited thereto. Figure 4 The flowchart of the image detection method 400 is shown.

[0077] In step S401, two images to be detected are acquired.

[0078] In this step, the image to be detected can come from the output or intermediate detection result of any intermediate layer in one or more neural networks, or it can come from the original input image to be detected, such as a two-dimensional image acquired by an image acquisition device such as a camera or video camera, or a two-dimensional frame image captured from a video.

[0079] In step S402, the two images to be detected are processed using a neural network to perform image detection and obtain the similarity between the two images. The neural network is trained as follows: At least two training images are obtained, each corresponding to a training feature map; each training feature map is clustered using multiple clustering modules, the channel weights of each clustering module are obtained, and training feature vectors are obtained based on the channel weights; training feature vector groups are formed using the training feature vectors from each clustering module; training feature vector pairs, consisting of training feature vectors from the same clustering module, are obtained for each of the two training feature vector groups corresponding to any two training feature maps, and the weights of each training feature vector pair are obtained; the similarity between the two training images corresponding to the two training feature vector groups is predicted based at least on the obtained training feature vector pairs and their weights, and the parameters of the neural network are adjusted based on the prediction results.

[0080] In this step, the image to be detected can be used to perform similarity detection using a neural network to obtain the detection result. The training method of the neural network here is similar to... Figure 1 The process described in [the previous section] is similar; that is, at least two labeled training images used for training can be utilized... Figure 1 The process shown performs similarity prediction and adjusts the parameters of the neural network based on the results. For example, it can be done using... Figure 1 The loss function L mentioned in the process shown C L R1 and / or L R2 The parameters of the neural network are adjusted to improve its similarity detection accuracy. In this embodiment, the neural network is trained by repeatedly updating and iterating its parameters using a large number of training images, aiming to minimize the difference between the trained detection results and the labeled true results. For specific operation methods, please refer to... Figure 1 The details mentioned above will not be repeated here.

[0081] In Figure 1In a training process similar to the steps described above, the neural network can be further trained as follows: at least based on the acquired training feature vector pairs and their weights, the distance between each training feature vector pair is calculated; the neural network is trained based on the distance between each training feature vector pair, adjusting the parameters of the neural network. Optionally, training the neural network based on the distance between each training feature vector pair and adjusting the parameters of the neural network may further include: training the neural network based on the category of each training feature vector pair, so as to minimize the distance between training feature vector pairs of the same category and maximize the distance between training feature vector pairs of different categories.

[0082] Optionally, in such Figure 1 As shown, after obtaining the feature vectors of P channels, the corresponding features... and the corresponding feature pairs weights Then, based on whether the training feature vector pairs belong to the same or different categories, this part of the loss function L can be adjusted using the calculated distance between the training feature vector pairs. T And further improve the accuracy of image similarity detection.

[0083] Specifically, methods such as triplet loss can be used to adjust the loss function L by calculating the categories and corresponding distances between different pairs of training feature vectors in the triplet input. T For example, three training images can be selected from N training images as the triplet input, and this triplet input can be represented as (input... anc input pos input neg ), where input anc and input pos They belong to the same category, while input anc and input neg They do not belong to the same category. Therefore, the loss function L can be... T Represented as:

[0084]

[0085] Where α2 represents a preset distance value that is fixed or adjustable, and d represents a distance metric function that can be calculated by L1 distance (representing the absolute value of the distance), L2 distance (representing the square of the distance), or Cos distance (representing the angle between two feature vectors) and represents the distance between the feature vectors shown in its subscript. It is the input of the p-th SE module. anc and input pos Corresponding training feature vector pairs The weight, It is the input of the p-th SE module. anc and input neg Corresponding training feature vector pairs The weight.

[0086] Based on the above loss function L T This means that during training, efforts can be made to minimize the distance between training feature vector pairs of the same category. Minimize the distance between training feature vector pairs of different classes. Maximize the parameters of the neural network to further adjust them and improve the accuracy of image similarity detection.

[0087] The image detection method according to embodiments of the present invention can utilize the channel weights of each clustering module and the weights of feature vector pairs during the clustering process to detect the image to be detected, thereby enabling targeted focus on image features with higher weights for similarity comparison and improving the accuracy of image similarity detection.

[0088] Furthermore, the image detection method described above according to embodiments of the present invention uses a neural network that is trained considering the channel weights of each clustering module and the weights of feature vector pairs during the clustering process. This effectively avoids the problem of low quantity and quality of abnormal samples during training, thereby improving the training efficiency and detection effect of the neural network.

[0089] Below, refer to Figure 5 The image detection apparatus according to embodiments of the present invention will be described. Figure 5 A block diagram of an image detection apparatus 500 according to an embodiment of the present invention is shown. Figure 5 As shown, the image detection apparatus 500 includes a feature map acquisition unit 510, a clustering unit 520, a feature vector pair acquisition unit 530, and a detection unit 540. Besides these units, the image detection apparatus 500 may also include other components; however, since these components are not relevant to the content of this embodiment, their illustrations and descriptions are omitted here. Furthermore, since the specific details of the operations performed by the image detection apparatus 500 according to this embodiment are the same as those referred to above... Figure 1-3 The details described are the same, so repeated descriptions of the same details are omitted here to avoid repetition.

[0090] Figure 5 The feature map acquisition unit 510 of the image detection device 500 acquires two images to be detected and acquires the feature maps to be detected corresponding to the two images to be detected respectively.

[0091] In this embodiment of the invention, the two images to be detected can be two static images, video frames from two videos, or a combination of a static image and a video frame, etc., and there are no limitations. The feature map acquisition unit 510 extracts information from the two images to be detected based on a convolutional neural network and outputs the corresponding feature map to be detected. Optionally, various convolutional neural networks such as ResNet, ResNext, and EfficientNet can be used to extract image information from the images to be detected in order to obtain the corresponding feature map to be detected. Of course, the above selection of convolutional neural networks is only an example. In practical applications, any appropriate convolutional neural network can be used to obtain the feature map to be detected.

[0092] Furthermore, if there are more than two images to be detected, the feature map acquisition unit 510 can choose to input any two images as the images to be detected, or it can combine all the images into pairs to be detected. In the specific detection process, depending on the structure of the actual detection network, all the images to be detected can be input simultaneously, and then two images can be selected for detection separately; alternatively, one pair of images to be detected can be input and detected sequentially, and then another pair of images to be detected can be detected after the detection is completed. There are no restrictions on either approach.

[0093] Optionally, the feature map acquisition unit 510 can represent the i-th image to be detected among the N input images to be detected as input. i (For example, when there are two images to be detected, N=2, and i can take the values ​​1 and 2), and the feature map to be detected after being extracted by the convolutional neural network is represented as U. i Among them, U can be i Represented as U i ∈R W×H×C R is the real number space, W, H and C represent the dimensions of the feature map to be detected, where C is the number of channels of the convolutional neural network.

[0094] Clustering unit 520 performs clustering processing on each feature map to be detected using multiple clustering modules, obtains the channel weights of each clustering module, obtains feature vectors based on the channel weights, and uses the feature vectors of each clustering module to form a feature vector group.

[0095] Clustering unit 520 can obtain the channel weights of different clustering modules in the clustering process, so as to output multiple weighted feature vectors that have high responses in different image regions of the feature map to be detected.

[0096] In this embodiment of the invention, the clustering module used for clustering processing can select various functional modules in the neural network to implement the above operations. For example, attention mechanism modules, such as the SE module and the CBAM module, can be selected. Of course, other modules used for clustering processing can also be selected, and there is no limitation here.

[0097] Alternatively, each feature map to be detected can be U i The input is processed into P parallel SE modules, and P feature vectors are obtained based on the channel weights of the P SE modules to form a feature vector group. Figure 2(a) shows a schematic diagram of the structure of an SE module according to one embodiment of the present invention, and Figure 2(b) shows a schematic diagram of the structure of an SE module according to another embodiment of the present invention. Whether the feature map to be detected is processed by the SE module shown in Figure 2(a) or the SE module shown in Figure 2(b), a weighted feature map composed of each of the P SE modules can be obtained. Weighted feature map

[0098] Figure 3 A schematic diagram of a clustering network performing clustering processing according to an embodiment of the present invention is shown. Figure 3 As shown, the feature map U to be detected i After passing through P parallel SE modules, the aforementioned weighted feature map is obtained. Subsequently, the obtained weighted feature map can be... The input continues to the next fully connected layer to extract the feature vector f. i p After obtaining the eigenvector f of the p-th SE module... i p Then, the i-th feature map U of the i-th image to be detected can be... i The corresponding feature vector set is represented as F i =(f i 1 ,…,f i P ).

[0099] The feature vector pair acquisition unit 530 acquires feature vector pairs consisting of feature vectors from the same clustering module in the two feature vector groups corresponding to the two feature maps to be detected, and acquires the weight of each feature vector pair.

[0100] Optionally, the feature vector pair acquisition unit 530 can select feature vectors from the same SE module (e.g., both selecting the p-th SE module) from the two feature vector groups corresponding to the two images to be detected to form a feature vector pair. For example, the i-th image input from N images to be detected can be selected as the feature vector pair. iThe corresponding feature vector set is represented as F i =(f i 1 ,…,f i P ), input the j-th image to be detected. j The corresponding feature vector set is represented as Therefore, the eigenvector pair corresponding to the p-th SE module in these two eigenvector groups is represented as follows: Then all feature vectors are paired to generate features. Perform channel-by-channel concatenation on P channels to obtain the feature F with the number of channels P. i,j .

[0101] After obtaining F i,j Then, the feature vector acquisition unit 530 can obtain the feature F i,j Input an SE module to obtain feature pairs corresponding to P channels. weight The structure of the SE module used here can be similar to that shown in Figure 2(a) or Figure 2(b).

[0102] The detection unit 540 performs image detection on the two images to be detected based at least on the acquired feature vector pairs and their weights, and obtains the similarity between the two images to be detected.

[0103] The image detection apparatus described above according to embodiments of the present invention can utilize the channel weights of each clustering module and the weights of feature vector pairs during the clustering process to detect the image to be detected. This allows for targeted focus on image features with higher weights for similarity comparison, improving the accuracy of image similarity detection. In these embodiments, the neural network used is trained considering the channel weights of each clustering module and the weights of feature vector pairs during the clustering process. This effectively avoids the problem of low quantity and quality of outlier samples during training, improving the training efficiency and detection performance of the neural network.

[0104] Below, refer to Figure 6 The image detection apparatus according to embodiments of the present invention will be described. Figure 6 A block diagram of an image detection apparatus 600 according to an embodiment of the present invention is shown. Figure 6 As shown, the device 600 can be a computer or a server.

[0105] like Figure 6As shown, the image detection device 600 includes one or more processors 610 and a memory 620. In addition, the image detection device 600 may also include input devices, output devices (not shown), etc., and these components can be interconnected via a bus system and / or other forms of connection mechanisms. It should be noted that... Figure 6 The components and structure of the image detection device 600 shown are merely exemplary and not limiting. The image detection device 600 may also have other components and structures as needed.

[0106] The processor 610 may be a central processing unit (CPU) or other processing unit with data processing and / or instruction execution capabilities, and may utilize computer program instructions stored in the memory 620 to perform desired functions, including: acquiring two images to be detected and acquiring feature maps corresponding to the two images to be detected respectively; performing clustering processing on each feature map to be detected using multiple clustering modules, acquiring channel weights of each clustering module and acquiring feature vectors based on the channel weights, and forming feature vector groups using the feature vectors of each clustering module; acquiring feature vector pairs composed of feature vectors from the same clustering module in the two feature vector groups corresponding to the two feature maps to be detected respectively, and acquiring the weights of each feature vector pair; performing image detection on the two images to be detected based at least on the acquired feature vector pairs and their weights, and acquiring the similarity between the two images to be detected.

[0107] The memory 620 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 610 may execute the program instructions to implement the functions of the image detection apparatus of the embodiments of the present invention described above, and / or other desired functions, and / or to execute the image detection method according to the embodiments of the present invention. Various application programs and various data may also be stored in the computer-readable storage medium.

[0108] The following describes a computer-readable storage medium according to embodiments of the present invention, having stored thereon computer program instructions, wherein the computer program instructions, when executed by a processor, perform the following steps: acquiring two images to be detected, and acquiring feature maps to be detected corresponding to the two images to be detected respectively; performing clustering processing on each feature map to be detected using multiple clustering modules, acquiring channel weights of each clustering module and acquiring feature vectors based on the channel weights, and forming feature vector groups using the feature vectors of each clustering module; acquiring feature vector pairs composed of feature vectors from the same clustering module in the two feature vector groups corresponding to the two feature maps to be detected respectively, and acquiring the weights of each feature vector pair; performing image detection on the two images to be detected based at least on the acquired feature vector pairs and their weights, and acquiring the similarity between the two images to be detected.

[0109] Below, refer to Figure 7 The image detection apparatus according to embodiments of the present invention will be described. Figure 7 A block diagram of an image detection apparatus 700 according to an embodiment of the present invention is shown. Figure 7 As shown, the image detection apparatus 700 includes an acquisition unit 710 and a processing unit 720. Besides these units, the image detection apparatus 700 may also include other components; however, since these components are not relevant to the content of this embodiment, their illustrations and descriptions are omitted here. Furthermore, the specific details of the operations performed by the image detection apparatus 700 according to this embodiment are similar to those described above. Figure 4 The details described are the same, so repeated descriptions of the same details are omitted here to avoid repetition.

[0110] Figure 7 The image detection device 700 in the image acquisition unit 710 acquires two images to be detected.

[0111] In this embodiment of the invention, the two images to be detected can be two static images, video frames from two videos, or a combination of a static image and a video frame, etc., and there are no limitations on this. The images to be detected can come from the output or intermediate detection result of any intermediate layer in one or more neural networks, or they can come from the original input images to be detected, such as two-dimensional images acquired using image acquisition devices such as cameras or video cameras, or two-dimensional frame images cropped from videos.

[0112] The processing unit 720 performs image detection on the two images to be detected using a neural network to obtain the similarity between the two images. The neural network is trained as follows: it acquires training feature maps corresponding to at least two training images; it performs clustering processing on each training feature map using multiple clustering modules, obtains the channel weights of each clustering module, and acquires training feature vectors based on the channel weights; it then uses the training feature vectors from each clustering module to form a training feature vector group; it acquires training feature vector pairs corresponding to the two training feature vector groups of any two training feature maps, each consisting of training feature vectors from the same clustering module, and acquires the weights of each training feature vector pair; it predicts the similarity between the two training images corresponding to the two training feature vector groups based at least on the acquired training feature vector pairs and their weights, and adjusts the parameters of the neural network based on the prediction results.

[0113] The processing unit 720 can perform similarity detection on the image to be detected using a neural network, thereby obtaining the detection result. The training method of the neural network here is similar to... Figure 1 The process described in [the previous section] is similar; that is, at least two labeled training images used for training can be utilized... Figure 1 The process shown performs similarity prediction and adjusts the parameters of the neural network based on the results. In this embodiment of the invention, the neural network is trained by repeatedly updating and iterating the parameters using a large number of training images, aiming to minimize the difference between the trained detection results and the labeled true results. For specific operation methods, please refer to... Figure 1 The details mentioned above will not be repeated here.

[0114] In Figure 1 In a training process similar to the steps described above, the neural network can be further trained as follows: at least based on the acquired training feature vector pairs and their weights, the distance between each training feature vector pair is calculated; the neural network is trained based on the distance between each training feature vector pair, adjusting the parameters of the neural network. Optionally, training the neural network based on the distance between each training feature vector pair and adjusting the parameters of the neural network may further include: training the neural network based on the category of each training feature vector pair, so as to minimize the distance between training feature vector pairs of the same category and maximize the distance between training feature vector pairs of different categories.

[0115] The image detection apparatus according to embodiments of the present invention can utilize the channel weights of each clustering module and the weights of feature vector pairs during the clustering process to detect the image to be detected, thereby enabling targeted focus on image features with higher weights for similarity comparison and improving the accuracy of image similarity detection.

[0116] Furthermore, the image detection device described above according to the embodiments of the present invention uses a neural network that is trained by considering the channel weights of each clustering module and the weights of feature vector pairs during the clustering process. This effectively avoids the problem of low quantity and quality of abnormal samples used during training, and improves the training efficiency and detection effect of the neural network.

[0117] Below, refer to Figure 8 The image detection apparatus according to embodiments of the present invention will be described. Figure 8 A block diagram of an image detection apparatus 800 according to an embodiment of the present invention is shown. Figure 8 As shown, the device 800 can be a computer or a server.

[0118] like Figure 8 As shown, the image detection device 800 includes one or more processors 810 and a memory 820. In addition, the image detection device 800 may also include input devices, output devices (not shown), etc., and these components can be interconnected via a bus system and / or other forms of connection mechanisms. It should be noted that... Figure 8 The components and structure of the image detection device 800 shown are merely exemplary and not limiting. The image detection device 800 may also have other components and structures as needed.

[0119] The processor 810 may be a central processing unit (CPU) or other processing unit with data processing and / or instruction execution capabilities, and may utilize computer program instructions stored in memory 820 to perform desired functions, including: acquiring two images to be detected; performing image detection on the two images to be detected using a neural network, and obtaining the similarity between the two images to be detected; wherein the neural network is trained in the following manner: acquiring training feature maps corresponding to at least two training images respectively; performing clustering processing on each training feature map using multiple clustering modules, obtaining the channel weights of each clustering module, and obtaining training feature vectors according to the channel weights, and forming training feature vector groups using the training feature vectors of each clustering module; acquiring training feature vector pairs consisting of training feature vectors from the same clustering module for each of the two training feature vector groups corresponding to any two training feature maps respectively, and obtaining the weights of each training feature vector pair; predicting the similarity between the two training images corresponding to the two training feature vector groups respectively based at least on the acquired training feature vector pairs and their weights, and adjusting the parameters of the neural network according to the prediction results.

[0120] The memory 820 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 810 may execute the program instructions to implement the functions of the image detection apparatus of the embodiments of the present invention described above, and / or other desired functions, and / or to execute the image detection method according to the embodiments of the present invention. Various application programs and various data may also be stored in the computer-readable storage medium.

[0121] The following describes a computer-readable storage medium according to an embodiment of the present invention, which stores computer program instructions, wherein the computer program instructions, when executed by a processor, perform the following steps: acquiring two images to be detected; performing image detection on the two images to be detected using a neural network, and acquiring the similarity between the two images to be detected; wherein the neural network is trained in the following manner: acquiring training feature maps corresponding to at least two training images respectively; performing clustering processing on each training feature map using multiple clustering modules, acquiring the channel weights of each clustering module, and acquiring training feature vectors according to the channel weights, and forming training feature vector groups using the training feature vectors of each clustering module; acquiring training feature vector pairs consisting of training feature vectors from the same clustering module for each of the two training feature vector groups corresponding to any two training feature maps respectively, and acquiring the weights of each training feature vector pair; predicting the similarity between the two training images corresponding to the two training feature vector groups respectively, based at least on the acquired training feature vector pairs and the weights of the training feature vector pairs, and adjusting the parameters of the neural network according to the prediction results.

[0122] This invention also provides a neural network training method, specifically including: acquiring training feature maps corresponding to at least two training images respectively; performing clustering processing on each training feature map using multiple clustering modules, acquiring the channel weights of each clustering module, and acquiring training feature vectors based on the channel weights, and forming training feature vector groups using the training feature vectors of each clustering module; acquiring training feature vector pairs consisting of training feature vectors from the same clustering module for each of the two training feature vector groups corresponding to any two training feature maps respectively, and acquiring the weights of each training feature vector pair; predicting the similarity between the two training images corresponding to the two training feature vector groups based at least on the acquired training feature vector pairs and their weights, and adjusting the parameters of the neural network based on the prediction results.

[0123] This invention also provides a neural network training apparatus, including a processor and a memory. The memory stores computer program instructions, wherein when the processor executes the computer program instructions, the processor performs the following steps: acquiring training feature maps corresponding to at least two training images; performing clustering processing on each training feature map using multiple clustering modules, acquiring channel weights for each clustering module, and acquiring training feature vectors based on the channel weights, forming training feature vector groups using the training feature vectors from each clustering module; acquiring training feature vector pairs, each consisting of training feature vectors from the same clustering module, corresponding to two training feature vector groups for any two training feature maps, and acquiring the weights of each training feature vector pair; predicting the similarity between two training images corresponding to the two training feature vector groups, at least based on the acquired training feature vector pairs and their weights, and adjusting the parameters of the neural network based on the prediction results.

[0124] This invention also provides a computer-readable storage medium storing computer program instructions, wherein the computer program instructions, when executed by a processor, perform the following steps: acquiring training feature maps corresponding to at least two training images respectively; performing clustering processing on each training feature map using multiple clustering modules, acquiring the channel weights of each clustering module, and acquiring training feature vectors based on the channel weights, and forming training feature vector groups using the training feature vectors of each clustering module; acquiring training feature vector pairs consisting of training feature vectors from the same clustering module for each of the two training feature vector groups corresponding to any two training feature maps respectively, and acquiring the weights of each training feature vector pair; predicting the similarity between the two training images corresponding to the two training feature vector groups based at least on the acquired training feature vector pairs and their weights, and adjusting the parameters of the neural network based on the prediction results.

[0125] Of course, the specific embodiments described above are merely examples and not limitations. Those skilled in the art can combine and integrate some steps and devices from the various embodiments described separately above to achieve the effects of the present invention. Such combined and integrated embodiments are also included in the present invention, but will not be described one by one here.

[0126] Note that the advantages, benefits, and effects mentioned in this invention are merely examples and not limitations, and should not be considered as essential features of every embodiment of the invention. Furthermore, the specific details described above are for illustrative and illustrative purposes only, and are not intended to limit the invention. These details do not limit the invention from being implemented solely by employing these specific details.

[0127] The block diagrams of devices, apparatuses, devices, and systems involved in this invention are merely illustrative examples and are not intended to require or imply that they must be connected, arranged, or configured in the manner shown in the block diagrams. As those skilled in the art will recognize, these devices, apparatuses, devices, and systems can be connected, arranged, and configured in any manner. Words such as “comprising,” “including,” “having,” etc., are open-ended terms meaning “including but not limited to,” and are used interchangeably with them. The terms “or” and “and” as used herein refer to the terms “and / or,” and are used interchangeably with them unless the context clearly indicates otherwise. The term “such as” as used herein refers to the phrase “such as but not limited to,” and is used interchangeably with it.

[0128] The flowcharts and method descriptions in this invention are merely illustrative examples and are not intended to require or imply that the steps of the various embodiments must be performed in the given order. As those skilled in the art will recognize, the steps in the above embodiments can be performed in any order. Words such as "then," "next," etc., are not intended to limit the order of steps; these words are only used to guide the reader through the description of these methods. Furthermore, any reference to a singular element, such as the use of the articles "a," "one," or "the," is not to be construed as limiting that element to the singular.

[0129] Furthermore, the steps and apparatus in the various embodiments herein are not limited to any one embodiment. In fact, new embodiments can be conceived by combining relevant steps and apparatus in the various embodiments herein with the concepts of the present invention, and these new embodiments are also included within the scope of the present invention.

[0130] Each operation described above can be performed by any suitable means capable of performing the corresponding function. Such means may include various hardware and / or software components and / or modules, including but not limited to circuits, application-specific integrated circuits (ASICs), or processors.

[0131] The various exemplified logic blocks, modules, and circuits described herein can be implemented or performed using a general-purpose processor, digital signal processor (DSP), ASIC, field-programmable gate array (FPGA) or other programmable logic device (PLD), discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. The general-purpose processor may be a microprocessor, but alternatively, it may be any commercially available processor, controller, microcontroller, or state machine. The processor may also be implemented as a combination of computing devices, such as a combination of a DSP and a microprocessor, multiple microprocessors, one or more microprocessors cooperating with a DSP core, or any other such configuration.

[0132] The steps of the methods or algorithms described in this invention can be directly embedded in hardware, in a software module executed by a processor, or a combination of both. The software module can reside in any form of tangible storage medium. Some examples of usable storage media include random access memory (RAM), read-only memory (ROM), flash memory, EPROM, EEPROM, registers, hard disks, removable disks, CD-ROMs, etc. The storage medium can be coupled to the processor so that the processor can read information from and write information to the storage medium. Alternatively, the storage medium can be integral with the processor. The software module can be a single instruction or many instructions, and can be distributed across several different code segments, different programs, and across multiple storage media.

[0133] The method of this invention includes one or more actions for implementing the method. The methods and / or actions may be interchanged without departing from the scope of the claims. In other words, unless a specific order of actions is specified, the order and / or use of specific actions may be modified without departing from the scope of the claims.

[0134] The described functionality can be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functionality can be stored as one or more instructions on a tangible computer-readable medium. The storage medium can be any available tangible medium that can be accessed by a computer. By way of example, and not limitation, such a computer-readable medium can include RAM, ROM, EEPROM, CD-ROM or other optical disc storage, disk storage or other magnetic storage devices, or any other tangible medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. As used herein, a disc includes a compact disc (CD), a laser disc, an optical disc, a digital universal disc (DVD), a floppy disk, and a Blu-ray disc.

[0135] Therefore, a computer program product can perform the operations described herein. For example, such a computer program product can be a computer-readable tangible medium having instructions tangibly stored (and / or encoded) thereon, which can be executed by one or more processors to perform the operations described herein. The computer program product may include packaging materials.

[0136] Software or instructions can also be transmitted via a transmission medium. For example, software can be transmitted from a website, server, or other remote source using transmission media such as coaxial cable, fiber optic cable, twisted pair, digital subscriber line (DSL), or wireless technologies such as infrared, radio, or microwave.

[0137] Furthermore, modules and / or other suitable means for carrying out the methods and techniques described herein can be downloaded and / or obtained by user terminals and / or base stations as appropriate. For example, such a device can be coupled to a server to facilitate the transmission of means for carrying out the methods described herein. Alternatively, the various methods described herein can be provided via storage components (e.g., RAM, ROM, physical storage media such as CDs or floppy disks) so that user terminals and / or base stations can obtain the various methods when coupled to the device or when providing storage components to the device. Furthermore, any other suitable techniques for providing the methods and techniques described herein to the device can be utilized.

[0138] Other examples and implementations are within the scope and spirit of this invention and the appended claims. For example, due to the nature of software, the functions described above can be implemented using software executed by a processor, hardware, firmware, hardwired, or any combination thereof. Features implementing the functions can also be physically located in various places, including being distributed so that parts of the functions are implemented at different physical locations. Moreover, as used herein, including as used in the claims, the "or" used in a list of items beginning with "at least one" indicates a separate list, such that a list of, for example, "at least one of A, B, or C" means A or B or C, or AB or AC or BC, or ABC (i.e., A and B and C). Furthermore, the word "exemplary" does not mean that the described examples are preferred or better than other examples.

[0139] Various changes, substitutions, and modifications can be made to the technology described herein without departing from the teachings defined by the appended claims. Furthermore, the scope of the claims is not limited to the specific aspects of the processes, machines, manufacturing processes, events, means, methods, and actions described above. Currently existing or later-developed processes, machines, manufacturing processes, events, means, methods, or actions that perform substantially the same function or achieve substantially the same result as the corresponding aspects described herein can be utilized. Therefore, the appended claims include such processes, machines, manufacturing processes, events, means, methods, or actions within their scope.

[0140] The above description of aspects of the invention is provided to enable any person skilled in the art to make or use the invention. Various modifications to these aspects will be readily apparent to those skilled in the art, and the general principles defined herein can be applied to other aspects without departing from the scope of the invention. Therefore, the invention is not intended to be limited to the aspects shown herein, but rather to be carried out within the widest scope consistent with the principles and novel features of the invention herein.

[0141] The above description has been given for illustrative and descriptive purposes. Furthermore, this description is not intended to limit the embodiments of the invention to the forms described herein. Although numerous exemplary aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, alterations, additions, and sub-combinations therein.

Claims

1. An image detection method, comprising: Two images to be detected are obtained, and the corresponding feature maps to be detected for each of the two images are obtained. For each feature map to be detected, multiple parallel clustering modules are used for clustering processing. The channel weights of each clustering module are obtained, and feature vectors are obtained based on the channel weights. The feature vectors of each clustering module are used to form a feature vector group. Obtain feature vector pairs consisting of feature vectors from the same clustering module in the two feature vector groups corresponding to the two feature maps to be detected, and obtain the weight of each feature vector pair; At least based on the obtained feature vector pairs and their weights, image detection is performed on the two images to be detected to obtain the similarity between the two images to be detected. Specifically, this involves obtaining feature vector pairs composed of feature vectors from the same clustering module from the two feature vector groups corresponding to the two feature maps to be detected, and obtaining the weights of each feature vector pair, including: The two feature vectors in each feature vector pair are processed to obtain the processed feature, which is the distance vector between the two feature vectors in each feature vector pair, or the two feature vectors are concatenated in an ordered manner. The processed features are concatenated channel by channel to obtain multi-channel features. The multi-channel features are then input into the clustering module for processing, and the weights of each feature vector pair are obtained.

2. The method as described in claim 1, wherein, Obtaining the feature maps corresponding to the two images to be detected includes: The information of the two images to be detected is extracted using a convolutional neural network, and the corresponding feature maps to be detected are output.

3. The method as described in claim 1, wherein, Obtaining feature vector pairs consisting of feature vectors from the same clustering network from two feature vector groups, and obtaining the weights of each feature vector pair, also includes: Weighted feature vector pairs are obtained based on the feature vector pairs and the weights of each feature vector pair.

4. The method of claim 3, wherein, At least based on the acquired feature vector pairs and their weights, image detection is performed on the two images to be detected. Obtaining the similarity between the two images to be detected further includes: Image detection is performed on the two images to be detected based on the weighted feature vectors to obtain the similarity between the two images.

5. An image detection method, comprising: Acquire two images to be detected; The two images to be detected are used to perform image detection using a neural network to obtain the similarity between the two images to be detected; The neural network is trained in the following manner: Obtain the training feature maps corresponding to at least two training images; For each training feature map, multiple parallel clustering modules are used for clustering processing to obtain the channel weights of each clustering module, and training feature vectors are obtained based on the channel weights. The training feature vectors of each clustering module are used to form a training feature vector group. Obtain two training feature vector pairs corresponding to any two training feature maps, each consisting of training feature vectors from the same clustering module, and obtain the weights of each training feature vector pair. Based at least on the acquired training feature vector pairs and their weights, the similarity between the two training images corresponding to the two training feature vector pairs is predicted, and the parameters of the neural network are adjusted according to the prediction results. Specifically, this involves obtaining two training feature vector pairs corresponding to any two training feature maps, each consisting of training feature vectors from the same clustering module, and obtaining the weights of each training feature vector pair, including: For each pair of training feature vectors, the two training feature vectors are processed to obtain the processed training feature. The processed training feature is the distance vector between the two training feature vectors in each pair of training feature vectors, or the two training feature vectors are concatenated in an ordered manner. The processed training features are concatenated channel by channel to obtain multi-channel training features. The multi-channel training features are then input into the clustering module for processing, and the weights of each training feature vector pair are obtained.

6. The method according to claim 5, wherein, The neural network is further trained in the following manner: At least based on the acquired training feature vector pairs and their weights, calculate the distance between each training feature vector pair; The neural network is trained based on the distance between each pair of training feature vectors, and the parameters of the neural network are adjusted.

7. The method of claim 6, wherein, The neural network is trained based on the distance between each pair of training feature vectors, and the parameters of the neural network are adjusted including: The neural network is trained according to the category of each training feature vector pair, so as to minimize the distance between training feature vector pairs of the same category and maximize the distance between training feature vector pairs of different categories.

8. An image detection apparatus, comprising: The feature map acquisition unit is configured to acquire two images to be detected and acquire the feature maps to be detected corresponding to the two images to be detected respectively; The clustering unit is configured to perform clustering processing on each feature map to be detected using multiple parallel clustering modules, obtain the channel weights of each clustering module and obtain feature vectors based on the channel weights, and use the feature vectors of each clustering module to form a feature vector group; The feature vector pair acquisition unit is configured to acquire feature vector pairs consisting of feature vectors from the same clustering module in the two feature vector groups corresponding to the two feature maps to be detected, and to acquire the weight of each feature vector pair; The detection unit is configured to perform image detection on the two images to be detected based at least on the acquired feature vector pairs and the weights of the feature vector pairs, and to obtain the similarity between the two images to be detected. The feature vector pair acquisition unit processes the two feature vectors in each feature vector pair to obtain the processed feature, which is the distance vector between the two feature vectors in each feature vector pair, or the two feature vectors are concatenated in an ordered manner. The processed features are concatenated channel by channel to obtain multi-channel features. The multi-channel features are then input into the clustering module for processing, and the weights of each feature vector pair are obtained.

9. An image detection device, comprising: processor; and A memory, in which computer program instructions are stored, When the computer program instructions are executed by the processor, the processor performs the following steps: Two images to be detected are obtained, and the corresponding feature maps to be detected for each of the two images are obtained. For each feature map to be detected, multiple parallel clustering modules are used for clustering processing. The channel weights of each clustering module are obtained, and feature vectors are obtained based on the channel weights. The feature vectors of each clustering module are used to form a feature vector group. Obtain feature vector pairs consisting of feature vectors from the same clustering module in the two feature vector groups corresponding to the two feature maps to be detected, and obtain the weight of each feature vector pair; At least based on the obtained feature vector pairs and their weights, image detection is performed on the two images to be detected to obtain the similarity between the two images to be detected. Specifically, this involves obtaining feature vector pairs composed of feature vectors from the same clustering module from the two feature vector groups corresponding to the two feature maps to be detected, and obtaining the weights of each feature vector pair, including: The two feature vectors in each feature vector pair are processed to obtain the processed feature, which is the distance vector between the two feature vectors in each feature vector pair, or the two feature vectors are concatenated in an ordered manner. The processed features are concatenated channel by channel to obtain multi-channel features. The multi-channel features are then input into the clustering module for processing, and the weights of each feature vector pair are obtained.

10. A computer-readable storage medium having stored thereon computer program instructions, wherein, When the computer program instructions are executed by the processor, the following steps are performed: Two images to be detected are obtained, and the corresponding feature maps to be detected for each of the two images are obtained. For each feature map to be detected, multiple parallel clustering modules are used for clustering processing. The channel weights of each clustering module are obtained, and feature vectors are obtained based on the channel weights. The feature vectors of each clustering module are used to form a feature vector group. Obtain feature vector pairs consisting of feature vectors from the same clustering module in the two feature vector groups corresponding to the two feature maps to be detected, and obtain the weight of each feature vector pair; At least based on the obtained feature vector pairs and their weights, image detection is performed on the two images to be detected to obtain the similarity between the two images to be detected. Specifically, this involves obtaining feature vector pairs composed of feature vectors from the same clustering module from the two feature vector groups corresponding to the two feature maps to be detected, and obtaining the weights of each feature vector pair, including: The two feature vectors in each feature vector pair are processed to obtain the processed feature, which is the distance vector between the two feature vectors in each feature vector pair, or the two feature vectors are concatenated in an ordered manner. The processed features are concatenated channel by channel to obtain multi-channel features. The multi-channel features are then input into the clustering module for processing, and the weights of each feature vector pair are obtained.

11. An image detection apparatus, comprising: The acquisition unit is configured to acquire two images to be detected. as well as The processing unit is configured to perform image detection on the two images to be detected using a neural network, and obtain the similarity between the two images to be detected; The neural network used by the processing unit is trained in the following manner: Obtain the training feature maps corresponding to at least two training images; For each training feature map, multiple parallel clustering modules are used for clustering processing to obtain the channel weights of each clustering module, and training feature vectors are obtained based on the channel weights. The training feature vectors of each clustering module are used to form a training feature vector group. Obtain two training feature vector pairs corresponding to any two training feature maps, each consisting of training feature vectors from the same clustering module, and obtain the weights of each training feature vector pair. Based at least on the acquired training feature vector pairs and their weights, the similarity between the two training images corresponding to the two training feature vector pairs is predicted, and the parameters of the neural network are adjusted according to the prediction results. Specifically, this involves obtaining two training feature vector pairs corresponding to any two training feature maps, each consisting of training feature vectors from the same clustering module, and obtaining the weights of each training feature vector pair, including: For each pair of training feature vectors, the two training feature vectors are processed to obtain the processed training feature. The processed training feature is the distance vector between the two training feature vectors in each pair of training feature vectors, or the two training feature vectors are concatenated in an ordered manner. The processed training features are concatenated channel by channel to obtain multi-channel training features. The multi-channel training features are then input into the clustering module for processing, and the weights of each training feature vector pair are obtained.