Image feature extraction method and device, image processing system and storage medium
By integrating global, local, and kernel attention mechanisms in image feature extraction and determining attention weight information based on the query matrix and key matrix, the problem of image feature extraction accuracy is solved, achieving more efficient feature representation and improved accuracy.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- JINGDONG TECH HLDG CO LTD
- Filing Date
- 2023-07-11
- Publication Date
- 2026-06-16
AI Technical Summary
How to improve the accuracy of image feature extraction.
For each input feature, the query vector, key vector, and value vector are determined based on the query matrix and key matrix. The first and second attention weights are determined by combining the features in the global and local receptive fields, and the features are updated based on these weights. The global, local, and kernel attention mechanisms are fused to obtain a more accurate feature representation.
It improves the accuracy of image feature extraction, better captures local relationships and global contextual information in images, and achieves more efficient feature representation.
Smart Images

Figure CN116797797B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the fields of image processing and computer technology, and in particular to a method, apparatus and computer-readable storage medium for image feature extraction. Background Technology
[0002] With the continuous development of computer technology, computer vision technologies such as image recognition and object detection are widely used in various fields.
[0003] In the field of computer vision, backbone network architecture design is an important research direction. A primary function of backbone networks is image feature extraction. The accuracy of image feature extraction directly impacts the performance of subsequent downstream tasks such as object detection and image recognition. Summary of the Invention
[0004] One of the technical problems this disclosure aims to solve is: how to improve the accuracy of image feature extraction.
[0005] According to some embodiments of this disclosure, an image feature extraction method is provided, comprising: for each input feature of an image, determining a query vector, a key vector, and a value vector for each input feature based on a query matrix and a key matrix; for each input feature, determining first attention weight information corresponding to the input feature based on the query vector of the input feature and the key vector of the feature in the global receptive field of the input feature; for each input feature, determining second attention weight information corresponding to the input feature based on the value vector of the feature in the local receptive field of the input feature; and updating each input feature based on the first attention weight information and the second attention weight information corresponding to each input feature to obtain a feature representation of the image.
[0006] In some embodiments, determining the first attention weight information corresponding to each input feature based on the query vector of the feature and the key vector of the feature in the global receptive field includes: dividing the query vector and key vector of each input feature into query sub-vectors and key sub-vectors of each input feature in each attention head according to the number of attention heads; determining the sub-attention weight information corresponding to the input feature in each attention head based on the query sub-vector of the input feature in each attention head and the key vector of the feature in the global receptive field; and determining the first attention weight information corresponding to each input feature based on the sub-attention weight information corresponding to each input feature in each attention head.
[0007] In some embodiments, the sub-attention weight information corresponding to each input feature in each attention head includes: the sub-attention weight of each feature in the global receptive field of each attention head relative to each input feature; the first attention weight information corresponding to each input feature includes: the first attention weight of each feature in the global receptive field relative to each input feature; and determining the first attention weight information of each input feature based on the sub-attention weights corresponding to each input feature in each attention head includes: for each feature in the global receptive field and each input feature, concatenating the sub-attention weights of the feature in the global receptive field of each attention head relative to the input feature to determine the first attention weight of the feature in the global receptive field relative to the input feature.
[0008] In some embodiments, determining the sub-attention weight information corresponding to the input feature in each attention head, based on the query sub-vector of the input feature in each attention head and the key sub-vector of the feature in the global receptive field, includes: in each attention head, determining the dot product of the query sub-vector of the input feature in the attention head and the key sub-vector of each feature in the global receptive field; normalizing each dot product to obtain the sub-attention weight of each feature in the global receptive field of the attention head relative to each input feature, as the sub-attention weight information corresponding to the input feature in the attention head.
[0009] In some embodiments, determining the second attention weight information corresponding to the input feature based on the value vector of the feature in the local receptive field of the input feature includes: determining a first context feature by using depthwise convolution based on the value vector of the feature in the local receptive field of the input feature; determining the local attention information corresponding to the input feature based on the first context feature; and determining the second attention weight information corresponding to the input feature based on the local attention information corresponding to the input feature.
[0010] In some embodiments, determining the second attention weight information corresponding to the input feature based on the value vector of the feature in the local receptive field of the input feature includes: determining the second context feature by using global average pooling based on the value vector of the feature in the local receptive field of the input feature; determining the kernel attention information corresponding to the input feature based on the second context feature; and determining the second attention weight information corresponding to the input feature based on the kernel attention information corresponding to the input feature.
[0011] In some embodiments, determining the second attention weight information corresponding to the input feature based on the value vector of the feature in the local receptive field of the input feature includes: performing a depthwise convolution based on the value vector of the feature in the local receptive field of the input feature to determine a first context feature; determining the local attention information corresponding to the input feature based on the first context feature; performing global average pooling based on the value vector of the feature in the local receptive field of the input feature to determine a second context feature; determining the kernel attention information corresponding to the input feature based on the second context feature; and determining the second attention weight information corresponding to the input feature based on the local attention information and the kernel attention information.
[0012] In some embodiments, determining the first context feature by performing a deep convolution based on the value vector of the feature in the local receptive field of the input feature includes: dividing the value vector of the feature in the local receptive field of the input feature into a sub-vector of the value of the feature in the local receptive field of each attention head according to the number of attention heads; inputting the sub-vector of the value of the feature in the local receptive field of each attention head into the deep convolution module and activation function module of each attention head to obtain the first context sub-feature in each attention head, wherein the first context feature includes multiple first context sub-features.
[0013] In some embodiments, determining the local attention information corresponding to the input feature based on the first context feature includes: inputting the first context sub-feature in each attention head into a series of first convolutional modules in each attention head to obtain the local sub-attention information corresponding to the input feature in each attention head; and concatenating the local sub-attention information corresponding to the input feature in each attention head to obtain the local attention information corresponding to the input feature.
[0014] In some embodiments, the kernel attention information includes spatial attention weight information and channel attention weight information. Determining the kernel attention information corresponding to the input feature based on the second context feature includes: inputting the second context feature into a series of second convolutional modules to obtain the spatial attention weight information corresponding to the input feature; and inputting the second context feature into a third convolutional module to obtain the channel attention weight information corresponding to the input feature.
[0015] In some embodiments, updating each input feature according to the first attention weight information and the second attention weight information corresponding to each input feature includes: updating the value vector of each input feature according to the channel attention weight information corresponding to each input feature to obtain the updated value vector of each input feature; and updating the input feature for each input feature according to the updated value vector of the feature in the global receptive field of the input feature, the updated value vector of the feature in the local receptive field of the input feature, and the first attention weight information and the second attention weight information corresponding to the input feature.
[0016] In some embodiments, the first attention weight information corresponding to each input feature includes: the first attention weight of each feature in the global receptive field relative to each input feature, and the second attention weight information corresponding to each input feature includes: the second attention weight of each feature in the local receptive field relative to each input feature. The features in the global receptive field include the features in the local receptive field. Each input feature is updated using the following method: For each input feature, the first attention weight and the second attention weight of each feature in the local receptive field of the input feature relative to the input feature are weighted and summed to obtain the attention weight of each feature in the local receptive field of the input feature relative to the input feature; the updated value vector of each feature in the local receptive field of the input feature is weighted and summed with the attention weight of each feature in the local receptive field of the input feature relative to the input feature, and then transformed by a transformation function to obtain the first updated value of the input feature; the value vector of each feature in the difference set between the global receptive field and the local receptive field of the input feature is weighted and summed with the first attention weight of each feature in the difference set relative to the input feature, and then transformed by a transformation function to obtain the second updated value of the input feature; the first updated value and the second updated value of each input feature are summed to update each input feature.
[0017] In some embodiments, when the second attention weight information corresponding to each input feature is determined based on the local attention information and kernel attention information corresponding to each input feature, the second attention weight of each feature in the local receptive field relative to each input feature is the weighted sum of the local attention weight of each feature in the local receptive field relative to each input feature and the spatial attention weight of each feature in the local receptive field relative to each input feature.
[0018] In some embodiments, determining the query vector, key vector, and value vector of each input feature based on the query matrix and the key matrix includes: multiplying each input feature by the query matrix to obtain the query vector of each input feature; multiplying each input feature by the key matrix to obtain the key vector of each input feature; and concatenating the query vector and key vector of each input feature to obtain the value vector of each input feature.
[0019] In some embodiments, the method further includes at least one of the following: inputting the feature representation of the image into an object classification model to determine the category of the object in the image; and inputting the feature representation of the image into an object detection model to determine the location of the object in the image.
[0020] According to some embodiments of this disclosure, an image feature extraction apparatus is provided, comprising: a determining module, configured to determine, for each input feature of an image, a query vector, a key vector, and a value vector of each input feature based on a query matrix and a key matrix; a first attention module, configured to determine, for each input feature, first attention weight information corresponding to the input feature based on the query vector of the input feature and the key vector of the feature in the global receptive field of the input feature; a second attention module, configured to determine, for each input feature, second attention weight information corresponding to the input feature based on the value vector of the feature in the local receptive field of the input feature; and an updating module, configured to update each input feature based on the first attention weight information and the second attention weight information corresponding to each input feature to obtain a feature representation of the image.
[0021] According to some other embodiments of the present disclosure, an image feature extraction apparatus is provided, comprising: a processor; and a memory coupled to the processor for storing instructions, which, when executed by the processor, cause the processor to perform an image feature extraction method as described in any of the foregoing embodiments.
[0022] According to further embodiments of the present disclosure, a non-transitory computer-readable storage medium is provided, on which a computer program is stored, wherein the program, when executed by a processor, implements the image feature extraction method of any of the foregoing embodiments.
[0023] According to some other embodiments of this disclosure, an image processing system is provided, including: a feature extraction device for images in any of the foregoing embodiments; and a camera device for capturing images.
[0024] In some embodiments, the image processing system further includes a picking device for picking objects based on the category and position of objects in the image output by the image feature extraction device.
[0025] In this disclosed scheme, for each input feature of an image, first attention weight information and second attention weight information are determined based on the features in the global receptive field and the features in the local receptive field, respectively. Then, each input feature is updated based on the first and second attention weight information to obtain the image feature representation. This disclosed scheme integrates different attention mechanisms. The first attention weight information can represent point-to-point global context information, while the second attention weight information can effectively capture local relationships in the input features. By integrating different attention mechanisms, image feature representation can be better achieved, improving the accuracy of image feature extraction.
[0026] Other features and advantages of this disclosure will become clear from the following detailed description of exemplary embodiments with reference to the accompanying drawings. Attached Figure Description
[0027] To more clearly illustrate the technical solutions in the embodiments of this disclosure or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0028] Figure 1 A schematic flowchart illustrating an image feature extraction method according to some embodiments of the present disclosure is shown.
[0029] Figure 2 A schematic diagram of the structure of an image feature extraction model according to some embodiments of the present disclosure is shown.
[0030] Figure 3 A schematic diagram of the structure of an image feature extraction apparatus according to some embodiments of the present disclosure is shown.
[0031] Figure 4 A schematic diagram of the structure of an image feature extraction apparatus according to other embodiments of the present disclosure is shown.
[0032] Figure 5 A schematic diagram of the structure of an image feature extraction apparatus according to further embodiments of the present disclosure is shown.
[0033] Figure 6 A schematic diagram of the structure of an image processing system according to some embodiments of the present disclosure is shown. Detailed Implementation
[0034] The technical solutions of the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, and not all embodiments. The following description of at least one exemplary embodiment is merely illustrative and is in no way intended to limit this disclosure or its application or use. All other embodiments obtained by those skilled in the art based on the embodiments of this disclosure without creative effort are within the scope of protection of this disclosure.
[0035] This disclosure proposes a method for image feature extraction, which is described below in conjunction with... Figures 1-2 Describe it.
[0036] Figure 1 Flowcharts are provided for some embodiments of the image feature extraction method of this disclosure. For example... Figure 1 As shown, the method of this embodiment includes steps S102 to S108.
[0037] In step S102, for each input feature of the image, the query vector, key vector, and value vector of each input feature are determined based on the query matrix and key matrix.
[0038] In some embodiments, the image is divided into patches of a preset size. Embedded encoding and positional encoding are performed on each patch. The embedded encoding and positional encoding of each patch are then fused to obtain the input feature matrix of each patch, thereby obtaining the input feature matrix of the image. Each feature in the input feature matrix of the image is used as an input feature.
[0039] For example, X∈R n×d This represents the input feature matrix, where n = H × W is the number of input features, and H / W / d represent the height / width / number of channels of the image, respectively.
[0040] In some embodiments, each input feature is multiplied by the query matrix to obtain a query vector for each input feature; each input feature is multiplied by the key matrix to obtain a key vector for each input feature; and the query vector and key vector of each input feature are concatenated to obtain a value vector for each input feature.
[0041] Each input feature can be fed into the image feature extraction model. For example... Figure 2 As shown, the image feature extraction model can include a first fully connected layer and a second fully connected layer. The parameters of the first fully connected layer can form a query matrix, and the parameters of the second fully connected layer can form a key matrix.
[0042] For example, let the i-th input feature be denoted as x. i ∈R d The query matrix is denoted as W.q The key matrix is denoted as W. k Query vector Q i =x i W q Key vector K i =x i W k Q i ∈R d K i ∈R d Value vector β(x) j =Concat(Q) i K i This method of calculating value vectors can expand dimensions without additional computational cost, thus improving efficiency.
[0043] In step S104, for each input feature, the first attention weight information corresponding to the input feature is determined based on the query vector of the input feature and the key vector of the feature in the global receptive field of the input feature.
[0044] like Figure 2 As shown, the image feature extraction model may include a global attention module (Global AttentionExpert), in which the first attention weight information, i.e., global attention information, can be determined.
[0045] In some embodiments, the query vector and key vector of each input feature are divided into query sub-vectors and key sub-vectors of each input feature in each attention head according to the number of attention heads; for each input feature, the sub-attention weight information corresponding to the input feature in each attention head is determined based on the query sub-vector of the input feature in each attention head and the key vector of the feature in the global receptive field; and the first attention weight information corresponding to each input feature is determined based on the sub-attention weight information corresponding to each input feature in each attention head.
[0046] The global attention module can employ a multi-head attention mechanism to focus the query vector Q. i and bond vector K i Further decompose along the channel dimension into N h The query vector and key vector of the h-th header are represented as follows: and Where, N h It is the number of attention heads, d h This represents the feature size processed by each attention head.
[0047] In some embodiments, in each attention head, the dot product of the query sub-vector of the input feature in the attention head and the key sub-vector of each feature in the global receptive field is determined; each dot product is normalized to obtain the sub-attention weight of each feature in the global receptive field of the attention head relative to each input feature, which is used as the sub-attention weight information corresponding to the input feature in the attention head.
[0048] For input feature x i and features x in the global receptive field j In the h-th attention head, the query vector and key vector Sub-attention weights The calculation is performed using the dot product and softmax normalization, as shown in the following formula.
[0049]
[0050] Among them, R g (i) represents the receptive field, which in the global attention module is the entire feature matrix of the image, i.e., the global receptive field.
[0051] In some embodiments, the sub-attention weight information corresponding to each input feature in each attention head includes: the sub-attention weight of each feature in the global receptive field of each attention head relative to each input feature, and the first attention weight information corresponding to each input feature includes: the first attention weight of each feature in the global receptive field relative to each input feature; for each feature in the global receptive field and each input feature, the sub-attention weights of the feature in the global receptive field of each attention head relative to the input feature are concatenated to determine the first attention weight of the feature in the global receptive field relative to the input feature.
[0052] As shown in the formula above, the sub-attention weights of each feature in the global receptive field of each attention head relative to each input feature can be calculated. The first attention weight is obtained by connecting N... h The corresponding sub-attention weights of each attention head are determined as follows, as shown in the formula below.
[0053]
[0054] Concat(.) is a concatenation or splicing operation.
[0055] Based on the above method, the first attention weight (i.e., global attention weight) of each feature in the global receptive field relative to each input feature can be obtained. The first attention weight matrix can be represented as follows: Figure 2 The diagram shows the result for an input feature x.i The first attention weight information is obtained after passing through Softmax.
[0056] A significant advantage of global attention modules is their global receptive field, enabling global feature interactions across all features and thus facilitating feature learning. Furthermore, they dynamically learn attention weights for each spatial location, resulting in high spatial adaptability of attention learning. Traditional global attention determines attention weights α solely based on point-to-point contextual information. ij The method described in the above embodiment, by concatenating the self-attention weights corresponding to multiple attention heads, forces the sharing of learned attention information among a set of channels, thereby making the attention weights channel-adaptive.
[0057] In step S106, for each input feature, the second attention weight information corresponding to the input feature is determined based on the value vector of the feature in the local receptive field of the input feature.
[0058] Steps S104 and S106 can be executed in parallel.
[0059] In some embodiments, for each input feature, a first context feature is determined by using depthwise convolution based on the value vector of the feature in the local receptive field of the input feature; local attention information corresponding to the input feature is determined based on the first context feature; and second attention weight information corresponding to the input feature is determined based on the local attention information corresponding to the input feature.
[0060] In other embodiments, for each input feature, a second context feature is determined by global average pooling based on the value vector of the feature in the local receptive field of the input feature; the kernel attention information corresponding to the input feature is determined based on the second context feature; and the second attention weight information corresponding to the input feature is determined based on the kernel attention information corresponding to the input feature.
[0061] In some other embodiments, for each input feature, a first context feature is determined by performing a depthwise convolution based on the value vector of the feature in the local receptive field of the input feature; local attention information corresponding to the input feature is determined based on the first context feature; global average pooling is performed based on the value vector of the feature in the local receptive field of the input feature to determine a second context feature; kernel attention information corresponding to the input feature is determined based on the second context feature; and second attention weight information corresponding to the input feature is determined based on the local attention information and kernel attention information corresponding to the input feature.
[0062] like Figure 2As shown, the image feature extraction model may include at least one of a LocalAttention Expert module and a Kernel Attention Expert module. The LocalAttention Expert module can be used to determine the local attention information corresponding to each input feature. The Kernel Attention Expert module can be used to determine the kernel attention information corresponding to each input feature. If only the LocalAttention Expert module is applied, the local attention information corresponding to each input feature can be used as the second attention weight information corresponding to each input feature. If only the Kernel Attention Expert module is applied, the kernel attention information corresponding to each input feature can be used as the second attention weight information corresponding to each input feature.
[0063] The local attention module will be introduced first.
[0064] In some embodiments, for each input feature, the value vector of the feature in the local receptive field of the input feature is divided into value sub-vectors of the feature in the local receptive field of each attention head according to the number of attention heads; the value sub-vectors of the feature in the local receptive field of each attention head are input into the deep convolution module and activation function module in each attention head to obtain the first context sub-feature in each attention head, wherein the first context feature includes multiple first context sub-features.
[0065] The local attention module employs a multi-head attention mechanism, but it limits attention learning within a local window to reduce memory usage and improve computational efficiency. Each attention head can use a K×K depth convolution (Convolutional Variants). d The ReLU6 activation function learns a region-level contextual representation (first contextual sub-feature) for all features within a K×K region. For example, for input feature x... i The first context sub-feature obtained from the h-th attention head is represented as The following formula can be used to determine it.
[0066]
[0067] Among them, w h Represents a K×K depth convolution Conv d,h The nucleus, σ r,h It is the ReLU6 activation function, β h (x j ) represents the feature x in the h-th attention head. j The value vector. Receptive field R l (i) is defined as x i A local region of K×K centered at [center]. For example... Figure 2 As shown, a 3×3 depthwise convolution Conv can be used. d,h .
[0068] In some embodiments, the first context sub-feature in each attention head is input into a series of first convolutional modules in each attention head to obtain local sub-attention information corresponding to the input feature in each attention head; the local sub-attention information corresponding to the input feature in each attention head is concatenated to obtain local attention information corresponding to the input feature.
[0069] Multiple consecutive first convolutional modules can make two consecutive 1×1 convolutions (e.g., ... Figure 2 As shown, This is followed by softmax normalization along the channels and no activation function. For input feature x i The input feature x obtained by the h-th attention head i The corresponding local attention information can be determined using the following formula.
[0070]
[0071] Figure 2 middle and This can be understood as a collective term for the first convolutional module in multiple attention heads. Each attention head takes input feature x as input. i The local sub-attention information is concatenated to obtain the input feature x. i The corresponding local attention information.
[0072]
[0073] Similar to the global attention module, the dynamically learned local attention weights in the local attention module are also spatially adaptive. Furthermore, compared to the global attention module, which only utilizes point-to-point contextual information, the local attention module mines richer region-level contextual information during attention learning. By concatenating the local attention weights corresponding to multiple attention heads, the learned attention information is forced to be shared across a set of channels, thus enabling the local attention weights to be channel-adaptive.
[0074] The kernel attention module is described below.
[0075] The kernel attention module is a dynamic deep convolution that generates attention weights as the kernel based on the input. This allows the learned attention weights to be directly applied to dense channels in feature learning.
[0076] In some embodiments, for each input feature, the value vector of the feature in the local receptive field of the input feature is globally averaged and pooled to determine a second context feature; the second context feature is input into multiple consecutive second convolutional modules to obtain the spatial attention weight information corresponding to the input feature; the second context feature is input into a third convolutional module to obtain the channel attention weight information corresponding to the input feature. The kernel attention information includes spatial attention weight information and channel attention weight information.
[0077] First, global average pooling (GAP) is applied to the input to obtain the second contextual features. Here, the input refers to the value vector of the features in the local receptive field of each input feature. For the input feature x... i The second contextual feature can be determined using the following formula.
[0078]
[0079] Among them, R k (i) is a local receptive field that can be related to R l (i) Same as x i The local region centered at K×K can also be different. i It can be used to obtain global context features through global average pooling.
[0080] A series of consecutive second convolutional modules can be similar to a series of consecutive first convolutional modules, including two consecutive 1×1 convolutions. This is followed by softmax normalization along the channels and no activation function. Input feature x i The corresponding spatial attention weight information can be determined using the following formula.
[0081]
[0082] Spatial attention weights Please refer to the formulas above for further study. The output dimension is N h .
[0083] The third convolutional module can be a 1×1 convolutional module. Input feature x i The corresponding channel attention weight information can be determined using the following formula.
[0084]
[0085] Where, σ s It can be the sigmoid activation function.
[0086] Dividing kernel attention information into spatial attention weight information and channel attention weight information can effectively reduce the amount of computation in the feature aggregation stage and improve efficiency.
[0087] Unlike global or local attention modules, the channel dimension of attention weights in the kernel attention module is equal to the channel dimension of the input in the feature fusion stage, thus enabling the attention weights to be adaptive across channels. Furthermore, this module supports mining global contextual information for attention learning. However, similar to local attention modules, the kernel attention module also has a local receptive field, and the learned kernel attention weights are shared across all locations.
[0088] In step S108, each input feature is updated according to the first attention weight information and the second attention weight information corresponding to each input feature to obtain the feature representation of the image.
[0089] Updating each input feature is also a stage of aggregating each input feature based on different attention weight information. In some embodiments, the value vector of each input feature is updated according to the channel attention weight information corresponding to each input feature, resulting in an updated value vector for each input feature; for each input feature, the updated value vector of the feature in the global receptive field, the updated value vector of the feature in the local receptive field, and the corresponding first and second attention weight information are used to update the input feature.
[0090] During the feature aggregation stage, channel attention information can be used to update the value vector of each feature. The following formula can be used to update the value vector of each feature.
[0091]
[0092] As in the above embodiment, the first attention weight information corresponding to each input feature includes: the first attention weight of each feature in the global receptive field relative to each input feature, and the second attention weight information corresponding to each input feature includes: the second attention weight of each feature in the local receptive field relative to each input feature. The global receptive field includes the local receptive field.
[0093] In some embodiments, for each input feature, a weighted sum of the first attention weight and the second attention weight of each feature in the local receptive field of the input feature relative to the input feature is obtained; the updated value vector of each feature in the local receptive field of the input feature is weighted sum with the attention weight of each feature in the local receptive field of the input feature relative to the input feature, and then passed through a transformation function to obtain the first updated value of the input feature; the updated value vector of each feature in the difference set between the global receptive field and the local receptive field of the input feature is weighted sum with the first attention weight of each feature in the difference set relative to the input feature, and then passed through a transformation function to obtain the second updated value of the input feature; the first updated value and the second updated value of each input feature are summed to update each input feature.
[0094] In some embodiments, when the second attention weight information corresponding to each input feature is determined based on the local attention information and kernel attention information corresponding to each input feature, the second attention weight of each feature in the local receptive field relative to each input feature is the weighted sum of the local attention weight of each feature in the local receptive field relative to each input feature and the spatial attention weight of each feature in the local receptive field relative to each input feature.
[0095] In the above embodiments, when the three attention mechanisms are applied simultaneously, for each feature in the local receptive field, the attention weight relative to each input feature is a weighted sum of the first attention weight, the local attention weight, and the spatial attention weight. For each feature in the difference set between the global receptive field and the local receptive field, the attention weight relative to each input feature is the first attention weight.
[0096] A global context-aware router can be used to integrate all three types of spatial attention. For input features x i and features x in its local or global receptive field j x j Relative to x i Attention weights can be represented by the following formula.
[0097]
[0098] Where λ1, λ2, and λ3 are gated networks in a globally context-aware router (e.g., ...). Figure 2 The trade-off parameters for learning in the Gate.
[0099] Using attention weights Features x in the local or global receptive field of the aggregated input features jThe updated value vector can be represented by the following formula.
[0100]
[0101] Where Y represents the feature transformation function, such as Figure 2 The result shown can be implemented using a third fully connected layer. i Input feature x i The updated features, i.e. the output features of the image feature extraction model, are used to obtain the feature representation of the entire image.
[0102] In the methods described above, each attention mechanism (global / local / kernel attention) can be considered an attention expert. The strengths and weaknesses of each attention expert can be complementary. A global context-aware router is used to fuse the attention knowledge learned from the three expert networks through a gating mechanism. Under this attention-level fusion mechanism, only a single feature aggregation is needed throughout the learning process, achieving less computational overhead than feature-level fusion.
[0103] In the above scheme, for each input feature of the image, first attention weight information and second attention weight information are determined based on the features in the global receptive field and the features in the local receptive field, respectively. Then, each input feature is updated based on the first and second attention weight information to obtain the image feature representation. This scheme integrates different attention mechanisms. The first attention weight information can represent point-to-point global context information, while the second attention weight information can effectively capture local relationships in the input features and achieve channel adaptation. By fusing different attention mechanisms, image feature representation can be better achieved, improving the accuracy of image feature extraction.
[0104] Following the image feature extraction model, downstream tasks such as object classification and object detection models can be set up. The image feature representation is input into the object classification model to determine the category of objects in the image; the image feature representation is input into the object detection model to determine the location of objects in the image. The object classification and object detection models can use existing modules or network structures, which will not be elaborated here.
[0105] This disclosure also provides an image feature extraction apparatus, which is described below in conjunction with... Figure 3 Describe it.
[0106] Figure 3 These are structural diagrams of some embodiments of the image feature extraction apparatus of this disclosure. For example... Figure 3As shown, the device 30 in this embodiment includes: a determination module 310, a first attention module 320, a second attention module 330, and an update module 340.
[0107] The determination module 310 is used to determine the query vector, key vector and value vector of each input feature for each input feature of the image, based on the query matrix and key matrix.
[0108] In some embodiments, the determining module 310 is used to multiply each input feature by the query matrix to obtain a query vector for each input feature; multiply each input feature by the key matrix to obtain a key vector for each input feature; and concatenate the query vector and key vector of each input feature to obtain a value vector for each input feature.
[0109] The first attention module 320 is used to determine the first attention weight information corresponding to each input feature based on the query vector of the input feature and the key vector of the feature in the global receptive field of the input feature.
[0110] In some embodiments, the first attention module 320 is used to divide the query vector and key vector of each input feature into query sub-vectors and key sub-vectors of each input feature in each attention head according to the number of attention heads; for each input feature, based on the query sub-vector of the input feature in each attention head and the key vector of the feature in the global receptive field, determine the sub-attention weight information corresponding to the input feature in each attention head; and based on the sub-attention weight information corresponding to each input feature in each attention head, determine the first attention weight information corresponding to each input feature.
[0111] In some embodiments, the sub-attention weight information corresponding to each input feature in each attention head includes: the sub-attention weight of each feature in the global receptive field of each attention head relative to each input feature, and the first attention weight information corresponding to each input feature includes: the first attention weight of each feature in the global receptive field relative to each input feature. The first attention module 320 is used to concatenate the sub-attention weights of the feature in the global receptive field of each attention head relative to the input feature for each feature in the global receptive field and each input feature, and determine the first attention weight of the feature in the global receptive field relative to the input feature.
[0112] In some embodiments, the first attention module 320 is configured to, at each attention head, determine the dot product of the query sub-vector of the input feature in the attention head and the key sub-vector of each feature in the global receptive field; normalize each dot product to obtain the sub-attention weight of each feature in the global receptive field of the attention head relative to each input feature, as the sub-attention weight information corresponding to the input feature in the attention head.
[0113] The second attention module 330 is used to determine the second attention weight information corresponding to each input feature based on the value vector of the feature in the local receptive field of the input feature.
[0114] In some embodiments, the second attention module 330 includes at least one of a local attention module and a kernel attention module, as well as a second attention fusion module.
[0115] In some embodiments, the local attention module is used to determine a first context feature by using depthwise convolution based on the value vector of the feature in the local receptive field of the input feature; and to determine the local attention information corresponding to the input feature based on the first context feature.
[0116] In some embodiments, the kernel attention module is used to determine a second context feature by using global average pooling based on the value vector of the feature in the local receptive field of the input feature; and to determine the kernel attention information corresponding to the input feature based on the second context feature.
[0117] In some embodiments, the second attention fusion module is used to determine the second attention weight information corresponding to the input feature based on the local attention information corresponding to the input feature, or to determine the second attention weight information corresponding to the input feature based on the kernel attention information corresponding to the input feature, or to determine the second attention weight information corresponding to the input feature based on the local attention information and kernel attention information corresponding to the input feature.
[0118] In some embodiments, the local attention module is used to divide the value vector of the feature in the local receptive field of the input feature into a sub-vector of the value of the feature in the local receptive field of each attention head according to the number of attention heads; the sub-vector of the value of the feature in the local receptive field of each attention head is input into the deep convolution module and activation function module in each attention head to obtain the first context sub-feature in each attention head, wherein the first context feature includes multiple first context sub-features.
[0119] In some embodiments, the local attention module is used to input the first context sub-feature in each attention head into a series of first convolutional modules in each attention head to obtain the local sub-attention information corresponding to the input feature in each attention head; and to concatenate the local sub-attention information corresponding to the input feature in each attention head to obtain the local attention information corresponding to the input feature.
[0120] In some embodiments, the kernel attention information includes spatial attention weight information and channel attention weight information. The kernel attention module is used to input the second context feature into a series of second convolutional modules to obtain the spatial attention weight information corresponding to the input feature; and to input the second context feature into a third convolutional module to obtain the channel attention weight information corresponding to the input feature.
[0121] The update module 340 is used to update each input feature according to the first attention weight information and the second attention weight information corresponding to each input feature, so as to obtain the feature representation of the image.
[0122] In some embodiments, the update module 340 is used to update the value vector of each input feature according to the channel attention weight information corresponding to each input feature, so as to obtain the updated value vector of each input feature; for each input feature, the update module 340 updates the input feature according to the updated value vector of the feature in the global receptive field of the input feature, the updated value vector of the feature in the local receptive field of the input feature, and the corresponding first attention weight information and second attention weight information of the input feature.
[0123] In some embodiments, the first attention weight information corresponding to each input feature includes: the first attention weight of each feature in the global receptive field relative to each input feature; the second attention weight information corresponding to each input feature includes: the second attention weight of each feature in the local receptive field relative to each input feature; the features in the global receptive field include the features in the local receptive field; the update module 340 is used to, for each input feature, perform a weighted summation of the first attention weight and the second attention weight of each feature in the local receptive field relative to the input feature to obtain the attention weight of each feature in the local receptive field relative to the input feature; perform a weighted summation of the updated value vector of each feature in the local receptive field relative to the input feature and the attention weight of each feature in the local receptive field relative to the input feature, and pass it through a transformation function to obtain the first updated value of the input feature; perform a weighted summation of the value vector of each feature in the difference set between the global receptive field and the local receptive field of the input feature and the first attention weight of each feature in the difference set relative to the input feature, and pass it through a transformation function to obtain the second updated value of the input feature; and sum the first updated value and the second updated value of each input feature to update each input feature.
[0124] In some embodiments, when the second attention weight information corresponding to each input feature is determined based on the local attention information and kernel attention information corresponding to each input feature, the second attention weight of each feature in the local receptive field relative to each input feature is the weighted sum of the local attention weight of each feature in the local receptive field relative to each input feature and the spatial attention weight of each feature in the local receptive field relative to each input feature.
[0125] In some embodiments, the apparatus 30 further includes an object classification module 350 and an object detection module 360. The object classification module 350 is used to determine the category of objects in the image based on the feature representation of the image; the object detection module 360 is used to determine the position of objects in the image based on the feature representation of the image.
[0126] The image feature extraction device in the embodiments of this disclosure can be implemented by various computing devices or computer systems, as described below. Figure 4 as well as Figure 5 Describe it.
[0127] Figure 4 These are structural diagrams of some embodiments of the image feature extraction apparatus of this disclosure. For example... Figure 4 As shown, the apparatus 40 of this embodiment includes a memory 410 and a processor 420 coupled to the memory 410. The processor 420 is configured to execute an image feature extraction method in any of the embodiments of this disclosure based on instructions stored in the memory 410.
[0128] The memory 410 may include, for example, system memory, fixed non-volatile storage media, etc. The system memory may store, for example, the operating system, application programs, boot loader, database, and other programs.
[0129] Figure 5 Structural diagrams of other embodiments of the image feature extraction apparatus of this disclosure are shown. Figure 5 As shown, the device 50 in this embodiment includes a memory 510 and a processor 520, which are similar to the memory 410 and processor 420, respectively. It may also include an input / output interface 530, a network interface 540, a storage interface 550, etc. These interfaces 530, 540, 550, and the memory 510 and processor 520 can be connected, for example, via a bus 560. The input / output interface 530 provides a connection interface for input / output devices such as a display, mouse, keyboard, and touchscreen. The network interface 540 provides a connection interface for various networked devices, such as connecting to a database server or cloud storage server. The storage interface 550 provides a connection interface for external storage devices such as SD cards and USB flash drives.
[0130] This disclosure also provides an image processing system, which is described below in conjunction with... Figure 6 Describe it.
[0131] Figure 6 This is a structural diagram of some embodiments of the image processing system disclosed herein. For example... Figure 6 As shown, the system 6 of this embodiment includes: an image feature extraction device 30 / 40 / 50 of any of the foregoing embodiments; and a camera device 62.
[0132] The camera device 62 is used to capture images.
[0133] In some embodiments, the image processing system further includes a picking device 64, used to pick objects in the image based on the category and position of the objects in the image output by the feature extraction device of the image.
[0134] Image processing system 6 may also include a picking device 64, such as a robotic arm. Image feature extraction devices 30 / 40 / 50 can be used to determine the position and category of objects in the image based on the image's feature representation, and the picking device 64 can pick out different objects and place them in designated locations.
[0135] Those skilled in the art will understand that embodiments of this disclosure can be provided as methods, systems, or computer program products. Therefore, this disclosure can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this disclosure can take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0136] This disclosure is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this disclosure. It should be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create a machine for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0137] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0138] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0139] The above description is only a preferred embodiment of this disclosure and is not intended to limit this disclosure. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this disclosure should be included within the protection scope of this disclosure.
Claims
1. A method for feature extraction from an image, comprising: For each input feature of the image, the query vector, key vector, and value vector of each input feature are determined based on the query matrix and key matrix; For each input feature, the first attention weight information corresponding to the input feature is determined based on the query vector of the input feature and the key vector of the feature in the global receptive field of the input feature; For each input feature, the second attention weight information corresponding to the input feature is determined based on the value vector of the feature in the local receptive field of the input feature; The feature representation of the image is obtained by updating each input feature according to the first attention weight information and the second attention weight information corresponding to each input feature. This includes: updating the value vector of each input feature according to the channel attention weight information corresponding to each input feature to obtain the updated value vector of each input feature; and updating the input feature for each input feature according to the updated value vector of the feature in the global receptive field of the input feature, the updated value vector of the feature in the local receptive field of the input feature, and the corresponding first attention weight information and second attention weight information.
2. The feature extraction method according to claim 1, wherein, For each input feature, determining the first attention weight information corresponding to that feature based on the query vector of that feature and the key vector of the feature in the global receptive field includes: The query vector and key vector of each input feature are divided into query sub-vectors and key vectors for each input feature in each attention head according to the number of attention heads; For each input feature, the sub-attention weight information corresponding to the input feature in each attention head is determined based on the query sub-vector of the input feature in each attention head and the key sub-vector of the feature in the global receptive field. Based on the sub-attention weight information corresponding to each input feature in each attention head, the first attention weight information corresponding to each input feature is determined.
3. The feature extraction method according to claim 2, wherein, The sub-attention weight information corresponding to each input feature in each attention head includes: the sub-attention weight of each feature in the global receptive field of each attention head relative to each input feature; and the first attention weight information corresponding to each input feature includes: the first attention weight of each feature in the global receptive field relative to each input feature. The step of determining the first attention weight information of each input feature based on the sub-attention weights corresponding to each input feature in each attention head includes: For each feature in the global receptive field and each input feature, the sub-attention weights of that feature in the global receptive field relative to the input feature in each attention head are concatenated to determine the first attention weight of that feature in the global receptive field relative to the input feature.
4. The feature extraction method according to claim 2, wherein, The step of determining the sub-attention weight information corresponding to the input feature in each attention head based on the query sub-vector of the input feature in each attention head and the key sub-vector of the feature in the global receptive field includes: For each attention head, determine the dot product of the query subvector of the input feature in that attention head and the key subvector of each feature in the global receptive field; Normalize each dot product to obtain the sub-attention weight of each feature in the global receptive field of the attention head relative to each input feature, which is used as the sub-attention weight information corresponding to the input feature in the attention head.
5. The feature extraction method according to claim 1, wherein, The step of determining the second attention weight information corresponding to the input feature based on the value vector of the feature in the local receptive field of the input feature includes: Based on the value vector of the feature in the local receptive field of the input feature, a depthwise convolution is used to determine the first context feature; Based on the first contextual feature, determine the local attention information corresponding to the input feature; Based on the local attention information corresponding to the input feature, determine the second attention weight information corresponding to the input feature.
6. The feature extraction method according to claim 1, wherein, The step of determining the second attention weight information corresponding to the input feature based on the value vector of the feature in the local receptive field of the input feature includes: Based on the value vector of the feature in the local receptive field of the input feature, global average pooling is used to determine the second context feature; Based on the second contextual feature, determine the kernel attention information corresponding to the input feature; Based on the kernel attention information corresponding to the input feature, determine the second attention weight information corresponding to the input feature.
7. The feature extraction method according to claim 1, wherein, The step of determining the second attention weight information corresponding to the input feature based on the value vector of the feature in the local receptive field of the input feature includes: Based on the value vector of the feature in the local receptive field of the input feature, perform depthwise convolution to determine the first context feature; Based on the first contextual feature, determine the local attention information corresponding to the input feature; Based on the value vector of the feature in the local receptive field of the input feature, global average pooling is performed to determine the second context feature; Based on the second contextual feature, determine the kernel attention information corresponding to the input feature; Based on the local attention information and kernel attention information corresponding to the input feature, the second attention weight information corresponding to the input feature is determined.
8. The feature extraction method according to claim 5 or 7, wherein, The step of performing a depthwise convolution based on the value vector of the feature in the local receptive field of the input feature to determine the first context feature includes: The value vector of the feature in the local receptive field of the input feature is divided into sub-vectors of the feature in the local receptive field of each attention head according to the number of attention heads; The value vector of the feature in the local receptive field of each attention head is input into the deep convolution module and activation function module of each attention head to obtain the first context sub-feature in each attention head, wherein the first context feature includes multiple first context sub-features.
9. The feature extraction method according to claim 8, wherein, The step of determining the local attention information corresponding to the input feature based on the first context feature includes: The first context sub-feature in each attention head is input into multiple consecutive first convolutional modules in each attention head to obtain the local sub-attention information corresponding to the input feature in each attention head. By concatenating the local sub-attention information corresponding to the input feature in each attention head, the local attention information corresponding to the input feature is obtained.
10. The feature extraction method according to claim 6 or 7, wherein, The kernel attention information includes spatial attention weight information and channel attention weight information. Determining the kernel attention information corresponding to the input feature based on the second context feature includes: The second context feature is input into multiple consecutive second convolutional modules to obtain the spatial attention weight information corresponding to the input feature; The second contextual feature is input into the third convolutional module to obtain the channel attention weight information corresponding to the input feature.
11. The feature extraction method according to claim 1, wherein, The first attention weight information corresponding to each input feature includes: the first attention weight of each feature in the global receptive field relative to each input feature; the second attention weight information corresponding to each input feature includes: the second attention weight of each feature in the local receptive field relative to each input feature; the features in the global receptive field include the features in the local receptive field. The following method is used to update each input feature: For each input feature, the first attention weight and the second attention weight of each feature in the local receptive field of the input feature are weighted and summed to obtain the attention weight of each feature in the local receptive field of the input feature relative to the input feature. The updated value vector of each feature in the local receptive field of the input feature is weighted and summed with the attention weight of each feature in the local receptive field of the input feature relative to the input feature, and then passed through a transformation function to obtain the first updated value of the input feature. The value vector of each feature in the difference set between the global receptive field and the local receptive field of the input feature is weighted and summed with the first attention weight of each feature in the difference set relative to the input feature, and then passed through the transformation function to obtain the second updated value of the input feature; The first update value and the second update value of each input feature are summed to update each input feature.
12. The feature extraction method according to claim 11, wherein, When the second attention weight information corresponding to each input feature is determined based on the local attention information and kernel attention information corresponding to each input feature, the second attention weight of each feature in the local receptive field relative to each input feature is the weighted sum of the local attention weight of each feature in the local receptive field relative to each input feature and the spatial attention weight of each feature in the local receptive field relative to each input feature.
13. The feature extraction method according to any one of claims 1-7, 9, and 11-12, wherein, The step of determining the query vector, key vector, and value vector for each input feature based on the query matrix and key matrix includes: Multiply each input feature by the query matrix to obtain a query vector for each input feature; Multiply each input feature by the key matrix to obtain the key vector of each input feature; The query vector and key vector of each input feature are concatenated to obtain the value vector of each input feature.
14. The feature extraction method according to any one of claims 1-7, 9, and 11-12, further comprising at least one of the following: The feature representation of the image is input into the object classification model to determine the category of the object in the image; The feature representation of the image is input into the object detection model to determine the location of the object in the image.
15. An image feature extraction apparatus, comprising: The determination module is used to determine the query vector, key vector, and value vector of each input feature of the image based on the query matrix and the key matrix. The first attention module is used to determine the first attention weight information corresponding to each input feature based on the query vector of the input feature and the key vector of the feature in the global receptive field of the input feature. The second attention module is used to determine the second attention weight information corresponding to each input feature based on the value vector of the feature in the local receptive field of the input feature. The update module is used to update each input feature according to the first attention weight information and the second attention weight information corresponding to each input feature to obtain the feature representation of the image. The update module includes: updating the value vector of each input feature according to the channel attention weight information corresponding to each input feature to obtain the updated value vector of each input feature; and updating the input feature for each input feature according to the updated value vector of the feature in the global receptive field of the input feature, the updated value vector of the feature in the local receptive field of the input feature, and the corresponding first attention weight information and second attention weight information.
16. An image feature extraction apparatus, comprising: processor; as well as A memory coupled to the processor is used to store instructions that, when executed by the processor, cause the processor to perform the image feature extraction method as described in any one of claims 1-14.
17. A non-transitory computer-readable storage medium having a computer program stored thereon, wherein, When executed by a processor, the program implements the steps of the method according to any one of claims 1-14.
18. An image processing system, comprising: The feature extraction apparatus for the image according to claim 15 or 16; as well as A camera device used to capture images.
19. The image processing system according to claim 18, further comprising: A picking device is used to pick objects in the image based on the category and position of the objects in the image output by the image feature extraction device.