Image processing method and device, electronic equipment and storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By introducing a motion processing module into the Vision Transformer model and using configured or learnable convolution kernels for depthwise separable convolution, the problem of low efficiency in local feature extraction is solved, achieving efficient image processing and applicability to various devices.

CN116403061BActive Publication Date: 2026-06-19TSINGHUA UNIVERSITY

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: TSINGHUA UNIVERSITY
Filing Date: 2023-03-31
Publication Date: 2026-06-19

Application Information

Patent Timeline

31 Mar 2023

Application

19 Jun 2026

Publication

CN116403061B

IPC: G06V10/77; G06V10/82; G06V10/764; G06V10/26; G06N3/0464

AI Tagging

Application Domain

Character and pattern recognition Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing Vision Transformer models are inefficient in local feature extraction, and their local attention modules rely on CUDA kernels or use inefficient Im2Col functions, making them unsuitable for devices without CUDA support.

Method used

It replaces the multi-head attention layer with a mobile processing module, and performs depthwise separable convolution on image vectors using configured or learnable convolution kernels to directly obtain the row-folded key-value matrix, avoiding window sliding, and is applicable to all devices.

Benefits of technology

It improves image processing efficiency, is suitable for devices that do not rely on the CUDA kernel, and enhances computation speed and model performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116403061B_ABST

Patent Text Reader

Abstract

This disclosure provides an image processing method, apparatus, electronic device, and storage medium, relating to the field of computer vision technology. The method includes: acquiring an image vector of an input image; acquiring a plurality of query matrices from the image vector; performing a shifting process on the image vector to obtain a row-expanded key-value matrix; obtaining an output matrix corresponding to each query matrix based on the plurality of query matrices and the key-value matrix, wherein each output matrix corresponding to a query matrix includes: the attention of the image vector to the query matrix; and outputting the image processing result of the input image based on the plurality of output matrices.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer vision technology, and in particular to an image processing method, apparatus, electronic device and storage medium. Background Technology

[0002] Self-attention plays a crucial role in Vision Transformer models. Related techniques primarily employ sparse global attention or window attention to reduce the computational complexity of Vision Transformer models. However, sparse global attention is inefficient in extracting local features, and window attention may be subject to some manual settings.

[0003] In contrast, local attention restricts the receptive field of each query to its own neighboring pixels, combining the advantages of convolution and self-attention, namely local inductive bias and adaptive feature extraction. However, current local attention modules either use inefficient Im2Col (image to column) functions or rely on specific CUDA (Compute Unified Device Architecture) kernels, making them unsuitable for devices without CUDA support. Summary of the Invention

[0004] In view of the above problems, this disclosure provides an image processing method, apparatus, electronic device, and storage medium to overcome or at least partially solve the above problems.

[0005] A first aspect of this disclosure provides an image processing method applied to an image processing model, the method comprising:

[0006] Obtain the image vector of the input image;

[0007] Multiple query matrices are obtained from the image vector;

[0008] The image vector is shifted to obtain a key-value matrix expanded by rows;

[0009] Based on the plurality of query matrices and the key-value matrix, an output matrix corresponding to each query matrix is obtained, and the output matrix corresponding to each query matrix includes: the self-attention of the image vector to the query matrix;

[0010] Based on the multiple output matrices, the image processing result of the input image is output.

[0011] Optionally, the step of shifting the image vector to obtain a row-expanded key-value matrix includes:

[0012] Obtain multiple configured convolutional kernels, wherein the multiple configured convolutional kernels represent different moving directions;

[0013] The image vector is subjected to depthwise separable convolution using the multiple configured convolution kernels to obtain multiple convolution result matrices of the image vector;

[0014] The key value matrix is obtained by expanding the multiple convolution result matrices of the image vector row by row.

[0015] Optionally, it also includes:

[0016] Obtain multiple learnable convolutional kernels, each of which corresponds one-to-one with a multiple configured convolutional kernel;

[0017] The image vector is convolved using the multiple learnable convolution kernels to obtain multiple target convolution result matrices of the image vector;

[0018] The multiple convolution result matrices of the image vector and the multiple target convolution result matrices are added together to obtain the sum matrix of the multiple convolution results of the image vector;

[0019] The step of shifting the image vector to obtain a row-expanded key-value matrix includes:

[0020] The key value matrix is obtained by expanding the sum of multiple convolution results of the image vectors row by row.

[0021] Optionally, the step of shifting the image vector to obtain a row-expanded key-value matrix includes:

[0022] The image vector is moved in different directions to obtain multiple movement result matrices of the image vector;

[0023] Expand the multiple move result matrices row by row to obtain the key value matrix.

[0024] Optionally, obtaining the output matrix corresponding to each query matrix based on the plurality of query matrices and the key-value matrix includes:

[0025] Each query matrix is multiplied by the key matrix to obtain the output matrix corresponding to each query matrix;

[0026] The step of outputting the image processing result of the input image based on the plurality of output matrices includes:

[0027] The output matrices corresponding to each query matrix are summed and normalized to obtain the image features of the input image;

[0028] Based on the image features, the image processing result of the input image is output.

[0029] Optionally, the image processing model is an image classification model, and the image processing result is an image classification result; or,

[0030] The image processing model is an object detection model, and the image processing result is the object detection result; or,

[0031] The image processing model is a semantic segmentation model, and the image processing result is a semantic segmentation result.

[0032] A second aspect of this disclosure provides an image processing apparatus applied to an image processing model, the apparatus comprising:

[0033] The input module is used to obtain the image vector of the input image;

[0034] The acquisition module is used to acquire multiple query matrices from the image vector;

[0035] The processing module is used to perform shift processing on the image vector to obtain a key-value matrix expanded by rows;

[0036] The module is configured to obtain an output matrix corresponding to each query matrix based on the plurality of query matrices and the key-value matrix, wherein the output matrix corresponding to each query matrix includes: the attention of the image vector to the query matrix;

[0037] The output module is used to output the image processing result of the input image based on the plurality of output matrices.

[0038] A third aspect of this disclosure provides an electronic device, including: a processor; and a memory for storing processor-executable instructions; wherein the processor is configured to execute the instructions to implement the image processing method as described in the first aspect.

[0039] A fourth aspect of this disclosure provides a computer-readable storage medium that, when instructions in the computer-readable storage medium are executed by a processor of an electronic device, enables the electronic device to perform the image processing method as described in the first aspect.

[0040] The embodiments disclosed herein have the following advantages:

[0041] In this embodiment, an image vector of the input image is obtained; multiple query matrices are obtained from the image vector; the image vector is shifted to obtain a row-expanded key-value matrix; based on the multiple query matrices and the key-value matrix, an output matrix corresponding to each query matrix is obtained, and the output matrix corresponding to each query matrix includes: the self-attention of the image vector on the query matrix; based on the multiple output matrices, the image processing result of the input image is output. Thus, compared to related technologies that require window sliding when using the Im2Col function to obtain the key-value matrix, this embodiment does not require window sliding when obtaining the key-value matrix, saving time and improving efficiency, and is applicable to devices without a CUDA kernel. Attached Figure Description

[0042] To more clearly illustrate the technical solutions of the embodiments of this disclosure, the accompanying drawings used in the description of the embodiments of this disclosure will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0043] Figure 1 This is a schematic diagram of obtaining the key-value matrix using the Im2Col function;

[0044] Figure 2 This is a flowchart of the steps of an image processing method according to an embodiment of the present disclosure;

[0045] Figure 3 This is a schematic diagram of obtaining a key value matrix by moving an image vector according to an embodiment of this disclosure;

[0046] Figure 4 This is a schematic diagram illustrating the process of convolving image vectors to obtain a key-value matrix according to an embodiment of this disclosure;

[0047] Figure 5 This is a block diagram of an apparatus for image processing according to an embodiment of the present disclosure. Detailed Implementation

[0048] To make the above-mentioned objectives, features and advantages of this disclosure more apparent and understandable, the disclosure will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0049] Figure 1 This is a schematic diagram illustrating the use of the Im2Col function to obtain the key-value matrix. For example... Figure 1As shown in section a, the local window is centered on a 2×2 query. The receptive field of the query includes the pixels surrounding it, and the convolutional kernel is 3×3. Sliding the window using the convolutional kernel yields four sliding results, each containing the following content: Figure 1 As shown in section b. Expanding the four sliding results column-wise yields the following: Figure 1 The four column vectors shown in section c each contain nine elements. These four column vectors can form a 4×9 key-value matrix when expanded column by column.

[0050] Obtaining the key-value matrix using the Im2Col function requires sliding the window, a process that is time-consuming. Therefore, using the Im2Col function to obtain the key-value matrix is inefficient. Consequently, the Vision Transformer model using the Im2Col function to obtain the key-value matrix is also inefficient.

[0051] Reference Figure 2 The diagram illustrates a flowchart of an image processing method according to an embodiment of this disclosure. This image processing method is applied to an image processing model, such as... Figure 2 As shown, the image processing method may specifically include steps S11 to S15.

[0052] Step S11: Obtain the image vector of the input image;

[0053] Step S12: Obtain multiple query matrices from the image vector;

[0054] Step S13: The image vector is shifted to obtain a key-value matrix expanded by rows;

[0055] Step S14: Based on the plurality of query matrices and the key-value matrix, obtain the output matrix corresponding to each query matrix, wherein the output matrix corresponding to each query matrix includes: the self-attention of the image vector to the query matrix;

[0056] Step S15: Output the image processing result of the input image according to the plurality of output matrices.

[0057] The image processing model proposed in this embodiment has a similar structure to the VisionTransformer model with a multi-head attention layer. Specifically, the image processing model proposed in this embodiment includes a motion processing module. By replacing the multi-head attention layer in the Vision Transformer model with the motion processing module, the image processing model can be obtained. The motion processing module proposed in this embodiment is a plug-and-play module. By inserting the motion processing module into the Vision Transformer model to replace the multi-head attention layer, the image processing model can be obtained. The image processing model can be an image classification model, an object detection model, or a semantic segmentation model, etc., and the corresponding image processing results can be image classification results, object detection results, and semantic segmentation results, etc.

[0058] The method for obtaining the image vector of the input image can refer to relevant techniques. For example, the image vector of the input image can be generated based on the pixel value of each pixel. The moving module of the image processing model has multiple heads, each corresponding to a query matrix. The image vector of the input image is input into the moving module, and each head of the moving module can obtain a query matrix from the image vector. The method for obtaining the query matrix from the image vector can refer to the method for obtaining the query matrix from each head of the multi-head attention layer of the VisionTransformer model.

[0059] Each head of the moving module can perform multi-directional moving operations on the image vector, resulting in a row-expanded key-value matrix. Moving the image vector can be achieved by performing depthwise separable convolutions on the image vector using multiple configured convolution kernels, or by moving the image vector in multiple directions. Based on the key-value matrix and its corresponding query matrix, each head of the moving module can obtain an output matrix corresponding to each query matrix. The output matrix for each query matrix contains the self-attention of the image vector towards that query matrix.

[0060] In some implementations, obtaining the output matrix corresponding to each query matrix based on the plurality of query matrices and the key-value matrix may include: multiplying each query matrix by the key-value matrix to obtain the output matrix corresponding to each query matrix. Outputting the image processing result of the input image based on the plurality of output matrices may include: summing and normalizing the output matrices corresponding to each query matrix to obtain the image features of the input image; and outputting the image processing result of the input image based on the image features. For a specific method of obtaining the image processing result of an image based on image features, refer to the relevant art, specifically the method of obtaining the image processing result of an image based on image features using the Vision Transformer model.

[0061] The following describes the specific implementation of the motion processing module to perform motion processing on image vectors and obtain a row-expanded key-value matrix.

[0062] The motion processing module performs motion processing on the image vector, which can be done by directly moving the image vector in different directions to obtain multiple motion result matrices of the image vector; the multiple motion result matrices are expanded row by row to obtain the key value matrix.

[0063] Figure 3 This is a schematic diagram illustrating the process of moving image vectors to obtain a key-value matrix according to an embodiment of this disclosure. For example... Figure 3 The 2×2 query shown in part a can be directly moved in 9 directions to obtain the following results: Figure 3 The nine movement results are shown in section b. The nine directions are, in order: upper left, directly above, upper right, left, center, upper right, lower left, directly below, and lower right. Moving the image vector towards the center essentially means keeping the image vector's position unchanged. Expanding the nine movement results row by row yields the following... Figure 3 The nine row vectors shown in section c each contain four elements. These nine row vectors can form a 4×9 key-value matrix that is expanded row by row.

[0064] according to Figure 1 Part C and Figure 3 As shown in section c, the column-expanded key value matrix obtained using the Im2Col function is the same as the row-expanded key value matrix obtained by directly moving the image vector. Therefore, directly moving the image vector in various directions can yield the correct key value matrix. Subsequent processing can then be performed based on the obtained key value matrix.

[0065] The motion processing module performs motion processing on the image vectors, which can be achieved by performing depthwise separable convolutions on the image vectors using multiple pre-configured convolution kernels. Specifically, the image processing model incorporates multiple pre-configured convolution kernels, each representing a different motion direction. The size of the convolution kernel is determined by the size of the query's receptive field. The image vectors are then subjected to depthwise separable convolutions using these pre-configured kernels, resulting in multiple convolution result matrices of the image vectors. These matrices are then expanded row-wise to obtain the key-value matrix.

[0066] Figure 4 This is a schematic diagram illustrating the process of obtaining a key-value matrix by convolving image vectors according to an embodiment of this disclosure. For example... Figure 4 The 2×2 query shown in part a is depthwise separable convolutionally processed by 9 pre-configured convolutional kernels, resulting in 9 convolutional results. Each convolutional result is then compared with... Figure 3The nine shift results shown in section b are identical. Expanding these nine convolution results row by row yields the following: Figure 4 The nine row vectors shown in section c each contain four elements. These nine row vectors can form a 4×9 key-value matrix that is expanded row by row.

[0067] By performing depthwise separable convolution on the image vector using multiple configured convolution kernels, the resulting matrix of multiple convolutional results is equal to the matrix of multiple shifted results obtained by shifting the image vector in different directions.

[0068] Each of the nine convolutional kernels contains eight zeros and one 1. The position of the 1 determines the direction of movement represented by the kernel. At position (1,1), the convolution result is the same as moving the image vector to the upper left; at position (2,1), it's the same as moving the image vector directly upwards; at position (3,1), it's the same as moving the image vector to the upper right; at position (1,2), it's the same as moving the image vector to the left; and at position (2,2), the convolution result represents different directions of movement. The convolution result obtained is the same as the result obtained by moving the image vector to the middle; when the kernel is at position (3,2), the convolution result obtained by convolving the kernel with the image vector is the same as the result obtained by moving the image vector to the right; when the kernel is at position (1,3), the convolution result obtained by convolving the kernel with the image vector is the same as the result obtained by moving the image vector to the lower left; when the kernel is at position (2,3), the convolution result obtained by convolving the kernel with the image vector is the same as the result obtained by moving the image vector directly downward; when the kernel is at position (3,3), the convolution result obtained by convolving the kernel with the image vector is the same as the result obtained by moving the image vector to the lower right.

[0069] Compared to using the Im2Col function to obtain the key value matrix, which requires window sliding, this embodiment directly uses a convolution kernel to perform depthwise separable convolution on the image vector when obtaining the key value matrix, without the need for window sliding, which saves time and improves efficiency.

[0070] By using multiple pre-configured convolutional kernels to perform depthwise separable convolution on the image vector, the level of attention given to each pixel within the receptive field is consistent. However, the contribution of each pixel within the receptive field may differ for different queries. Therefore, based on the above technical solution, multiple learnable convolutional kernels can also be used to convolve the image vector.

[0071] The size of the learnable convolutional kernels is the same as that of the configured convolutional kernels, but the values of each element in the learnable convolutional kernels are completely random values learned during the learning process, and are not limited to 0 and 1.

[0072] Each head of the mobile processing module includes multiple learnable convolutional kernels, each corresponding one-to-one with a set of pre-configured convolutional kernels. These learnable kernels convolve image vectors individually, yielding multiple target convolution result matrices. The target convolution result matrix corresponding to each learnable kernel is added to the convolution result matrix corresponding to its pre-configured counterpart to obtain a sum matrix of convolution results for the image vectors. Expanding this sum matrix row-wise yields a key-value matrix.

[0073] Multiple learnable convolutional kernels and multiple pre-configured convolutional kernels can convolve image vectors in parallel, resulting in multiple target convolutional result matrices and multiple convolutional result matrices. Adding the convolutional result matrices and the target convolutional result matrices essentially utilizes reparameter recalculation techniques to transform two parallel paths into a single convolution, thereby improving model performance while maintaining inference efficiency.

[0074] The values of the elements of the learnable convolutional kernel are learned during the training of the image processing model. The training of the image processing model can refer to relevant techniques, such as supervised training, unsupervised training, and self-supervised training.

[0075] Thus, convolving image vectors with learnable convolution kernels allows for more flexible extraction of key-value pairs, significantly increasing model capacity and capturing diverse image features. Convolution with learnable kernels can be viewed as a linear combination of features within a local window, facilitating the increase of spatial sampling locations and geometric transformations of the input image. By utilizing reparameter recalculation techniques, two parallel paths are transformed into a single convolution, improving model performance while maintaining inference efficiency.

[0076] Compared to using the Im2Col function to obtain the key-value matrix, the technical solution of this embodiment does not require window sliding, which can effectively improve efficiency and can be applied to devices that do not contain a CUDA kernel.

[0077] It should be noted that, for the sake of simplicity, the method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments of this disclosure are not limited to the described order of actions, because according to the embodiments of this disclosure, some steps can be performed in other orders or simultaneously. Secondly, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions involved are not necessarily required by the embodiments of this disclosure.

[0078] Figure 5 This is a schematic diagram of the structure of an image processing apparatus according to an embodiment of the present disclosure. This image processing apparatus is applied to an image processing model, such as... Figure 5 As shown, the device includes an input module, an acquisition module, a processing module, an acquisition module, and an output module, wherein:

[0079] The input module is used to obtain the image vector of the input image;

[0080] The acquisition module is used to acquire multiple query matrices from the image vector;

[0081] The processing module is used to perform shift processing on the image vector to obtain a key-value matrix expanded by rows;

[0082] The module is configured to obtain an output matrix corresponding to each query matrix based on the plurality of query matrices and the key-value matrix, wherein the output matrix corresponding to each query matrix includes: the attention of the image vector to the query matrix;

[0083] The output module is used to output the image processing result of the input image based on the plurality of output matrices.

[0084] Optionally, the processing module is specifically used to perform:

[0085] Obtain multiple configured convolutional kernels, wherein the multiple configured convolutional kernels represent different moving directions;

[0086] The image vector is subjected to depthwise separable convolution using the multiple configured convolution kernels to obtain multiple convolution result matrices of the image vector;

[0087] The key value matrix is obtained by expanding the multiple convolution result matrices of the image vector row by row.

[0088] Optionally, it also includes:

[0089] The convolution kernel acquisition module is used to acquire multiple learnable convolution kernels, and the multiple learnable convolution kernels correspond one-to-one with the multiple configured convolution kernels;

[0090] The convolution module is used to convolve the image vector with the multiple learnable convolution kernels respectively to obtain multiple target convolution result matrices of the image vector;

[0091] The addition module is used to add the multiple convolution result matrices and the multiple target convolution result matrices of the image vector respectively to obtain the sum matrix of the multiple convolution results of the image vector;

[0092] The processing module is specifically used to execute:

[0093] The key value matrix is obtained by expanding the sum of multiple convolution results of the image vectors row by row.

[0094] Optionally, the processing module is specifically used to perform:

[0095] The image vector is moved in different directions to obtain multiple movement result matrices of the image vector;

[0096] Expand the multiple move result matrices row by row to obtain the key value matrix.

[0097] Optionally, the obtaining module is specifically used to perform:

[0098] Each query matrix is multiplied by the key matrix to obtain the output matrix corresponding to each query matrix;

[0099] The step of outputting the image processing result of the input image based on the plurality of output matrices includes:

[0100] The output matrices corresponding to each query matrix are summed and normalized to obtain the image features of the input image;

[0101] Based on the image features, the image processing result of the input image is output.

[0102] Optionally, the image processing model is an image classification model, and the image processing result is an image classification result; or,

[0103] The image processing model is an object detection model, and the image processing result is the object detection result; or,

[0104] The image processing model is a semantic segmentation model, and the image processing result is a semantic segmentation result.

[0105] It should be noted that the device embodiments are similar to the method embodiments, so the description is relatively simple. For relevant details, please refer to the method embodiments.

[0106] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. The same or similar parts between the various embodiments can be referred to each other.

[0107] Those skilled in the art will understand that embodiments of this disclosure can be provided as methods, apparatus, or computer program products. Therefore, embodiments of this disclosure can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of this disclosure can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0108] This disclosure describes embodiments of methods, apparatus, electronic devices, and computer program products according to embodiments of this disclosure with reference to flowchart illustrations and / or block diagrams. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, generate instructions for implementing the flowchart illustrations. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0109] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing terminal device to operate in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0110] These computer program instructions can also be loaded onto a computer or other programmable data processing terminal equipment, causing a series of operational steps to be performed on the computer or other programmable terminal equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable terminal equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0111] While preferred embodiments of the present disclosure have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the present disclosure.

[0112] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or terminal device that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or terminal device. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or terminal device that includes said element.

[0113] The above provides a detailed description of an image processing method, apparatus, electronic device, and storage medium provided by this disclosure. Specific examples have been used to illustrate the principles and implementation methods of this disclosure. The descriptions of the above embodiments are only for the purpose of helping to understand the method and its core ideas. At the same time, those skilled in the art will recognize that there will be changes in the specific implementation methods and application scope based on the ideas of this disclosure. Therefore, the content of this specification should not be construed as a limitation of this disclosure.

Claims

1. An image processing method, characterized by, Applied to an image processing model, the method includes: Obtain the image vector of the input image; Multiple query matrices are obtained from the image vector; The image vector is shifted to obtain a key-value matrix expanded by rows; Based on the plurality of query matrices and the key-value matrix, an output matrix corresponding to each query matrix is obtained, and the output matrix corresponding to each query matrix includes: the self-attention of the image vector to the query matrix; Based on the plurality of output matrices, the image processing result of the input image is output; The step of shifting the image vector to obtain a row-expanded key-value matrix includes: Obtain multiple configured convolutional kernels, which represent nine different movement directions: upper left, upper right, upper right, left, middle, upper right, lower left, lower right, and lower right. The image vector is subjected to depthwise separable convolution using the multiple configured convolution kernels to obtain multiple convolution result matrices of the image vector; The key value matrix is obtained by expanding the multiple convolution result matrices of the image vector row by row.

2. The method of claim 1, wherein, Also includes: Obtain multiple learnable convolutional kernels, each of which corresponds one-to-one with a multiple configured convolutional kernel; The image vector is convolved using the multiple learnable convolution kernels to obtain multiple target convolution result matrices of the image vector; The multiple convolution result matrices of the image vector and the multiple target convolution result matrices are added together to obtain the sum matrix of the multiple convolution results of the image vector; The step of shifting the image vector to obtain a row-expanded key-value matrix includes: The key value matrix is obtained by expanding the sum of multiple convolution results of the image vectors row by row.

3. The method of claim 1, wherein, The step of shifting the image vector to obtain a row-expanded key-value matrix includes: The image vector is moved in nine different directions: upper left, upper right, upper left, middle, upper right, lower left, lower right, and lower right, resulting in multiple movement result matrices of the image vector. Expand the multiple move result matrices row by row to obtain the key value matrix.

4. The method according to any of claims 1 to 3, characterized in that The step of obtaining the output matrix corresponding to each query matrix based on the plurality of query matrices and the key-value matrix includes: Each query matrix is multiplied by the key matrix to obtain the output matrix corresponding to each query matrix; The step of outputting the image processing result of the input image based on the plurality of output matrices includes: The output matrices corresponding to each query matrix are summed and normalized to obtain the image features of the input image; Based on the image features, the image processing result of the input image is output.

5. The method according to any one of claims 1 to 3, characterized in that, The image processing model is an image classification model, and the image processing result is an image classification result; or, The image processing model is an object detection model, and the image processing result is the object detection result; or, The image processing model is a semantic segmentation model, and the image processing result is a semantic segmentation result.

6. An image processing apparatus, characterized in that, The apparatus, applied to an image processing model, includes: The input module is used to obtain the image vector of the input image; The acquisition module is used to acquire multiple query matrices from the image vector; The processing module is used to perform shift processing on the image vector to obtain a key-value matrix expanded by rows; The module is configured to obtain an output matrix corresponding to each query matrix based on the plurality of query matrices and the key-value matrix, wherein the output matrix corresponding to each query matrix includes: the attention of the image vector to the query matrix; The output module is used to output the image processing result of the input image according to the plurality of output matrices; Specifically, the processing module is used to execute: Obtain multiple configured convolutional kernels, which represent nine different movement directions: upper left, upper right, upper right, left, middle, upper right, lower left, lower right, and lower right. The image vector is subjected to depthwise separable convolution using the multiple configured convolution kernels to obtain multiple convolution result matrices of the image vector; The key value matrix is obtained by expanding the multiple convolution result matrices of the image vector row by row.

7. An electronic device, comprising: include: processor; Memory used to store the processor's executable instructions; The processor is configured to execute the instructions to implement the image processing method as described in any one of claims 1 to 5.

8. A computer-readable storage medium, characterized in that, When the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is enabled to perform the image processing method as described in any one of claims 1 to 5.