Image processing method, device and non-volatile computer readable storage medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By introducing streamlined convolutions into the object recognition backbone network and dynamically learning the processing kernel, the problem of information loss in existing technologies is solved, and the recognition accuracy and performance are improved.

CN116468902BActive Publication Date: 2026-06-16JINGDONG TECH HLDG CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: JINGDONG TECH HLDG CO LTD
Filing Date: 2023-03-10
Publication Date: 2026-06-16

Application Information

Patent Timeline

10 Mar 2023

Application

16 Jun 2026

Publication

CN116468902B

IPC: G06V10/44; G06V10/764; G06V10/82; G06N3/08; G06N3/0464

CPC: G06V10/44; G06V10/764; G06V10/82; G06N3/08

AI Tagging

Application Domain

Character and pattern recognition Neural learning methods

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing object recognition backbone networks based on convolutional neural networks and Transformer structures cannot dynamically model and learn different inductive biases at different processing stages, resulting in information loss and decreased recognition accuracy.

⚗Method used

A streamlined convolution-based visual Transformer backbone network architecture is adopted. By introducing streamlined dependencies between inductive biases of different modules, processing kernels are dynamically allocated to learn specific processing kernels at each resolution stage.

🎯Benefits of technology

It improves the recognition accuracy and processing performance of object recognition models by combining the two-dimensional inductive bias information of convolutional neural networks and the global self-attention mechanism of the Transformer structure to dynamically learn information between different network blocks.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116468902B_ABST

Patent Text Reader

Abstract

The present disclosure relates to a processing method and device of an image and a nonvolatile computer readable storage medium, and relates to the technical field of computer. The processing method comprises: extracting a feature vector of a to-be-processed image by using a plurality of feature extraction layers connected in sequence in a machine learning model, a processing kernel of a current feature extraction layer being determined according to a processing kernel and a processing result of a previous feature extraction layer, the current feature extraction layer being a feature extraction layer other than a first feature extraction layer; and processing the to-be-processed image according to the feature vector. The technical solution of the present disclosure can dynamically learn the processing kernel of the feature extraction layer, improve the recognition accuracy of the feature extraction layer, and thus improve the processing performance.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of computer technology, and in particular to an image processing method, an image processing apparatus, and a non-volatile computer-readable storage medium. Background Technology

[0002] Object recognition is one of the fundamental topics in computer vision. Given an image of an object and its corresponding category label, the goal of object recognition is to learn an object recognition generative model using this data, which can correctly classify objects in an image. The design of the skeleton network structure for object recognition is an important research direction in this field.

[0003] In related technologies, the skeleton network structure for object recognition mainly includes two design frameworks: network design based on CNN (Convolutional Neural Networks) and network design based on Transformer modules. Summary of the Invention

[0004] The inventors of this disclosure have discovered the following problems in the aforementioned related technologies: they are prone to information loss, leading to a decrease in processing performance.

[0005] In view of this, this disclosure proposes an image processing technology that can dynamically learn the processing kernel of the feature extraction layer, improve the recognition accuracy of the feature extraction layer, and thus improve processing performance.

[0006] According to some embodiments of this disclosure, an image processing method is provided, comprising: extracting feature vectors of an image to be processed using multiple sequentially connected feature extraction layers in a machine learning model, wherein the processing kernel of the current feature extraction layer is determined based on the processing kernel and processing result of the previous feature extraction layer, and the current feature extraction layer is a feature extraction layer other than the first feature extraction layer; and processing the image to be processed based on the feature vectors.

[0007] In some embodiments, the processing kernel of the current feature extraction layer is calculated by the following steps: estimating the estimated value of the processing kernel of the current feature extraction layer based on the processing result of the previous feature extraction layer; and determining the processing kernel of the current feature extraction layer based on the estimated value and the processing kernel of the previous feature extraction layer.

[0008] In some embodiments, estimating the processing kernel of the current feature extraction layer includes: dividing multiple channel components in the processing result of the previous feature extraction layer into multiple groups; estimating multiple sub-estimates of the processing kernel of the current feature extraction layer based on each of the multiple groups; and determining the estimated value of the processing kernel of the current feature extraction layer based on the multiple sub-estimates.

[0009] In some embodiments, dividing multiple channel components in the processing result of the previous feature extraction layer into multiple groups includes: performing downsampling processing on the processing result of the previous feature extraction layer to obtain downsampling results; expanding the channel dimensions of the downsampling results to obtain channel dimension expansion results; and dividing the channel dimension expansion results into multiple groups.

[0010] In some embodiments, estimating multiple sub-estimates of the processing kernel of the current feature extraction layer for each of the multiple groups includes: using an SFC (Space Full Connection) layer to process each of the multiple groups to obtain multiple sub-estimates.

[0011] In some embodiments, determining the estimated value of the processing kernel of the current feature extraction layer based on multiple sub-estimates includes: processing the connection results of multiple sub-estimates using a fully connected layer to obtain a fully connected processing result; and performing GN (Group Normalization) processing on the fully connected processing result to determine the estimated value of the processing kernel of the current feature extraction layer.

[0012] In some embodiments, each of the plurality of feature extraction layers includes a convolutional feedforward layer, which includes a convolutional layer and a fully connected layer.

[0013] In some embodiments, the convolutional layer is placed before the fully connected layer.

[0014] In some embodiments, the feature extraction layer includes a layer normalization layer, the output of the convolutional layer is used as the input of the layer normalization layer, and the output of the layer normalization layer is used as the input of the fully connected layer.

[0015] In some embodiments, the plurality of feature extraction layers include a first feature extraction layer and a second feature extraction layer. The first feature extraction layer includes an attention mechanism module, while the second feature extraction layer does not include an attention mechanism module. The output of the attention mechanism module is the input of the convolutional feedforward layer of the first feature extraction layer. The resolution of the data processed by the first feature extraction layer is lower than the resolution of the data processed by the second feature extraction layer.

[0016] In some embodiments, the second feature extraction layer is disposed before the first feature extraction layer.

[0017] In some embodiments, the convolutional layers of the convolutional feedforward layer include depthwise convolutional layers.

[0018] In some embodiments, processing the image to be processed based on the feature vector includes classifying the image to be processed based on the feature vector.

[0019] According to some other embodiments of this disclosure, an image processing apparatus is provided, comprising: an extraction unit, configured to extract feature vectors of an image to be processed using a plurality of sequentially connected feature extraction layers in a machine learning model, wherein the processing kernel of the current feature extraction layer is determined based on the processing kernel and processing result of the previous feature extraction layer, and the current feature extraction layer is a feature extraction layer other than the first feature extraction layer; and a processing unit, configured to process the image to be processed based on the feature vectors.

[0020] In some embodiments, the extraction unit calculates the processing kernel of the current feature extraction layer through the following steps: estimating the estimated value of the processing kernel of the current feature extraction layer based on the processing result of the previous feature extraction layer; and determining the processing kernel of the current feature extraction layer based on the estimated value and the processing kernel of the previous feature extraction layer.

[0021] In some embodiments, the extraction unit divides the multiple channel components in the processing result of the previous feature extraction layer into multiple groups; based on each of the multiple groups, it estimates multiple sub-estimates of the processing kernel of the current feature extraction layer; and based on the multiple sub-estimates, it determines the estimated value of the processing kernel of the current feature extraction layer.

[0022] In some embodiments, the extraction unit performs downsampling on the processing result of the previous feature extraction layer to obtain a downsampling result, expands the channel dimension of the downsampling result to obtain a channel dimension expansion result, and divides the channel dimension expansion result into multiple groups.

[0023] In some embodiments, the extraction unit utilizes the SFC layer to process each of the multiple groups to obtain multiple sub-estimates.

[0024] In some embodiments, the extraction unit utilizes a fully connected layer to process the connection results of multiple sub-estimates to obtain a fully connected processing result; GN processing is then performed on the fully connected processing result to determine the estimated value of the processing kernel of the current feature extraction layer.

[0025] In some embodiments, each of the plurality of feature extraction layers includes a convolutional feedforward layer, which includes a convolutional layer and a fully connected layer.

[0026] In some embodiments, the convolutional layer is placed before the fully connected layer.

[0027] In some embodiments, the feature extraction layer includes a layer normalization layer, the output of the convolutional layer is used as the input of the layer normalization layer, and the output of the layer normalization layer is used as the input of the fully connected layer.

[0028] In some embodiments, the plurality of feature extraction layers include a first feature extraction layer and a second feature extraction layer. The first feature extraction layer includes an attention mechanism module, while the second feature extraction layer does not include an attention mechanism module. The output of the attention mechanism module is the input of the convolutional feedforward layer of the first feature extraction layer. The resolution of the data processed by the first feature extraction layer is lower than the resolution of the data processed by the second feature extraction layer.

[0029] In some embodiments, the second feature extraction layer is disposed before the first feature extraction layer.

[0030] In some embodiments, the convolutional layers of the convolutional feedforward layer include depthwise convolutional layers.

[0031] In some embodiments, the processing unit classifies the image to be processed based on the feature vector.

[0032] According to further embodiments of the present disclosure, an image processing apparatus is provided, comprising: a memory; and a processor coupled to the memory, the processor being configured to perform the image processing method of any of the above embodiments based on instructions stored in the memory device.

[0033] According to further embodiments of the present disclosure, a non-volatile computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the image processing method of any of the above embodiments.

[0034] In the above embodiments, during the learning process of the machine learning model, the processing kernel of each feature extraction layer is dynamically learned based on the information between different feature extraction layers, so as to improve the recognition accuracy of the feature extraction layers and thus improve the processing performance of the machine learning model. Attached Figure Description

[0035] The accompanying drawings, which form part of this specification, illustrate embodiments of this disclosure and, together with the specification, serve to explain the principles of this disclosure.

[0036] This disclosure will become clearer with reference to the accompanying drawings and the following detailed description, wherein:

[0037] Figure 1 Flowcharts illustrating some embodiments of the image processing methods of this disclosure;

[0038] Figure 2 Schematic diagrams illustrating some embodiments of the image processing methods of this disclosure;

[0039] Figure 3 Schematic diagrams illustrating other embodiments of the image processing methods of this disclosure;

[0040] Figure 4Schematic diagrams illustrating some embodiments of the image processing apparatus of this disclosure;

[0041] Figure 5 Block diagrams illustrating other embodiments of the image processing apparatus of this disclosure;

[0042] Figure 6 Block diagrams illustrating further embodiments of the image processing apparatus of this disclosure. Detailed Implementation

[0043] Various exemplary embodiments of the present disclosure will now be described in detail with reference to the accompanying drawings. It should be noted that, unless otherwise specifically stated, the relative arrangement, numerical expressions, and values of the components and steps set forth in these embodiments do not limit the scope of the present disclosure.

[0044] At the same time, it should be understood that, for ease of description, the dimensions of the various parts shown in the accompanying drawings are not drawn according to actual scale.

[0045] The following description of at least one exemplary embodiment is merely illustrative and is in no way intended to limit this disclosure or its application or use.

[0046] Techniques, methods, and equipment known to those skilled in the art may not be discussed in detail, but where appropriate, such techniques, methods, and equipment should be considered part of the specification.

[0047] In all examples shown and discussed herein, any specific values should be interpreted as merely exemplary and not as limitations. Therefore, other examples of exemplary embodiments may have different values.

[0048] It should be noted that similar labels and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be discussed further in subsequent figures.

[0049] As mentioned earlier, convolutional neural network (CNN)-based backbone network design is a mainstream choice for object recognition backbone network design. CNNs primarily utilize stacked convolutional kernels to extract features from local regions of images, and employ a pyramid-structure downsampling method to progressively expand the receptive field of the CNN in multiple stages, thereby achieving the extraction of global features from the image.

[0050] However, in the early stages, convolutional neural networks can only extract local information from images and cannot directly process global information. Object recognition backbone networks based on the Transformer structure can effectively solve this problem.

[0051] The design of a backbone network based on the Transformer structure relies entirely on the self-attention mechanism between different image patches for feature fusion. Therefore, object recognition backbone networks built using the Transformer structure can obtain global information about the image early in the network's development. This characteristic has led to a surge in research and exploration into the design of object recognition backbone networks based on the Transformer structure.

[0052] It is evident that both convolutional neural network-based and Transformer-based design schemes have their own advantages and disadvantages.

[0053] For convolutional neural network (CNN)-based designs, the introduction of local 2D convolutional kernels with inductive biases based on prior knowledge enables CNNs to process high-resolution images quickly and achieve good results even with small training data. However, the inability of the network to obtain global information about the image in the early stages results in some performance loss.

[0054] For Transformer-based designs, training and prediction speeds are often slower than convolutional neural networks. Due to the dense self-attention computation of the Transformer structure, feature calculation for each image patch requires computation along with features from all image patches, making training this architecture typically slow for high-resolution inputs.

[0055] However, Transformer-based networks can perform global attention feature fusion operations on the entire image. With a sufficiently large amount of training data, visual backbone networks designed based on the Transformer architecture can achieve higher recognition accuracy than convolutional neural networks with similar model parameter counts. Transformer technology typically offers superior performance compared to similar techniques; therefore, this complexity has gradually become a bottleneck restricting the progress of this useful architecture.

[0056] A hybrid network design that integrates convolutional neural networks and Transformer structures allows the visual backbone network to possess the advantages of both.

[0057] Self-attention operations can be integrated into convolutional neural network blocks. For example, local self-attention learning can be employed at each local block in a convolutional neural network architecture, thereby reducing the computationally intensive self-attention computations of the Transformer structure while preserving the prior knowledge of the convolutional neural network.

[0058] Convolutional operations can be integrated into the Transformer architecture. For example, convolutional layers can be inserted into self-attention modules or feedforward layers to build the backbone network; alternatively, the outputs of self-attention modules and convolutional layers can be fused within each Transformer module. The aim is to introduce an inductive bias for accurate 2D region structure modeling into the visual backbone network based on the Transformer architecture.

[0059] The machine learning models described above all rely on typical convolutions to impose inductive bias within each Transformer module. However, after training, the input resolution and kernel of the feature maps learned at each layer are statically fixed; moreover, the independent modeling of the two-dimensional structural inductive bias within each block ignores the inductive bias information of different blocks with inputs at other different resolutions. This results in information loss and limits the improvement of recognition accuracy.

[0060] It is evident that the object recognition backbone network based on a hybrid design of convolutional neural networks and Transformer structures suffers from the technical problem of being unable to dynamically model and learn different inductive biases at different processing stages.

[0061] To address the aforementioned technical issues, this disclosure proposes a novel hybrid architecture for a visual Transformer backbone network based on streamlined convolutions. By introducing streamlined dependencies between inductive biases of different modules, the processing kernels are dynamically allocated to streamlined convolutions.

[0062] Under this hybrid architecture, the model can learn a specific processing kernel for each resolution under the guidance of the processing kernel of the Transformer module. This allows the model to dynamically consider information from other stage blocks during the learning process at different stages, thereby improving the model's recognition accuracy.

[0063] For example, the technical solution of this disclosure can be implemented through the following embodiments.

[0064] Figure 1 Flowcharts illustrating some embodiments of the image processing methods of this disclosure are shown.

[0065] like Figure 1 As shown, in step 110, multiple feature extraction layers connected sequentially in the machine learning model are used to extract the feature vector of the image to be processed. The processing kernel of the current feature extraction layer is determined based on the processing kernel and processing result of the previous feature extraction layer. The current feature extraction layer is the feature extraction layer other than the first feature extraction layer.

[0066] In some embodiments, each of the plurality of feature extraction layers includes a CFF (Convolutional Feedforward) layer, which comprises convolutional layers and fully connected layers. For example, the convolutional layers of a CFF layer include depthwise convolutional layers.

[0067] For example, the receptive field can be expanded by increasing the size of the processing kernel in a deep convolutional layer, thereby improving the performance of feature extraction.

[0068] In some embodiments, the convolutional layer is positioned before the fully connected layers. For example, a CFF layer may include a deep convolutional (DWConv) layer capable of dynamic learning and two fully connected (FC) layers. The DWConv layer can be positioned before the two FC layers to shift the DWConv layer upwards, thereby improving the performance of feature extraction.

[0069] In some embodiments, the feature extraction layer includes an LN (layer normalization) layer, the output of the convolutional layer is used as the input of the LN layer, and the output of the LN layer is used as the input of the fully connected layer.

[0070] For example, the l-th feature extraction layer includes the l-th streamlined Transformer module, which in turn includes a CFF layer, and its input features are... The processing procedure for the CFF layer is shown in the following formula:

[0071]

[0072]

[0073] DWConv() represents a depthwise convolution operation, θ l Learnable processing kernel parameters corresponding to the l-th streamlined Transformer module.

[0074] In some embodiments, the plurality of feature extraction layers includes a first feature extraction layer and a second feature extraction layer. The first feature extraction layer includes an attention mechanism module, while the second feature extraction layer does not. The output of the attention mechanism module is the input to the convolutional feedforward layer of the first feature extraction layer. The resolution of the data processed by the first feature extraction layer is lower than the resolution of the data processed by the second feature extraction layer. For example, the second feature extraction layer is positioned before the first feature extraction layer.

[0075] For example, the attention mechanism module includes an MHA (Multi-Head Self-Attention) layer.

[0076] For example, a machine learning model includes four processing stages from high-resolution input to low-resolution input. For the first two processing stages with high-resolution input, the cumbersome MHA is removed, and only the CFF layer is used as the second feature extraction layer; for the last two stages with low-resolution input, a streamlined Transformer module consisting of a stack of MHA and CFF layers is used as the first feature extraction layer.

[0077] For example, a block embedding layer approach can be used to simultaneously increase the channel size of the feature extraction layer and reduce the spatial resolution.

[0078] In some embodiments, machine learning models with different model sizes can be set. For example, three model sizes can be set for the machine learning model: a small-sized feature extraction layer, a basic-sized feature extraction layer, and a large-sized feature extraction layer.

[0079] For example, Table 1 lists the architectures for three model sizes. E i C i and These are the expansion ratio, channel dimension, and number of long heads in the multi-head self-attention / token feature mixer layer of stage i, respectively:

[0080] Table 1

[0081]

[0082] In some embodiments, the processing kernel of the current feature extraction layer is calculated by the following steps: estimating the estimated value of the processing kernel of the current feature extraction layer based on the processing result of the previous feature extraction layer; and determining the processing kernel of the current feature extraction layer based on the estimated value and the processing kernel of the previous feature extraction layer.

[0083] If the processing of each channel shares the same parameters, the learning complexity of the machine learning model will be limited, resulting in insufficient flexibility; if each channel is processed using an independent layer with different parameters, the computational load of the machine learning model will increase.

[0084] In some embodiments, to address the aforementioned technical problems, an MHM (multi-head mixer) layer can be used to group all channels; only channels within the same group share the same parameters during processing, and different groups do not share parameters.

[0085] For example, the multiple channel components in the processing result of the previous feature extraction layer are divided into multiple groups; based on each of the multiple groups, multiple sub-estimates of the processing kernel of the current feature extraction layer are estimated; based on the multiple sub-estimates, the estimated value of the processing kernel of the current feature extraction layer is determined.

[0086] This improves the flexibility of machine learning models and limits the growth rate of machine learning model parameters, thus achieving a balance between overhead and flexibility.

[0087] In some embodiments, the processing result of the previous feature extraction layer Perform downsampling processing to obtain downsampling results. Expand the channel dimension of the downsampling result to obtain the channel dimension expanded result. Expanding the channel dimension results Divided into multiple groups

[0088] For example, for input features Using AAP (Adaptive Average Pooling) operation to The kernel size is downsampled to Kh×Kw, where Kh and Kw are the height and width of the output kernel, respectively. An FC layer and a GELU (Gaussian Error Linear Unit) activation function σ are used to expand the kernel's channel dimension. Downsampling and channel dimension expansion can be performed using the following formula:

[0089]

[0090]

[0091] In some embodiments, the SFC layer is used to process multiple packets separately. Each of them, to obtain multiple sub-estimates hea6 i For example, a fully connected layer can be used to process the connection result z of multiple sub-estimates. l To obtain the fully connected processing results; perform GN processing on the fully connected processing results to determine the estimated value of the processing kernel of the current feature extraction layer.

[0092] For example, to encourage spatial interaction in the input, the MHM layer only shares the same parameters within the channel groups of each head, thus achieving an overhead balance between parameter budget and flexibility; finally, a fully connected layer is used to generate the processing kernel.

[0093] For example, residual connections can be used to implement streamlined kernel generation modules. That is, by calculating the estimated value of the current layer's processing kernel... Compared to the processing kernel θ of the previous layer l,1 The aggregation of these elements is used to enhance the dissemination of kernel information.

[0094] For example, the overall operation of the kernel generation module can be implemented using the following formula:

[0095]

[0096]

[0097]

[0098] For example, It is divided into hd groups of equal size. yes The i-th group. Concat() represents the concatenation operation, GN() represents the group normalization process, and SFC() represents the spatial fully connected layer to implement matrix multiplication operations across spatial locations.

[0099] In step 120, the image to be processed is processed based on the feature vector. For example, the image to be processed is classified based on the feature vector.

[0100] In the above embodiments, to address the technical problems that the Transformer module lacks correct inductive bias in 2D region structure modeling, and that the hybrid network of convolutional neural network and Transformer module cannot dynamically learn the convolution kernel function of each module, a network module that can dynamically learn the convolution kernel function of different modules is proposed.

[0101] In other words, it's based on a streamlined convolutional transformer architecture module. This module effectively upgrades the FF layer in the transformer architecture block through streamlined convolutions, which is dynamically learned through a streamlined kernel generated in another path.

[0102] Figure 2 Schematic diagrams illustrating some embodiments of the image processing methods of this disclosure.

[0103] like Figure 2 As shown, the architecture of the streamlined Transformer module comprises two paths: the hybrid Transformer path and the KG (streamlined kernel generation path). The hybrid Transformer path can consist of an MHA layer and a CFF layer proposed in this disclosure. The CFF layer incorporates dynamically learned deep convolutions into the FF layer to capture inductive bias; the KG path collects the input of the current layer and the processing kernel of the previous layer, and further generates a dedicated processing kernel for deep convolutions in the current layer.

[0104] In some embodiments, the feature extraction layer of the machine learning model employs a hybrid architecture based on convolutional neural networks and Transformer structures. This hybrid architecture can be referred to as a streamlined Transformer module.

[0105] For example, each feature extraction layer consists of two paths: one is a hybrid Transformer path, which replaces the original FF layer in the Transformer module with an additional deep convolutional layer, namely the CFF layer, to capture the inductive bias; the other is a streamlined KG path, which collects the input features of the current layer and the processing kernel of the deep convolution in the previous layer.

[0106] The goal of this hybrid architecture is to dynamically generate a dedicated processing kernel for the deep convolutional layer of each structure block (feature extraction layer).

[0107] The CFF layer can consist of a dynamically learning DWConv layer and two FC layers. The DWConv layer can be placed before the two FC layers to shift the DWConv layer upwards, thereby improving feature extraction performance.

[0108] For example, the receptive field can be expanded by increasing the size of the processing kernel in a deep convolutional layer, thereby improving the performance of feature extraction.

[0109] For example, the l-th feature extraction layer includes the l-th streamlined Transformer module, which in turn includes a CFF layer, and its input features are... The processing procedure for the CFF layer is shown in the following formula:

[0110]

[0111]

[0112] Each streamlined kernel generation module in the KG path is designed to generate a dedicated processing kernel for depthwise convolutions within the corresponding CFF layer. For example, it can be achieved through... Figure 3 The implementation example in the document demonstrates the KG path.

[0113] Figure 3 Schematic diagrams illustrating other embodiments of the image processing methods of this disclosure.

[0114] like Figure 3 As shown, for input features Using AAP operations The size is downsampled to Kh×Kw, where Kh and Kw are the height and width of the output kernel, respectively; the FC layer and GELU activation function σ are used to expand the channel dimension of the kernel.

[0115] For example, to encourage spatial interaction in the input, the MHM layer only shares the same parameters within the channel groups of each head, thus achieving an overhead balance between parameter budget and flexibility; finally, a fully connected layer is used to generate the processing kernel.

[0116] For example, residual connections can be used to implement streamlined kernel generation modules. That is, by calculating the estimated value of the current layer's processing kernel... Compared to the processing kernel θ of the previous layer l-1 The aggregation of these elements is used to enhance the dissemination of kernel information.

[0117] For example, the operation of the kernel generation module in the KG path can be implemented using the following formula:

[0118]

[0119]

[0120]

[0121]

[0122]

[0123] For example, a machine learning model includes four processing stages from high-resolution input to low-resolution input. For the first two processing stages with high-resolution input, the cumbersome MHA is removed, and only the CFF layer is used as the second feature extraction layer; for the last two stages with low-resolution input, a streamlined Transformer module consisting of a stack of MHA and CFF layers is used as the first feature extraction layer.

[0124] For example, a block embedding layer approach can be used to simultaneously increase the channel size of the feature extraction layer and reduce the spatial resolution.

[0125] In some embodiments, machine learning models with different model sizes can be set. For example, three model sizes can be set for the machine learning model: a small-sized feature extraction layer, a basic-sized feature extraction layer, and a large-sized feature extraction layer.

[0126] For example, Table 1 lists the architectures for three model sizes.

[0127] In the above embodiments, a design scheme for a visual Transformer backbone network based on streamlined convolution is proposed. As a novel design paradigm for object recognition backbone networks, this backbone network, through its streamlined convolutional structure, effectively combines the two-dimensional inductive bias information of convolutional neural networks with the global self-attention mechanism of the Transformer structure. Furthermore, it can consider the information between different network blocks during model learning, dynamically learning the deep convolutional kernels in different network blocks, thereby further improving the recognition accuracy of the visual backbone network.

[0128] Figure 4Schematic diagrams illustrating some embodiments of the image processing apparatus of this disclosure.

[0129] like Figure 4 As shown, the image processing device 4 includes: an extraction unit 41, used to extract feature vectors of the image to be processed using multiple feature extraction layers sequentially connected in a machine learning model, wherein the processing kernel of the current feature extraction layer is determined based on the processing kernel and processing result of the previous feature extraction layer, and the current feature extraction layer is a feature extraction layer other than the first feature extraction layer; and a processing unit 42, used to process the image to be processed based on the feature vectors.

[0130] In some embodiments, the extraction unit 41 calculates the processing kernel of the current feature extraction layer through the following steps: estimating the estimated value of the processing kernel of the current feature extraction layer based on the processing result of the previous feature extraction layer; and determining the processing kernel of the current feature extraction layer based on the estimated value and the processing kernel of the previous feature extraction layer.

[0131] In some embodiments, the extraction unit 41 divides the multiple channel components in the processing result of the previous feature extraction layer into multiple groups; according to each of the multiple groups, it estimates multiple sub-estimates of the processing kernel of the current feature extraction layer; and according to the multiple sub-estimates, it determines the estimated value of the processing kernel of the current feature extraction layer.

[0132] In some embodiments, the extraction unit 41 performs downsampling on the processing result of the previous feature extraction layer to obtain a downsampling result, expands the channel dimension of the downsampling result to obtain a channel dimension expansion result, and divides the channel dimension expansion result into multiple groups.

[0133] In some embodiments, the extraction unit 41 utilizes the SFC layer to process each of the multiple groups to obtain multiple sub-estimates.

[0134] In some embodiments, the extraction unit 41 utilizes a fully connected layer to process the connection results of multiple sub-estimates to obtain a fully connected processing result; the fully connected processing result is then subjected to GN processing to determine the estimated value of the processing kernel of the current feature extraction layer.

[0135] In some embodiments, each of the plurality of feature extraction layers includes a convolutional feedforward layer, which includes a convolutional layer and a fully connected layer.

[0136] In some embodiments, the convolutional layer is placed before the fully connected layer.

[0137] In some embodiments, the feature extraction layer includes a layer normalization layer, the output of the convolutional layer is used as the input of the layer normalization layer, and the output of the layer normalization layer is used as the input of the fully connected layer.

[0138] In some embodiments, the plurality of feature extraction layers include a first feature extraction layer and a second feature extraction layer. The first feature extraction layer includes an attention mechanism module, while the second feature extraction layer does not include an attention mechanism module. The output of the attention mechanism module is the input of the convolutional feedforward layer of the first feature extraction layer. The resolution of the data processed by the first feature extraction layer is lower than the resolution of the data processed by the second feature extraction layer.

[0139] In some embodiments, the second feature extraction layer is disposed before the first feature extraction layer.

[0140] In some embodiments, the convolutional layers of the convolutional feedforward layer include depthwise convolutional layers.

[0141] In some embodiments, the processing unit 42 classifies the image to be processed based on the feature vector.

[0142] Figure 5 Block diagrams illustrating other embodiments of the image processing apparatus of this disclosure.

[0143] like Figure 5 As shown, the image processing apparatus 5 of this embodiment includes a memory 51 and a processor 52 coupled to the memory 51. The processor 52 is configured to execute an image processing method in any embodiment of this disclosure based on instructions stored in the memory 51.

[0144] The memory 51 may include, for example, system memory, fixed non-volatile storage media, etc. The system memory stores, for example, the operating system, application programs, boot loader, database, and other programs.

[0145] Figure 6 Block diagrams illustrating further embodiments of the image processing apparatus of this disclosure.

[0146] like Figure 6 As shown, the image processing apparatus 6 of this embodiment includes a memory 610 and a processor 620 coupled to the memory 610. The processor 620 is configured to execute the image processing method of any of the foregoing embodiments based on instructions stored in the memory 610.

[0147] The memory 610 may include, for example, system memory, fixed non-volatile storage media, etc. The system memory may store, for example, the operating system, application programs, boot loader, and other programs.

[0148] The image processing device 6 may also include an input / output interface 630, a network interface 640, and a storage interface 650. These interfaces 630, 640, and 650, as well as the memory 610 and processor 620, can be connected, for example, via a bus 660. The input / output interface 630 provides a connection interface for input / output devices such as a monitor, mouse, keyboard, touchscreen, microphone, and speakers. The network interface 640 provides a connection interface for various networked devices. The storage interface 650 provides a connection interface for external storage devices such as SD cards and USB flash drives.

[0149] Those skilled in the art will understand that embodiments of this disclosure can be provided as methods, systems, or computer program products. Therefore, this disclosure can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this disclosure can take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0150] The image processing method, image processing apparatus, and non-volatile computer-readable storage medium according to this disclosure have been described in detail above. To avoid obscuring the concept of this disclosure, some details known in the art have not been described. Those skilled in the art will fully understand how to implement the technical solutions disclosed herein based on the above description.

[0151] The methods and systems of this disclosure may be implemented in many ways. For example, they may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order of steps for the methods is for illustrative purposes only, and the steps of the methods of this disclosure are not limited to the specific order described above unless otherwise specifically stated. Furthermore, in some embodiments, this disclosure may also be implemented as a program recorded on a recording medium, the program including machine-readable instructions for implementing the methods according to this disclosure. Thus, this disclosure also covers recording media storing programs for performing the methods according to this disclosure.

[0152] While specific embodiments of this disclosure have been described in detail by way of example, those skilled in the art should understand that the examples are for illustrative purposes only and not intended to limit the scope of this disclosure. Those skilled in the art should understand that modifications can be made to the above embodiments without departing from the scope and spirit of this disclosure. The scope of this disclosure is defined by the appended claims.

Claims

1. An image processing method, comprising: The feature vector of the image to be processed is extracted by using multiple sequentially connected feature extraction layers in the machine learning model. The processing kernel of the current feature extraction layer is determined based on the processing kernel and processing result of the previous feature extraction layer. The current feature extraction layer is the feature extraction layer other than the first feature extraction layer. The image to be processed is then processed based on the feature vector. The processing kernel of the current feature extraction layer is calculated through the following steps: Based on the processing result of the previous feature extraction layer, estimate the value of the processing kernel of the current feature extraction layer; The processing kernel of the current feature extraction layer is determined based on the estimated value and the processing kernel of the previous feature extraction layer.

2. The processing method according to claim 1, wherein, The estimation of the processing kernel of the current feature extraction layer includes: The multiple channel components in the processing result of the previous feature extraction layer are divided into multiple groups; For each of the plurality of groups, estimate a plurality of sub-estimates for the processing kernel of the current feature extraction layer; Based on the multiple sub-estimates, the estimated value of the processing kernel of the current feature extraction layer is determined.

3. The processing method according to claim 2, wherein, The step of dividing the multiple channel components in the processing result of the previous feature extraction layer into multiple groups includes: The processing result of the previous feature extraction layer is downsampled to obtain the downsampled result; The channel dimension of the downsampling result is expanded to obtain the channel dimension expansion result; The channel dimension expansion result is divided into the multiple groups.

4. The processing method according to claim 2, wherein, The step of estimating multiple sub-estimates of the processing kernel of the current feature extraction layer based on each of the plurality of groups includes: Using a spatially fully connected (SFC) layer, each of the multiple groups is processed separately to obtain the multiple sub-estimates.

5. The processing method according to claim 2, wherein, The step of determining the estimated value of the processing kernel of the current feature extraction layer based on the plurality of sub-estimates includes: A fully connected layer is used to process the connection results of the multiple sub-estimates to obtain the fully connected processing result; The fully connected processing result is subjected to group normalization (GN) processing to determine the estimated value of the processing kernel of the current feature extraction layer.

6. The processing method according to claim 1, wherein, Each of the plurality of feature extraction layers includes a convolutional feedforward layer, which comprises a convolutional layer and a fully connected layer.

7. The processing method according to claim 6, wherein, The convolutional layer is positioned before the fully connected layer.

8. The processing method according to claim 7, wherein, The feature extraction layer includes a normalization layer, the output of the convolutional layer is used as the input of the normalization layer, and the output of the normalization layer is used as the input of the fully connected layer.

9. The processing method according to claim 6, wherein, The plurality of feature extraction layers include a first feature extraction layer and a second feature extraction layer. The first feature extraction layer includes an attention mechanism module, while the second feature extraction layer does not include the attention mechanism module. The output of the attention mechanism module is the input of the convolutional feedforward layer of the first feature extraction layer. The resolution of the data processed by the first feature extraction layer is lower than the resolution of the data processed by the second feature extraction layer.

10. The processing method according to claim 9, wherein, The second feature extraction layer is positioned before the first feature extraction layer.

11. The processing method according to claim 6, wherein, The convolutional layers of the convolutional feedforward layer include depthwise convolutional layers.

12. The processing method according to any one of claims 1 to 11, wherein, The step of processing the image to be processed based on the feature vector includes: The image to be processed is classified based on the feature vector.

13. An image processing apparatus, comprising: An extraction unit is used to extract feature vectors of an image to be processed by using multiple sequentially connected feature extraction layers in a machine learning model. The processing kernel of the current feature extraction layer is determined based on the processing kernel and processing result of the previous feature extraction layer. The current feature extraction layer is a feature extraction layer other than the first feature extraction layer. The processing unit is configured to process the image to be processed based on the feature vector. The processing kernel of the current feature extraction layer is calculated through the following steps: Based on the processing result of the previous feature extraction layer, estimate the value of the processing kernel of the current feature extraction layer; The processing kernel of the current feature extraction layer is determined based on the estimated value and the processing kernel of the previous feature extraction layer.

14. An image processing apparatus, comprising: Memory; and A processor coupled to the memory, the processor being configured to execute the image processing method according to any one of claims 1 to 12 based on instructions stored in the memory.

15. A non-volatile computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the image processing method according to any one of claims 1 to 12.