Image recognition method and device based on neural network model and terminal equipment

By introducing a Meta Former structure into a lightweight convolutional neural network to extract global features, the problem of decreased recognition accuracy in lightweight models is solved, enabling efficient image recognition tasks on terminal devices.

CN115187844BActive Publication Date: 2026-06-23SHENZHEN INTELLIFUSION TECHNOLOGIES CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SHENZHEN INTELLIFUSION TECHNOLOGIES CO LTD
Filing Date
2022-06-30
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing lightweight convolutional neural network models suffer from reduced recognition accuracy in image recognition tasks, making it difficult to achieve efficient image recognition on terminal devices with low computing power.

Method used

We employ a Meta Former structure based on pure convolution to extract global features from images, and combine it with a lightweight convolutional neural network to construct a lightweight neural network model. This improves image recognition accuracy through feature extraction and recognition processes.

Benefits of technology

While achieving a lightweight model, the image recognition accuracy of the neural network model was improved, enhancing the actual deployment effect on terminal devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115187844B_ABST
    Figure CN115187844B_ABST
Patent Text Reader

Abstract

The application is suitable for the technical field of computer vision, and provides an image recognition method and device based on a neural network model and a terminal device, wherein the neural network model extracts global features of an image to be recognized based on a Meta Former structure based on pure convolution, the image recognition method comprises: inputting the image to be recognized into the trained neural network model, and sequentially performing feature extraction and recognition on the image to be recognized by the neural network model to obtain a recognition result. The application can improve the accuracy of image recognition while realizing a lightweight model.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of computer vision technology, and in particular relates to image recognition methods, devices, terminal equipment and computer-readable storage media based on neural network models. Background Technology

[0002] In recent years, neural networks have been widely applied to solve image recognition tasks in computer vision, such as image classification, object detection, and image segmentation. Image classification, as a fundamental task in computer vision, is the main support for object detection and semantic segmentation. Object detection, as the core of computer vision tasks, is the foundation for scene understanding and cognition. For example, in scenarios such as face recognition, pedestrian tracking, and autonomous driving, detecting the target of interest is a prerequisite for understanding the scene. Semantic segmentation paves the way for achieving a complete understanding of the scene, and more and more applications are extracting knowledge from image data through semantic segmentation, including applications such as autonomous driving, human-computer interaction, virtual reality, and medical image analysis. Therefore, research on image recognition tasks such as image classification and object detection has become a hot topic in the field of computer vision.

[0003] Lightweight models refer to models that can run smoothly on terminal devices with low computing power and low computational overhead. Because convolutional neural networks have many parameters and high computational cost, and embedded and mobile terminal devices have limited computing power and storage capacity, lightweighting neural network models has become a research hotspot in recent years.

[0004] Currently, the neural network models used for image recognition tasks in actual deployment are mainly convolutional neural network models. Image recognition methods based on convolutional neural networks mostly rely on deep network structures to improve detection accuracy. Compressing convolutional neural network models to make them lightweight will affect the recognition accuracy of the network, resulting in a relative decrease in the accuracy of the lightweight neural network model. Summary of the Invention

[0005] This application provides an image recognition method, apparatus, and terminal device based on a neural network model, which helps to improve the accuracy of image recognition while achieving a lightweight model.

[0006] In a first aspect, embodiments of this application provide an image recognition method based on a neural network model. The neural network model extracts global features of the image to be recognized through a MetaFormer structure based on pure convolution. The image recognition method includes:

[0007] The image to be identified is input into the trained neural network model described above. The neural network model then sequentially extracts and identifies features from the image to be identified, thereby obtaining the identification result.

[0008] Secondly, embodiments of this application provide an image recognition device, which includes:

[0009] The input module and the trained neural network model, which extracts global features of the image to be recognized based on a pure convolutional MetaFormer structure.

[0010] The above input module is used to input the image to be recognized into the trained neural network model.

[0011] The aforementioned neural network model is used to sequentially extract and recognize features from the image to be recognized, thereby obtaining the recognition result.

[0012] Thirdly, embodiments of this application provide a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor. When the processor executes the computer program, it implements the image recognition method based on a neural network model described in the first aspect.

[0013] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the image recognition method based on a neural network model described in the first aspect.

[0014] Fifthly, embodiments of this application provide a computer program product that, when run on a terminal device, causes the terminal device to execute any of the above-described neural network model-based image recognition methods in the first aspect.

[0015] The beneficial effects of this application embodiment compared with the prior art are as follows: The image to be identified is input into a trained neural network model, and the neural network model sequentially extracts and identifies features of the image to be identified to obtain the recognition result. Since the neural network model extracts global features of the image through a convolution-based Meta Former structure, when constructing the neural network model based on a lightweight convolutional neural network, the neural network model can focus on the global features of the image to be identified, have more features, reduce the accuracy drop caused by the lightweight convolutional neural network, and improve the image recognition accuracy of the neural network model while realizing a lightweight model, thereby improving the accuracy of the neural network model. Attached Figure Description

[0016] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments or the prior art will be briefly introduced below.

[0017] Figure 1 This is a schematic flowchart of an image recognition method based on a neural network model provided in an embodiment of this application;

[0018] Figure 2 This is a schematic diagram of a global feature extraction module structure provided in an embodiment of this application;

[0019] Figure 3 This is a schematic diagram of a global feature extraction module structure provided in an embodiment of this application;

[0020] Figure 4 This is a schematic diagram of the process of a global feature sub-extraction module extracting global features according to an embodiment of this application;

[0021] Figure 5 This is a schematic diagram of a neural network model structure for image classification tasks provided in an embodiment of this application;

[0022] Figure 6 This is a schematic diagram of a neural network model structure for an object detection task provided in an embodiment of this application;

[0023] Figure 7 This is a schematic diagram of the detection frame structure for the target detection task provided in the embodiments of this application;

[0024] Figure 8 This is a schematic diagram of a neural network model structure for semantic segmentation tasks provided in an embodiment of this application;

[0025] Figure 9 This is a schematic diagram of the convolutional branch (segmentation module) structure of the semantic segmentation sub-model provided in this application embodiment;

[0026] Figure 10 This is a schematic diagram of the structure of the image recognition device provided in the embodiments of this application;

[0027] Figure 11 This is a schematic diagram of the structure of the terminal device provided in the embodiments of this application. Detailed Implementation

[0028] In the following description, specific details such as particular system architectures and techniques are set forth for illustrative purposes and not for limitation, in order to provide a thorough understanding of the embodiments of this application. However, those skilled in the art will understand that this application may also be implemented in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, apparatuses, circuits, and methods have been omitted so as not to obscure the description of this application with unnecessary detail.

[0029] It should be understood that, when used in this application specification and the appended claims, the term "comprising" indicates the presence of the described features, integrals, steps, operations, elements and / or components, but does not exclude the presence or addition of one or more other features, integrals, steps, operations, elements, components and / or a collection thereof.

[0030] It should also be understood that the term “and / or” as used in this application specification and the appended claims means any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.

[0031] Furthermore, in the description of this application and the appended claims, the terms "first," "second," "third," etc., are used only to distinguish descriptions and should not be construed as indicating or implying relative importance.

[0032] References to "one embodiment" or "some embodiments" in this specification mean that one or more embodiments of this application include a specific feature, structure, or characteristic described in connection with that embodiment. Therefore, the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in still other embodiments," etc., appearing in different parts of this specification do not necessarily refer to the same embodiment, but rather mean "one or more, but not all, embodiments," unless otherwise specifically emphasized.

[0033] Example 1:

[0034] Figure 1 A flowchart illustrating an image recognition method based on a neural network model according to an embodiment of the present invention is shown below in detail:

[0035] The image to be recognized is input into a trained neural network model. The neural network model extracts features and recognizes the image sequentially to obtain the recognition result. The neural network model is built on a lightweight convolutional neural network and extracts global features of the image to be recognized through a Meta Former structure based on pure convolution.

[0036] The Meta Former structure described above is a derivative of the VIT model for computer vision tasks. The VIT model applies a self-attention-based Transformer model to image tasks, implementing the self-attention mechanism through matrix operations. Compared to traditional convolutional neural network-based models in image tasks, the VIT model achieves the same or even better results and lower costs on large datasets. However, the VIT model requires significant computational resources and hardware requirements for image tasks, making it less hardware-friendly. Therefore, an improved Meta Former structure based on pure convolution with good hardware support is used to extract global features from images for image tasks. This structure inherits the good hardware support and low computational overhead of convolutional neural networks while incorporating the global feature extraction capabilities of the VIT model, thus improving the accuracy of the neural network model.

[0037] Specifically, when lightweighting a neural network model, since the neural network model relies on deep network structures to improve image recognition accuracy, compressing the neural network model will result in lower recognition accuracy. Therefore, when constructing a neural network model, an improved Meta Former structure based on pure convolution is incorporated into the neural network model to extract global features of the image to be recognized. Thus, the neural network model containing the Meta Former structure is used to extract features and recognize the image to be recognized, and the recognition result is obtained.

[0038] In this embodiment, a trained neural network model is used to sequentially extract and recognize features from the image to be recognized, thereby obtaining the recognition result. The neural network model extracts global features of the image to be recognized based on a pure convolutional Meta Former structure. Since the Meta Former structure performs image recognition tasks by extracting global features, it has good image recognition accuracy. This enables the neural network model to achieve lightweight design while maintaining good image recognition accuracy, thereby improving the accuracy of the neural network model.

[0039] In this embodiment, a trained neural network model sequentially extracts and recognizes features from the image to be detected, thereby outputting the corresponding image recognition result. Since the neural network model is based on the global features of the image to be recognized using a convolutional Meta Former structure, it can focus on the global features of the image to be recognized when performing image recognition tasks, thereby improving the image recognition accuracy of the neural network model and avoiding the problem of decreased accuracy caused by lightweighting. This improves the recognition accuracy of the neural network model and enhances the actual deployment effect of the corresponding application of image recognition based on the neural network model.

[0040] In some embodiments, prior to image recognition based on the aforementioned neural network model, the method further includes:

[0041] The constructed neural network model is trained.

[0042] Optionally, since the images to be recognized, the recognition targets, and the results vary for different applications' image recognition tasks, the neural network model should be trained according to the actual image recognition requirements of the application before performing the corresponding image recognition task based on the aforementioned neural network model. For example, to deploy a neural network model for an autonomous vehicle to detect pedestrians and identify their locations for safe and reliable driving, the aforementioned neural network model needs to be trained for pedestrian detection or pedestrian re-detection to enable it to detect and identify pedestrians and their location information.

[0043] Optionally, when training the neural network model, a standard labeled dataset can be obtained from an image database as the training set, or an unlabeled dataset can be used as the training set. During training, a validation set can be used to verify the generalization ability of the neural network model, to evaluate its capabilities, and to decide whether to stop training. Alternatively, a test set can be used to evaluate the generalization ability of the neural network model.

[0044] In this embodiment of the application, before performing image recognition based on the neural network model, the neural network model is trained according to the actual needs of the actual deployment application in order to more accurately achieve image recognition tasks for different deployment applications.

[0045] In some embodiments, prior to image recognition based on the aforementioned neural network model, the method further includes:

[0046] Obtain the image to be recognized.

[0047] Optionally, the image to be identified may be an image captured by a camera device or an image frame in a video stream captured by a camera device.

[0048] Optionally, since image recognition tasks may differ across application areas, the camera equipment used and the rules for acquiring the images to be detected may also vary. Therefore, the corresponding images to be recognized should be obtained according to the specific acquisition methods and rules for each application area. For example, in traffic applications, such as traffic violation detection, it is necessary to use cameras installed at fixed locations to acquire real-time video and obtain continuous image frames from the video stream as the images to be detected for the corresponding image recognition task.

[0049] In this embodiment of the application, based on the images required for image recognition tasks in various application fields, corresponding acquisition methods and acquisition rules are adopted to obtain images to be recognized that meet the requirements of image recognition tasks, so as to perform image recognition tasks.

[0050] In some embodiments, the neural network model described above includes a feature extraction module and a recognition module, wherein the feature extraction module includes a global feature extraction module constructed based on a convolutional Meta Former structure;

[0051] Accordingly, the above-mentioned neural network model is used to identify the image to be identified, and the identification result is obtained, including:

[0052] A1. Based on the above feature extraction module, perform feature extraction on the above image to be identified.

[0053] A2. Based on the above recognition module, the extracted features are recognized to obtain the recognition results.

[0054] Optionally, since image recognition tasks include image classification, object detection, semantic segmentation, and instance segmentation, the features of the image to be recognized and the recognition methods for the extracted features required for each task are not entirely the same. Therefore, a feature extraction module is used to extract features from the image to be recognized. Based on the features required for each image recognition task and the corresponding recognition methods, the extracted features are recognized to obtain the corresponding recognition results. For example, pedestrian detection is used in fields such as intelligent driving, intelligent monitoring, pedestrian analysis, and intelligent robots to determine whether a pedestrian is present in an input image or video frame. When performing pedestrian detection, since the size of pedestrians in the image varies, it is necessary to use the extracted features of different sizes to detect the pedestrians present in the image to be recognized and output the pedestrian information and location.

[0055] In this embodiment, the feature extraction module extracts features from the image to be recognized. Since the features of the image to be recognized and the recognition methods for the features are different for each image recognition task, the recognition module obtains the required features for the corresponding image recognition task, recognizes the corresponding features, and obtains the corresponding recognition results, thereby improving the recognition accuracy and efficiency of each image recognition task.

[0056] In some embodiments, the feature extraction module further includes a first convolution module;

[0057] Accordingly, step A1 above includes:

[0058] A11. Based on the first convolution module, local features are extracted from the image to be identified to obtain the first local feature image.

[0059] Optionally, the first convolutional module can be built based on a lightweight convolutional neural network structure to reduce its computational cost. For example, it can be built based on lightweight network structures such as MobileNetV2 or SqueezeNet. MobileNetV2 is a lightweight neural network that uses depthwise separable convolution instead of ordinary convolution, introducing a linear bottleneck to improve the model's expressive power and avoid feature information loss caused by nonlinear transformations. It also expands the feature map channels through an inverse residual structure and avoids gradient vanishing or exploding problems, enriching the number of features and thus improving accuracy. SqueezeNet, on the other hand, is a simplified lightweight convolutional neural network structure that customizes its own convolutional modules, compressing and expanding the number of data channels separately, and further compressing parameters through deep compression to achieve an ultra-lightweight effect, suitable for terminal devices with limited computing power.

[0060] The aforementioned depthwise separable convolution refers to dividing a general convolution into two steps: channel-wise convolution and pointwise convolution. Depthwise separable convolution can be defined by two independent layers: a lightweight depthwise convolution for spatial filtering and a 1x1 pointwise convolution for feature generation. The lightweight depthwise convolution performs channel-wise convolution on the image, with one kernel responsible for one channel. This convolution operation is performed on each channel without changing the depth of the input feature image, resulting in an output feature map with the same number of channels as the input feature map. Then, the output feature map is subjected to dimensionality upscaling and downscaling using 1x1 convolution, and weighted combination along the depth direction, combining the feature information from each channel without changing the size of the feature map.

[0061] A12. Based on the global feature extraction module described above, global feature extraction is performed on the first local feature image to obtain global features.

[0062] Optionally, the first local feature image extracted by the first convolution module is used as the input of the global feature extraction module. Through the convolution-based Meta Former structure, global features are extracted from the obtained first local feature image to obtain the global features of the image to be recognized.

[0063] In this embodiment, by using a lightweight network structure to extract the first local features, the parameters and computational load of the neural network model are reduced, thereby accelerating the image recognition speed of the neural network model. Furthermore, by extracting global features through a convolution-based MetaFormer structure, subsequent image recognition tasks can be performed based on the global features, reducing the impact of the lightweight network structure on the accuracy of the neural network model and thus improving the recognition accuracy.

[0064] In some embodiments, the feature extraction module further includes a second convolution module and a fusion module, wherein the second convolution module uses a lightweight convolutional neural network to extract second local features;

[0065] Accordingly, step A1 above also includes:

[0066] A13. Based on the second convolution module described above, local features are extracted from the first local feature image to obtain the second local features.

[0067] Optionally, the second convolutional module can be constructed based on lightweight convolutional neural network architectures such as MobileNetV2 and SqueezeNet to reduce the computational cost of the second convolutional module, thereby extracting the second local features. Its principle is the same as that of the first convolutional module and will not be elaborated further here.

[0068] A14. Based on the above fusion module, the second local feature and the global feature are fused to obtain the fused feature.

[0069] Optionally, the aforementioned second local features and global features are connected in the channel direction to obtain fused features that describe the local and global features, thereby improving the information representation capability of the feature image and thus improving the accuracy of subsequent image recognition.

[0070] In this embodiment, by fusing the second local features and global features to obtain fused features that can represent both local and global features, the information representation capability of the feature image is improved. This results in relatively accurate image recognition results when performing corresponding image recognition tasks based on the fused features, thereby improving the recognition accuracy of the neural network model.

[0071] It should be noted that in some embodiments, when extracting the first local features and the fused features, since image recognition tasks such as target detection require the recognition of feature images at different scales, it is necessary to extract the first local features and fused features at different scales to improve the recognition accuracy of the corresponding image recognition tasks.

[0072] In some embodiments, the global feature extraction module includes, from input to output, a first residual module, a global feature sub-extraction module, a first merging module, a second residual module, a feedforward network module, and a second merging module connected in sequence.

[0073] Accordingly, step A12 above specifically includes:

[0074] The first addition module is used to add and merge the input data of the first residual module and the output data of the global feature extraction module.

[0075] The second addition module is used to add and merge the input data of the second residual module and the output data of the feedforward network module.

[0076] The aforementioned global feature extraction module includes a first branch, a second branch, and a merging module;

[0077] The first branch mentioned above is used to extract local features from the input image through depthwise separable convolution;

[0078] The second branch mentioned above is used to extract global features from the input image;

[0079] The aforementioned merging module is used to merge the features output by the first branch and the second branch based on the pixel position to obtain global features.

[0080] The first residual module and the second residual module refer to the residual structure adopted based on the global feature extraction module and the feedforward network module, respectively, to solve the problems of gradient explosion and network performance degradation.

[0081] Optionally, the network structure of the global feature extraction module described above is as follows: Figure 2 As shown, the first local feature image is used as the input of the first residual module. Local features are extracted from the input first local feature image based on the first branch of the global feature sub-extraction module, and global features are extracted from the input first local feature image based on the second branch of the global feature sub-extraction module. The features output by the first branch and the second branch are merged by the merging module to obtain the global features of the image to be recognized.

[0082] Optionally, the first branch described above performs local feature extraction on the input first local feature image based on depthwise separable convolution to reduce the number of parameters and computational cost. For example, the first branch uses a convolution with a kernel size of 3*3 and a stride of 1, and pads the input edges with a circle of zeros to maintain the resolution of the image after convolution, that is, to make the output of the first branch have the same size as the input first local feature image. The first branch can also use separable convolution, group convolution, or other conventional convolution to perform local feature extraction on the input first local feature image. The above convolution operations are conventional convolution operations and will not be described in detail here.

[0083] Optionally, when merging the features output by the first branch and the second branch, the merging module adds the features output by the first branch and the second branch based on the pixel position to obtain global features. Without increasing the dimension of the feature image, the global features describe more information, that is, they contain feature information of both local and global features.

[0084] In this embodiment, local features are extracted by depthwise separable convolution, and the local features and the extracted global features are added together to obtain a global feature image containing the local features. Since the local features and global features are merged by addition, the dimension of the resulting feature map, i.e. the number of channels, remains unchanged, but the amount of information it describes increases. This improves the accuracy of subsequent image recognition without increasing the amount of computation, thereby improving the accuracy of the neural network model.

[0085] In some embodiments, the structure of the global feature extraction module described above is as follows: Figure 3 As shown, in the first residual module and the second residual module, in the following... Figure 2 The structure shown adds a BN (Batch Normalization) layer to each layer to normalize the input image, preventing the data distribution of intermediate layers from changing during the training of the neural network model, thus avoiding gradient vanishing or gradient explosion problems and accelerating the training speed of the neural network model.

[0086] In some embodiments, when recognizing an image to be recognized, the second branch is specifically used for:

[0087] B1. Perform a convolution operation on the input image to obtain N feature vectors, where N is a positive integer greater than 1.

[0088] Optionally, when performing convolution on the first local feature image of the input, the input image is divided into N groups, and each group is processed by a large convolution kernel that can cover all pixels in the group to extract a corresponding feature vector, resulting in a total of N feature vectors.

[0089] B2. Perform channel shuffling on each feature vector to obtain N new feature vectors.

[0090] Optionally, based on the above N feature vectors, each feature vector is divided into N groups along the channel direction, and then each feature vector is shuffled and rearranged to generate new N feature vectors. The channel order of the original feature vectors is shuffled, so that feature information flows in different channels to achieve information exchange and fusion.

[0091] It is important to note that each of the newly generated N feature vectors is composed of N groups from different feature vectors. That is, the N groups that make up the new feature vectors come from N different feature vectors. This ensures that the newly generated feature vectors contain the feature information of other feature vectors, so that each newly generated feature vector can represent the features of the entire input image.

[0092] B3. The above new N feature vectors are sparsified and rearranged to obtain a sparse feature map.

[0093] Optionally, the resulting N feature vectors are rearranged to form a sparse feature map, and the parts of the sparse feature map without actual data are filled with zeros to diffuse the feature vector information to each group.

[0094] B4. After the sparse feature map is diffused into a dense feature map through convolution operation, it is output.

[0095] Optionally, the above sparse feature map is convolved to diffuse the information of the sparse feature map into a dense feature map, so that the information of each pixel position in the generated dense feature map describes the information of all pixels in the input image.

[0096] Optionally, based on the depthwise separable convolution of the first branch, local features are extracted from the input image. Based on the second branch, steps B1-B4 are performed to extract global features from the input image. The extracted local and global features are then merged to obtain the global features. For example, ... Figure 4 As shown, the input image size is 8*8. A depthwise separable convolution is performed on the input image using a convolution kernel with size k = 3*3, stride s = 1, and edge padding p = 1 to extract local features. Simultaneously, the input image is divided into four groups of 4*4 pixels each, and a large-stride convolution is performed on these groups using a convolution kernel with size k = 4, stride 4, and edge padding p = 0, resulting in four feature vectors. These four feature vectors are then shuffled and recombined to obtain four new feature vectors. These four new feature vectors are then sparsified to form a sparse feature map. A convolution kernel with size k = 4*3, stride s = 1, and edge padding p = 2 is then used to convolve the sparse feature map, propagating and diffusing the information to obtain a dense feature map (i.e., global features). Finally, the extracted local features and global features are added and merged according to element position to obtain the final output global features.

[0097] In this embodiment, by performing convolution, channel shuffling, and sparsification diffusion on the input image, global features are obtained in which each pixel describes the information of all pixels in the image to be recognized. Since pure convolution operations are used throughout the process of extracting global features, the neural network model has good hardware support capabilities while extracting global features based on the Meta Former structure to improve image recognition accuracy, thus facilitating the deployment of the neural network model on terminal devices.

[0098] In some embodiments, the recognition module described above includes an image classification sub-model;

[0099] Accordingly, step A2 above includes:

[0100] A21. Based on the above image recognition sub-model, the extracted features are classified to obtain the image classification result.

[0101] Image classification refers to distinguishing different categories of targets based on the different features reflected in an image.

[0102] Optionally, the extracted fusion features can be input into a trained image classification sub-model to output the image classification result. For example, in face recognition, after the fusion features of a face image are extracted through the feature extraction module described above, the image classification sub-model is used to classify the extracted fusion features of the face image to obtain the category of the face image, i.e., which person it belongs to.

[0103] In this embodiment, the extracted fusion features are classified by a trained image classification sub-model to obtain image classification results. Since the classification is based on fusion features, which describe more semantic and detailed information, the image classification sub-model is more accurate in classifying the target, thereby improving the recognition accuracy of the neural network model.

[0104] In some embodiments, the above image classification sub-model includes a third convolutional module;

[0105] Accordingly, step A21 above includes:

[0106] Based on the third convolution module mentioned above, the extracted features are classified to obtain the image classification result.

[0107] Optionally, the third convolutional module described above constructs a fully connected layer based on convolution to classify the extracted features.

[0108] Optional, such as Figure 5 The neural network model shown for image classification tasks extracts local features from the image to be identified through a first convolutional module C1, based on a feature extraction module. The first local features are obtained by stacking the second convolutional module C2 and the global feature extraction module E N1 and M1 times respectively. These extract features from the first local features. The second local features and the global features are then fused through a fusion module F to obtain fused features. A third convolutional module C (i.e., a fully connected layer) obtains the fused features required for the image classification task. Based on the fully connected layer, the extracted fused features are mapped to the sample label space to obtain the probability values ​​of the image to be identified belonging to each category. The label with the highest probability value is selected as the image classification result, thus achieving the image classification task.

[0109] In this embodiment, the extracted features are classified through convolution operations, and the classification result is obtained based on the sample labels.

[0110] In some embodiments, before performing the image classification task using the neural network model based on the above-described image classification sub-model, the method further includes:

[0111] Image classification training is performed on the neural network model that includes the above-mentioned image classification sub-model.

[0112] Optionally, the neural network model including the image classification sub-model can be trained using common training methods to achieve the image classification task. For example, based on the actual image classification requirements of the deployed application, a corresponding training set can be obtained, and the neural network model including the image classification sub-model can be iteratively trained based on the training set until the trained neural network model including the image classification sub-model meets the preset conditions, thus obtaining the trained neural network model.

[0113] In this embodiment, the neural network model including the image classification sub-model is trained for image classification according to the actual application requirements, so as to achieve the image classification task of the deployed application.

[0114] In some embodiments, the above-described identification module further includes a target detection sub-model;

[0115] Accordingly, step A2 above also includes:

[0116] A22. Based on the above target detection sub-model, the extracted features are processed for target detection to obtain the target detection results.

[0117] Optionally, since the task of object detection is to find all objects of interest in an image and determine their category and location, and since the shapes and sizes of objects vary, multi-scale detection based on the SSD (Single Shot MultiBoxDetector) algorithm receives features of different sizes extracted and uses prior boxes of different scales and aspect ratios to detect the features of different sizes extracted, thereby outputting the confidence score of each category in each detection box and the offset of the detection box relative to the prior box, i.e., the position information of the detection box.

[0118] The SSD algorithm described above is a type of one-stage object detection algorithm. It performs dense sampling at different locations in the image, using different scales and aspect ratios to set prior boxes during sampling. Then, it uses a convolutional neural network to extract image features from the prior boxes and directly performs classification and regression, offering the advantage of high detection speed. To detect targets of different scales, the SSD algorithm employs a grid partitioning approach, scanning the feature maps of different convolutional layers. This allows it to detect targets of different sizes based on the feature maps at different scales—that is, detecting small objects based on large-scale feature maps and large objects based on small-scale feature maps—thus improving detection accuracy.

[0119] Specifically, based on the aforementioned object detection sub-model, the system receives first local features and fused features of different scales extracted from the image to be recognized. The object detection head then uses prior boxes of different scales and aspect ratios to perform object detection on these first local features and fused features, obtaining the confidence score for each category within each detection box and the offset of the detection box relative to the prior box. For example, since face recognition requires detecting faces in the image to be recognized and then recognizing the detected faces, the aforementioned neural network model can be applied to face recognition applications such as access control systems. Based on the aforementioned object detection sub-model, the extracted first local features and fused features of the image to be recognized are detected using prior boxes of different scales and aspect ratios, resulting in detection boxes containing faces. These detection boxes limit the processing area of ​​subsequent face recognition algorithms from the entire image to the face region within the detection box.

[0120] Optionally, to determine the category and location of targets with different shapes and sizes, multi-scale detection of the image to be detected can be achieved by scaling the image at different scales, predicting pyramid features, etc. The above-mentioned multi-scale detection methods are conventional detection methods and will not be elaborated on here.

[0121] In this embodiment, since target detection is performed based on feature maps of different scales and prior boxes of different scales and aspect ratios, targets of different sizes can be detected, thereby improving the accuracy of target detection and thus improving the accuracy of the aforementioned neural network model.

[0122] In some embodiments, step A22 specifically includes:

[0123] The above-mentioned target detection sub-model uses the Non-Maximum Suppression (NMS) algorithm to detect the location of targets and / or the category to which targets belong in images.

[0124] Optionally, since a large number of detection boxes are generated at the same target location during target detection, and these detection boxes may overlap, and we usually only need one detection box for the same target, excess detection boxes will affect the accuracy of target detection. Therefore, if... Figure 6The neural network model shown for object detection extracts local features from the image to be identified using a first convolutional module C1. This yields first local features. A second convolutional module C2 and a global feature extraction module E are stacked N1 and M1 times respectively to extract features from the first local feature image. A fusion module F then fuses the obtained second local features and global features to obtain fused features. An object detection head detects the extracted first local features and fused features at different scales to obtain the detection boxes and probability values ​​of the targets in the image to be identified. A non-maximum suppression (NMS) algorithm is used to filter out overlapping detection boxes, resulting in the optimal detection box, thereby determining the location of the target and / or the target's category in the image to be identified. For example, as... Figure 7 For the vehicle shown, during the process of locating the vehicle's position in the map, a large number of detection boxes are obtained. It is necessary to use non-maximum suppression to identify useless detection boxes to obtain the vehicle detection result. For example, suppose there are 6 detection boxes with probabilities of belonging to the vehicle in ascending order, namely A, B, C, D, E, and F. Starting from the detection box with the highest probability, F, we determine whether the overlap ratio (IOU) between detection boxes A, B, C, D, and E and F is greater than a preset threshold. If the overlap ratio between detection boxes B and D and F exceeds the preset threshold, then detection boxes B and D are removed, and F is marked as the remaining detection box. From the remaining detection boxes A, C, and E, the detection box with the highest probability, E, is selected. The overlap ratio between detection boxes A, C, and E is determined, and detection boxes with an overlap ratio greater than the preset threshold are removed. E is marked as the second remaining detection box. This process is repeated to find all the retained detection boxes.

[0125] The aforementioned non-maximum suppression is an edge refinement technique used to suppress targets that are not maxima, thereby searching for targets with local maxima (optimal values).

[0126] Optionally, when removing the redundant detection boxes, the method for removing redundant detection boxes can be selected according to actual needs. For example, in multi-target detection, the required detection boxes can be obtained based on template matching, clustering algorithms, and other methods, thereby obtaining the corresponding detection results.

[0127] In this embodiment, since multiple detection boxes are generated for the same target and there is an overlap, the overlapping detection boxes are removed based on non-maximum suppression to obtain the optimal detection box, thereby obtaining the location of the target in the image and / or the category to which the target belongs, reducing interference items and making the target detection results more accurate.

[0128] In some embodiments, before the neural network model including the object detection sub-model performs the object detection task, it further includes:

[0129] The neural network model, which includes the object detection sub-model, is trained for object detection.

[0130] Optionally, the neural network model including the object detection sub-model can be trained using common training methods to achieve the object detection task. For example, based on the actual object detection requirements of the deployed application, a labeled training set can be obtained, and two or more neural network models including the object detection sub-model can be trained on the training set. The two or more neural network models including the object detection sub-model can then be fused to obtain the trained neural network model.

[0131] In this embodiment, the neural network model including the target detection sub-model is trained for target detection according to the actual application requirements, so as to achieve the target detection task of the deployed application.

[0132] It should be noted that when the neural network model includes an image classification sub-model and an object detection sub-model, the image classification sub-model and the object detection sub-model can be trained separately and then fused together to achieve the image classification and object detection tasks.

[0133] In some embodiments, the above-mentioned identification module further includes a semantic segmentation sub-model;

[0134] Accordingly, step A2 above also includes:

[0135] A23. Based on the above semantic segmentation sub-model, the extracted features are segmented to obtain the semantic segmentation result.

[0136] Semantic segmentation combines image classification, object detection, and image segmentation techniques. It uses a specific method to segment an image into regions with specific semantic meanings and identifies the semantic category of each semantic block, resulting in a segmented image with pixel-by-pixel semantic annotations.

[0137] Optionally, the extracted fusion features are input into a trained semantic segmentation sub-model for segmentation processing, outputting the corresponding semantic segmentation results. For example, in face recognition, tasks related to face segmentation typically involve the classification of features such as skin, hair, eyes, mouth, nose, and background. A neural network model extracts fusion features from the image to be recognized. Since these fusion features contain semantic and detailed information about the image, the trained semantic segmentation sub-model processes these features to obtain the semantic segmentation results for the face.

[0138] In this embodiment of the application, since the fused features contain semantic, positional, and detail information, pixel-by-pixel semantic segmentation is performed based on the fused features, resulting in more accurate semantic segmentation results.

[0139] In some embodiments, the semantic segmentation sub-model described above includes a segmentation module, a merging module, and a fourth convolution module;

[0140] Accordingly, step A23 above includes:

[0141] Based on the segmentation module described above, the extracted features are processed by multi-scale convolution to obtain feature maps of different sizes.

[0142] Based on the above merging module, the feature maps of different sizes are merged along the channel direction;

[0143] Based on the fourth convolution module, the feature map output by the merging module is convolved to obtain the semantic segmentation result.

[0144] Optionally, since large objects perform better in detecting small-scale feature maps, while small objects perform better in detecting large-scale feature maps, therefore, the following method can be used: Figure 8 The neural network model shown for semantic segmentation extracts local features from the image to be recognized using a first convolutional module C1 to obtain first local features. A second convolutional module C2 and a global feature extraction module E are stacked N1 and M1 times respectively to extract features from the first local feature image. A fusion module F then fuses the obtained second local features and global features to obtain fused features. The segmentation module's convolutional branch performs multi-scale convolution processing on the extracted fused features to obtain feature maps of different sizes. A merging module merges these feature maps along the channel direction. Finally, a fourth convolutional module C performs a 1*1 convolution on the merged feature maps to achieve cross-channel fusion of feature map information, thereby outputting the segmentation result.

[0145] In this embodiment, the extracted fusion features are processed by multi-scale convolution and merged to obtain feature maps of different sizes for semantic segmentation. The feature maps of different sizes make the segmentation of objects of different scales more accurate, thereby improving the accuracy of semantic segmentation. Furthermore, the merged feature maps are processed by point convolution to fuse the information of the feature maps across channels, thereby improving the accuracy of semantic segmentation.

[0146] In some embodiments, the segmentation module includes M parallel convolutional branches, the topmost convolutional branch uses 1*1 convolution, and the other convolutional branches use dilated convolution with increasing dilation factors, where M is a positive integer greater than 1.

[0147] Accordingly, when step A23 performs multi-scale convolution processing on the extracted features based on the segmentation module, it includes:

[0148] Based on the above convolutional branches, the extracted features are processed by multi-scale convolution.

[0149] Optionally, since semantic segmentation is a pixel-level classification task, guiding pixel classification with semantic information requires acquiring high-resolution feature images rich in semantic information. Dilated convolution can effectively increase the receptive field size of the semantic segmentation sub-model without increasing its model parameters. Therefore, parallel dilated convolutions with different dilation factors are used to extract semantic features at multiple scales. For example, the segmentation module uses... Figure 9 The convolutional branch structure shown uses a 1*1 convolution at the top layer, while the other three layers use dilated convolutions with a kernel size of 3*3 and dilation factors of 6, 12, and 18 to extract features from the fused features, thereby obtaining feature maps of different scales with richer semantic features.

[0150] In this embodiment, since dilated convolution can expand the receptive field size without increasing the number of parameters in the neural network model, by using dilated convolution with different dilation factors to extract features from the fused features and obtain feature maps of different scales with rich semantic information, the accuracy of semantic segmentation can be effectively improved, thereby improving the recognition accuracy of the neural network model.

[0151] In some embodiments, the neural network model including the semantic segmentation sub-model further includes, before performing the semantic segmentation task:

[0152] Semantic segmentation training is performed on the neural network model that includes the aforementioned semantic segmentation sub-model.

[0153] Optionally, the neural network model including the semantic segmentation sub-model can be trained using conventional training methods to achieve the semantic segmentation task.

[0154] It should be noted that the recognition module of the aforementioned neural network model may include one or more of the following: image classification sub-model, object detection sub-model, and semantic segmentation sub-model, but is not limited to these. In practical applications, the specific functions of the recognition module in the neural network model are set according to the image recognition task requirements of the deployed application. Features are extracted based on the aforementioned feature extraction module based on the MetaFormer structure to perform the corresponding image recognition task.

[0155] It should be understood that the sequence number of each step in the above embodiments does not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application.

[0156] Example 2:

[0157] Corresponding to the image processing method described in the above embodiments, Figure 10 A structural block diagram of the device provided in the embodiments of this application is shown. For ease of explanation, only the parts related to the embodiments of this application are shown.

[0158] Reference Figure 10 The device includes: an input module 101 and a trained neural network model 102, wherein the neural network model extracts global features of the image to be recognized based on a pure convolutional MetaFormer structure.

[0159] The input module 101 is used to input the image to be recognized into the trained neural network model described above;

[0160] The neural network model 102 is used to sequentially extract and recognize features from the above-mentioned image to be recognized, and obtain the recognition result.

[0161] In this embodiment, a trained neural network model sequentially extracts and recognizes features from the image to be detected, thereby outputting the corresponding image recognition result. Since the neural network model is based on the global features of the image to be recognized using a convolutional Meta Former structure, it can focus on the global features of the image to be recognized when performing image recognition tasks, thereby improving the image recognition accuracy of the neural network model and avoiding the problem of decreased accuracy caused by lightweighting. This improves the recognition accuracy of the neural network model and enhances the actual deployment effect of the corresponding application of image recognition based on the neural network model.

[0162] In some embodiments, the image recognition device further includes:

[0163] The image acquisition module is used to acquire the image to be recognized.

[0164] In some cases of irrationality, the aforementioned neural network models include:

[0165] The feature extraction module is used to extract features from the image to be recognized using the aforementioned neural network model.

[0166] The recognition module is used to recognize the image to be recognized through the above neural network model and obtain the recognition result.

[0167] In some embodiments, the feature extraction module includes:

[0168] The global feature extraction module is used to extract global features of the image to be recognized based on the Meta Former structure of pure convolution.

[0169] In some embodiments, the feature extraction module further includes:

[0170] The first convolution module is used to extract local features from the image to be identified to obtain a first local feature image.

[0171] Accordingly, the global feature extraction module is used to extract global features from the first local feature image to obtain global features.

[0172] In some embodiments, the feature extraction module further includes:

[0173] The second convolution module is used to extract local features from the first local feature image to obtain the second local features.

[0174] The fusion module is used to fuse the aforementioned second local feature with the aforementioned global feature to obtain the fused feature.

[0175] In some embodiments, the global feature extraction module includes:

[0176] The first addition module is used to add and merge the input data of the first residual module and the output data of the global feature extraction module.

[0177] The second merging module is used to add and merge the input data of the second residual module and the output data of the feedforward network module.

[0178] The global feature extraction module is used to extract global features from the input data.

[0179] The aforementioned global feature extraction module includes:

[0180] The first branch is used to extract local features from the input image through depthwise separable convolution.

[0181] The second branch is used to extract global features from the input image.

[0182] The merging module is used to merge the features output from the first branch and the second branch based on the pixel position to obtain global features.

[0183] In some embodiments, the second branch includes:

[0184] A convolutional unit is used to perform a convolution operation on the input image to obtain N feature vectors, where N is a positive integer greater than 1.

[0185] The channel shuffling unit is used to perform channel shuffling on each feature vector to obtain N new feature vectors.

[0186] The sparsification unit is used to sparsify and rearrange the above-mentioned new N feature vectors to obtain a sparse feature map.

[0187] The diffusion unit is used to diffuse the sparse feature map into a dense feature map through a convolution operation before outputting it.

[0188] In some embodiments, the identification module includes:

[0189] The image classification sub-model is used to classify the extracted features based on the above image recognition sub-model to obtain the image classification result.

[0190] In some embodiments, the image classification module includes:

[0191] The third convolutional unit is used to classify the extracted features and obtain the image classification result.

[0192] In some embodiments, the identification module further includes:

[0193] The target detection sub-model is used to perform target detection processing on the extracted features based on the above target detection sub-model to obtain the target detection result.

[0194] In some embodiments, the above-mentioned target detection sub-model includes:

[0195] The detection unit is used to detect the location of a target and / or the category of the target in an image based on the Non-Maximum Suppression (NMS) algorithm.

[0196] In some embodiments, the identification module further includes:

[0197] The semantic segmentation sub-model is used to segment the extracted features based on the semantic segmentation sub-model mentioned above to obtain the semantic segmentation result.

[0198] In some embodiments, the semantic segmentation sub-model described above includes:

[0199] Multi-scale convolutional units are used to perform multi-scale convolution processing on the extracted features to obtain feature maps of different sizes.

[0200] The feature merging unit is used to merge feature maps of different sizes along the channel direction.

[0201] The fourth convolutional unit is used to perform convolution processing on the feature map output by the feature merging unit above to obtain the semantic segmentation result.

[0202] In some embodiments, the multi-scale convolutional unit includes:

[0203] Convolutional branch units are used to extract features from the above-mentioned fused features based on parallel convolutional branches.

[0204] It should be noted that the information interaction and execution process between the above-mentioned devices / units are based on the same concept as the method embodiments of this application. For details on their specific functions and technical effects, please refer to the method embodiments section, and they will not be repeated here.

[0205] Example 3:

[0206] Figure 11 This is a schematic diagram of the structure of a terminal device provided in an embodiment of this application. Figure 11 As shown, the terminal device 11 of this embodiment includes: at least one processor 110 ( Figure 11 The diagram shows only one processor, a memory 111, and a computer program 112 stored in the memory 111 and executable on the at least one processor 110, which, when executed, implements the steps in any of the above-described method embodiments.

[0207] For example, the computer program 112 can be divided into one or more modules / units, which are stored in the memory 111 and executed by the processor 110 to complete this application. The one or more modules / units can be a series of computer program instruction segments capable of performing specific functions, which describe the execution process of the computer program 112 in the terminal device 11. For example, the computer program 112 can be divided into an input module 101 and a trained neural network model 102, wherein the neural network model extracts global features of the image to be recognized based on a pure convolutional MetaFormer structure. The specific functions of each module are as follows:

[0208] The input module 101 is used to input the image to be recognized into the trained neural network model described above;

[0209] The neural network model 102 is used to sequentially extract and recognize features from the above-mentioned image to be recognized, and obtain the recognition result.

[0210] The terminal device 11 can be a desktop computer, laptop, handheld computer, cloud server, or other computing device. This terminal device may include, but is not limited to, a processor 110 and a memory 111. Those skilled in the art will understand that... Figure 11 This is merely an example of terminal device 11 and does not constitute a limitation on terminal device 11. It may include more or fewer components than shown, or combine certain components, or different components, such as input / output devices, network access devices, etc.

[0211] The processor 110 may be a Central Processing Unit (CPU), or it may be other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor or any conventional processor.

[0212] In some embodiments, the memory 111 may be an internal storage unit of the terminal device 11, such as a hard disk or memory of the terminal device 11. In other embodiments, the memory 111 may be an external storage device of the terminal device 11, such as a plug-in hard disk, smart media card (SMC), secure digital (SD) card, flash card, etc., equipped on the terminal device 11. Furthermore, the memory 111 may include both internal and external storage units of the terminal device 11. The memory 111 is used to store the operating system, applications, bootloader, data, and other programs, such as the program code of the computer program. The memory 111 can also be used to temporarily store data that has been output or will be output.

[0213] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the above-described division of functional units and modules is merely an example. In practical applications, the above functions can be assigned to different functional units and modules as needed, that is, the internal structure of the device can be divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiments can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit. Furthermore, the specific names of the functional units and modules are only for easy differentiation and are not intended to limit the scope of protection of this application. The specific working process of the units and modules in the above system can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0214] This application also provides a network device, which includes: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, wherein the processor executes the computer program to implement the steps in any of the above method embodiments.

[0215] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps described in the various method embodiments above.

[0216] This application provides a computer program product that, when run on a terminal device, enables the terminal device to implement the steps described in the above-described method embodiments.

[0217] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments of this application can be implemented by a computer program instructing related hardware. The computer program can be stored in a computer-readable storage medium, and when executed by a processor, it can implement the steps of the various method embodiments described above. The computer program includes computer program code, which can be in the form of source code, object code, executable files, or certain intermediate forms. The computer-readable medium can include at least: any entity or device capable of carrying computer program code to a photographing device / terminal device, a recording medium, a computer memory, a read-only memory (ROM), a random access memory (RAM), an electrical carrier signal, a telecommunication signal, and a software distribution medium. Examples include USB flash drives, portable hard drives, magnetic disks, or optical disks. In some jurisdictions, according to legislation and patent practice, computer-readable media cannot be electrical carrier signals or telecommunication signals.

[0218] In the above embodiments, the descriptions of each embodiment have different focuses. For parts that are not described in detail or recorded in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0219] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0220] In the embodiments provided in this application, it should be understood that the disclosed apparatus / network devices and methods can be implemented in other ways. For example, the apparatus / network device embodiments described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.

[0221] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0222] The above-described embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application, and should all be included within the protection scope of this application.

Claims

1. An image recognition method based on a neural network model, characterized in that, The neural network model extracts global features of the image to be identified through a Meta Former structure based on pure convolution. The Meta Former structure based on pure convolution means that the entire Meta Former structure is pure convolution. The image recognition method includes: The image to be identified is input into the trained neural network model, which then sequentially extracts and identifies features from the image to be identified, thereby obtaining the identification result. The neural network model includes a feature extraction module and a recognition module. The feature extraction module includes a global feature extraction module built based on the Meta Former structure of pure convolution. The process of extracting and recognizing features from the image to be recognized sequentially using the neural network model to obtain the recognition result includes: The feature extraction module performs feature extraction on the image to be recognized. The extracted features are identified based on the identification module to obtain the identification result; The global feature extraction module comprises, from input to output, a first residual module, a global feature sub-extraction module, a first merging module, a second residual module, a feedforward network module, and a second merging module connected in sequence. The first addition module is used to add and merge the input data of the first residual module and the output data of the global feature extraction module; The second addition module is used to add and merge the input data of the second residual module and the output data of the feedforward network module; The global feature extraction module includes a first branch, a second branch, and a merging module; The first branch is used to extract local features from the input image through depthwise separable convolution; The second branch is used to extract global features from the input image; The merging module is used to merge the features output by the first branch and the second branch according to the pixel position to obtain global features.

2. The image recognition method as described in claim 1, characterized in that, The feature extraction module further includes a first convolution module; The feature extraction based on the feature extraction module for the image to be recognized includes: Based on the first convolutional module, local features are extracted from the image to be recognized to obtain a first local feature image; The global feature extraction module performs global feature extraction on the first local feature image to obtain global features.

3. The image recognition method as described in claim 2, characterized in that, The feature extraction module further includes a second convolution module and a fusion module. The second convolution module uses a lightweight convolutional neural network to extract second local features. The feature extraction based on the feature extraction module further includes: Based on the second convolution module, local features are extracted from the first local feature image to obtain the second local features; The fusion module fuses the second local feature with the global feature to obtain the fused feature.

4. The image recognition method as described in claim 3, characterized in that, The second branch is specifically used for: performing a convolution operation on the input image to obtain N feature vectors; performing a channel shuffling operation on each feature vector to obtain new N feature vectors; performing sparsification rearrangement on the new N feature vectors to obtain a sparse feature map; and outputting the sparse feature map after spreading it into a dense feature map through a convolution operation, where N is a positive integer greater than 1.

5. The image recognition method as described in claim 4, characterized in that, The recognition module includes an image recognition sub-model; The process of identifying the extracted features based on the identification module to obtain the identification result includes: The extracted features are classified based on the image recognition sub-model to obtain the image classification result.

6. The image recognition method as described in claim 5, characterized in that, The image recognition sub-model includes a third convolutional module; The process of classifying the extracted features based on the image recognition sub-model to obtain image classification results includes: The extracted features are classified based on the third convolution module to obtain the image classification result.

7. The image recognition method as described in claim 1, characterized in that, The recognition module also includes a target detection sub-model; The process of identifying the extracted features based on the identification module to obtain the identification result includes: The extracted features are processed by the target detection sub-model to obtain the target detection result.

8. The image recognition method as described in claim 7, characterized in that, The target detection sub-model detects the location of targets and / or the category of targets in an image based on a nonmaximum suppression algorithm.

9. The image recognition method as described in claim 1, characterized in that, The recognition module also includes a semantic segmentation sub-model; The step of identifying the extracted features through the identification module to obtain the identification result includes: The extracted features are segmented based on the semantic segmentation sub-model to obtain the semantic segmentation result.

10. The image recognition method as described in claim 9, characterized in that, The semantic segmentation sub-model includes a segmentation module, a merging module, and a fourth convolutional module; The process of segmenting the extracted features based on the semantic segmentation sub-model to obtain semantic segmentation results includes: Based on the segmentation module, the extracted features are processed by multi-scale convolution to obtain feature maps of different sizes; The merging module merges the feature maps of different sizes along the channel direction; The fourth convolutional module performs convolution processing on the feature map output by the merging module to obtain the semantic segmentation result.

11. The image recognition method as described in claim 10, characterized in that, The segmentation module includes M parallel convolutional branches, with the topmost convolutional branch using 1...

1. Convolution is performed, and other convolution branches use dilated convolution with successively increasing dilation factors, where M is a positive integer greater than 1; The step of performing multi-scale convolution processing on the extracted features based on the segmentation module to obtain semantic segmentation results includes: The extracted features are processed using multi-scale convolution based on the convolutional branches.

12. An image recognition device, characterized in that, include: The input module and the trained neural network model, wherein the neural network model extracts global features of the image to be identified based on a pure convolutional Meta Former structure, wherein the pure convolutional Meta Former structure refers to the entire Meta Former structure being pure convolution; The input module is used to: input the image to be recognized into the neural network model; The neural network model is used to: sequentially extract and recognize features from the image to be recognized, and obtain the recognition result; The neural network model includes a feature extraction module and a recognition module; the feature extraction module includes a global feature extraction module constructed based on the Meta Former structure of convolution; The neural network model is specifically used to: extract features from the image to be recognized based on the feature extraction module; The extracted features are identified based on the identification module to obtain the identification result; The global feature extraction module comprises, from input to output, a first residual module, a global feature sub-extraction module, a first merging module, a second residual module, a feedforward network module, and a second merging module connected in sequence. The first addition module is used to add and merge the input data of the first residual module and the output data of the global feature extraction module; The second addition module is used to add and merge the input data of the second residual module and the output data of the feedforward network module; The global feature extraction module includes a first branch, a second branch, and a merging module; The first branch is used to extract local features from the input image through depthwise separable convolution; The second branch is used to extract global features from the input image; The merging module is used to merge the features output by the first branch and the second branch according to the pixel position to obtain global features.

13. The image recognition device according to claim 12, characterized in that, The feature extraction module further includes a first convolution module; The feature extraction based on the feature extraction module for the image to be recognized includes: Based on the first convolutional module, local features are extracted from the image to be recognized to obtain a first local feature image; Based on the global feature extraction module, global features are extracted from the first local feature image to obtain global features; The recognition module includes: The image classification unit is used to classify the extracted features based on the image recognition sub-model to obtain the image classification result.

14. The image recognition device according to claim 13, characterized in that, The feature extraction module further includes: a second convolution module and a fusion module, wherein the second convolution module uses a lightweight convolutional neural network to extract second local features; The second convolution module is used to extract local features from the first local feature image to obtain the second local features; The fusion module is used to fuse the second local feature with the global feature to obtain a fused feature.

15. A terminal device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the computer program, it implements the method as described in any one of claims 1 to 11.

16. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1 to 11.