Methods, devices, media and equipment for detecting facial landmarks

By constructing a facial landmark detection model and utilizing feature extraction, enhancement, and regression networks, the balance between speed and high accuracy was resolved, achieving efficient and accurate facial landmark detection.

CN117315740BActive Publication Date: 2026-06-30CHINA AUTOMOTIVE INNOVATION CORP

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHINA AUTOMOTIVE INNOVATION CORP
Filing Date
2023-08-31
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing technologies struggle to strike a balance between speed and high accuracy in facial landmark detection; typically, improving one aspect comes at the expense of the other.

Method used

A facial landmark detection model is adopted, including a feature extraction network, a feature enhancement network, and a feature regression network. Through feature extraction, enhancement, and regression processing, redundant bottleneck modules and redundant sub-modules are used for feature enhancement. Combined with multi-scale fully connected layers for feature mapping, end-to-end detection is achieved.

Benefits of technology

It improves the efficiency and accuracy of facial landmark detection, reduces computational load, and is suitable for devices with limited computing resources, especially performing exceptionally well on mobile devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117315740B_ABST
    Figure CN117315740B_ABST
Patent Text Reader

Abstract

This application discloses a method, apparatus, medium, and device for detecting facial key points, relating to the field of object detection. The method includes: acquiring an image to be detected; inputting the image to be detected into a feature extraction network of a facial key point detection model for feature extraction processing to obtain a first feature map; inputting the first feature map into a feature enhancement network of the facial key point detection model for feature enhancement processing to obtain a second feature map, the second feature map containing redundant feature information of the first feature map; inputting the second feature map into a feature regression network of the facial key point detection model for feature point detection processing to obtain feature point detection results, the feature point detection results indicating whether each feature point in the second feature map is a facial key point; and determining the predicted location information corresponding to the facial key points in the image to be detected based on the feature point detection results. The technical solution provided by this application can balance speed and accuracy, improving the efficiency of facial key point detection.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of target detection, specifically to methods, devices, media, and equipment for detecting facial key points. Background Technology

[0002] Facial landmark detection, also known as facial landmark localization or face alignment, refers to locating the key regions of a face, including eyebrows, eyes, nose, mouth, and facial contours, given a facial image. Due to the influence of factors such as pose and occlusion, facial landmark detection is a challenging task.

[0003] Facial landmark detection is a crucial step in the field of facial recognition and analysis. It serves as a prerequisite and breakthrough for other face-related problems such as automatic facial recognition, expression analysis, 3D face reconstruction, and 3D animation. It also plays a key role in numerous scientific research and application projects, such as facial pose correction, pose recognition, expression recognition, fatigue monitoring, and lip shape recognition.

[0004] Facial landmark detection, a fundamental task in face-related applications, still faces many challenges, including detection accuracy, processing speed, and model size. It is well known that improving accuracy inevitably sacrifices speed, and vice versa. However, in practical engineering development, it is desirable to achieve both. Summary of the Invention

[0005] To achieve both speed and high accuracy in facial landmark detection, this application provides a method, apparatus, medium, and device for facial landmark detection. The technical solution is as follows:

[0006] Firstly, this application provides a method for detecting facial landmarks, the method comprising:

[0007] Acquire an image to be detected, wherein the image to be detected contains a face region;

[0008] The image to be detected is input into the feature extraction network of the face key point detection model for feature extraction processing to obtain the first feature map.

[0009] The first feature map is input into the feature enhancement network of the face key point detection model for feature enhancement processing to obtain a second feature map, which contains redundant feature information of the first feature map.

[0010] The second feature map is input into the feature regression network of the face key point detection model to perform feature point detection processing and obtain feature point detection results. The feature point detection results indicate whether each feature point in the second feature map is a face key point.

[0011] Based on the feature point detection results, the predicted location information corresponding to the facial key points in the image to be detected is determined.

[0012] Optionally, the step of inputting the image to be detected into the feature extraction network of the facial landmark detection model for feature extraction processing to obtain a first feature map includes:

[0013] The image to be detected is subjected to a 3×3 convolution to obtain an initial feature map;

[0014] The initial feature map is subjected to channel-by-channel convolution to obtain the first feature map.

[0015] Optionally, the feature enhancement network includes multiple redundant bottleneck modules. The step of inputting the first feature map into the feature enhancement network of the facial landmark detection model for feature enhancement processing to obtain the second feature map includes:

[0016] Determine the target input feature map of the target redundancy bottleneck module, wherein the target input feature map is either the first feature map or the previous output feature map of the previous redundancy bottleneck module; the target redundancy bottleneck module is any one of the plurality of redundancy bottleneck modules.

[0017] The target input feature map is input into at least one redundant sub-module in the target redundancy bottleneck module, and feature augmentation processing is performed to obtain the target redundancy feature map.

[0018] The target input feature map and the normalized target redundancy feature map are added together by tensor to obtain the target output feature map of the target redundancy bottleneck module.

[0019] The target output feature map is used as the next input feature map of the next redundant bottleneck module, and the above process is repeated until the output of the last redundant bottleneck module is used as the second feature map.

[0020] Optionally, the step of inputting the target input feature map into at least one redundant submodule in the target redundancy bottleneck module and performing feature augmentation processing to obtain the target redundancy feature map includes:

[0021] Determine the target input feature sub-map of the target redundancy sub-module, wherein the target input feature sub-map is either the target input feature map or the previous output feature sub-map of the previous redundancy sub-module; the target redundancy sub-module is any one of the at least one redundancy sub-module;

[0022] The target input feature submap is subjected to pointwise convolution processing to obtain the intrinsic feature submap, which includes channel feature submaps of multiple channels;

[0023] Perform a linear transformation on the channel feature submap of each channel to obtain multiple redundant feature submaps;

[0024] The intrinsic feature subgraph and the multiple redundant feature subgraphs are tensor-concatenated to obtain the target output feature subgraph, which is used as the output of the target redundant submodule.

[0025] The target output feature submap is normalized or activated to obtain the next input feature submap of the next redundant submodule. The above process is repeated until the output of the last redundant submodule is used as the target redundant feature map.

[0026] Optionally, the step of inputting the second feature map into the feature regression network of the facial key point detection model for feature point detection processing to obtain the feature point detection result includes:

[0027] The second feature map is input into the convolutional layer of the feature regression network for convolution processing to obtain the third feature map;

[0028] The third feature map is input into the multi-scale pooling layer of the feature regression network and average pooling is performed to obtain multiple fourth feature maps.

[0029] The multiple fourth feature maps are input into the multi-scale fully connected layer of the feature regression network for feature mapping processing to obtain the feature point detection results.

[0030] Optionally, the method further includes:

[0031] Obtain training sample images;

[0032] Based on the scale-invariant feature transform matching algorithm, the feature description information of the facial key points in the training sample image is determined;

[0033] The feature description information of the facial key points is input into the neural network model to be trained, and used as the prior features of the training sample images.

[0034] The training sample image is input into the neural network model to perform facial key point detection processing, and the sample feature point detection result of the facial key point in the training sample image is obtained.

[0035] Based on the feature description information and the sample feature point detection results, the loss information is determined;

[0036] Based on the loss information, the neural network model is trained to obtain the facial landmark detection model.

[0037] Optionally, the scale-invariant feature transform matching algorithm for determining the feature description information of the facial key points in the training sample image includes:

[0038] The training sample images are subjected to multi-scale transformation to obtain images at multiple scales;

[0039] Determine the Gaussian difference image corresponding to every two adjacent scale images in the plurality of scale images;

[0040] Based on the Gaussian difference images corresponding to each pair of adjacent scale images, determine at least one extreme point in each Gaussian difference image;

[0041] Based on at least one extreme point in each Gaussian difference image, at least one candidate feature point in the training sample image is determined, and the at least one candidate feature point is used as the facial key point;

[0042] Based on the gradient information of the training sample images, the feature description information of the facial key points is determined.

[0043] Optionally, determining the loss information based on the feature description information and the sample feature point detection results includes:

[0044] From the feature description information, determine the first feature description sub-information corresponding to the left side region of the face and the second feature description sub-information corresponding to the right side region of the face;

[0045] From the sample feature point detection results, determine the first sample feature point detection sub-result corresponding to the left side region of the face and the second sample feature point detection sub-result corresponding to the right side region of the face;

[0046] Based on the first feature descriptor information, the first sample feature point detection result, the second feature descriptor information, and the second sample feature point detection result, the first loss data is determined;

[0047] The third feature descriptor information corresponding to the target key point is determined from the feature description information, and the target key point includes the center point of the eye region, the center point of the mouth region, and the tip of the nose.

[0048] Determine a third sample feature point detection sub-result corresponding to the target key point from the sample feature point detection results;

[0049] The second loss data is determined based on the third feature descriptor information and the third sample feature point detection result;

[0050] The sum of the first loss data and the second loss data is input into a preset loss function to obtain the loss information.

[0051] Secondly, this application provides a facial landmark detection device, the device comprising:

[0052] The acquisition module is used to acquire the image to be detected, wherein the image to be detected contains a face region;

[0053] The feature extraction module is used to input the image to be detected into the feature extraction network of the face key point detection model, perform feature extraction processing, and obtain the first feature map;

[0054] The feature enhancement module is used to input the first feature map into the feature enhancement network of the face key point detection model, perform feature enhancement processing, and obtain a second feature map, wherein the second feature map contains redundant feature information of the first feature map;

[0055] The feature regression module is used to input the second feature map into the feature regression network of the face key point detection model, perform feature point detection processing, and obtain feature point detection results. The feature point detection results indicate whether each feature point in the second feature map is a face key point.

[0056] The key point location determination module is used to determine the predicted location information of the facial key points in the image to be detected based on the feature point detection results.

[0057] Thirdly, this application provides a computer-readable storage medium storing at least one instruction or at least one program, which is loaded and executed by a processor to implement a method for detecting facial key points as described in the first aspect.

[0058] Fourthly, this application provides a computer device including a processor and a memory, wherein the memory stores at least one instruction or at least one program, and the at least one instruction or at least one program is loaded and executed by the processor to implement a method for detecting facial key points as described in the first aspect.

[0059] Fifthly, this application provides a computer program product, which includes computer instructions that, when executed by a processor, implement a method for detecting facial key points as described in the first aspect.

[0060] The method, apparatus, medium, and equipment for detecting facial key points provided in this application have the following technical effects:

[0061] The solution provided in this application constructs a facial landmark detection model, including a feature extraction network, a feature enhancement network, and a feature regression network. These networks perform feature extraction, feature enhancement, and feature regression processing on the image to be detected, respectively, to obtain the feature point detection results for the image. The feature point detection results indicate whether each feature point is a facial landmark, and thus, based on the feature point detection results, the predicted location information corresponding to the facial landmarks in the image to be detected can be determined. In the technical solution provided in this application, the facial landmark detection model achieves end-to-end detection or localization, improving the efficiency of facial landmark detection and localization. Furthermore, compared to conventional convolution operations, the feature enhancement module utilizes redundant feature information from the first feature map output by the feature extraction module to enhance the first feature map. This operation is simple and effectively reduces the computational load, significantly improving processing efficiency. Moreover, feature enhancement can improve the prediction accuracy of the feature regression module.

[0062] Additional aspects and advantages of this application will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of this application. Attached Figure Description

[0063] To more clearly illustrate the technical solutions and advantages in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0064] Figure 1 This is a schematic diagram of the implementation environment of a facial key point detection method provided in an embodiment of this application;

[0065] Figure 2 This is a flowchart illustrating a method for detecting facial key points provided in an embodiment of this application;

[0066] Figure 3 This is a schematic diagram of a feature enhancement process provided in an embodiment of this application;

[0067] Figure 4 This is a schematic diagram of the structure of a redundant bottleneck module provided in an embodiment of this application;

[0068] Figure 5 This is a schematic diagram of another redundant bottleneck module provided in an embodiment of this application;

[0069] Figure 6 This is a schematic flowchart of a feature augmentation process provided in an embodiment of this application;

[0070] Figure 7This is a schematic diagram of the structure of a redundant submodule provided in an embodiment of this application;

[0071] Figure 8 This is a schematic diagram of a conventional convolution and a phantom convolution provided in an embodiment of this application;

[0072] Figure 9 This is a hierarchical diagram of a facial landmark detection model provided in an embodiment of this application;

[0073] Figure 10 This is a schematic diagram of a model training process provided in an embodiment of this application;

[0074] Figure 11 This is a schematic diagram of a facial key point detection device provided in an embodiment of this application;

[0075] Figure 12 This is a schematic diagram of the hardware structure of a device for implementing a method for detecting facial key points, provided in an embodiment of this application. Detailed Implementation

[0076] Artificial Intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to have perception, reasoning, and decision-making capabilities. AI technology is a comprehensive discipline involving a wide range of fields, encompassing both hardware and software technologies. Fundamental AI technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing technology, operating / interactive systems, and mechatronics.

[0077] The solutions provided in this application involve technologies such as deep learning (DL) in artificial intelligence.

[0078] Deep learning (DL) is a major research direction in the field of machine learning (ML), bringing it closer to its original goal—artificial intelligence. Deep learning learns the inherent patterns and hierarchical representations of sample data; the information gained during this learning process greatly aids in interpreting data such as text, images, and sound. Its ultimate goal is to enable machines to possess analytical and learning capabilities like humans, capable of recognizing data such as text, images, and sound. Deep learning is a complex machine learning algorithm that has achieved results in speech and image recognition far exceeding previous related technologies. Deep learning has yielded significant achievements in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech recognition, recommendation and personalization technologies, and other related fields. Deep learning enables machines to mimic human activities such as sight, hearing, and thought, solving many complex pattern recognition problems and significantly advancing artificial intelligence-related technologies.

[0079] The solutions provided in this application can be deployed in the cloud, and also involve cloud technologies.

[0080] Cloud technology refers to a hosting technology that unifies hardware, software, and network resources within a wide area network (WAN) or local area network (LAN) to achieve data computation, storage, processing, and sharing. It can also be understood as a general term for network technologies, information technologies, integration technologies, management platform technologies, and application technologies based on cloud computing business models. These technologies can form resource pools, allowing for on-demand use and flexibility. Backend services of cloud computing systems require substantial computing and storage resources, such as video websites, image websites, and many portal websites. With the rapid development and application of the internet industry, every item may have its own identification mark in the future, requiring transmission to backend systems for logical processing. Data at different levels will be processed separately, and various industry data require robust system support; therefore, cloud technology relies on cloud computing as its foundation. Cloud computing is a computing model that distributes computing tasks across a resource pool composed of numerous computers, enabling various application systems to obtain computing power, storage space, and information services as needed. The network providing these resources is called the "cloud." From the user's perspective, resources in the "cloud" are infinitely scalable, readily available, and can be used on demand, expanded at any time, and paid for based on usage. As a provider of fundamental cloud computing capabilities, a cloud resource pool platform, often referred to as a cloud platform or Infrastructure as a Service (IaaS), is established. This platform deploys various types of virtual resources within the resource pool for external customers to choose from. The cloud resource pool primarily includes: computing devices (which can be virtualized machines containing operating systems), storage devices, and network devices.

[0081] To achieve both speed and high accuracy in facial landmark detection, embodiments of this application provide methods, apparatus, media, and devices for facial landmark detection. The technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of this application, and not all of them. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application. Examples of the embodiments are shown in the accompanying drawings, wherein the same or similar reference numerals denote the same or similar elements or elements having the same or similar functions throughout.

[0082] It should be noted that the terms "first," "second," etc., in the specification, claims, and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or server that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or devices.

[0083] Please see Figure 1 This is a schematic diagram illustrating the implementation environment of a facial landmark detection method provided in this application embodiment, as shown below. Figure 1 As shown, the implementation environment may include at least client 01 and server 02.

[0084] Specifically, client 01 can include devices such as smartphones, desktop computers, tablets, laptops, in-vehicle terminals, digital assistants, smart wearable devices, and voice interaction devices. It can also include software running on the device, such as web pages provided to users by service providers, or applications provided by those service providers. Specifically, client 01 can capture an image to be detected, which contains a facial region. Client 01 sends the image to be detected to server 02, where server 02 performs facial landmark detection processing based on the image.

[0085] Specifically, server 02 can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms. Server 02 may include network communication units, processors, and memory, etc. The terminal and the server can be connected directly or indirectly through wired or wireless communication, which is not limited herein. Specifically, server 02 can input the image to be detected into the feature extraction network of a facial landmark detection model for feature extraction processing to obtain a first feature map; input the first feature map into the feature enhancement network of the facial landmark detection model for feature enhancement processing to obtain a second feature map, which contains redundant feature information of the first feature map; input the second feature map into the feature regression network of the facial landmark detection model for feature point detection processing to obtain feature point detection results, which indicate whether each feature point in the second feature map is a facial landmark; and determine the predicted location information corresponding to the facial landmarks in the image to be detected based on the feature point detection results. Server 02 can also send the predicted location information corresponding to the facial landmarks in the image to be detected to client 01 to serve related applications on client 01.

[0086] Specifically, server 02 and the database are located in the cloud. Server 02 can be a physical machine or a virtual machine.

[0087] In another implementation environment provided in this application embodiment, a facial landmark detection model is configured in client 01. Client 01 directly performs facial landmark detection processing based on the captured image to be detected, and obtains the predicted location information corresponding to the facial landmarks in the image to be detected. The above is an example of an implementation environment. Other implementation environments are also possible in this application embodiment, which will not be elaborated here.

[0088] Please see Figure 2 This is a flowchart illustrating a method for detecting facial key points provided in an embodiment of this application. This application provides the operational steps of the method described in the embodiments or flowchart, but based on conventional or non-inventive labor, more or fewer operational steps may be included. The order of steps listed in the embodiments is merely one possible execution order among many and does not represent the only possible execution order. In actual system or server product execution, the method can be executed sequentially according to the embodiments or drawings, or in parallel (e.g., in a parallel processor or multi-threaded processing environment). Please refer to... Figure 2The method for detecting facial key points provided in this application embodiment may include the following steps:

[0089] S210: Obtain the image to be detected, which contains a face region.

[0090] Understandably, obtaining high-quality facial images in real-world scenarios is difficult, mainly due to the following challenges: First, facial expressions, extreme local lighting (such as highlights and shadows), and occlusion can interfere with facial image detection; second, pose and image quality affect the overall appearance of the facial image. The facial landmark detection model constructed in this application can quickly and accurately detect and locate facial landmarks under the influence of local or global changes.

[0091] S220: Input the image to be detected into the feature extraction network of the face key point detection model, perform feature extraction processing, and obtain the first feature map.

[0092] In one embodiment of this application, the feature extraction network includes regular convolution and channel-wise convolution, which can effectively improve feature extraction efficiency. Specifically, it can be implemented as follows:

[0093] S221: Perform 3×3 convolution processing on the image to be detected to obtain the initial feature map.

[0094] It is understandable that using a 3×3 convolution kernel to perform the same convolution process on each channel of the image to be detected can be seen as applying the same filter to each pixel of the input at once, thereby generating a filtered feature map, which is the initial feature map.

[0095] S222: Perform channel-wise convolution on the initial feature map to obtain the first feature map.

[0096] Depthwise convolution (also known as spatial convolution, or simply DWConv) is used to extract local features from images or feature maps. Unlike regular convolution operations, in depthwise convolution, each input channel is convolved only with its corresponding weight matrix.

[0097] Specifically, the process of channel-wise convolution involves first performing convolution on each channel of the initial feature map, and then stacking the convolutions across different channels to generate a multi-channel feature map. This multi-channel feature map can then be batch-normalized, non-linearly activated using ReLU layers, and finally passed through a convolutional layer to output the first feature map.

[0098] In the above embodiments, combining traditional convolution and channel-wise convolution can effectively extract global and local features, improving the network's feature extraction capabilities. Furthermore, channel-wise convolution is a lightweight computational method in convolutional neural networks, effectively improving model efficiency and cost-effectiveness when computational resources are limited. This makes its advantages particularly pronounced when running on mobile devices.

[0099] S230: Input the first feature map into the feature enhancement network of the face key point detection model, perform feature enhancement processing, and obtain the second feature map. The second feature map contains redundant feature information of the first feature map.

[0100] Considering that convolutional neural networks generate highly similar feature maps during feature extraction, this embodiment assumes a positive correlation between the feature extraction capability of a convolutional neural network-based model and highly similar feature maps. Therefore, in this embodiment, the feature enhancement network extracts redundant feature information from the first feature map to enhance its features, thereby improving the feature extraction performance of the facial landmark detection model. In this embodiment, redundant feature information can be represented as one or more feature maps similar to the first feature map.

[0101] In one embodiment of this application, the backbone network of the feature enhancement network is one or more redundant bottleneck modules (Ghost BottleNeck), which are constructed based on redundant sub-modules (GhostModule, also known as phantom module or ghost module). Figure 3 As shown, in any redundant bottleneck module, feature enhancement processing can be implemented as follows:

[0102] S231: Determine the target input feature map of the target redundancy bottleneck module.

[0103] When the target redundant bottleneck module is the first redundant bottleneck module of the backbone network, the target input feature map is the first feature map; when the target redundant bottleneck module is not the first redundant bottleneck module of the backbone network, the target input feature map is the previous output feature map of the previous redundant bottleneck module; the target redundant bottleneck module can be any one of multiple redundant bottleneck modules.

[0104] S232: Input the target input feature map into at least one redundant sub-module in the target redundancy bottleneck module, perform feature augmentation processing, and obtain the target redundancy feature map.

[0105] Specifically, with a stride of 1, the network structure of the target redundant bottleneck module can be as follows: Figure 4As shown, the target redundancy bottleneck module includes two redundancy sub-modules (Ghost Module). The output of the first redundancy sub-module is processed by batch normalization (BN) and nonlinear activation (RELU) before being input to the second redundancy sub-module.

[0106] Specifically, with a stride of 2, the network structure of the target redundant bottleneck module can be as follows: Figure 5 As shown, the target redundancy bottleneck module includes two redundant sub-modules. The output of the first redundant sub-module is processed by batch normalization (BN) and nonlinear activation (RELU), and then a channel-wise convolution (DWConv) with a stride of 2 is performed. After normalization, it is input into the second redundant sub-module.

[0107] Furthermore, the number of channels in the feature map is increased through a concatenation operation in the redundant submodule, which can be considered as feature augmentation. The specific processing steps in the redundant submodule can be found in the following embodiments, and will not be elaborated here. It is understood that the input to the target redundancy bottleneck module, the target redundant feature map, can characterize the redundant feature information of the target input feature map.

[0108] S233: Add the target input feature map and the normalized target redundancy feature map into a tensor to obtain the target output feature map of the target redundancy bottleneck module.

[0109] like Figure 4 or Figure 5 As shown, the output of the subsequent redundant submodule is normalized and then added to the target input feature map using a tensor (Add) to obtain the final output of the target redundant bottleneck module. It is understandable that tensor addition does not expand the dimension; for example, adding two 104*104*128 feature maps will still result in a 104*104*128 feature map.

[0110] S234: Use the target output feature map as the next input feature map of the next redundant bottleneck module, and repeat the above process until the output of the last redundant bottleneck module is used as the second feature map.

[0111] When the backbone network contains multiple redundant bottleneck modules, repeating the above process means that similar processing operations can be performed, such as feature augmentation, tensor addition, batch normalization, and nonlinear activation. However, it is not required that the internal structure and parameters of each layer of the multiple redundant bottleneck modules contained in the backbone network be the same. That is, the internal structure and parameters of each redundant bottleneck module can be different.

[0112] In the above embodiments, redundant feature information of the target input feature map is extracted by at least one redundant sub-module in the target redundancy bottleneck module, and the redundant feature information and the target input feature map are fused by tensor addition, which not only enhances the feature extraction effect, but also preserves the original input features.

[0113] In one embodiment of this application, such as Figure 6 As shown, step S232 may include:

[0114] S2321: Determine the target input feature subgraph of the target redundant submodule.

[0115] Wherein, when the target redundant submodule is the first redundant submodule in the target redundant bottleneck module, the target input feature map is the target input feature map; when the target redundant submodule is not the first redundant submodule in the target redundant bottleneck module, the target input feature map is the previous output feature map of the previous redundant submodule, and further, it is the feature map of the previous output feature map after batch normalization and nonlinear activation; the target redundant submodule can be any one of at least one redundant submodule.

[0116] S2322: Perform pointwise convolution on the target input feature submap to obtain the intrinsic feature submap, which includes channel feature submaps of multiple channels.

[0117] like Figure 7 As shown, pointwise convolution uses a 1x1 kernel to convolve each feature point of the target input feature submap. The number of parameters can be reduced by reducing or increasing the number of channels.

[0118] S2323: Perform a linear transformation on the channel feature submap of each channel to obtain multiple redundant feature submaps.

[0119] Understandably, linear transformations require less computation and can quickly generate a large number of redundant feature subgraphs to achieve feature augmentation / enhancement effects.

[0120] In one feasible implementation, such as Figure 7 As shown, a linear transformation can be achieved using channel-wise convolution (DWConv), thereby extracting redundant features from the target input feature sub-map, and involving less convolutional computation.

[0121] In the GhostModule, pointwise convolution and channelwise convolution can be combined into phantom convolution, also known as depthwise separable convolution.

[0122] S2324: Tensor concatenation of the intrinsic feature submap and multiple redundant feature submaps is performed to obtain the target output feature submap, which is used as the output of the target redundant submodule.

[0123] It is understandable that, such as Figure 7 As shown, tensor concatenation expands the dimensions of a tensor. For example, concatenating two feature maps of 26*26*256 and 26*26*512 results in a feature map of 26*26*768.

[0124] S2325: Normalize or activate the target output feature map to obtain the next input feature map of the next redundant submodule. Repeat the above process until the output of the last redundant submodule is used as the target redundant feature map.

[0125] When the target redundancy bottleneck module contains multiple redundant sub-modules, repeating the above process means that similar processing operations can be performed, such as pointwise convolution, channelwise convolution, tensor concatenation, batch normalization, and nonlinear activation, but it is not required that the parameters of each layer of multiple redundant sub-modules be the same.

[0126] Figure 8 This is a schematic diagram illustrating a conventional convolution and a phantom convolution provided in an embodiment of this application. For example... Figure 8 As shown, in a conventional convolution operation, the input feature map w*h*c is convolved with n sets of kxk convolutional kernels to generate an output with n channels and a size of w'*h'. The computational cost is approximately w*h*c*n*w'*h' (ignoring bias calculation). In GhostModule, m sets of kxk convolutional kernels are first used to convolve with the input, where m can be half of n, generating an intrinsic map with m channels and a size of w'*h'. The computational cost of this part is approximately w*h*c*m*w'*h' (ignoring bias calculation), significantly reducing the computational cost. Then, the intrinsic map undergoes linear transformations Φ1, Φ2, ..., Φ... k Multiple redundant feature maps, ghost1 to ghosts-1, are generated. The intrinsic map and the multiple redundant feature maps are then concatenated as the output.

[0127] In the above embodiments, the Ghost Module is used to extract redundant features. Compared with conventional convolution, it is lightweight and efficient. In addition, the Ghost Module is easy to port and deploy, which can quickly improve processing efficiency in practice.

[0128] S240: Input the second feature map into the feature regression network of the face key point detection model, perform feature point detection processing, and obtain the feature point detection result. The feature point detection result indicates whether each feature point in the second feature map is a face key point.

[0129] In one embodiment of this application, the feature regression network includes convolutional layers, reddening layers, and a full-face layer. Finally, it maps the feature information of each feature point in the second feature map to feature point detection results indicating whether each feature point in the second feature map is a facial keypoint. Specifically, it can be implemented as follows:

[0130] S241: Input the second feature map into the convolutional layer of the feature regression network and perform convolution processing to obtain the third feature map.

[0131] S242: Input the third feature map into the multi-scale pooling layer of the feature regression network and perform average pooling to obtain multiple fourth feature maps.

[0132] S243: Input multiple fourth feature maps into the multi-scale fully connected layer of the feature regression network for feature mapping processing to obtain feature point detection results.

[0133] In the above embodiments, the number of multi-scale fully connected layers is increased, which can increase the receptive field and better capture the global structure of the face, thereby enabling accurate localization of facial key points.

[0134] Figure 9 This is a schematic diagram of the structure of a face key detection model provided in an embodiment of this application. As shown in the figure, the network modules of each layer can be referred to in the aforementioned embodiments, and will not be repeated here. In addition, the parameters t, c, n, and s represent the input size, the number of output channels, the number of repetitions, and the step size, respectively.

[0135] S250: Based on the feature point detection results, determine the predicted location information of the facial key points in the image to be detected.

[0136] Facial landmarks can be applied for, but are not limited to, points such as the corners of the eyes, the inner corners of the eyebrows, the outer corners of the eyebrows, the corners of the mouth, and the tip of the nose. Facial landmarks can be used to locate key areas of the face, including but not limited to eyebrows, eyes, nose, mouth, and facial contours.

[0137] As described in the above embodiments, the facial key point detection method provided by this application utilizes a facial key point detection model to achieve end-to-end detection or localization, thereby improving the efficiency of facial key point detection. Furthermore, compared to conventional convolution operations, in the feature enhancement module, redundant feature information of the first feature map output by the feature extraction module is used to enhance the first feature map. This method is simple to operate and effectively reduces the amount of computation, thereby improving processing efficiency. Moreover, feature enhancement can improve the prediction accuracy of the feature regression module.

[0138] Figure 10 This is a schematic diagram illustrating the training process of a facial landmark detection model provided in an embodiment of this application. Please refer to... Figure 10 The facial landmark detection method provided in this application embodiment may further include the following steps:

[0139] S310: Obtain training sample images.

[0140] S320: Based on the scale-invariant feature transform matching algorithm, determine the feature description information of facial key points in training sample images.

[0141] In the embodiments of this application, a scale-invariant feature transform matching algorithm is used to detect the location of candidate points of interest in the facial region. Specifically, step S320 may include the following steps:

[0142] S321: Perform multi-scale transformation on the training sample images to obtain images at multiple scales.

[0143] The description function L(x, y, σ) of the training sample images in different scale spaces can be expressed as shown in Equation (1):

[0144] L(x,y,σ)=G(x,y,σ)*s(x,y) (1)

[0145] Where L(x, y, σ) is the scaled image, s(x, y) represents the training sample image, (x, y) can be the pixel coordinates, and G(x, y, σ) is the Gaussian convolution kernel function, as shown in formula (2):

[0146]

[0147] Here, σ is the scale factor. The smaller the σ value, the less the image is smoothed, and the smaller the scale. Large scales correspond to the overall features of the image, while small scales correspond to the detailed features of the image.

[0148] S322: Determine the Gaussian difference image corresponding to every two adjacent scale images in multiple scale images.

[0149] Extreme points in scale space can be calculated using the difference of the Gaussian (DoG) function of the training sample images, which can be expressed as shown in Equation (3):

[0150] D(x,y,σ)=[G(x,y,kσ)-G(x,y,σ)]*s(x,y)

[0151] =L(x,y,kσ)-L(x,y,σ) (3)

[0152] Where D(x, y, σ) is the Gaussian (DoG) function of the training sample images, and k is a constant factor. In one embodiment of this application, the number of intervals n can be set to 3 to form n+2 DoG images, and k can be set to 2. 1 / 3 .

[0153] S323: Based on the Gaussian difference images corresponding to each pair of adjacent scale images, determine at least one extreme point in each Gaussian difference image.

[0154] Each pixel in each Gaussian difference image is compared with its eight neighbors in the same scale, and simultaneously with its nine neighbors in a scaled-up or scaled-down Gaussian difference image. If the value of this pixel is the minimum or maximum among the compared pixels, then the pixel can be considered an extremum.

[0155] S324: Based on at least one extreme point in each Gaussian difference image, determine at least one candidate feature point in the training sample image, and use at least one candidate feature point as a facial key point.

[0156] Furthermore, non-maximum suppression processing can be applied to at least one extreme point in each Gaussian difference image to obtain at least one candidate feature point as a facial keypoint. Simultaneously, the position and scale of each candidate feature point can be determined.

[0157] S325: Determine the feature description information of facial key points based on the gradient information of the training sample images.

[0158] First, the gradient information of each pixel in the training sample image can be determined. The gradient information can be represented as a gradient direction histogram. The gradient information includes the gradient magnitude and gradient direction. The formula for calculating the gradient magnitude m(x, y) is shown in formula (4), and the formula for calculating the gradient direction θ(x, y) is shown in formula (5), that is:

[0159]

[0160]

[0161] Where L is a scaled image with scale σ.

[0162] Specifically, a neighborhood F centered on a candidate feature point is selected. A histogram of gradient directions is obtained by calculating the orientation of the candidate feature point within F. The orientation range calculated by the above formula is 360 degrees. However, using an original orientation histogram with 360 points is computationally expensive. To reduce computational cost, the histogram can be divided into 36 equal parts, each covering a 10-degree orientation range. Therefore, the orientation histogram has 36 blocks. The gradient direction of the candidate feature point is the largest component of the 36 phases in the histogram.

[0163] This application embodiment also constructs a set of detectors that utilize gradient information from training sample images for feature description. The feature description information is constructed from a vector containing the values ​​of all orientation histogram entries. A neighborhood window is selected centered on each candidate feature point / facial keypoint and divided into 16 sub-regions of size 4 × 4. Using the above formula, the orientation and amplitude of all pixels in the sub-region are obtained and then accumulated into an orientation histogram. Using the orientation histogram, eight directional distributions within the range (0, π / 4, π / 2, 3π / 4, π, 5π / 4, 3π / 2, 7π / 4) can be calculated, with the length corresponding to the sum of gradient magnitudes near that direction within the region. An amplitude Gaussian function and a Gaussian function are applied to create the orientation histogram of the sub-region. By connecting the orientation descriptions of all sub-regions, the feature description information for each candidate feature point / facial keypoint is obtained.

[0164] In the above embodiments, the prior features for model training can be constructed by using the scale-invariant feature transformation matching algorithm, which can improve the localization accuracy during model training.

[0165] S330: Input the feature description information of facial key points into the neural network model to be trained, as the prior features of the training sample images.

[0166] S340: Input the training sample image into the neural network model to perform facial key point detection processing, and obtain the sample feature point detection results of the facial key points in the training sample image.

[0167] The detection and processing of facial landmarks during the training phase can be referred to the aforementioned embodiments, and will not be repeated here.

[0168] S350: Determine the loss information based on the feature description information and the sample feature point detection results.

[0169] In one embodiment of this application, multi-part loss data is constructed to determine the final loss information, thereby ensuring the positioning accuracy of small-scale targets. Specifically, step S350 may include:

[0170] S351: Determine the first feature descriptor information corresponding to the left side of the face and the second feature descriptor information corresponding to the right side of the face from the feature description information.

[0171] S352: Determine the first sample feature point detection sub-result corresponding to the left side of the face and the second sample feature point detection sub-result corresponding to the right side of the face from the sample feature point detection results.

[0172] S353: Determine the first loss data based on the first feature descriptor information, the first sample feature point detection result, the second feature descriptor information, and the second sample feature point detection result.

[0173] Considering the symmetry of the human face, the face region is divided into left and right parts. The loss data of the left region is determined based on the first feature descriptor information and the first sample feature point detection result. The loss data of the right region is determined based on the second feature descriptor information and the second sample feature point detection result. Then, the difference between the loss data of the left region and the loss data of the right region is used as the final first loss data L1.

[0174] S354: Determine the third feature descriptor information corresponding to the target key points from the feature description information. The target key points include the center point of the eye region, the center point of the mouth region, and the tip of the nose.

[0175] S355: Determine the third sample feature point detection sub-result corresponding to the target key point from the sample feature point detection results.

[0176] S356: Determine the second loss data based on the third feature descriptor information and the third sample feature point detection results.

[0177] To improve the accuracy of local keypoint localization, loss data corresponding to local facial keypoints is also introduced. Specifically, the distance dist1 from the center points of the two eye regions and the center point of the mouth region to the tip of the nose can be calculated based on the third feature descriptor information. The distance dist2 is obtained by performing corresponding calculations based on the third sample feature point detection results. Finally, the second loss data L2 can be expressed as abs(dist1-dist2), where abs() is the absolute value function.

[0178] S357: Input the sum of the first loss data and the second loss data into the preset loss function to obtain loss information.

[0179] The default loss function can be the wing-loss function to improve training accuracy under small error conditions.

[0180] Considering that the accuracy of automatic dataset labeling may not be perfect, the logarithmic loss in the wing-loss function can amplify the error, potentially causing the win-loss to oscillate and fail to converge to a satisfactory level. Alternatively, other loss functions can be used, such as L1 Loss, L2 Loss, or Smooth L1 Loss. L1 Loss, also known as Mean Absolute Error (MAE), measures the average error between the predicted and true values, ranging from 0 to positive infinity. L2 Loss, also known as Mean Squared Error (MSE), measures the sum of squared distances between the predicted and true values, also ranging from 0 to positive infinity. Smooth L1 Loss combines the advantages of both L1 and L2 loss functions.

[0181] In the above embodiments, multiple loss data are constructed to determine the final loss information in order to ensure the positioning accuracy of small-scale targets.

[0182] In one embodiment of this application, if the training sample image is any frame of a video, considering the continuity between video frames, it can be assumed that the distance change between corresponding facial key points between adjacent frames is small. A larger loss penalty weight is assigned to the error of facial key points with large distance changes. Finally, the weighted summation result is summed together with the first loss data and the second loss data and input into a preset loss function. Specifically, the distance between corresponding facial key points between adjacent frames is calculated, and the variance of the distance change of the facial key points is calculated. A large variance indicates a large distance change, further indicating that the facial key point may be mislocated. Therefore, the loss penalty c for that point is increased. The final loss data L = L1 + L2 + (e1*c1 + e2*c2 + ... + e m *c m Where m is the number of facial key points, and e is the normalized mean square error (NME), as shown in formula (6):

[0183]

[0184] Where N is the number of keypoints, xi represents the predicted value of the i-th keypoint, xi* represents the actual value of the i-th keypoint, and d can be 1.

[0185] S360: Based on loss information, a neural network model is trained to obtain a facial landmark detection model.

[0186] This application embodiment also provides a facial landmark detection device 1100, such as... Figure 11 As shown, the device may include:

[0187] The acquisition module 1110 is used to acquire an image to be detected, wherein the image to be detected contains a face region;

[0188] The feature extraction module 1120 is used to input the image to be detected into the feature extraction network of the face key point detection model, perform feature extraction processing, and obtain a first feature map;

[0189] The feature enhancement module 1130 is used to input the first feature map into the feature enhancement network of the face key point detection model, perform feature enhancement processing, and obtain a second feature map, wherein the second feature map contains redundant feature information of the first feature map;

[0190] The feature regression module 1140 is used to input the second feature map into the feature regression network of the face key point detection model, perform feature point detection processing, and obtain feature point detection results. The feature point detection results indicate whether each feature point in the second feature map is a face key point.

[0191] The key point location determination module 1150 is used to determine the predicted location information of the facial key points in the image to be detected based on the feature point detection results.

[0192] In one embodiment of this application, the feature extraction module 1120 may include:

[0193] A conventional convolutional unit is used to perform 3×3 convolution processing on the image to be detected to obtain an initial feature map;

[0194] A depthwise convolutional unit is used to perform channel-wise convolution processing on the initial feature map to obtain the first feature map.

[0195] In one embodiment of this application, the feature enhancement network includes multiple redundant bottleneck modules, and the feature enhancement module 1130 may include:

[0196] An input determination unit is used to determine the target input feature map of the target redundancy bottleneck module, wherein the target input feature map is the first feature map or the previous output feature map of the previous redundancy bottleneck module; the target redundancy bottleneck module is any one of the plurality of redundancy bottleneck modules.

[0197] The feature augmentation unit is used to input the target input feature map into at least one redundant sub-module in the target redundancy bottleneck module, perform feature augmentation processing, and obtain the target redundancy feature map.

[0198] Tensor addition unit is used to add the target input feature map and the normalized target redundancy feature map by tensor addition to obtain the target output feature map of the target redundancy bottleneck module.

[0199] The output determination unit is used to take the target output feature map as the next input feature map of the next redundant bottleneck module, and repeat the above process until the output of the last redundant bottleneck module is taken as the second feature map.

[0200] In one embodiment of this application, the feature augmentation unit may include:

[0201] An input determination subunit is used to determine the target input feature submap of the target redundancy submodule, wherein the target input feature submap is either the target input feature map or the previous output feature submap of the previous redundancy submodule; the target redundancy submodule is any one of the at least one redundancy submodule.

[0202] A pointwise convolution sub-unit is used to perform pointwise convolution processing on the target input feature sub-map to obtain an intrinsic feature sub-map, wherein the intrinsic feature sub-map includes channel feature sub-maps of multiple channels;

[0203] The linear transformation subunit is used to perform a linear transformation on the channel feature submap of each channel, corresponding to the channel, to obtain multiple redundant feature submaps;

[0204] Tensor splicing unit is used to splice the intrinsic feature submap and the multiple redundant feature submaps with tensors to obtain the target output feature submap, which is used as the output of the target redundant submodule.

[0205] The output determination subunit is used to normalize or activate the target output feature submap to obtain the next input feature submap of the next redundant submodule. The above process is repeated until the output of the last redundant submodule is used as the target redundant feature map.

[0206] In one embodiment of this application, the feature regression module 1140 may include:

[0207] The regression convolution unit is used to input the second feature map into the convolutional layer of the feature regression network, perform convolution processing, and obtain the third feature map;

[0208] The regression pooling unit is used to input the third feature map into the multi-scale pooling layer of the feature regression network, perform average pooling processing, and obtain multiple fourth feature maps.

[0209] The regression mapping unit is used to input the multiple fourth feature maps into the multi-scale fully connected layer of the feature regression network for feature mapping processing to obtain the feature point detection results.

[0210] In one embodiment of this application, the device 1100 may further include:

[0211] The training data acquisition unit is used to acquire training sample images;

[0212] The feature description determination unit is used to determine the feature description information of the facial key points in the training sample image based on the scale-invariant feature transform matching algorithm.

[0213] The prior input unit is used to input the feature description information of the facial key points into the neural network model to be trained, as the prior features of the training sample image;

[0214] The training detection unit is used to input the training sample image into the neural network model, perform facial key point detection processing, and obtain the sample feature point detection result of the facial key point in the training sample image;

[0215] The loss calculation unit is used to determine loss information based on the feature description information and the sample feature point detection results;

[0216] The model training unit is used to train the neural network model based on the loss information to obtain the facial key point detection model.

[0217] In one embodiment of this application, the feature description determination unit may include:

[0218] The scaling subunit is used to perform multi-scale transformation on the training sample images to obtain images at multiple scales.

[0219] A Gaussian difference molecular unit is used to determine the Gaussian difference image corresponding to every two adjacent scale images in the plurality of scale images;

[0220] The extreme point determination subunit is used to determine at least one extreme point in each Gaussian difference image based on the Gaussian difference images corresponding to each pair of adjacent scale images.

[0221] The candidate feature point determination subunit is used to determine at least one candidate feature point of the training sample image based on at least one extreme point in each Gaussian difference image, and to use the at least one candidate feature point as the facial key point.

[0222] The feature description information determination subunit is used to determine the feature description information of the facial key points based on the gradient information of the training sample images.

[0223] In one embodiment of this application, the loss calculation unit may include:

[0224] The first segmentation subunit of feature description information is used to determine, from the feature description information, a first feature description sub-information corresponding to the left side region of the face and a second feature description sub-information corresponding to the right side region of the face;

[0225] The first segmentation subunit of the sample feature point detection result is used to determine, from the sample feature point detection result, a first sample feature point detection sub-result corresponding to the left side region of the face and a second sample feature point detection sub-result corresponding to the right side region of the face;

[0226] The first loss data determination subunit is used to determine the first loss data based on the first feature description sub-information, the first sample feature point detection sub-result, the second feature description information, and the second sample feature point detection sub-result.

[0227] The second segmentation subunit of feature description information is used to determine the third feature description sub-information corresponding to the target key point from the feature description information. The target key point includes the center point of the eye region, the center point of the mouth region, and the tip of the nose.

[0228] The second segmentation subunit of the sample feature point detection result is used to determine the third sample feature point detection sub-result corresponding to the target key point from the sample feature point detection result;

[0229] The second loss data determination unit is used to determine the second loss data based on the third feature descriptor information and the third sample feature point detection result.

[0230] The loss information determination subunit is used to input the sum of the first loss data and the second loss data into a preset loss function to obtain the loss information.

[0231] It should be noted that the apparatus provided in the above embodiments is only illustrated by the division of the above functional modules when implementing its functions. In actual applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the device can be divided into different functional modules to complete all or part of the functions described above. In addition, the apparatus and method embodiments provided in the above embodiments belong to the same concept, and the specific implementation process can be found in the method embodiments, which will not be repeated here.

[0232] This application provides a computer device including a processor and a memory. The memory stores at least one instruction or at least one program, which is loaded and executed by the processor to implement a facial key point detection method as provided in the above method embodiments.

[0233] Figure 12 A schematic diagram of the hardware structure of a device for implementing a facial key point detection method provided in an embodiment of this application is shown. This device may constitute or include the apparatus or system provided in the embodiment of this application. Figure 12As shown, device 10 may include one or more processors 1002 (shown as 1002a, 1002b, ..., 1002n in the figure) 1002 (processor 1002 may include, but is not limited to, a microprocessor MCU or a programmable logic device FPGA, etc.), a memory 1004 for storing data, and a transmission device 1006 for communication functions. In addition, it may also include: a display, an input / output interface (I / O interface), a universal serial bus (USB) port (which may be included as one of the ports of the I / O interface), a network interface, a power supply, and / or a camera. Those skilled in the art will understand that... Figure 12 The structure shown is for illustrative purposes only and does not limit the structure of the electronic device described above. For example, device 10 may also include a... Figure 12 The more or fewer components shown, or having the same Figure 12 The different configurations shown.

[0234] It should be noted that the aforementioned one or more processors 1002 and / or other data processing circuits are generally referred to herein as "data processing circuits". These data processing circuits may be embodied, in whole or in part, in software, hardware, firmware, or any other combination thereof. Furthermore, the data processing circuits may be a single, independent processing module, or may be wholly or partially integrated into any other element within device 10 (or mobile device). As involved in the embodiments of this application, the data processing circuits serve as a processor control mechanism (e.g., selection of a variable resistor termination path connected to an interface).

[0235] The memory 1004 can be used to store software programs and modules of application software, such as the program instructions / data storage device corresponding to the method described in the embodiments of this application. The processor 1002 executes various functional applications and data processing by running the software programs and modules stored in the memory 1004, thereby realizing the above-mentioned method for detecting facial key points. The memory 1004 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 1004 may further include memory remotely located relative to the processor 1002, and these remote memories can be connected to the device 10 via a network. Examples of the above-mentioned networks include, but are not limited to, the Internet, corporate intranets, local area networks, mobile communication networks, and combinations thereof.

[0236] The transmission device 1006 is used to receive or send data via a network. Specific examples of the network described above may include a wireless network provided by the communication provider of device 10. In one example, the transmission device 1006 includes a Network Interface Controller (NIC), which can connect to other network devices via a base station to communicate with the Internet. In another example, the transmission device 1006 may be a Radio Frequency (RF) module, used for wireless communication with the Internet.

[0237] The display may be, for example, a touchscreen liquid crystal display (LCD) that allows a user to interact with the user interface of device 10 (or a mobile device).

[0238] This application embodiment also provides a computer-readable storage medium, which can be disposed in a server to store at least one instruction or at least one program related to implementing a facial key point detection method in the method embodiment. The at least one instruction or the at least one program is loaded and executed by the processor to implement the facial key point detection method provided in the above method embodiment.

[0239] Optionally, in this embodiment, the storage medium may be located at at least one of the multiple network servers in a computer network. Optionally, in this embodiment, the storage medium may include, but is not limited to, various media capable of storing program code, such as USB flash drives, read-only memory (ROM), random access memory (RAM), portable hard drives, magnetic disks, or optical disks.

[0240] This invention also provides a computer program product or computer program, which includes computer instructions stored in a computer-readable storage medium. A processor of a computer device reads the computer instructions from the computer-readable storage medium and executes the computer instructions, causing the computer device to perform a facial key point detection method provided in the various optional embodiments described above.

[0241] It should be noted that the order of the embodiments described above is merely for descriptive purposes and does not represent the superiority or inferiority of the embodiments. Furthermore, the above description focuses on specific embodiments of this application. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps described in the claims can be performed in a different order than that shown in the embodiments and still achieve the desired results. Additionally, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired results. In some implementations, multitasking and parallel processing are also possible or may be advantageous.

[0242] The various embodiments in this application are described in a progressive manner. Similar or identical parts between embodiments can be referred to mutually. Each embodiment focuses on describing the differences from other embodiments. In particular, the device, equipment, and storage medium embodiments are basically similar to the method embodiments, so the descriptions are relatively simple; relevant parts can be referred to the descriptions of the method embodiments.

[0243] Those skilled in the art will understand that all or part of the steps of the above embodiments can be implemented by hardware or by a program instructing related hardware. The program can be stored in a computer-readable storage medium, such as a read-only memory, a disk, or an optical disk.

[0244] The above description is only a preferred embodiment of this application and is not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.

Claims

1. A method for detecting facial key points, characterized in that, The method includes: Acquire an image to be detected, wherein the image to be detected contains a face region; The image to be detected is input into the feature extraction network of the face key point detection model for feature extraction processing to obtain the first feature map. The first feature map is input into the feature enhancement network of the face key point detection model for feature enhancement processing to obtain a second feature map, which contains redundant feature information of the first feature map. The second feature map is input into the feature regression network of the face key point detection model to perform feature point detection processing and obtain feature point detection results. The feature point detection results indicate whether each feature point in the second feature map is a face key point. Based on the feature point detection results, the predicted location information corresponding to the facial key points in the image to be detected is determined; The facial landmark detection model is obtained through the following steps: Obtain training sample images; Based on the scale-invariant feature transform matching algorithm, the feature description information of the facial key points in the training sample image is determined; The feature description information of the facial key points is input into the neural network model to be trained, and used as the prior features of the training sample images. The training sample image is input into the neural network model to perform facial key point detection processing, and the sample feature point detection result of the facial key point in the training sample image is obtained. Based on the feature description information and the sample feature point detection results, the loss information is determined; Based on the loss information, the neural network model is trained to obtain the facial key point detection model; The step of determining loss information based on the feature description information and the sample feature point detection results includes: From the feature description information, determine the first feature description sub-information corresponding to the left side region of the face and the second feature description sub-information corresponding to the right side region of the face; From the sample feature point detection results, determine the first sample feature point detection sub-result corresponding to the left side region of the face and the second sample feature point detection sub-result corresponding to the right side region of the face; Based on the first feature descriptor information, the first sample feature point detection result, the second feature descriptor information, and the second sample feature point detection result, the first loss data is determined; The third feature descriptor information corresponding to the target key point is determined from the feature description information, and the target key point includes the center point of the eye region, the center point of the mouth region, and the tip of the nose. Determine a third sample feature point detection sub-result corresponding to the target key point from the sample feature point detection results; The second loss data is determined based on the third feature descriptor information and the third sample feature point detection result; The sum of the first loss data and the second loss data is input into a preset loss function to obtain the loss information.

2. The method according to claim 1, characterized in that, The step of inputting the image to be detected into the feature extraction network of the facial landmark detection model for feature extraction processing to obtain a first feature map includes: The image to be detected is subjected to a 3×3 convolution to obtain an initial feature map; The initial feature map is subjected to channel-by-channel convolution to obtain the first feature map.

3. The method according to claim 1, characterized in that, The feature enhancement network includes multiple redundant bottleneck modules. The step of inputting the first feature map into the feature enhancement network of the facial keypoint detection model for feature enhancement processing to obtain the second feature map includes: Determine the target input feature map of the target redundancy bottleneck module, wherein the target input feature map is either the first feature map or the previous output feature map of the previous redundancy bottleneck module; the target redundancy bottleneck module is any one of the plurality of redundancy bottleneck modules. The target input feature map is input into at least one redundant sub-module in the target redundancy bottleneck module, and feature augmentation processing is performed to obtain the target redundancy feature map. The target input feature map and the normalized target redundancy feature map are added together by tensor to obtain the target output feature map of the target redundancy bottleneck module. The target output feature map is used as the next input feature map of the next redundant bottleneck module. The above feature augmentation process and tensor addition are repeated until the output of the last redundant bottleneck module is used as the second feature map.

4. The method according to claim 3, characterized in that, The step of inputting the target input feature map into at least one redundant submodule in the target redundancy bottleneck module and performing feature augmentation processing to obtain the target redundancy feature map includes: Determine the target input feature sub-map of the target redundancy sub-module, wherein the target input feature sub-map is either the target input feature map or the previous output feature sub-map of the previous redundancy sub-module; the target redundancy sub-module is any one of the at least one redundancy sub-module; The target input feature submap is subjected to pointwise convolution processing to obtain the intrinsic feature submap, which includes channel feature submaps of multiple channels; Perform a linear transformation on the channel feature submap of each channel to obtain multiple redundant feature submaps; The intrinsic feature subgraph and the multiple redundant feature subgraphs are tensor-concatenated to obtain the target output feature subgraph, which is used as the output of the target redundant submodule. The target output feature map is normalized or activated to obtain the next input feature map of the next redundant submodule. The above pointwise convolution, linear transformation, tensor concatenation and normalization or activation processes are repeated until the output of the last redundant submodule is used as the target redundant feature map.

5. The method according to claim 1, characterized in that, The step of inputting the second feature map into the feature regression network of the facial key point detection model for feature point detection processing to obtain the feature point detection result includes: The second feature map is input into the convolutional layer of the feature regression network for convolution processing to obtain the third feature map; The third feature map is input into the multi-scale pooling layer of the feature regression network and average pooling is performed to obtain multiple fourth feature maps. The multiple fourth feature maps are input into the multi-scale fully connected layer of the feature regression network for feature mapping processing to obtain the feature point detection results.

6. The method according to claim 1, characterized in that, The scale-invariant feature transform matching algorithm determines the feature description information of the facial key points in the training sample image, including: The training sample images are subjected to multi-scale transformation to obtain images at multiple scales; Determine the Gaussian difference image corresponding to every two adjacent scale images in the plurality of scale images; Based on the Gaussian difference images corresponding to each pair of adjacent scale images, determine at least one extreme point in each Gaussian difference image; Based on at least one extreme point in each Gaussian difference image, at least one candidate feature point in the training sample image is determined, and the at least one candidate feature point is used as the facial key point; Based on the gradient information of the training sample images, the feature description information of the facial key points is determined.

7. A device for detecting facial key points, characterized in that, The device includes: The acquisition module is used to acquire the image to be detected, wherein the image to be detected contains a face region; The feature extraction module is used to input the image to be detected into the feature extraction network of the face key point detection model, perform feature extraction processing, and obtain the first feature map; The feature enhancement module is used to input the first feature map into the feature enhancement network of the face key point detection model, perform feature enhancement processing, and obtain a second feature map, wherein the second feature map contains redundant feature information of the first feature map; The feature regression module is used to input the second feature map into the feature regression network of the face key point detection model, perform feature point detection processing, and obtain feature point detection results. The feature point detection results indicate whether each feature point in the second feature map is a face key point. The key point location determination module is used to determine the predicted location information corresponding to the facial key points in the image to be detected based on the feature point detection results. The device further includes: The training data acquisition unit is used to acquire training sample images; The feature description determination unit is used to determine the feature description information of the facial key points in the training sample image based on the scale-invariant feature transformation matching algorithm. The prior input unit is used to input the feature description information of the facial key points into the neural network model to be trained, as the prior features of the training sample image; The training detection unit is used to input the training sample image into the neural network model, perform facial key point detection processing, and obtain the sample feature point detection result of the facial key point in the training sample image; The loss calculation unit is used to determine loss information based on the feature description information and the sample feature point detection results; The model training unit is used to train the neural network model based on the loss information to obtain the facial key point detection model. The loss calculation unit includes: The first segmentation subunit of feature description information is used to determine, from the feature description information, a first feature description sub-information corresponding to the left side region of the face and a second feature description sub-information corresponding to the right side region of the face; The first segmentation subunit of the sample feature point detection result is used to determine, from the sample feature point detection result, a first sample feature point detection sub-result corresponding to the left side region of the face and a second sample feature point detection sub-result corresponding to the right side region of the face; The first loss data determination subunit is used to determine the first loss data based on the first feature description sub-information, the first sample feature point detection sub-result, the second feature description information, and the second sample feature point detection sub-result. The second segmentation subunit of feature description information is used to determine the third feature description sub-information corresponding to the target key point from the feature description information. The target key point includes the center point of the eye region, the center point of the mouth region, and the tip of the nose. The second segmentation subunit of the sample feature point detection result is used to determine the third sample feature point detection sub-result corresponding to the target key point from the sample feature point detection result; The second loss data determination unit is used to determine the second loss data based on the third feature descriptor information and the third sample feature point detection result. The loss information determination subunit is used to input the sum of the first loss data and the second loss data into a preset loss function to obtain the loss information.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores at least one instruction or at least one program, which is loaded and executed by a processor to implement a method for detecting facial key points as described in any one of claims 1 to 6.

9. A computer device, characterized in that, The computer device includes a processor and a memory, the memory storing at least one instruction or at least one program, the at least one instruction or at least one program being loaded and executed by the processor to implement a method for detecting facial key points as described in any one of claims 1 to 6.