Face and gesture recognition method and product fusing attention mechanism
By integrating attention mechanisms into a face and gesture recognition method, utilizing the shared parameters of CenterNet and Siamese neural networks, and combining ResNet network for semantic segmentation, the environmental dependence and low accuracy issues of intelligent wheelchair face and gesture recognition systems are solved, achieving more efficient recognition results.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING SCI & TECH HUAHUI INTELLIGENT TECH CO LTD
- Filing Date
- 2022-10-19
- Publication Date
- 2026-06-23
AI Technical Summary
Existing intelligent wheelchair face and gesture recognition systems suffer from problems such as loose network structure, susceptibility to environmental influences, and low recognition accuracy and efficiency. In particular, when used by the elderly and people with mobility impairments, recognition efficiency is poor when the face is not centered or at an inappropriate distance.
A face and gesture recognition method employing a fusion attention mechanism is proposed. The CenterNet network is used to separate the regions of interest in face and hand images, and Siamese neural networks are used to generate features by sharing parameters. The ResNet network is then combined for semantic segmentation and classification to eliminate environmental interference and improve recognition accuracy and efficiency.
It improves the efficiency of facial recognition and the accuracy of gesture recognition, enhances the adaptability of the network, and can better handle situations where the face is not centered and the gesture is against a complex background, providing convenient intelligent wheelchair interaction.
Smart Images

Figure CN115471898B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of face and gesture recognition, and in particular to a face and gesture recognition method and product that incorporates an attention mechanism. Background Technology
[0002] In recent years, many elderly people have experienced difficulties walking and moving due to chronic diseases. Many families are equipping their seniors with smart wheelchairs to assist them in their mobility and allow them greater autonomy in their lives.
[0003] As a product designed for the elderly and disabled, the interaction method of smart wheelchairs should maximize the user's initiative and avoid the user's own limitations. Moreover, the microcomputers installed in smart wheelchairs generally have limited computing power, which places high demands on the efficiency of the algorithms.
[0004] Facial recognition and gesture recognition are important interaction methods for current smart wheelchairs. However, these methods still have some shortcomings. First, most smart wheelchairs currently perform facial and gesture recognition independently, without sharing network structures or parameters, resulting in a less compact overall wheelchair interaction system. Second, current facial and gesture recognition algorithms are easily affected by distance, position, and lighting; gesture recognition is also affected by the user's skin color and background. Finally, the accuracy and efficiency of current facial and gesture recognition algorithms still have room for improvement.
[0005] The primary users of smart wheelchairs are the elderly and people with mobility impairments. This group has limited mobility and limited proficiency in using smart products. When performing facial recognition, issues such as the face not being centered, not facing the camera directly, or being too close or too far from the camera are common. Traditional facial recognition algorithms require users to repeatedly adjust their posture in these situations to recognize the face, causing significant inconvenience. Summary of the Invention
[0006] The purpose of this invention is to provide a face and gesture recognition method and product that integrates attention mechanisms to solve the problems of poor face recognition efficiency and low gesture recognition accuracy.
[0007] To achieve the above objectives, the present invention provides the following solution:
[0008] A face and gesture recognition method that integrates attention mechanisms includes:
[0009] Facial recognition process:
[0010] Acquire camera images and input them into the CenterNet network to generate regions of interest for facial images and hand images;
[0011] Acquire facial images from the database and input the facial images into a Siamese neural network that fuses channels and spatial attention mechanisms to generate facial image features. At the same time, input the region of interest of the facial image into another Siamese neural network that fuses channels and spatial attention mechanisms to generate facial image features; the two Siamese neural networks share parameters.
[0012] The face is identified by comparing the features of the facial image with the features of the region of interest in the facial image, and a facial recognition result is generated.
[0013] Gesture recognition process:
[0014] While performing the facial recognition process, the region of interest of the hand image is input into a ResNet network based on a multi-scale fusion mechanism for semantic segmentation, generating a hand binarized image;
[0015] The binarized hand image is input into a classification network to generate hand recognition results;
[0016] The smart wheelchair is controlled based on the facial recognition results or the hand recognition results.
[0017] Optionally, the Siamese neural network that integrates channel and spatial attention mechanisms specifically includes: a 7×7 convolutional layer, a max pooling layer, a ResBlock0 module, a ResBlock1 module, a first hybrid attention mechanism (MA) module, a ResBlock2 module, a second MA module, a ResBlock3 module, a third MA module, an average pooling layer, and a fully connected layer connected in sequence; the first MA module, the second MA module, and the third MA module have the same structure; the ResBlock1 module, the ResBlock2 module, and the ResBlock3 module have the same structure.
[0018] Optionally, the first MA module specifically includes: a spatial domain attention mechanism (SA) module and a channel domain attention mechanism (CA) module;
[0019] The feature map is input into the SA module, passes through a 1×1 convolutional layer, and then through three convolutional layers to output the convolutional feature map.
[0020] The convolutional feature map is input into the CA module to generate a CA module feature map. The CA module feature map is then added to the feature map to generate a face image feature or facial or image feature. The feature map is the region of interest of the face image or the facial image.
[0021] Optionally, the SA module specifically includes: two independent parallel branches, which respectively perform max pooling and average pooling on the feature map in the channel direction to generate a first feature map and a second feature map in a single channel;
[0022] Convolution and ReLU activation operations are performed on the first feature map and the second feature map respectively to generate the third feature map and the fourth feature map. The third feature map and the fourth feature map are then summed element by element to generate the fifth feature map.
[0023] The fifth feature map is transformed into a spatial attention weight matrix using the sigmoid function, and the spatial attention weight matrix is multiplied with the input feature map to generate the SA module feature map.
[0024] Optionally, the CA module specifically includes:
[0025] Global average pooling and global max pooling are performed on the feature maps of the SA module respectively to generate a sixth feature map and a seventh feature map with c channels;
[0026] Add the elements at corresponding positions of the sixth feature map and the seventh feature map to obtain the eighth feature map;
[0027] The eighth feature map is transformed through two fully connected layers to obtain the ninth feature map;
[0028] The ninth feature map is transformed into a channel attention weight matrix using the sigmoid function. The channel attention matrix is then multiplied by the input convolutional feature map to generate the CA module feature map.
[0029] Optionally, the ResBlock0 module specifically includes: 4 convolutional layers; each convolutional layer is followed by a normalization layer and a correction unit layer; the ResBlock0 module does not change the size of the feature map output by the max pooling layer.
[0030] Optionally, the ResBlock1 module specifically includes: 3 convolutional layers; the feature map output by the ResBlock0 module is passed through a first branch including two convolutional layers and a second branch including one convolutional layer to generate two convolutional feature maps; in the first branch, a normalization processing layer and a correction unit layer are connected after the first convolutional layer, and a normalization processing layer is connected after the second convolutional layer;
[0031] The two convolutional feature maps are added together and passed through a correction unit layer to output the corrected feature map.
[0032] Optionally, the ResNet network specifically includes: four parallel branches;
[0033] Each branch includes an average pooling layer, a convolutional layer, a combined attention mechanism module, and an upsampling layer connected in sequence; the feature maps processed by the four parallel branches are concatenated, and then convolved by a convolutional layer to output a segmented hand binarized image; the combined attention mechanism module is the SA module and the CA module connected in series.
[0034] An electronic device includes a memory and a processor, the memory storing a computer program, and the processor running the computer program to enable the electronic device to perform the face and gesture recognition method described above with an attention fusion mechanism.
[0035] A computer-readable storage medium storing a computer program that, when executed by a processor, implements the aforementioned face and gesture recognition method with a fusion attention mechanism.
[0036] According to specific embodiments provided by the present invention, the following technical effects are disclosed: The present invention provides a face and gesture recognition method and product that integrates an attention mechanism. By inputting camera images into a CenterNet network, the user's facial image region of interest and hand image region of interest are simultaneously separated, reducing the number of parameters in the detection network by half and eliminating interference from positional and useless information in the image. Furthermore, by inputting the facial image region of interest into a Siamese neural network that integrates channel and spatial attention mechanisms, it can focus more on important regions in the facial image, suppress interference from less important information, and enhance the network's performance. The present invention, utilizing a CenterNet network integrating channel and spatial attention mechanisms and a Siamese neural network, can better adapt to situations where the face is not in the center of the image, the face is incomplete, or the face is not directly facing the camera, thereby improving face recognition efficiency.
[0037] Meanwhile, semantic segmentation and binarization of the hand image region of interest are performed using a ResNet network based on a multi-scale fusion mechanism. This eliminates the interference of complex backgrounds, hand textures, and hand colors on recognition, making the edges of the semantically segmented hand binarized image clearer, thereby improving the accuracy of gesture recognition. Attached Figure Description
[0038] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0039] Figure 1 Flowchart of the face and gesture recognition method with fusion attention mechanism provided by the present invention;
[0040] Figure 2 This is the overall network framework diagram provided by the present invention;
[0041] Figure 3 A diagram of the twin neural network structure for the fusion channel and spatial attention mechanism provided in this invention;
[0042] Figure 4 This is a structural diagram of the first MA module provided by the present invention;
[0043] Figure 5 This is a structural diagram of the SA module provided by the present invention;
[0044] Figure 6 This is a structural diagram of the CA module provided by the present invention;
[0045] Figure 7 This is a structural diagram of the ResBlock0 module provided by the present invention;
[0046] Figure 8 This is a structural diagram of ResBlock1 provided by the present invention;
[0047] Figure 9 The flowchart of the present invention is provided for the training of a triplets error twin neural network;
[0048] Figure 10 A flowchart of the semantic segmentation algorithm provided by this invention;
[0049] Figure 11 This is a structural diagram of the combined attention mechanism module provided by the present invention;
[0050] Figure 12 The classification network flowchart provided by this invention;
[0051] Figure 13 This is a structural diagram of the first three basic units in the classification network provided by this invention.
[0052] Figure 14 This is a structural diagram of the last two basic units in the classification network provided by this invention. Detailed Implementation
[0053] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0054] The purpose of this invention is to provide a face and gesture recognition method and product that integrates attention mechanisms, which can improve face recognition efficiency and gesture recognition accuracy.
[0055] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0056] Figure 1 Figure 1 is a flowchart of the face and gesture recognition method with fusion attention mechanism provided by the present invention. Figure 2 is an overall network framework diagram provided by the present invention. Figures 1-2 As shown, a face and gesture recognition method that integrates an attention mechanism includes:
[0057] Facial recognition process:
[0058] Step 101: Acquire camera images and input the camera images into the CenterNet network to generate regions of interest for the face image and the hand image; wherein, the CenterNet network is a front-end target detection network.
[0059] By using CenterNet to extract features from the overall input, and then separating the hands and face, the interference of position and useless information in the image on subsequent processes is eliminated.
[0060] Step 102: Obtain face images from the database and input the face images into a Siamese neural network that fuses channel and spatial attention mechanisms to generate face image features. At the same time, input the region of interest of the face image into another Siamese neural network that fuses channel and spatial attention mechanisms to generate face image features. The two Siamese neural networks share parameters. The parameters include the weights of each Resblock module, the MixAttention (MA) module, the max pooling layer, and the average pooling layer. The MA module includes a first MA module, a second MA module, and a third MA module.
[0061] In practical applications, Figure 3 The diagram shows the Siamese neural network structure of the fusion channel and spatial attention mechanism provided in this invention. Figure 3As shown, the Siamese neural network that integrates channel and spatial attention mechanisms specifically includes: a 7×7 convolutional layer, a max pooling layer, a ResBlock0 module, a ResBlock1 module, a first hybrid attention mechanism (MA) module, a ResBlock2 module, a second MA module, a ResBlock3 module, a third MA module, an average pooling layer, and a fully connected layer connected in sequence; the first MA module, the second MA module, and the third MA module have the same structure; the ResBlock1 module, the ResBlock2 module, and the ResBlock3 module have the same structure.
[0062] Figure 4 The first MA module structure diagram provided by the present invention is as follows: Figure 4 As shown, the first MA module specifically includes: a spatial domain attention mechanism (SA) module and a channel domain attention mechanism (CA) module; the feature map is input into the SA module, passes through a 1×1 convolutional layer, and then passes through three convolutional layers respectively, outputting a convolutional feature map; the convolutional feature map is input into the CA module to generate a CA module feature map, and the CA module feature map is added to the feature map to generate a face image feature or facial or image feature; the feature map is the region of interest of the face image or the facial image.
[0063] In practical applications, the SA module specifically includes: two independent parallel branches, which perform max pooling and average pooling on the feature map in the channel direction respectively to generate a first feature map and a second feature map in a single channel; convolution and ReLU activation operations are performed on the first feature map and the second feature map respectively to generate a third feature map and a fourth feature map, and the third feature map and the fourth feature map are summed element by element to generate a fifth feature map; the fifth feature map is transformed into a spatial attention weight matrix through the sigmoid function, and the spatial attention weight matrix is multiplied with the input feature map to generate the SA module feature map.
[0064] Figure 5 The structure diagram of the SA module provided by this invention is as follows: Figure 5 As shown, the spatial domain attention mechanism (SA) pays more attention to important spatial information in the feature image, where w, h, and c represent the width, height, and number of channels of the input feature map of this module, respectively.
[0065] This module contains two independent parallel branches that perform max pooling and average pooling on the feature map along the channel direction, respectively, to obtain a single-channel feature map of size w×h×1. Figure 1 _1 and characteristics Figure 1 _2.
[0066] In each of the two branches, convolution and ReLU activation operations are performed on the obtained feature maps to obtain the features. Figure 2 _2 and characteristics Figure 2 _3, Features Figure 2 _2 and characteristics Figure 2 _3. Sum the elements of each row and column one by one to obtain the feature. Figure 3 .
[0067] Features are extracted using the sigmoid function. Figure 3 The input is transformed into a spatial attention weight matrix, where the magnitude of each element reflects the importance of the corresponding location in the feature map. This spatial attention weight matrix is then multiplied by the input to obtain the output of the SA module.
[0068] Max pooling can adaptively assign higher weights to local features such as salient edges and contours, while average pooling can adaptively assign higher weights to global features of salient regions. The combination of the two can better enable neural networks to focus on facial components such as eyes and mouth, and has stronger expressive power compared with traditional single-branch attention mechanisms.
[0069] In practical applications, the CA module specifically includes: performing global average pooling and global max pooling on the feature maps of the SA module to generate a sixth feature map and a seventh feature map with c channels; adding the elements at corresponding positions of the sixth feature map and the seventh feature map to obtain an eighth feature map; transforming the eighth feature map through two fully connected layers to obtain a ninth feature map; converting the ninth feature map into a channel attention weight matrix using the sigmoid function; and multiplying the channel attention matrix with the input convolutional feature map to generate the CA module feature map.
[0070] Figure 6 The structure diagram of the CA module provided by this invention is as follows: Figure 6 As shown, the channel domain attention mechanism (CA) focuses more on the important channel features in the feature map tensor. The CA module first performs global average pooling and global max pooling on the feature map to obtain a 1×1 feature map with c channels. Figure 4 _1 and characteristics Figure 4 _2.
[0071] Features Figure 4 _1 and characteristics Figure 4 Add the elements at the corresponding positions of _2 to obtain the feature. Figure 5 .
[0072] Features Figure 5 The features are obtained by transformation through two fully connected layers. Figure 6 .
[0073] Features are extracted using the sigmoid function. Figure 6 The input is converted into a channel attention weight matrix, and the channel attention matrix is multiplied by the input to obtain the output of the CA module.
[0074] The size of different channel positions reflects the importance of different channels. Similar to the SA module, this module uses a dual-branch mode, which has better performance and accuracy compared to the traditional channel attention mechanism; at the same time, the bottleneck structure composed of two fully connected layers effectively reduces the number of parameters while ensuring accuracy.
[0075] Figure 7 The ResBlock0 module structure diagram provided by this invention is as follows: Figure 7 As shown, the ResBlock0 module specifically includes: 4 convolutional layers; each convolutional layer is followed by a normalization layer and a correction unit layer; the ResBlock0 module does not change the size of the feature map output by the max pooling layer.
[0076] Figure 7 In this model, the kernel size of Conv1-Conv4 is 3*3, the stride is 1, and the padding is 1. Each convolutional layer is followed by a normalization layer (BN) and a correction unit layer (ReLU); ResBlock0 does not change the size of the feature map.
[0077] Figure 8 The ResBlock1 structure diagram provided by this invention is as follows: Figure 8 As shown, the ResBlock1 module specifically includes: 3 convolutional layers; the feature map output by the ResBlock0 module is passed through a first branch including two convolutional layers and a second branch including one convolutional layer to generate two convolutional feature maps; in the first branch, a normalization layer and a correction unit layer are connected after the first convolutional layer, and a normalization layer is connected after the second convolutional layer; the two convolutional feature maps are added together and passed through a correction unit layer to output a corrected feature map.
[0078] Figure 8 In the diagram, Conv1 and Conv3 have a kernel size of 3*3, padding of 1, and stride of 2. Conv2 has a kernel size of 3*3, padding of 1, and stride of 1. ResBlock1 to ResBlock3 all reduce the feature map size to half of its original size.
[0079] Step 103: Compare the features of the face image with the features of the region of interest in the face image to identify the face and generate a face recognition result.
[0080] In the face recognition part, the fusion channel and spatial attention mechanism of the Siamese neural network MA-ResNet enable the convolutional neural network to pay more attention to important regions in the image, while suppressing the interference of less important information, thus enhancing the network's performance.
[0081] The results show that the MA-ResNet proposed in this invention has higher accuracy, stronger adaptability, and higher robustness compared to convolutional neural networks with the same number of layers, such as ResNet18; and the Siamese neural network in this invention has fewer parameters while maintaining the same accuracy.
[0082] The front-end network (centernet network) is an object detection network that uses the bounding box cropping feature map obtained from object detection. Subsequent networks only calculate and recognize a small area containing the face, thus solving the problem of the face not being in the center of the image and eliminating interference from the surrounding background.
[0083] The presence of an attention mechanism allows the network to focus more on information with distinct local features, such as the eyes, mouth, and nose. This means that even when the face is incomplete, the existing information is sufficient for identification. Furthermore, adding an attention mechanism module to the existing structure also increases the network's depth. Deeper networks can better learn higher-level, abstract semantic information, thus exhibiting stronger adaptability and better handling situations where the face is not directly facing the camera.
[0084] Therefore, the use of pre-detection networks and attention mechanisms can better adapt to situations where the face is not in the center of the frame, the face is incomplete, or the face is not facing the camera directly when recognizing a face.
[0085] Gesture recognition process:
[0086] Step 104: While performing the face recognition process, input the region of interest of the hand image into the ResNet network based on the multi-scale fusion mechanism for semantic segmentation to generate a hand binarized image.
[0087] Step 105: Input the binarized hand image into the classification network to generate hand recognition results.
[0088] Step 106: Control the smart wheelchair based on the facial recognition result or the hand recognition result.
[0089] In the gesture recognition section, this invention eliminates interference from complex backgrounds, hand textures, and hand colors through semantic segmentation and binarization. The use of an attention mechanism makes the edges of the hand clearer in the segmentation results, and the use of a multi-scale fusion mechanism allows the algorithm to adapt to changes in the distance between the hand and the camera. The algorithm proposed in this invention recognizes gestures more accurately, and requires a smaller dataset to train the classification network.
[0090] In training the twin neural network of this invention, triplets error was used, the form of which... Figure 9 As shown.
[0091] During network training, the input, the corresponding ground truth value, and irrelevant terms are input into the Siamese neural network, respectively. Operations are performed on these operations, and the results are then embedded (dimensionality reduction). The triplets error is calculated using the three dimensionality-reduced feature vectors. The formula for calculating the triplets error is as follows:
[0092]
[0093] Where d(a,p) represents the distance between the input and the feature vector corresponding to the true value, d(a,n) represents the distance between the true value and the feature vector corresponding to irrelevant terms, and margin is a moderating variable. The training goal of triplets error is to minimize the distance between feature vectors of the same person and maximize the distance between feature vectors of different people. Using triplets error can reduce the probability of false matches and increase the recall rate of the algorithm.
[0094] In summary, the face recognition method proposed in this invention has higher accuracy, higher robustness, stronger multi-scale adaptability, and is more convenient for entering user information, which can provide great convenience for the elderly and users with mobility difficulties.
[0095] Some elderly people suffer from upper limb weakness or Parkinson's syndrome, making it relatively difficult to operate a wheelchair using a joystick. Using different gestures to translate into different control signals can effectively solve these problems. Furthermore, gesture recognition is a good alternative when voice recognition is ineffective.
[0096] This invention combines semantic segmentation and a classification network to recognize gestures. First, semantic segmentation is used to binarize the hand and background to eliminate the influence of the background. Then, the binarized result is fed into a classification network for classification to obtain the gesture recognition result.
[0097] This invention incorporates an attention mechanism into the semantic segmentation network to improve the accuracy of edge segmentation and the clarity of hand boundary contours. A multi-scale fusion mechanism is added to enhance the segmentation algorithm's adaptability to different distances and scales. Using this invention, the influence of complex backgrounds, skin color, and hand texture can be effectively eliminated, resulting in high recognition accuracy. Furthermore, it eliminates the need for large-scale datasets, allowing users to customize the mapping between gestures and operations.
[0098] Figure 10 The flowchart of the semantic segmentation algorithm provided by this invention is as follows: Figure 10 As shown, the ResNet network specifically includes: four parallel branches; each branch includes an average pooling layer, a convolutional layer, a combined attention mechanism module, and an upsampling layer connected in sequence; the feature maps processed by the four parallel branches are concatenated, and then convolved by a convolutional layer to output a segmented hand binarized image; the combined attention mechanism module is the SA module and the CA module connected in series.
[0099] The hand ROI detected by the CenterNet network is input into ResNet. The feature map obtained by ResNet is processed into four parallel branches. The processed features are then concatenated, and finally output after another convolution.
[0100] Figure 11 The structural diagram of the combined attention mechanism module provided by the present invention is as follows: Figure 11 As shown, the SA module and the CA module are connected in series to form a combined attention mechanism module.
[0101] The average pooling layers 1-4 have kernel sizes of 1*1, 2*2, 3*3, and 6*6, respectively, to obtain features at different scales. Convolutional layer 1 uses 1*1 kernels to fuse different channels within the same feature map. After passing through the combined attention mechanism module, feature maps at different scales are upsampled using bilinear interpolation and then concatenated, inputting into convolutional layer 2. Convolutional layer 2 also uses 1*1 kernels to compress the feature maps and achieve the fusion of feature maps at different scales.
[0102] Figure 12 The classification network flowchart provided by this invention is as follows: Figure 12 As shown, Figure 12 The first three basic unit structures are as follows: Figure 13 As shown, the structures of the latter two basic units are as follows: Figure 14 As shown.
[0103] Each convolutional layer in the five basic units has a 3x3 kernel size, padding of 1, and a stride of 1. Each pooling layer has a 2x2 kernel size, padding of 0, and a stride of 2. After five basic units, the resulting feature vector has a size of 7x7x512. The data is then processed through two fully connected layers and one softmax classification layer, outputting a feature vector equal to 1 * the number of classes. The position of the element with the largest value in the feature vector represents the corresponding recognition result.
[0104] This invention proposes a face and gesture recognition method that integrates an attention mechanism, resulting in a more compact structure, higher recognition accuracy, and better adaptability.
[0105] This invention provides an electronic device including a memory and a processor. The memory stores a computer program, and the processor runs the computer program to enable the electronic device to perform the face and gesture recognition method based on the fusion attention mechanism of Embodiment 1.
[0106] In practical applications, the aforementioned electronic devices can be servers.
[0107] In practical applications, electronic devices include: at least one processor, memory, bus, and communication interface.
[0108] The processor, communication interface, and memory communicate with each other via a communication bus.
[0109] A communication interface is used to communicate with other devices.
[0110] The processor is used to execute programs, specifically the methods described in the above embodiments.
[0111] Specifically, the program may include program code, which includes computer operation instructions.
[0112] The processor may be a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of the present invention. The electronic device may include one or more processors of the same type, such as one or more CPUs; or it may include processors of different types, such as one or more CPUs and one or more ASICs.
[0113] Memory is used to store programs. Memory may include high-speed RAM, and may also include non-volatile memory, such as at least one disk drive.
[0114] Based on the description of the above embodiments, this application provides a storage medium storing computer program instructions thereon, which can be executed by a processor to implement the methods described in any embodiment.
[0115] The face and gesture recognition method with fusion attention mechanism provided in this application exists in various forms, including but not limited to:
[0116] (1) Mobile communication devices: These devices are characterized by their mobile communication capabilities and primarily aim to provide voice and data communication. These terminals include: smartphones (e.g., iPhones), multimedia phones, feature phones, and low-end phones, etc.
[0117] (2) Ultra-mobile personal computer devices: These devices fall under the category of personal computers, possessing computing and processing capabilities, and generally also have mobile internet access capabilities. These terminals include PDAs, MIDs, and UMPCs, such as the iPad.
[0118] (3) Portable entertainment devices: These devices can display and play multimedia content. This category includes: audio and video players (such as iPods), handheld game consoles, e-books, as well as smart toys and portable car navigation devices.
[0119] (4) Other electronic devices with data interaction functions.
[0120] Specific embodiments of the subject matter have now been described. Other embodiments are within the scope of the appended claims. In some cases, the actions described in the claims can be performed in a different order and still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require a specific or sequential order to achieve the desired result. In some embodiments, multitasking and parallel processing can be advantageous.
[0121] The systems, devices, modules, or units described in the above embodiments can be implemented by computer chips or entities, or by products with certain functions. A typical implementation device is a computer. Specifically, a computer can be, for example, a personal computer, laptop computer, cellular phone, camera phone, smartphone, personal digital assistant, media player, navigation device, email device, game console, tablet computer, wearable device, or any combination of these devices.
[0122] For ease of description, the above apparatus is described by dividing it into various functional units. Of course, in implementing this application, the functions of each unit can be implemented in one or more software and / or hardware components. Those skilled in the art will understand that embodiments of this application can be provided as methods and products, or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0123] This application is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of this application. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart... Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0124] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0125] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0126] In a typical configuration, a computing device includes one or more processors (CPU), input / output interfaces, network interfaces, and memory.
[0127] Memory may include non-persistent storage in computer-readable media, such as random access memory (RAM) and / or non-volatile memory, such as read-only memory (ROM) or flash RAM. Memory is an example of computer-readable media.
[0128] Computer-readable media, including both permanent and non-permanent, removable and non-removable media, can store information using any method or technology. Information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technologies, and CD-ROM.
[0129] Digital multifunction optical disc (DVD) or other optical storage, magnetic cassette tape, magnetic magnetic disk storage or other magnetic storage devices
[0130] Or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transient media, such as modulated data signals and carrier waves.
[0131] It should also be noted that the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0132] Those skilled in the art will understand that embodiments of this application can be provided as methods and products or computer program products. Therefore, this application can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, this application can take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0133] This application can be described in the general context of computer-executable instructions that are executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform specific transactions or implement specific abstract data types. This application can also be practiced in distributed computing environments where transactions are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.
[0134] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the systems disclosed in the embodiments, since they correspond to the methods disclosed in the embodiments, the descriptions are relatively simple; relevant parts can be referred to the method section.
[0135] This document uses specific examples to illustrate the principles and implementation methods of the present invention. The descriptions of the above embodiments are only for the purpose of helping to understand the method and core ideas of the present invention. Furthermore, those skilled in the art will recognize that, based on the ideas of the present invention, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of the present invention.
Claims
1. A face and gesture recognition method integrating an attention mechanism, characterized in that, include: Facial recognition process: The system acquires camera images and inputs them into a CenterNet network to generate regions of interest (ROIs) for facial and hand images; the CenterNet network is a front-end target detection network. Acquire facial images from the database and input the facial images into a Siamese neural network that fuses channels and spatial attention mechanisms to generate facial image features. At the same time, input the region of interest of the facial image into another Siamese neural network that fuses channels and spatial attention mechanisms to generate facial image region of interest features. The two Siamese neural networks share parameters; the Siamese neural network that fuses the channel and spatial attention mechanisms specifically includes: 7 sequentially connected... The system consists of 7 convolutional layers, max pooling layers, ResBlock0 modules, ResBlock1 modules, a first hybrid attention mechanism (MA) module, ResBlock2 modules, a second MA module, a ResBlock3 module, a third MA module, an average pooling layer, and a fully connected layer. The first MA module, the second MA module, and the third MA module have the same structure. The ResBlock1 module, the ResBlock2 module, and the ResBlock3 module also have the same structure. The face is identified by comparing the features of the facial image with the features of the region of interest in the facial image, and a facial recognition result is generated. Gesture recognition process: While performing the facial recognition process, the region of interest of the hand image is input into a ResNet network based on a multi-scale fusion mechanism for semantic segmentation, generating a hand binarized image; The binarized hand image is input into a classification network to generate hand recognition results; The smart wheelchair is controlled based on the facial recognition results or the hand recognition results.
2. The face and gesture recognition method incorporating an attention mechanism according to claim 1, characterized in that, The first MA module specifically includes: a spatial domain attention mechanism (SA) module and a channel domain attention mechanism (CA) module; The feature map is input into the SA module, and after 1... One convolutional layer, followed by three more convolutional layers, outputs the convolutional feature map; The convolutional feature map is input into the CA module to generate a CA module feature map. The CA module feature map is then added to the feature map to generate a face image feature or facial or image feature. The feature map is the region of interest of the face image or the facial image.
3. The face and gesture recognition method incorporating an attention mechanism according to claim 2, characterized in that, The SA module specifically includes: two independent parallel branches, which respectively perform max pooling and average pooling on the feature map in the channel direction to generate a first feature map and a second feature map for a single channel; Convolution and ReLU activation operations are performed on the first feature map and the second feature map respectively to generate the third feature map and the fourth feature map. The third feature map and the fourth feature map are then summed element by element to generate the fifth feature map. The fifth feature map is transformed into a spatial attention weight matrix using the sigmoid function, and the spatial attention weight matrix is multiplied with the input feature map to generate the SA module feature map.
4. The face and gesture recognition method incorporating an attention mechanism according to claim 3, characterized in that, The CA module specifically includes: Global average pooling and global max pooling are performed on the feature maps of the SA module respectively to generate a sixth feature map and a seventh feature map with c channels; Add the elements at corresponding positions of the sixth feature map and the seventh feature map to obtain the eighth feature map; The eighth feature map is transformed through two fully connected layers to obtain the ninth feature map; The ninth feature map is transformed into a channel attention weight matrix using the sigmoid function. The channel attention matrix is then multiplied by the input convolutional feature map to generate the CA module feature map.
5. The face and gesture recognition method with fused attention mechanism according to claim 4, characterized in that, The ResBlock0 module specifically includes: 4 convolutional layers; each convolutional layer is followed by a normalization layer and a correction unit layer; the ResBlock0 module does not change the size of the feature map output by the max pooling layer.
6. The face and gesture recognition method with fused attention mechanism according to claim 5, characterized in that, The ResBlock1 module specifically includes: 3 convolutional layers; the feature map output by the ResBlock0 module is passed through a first branch including two convolutional layers and a second branch including one convolutional layer to generate two convolutional feature maps; in the first branch, a normalization processing layer and a correction unit layer are connected after the first convolutional layer, and a normalization processing layer is connected after the second convolutional layer. The two convolutional feature maps are added together and passed through a correction unit layer to output the corrected feature map.
7. The face and gesture recognition method incorporating an attention mechanism according to claim 6, characterized in that, The ResNet network specifically includes: four parallel branches; Each branch includes an average pooling layer, a convolutional layer, a combined attention mechanism module, and an upsampling layer connected in sequence; the feature maps processed by the four parallel branches are concatenated, and then convolved by a convolutional layer to output a segmented hand binarized image; the combined attention mechanism module is the SA module and the CA module connected in series.
8. An electronic device, characterized in that, The device includes a memory and a processor, the memory being used to store a computer program, and the processor running the computer program to cause the electronic device to perform a face and gesture recognition method based on a fusion attention mechanism as described in any one of claims 1-7.
9. A computer-readable storage medium, characterized in that, It stores a computer program that, when executed by a processor, implements the face and gesture recognition method with a fusion attention mechanism as described in any one of claims 1-7.