An image processing method, device, electronic device, and computer-readable storage medium

By extracting and splicing facial expression features and identity features in facial expression modeling, capturing their joint information, and adjusting facial expression features by minimizing mutual information, the shortcomings of independent modeling of facial expression and identity features in traditional methods are solved, achieving more accurate and robust facial expression recognition and identity verification.

CN120673453BActive Publication Date: 2026-06-26UBTECH ROBOTICS CORP LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
UBTECH ROBOTICS CORP LTD
Filing Date
2025-05-23
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Traditional facial expression modeling methods assume that expression features and identity features are independent, ignoring the potential correlation between the two, which leads to a decrease in expression classification accuracy or unstable identity recognition.

Method used

By extracting facial expression features and identity features from image data, performing feature stitching, capturing the joint information between facial expression and identity, and adjusting facial expression features to reduce identity information through mutual information minimization, more accurate facial expression recognition and identity verification can be achieved.

Benefits of technology

It improves the accuracy and robustness of facial expression recognition and identity verification, enhances the ability to model personalized facial expressions, and improves performance in multi-task scenarios.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120673453B_ABST
    Figure CN120673453B_ABST
Patent Text Reader

Abstract

The application provides an image processing method and device, electronic equipment and computer readable storage medium; the method comprises: acquiring image data of a facial expression; extracting a first expression feature and a first identity feature from the image data; performing feature splicing on the first expression feature and the first identity feature to obtain a first joint feature, and extracting joint information between the first expression feature and the first identity feature from the first joint feature; estimating mutual information between the first expression feature and the first identity feature, adjusting the first expression feature to a second expression feature with the mutual information minimized as the target; and predicting a classification result of the facial expression based on the second expression feature and the joint information; in this way, expression and identity collaborative modeling is realized.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] This application relates to the field of artificial intelligence technology, and in particular to an image processing method, apparatus, electronic device, and computer-readable storage medium. Background Technology

[0002] In the field of facial expression modeling, traditional methods typically decompose facial expression images into independent expression features and identity features, assuming that these features are independent of each other, thus ignoring the potential correlation between them. However, in real-world scenarios, the performance of expression features is often influenced by identity features; for example, different individuals may exhibit different facial muscle changes when expressing the same emotion. Therefore, modeling expression or identity features in isolation cannot fully capture the complex relationships in the real world, potentially leading to decreased accuracy in expression classification or instability in identity recognition. Summary of the Invention

[0003] To address the aforementioned technical problems, embodiments of this application provide an image processing method, apparatus, electronic device, and computer-readable storage medium.

[0004] In a first aspect, embodiments of this application provide an image processing method, including:

[0005] Acquire image data of facial expressions;

[0006] Extract the first facial expression feature and the first identity feature from the image data;

[0007] The first facial expression feature and the first identity feature are concatenated to obtain a first joint feature, and the joint information between the first facial expression feature and the first identity feature is extracted from the first joint feature.

[0008] The mutual information between the first facial expression feature and the first identity feature is estimated, and the first facial expression feature is adjusted to the second facial expression feature with the goal of minimizing the mutual information.

[0009] The classification result of the facial expression is predicted based on the second expression feature and the joint information.

[0010] Secondly, embodiments of this application provide an image processing apparatus, which is applied to an electronic device and includes:

[0011] The acquisition module is used to acquire image data of facial expressions;

[0012] The extraction module is used to extract a first facial expression feature and a first identity feature from the image data;

[0013] The processing module is used to perform feature concatenation on the first expression feature and the first identity feature to obtain the first joint feature;

[0014] The extraction module is further configured to extract joint information between the first expression feature and the first identity feature from the first joint feature;

[0015] The processing module is further configured to estimate the mutual information between the first expression feature and the first identity feature, adjust the first expression feature to a second expression feature with the goal of minimizing the mutual information, and predict the classification result of the facial expression based on the second expression feature and the joint information.

[0016] Thirdly, embodiments of this application provide an electronic device, including:

[0017] Memory is used to store executable instructions or computer programs.

[0018] The processor, when executing computer-executable instructions or computer programs stored in the memory, implements an image processing method provided in the embodiments of this application.

[0019] Fourthly, embodiments of this application provide a computer-readable storage medium storing a computer program or computer-executable instructions, which, when executed by a processor, implements an image processing method provided in embodiments of this application.

[0020] In the technical solution of this application embodiment, facial expression features and identity features are extracted from image data respectively to achieve independent feature extraction of facial expression and identity. Joint features are obtained by concatenating facial expression features and identity features, thereby effectively solving the shortcomings of independent modeling of facial expression and identity features in traditional methods by capturing the joint information between facial expression and identity. By estimating the mutual information between facial expression features and identity features, and aiming to minimize the mutual information, facial expression features are adjusted to contain less identity information, thereby achieving more accurate and robust facial expression recognition and identity verification. Furthermore, the classification results of facial expressions are predicted by the joint information and facial expression features containing less identity information, thereby synergistically improving facial expression classification. This not only enhances the ability to model personalized expressions, but also improves performance in multi-task scenarios. Attached Figure Description

[0021] Figure 1 This is a schematic diagram of the architecture of the image processing system 100 provided in an embodiment of this application;

[0022] Figure 2 This is a schematic diagram of the structure of the image processing method provided in the embodiments of this application;

[0023] Figure 3 This is a schematic flowchart of the image processing method provided in the embodiments of this application;

[0024] Figure 4 This is a schematic flowchart of the image processing method provided in the embodiments of this application;

[0025] Figure 5 This is a schematic diagram of facial expression feature classification provided in an embodiment of this application;

[0026] Figure 6 This is a schematic flowchart of the image processing method provided in the embodiments of this application;

[0027] Figure 7 This is a schematic flowchart of the image processing method provided in the embodiments of this application;

[0028] Figure 8 This is a schematic diagram of the structural composition of the image processing apparatus provided in the embodiments of this application;

[0029] Figure 9 This is a schematic structural diagram of an electronic device provided in an embodiment of this application. Detailed Implementation

[0030] To make the objectives, technical solutions, and advantages of this application clearer, the application will be further described in detail below with reference to the accompanying drawings. The described embodiments should not be regarded as limitations on this application. All other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0031] In the following description, references are made to “some embodiments,” which describe a subset of all possible embodiments. However, it is understood that “some embodiments” may be the same subset or different subsets of all possible embodiments and may be combined with each other without conflict.

[0032] In the embodiments of this application, the terms "module" or "unit" refer to a computer program or part of a computer program that has a predetermined function and works with other related parts to achieve a predetermined goal, and can be implemented wholly or partially using software, hardware (such as processing circuitry or memory), or a combination thereof. Similarly, a processor (or multiple processors or memory) can be used to implement one or more modules or units. Furthermore, each module or unit can be part of an overall module or unit that includes the functionality of that module or unit.

[0033] In the following description, the terms “first, second, ...” are used merely to distinguish similar objects and do not represent a specific ordering of objects. It is understood that “first, second, ...” may be interchanged in a specific order or sequence where permitted, so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.

[0034] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0035] Before providing a further detailed description of the embodiments of this application, the nouns and terms involved in the embodiments of this application will be explained, and the nouns and terms involved in the embodiments of this application shall be interpreted as follows.

[0036] 1) Facial expression modeling refers to the process of constructing a computational model that can simulate, analyze, or recognize changes in human facial expressions. This model aims to capture dynamic features such as facial muscle movements and skin deformation, transforming human emotional states (such as joy, sadness, and anger) into quantifiable or visual digital representations.

[0037] 2) Facial features refer to dynamic information related to facial expressions in a facial image, namely short-term deformations caused by facial muscle movements (such as raising the corners of the mouth, furrowing the eyebrows, etc.). These features are directly related to an individual's emotional state (such as happiness, anger, sadness) and are independent of personal identity.

[0038] 3) Identity features refer to static information in a facial image that is related to an individual's identity, i.e., inherent facial attributes (such as bone structure, facial proportions, skin color, etc.) that are not affected by facial expressions. These features are used to distinguish different people.

[0039] 4) An image encoder (or visual encoder, feature extractor) is a component specifically designed for image processing. An image encoder receives an image as input and, through a series of neural network layers, such as ResNet or Vision Transformer (ViT), transforms the raw pixel data of the image into a set of dense vector representations. During this process, the image encoder learns an abstract representation of the image content and semantic information. Ultimately, this set of vectors is mapped to a fixed-dimensional space as image features to facilitate subsequent cross-modal interactions.

[0040] In facial expression modeling methods of related technologies, facial expression images are typically decomposed into two independent information components: expression features and identity features. Expression features represent dynamic information related to facial expressions in the facial image, i.e., short-term deformations caused by facial muscle movements, such as raising the corners of the mouth or furrowing the eyebrows. Expression features are directly related to an individual's emotional state, such as happiness, anger, or sadness, and are independent of individual identity. Identity features represent static information related to an individual's identity in the facial image, i.e., inherent facial attributes unaffected by facial expressions, such as bone structure, facial proportions, and skin color. Identity features can be used to distinguish different people.

[0041] In related technologies, traditional face modeling methods typically assume that expressions and identities are independent, thus ignoring the potential correlation between the two. However, in real-world scenarios, the expression of facial features is often influenced by identity features; for example, different individuals may exhibit different facial muscle changes when expressing the same emotion. Therefore, modeling facial expressions or identity features in isolation cannot fully capture the complex relationships in the real world, potentially leading to decreased accuracy in expression recognition or instability in identity recognition. Thus, traditional methods suffer from the limitations of independent modeling. For instance, when extracting facial and identity features, traditional methods often employ decoupling strategies, attempting to separate facial and identity features as much as possible and pursuing their independence. This modeling approach ignores the synergistic relationship between facial and identity features, resulting in incomplete feature representations and difficulty in accurately modeling personalized facial expressions. Furthermore, traditional modeling methods lack joint information. For example, the relationship between facial and identity features is dynamic and complex; certain expressions may be expressed differently in different individuals, and these personalized differences need to be captured through joint information. Existing methods often lack joint representation mechanisms, making it difficult to effectively model the interaction between facial expressions and identities.

[0042] To address the aforementioned issues, embodiments of this application provide an image processing method, apparatus, electronic device, and computer-readable storage medium. By introducing collaborative feature modeling and joint information optimization mechanisms, the shortcomings of independent modeling of facial expressions and identity features in traditional methods are effectively resolved, thereby achieving more accurate and robust facial expression recognition and identity verification.

[0043] The following describes exemplary applications of the devices provided in the embodiments of this application. The electronic devices provided in the embodiments of this application can be implemented as various types of terminals such as laptops, tablets, desktop computers, set-top boxes, smartphones, smart speakers, smartwatches, smart TVs, and vehicle terminals, or they can be implemented as servers.

[0044] See Figure 1 , Figure 1 This is a schematic diagram of the architecture of the virtual scene image anomaly detection system provided in this application embodiment, for example. Figure 1 The system involves server 100, terminal device 200, and network 300. Terminal device 200 is connected to server 100 through network 300, which can be a wide area network (WAN), a local area network (LAN), or a combination of both.

[0045] In some embodiments, the present application embodiments can be implemented collaboratively by a server and a terminal device. For example, the terminal device 200 sends image data of facial expressions to the server 100, the server 100 predicts the classification result of the facial expressions using the image processing method provided in the present application embodiments, and sends the classification result of the facial expressions to the terminal device 200, and the terminal device 200 realizes facial expression recognition or identity verification based on the classification result of the facial expressions.

[0046] In other embodiments, the embodiments of this application can be implemented independently by a terminal device. Terminal device 200 sends image data of facial expressions to a server. Server 100 receives the image data of facial expressions and sends the facial expression classification model provided in the embodiments of this application to terminal device 200. Terminal device 200 receives the model sent by the server and downloads it locally. Using the model and image data, it obtains the classification result of the facial expressions. Terminal device 200 uses the classification result of the facial expressions to perform facial expression recognition or identity verification.

[0047] In some embodiments, the terminal device or server can implement the image processing method provided in this application embodiment by running various computer-executable instructions or computer programs. For example, computer-executable instructions can be microprogram-level commands, machine instructions, or software instructions. Computer programs can be native programs or software modules in an operating system. In summary, the aforementioned computer-executable instructions can be any form of instruction, and the aforementioned computer programs can be any form of application program, module, or plug-in. Terminal devices include, but are not limited to, mobile phones, computers, intelligent voice interaction devices, smart home appliances, vehicle terminals, aircraft, etc.

[0048] See Figure 2 , Figure 2 This is a schematic diagram illustrating the principle of the image processing method provided in this application embodiment. First, image data of facial expressions is acquired. A first expression feature and a first identity feature are extracted from the image data. Next, the first expression feature and the first identity feature are concatenated to obtain a first joint feature. Joint information between the first expression feature and the first identity feature is extracted from the first joint feature. Then, the mutual information between the first expression feature and the first identity feature is estimated. The first expression feature is adjusted to a second expression feature with the goal of minimizing the mutual information. Finally, the classification result of the facial expression is predicted based on the second expression feature and the joint information. (See [link to relevant documentation]). Figure 3 , Figure 3 This is a schematic flowchart of the image processing method provided in the embodiments of this application, which will be described in detail below.

[0049] In step 301, image data of facial expressions is acquired.

[0050] In some implementations, facial expression image data can be obtained from an expression dataset. For example, the source of facial expression data can be a publicly available dataset containing basic and complex expressions, covering a wide range of ages, ethnicities, and occluded scenes. Alternatively, the source of facial expression data can be a self-built dataset, obtained by recording photos under controlled lighting conditions using a data acquisition device.

[0051] In some implementations, low-quality samples in facial expression datasets can be removed through blur detection and extreme pose filtering. For example, the Laplacian operator can be used to calculate image sharpness scores, filtering samples with scores below a threshold; images with pitch angles greater than 0 degrees or yaw angles greater than 45 degrees can be removed based on head pose estimation. Class balancing of facial expression data can also be achieved through oversampling and undersampling. For instance, oversampling can be used to generate synthetic images from minority class samples, followed by undersampling to randomly remove majority class samples, making the sample sizes of each class more similar.

[0052] In other implementations, after acquiring facial expression image data, the image data can be preprocessed. For example, face alignment can be performed on the facial expression image data to standardize the geometric structure of the face and ensure pose standardization. Alternatively, data normalization can be performed on the image data by normalizing the color space and / or size, i.e., converting the image data to grayscale, to reduce computational load. Or, Z-score normalization can be performed to eliminate lighting differences, or histogram equalization can be performed to enhance details in low-contrast images. Standardizing the resolution of the image data can scale the image to the model input size and use bilinear interpolation to maintain smoothness.

[0053] In other implementations, after acquiring facial expression image data, data augmentation processing can be performed on the image data. For example, geometric transformations can be applied to the image data, using small rotations to avoid excessive distortion of the facial structure, or random cropping can be used to preserve the core facial areas. Another example is photometric transformation processing, such as adjusting brightness, contrast, or color jitter. By performing data augmentation processing on the image data, the model can converge faster and its robustness can be increased.

[0054] In step 302, the first facial expression feature and the first identity feature are extracted from the image data.

[0055] In this embodiment, the first facial expression feature is used to describe the attribute information of a person in the image data, specifically the person's posture and facial expression information. It represents the dynamic information related to facial expressions in the face image, i.e., short-term deformations caused by facial muscle movements, such as raising the corners of the mouth or furrowing the eyebrows. The facial expression feature is directly related to an individual's emotional state and is independent of their identity. The first identity feature is used to describe the identity information of a person in the image data, representing the static information related to an individual's identity in the face image, i.e., inherent facial attributes unaffected by facial expressions, such as bone structure, facial proportions, and skin color.

[0056] In some implementations, extracting a first facial expression feature and a first identity feature from image data includes: extracting the first facial expression feature from image data using an expression encoder; and extracting the first identity feature from image data using an identity encoder; wherein the identity encoder is pre-trained.

[0057] Here, the preprocessed image data undergoes facial expression feature extraction processing. For example, a facial expression encoder can be used to identify facial expression features in the image data, capturing fine-grained information related to facial expressions. The facial expression encoder can use deep neural networks (DNNs), convolutional neural networks (CNNs), or other image processing techniques to identify facial expression features in the image data and encode this information into high-dimensional vectors to obtain the first facial expression feature. The preprocessed image data then undergoes identity feature extraction processing. For example, an identity encoder can be used to identify facial expression features in the image data and extract identity features related to individual identity, such as facial geometry and muscle distribution. The identity encoder can use DNNs, CNNs, or other image processing techniques to identify identity features in the image data and encode this information into high-dimensional vectors to obtain the first identity feature. The identity feature extraction requires training on a face dataset.

[0058] It should be noted that the feature extraction methods for the expression encoder and the identity encoder need to have different parameters but similar structures. For example, both the expression encoder and the identity encoder can use a Transformer structure for feature extraction, or both can use convolutional neural network structures such as ResNet or VGG for feature extraction, and the identity encoder can be pre-trained on a large-scale face dataset. The application does not specify any particular feature extraction method to be used.

[0059] In some implementations, image data can be processed using a CNN (Convolutional Neural Network) to encode facial expressions, resulting in a first facial expression feature, such as a Residual Neural Network (ResNet). This application does not limit the specific implementation of the facial expression encoding process for image data. In a convolutional neural network, there are typically multiple fully connected layers or multiple global average pooling layers, which generate a final fixed-size feature vector at the end of the convolutional neural network. This first facial expression feature contains the facial expression information of the image data.

[0060] For example, ResNet convolutional architecture is used to extract facial expression features. Local features are extracted layer by layer through a sliding window of the convolutional kernel. The first convolutional layer downsamples the image to 112×112, followed by four sets of residual blocks: channels 64-256, 128-512, 256-1024, and 512-2048. Each channel set contains a Bottleneck structure. Finally, global average pooling is used to obtain the final feature map, which is then compressed into a 2048-dimensional vector. A fully connected layer then outputs 512-dimensional facial expression features.

[0061] For example, by performing facial expression feature extraction on image data using an expression encoder, the first facial expression feature is obtained, which can be expressed by formula (1):

[0062] (1)

[0063] in, The first facial expression feature is represented by CNN, which stands for Convolutional Neural Network, and I represents image data. These are the parameters of the convolutional neural network in the facial expression encoder.

[0064] It should be noted that the identity feature encoder can also extract identity features using the same ResNet convolutional architecture as the expression encoder, outputting 512-dimensional identity features to obtain the first identity feature, which can be represented by formula (2). The difference is that the parameters of the identity encoder are independent of those of the expression encoder, with no shared layers or weights.

[0065] (2)

[0066] in, The first identity feature is represented by CNN, which stands for Convolutional Neural Network, and I represents image data. These are the parameters of the convolutional neural network of the identity encoder.

[0067] For example, image data can be processed by facial expression encoding using network models such as DNNs to obtain the first facial expression feature and the first identity feature, for example, the Additive Angular Margin Loss for Deep Face Recognition (ArcFace) network in deep face recognition.

[0068] In other implementations, the image data can be segmented to obtain multiple pixel blocks; the multiple pixel blocks can be sorted into a sequence of pixel blocks to be encoded; the sequence of pixel blocks to be encoded can be encoded to obtain a first identity feature and a first expression feature. The expression encoder and the identity encoder use the same segmentation and embedding process, but the identity encoder is trained using an identity dataset.

[0069] For example, assuming the image data size is 224×224, it is divided according to a set pixel block size (e.g., 16×16 pixel block size). After division, we will get... The image data is divided into multiple pixel blocks (such as in left-to-right or top-to-bottom order) to form a sequence of pixel blocks to be encoded. The sequence of pixel blocks to be encoded is used as the input unit for identity encoding processing. The image data is divided into multiple pixel blocks according to the set length and width of the pixel blocks, and then the multiple pixel blocks are arranged into a sequence of pixel blocks to be encoded in left-to-right and top-to-bottom order.

[0070] For example, an embedding layer can be used to perform embedding encoding on the sequence of pixel blocks to be encoded to obtain embedded features. For instance, in the embedding layer, a convolution operation is performed on the sequence of pixel blocks to be encoded, and position embedding is then performed on the convolutionally processed sequence of pixel blocks to be encoded to obtain embedded features. Before the embedding encoding process, each pixel block to be encoded in the sequence of pixel blocks to be encoded can be flattened into a one-dimensional vector. For example, for a 16×16×3 pixel block to be encoded, the length of the flattened vector is 768. The position encoding process can be generated by a fixed algorithm, such as a combination of sine and cosine functions. For each position (e.g., the pixel position of each pixel in each pixel block to be encoded) pos and the dimension i of the feature, the position encoding of even-dimensional positions uses a sine function, while the position encoding of odd-dimensional positions uses a cosine function. The embodiments of this application do not limit the specific implementation of the position encoding.

[0071] For example, before attention encoding, the embedded features can be normalized (e.g., layer normalization) to obtain normalized features. For the normalized features used in attention encoding, a linear transformation is first performed to generate three matrices: a query vector (Q), a key vector (K), and a value vector (V). For each attention head, the dot product of Q and K is calculated to obtain the attention score. Next, the attention score is normalized, for example, by applying a normalization function (such as the Softmax function), to transform the attention score into a probability distribution. The attention probability distribution is then used to weight V to generate a new feature representation. Finally, the outputs of all attention heads are concatenated and subjected to a linear transformation, projecting the result into a 512-dimensional embedding. This yields a rich representation that includes the correlations between different positions in the first normalized features, i.e., the attention-encoded features.

[0072] For example, attention-encoded features can be mapped using a feed-forward neural network (FFN) layer to obtain 512-dimensional first identity features and 512-dimensional first facial expression features. The feed-forward neural network layer can, for example, employ a multilayer perceptron (MLP) structure.

[0073] It should be noted that the expression encoder and the identity encoder can use the same Transformer structure for feature extraction. The difference is that the expression encoder and the identity encoder have independent parameters, and the identity encoder is trained using the identity dataset.

[0074] In step 303, the first expression feature and the first identity feature are concatenated to obtain the first joint feature, and the joint information between the first expression feature and the first identity feature is extracted from the first joint feature.

[0075] Here, refer to Figure 5 Different individuals exhibit varying facial muscle changes when expressing the same emotion. The relationship between facial expressions and identity features is dynamic and complex; for example, certain expressions may be expressed differently in different individuals. This individualized difference needs to be captured through joint information. By concatenating the first facial expression feature and the first identity feature, a first joint feature is obtained. From this first joint feature, the joint information between the first facial expression feature and the first identity feature is extracted.

[0076] In some embodiments, the first expression feature and the first identity feature can be spliced ​​together using a feature splicing method to obtain spliced ​​features. For example, splicing together 512-dimensional expression features and 512-dimensional identity features yields 1024-dimensional spliced ​​features.

[0077] In some embodiments, the 1024-dimensional concatenated features can be fused using feature mapping to obtain the first joint features. For example, a multilayer perceptron (MLP) can be used to map the 1024-dimensional concatenated features to a fixed-size feature space, and then the mapped concatenated features can be fused to obtain the first joint features. An MLP is a feedforward neural network consisting of multiple fully connected layers, each using an activation function. An MLP can automatically learn the non-linear relationships between features and transform them into a form suitable for subsequent tasks.

[0078] For example, the first facial expression feature and the first identity feature are concatenated to obtain the first joint feature, which can be represented by formula (3):

[0079] (3)

[0080] in, Let F represent the first joint feature, and F represent the fusion process. Indicates the primary facial expression characteristic. Indicates the primary identity characteristic. This represents the parameters of the fusion process (e.g., the parameters of the MLP network when performing fusion processing via MLP).

[0081] In step 304, the mutual information between the first expression feature and the first identity feature is estimated, and the first expression feature is adjusted to the second expression feature with the goal of minimizing the mutual information.

[0082] In some implementations, the extracted first facial expression features and first identity features are optimized by minimizing the mutual information between facial expression and identity. This minimizes the mutual information between the first facial expression features and the first identity features, thereby achieving the independence of facial expression features and identity features. This avoids the inclusion of identity information in facial expression features, reduces the inclusion of identity information in facial expression features, and reduces the inclusion of facial expression information in identity features, thus ensuring the independence of the two.

[0083] For example, the dependency between two random variables, primary facial expression feature and primary identity feature, can be measured by mutual information (MI), which can be defined by formula (4):

[0084] (4)

[0085] Where X represents the first facial expression feature; Y represents the first identity feature; The information entropy of variable X; Let X be the conditional entropy, which represents the remaining uncertainty of X after Y is known. The larger the value, the more redundant information there is between X and Y. This is determined by minimizing the formula... The goal is to reduce the identity information contained in facial expression features, as well as the facial expression information contained in identity features.

[0086] In some implementations, see Figure 4 , Figure 3 In step 304 shown, the mutual information between the first facial expression feature and the first identity feature is estimated, which can be achieved through the following steps 401 to 402, as explained in detail below.

[0087] Step 401: Calculate the joint distribution and marginal distribution between the first facial expression feature and the first identity feature using the mutual information estimator, and obtain the expected value of the joint distribution and the logarithmic expected value of the marginal distribution.

[0088] Step 402: Calculate the mutual information of the first facial expression feature and the first identity feature based on the joint distribution expectation and the marginal distribution log expectation.

[0089] For example, mutual information can be estimated by maximizing the variational lower bound of mutual information using the Mutual Information Neural Estimation (MINE) method based on neural networks. Specifically, this can be expressed by formula (5):

[0090] (5)

[0091] The mutual information estimator can be expressed as: , It can include a neural network (called a statistical network) to distinguish joint distributions. and marginal distribution product . Denotes the expectation of the joint distribution. Let represent the log-expected value of the marginal distribution. The mutual information between the first facial expression feature and the first identity feature is calculated using the joint distribution expectation and the log-expected value of the marginal distribution, which can be obtained through formula (6):

[0092] (6)

[0093] The mutual information loss function can be obtained by maximizing the lower bound of mutual information, which is equivalent to minimizing the negative estimate.

[0094] In some implementations, see Figure 6 ,exist Figure 4 Before step 401 shown, steps 601 to 603 can be performed to train the mutual information estimator, which will be explained in detail below.

[0095] Step 601: Obtain joint distribution sample pairs, which include first facial expression feature samples and first identity feature samples.

[0096] Step 602: Randomly arrange the first identity feature sample to obtain the second identity feature sample. The second identity feature sample includes identity feature information that is unrelated to the first expression feature sample.

[0097] Step 603: Generate an edge distribution sample pair based on the first expression feature sample and the second identity feature sample. The edge distribution sample pair includes the first expression feature sample and the second identity feature sample.

[0098] For example, the input data includes facial expression features extracted by the facial expression encoder, which can be represented as: The input data includes identity features extracted through the identity encoder, which can be represented as: Based on the input data, sample pairs are constructed to obtain joint distribution sample pairs. These joint distribution sample pairs can also be called true sample pairs or positive sample pairs, and can be represented as... ,in, This represents the first facial expression feature sample. This represents the first identity feature sample.

[0099] For example, the first identity feature sample is randomly arranged to obtain the second identity feature sample. The second identity feature sample includes identity feature information that is unrelated to the first facial expression feature sample, i.e. At this point, the edge distribution sample pair generated based on the first facial expression feature sample and the second identity feature sample can be represented as follows: .

[0100] In some implementations, see Figure 7 ,exist Figure 4 The step 401 shown can also be implemented through the following steps 701 to 704, which will be explained in detail below.

[0101] Step 701: Calculate the joint distribution of the joint distribution sample pairs using the mutual information estimator to obtain the joint distribution score.

[0102] For example, a mutual information estimator can be represented as The joint distribution sample pairs are calculated using a mutual information estimator. The joint distribution of the scores can be expressed by formula (7):

[0103] (7)

[0104] Step 702: Calculate the marginal distribution of the marginal distribution sample pairs using the mutual information estimator to obtain the marginal distribution score value.

[0105] For example, marginal distribution sample pairs are calculated using a mutual information estimator. The joint distribution of the marginal distribution score can be expressed by formula (8):

[0106] (8)

[0107] Step 703: Perform expectation estimation on the joint distribution score to obtain the joint distribution expectation.

[0108] For example, the joint distribution expectation can be expressed as The joint distribution expectation is calculated using a mutual information estimator and can be expressed by formula (9):

[0109] (9)

[0110] Step 704: Perform expectation estimation on the marginal distribution score to obtain the expected logarithm of the marginal distribution.

[0111] For example, the log-expected value of the marginal distribution can be expressed as The expected logarithm of the marginal distribution is calculated using a mutual information estimator, and the expected logarithm of the marginal distribution can be expressed by formula (10):

[0112] (10)

[0113] In some implementations... Figure 3 Step 304 shown includes: minimizing the joint distribution expectation and the marginal distribution log expectation based on the first expression feature and the first identity feature using a mutual information estimator to obtain the second expression feature; wherein the identity feature information contained in the second expression feature is less than the identity feature information contained in the first expression feature.

[0114] For example, a second expression feature is generated by minimizing the mutual information between the expression feature and the identity feature using a mutual information estimator. The second expression feature satisfies the characteristics of reduced identity information and preserved expression information. Specifically, the second expression feature contains less identity feature information than the first expression feature, while maintaining the expression discrimination ability of the first expression feature.

[0115] For example, the first facial feature can be represented as The first identity feature can be represented as The second facial expression feature can be represented as A mutual information estimator based on a MINE network is used. The expected value of the joint distribution and the logarithmic expected value of the marginal distributions can be expressed by formula (11):

[0116] (11)

[0117] In step 305, the classification result of the facial expression is predicted based on the second expression features and joint information.

[0118] In some implementations, the second facial expression feature It combines facial features obtained by minimizing mutual information, and then... It is based on the first facial expression feature and primary identity characteristics The first joint feature generated by MLP fusion after splicing Obtained by using the second facial expression feature. and United Information Input the data into the facial expression classifier and output the facial expression classification results.

[0119] In some implementations, the first loss function is determined based on a variational lower bound of mutual information. For example, maximizing the lower bound of mutual information is equivalent to minimizing the negative estimate, and the first loss function can be expressed by formula (12):

[0120] (12)

[0121] In some implementations, a second loss function is determined based on the classification results and labels of facial expressions. For example, the second loss function is obtained by directly optimizing the expression classification task using cross-entropy loss, and can be expressed by formula (13):

[0122] (13)

[0123] in, This represents the true label of the i-th sample. This represents the category probability of the facial expression classification model.

[0124] In some implementations, the target loss function is obtained by summing or weighted summing the first and second loss functions. For example, by combining the classification loss and the mutual information minimization loss, the target loss function can be expressed by formula (14):

[0125] (14)

[0126] In some implementations, the facial expression encoder and mutual information estimator are updated based on the objective loss function.

[0127] Specifically, the facial encoder can be optimized using the backpropagation algorithm based on the loss calculated by equation (14) above. and mutual information estimator This improves the performance of facial expression recognition and identity feature extraction.

[0128] For example, before backpropagation, forward propagation is performed first to obtain the prediction results and loss values ​​of the expression classification model. Preprocessed image data is input into the model, and local features (such as edges and textures) are extracted through convolutional layers. Non-linearity is introduced through activation functions (such as ReLU). Next, the feature map dimensionality is reduced through pooling layers (such as max pooling), and then the features are mapped to the classification space through fully connected layers, such as outputting 7-dimensional expression probabilities. Finally, the loss is calculated by comparing the predicted values ​​with the true labels.

[0129] For example, the target loss function, i.e., the gradient with respect to the model parameters, is calculated through backpropagation to guide parameter updates. Specifically, gradient calculation is first performed by calculating the gradient layer by layer from the output layer based on the chain rule. The output layer gradient includes the derivative of the loss with respect to the predicted probability, and the hidden layer gradient includes the gradients propagated layer by layer (such as fully connected layers, pooling layers, and convolutional layers). Next, the gradient can be automatically calculated using automatic differentiation.

[0130] For example, the model parameters are adjusted based on the gradient to minimize the loss function. Specifically, the parameters can be updated using an optimization algorithm, gradient descent, or other methods. The gradient descent formula can be expressed by formula (15):

[0131] (15)

[0132] in, Indicates model parameters, This represents the learning rate. It's worth noting that historical gradients can be cleared before each backpropagation to avoid gradient accumulation.

[0133] Thus, forward propagation is responsible for generating predictions and calculating the target loss, while backpropagation guides parameter optimization through gradient calculation. The two processes iterate in a loop until the model converges.

[0134] As can be seen from the above, the embodiments of this application provide an image processing method. Through steps 301 to 305, 401 to 402, 601 to 603, and 701 to 704, a collaborative modeling method for expression and identity is implemented. By extracting expression features and identity features from image data respectively, independent features of expression and identity are extracted. Furthermore, by concatenating the expression features and identity features to obtain joint features, the shortcomings of independent modeling of expression and identity features in traditional methods are effectively solved by capturing the joint information between expression and identity. Furthermore, by estimating the mutual information between expression features and identity features, the expression features are adjusted to contain less identity information with the goal of minimizing the mutual information, thereby achieving more accurate and robust expression recognition and identity verification. Then, by predicting the classification result of facial expressions through joint information and expression features containing less identity information, the classification of expressions is synergistically improved, thereby not only enhancing the modeling ability of personalized expressions, but also improving the performance in multi-task scenarios (such as expression classification and identity recognition).

[0135] Figure 8 This is a schematic diagram of the structural composition of the image processing apparatus provided in the embodiments of this application. Figure 1 Applied to electronic devices, such as Figure 8 As shown, the image processing apparatus 800 includes:

[0136] The acquisition module 801 is used to acquire image data of facial expressions.

[0137] Extraction module 802 is used to extract a first facial expression feature and a first identity feature from the image data.

[0138] The processing module 803 is used to perform feature concatenation on the first expression feature and the first identity feature to obtain the first joint feature.

[0139] The extraction module 802 is further configured to extract joint information between the first expression feature and the first identity feature from the first joint feature.

[0140] The processing module 803 is further configured to estimate the mutual information between the first expression feature and the first identity feature, adjust the first expression feature to a second expression feature with the goal of minimizing the mutual information, and predict the classification result of the facial expression based on the second expression feature and the joint information.

[0141] In some embodiments, the extraction module 802 is further configured to extract a first facial expression feature from the image data via an expression encoder; and to extract a first identity feature from the image data via an identity encoder; wherein the identity encoder is pre-trained.

[0142] In some embodiments, the processing module 803 is further configured to calculate the joint distribution and marginal distribution between the first expression feature and the first identity feature using a mutual information estimator, to obtain the joint distribution expectation and the logarithmic expectation of the marginal distribution; and to calculate the mutual information between the first expression feature and the first identity feature based on the joint distribution expectation and the logarithmic expectation of the marginal distribution.

[0143] In some embodiments, before the processing module 803 calculates the joint distribution and edge distribution between the first facial expression feature and the first identity feature, the acquisition module 801 is further configured to acquire a joint distribution sample pair, the joint distribution sample pair including the first facial expression feature sample and the first identity feature sample.

[0144] The processing module 803 is further configured to randomly arrange the first identity feature sample to obtain a second identity feature sample, the second identity feature sample including identity feature information unrelated to the first facial expression feature sample; and generate an edge distribution sample pair based on the first facial expression feature sample and the second identity feature sample, the edge distribution sample pair including the first facial expression feature sample and the second identity feature sample.

[0145] In some embodiments, the processing module 803 is further configured to: calculate the joint distribution of the joint distribution sample pairs using the mutual information estimator to obtain a joint distribution score; calculate the marginal distribution of the marginal distribution sample pairs using the mutual information estimator to obtain a marginal distribution score; perform expectation estimation processing on the joint distribution score to obtain the joint distribution expectation; and perform expectation estimation processing on the marginal distribution score to obtain the marginal distribution log expectation.

[0146] In some embodiments, the processing module 803 is further configured to minimize the joint distribution expectation and the marginal distribution log expectation based on the first expression feature and the first identity feature using the mutual information estimator to obtain the second expression feature; wherein the identity feature information contained in the second expression feature is less than the identity feature information contained in the first expression feature.

[0147] In some implementations, the processing module 803 is further configured to determine a first loss function based on the variational lower bound of mutual information; determine a second loss function based on the classification results and classification labels of facial expressions; sum or weightedly sum the first loss function and the second loss function to obtain a target loss function; and update the expression encoder and mutual information estimator based on the target loss function.

[0148] Those skilled in the art should understand that Figure 8The functions of each module in the image processing device shown can be understood by referring to the relevant descriptions of the aforementioned method. Figure 8 The functions of each module in the image processing device shown can be implemented by a program running on the processor or by specific logic circuits.

[0149] Figure 9 This is a schematic structural diagram of an electronic device 900 provided in an embodiment of this application. The electronic device may be a terminal device or a server. Figure 9 The illustrated electronic device 900 includes a processor 910, which can call and run computer programs from memory to implement the methods in the embodiments of this application.

[0150] Optionally, such as Figure 9 As shown, the electronic device 900 may further include a memory 920. The processor 910 can retrieve and run computer programs from the memory 920 to implement the methods described in the embodiments of this application.

[0151] The memory 920 can be a separate device independent of the processor 910, or it can be integrated into the processor 910.

[0152] Optionally, such as Figure 9 As shown, the electronic device 900 may also include a transceiver 930, which the processor 910 can control to communicate with other devices. Specifically, it can send information or data to other devices or receive information or data sent by other devices.

[0153] The transceiver 930 may include a transmitter and a receiver. The transceiver 930 may further include antennas, and the number of antennas may be one or more.

[0154] Optionally, the electronic device 900 may specifically be a server in the embodiments of this application, and the electronic device 900 may implement the corresponding processes implemented by the server in the various methods of the embodiments of this application. For the sake of brevity, it will not be described in detail here.

[0155] Optionally, the electronic device 900 may specifically be a mobile terminal / terminal device in the embodiments of this application, and the electronic device 900 may implement the corresponding processes implemented by the mobile terminal / terminal device in the various methods of the embodiments of this application. For the sake of brevity, it will not be described in detail here.

[0156] It should be understood that the processor in the embodiments of this application may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the above method embodiments can be completed by integrated logic circuits in the processor's hardware or by instructions in software form. The processor described above can be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this application. The general-purpose processor can be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this application can be directly embodied in the execution of a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software modules can be located in random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, or other mature storage media in the art. The storage medium is located in memory, and the processor reads information from the memory and, in conjunction with its hardware, completes the steps of the above method.

[0157] It is understood that the memory in the embodiments of this application can be volatile memory or non-volatile memory, or may include both volatile and non-volatile memory. The non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. The volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Enhanced Synchronous DRAM (ESDRAM), Synchlink DRAM (SLDRAM), and Direct Rambus RAM (DR RAM). It should be noted that the memory used in the systems and methods described herein is intended to include, but is not limited to, these and any other suitable types of memory.

[0158] It should be understood that the above-described memory is exemplary and not a limiting description. For example, the memory in the embodiments of this application may also be static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous link dynamic random access memory (SLDRAM), and direct memory bus RAM (DR RAM), etc. That is to say, the memory in the embodiments of this application is intended to include, but is not limited to, these and any other suitable types of memory.

[0159] This application also provides a computer-readable storage medium for storing computer programs.

[0160] Optionally, the computer-readable storage medium can be applied to the network device in the embodiments of this application, and the computer program causes the computer to execute the corresponding processes implemented by the network device in the various methods of the embodiments of this application. For the sake of brevity, it will not be described in detail here.

[0161] Optionally, the computer-readable storage medium can be applied to the mobile terminal / terminal device in the embodiments of this application, and the computer program causes the computer to execute the corresponding processes implemented by the mobile terminal / terminal device in the various methods of the embodiments of this application. For the sake of brevity, it will not be described in detail here.

[0162] This application also provides a computer program product, including computer program instructions.

[0163] Optionally, the computer program product can be applied to the network device in the embodiments of this application, and the computer program instructions cause the computer to execute the corresponding processes implemented by the network device in the various methods of the embodiments of this application. For the sake of brevity, they will not be described in detail here.

[0164] Optionally, the computer program product can be applied to the mobile terminal / terminal device in the embodiments of this application, and the computer program instructions cause the computer to execute the corresponding processes implemented by the mobile terminal / terminal device in the various methods of the embodiments of this application. For the sake of brevity, they will not be described in detail here.

[0165] This application also provides a computer program.

[0166] Optionally, the computer program can be applied to the network device in the embodiments of this application. When the computer program is run on the computer, it causes the computer to execute the corresponding processes implemented by the network device in the various methods of the embodiments of this application. For the sake of brevity, it will not be described in detail here.

[0167] Optionally, the computer program can be applied to the mobile terminal / terminal device in the embodiments of this application. When the computer program is run on a computer, it causes the computer to execute the corresponding processes implemented by the mobile terminal / terminal device in the various methods of the embodiments of this application. For the sake of brevity, it will not be described in detail here.

[0168] Those skilled in the art will recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0169] Those skilled in the art will understand that, for the sake of convenience and brevity, the specific working processes of the systems, devices, and units described above can be referred to the corresponding processes in the foregoing method embodiments, and will not be repeated here.

[0170] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0171] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0172] In addition, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit.

[0173] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0174] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application.

Claims

1. An image processing method, characterized in that, The method includes: Acquire image data of facial expressions; Extract the first facial expression feature and the first identity feature from the image data; The first facial expression feature and the first identity feature are concatenated to obtain a first joint feature, and the joint information between the first facial expression feature and the first identity feature is extracted from the first joint feature. The mutual information between the first facial expression feature and the first identity feature is estimated, and the first facial expression feature is adjusted to the second facial expression feature with the goal of minimizing the mutual information. The classification result of the facial expression is predicted based on the second expression feature and the joint information.

2. The method according to claim 1, characterized in that, The step of extracting the first facial expression feature and the first identity feature from the image data includes: The first facial expression feature is extracted from the image data using an expression encoder; A first identity feature is extracted from the image data by an identity encoder; wherein the identity encoder is pre-trained.

3. The method according to claim 2, characterized in that, The estimation of mutual information between the first facial expression feature and the first identity feature includes: The joint distribution and marginal distribution between the first facial expression feature and the first identity feature are calculated using a mutual information estimator to obtain the expectation of the joint distribution and the logarithmic expectation of the marginal distribution. Based on the joint distribution expectation and the marginal distribution logarithmic expectation, the mutual information between the first facial expression feature and the first identity feature is calculated.

4. The method according to claim 3, characterized in that, Before calculating the joint distribution and marginal distribution between the first facial expression feature and the first identity feature using a mutual information estimator, the method further includes: Obtain a joint distribution sample pair, wherein the joint distribution sample pair includes a first facial expression feature sample and a first identity feature sample; The first identity feature sample is randomly arranged to obtain a second identity feature sample, and the second identity feature sample includes identity feature information that is unrelated to the first facial expression feature sample. An edge distribution sample pair is generated based on the first facial expression feature sample and the second identity feature sample, wherein the edge distribution sample pair includes the first facial expression feature sample and the second identity feature sample.

5. The method according to claim 4, characterized in that, The step of calculating the joint distribution and marginal distribution between the first facial expression feature and the first identity feature using a mutual information estimator to obtain the expected value of the joint distribution and the logarithmic expected value of the marginal distribution includes: The joint distribution of the joint distribution sample pairs is calculated using the mutual information estimator to obtain the joint distribution score value; The marginal distribution of the marginal distribution sample pairs is calculated using the mutual information estimator to obtain the marginal distribution score value; The expected value of the joint distribution is obtained by performing expectation estimation on the joint distribution score. The marginal distribution score is subjected to expectation estimation processing to obtain the expected logarithm of the marginal distribution.

6. The method according to claim 4, characterized in that, The step of estimating the mutual information between the first facial expression feature and the first identity feature, and adjusting the first facial expression feature to a second facial expression feature with the goal of minimizing the mutual information, includes: The second expression feature is obtained by minimizing the joint distribution expectation and the marginal distribution log expectation based on the first expression feature and the first identity feature using the mutual information estimator; wherein the identity feature information contained in the second expression feature is less than the identity feature information contained in the first expression feature.

7. The method according to any one of claims 1 to 6, characterized in that, The method further includes: The first loss function is determined based on the variational lower bound of mutual information; The second loss function is determined based on the classification results and labels of facial expressions. The target loss function is obtained by summing or weighted summing the first loss function and the second loss function; The facial expression encoder and mutual information estimator are updated based on the objective loss function.

8. An optimization device for an expression classification model, characterized in that, The device is used in an electronic device, and the device includes: The acquisition module is used to acquire image data of facial expressions; The extraction module is used to extract a first facial expression feature and a first identity feature from the image data; The processing module is used to perform feature concatenation on the first expression feature and the first identity feature to obtain the first joint feature; The extraction module is further configured to extract joint information between the first expression feature and the first identity feature from the first joint feature; The processing module is further configured to estimate the mutual information between the first expression feature and the first identity feature, adjust the first expression feature to a second expression feature with the goal of minimizing the mutual information, and predict the classification result of the facial expression based on the second expression feature and the joint information.

9. An electronic device, characterized in that, include: Memory is used to store executable instructions or computer programs. A processor, when executing computer-executable instructions or computer programs stored in the memory, implements the image processing method according to any one of claims 1 to 7.

10. A computer-readable storage medium storing computer-executable instructions or a computer program, characterized in that, When the computer-executable instructions or computer program are executed by a processor, the image processing method according to any one of claims 1 to 7 is implemented.