A video depression recognition method and system based on uncertainty perception and label distribution learning

A depression identification model is constructed by using uncertainty perception and labeled distribution learning methods. This solves the problems of label noise and small sample size, improves the robustness and interpretability of the model, and outputs depression scores and confidence levels. It is applicable to video depression detection.

CN122244924APending Publication Date: 2026-06-19SOUTHEAST UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
SOUTHEAST UNIV
Filing Date
2026-03-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing deep learning-based methods for detecting depression perform poorly in the face of label noise and small sample size problems, and lack prediction confidence, resulting in decreased model generalization ability and insufficient interpretability.

Method used

We employ uncertainty perception and labeled distribution learning to construct a depression identification model. By utilizing a hybrid loss function and a frozen learning strategy, we dynamically construct a Gaussian label distribution and output depression scores and uncertainty bias values ​​to prevent overfitting and improve model robustness.

Benefits of technology

It improves the model's generalization ability under small sample conditions, and the output uncertainty bias value provides a confidence index of the diagnostic results, enhancing the model's interpretability and clinical application value.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244924A_ABST
    Figure CN122244924A_ABST
Patent Text Reader

Abstract

This invention discloses a video-based depression recognition method and system based on uncertainty perception and labeled distribution learning. The method includes: constructing a depression recognition model, comprising a feature extraction network and a first output layer and a second output layer; the first output layer predicts depression scores, and the second output layer identifies the uncertainty bias value of the model; training the depression recognition model using a facial image dataset; freezing the first half of the network parameters of the feature extraction network during training; constructing a Gaussian label distribution based on the depression score label and the uncertainty bias value; constructing a hybrid loss function including a distribution consistency loss calculated based on the Gaussian label distribution and an adaptive bias regression loss calculated based on the model prediction error; and optimizing the model parameters; obtaining video data of the object to be detected and inputting it into the trained model to predict the depression score and uncertainty bias value. This invention exhibits high robustness, adaptability to small sample learning, and the ability to output confidence scores.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to neural network-based depression recognition technology, and more particularly to a video depression recognition method and system based on uncertainty perception and labeled distribution learning. Background Technology

[0002] Depression is a common mental disorder, and its early detection and diagnosis are crucial for treatment outcomes. Traditional diagnosis relies primarily on doctor-patient interviews and assessments using scales (such as the PHQ-9 and BDI), a method that is not only time-consuming and labor-intensive but also highly dependent on the doctor's subjective experience. In recent years, with the rapid development of deep learning technology, automated depression detection technology based on computer vision analysis of nonverbal cues such as facial expressions and head posture has attracted widespread attention.

[0003] However, existing deep learning-based methods for detecting depression still face many challenges in practical applications:

[0004] 1. Label Noise and Subjectivity Issues: In publicly available depression datasets (such as AVEC2014), label scores are typically obtained by averaging scores given by multiple experts based on video performance. Because different experts have subjective differences in their understanding of symptoms, the labels themselves often contain noise. Traditional "hard regression" models directly fit a single numerical label, which easily forces the fit onto noisy data, leading to a decrease in the model's generalization ability.

[0005] 2. Small sample size overfitting problem: Medical image or video datasets typically have a small sample size. Directly applying deep convolutional neural networks (such as ResNet, VGG, etc.) can easily lead to overfitting on the training set, while performing poorly on the test set.

[0006] 3. Lack of predictive confidence: Most existing regression models only output a definite depression score, failing to inform users how "confident" the model is about the predicted outcome (i.e., uncertainty). In medical auxiliary diagnostic scenarios, the lack of uncertainty estimation makes it difficult for doctors to determine when to adopt the model's recommendations and when to conduct manual review, limiting the model's interpretability and clinical application value. Summary of the Invention

[0007] In view of the problems existing in the prior art, the purpose of this invention is to provide a method and system for identifying depression that can be effectively trained under small sample conditions and can automatically perceive label noise and output prediction confidence.

[0008] To achieve the above-mentioned objectives, the present invention provides the following technical solution:

[0009] A video-based depression identification method based on uncertainty perception and labeled distribution learning includes the following steps:

[0010] (1) Obtain video data with known depression score labels, extract facial images of several facial regions from the frame sequence corresponding to each video data, and perform preprocessing to form a facial image dataset;

[0011] (2) Construct a depression identification model. The depression classification model includes a feature extraction network and a first branch output layer and a second branch output layer. The feature extraction network is a pre-trained deep convolutional neural network used to extract image features of facial images. The first branch output layer is used to predict the depression score based on the image features. The second branch output layer is used to predict the uncertainty deviation value of the depression identification model based on the image features.

[0012] (3) The depression recognition model is trained using a facial image dataset; the training rules are as follows:

[0013] Freeze the parameters of the first half of the feature extraction network and train only the parameters of the second half of the network;

[0014] A Gaussian label distribution is dynamically constructed based on the depression score labels and the uncertainty deviation value, and the width of the Gaussian label distribution adapts to the uncertainty deviation value.

[0015] The model parameters are optimized using a hybrid loss function, which includes a distribution consistency loss calculated based on the Gaussian label distribution and an adaptive bias regression loss calculated based on the prediction error of the depression identification model. The adaptive bias regression loss guides the depression identification model to output a large uncertainty bias value when the prediction error is large and a small uncertainty bias value when the prediction error is small.

[0016] (4) Obtain video data of the object to be detected, preprocess it to form a facial image, input it into the trained depression recognition model, and obtain the predicted depression score and uncertainty deviation value.

[0017] Furthermore, step (1) specifically includes:

[0018] (1.1) Obtain video data with known depression score labels;

[0019] (1.2) In the frame sequence corresponding to each video data, a number of frames are extracted at fixed intervals to form an image sequence;

[0020] (1.3) A multi-task cascaded convolutional neural network is used to detect faces in each frame of the image sequence. Affine transformation is performed based on key points to align the face region, remove background interference, and obtain several face images.

[0021] (1.4) Perform data augmentation on each facial image, including geometric transformation, color change, and random region occlusion, to form a facial image dataset.

[0022] Furthermore, the first branch output layer is composed of a fully connected layer, which is used to map the extracted image features into a scalar mean depression score, and output it as a depression score value.

[0023] The second branch output layer consists of a fully connected layer, a nonlinear activation function, and a numerical truncation operation. It is used to output a positive uncertainty deviation value. The numerical truncation operation limits the uncertainty deviation value to a preset numerical range to prevent numerical overflow and constrain the uncertainty range.

[0024] Furthermore, the method for constructing the Gaussian label distribution is as follows:

[0025] Using the depression score label y as the mean and the uncertainty deviation σ as the standard deviation, a Gaussian label distribution is generated according to the following formula. :

[0026]

[0027] in, This represents the predicted depression score. This indicates a normalization operation.

[0028] Furthermore, the distribution consistency loss specifically refers to:

[0029]

[0030] In the formula, This represents the distribution consistency loss. Represents the divergence function. The log-normal distribution represents the predicted probability of a depression identification model. This represents a Gaussian label distribution.

[0031] Furthermore, the adaptive bias regression loss is specifically as follows:

[0032]

[0033]

[0034] in, This represents the adaptive bias regression loss. Represents the mean square error function. These represent the depression score and variance value predicted by the depression identification model, respectively. This represents the dynamic target deviation value. Indicates a depression score label, For error sensitivity coefficient, It is the fundamental uncertainty constant.

[0035] A video depression recognition system based on uncertainty perception and labeled distribution learning includes:

[0036] The dataset formation module is used to acquire video data with known depression score labels, extract facial images of several facial regions from the frame sequence corresponding to each video data, and perform preprocessing to form a facial image dataset.

[0037] The model building module is used to build a depression recognition model. The depression classification model includes a feature extraction network and a first branch output layer and a second branch output layer. The feature extraction network is a pre-trained deep convolutional neural network used to extract image features from facial images. The first branch output layer is used to predict depression scores based on the image features. The second branch output layer is used to predict the uncertainty bias value of the depression recognition model based on the image features.

[0038] The model training module is used to train the depression recognition model using a facial image dataset; the training rules are as follows:

[0039] Freeze the parameters of the first half of the feature extraction network and train only the parameters of the second half of the network;

[0040] A Gaussian label distribution is dynamically constructed based on the depression score labels and the uncertainty deviation value, and the width of the Gaussian label distribution adapts to the uncertainty deviation value.

[0041] The model parameters are optimized using a hybrid loss function, which includes a distribution consistency loss calculated based on the Gaussian label distribution and an adaptive bias regression loss calculated based on the prediction error of the depression identification model. The adaptive bias regression loss guides the depression identification model to output a large uncertainty bias value when the prediction error is large and a small uncertainty bias value when the prediction error is small.

[0042] The recognition module is used to acquire video data of the object to be detected, preprocess it to form a facial image, input it into a trained depression recognition model, and obtain the predicted depression score and uncertainty bias value.

[0043] A computer program product includes a computer program that, when executed by a processor, implements the above-described method.

[0044] A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the method described above.

[0045] A computer-readable storage medium having a computer program / instructions stored thereon, which, when executed by a processor, implements the above-described method.

[0046] Compared with the prior art, the beneficial effects of this invention are:

[0047] 1. Strong robustness: By introducing uncertainty to model a hybrid loss function, unlike traditional hard regression, the hybrid loss function of this invention can automatically identify difficult or noisy samples, reduce the impact of uncertainty during training, improve the model's generalization ability, and enhance robustness.

[0048] 2. Adapting to small samples: Combining the frozen learning strategy with the hybrid loss function effectively prevents the model from overfitting on small-scale depression datasets and adapts to small sample learning.

[0049] 3. Interpretability: The uncertainty bias value output by the model can be used as a confidence index of the diagnostic results, providing doctors with more valuable auxiliary decision-making information. Attached Figure Description

[0050] Figure 1 This is a schematic diagram of the general process of the video depression recognition method based on uncertainty perception and labeled distribution learning provided in the embodiments of the present invention;

[0051] Figure 2 This is a schematic diagram of the overall process of the video depression recognition method based on uncertainty perception and labeled distribution learning provided in the embodiments of the present invention;

[0052] Figure 3 This is a schematic diagram of the structure of the depression recognition model provided in an embodiment of the present invention;

[0053] Figure 4 This is an example diagram of dynamic Gaussian label distribution provided in an embodiment of the present invention;

[0054] Figure 5 This is a flowchart of the adaptive deviation regression loss calculation provided in an embodiment of the present invention;

[0055] Figure 6 This is a schematic diagram of the computer device structure provided in an embodiment of the present invention. Detailed Implementation

[0056] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

[0057] Example 1

[0058] This invention provides a video-based depression recognition method based on uncertainty perception and labeled distribution learning, such as... Figure 1 and Figure 2 As shown, it includes the following steps:

[0059] S101. Obtain video data with known depression score labels, extract facial images of several facial regions from the frame sequence corresponding to each video data, and perform preprocessing to form a facial image dataset.

[0060] This step specifically includes:

[0061] S1011. Obtain video data with known depression score labels; specifically, you can select video data from publicly available depression datasets (such as AVEC2014), or collect video data from patients with depression who have already been scored.

[0062] S1012. In the frame sequence corresponding to each video data, extract several frames of images at fixed intervals, for example, extract 100 frames of images evenly at fixed intervals to form an image sequence.

[0063] S1013. A multi-task cascaded convolutional neural network (MTCNN) is used to detect faces in each frame of the image sequence. Affine transformations are performed on the faces based on key points to align them, and then the faces are cropped out. The facial region of pixels is processed, and background interference is removed to obtain several facial images.

[0064] S1014. Perform data augmentation on each facial image to form a facial image dataset, including:

[0065] Geometric transformations: random horizontal flip (probability 0.5), random rotation (±10 degrees);

[0066] Color variation: Randomly adjust brightness, contrast, and saturation (variation factor 0.1);

[0067] Random Erasing: A rectangular region is randomly selected in the image with a probability of 0.5 to fill the mean pixels, with the occlusion area ratio ranging from 0.02 to 0.15.

[0068] Data augmentation simulates the situation of missing local facial information, forcing the model to learn contextual features and significantly improving robustness.

[0069] S102. Construct a depression identification model, wherein the depression classification model includes a feature extraction network and a first branch output layer and a second branch output layer, such as... Figure 3 As shown.

[0070] The feature extraction network is a pre-trained deep convolutional neural network. For example, ResNet18 can be selected as the backbone network, and its pre-trained weights on ImageNet can be loaded to extract image features of facial images. These features are 512-dimensional features.

[0071] The first branch output layer is used to predict the depression score based on the image features. It consists of fully connected layers and is used to map the extracted image features to a scalar mean depression score, which is then output as the depression score value. The output dimension is 64.

[0072] The second branch output layer is used to predict the uncertainty bias value of the depression recognition model based on the image features. It consists of a fully connected layer, a non-linear activation function Softplus(), and a numerical truncation operation. This output layer is used to limit the uncertainty bias value to a preset numerical range to prevent numerical overflow and constrain the uncertainty range. In this embodiment, to prevent gradient explosion in the early stages of training, a hard truncation (Clamp) is applied to the uncertainty bias value output, forcibly limiting it to a certain range. Within the interval. A lower limit of 3.0 provides basic smoothness, while an upper limit of 15.0 prevents the distribution from becoming too flat.

[0073] S103. The depression recognition model is trained using a facial image dataset.

[0074] The training rules are as follows:

[0075] A. Freeze the first half of the network parameters of the feature extraction network and train only the second half of the network parameters.

[0076] The learning strategy is a hierarchical transfer learning strategy, specifically: loading the weight parameters of a feature extraction network pre-trained on a large-scale general image dataset as initialization parameters; freezing the parameters of the shallow convolutional blocks in the first half of the feature extraction network so that they do not generate gradient updates during training; and fine-tuning the parameters of only the deep convolutional blocks and dual-branch output layers in the second half of the feature extraction network. For the ResNet18 feature extraction network, only the parameters of the first two layers (layer1, layer2) and the convolutional layers before them can be frozen. It retains the general texture feature extraction capability. Only the back layers (layer3 and layer4) are enabled for training to fine-tune and adapt to the high-level semantic features of depression.

[0077] The hierarchical transfer learning strategy can adapt to small sample data and effectively prevent the model from overfitting on small-scale depression datasets.

[0078] B. Based on the depression score labels and the uncertainty deviation value, a Gaussian label distribution is dynamically constructed, wherein the width of the Gaussian label distribution adaptively changes with the uncertainty deviation value. The specific method for constructing the Gaussian label distribution is as follows:

[0079] Using the depression score label y as the mean and the uncertainty deviation σ as the standard deviation, a Gaussian label distribution is generated according to the following formula. :

[0080]

[0081] in, This represents the predicted depression score. This represents the normalization operation, Gaussian label distribution. Examples such as Figure 4 As shown.

[0082] This soft-labeling mechanism allows the model to make "fuzzy" predictions when there is uncertainty, reflecting the true distribution characteristics of the data. Instead of using traditional point-to-point regression loss, it dynamically constructs a Gaussian label distribution by utilizing the uncertainty deviation between the true labels and the model's predictions.

[0083] C. The model parameters are optimized using a hybrid loss function, which includes the distribution consistency loss calculated based on the Gaussian label distribution and the adaptive bias regression loss calculated based on the prediction error of the depression identification model. The adaptive bias regression loss guides the depression identification model to output a large uncertainty bias value when the prediction error is large and a small uncertainty bias value when the prediction error is small.

[0084] The distribution consistency loss is specifically as follows:

[0085]

[0086] In the formula, This represents the distribution consistency loss. Represents the divergence function. The log-normal distribution represents the predicted probability of a depression identification model. This represents a Gaussian label distribution. Distribution consistency loss is used to measure the log probability distribution predicted by the model. With the constructed Gaussian label distribution The differences between them.

[0087] like Figure 5 As shown, the adaptive bias regression loss is defined as the mean square error between the uncertainty bias value output by the model and the dynamic target bias value, specifically:

[0088]

[0089]

[0090] in, This represents the adaptive bias regression loss. Represents the mean square error function. These represent the depression score and variance value predicted by the depression identification model, respectively. This represents the dynamic target deviation value. Indicates a depression score label, For error sensitivity coefficient, This is the fundamental uncertainty constant. In this embodiment, the error sensitivity coefficient... The value of is 0.5, and the fundamental uncertainty constant is... The value is 3.0. The physical meaning of the formula is: when the prediction error is large (the model prediction is inaccurate), greater uncertainty (i.e., greater...) is allowed. Even if the prediction is perfectly accurate, the underlying uncertainty is maintained at 3.0 to prevent overfitting.

[0091] Hybrid loss function Represented as:

[0092]

[0093] in, and These are the weighting coefficients for the two loss terms.

[0094] The loss function described above combines distribution consistency loss and adaptive bias regression loss, enabling the model to narrow the gap between the predicted distribution and the target Gaussian distribution, dynamically adjust uncertainty, and automatically identify samples with noisy labels, reducing their negative impact on the gradient by increasing variance. This allows the model to output accurate depression scores and uncertainty bias values.

[0095] S104. Obtain video data of the object to be detected, preprocess it to form a facial image, input it into the trained depression recognition model, and obtain the predicted depression score and uncertainty bias value.

[0096] The predicted depression score is a predictive conclusion, and the uncertainty bias value can be used as a confidence index of the predictive conclusion, providing doctors with more valuable information to assist in decision-making. It should be understood that the predicted depression score and uncertainty bias value cannot be used as the final disease detection or diagnosis result, but only to provide decision-making assistance to doctors. Everything should be based on the doctor's diagnosis.

[0097] The embodiments of the present invention were experimentally verified, and the experimental parameters are as follows:

[0098] (1) Optimizer: Adam, initial learning rate Weight Decay .

[0099] (2) Learning rate scheduling: Using ReduceLROnPlateau, the learning rate decay coefficient is 0.2 when the validation set MAE does not decrease within 5 epochs.

[0100] (3) Batch Size: 32; Training epochs: 200.

[0101] Tests on the AVEC2014 dataset show that this method not only outperforms traditional hard regression methods in terms of MAE and RMSE metrics, but also that the Uncertainty column included in the prediction results effectively indicates the prediction quality (samples with large errors are usually accompanied by large Sigma values), thus verifying the effectiveness of the method.

[0102] Example 2

[0103] This invention provides a video depression recognition system based on uncertainty perception and labeled distribution learning, comprising:

[0104] The dataset formation module is used to acquire video data with known depression score labels, extract facial images of several facial regions from the frame sequence corresponding to each video data, and perform preprocessing to form a facial image dataset.

[0105] The model building module is used to build a depression recognition model. The depression classification model includes a feature extraction network and a first branch output layer and a second branch output layer. The feature extraction network is a pre-trained deep convolutional neural network used to extract image features from facial images. The first branch output layer is used to predict depression scores based on the image features. The second branch output layer is used to predict the uncertainty bias value of the depression recognition model based on the image features.

[0106] The model training module is used to train the depression recognition model using a facial image dataset; the training rules are as follows:

[0107] Freeze the parameters of the first half of the feature extraction network and train only the parameters of the second half of the network;

[0108] A Gaussian label distribution is dynamically constructed based on the depression score labels and the uncertainty deviation value, and the width of the Gaussian label distribution adapts to the uncertainty deviation value.

[0109] The model parameters are optimized using a hybrid loss function, which includes a distribution consistency loss calculated based on the Gaussian label distribution and an adaptive bias regression loss calculated based on the prediction error of the depression identification model. The adaptive bias regression loss guides the depression identification model to output a large uncertainty bias value when the prediction error is large and a small uncertainty bias value when the prediction error is small.

[0110] The recognition module is used to acquire video data of the object to be detected, preprocess it to form a facial image, input it into a trained depression recognition model, and obtain the predicted depression score and uncertainty bias value.

[0111] The first branch output layer is composed of a fully connected layer, which is used to map the extracted image features into a scalar mean depression score, and output it as a depression score value.

[0112] The second branch output layer consists of a fully connected layer, a nonlinear activation function, and a numerical truncation operation. It is used to output a positive uncertainty deviation value. The numerical truncation operation limits the uncertainty deviation value to a preset numerical range to prevent numerical overflow and constrain the uncertainty range.

[0113] The method for constructing the Gaussian label distribution is as follows:

[0114] Using the depression score label y as the mean and the uncertainty deviation σ as the standard deviation, a Gaussian label distribution is generated according to the following formula. :

[0115]

[0116] in, This represents the predicted depression score. This indicates a normalization operation.

[0117] The distribution consistency loss is specifically as follows:

[0118]

[0119] In the formula, This represents the distribution consistency loss. Represents the divergence function. The log-normal distribution represents the predicted probability of a depression identification model. This represents a Gaussian label distribution.

[0120] The adaptive bias regression loss is specifically as follows:

[0121]

[0122]

[0123] in, This represents the adaptive bias regression loss. Represents the mean square error function. These represent the depression score and variance value predicted by the depression identification model, respectively. This represents the dynamic target deviation value. Indicates a depression score label, For error sensitivity coefficient, It is the fundamental uncertainty constant.

[0124] The system provided in this embodiment of the invention can be used to execute the method provided in Embodiment 1 of the invention, and has the corresponding functions and beneficial effects of executing the method.

[0125] It is worth noting that in the embodiments of the above system, the various units and modules included are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be achieved; in addition, the specific names of each functional unit are only for easy distinction between each other and are not used to limit the scope of protection of the present invention.

[0126] The embodiments described above are merely illustrative. The modules described as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules; that is, they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art will clearly understand that each implementation can be achieved using software plus necessary general-purpose hardware platforms, or it can be implemented solely through hardware, as long as the function or purpose can be achieved.

[0127] Example 3

[0128] This invention also provides a computer program product, such as an app on a mobile phone or tablet, or an installer on a computer. This product includes a computer program / instructions that, when executed by a processor, implement the method described in Embodiment 1. The code for the computer-executable program used to perform the operations of this invention can be written in one or more programming languages ​​or a combination thereof. Programming languages ​​include object-oriented programming languages ​​such as Java, Smalltalk, and C++, as well as conventional procedural programming languages ​​such as "C" or similar languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (e.g., via the Internet using an Internet service provider).

[0129] Example 4

[0130] Figure 6 This is a schematic diagram of the structure of a computer device provided in an embodiment of the present invention. The embodiments of the present invention provide services for implementing the method of the first embodiment of the present invention described above. Figure 2 As shown, the device may include: a memory 301 storing a computer-executable program; a processor 302 coupled to the memory 301; the processor 302 calls the computer-executable program stored in the memory 301 to perform the steps in the method described in Embodiment 1.

[0131] Memory 301 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) and / or cache memory. The device may further include other removable / non-removable, volatile / non-volatile computer system storage media. By way of example only, memory 301 may be used to read and write non-removable, non-volatile magnetic media (commonly referred to as a "hard disk drive"). A program / utility having a set (at least one) of program modules may be stored, for example, in memory 301. Such program modules include, but are not limited to, an operating system, one or more application programs, other program modules, and program data. Each or some combination of these examples may include an implementation of a network environment. The computer-executable program of the program modules typically performs the functions and / or methods described in the embodiments of the present invention.

[0132] The processor 302 executes various functional applications and data processing by running programs stored in the memory 301, such as implementing the method provided in Embodiment 1 of the present invention.

[0133] The code of a computer executable program can be written in one or more programming languages ​​or a combination thereof. Programming languages ​​include object-oriented programming languages ​​such as Java, Smalltalk, and C++, as well as conventional procedural programming languages ​​such as the "C" language or similar programming languages.

[0134] Example 5

[0135] This invention provides a storage medium containing a computer-executable program, which, when executed by a computer processor, is used to perform the method of Embodiment 1.

[0136] The storage medium of this invention can be any combination of one or more computer-readable media. A computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. A computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this document, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.

[0137] Of course, the computer-executable program in the storage medium provided in the embodiments of the present invention is not limited to the above-described method operations, but can also perform related operations in the methods provided in any embodiment of the present invention.

[0138] It should be understood that the embodiments and descriptions above are only the principles, main features and advantages of the present invention. Various changes and modifications can be made to the present invention without departing from the spirit and scope of the invention, and all such changes and modifications fall within the protection scope of the present invention.

Claims

1. A video-based depression identification method based on uncertainty perception and labeled distribution learning, characterized in that, Includes the following steps: (1) Obtain video data with known depression score labels, extract facial images of several facial regions from the frame sequence corresponding to each video data, and perform preprocessing to form a facial image dataset; (2) Construct a depression identification model. The depression classification model includes a feature extraction network and a first branch output layer and a second branch output layer. The feature extraction network is a pre-trained deep convolutional neural network used to extract image features of facial images. The first branch output layer is used to predict the depression score based on the image features. The second branch output layer is used to predict the uncertainty deviation value of the depression identification model based on the image features. (3) The depression recognition model is trained using a facial image dataset; the training rules are as follows: Freeze the parameters of the first half of the feature extraction network and train only the parameters of the second half of the network; A Gaussian label distribution is dynamically constructed based on the depression score labels and the uncertainty deviation value, and the width of the Gaussian label distribution adapts to the uncertainty deviation value. The model parameters are optimized using a hybrid loss function, which includes a distribution consistency loss calculated based on the Gaussian label distribution and an adaptive bias regression loss calculated based on the prediction error of the depression identification model. The adaptive bias regression loss guides the depression identification model to output a large uncertainty bias value when the prediction error is large and a small uncertainty bias value when the prediction error is small. (4) Obtain video data of the object to be detected, preprocess it to form a facial image, input it into the trained depression recognition model, and obtain the predicted depression score and uncertainty deviation value.

2. The video depression recognition method based on uncertainty perception and labeled distribution learning according to claim 1, characterized in that, Step (1) specifically includes: (1.1) Obtain video data with known depression score labels; (1.2) In the frame sequence corresponding to each video data, a number of frames are extracted at fixed intervals to form an image sequence; (1.3) A multi-task cascaded convolutional neural network is used to detect faces in each frame of the image sequence. Affine transformation is performed based on key points to align the face region, remove background interference, and obtain several face images. (1.4) Perform data augmentation on each facial image, including geometric transformation, color change, and random region occlusion, to form a facial image dataset.

3. The video depression recognition method based on uncertainty perception and labeled distribution learning according to claim 1, characterized in that: The first branch output layer consists of a fully connected layer, which is used to map the extracted image features into a scalar mean depression score, and output it as a depression score value. The second branch output layer consists of a fully connected layer, a nonlinear activation function, and a numerical truncation operation. It is used to output a positive uncertainty deviation value. The numerical truncation operation limits the uncertainty deviation value to a preset numerical range to prevent numerical overflow and constrain the uncertainty range.

4. The video depression recognition method based on uncertainty perception and labeled distribution learning according to claim 1, characterized in that, The method for constructing the Gaussian label distribution is as follows: Using the depression score label y as the mean and the uncertainty deviation σ as the standard deviation, a Gaussian label distribution is generated according to the following formula. : , in, This represents the predicted depression score. This indicates a normalization operation.

5. The video depression recognition method based on uncertainty perception and labeled distribution learning according to claim 1, characterized in that, The distribution consistency loss is specifically as follows: , In the formula, This represents the distribution consistency loss. Represents the divergence function. The log-normal distribution represents the predicted probability of a depression identification model. This represents a Gaussian label distribution.

6. The video depression recognition method based on uncertainty perception and labeled distribution learning according to claim 1, characterized in that, The adaptive bias regression loss is specifically as follows: , , in, This represents the adaptive bias regression loss. Represents the mean square error function. These represent the depression score and variance value predicted by the depression identification model, respectively. This represents the dynamic target deviation value. Indicates a depression score label, For error sensitivity coefficient, It is the fundamental uncertainty constant.

7. A video depression recognition system based on uncertainty perception and labeled distribution learning, characterized in that, include: The dataset formation module is used to acquire video data with known depression score labels, extract facial images of several facial regions from the frame sequence corresponding to each video data, and perform preprocessing to form a facial image dataset. The model building module is used to build a depression recognition model. The depression classification model includes a feature extraction network and a first branch output layer and a second branch output layer. The feature extraction network is a pre-trained deep convolutional neural network used to extract image features from facial images. The first branch output layer is used to predict depression scores based on the image features. The second branch output layer is used to predict the uncertainty bias value of the depression recognition model based on the image features. The model training module is used to train the depression recognition model using a facial image dataset; the training rules are as follows: Freeze the parameters of the first half of the feature extraction network and train only the parameters of the second half of the network; A Gaussian label distribution is dynamically constructed based on the depression score labels and the uncertainty deviation value, and the width of the Gaussian label distribution adapts to the uncertainty deviation value. The model parameters are optimized using a hybrid loss function, which includes a distribution consistency loss calculated based on the Gaussian label distribution and an adaptive bias regression loss calculated based on the prediction error of the depression identification model. The adaptive bias regression loss guides the depression identification model to output a large uncertainty bias value when the prediction error is large and a small uncertainty bias value when the prediction error is small. The recognition module is used to acquire video data of the object to be detected, preprocess it to form a facial image, input it into a trained depression recognition model, and obtain the predicted depression score and uncertainty bias value.

8. A computer program product, comprising a computer program, characterized in that: When the computer program is executed by a processor, it implements the method of any one of claims 1-6.

9. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that: The processor executes the computer program to implement the method as described in any one of claims 1-6.

10. A computer-readable storage medium having a computer program / instructions stored thereon, characterized in that: The computer program / instructions, when executed by a processor, implement the method of any one of claims 1-6.