Deep learning-based stress detection method and apparatus

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using a deep learning-based stress detection method and leveraging facial frames and multi-layered attention mechanisms in video samples, a stress detection model is constructed, which solves the problem of poor accuracy in contactless detection and achieves personalized and accurate contactless psychological stress detection.

CN119541008BActive Publication Date: 2026-06-23TSINGHUA UNIVERSITY

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: TSINGHUA UNIVERSITY
Filing Date: 2023-08-29
Publication Date: 2026-06-23

AI Technical Summary

Technical Problem

Existing contactless psychological stress detection methods have poor accuracy, traditional questionnaire surveys are highly subjective, rely on wearable devices and are costly, making it difficult to achieve large-scale contactless detection.

Method used

A deep learning-based stress detection method is adopted. By acquiring facial frames from video samples, a multi-layer attention mechanism is used to extract emotion-oriented descriptive features and group classification results to construct a stress detection model. Combined with user personalized features and group information, contactless psychological stress detection is carried out.

Benefits of technology

It achieves accurate, non-contact detection of psychological stress levels, improves the accuracy of stress detection, is suitable for large-scale personalized stress detection, and reduces equipment costs.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN119541008B_ABST

Patent Text Reader

Abstract

The application provides a pressure detection method and device based on deep learning, the method comprising: acquiring a video to be detected, and extracting a face frame of a target user in the video to be detected; inputting the face frame of the target user into a pre-trained pressure detection model to obtain a pressure detection result output by the pressure detection model; wherein the pressure detection model is trained based on a face frame sample, a sample pressure detection result corresponding to the face frame sample, a sample emotion-oriented description feature, and a sample group classification result; the face frame sample is extracted from a video sample; the emotion-oriented description feature is obtained by superimposing a representation feature extracted from the face frame sample through a multi-layer attention mechanism; and the group classification result is obtained by extracting the face frame sample through a group attention mechanism. The technical problem of poor detection accuracy of a non-contact detection method in the prior art is solved.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence technology, and in particular to a stress detection method and apparatus based on deep learning. Background Technology

[0002] With the fast pace of modern life and intensified social competition, people are under unprecedented psychological pressure. It is crucial to detect stress in a timely manner through stress monitoring before the adverse consequences of excessive stress occur.

[0003] Traditional stress assessment methods rely on psychological questionnaires or professional psychological counseling, which are usually only applicable to a small number of people and are difficult to effectively cope with a large number of testing needs. Furthermore, since the results of questionnaire surveys depend on the counselor's answers to relevant questions, the measurement of stress levels is relatively subjective, especially when the counselor selectively expresses their psychological state, resulting in significant bias in the test results.

[0004] To overcome the limitations of questionnaire surveys, some existing technologies utilize specialized sensors or wearable devices (such as mobile phones with embedded sensors) to detect psychological stress by sensing physiological signals (such as heart rate variability, electrocardiogram, electroencephalogram, electromyography, blood pressure, and skin current response). However, while this detection method offers high accuracy, it requires users to wear wearable devices or specialized sensors, making contactless measurement difficult, and the equipment is also costly.

[0005] Therefore, providing a stress detection method and device based on deep learning to accurately detect the psychological stress level of target users without contact, thereby achieving non-contact stress detection and improving the accuracy of stress detection, has become an urgent problem to be solved by those skilled in the art. Summary of the Invention

[0006] This invention provides a stress detection method and device based on deep learning to solve the technical problem of poor detection accuracy of existing contactless detection methods, aiming to achieve more accurate contactless detection of the psychological stress level of target users, realize contactless stress detection, and improve the accuracy of stress detection.

[0007] This invention provides a deep learning-based stress detection method, comprising:

[0008] Acquire the video to be detected and extract the facial frames of the target user from the video to be detected;

[0009] The target user's facial frame is input into a pre-trained stress detection model to obtain the stress detection result output by the stress detection model;

[0010] The stress detection model is trained based on facial frame samples, as well as the corresponding sample stress detection results, sample emotion guidance description features, and sample group classification results.

[0011] The facial frame samples are extracted from video samples; the emotion-guided description features are obtained by superimposing the representation features extracted from the facial frame samples through a multi-layer attention mechanism; and the group classification result is obtained from the facial frame samples through a group attention mechanism.

[0012] According to the deep learning-based stress detection method provided by the present invention, facial frame samples are extracted from video samples, specifically including:

[0013] Acquire video samples and extract frames that meet preset image quality requirements from the video samples;

[0014] Based on the pre-defined emotion classification, the probability of the detected emotion classification in each frame is calculated to obtain a sequence of emotion classification probabilities.

[0015] Extract a predetermined number of emotion classification probabilities from the sequence of emotion classification probabilities;

[0016] The frames corresponding to the predetermined number of emotion classification probabilities are used as the facial frame samples.

[0017] According to the deep learning-based stress detection method provided by the present invention, the multi-layer attention mechanism includes at least an emotion expression attention mechanism, a continuous negative emotion attention mechanism, and a video frame self-attention mechanism with emotional information.

[0018] The facial frame samples are superimposed with the representation features extracted through a multi-layer attention mechanism to obtain the emotion-oriented description features, which specifically include:

[0019] Based on the aforementioned emotion expression attention mechanism, feature representations are extracted from the facial frame samples to obtain frame representation results containing emotional information.

[0020] Based on the negative emotion cycle and continuous negative emotion attention mechanism, the frame representation results containing emotional information are... Extracting video representations containing frames with negative emotions.

[0021] Based on the video frame self-attention mechanism with emotional information, the video representation results enhanced by frames containing negative emotions are... By embedding location information, self-attention results of video frames with emotional content are obtained.

[0022] According to the deep learning-based stress detection method provided by the present invention, feature representations are extracted from the facial frame samples based on the emotion expression attention mechanism to obtain frame representation results containing emotional information. Specifically, it includes:

[0023] Constructing facial frame samples X F Compared with the sample pressure test result X C Vector A;

[0024] Normalize each element in the vector A to [0,1], and use the normalization result to represent the attention weight relative to the target sentiment category;

[0025] Based on the attention weights relative to the target sentiment category and the vector, we obtain the categorical sentiment attention X. A ;

[0026] Classify Emotional Attention X A With facial frame sample X F Each frame in the algorithm is aligned, and a frame representation containing emotional information is calculated using a linear function.

[0027] The stress detection method based on deep learning provided by this invention, based on the negative emotion cycle and continuous negative emotion attention mechanism, in the frame representation results containing emotional information... Extracting video representations containing frames with negative emotions. Specifically, it includes:

[0028] Extracting the frame representation results containing emotional information Negative emotional fragments p in consecutive facial frames;

[0029] Based on the duration of the negative emotion segment p and its position within the segment, the frame representation of each frame in the negative emotion segment is weighted and enhanced to obtain the video representation of the facial frame sample that includes negative emotion frame enhancement.

[0030] The stress detection method based on deep learning provided by the present invention utilizes a video frame self-attention mechanism with emotional information to enhance the video representation results containing frames with negative emotions. By embedding location information, self-attention results of video frames with emotional content are obtained. Specifically, it includes:

[0031] Based on pre-set frame position information parameters, the video representation result for the enhancement of frames containing negative emotions is... For each frame representation in the dataset, a position embedding operation is performed to obtain the embedded representation result.

[0032] Based on the video frame self-attention mechanism with emotional information, the embedding representation results are analyzed. Perform frame self-attention calculations to obtain video frame self-attention results with emotional content.

[0033] The present invention also provides a stress detection device based on deep learning, comprising:

[0034] The video acquisition unit is used to acquire the video to be detected and extract the facial frames of the target user in the video to be detected.

[0035] The result generation unit is used to input the target user's facial frame into a pre-trained stress detection model to obtain the stress detection result output by the stress detection model.

[0036] The stress detection model is trained based on facial frame samples, as well as the corresponding sample stress detection results, sample emotion guidance description features, and sample group classification results.

[0037] The facial frame samples are extracted from video samples; the emotion-guided description features are obtained by superimposing the representation features extracted from the facial frame samples through a multi-layer attention mechanism; and the group classification result is obtained from the facial frame samples through a group attention mechanism.

[0038] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the deep learning-based stress detection method described above.

[0039] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the deep learning-based stress detection method described above.

[0040] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the deep learning-based stress detection method described above.

[0041] This invention provides a deep learning-based stress detection method. It involves acquiring a video to be detected and extracting facial frames of a target user from the video. These facial frames are then input into a pre-trained stress detection model to obtain the stress detection result output by the model. The stress detection model is trained based on facial frame samples, corresponding sample stress detection results, sample emotion-oriented descriptive features, and sample group classification results. The facial frame samples are extracted from video samples. The emotion-oriented descriptive features are obtained by superimposing representation features extracted from the facial frame samples through a multi-layer attention mechanism. The group classification result is obtained from the facial frame samples through a group attention mechanism.

[0042] Thus, the stress detection method provided by this invention can realize personalized stress detection based on ubiquitous video and deep learning. It adopts a three-layer attention mechanism, focusing on the user's specific emotional facial expressions, the duration of continuous negative emotions, and the transformation and self-association of emotional frames in the video. It can accurately identify stress levels, thereby solving the technical problem of poor detection accuracy of existing contactless detection methods. It can accurately detect the psychological stress level of the target user without contact, realize contactless stress detection, and improve the accuracy of stress detection. Attached Figure Description

[0043] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0044] Figure 1 This is one of the flowcharts of the stress detection method based on deep learning provided by the present invention;

[0045] Figure 2 This is the second flowchart of the stress detection method based on deep learning provided by the present invention;

[0046] Figure 3 This is the third flowchart of the stress detection method based on deep learning provided by the present invention;

[0047] Figure 4 This is the fourth flowchart of the stress detection method based on deep learning provided by the present invention;

[0048] Figure 5 This is the fifth flowchart of the stress detection method based on deep learning provided by the present invention;

[0049] Figure 6 This is the sixth flowchart of the stress detection method based on deep learning provided by the present invention;

[0050] Figure 7 This is the seventh flowchart of the stress detection method based on deep learning provided by the present invention;

[0051] Figure 8 This is the eighth flowchart of the stress detection method based on deep learning provided by the present invention;

[0052] Figure 9 This is the ninth flowchart of the stress detection method based on deep learning provided by the present invention;

[0053] Figure 10 This is a schematic diagram of the stress detection device based on deep learning provided by the present invention;

[0054] Figure 11 This is a schematic diagram of the structure of the electronic device provided by the present invention.

[0055] Figure label:

[0056] 1001: Video acquisition unit; 1002: Result generation unit. Detailed Implementation

[0057] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0058] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0059] The following is combined with Figures 1-9 The present invention describes a deep learning-based stress detection method.

[0060] In one specific implementation, such as Figure 1 As shown, the stress detection method based on deep learning provided by this invention includes the following steps:

[0061] S110: Acquire the video to be detected and extract the facial frames of the target user from the video to be detected. In specific application scenarios, the video to be detected can be surveillance video, such as surveillance video within a target area during a target time period; with the widespread application of contactless cameras in public spaces and specific locations, it has become possible to detect psychological stress through human facial expressions and movements captured by video; compared with traditional stress scales and specific sensors, surveillance cameras have advantages such as convenience, low cost, and non-invasiveness.

[0062] S120: Input the target user's facial frames into a pre-trained stress detection model to obtain the stress detection result output by the stress detection model; wherein, the stress detection model is trained based on facial frame samples, as well as the corresponding sample stress detection result, sample emotion-oriented description features, and sample group classification result; the facial frame samples are extracted from video samples; the emotion-oriented description features are obtained by superimposing the representation features extracted from the facial frame samples through a multi-layer attention mechanism; the group classification result is obtained from the facial frame samples through a group attention mechanism.

[0063] In addition to a multi-layered attention mechanism, the model provided in this embodiment also incorporates group classification results to improve prediction accuracy. That is, it learns and mines personalized user characteristics from videos, such as gender, age, and extraversion, to establish user groups and further enhance video-based stress detection performance.

[0064] When learning and constructing groups based on user personalized features, an RDF-based group user graph is built based on user demographic and personality characteristics automatically inferred from videos. Since users' gender, age, and personality traits influence their facial expressions, these traits are inferred from the videos themselves, in addition to the video content. Specifically, stress recognition capabilities are enhanced through a group attention mechanism, including the following steps:

[0065] S1: Use RDF (Resource Description Framework) statements to describe user information and construct a user graph based on RDF;

[0066] S2: Based on video, detect and identify user age, gender, and extraversion characteristics;

[0067] S3: Group attention mechanism based on user personality characteristics: In addition to the user's individual video content, it also pays attention to other users who share the same personality characteristics as the user, and enriches the user representation through group information.

[0068] Because facial expressions contain rich information reflecting a person's stress state, and stress is closely linked to negative emotions such as anger, disgust, fear, and sadness, stress detection can theoretically be transformed into instantaneous negative emotional state detection. The detection results for each frame of video are then aggregated, and the stress level is determined based on the percentage of frames displaying negative emotions. Compared to traditional techniques that only consider negative emotions, this embodiment combines all of the user's positive, neutral, and negative emotions. Furthermore, it analyzes the dynamic changes in the user's emotions, rather than simply aggregating and statistically analyzing negative emotions, to assess stress. Particular attention is paid to the duration of negative emotions and the facial expression features associated with various emotions. For example, these emotions can include eight emotion types: happiness, surprise, fear, sadness, anger, disgust, contempt, and neutrality.

[0069] In practical applications, the video source can be surveillance video. Since surveillance video contains a lot of noise, it is necessary to preprocess the surveillance video data to ensure the accuracy of the results, thereby providing data support for the subsequent generation of video representation.

[0070] In some embodiments, according to the deep learning-based stress detection method provided by the present invention, facial frame samples are extracted from video samples, such as... Figure 2 As shown, the specific steps include:

[0071] S210: Obtain video samples and extract frames that meet preset image quality requirements from the video samples; that is, obtain a high-quality sequence of user face frames from the input surveillance video.

[0072] S220: Based on the pre-defined emotion classification, calculate the probability of the detected emotion classification in each frame to obtain a sequence of emotion classification probabilities; encode the user's facial frames and identify different emotion classifications from each facial frame.

[0073] S230: Extract a preset number of emotion classification probabilities from the sequence of emotion classification probabilities;

[0074] S240: The frames corresponding to the obtained preset number of emotion classification probabilities are used as the facial frame samples. Specifically, for each emotion category (e.g., happiness, surprise, fear, sadness, anger, disgust, contempt, and neutrality), the most characteristic facial frame is selected from the user's video, which can also be understood as the facial frame corresponding to the emotion category with the highest weight.

[0075] In a specific use case, user surveillance video is preprocessed to generate an internal video representation for further processing. The MTCNN face detection technology is used to capture face regions in each frame of the original video, and the FaceNet face recognition system is used to identify the user's face. Low-quality facial frames (e.g., those with facial occlusion, poor lighting, or blurriness) are filtered out using the SER-FIQ model. F user facial frames, denoted as X, are obtained from the input surveillance video. F Theoretically, given the high correlation between stress and emotion, and the excellent facial emotion recognition performance, stress detection models should be built on high-level emotional features rather than starting with low-level features such as facial landmarks and facial action units, in order to improve model performance.

[0076] When encoding each facial frame, models such as ResNet, Self-Cure Network (SCN), or Vision Transformer (ViT) can be used to identify eight types of emotions from each facial frame, which can include happiness, surprise, fear, sadness, anger, disgust, contempt, and neutrality.

[0077] Specifically, let's first define the facial frame representation for emotion as follows:

[0078] X F =(x1,x2,…,x F )

[0079] in (i = 1, 2, ..., F).

[0080] Let X F The sequence of categorical emotion probabilities detected in each facial frame is as follows:

[0081] X E =(e1,e2,…,e F )

[0082] Among them, e i =(e i,1 ,e i,2 ,…,e i,C ), e i,j ∈[0,1]; i=1,2,…,F, j=1,2…,C.

[0083] Furthermore, from X F Eight of the most emotional facial frames were selected to represent:

[0084]

[0085] in, Based on X E In XF The facial frames corresponding to the i-th emotion are selected from the data, and the selection rules are as follows:

[0086]

[0087] In this way, the eight most emotional facial frames selected will serve as a reference to provide data support for distinguishing frame-based emotions in the subsequent triple attention mechanism.

[0088] The method provided by this invention employs a three-layer attention mechanism, focusing on the user's specific emotional facial expressions, the duration of continuous negative emotions, and the transformation and self-association of emotional frames within the video. In other words, when performing emotion-oriented video learning, a deep learning framework can be used to focus on the user's emotional expression features, the duration of continuous negative emotions, and emotional dynamics through three attention mechanisms, thereby improving accuracy by mining information from the user's facial frames. Specifically, the three attention mechanisms include an emotion expression attention mechanism, a continuous negative emotion attention mechanism, and an emotion-infused video frame self-attention mechanism. The emotion expression attention mechanism focuses on the emotional expression in each frame; the continuous negative emotion attention mechanism enhances the weight of continuous negative emotions based on the concept of a negative emotion cycle. It should be understood that a negative emotion period is a segment containing a series of consecutive frames that meets the following conditions: first, each frame displays a negative emotion, such as fear, sadness, anger, disgust, and / or contempt; second, the time distance between any two consecutive frames is less than a certain threshold, such as a 30-minute or 15-minute time interval in monitoring. The emotion-infused video frame self-attention mechanism is used to implant an emotion frame self-attention layer, modeling the frame sequence as a video representation.

[0089] More specifically, in some embodiments, according to the deep learning-based stress detection method provided by the present invention, the multi-layer attention mechanism includes at least an emotion expression attention mechanism, a continuous negative emotion attention mechanism, and a video frame self-attention mechanism with emotional information;

[0090] The facial frame samples are superimposed with representation features extracted through a multi-layer attention mechanism to obtain the emotion-oriented description features, such as... Figure 3 As shown, the specific steps include:

[0091] S310: Based on the aforementioned emotion expression attention mechanism, feature representations are extracted from the facial frame samples to obtain frame representation results containing emotional information.

[0092] S320: Based on the negative emotion cycle and continuous negative emotion attention mechanism, the frame representation result containing emotional information... Extracting video representations containing frames with negative emotions.

[0093] S330: Based on a video frame self-attention mechanism with emotional information, the video representation result is enhanced by the video frame containing negative emotional information. By embedding location information, self-attention results of video frames with emotional content are obtained.

[0094] In step S310, according to the deep learning-based stress detection method provided by the present invention, feature representations are extracted from the facial frame samples based on the emotion expression attention mechanism to obtain frame representation results containing emotional information. like Figure 4 As shown, the specific steps include:

[0095] S410: Constructing facial frame samples X F Compared with the sample pressure test result X C Vector A;

[0096] S420: Normalize each element in the vector A to [0,1], and use the normalization result to represent the attention weight relative to the target sentiment category;

[0097] S430: Based on the attention weights relative to the target sentiment category and the vector, obtain the categorical sentiment attention X. A ;

[0098] S440: Classifying Emotional Attention X A With facial frame sample X F Each frame in the algorithm is aligned, and a frame representation containing emotional information is calculated using a linear function.

[0099] Specifically, the emotional expression attention mechanism is used to obtain frame representations containing emotional information. During the process, such as Figure 5 As shown, taking eight commonly used emotion types as examples, for Each frame in the text is compared with the eight most explicit emotional frames. The closeness is calculated as an 8-dimensional vector A. After passing through the Softmax function, ... Each element in the expression is normalized to [0,1] to represent the attention weight relative to a specific sentiment category.

[0100] Set [Q,K,V] = [X] F ×W q ,X C ×W k ,X C ×W v ]; among them, W qW k W v These are trainable parameters used to train X. F and X C Projection maps to the same dimension to enable multiplication. d k It is the scaling size parameter.

[0101]

[0102] X A =A×V

[0103] Classify Emotional Attention X A With X F Each frame in the process is aligned, and dimensionality is reduced using a linear function to decrease computational complexity.

[0104]

[0105]

[0106] in, For trainable parameters, the frame representation results containing sentiment information This will provide a reference dimension for further emotion identification.

[0107] In step S320, based on the deep learning-based stress detection method provided by the present invention, and based on the negative emotion cycle and continuous negative emotion attention mechanism, the frame representation result containing emotional information is used to... Extracting video representations containing frames with negative emotions. Specifically, it includes:

[0108] Extracting the frame representation results containing emotional information Negative emotional fragments p in consecutive facial frames;

[0109] Based on the duration of the negative emotion segment p and its position within the segment, the frame representation of each frame in the negative emotion segment is weighted and enhanced to obtain the video representation of the facial frame sample that includes negative emotion frame enhancement. In principle, based on the concept of negative emotion cycles, the weight of consecutive negative emotions in the user is enhanced. A negative emotion period is a segment containing a series of consecutive frames that meet the following conditions: (a) each frame displays a negative emotion (fear, sadness, anger, disgust, and / or contempt), and (b) the time interval between any two consecutive frames is less than a certain threshold (e.g., 30 minutes in monitoring). This is because some low-quality frames are discarded during data preprocessing, and the time interval between any two consecutive frames may be uneven. Condition (b) requires that any two consecutive frames not be too far apart. Furthermore, frames with positive or neutral emotions will not appear in any negative emotion segment.

[0110] Accordingly, by utilizing a continuous negative emotion attention mechanism, the video representation of the facial frame samples, enhanced with negative emotion frames, is obtained. During the process, such as Figure 6 As shown, let A negative emotional segment consisting of k consecutive facial frames is represented as follows: Fragment length |p|=k(1 <k≤F)。

[0111] Based on the duration of p and its i-th position within the segment, represent each frame in the segment. Enhanced to

[0112]

[0113]

[0114] Λ(p)=|p|·W2+b2

[0115] The function Λ(p) returns the added frame-by-frame weights based on the duration p of the time period, and the function λ(p,i) returns the added frame-by-frame weights for the i-th frame in p based on Λ(p). and b2∈[0,1] are all trainable parameters.

[0116] As shown in the above formula, the longer the negative emotion cycle, the greater the increase in cycle weight; and the later the frame is in the negative emotion cycle, the greater the increase in frame weight. For X F Any other non-negative emotion frame in the sequence, since it does not exist in any negative emotion segment, has a frame representation consistent with the original frame representation containing emotion information. Therefore, it can be used... This is used to represent video representations that include frames enhanced with negative emotions.

[0117] In step S330, according to the deep learning-based stress detection method provided by the present invention, based on the video frame self-attention mechanism with emotional information, the video representation result containing negative emotion frames is enhanced. By embedding location information, self-attention results of video frames with emotional content are obtained. Specifically, it includes:

[0118] Based on pre-set frame position information parameters, the video representation result for the enhancement of frames containing negative emotions is... For each frame representation in the dataset, a position embedding operation is performed to obtain the embedded representation result.

[0119] Based on the video frame self-attention mechanism with emotional information, the embedding representation results are analyzed. Perform frame self-attention calculations to obtain video frame self-attention results with emotional content.

[0120] In a specific use case, a video frame self-attention mechanism with emotional information is used to obtain video frame self-attention results with emotional information. During the process, an emotion-based frame self-attention layer was implanted into the video frame self-attention mechanism, modeling the frame sequence as a video representation. Each frame representation in the data undergoes a position embedding operation, such as... Figure 7 As shown.

[0121]

[0122] in, For frame location information parameters, Add a trainable parameter matrix In the header of the frame sequence, as the classification CLS_TOKEN, at this time

[0123] set up in It is used for mapping Perform frame self-attention calculation and calculate the frame self-attention weight matrix. Then, the self-attention results of video frames with emotions are obtained according to the following formula.

[0124]

[0125] in, It is q i ′ ∈Q ′ and k j ′ ∈K ′The calculated video representation after the sub-attention mechanism is as follows:

[0126]

[0127] After normalization, such as Figure 8 As shown, a video representation was learned using a deep learning network. This is then fed into an MLP block, which consists of two fully connected network layers (for feature projection), a GELU nonlinear activation layer, and a dropout layer (to prevent overfitting); in steps S310-S330, M stacks are performed (where M=16), and the resulting representation is extracted... As a representation of video, video is expressed as It will be integrated with other attributes of the user group.

[0128] Furthermore, in the process of personalized feature learning and group attention, since users' personality and personality traits affect their facial expressions, in addition to video content, efforts are made to infer other attribute values of users from the video, such as gender, age, personality traits, etc., and incorporate them to enhance stress recognition capabilities.

[0129] First, user graph construction based on RDF; such as Figure 8 As shown, Figure 8 This is a sample user graph based on RDF, where straight edges represent "containment" or "weight" relationships, and curved edges represent "value" relationships. To avoid line intersections, this graph only shows information corresponding to user 1. The user graph contains four attributes (video, gender, age, and personality). The value of the "video" attribute is the video representation head output after processing by the attention mechanism. For example, the value of the "gender" attribute is "male" or "female," and the value of the "age" attribute is "young," "middle-aged," or "old." In the Big Five personality traits classification (extroversion, agreeableness, conscientiousness, neuroticism, and openness), due to the feasibility of video inference and its potential impact on stress detection, this embodiment only considers the "extroversion" dimension of the "personality" attribute. The weights of the three attributes (gender, age, and personality) are applied to all users to represent their importance in the stress detection task, where the sum of the values of each attribute for each user is 1.0.

[0130] Subsequently, the detection and recognition of user age and gender information are based on video. Specifically, in many practical applications, human gender and age have been regarded as two important biometrics. Estimating age and gender from facial expressions displayed in photos or videos has been extensively studied and has yielded fruitful results. Inspired by the powerful global relation modeling capabilities of Vision Transformer (ViT) in computer vision tasks, a deep learning architecture based on PS-ViT and TimeSformer is constructed. This architecture uses PS-ViT to divide each frame of video into a series of salient patches, and then uses TimeSformer to compute a patch-level attention mechanism, including self-attention of all patches in the same frame and attention of patches at the same position in different frames. The model is trained separately for age and gender estimation.

[0131] When detecting user personality traits based on video, we estimate user personality traits (extroversion) using the aforementioned PS-ViT and patch-level attention. Considering that personality estimation requires mining more detailed information than age and gender estimation, we use models such as TNT (Transformer in Transformer). For each salient patch, we uniformly divide it into several smaller sub-patches and compute attention for each sub-patches together with other local sub-patches. We cluster each salient patch with its locally involved sub-patches to enhance representational power. The estimated probabilities of extraversion, gender, and age are weighted relative to their corresponding values for users and user groups.

[0132] When leveraging group attention mechanisms, in addition to individual user video content, attention is also paid to other users with similar attribute values (e.g., "female," "young," etc.), enriching user representations through group information. To this end, user groups are defined by specific attribute values such as "gender," "age," or "personality." For example, all "male" users form one group, and all "young" users form another. In this study, six groups were set up (corresponding to "male," "female," "young," "middle-aged," "elderly," and "extroverted"), allowing for the aggregation of information from these groups and the definition of user representations.

[0133] Specifically, let R(u) be the representation of user u, and its initial value be the multi-head video representation.

[0134] Let R(g) be the representation of group g. After semantic encoding, its initial attribute value is... Project both into the same feature space to maintain consistency with the video content representation;

[0135] The initial representations of user u and group g are: in For trainable parameters,

[0136] For each pair of user u and group g, iteratively compute and aggregate group information into the user representation:

[0137]

[0138]

[0139]

[0140]

[0141] Where || represents the Concat concatenation operation. For trainable parameters, USet(g) represents the aggregated information of the group to which user u belongs and the aggregated information of users in the same group g. * In the table, user set and GSet(u) represent the group set to which user u belongs, α(g * ,u) and β(u * g) represent users u * Attention scores for the group g and user u within group g; calculated as follows:

[0142]

[0143]

[0144] Where w e (u * g) for user u * The edge weight w between g and group g p (g) is the weight of group g (equal to the weight of the corresponding attribute node in the stress detection).

[0145] like Figure 9 As shown, a graph neural network based on GAT is used to calculate group attention. After (l+1) propagations, the resulting group user expression R is obtained. (l+1) (u) The input is fed into a fully connected layer, and then the Softmax activation function is used to obtain the probabilities of different stress levels, corresponding to no stress, low stress, and high stress, respectively. The term with the highest probability is the detected stress level:

[0146] [y1,…,y n ] = Softmax(R (l+1) (u)×W s )

[0147] in Furthermore, the cross-entropy loss function is used to adjust the model weights throughout the entire model training process.

[0148] In the above specific embodiments, the deep learning-based stress detection method provided by the present invention acquires a video to be detected and extracts facial frames of the target user from the video; the facial frames of the target user are input into a pre-trained stress detection model to obtain the stress detection result output by the stress detection model; wherein, the stress detection model is trained based on facial frame samples, as well as the corresponding sample stress detection result, sample emotion-oriented description features, and sample group classification result; the facial frame samples are extracted from video samples; the emotion-oriented description features are obtained by superimposing the representation features extracted from the facial frame samples through a multi-layer attention mechanism; the group classification result is obtained from the facial frame samples through a group attention mechanism.

[0149] Thus, the stress detection method provided by this invention can realize personalized stress detection based on ubiquitous video and deep learning. It adopts a three-layer attention mechanism, focusing on the user's specific emotional facial expressions, the duration of continuous negative emotions, and the transformation and self-association of emotional frames in the video. It can accurately identify stress levels, thereby solving the technical problem of poor detection accuracy of existing contactless detection methods. It can accurately detect the psychological stress level of the target user without contact, realize contactless stress detection, and improve the accuracy of stress detection.

[0150] In addition to the methods described above, this invention also provides a stress detection device based on deep learning, such as... Figure 10 As shown, the device includes:

[0151] The video acquisition unit 1001 is used to acquire the video to be detected and extract the facial frames of the target user in the video to be detected.

[0152] The result generation unit 1002 is used to input the facial frame of the target user into a pre-trained stress detection model to obtain the stress detection result output by the stress detection model;

[0153] The stress detection model is trained based on facial frame samples, as well as the corresponding sample stress detection results, sample emotion guidance description features, and sample group classification results.

[0154] The facial frame samples are extracted from video samples; the emotion-guided description features are obtained by superimposing the representation features extracted from the facial frame samples through a multi-layer attention mechanism; and the group classification result is obtained from the facial frame samples through a group attention mechanism.

[0155] According to the deep learning-based stress detection device provided by the present invention, the multi-layer attention mechanism includes at least an emotion expression attention mechanism, a continuous negative emotion attention mechanism, and a video frame self-attention mechanism with emotional information.

[0156] The facial frame samples are superimposed with the representation features extracted through a multi-layer attention mechanism to obtain the emotion-oriented description features, which specifically include:

[0157] Based on the aforementioned emotion expression attention mechanism, feature representations are extracted from the facial frame samples to obtain frame representation results containing emotional information.

[0158] Based on the negative emotion cycle and continuous negative emotion attention mechanism, the frame representation results containing emotional information are... Extracting video representations containing frames with negative emotions.

[0159] Based on the video frame self-attention mechanism with emotional information, the video representation results enhanced by frames containing negative emotions are... By embedding location information, self-attention results of video frames with emotional content are obtained.

[0160] According to the deep learning-based stress detection device provided by the present invention, feature representations are extracted from the facial frame samples based on the emotional expression attention mechanism to obtain frame representation results containing emotional information. Specifically, it includes:

[0161] Constructing facial frame samples X F Compared with the sample pressure test result X C Vector A;

[0162] Normalize each element in the vector A to [0,1], and use the normalization result to represent the attention weight relative to the target sentiment category;

[0163] Based on the attention weights relative to the target sentiment category and the vector, we obtain the categorical sentiment attention X. A ;

[0164] Classify Emotional Attention X A With facial frame sample X FEach frame in the algorithm is aligned, and a frame representation containing emotional information is calculated using a linear function.

[0165] The deep learning-based stress detection device provided by the present invention, based on the negative emotion cycle and continuous negative emotion attention mechanism, in the frame representation results containing emotional information... Extracting video representations containing frames with negative emotions. Specifically, it includes:

[0166] Extracting the frame representation results containing emotional information Negative emotional fragments p in consecutive facial frames;

[0167] Based on the duration of the negative emotion segment p and its position within the segment, the frame representation of each frame in the negative emotion segment is weighted and enhanced to obtain the video representation of the facial frame sample that includes negative emotion frame enhancement.

[0168] The stress detection device based on deep learning provided by the present invention utilizes a video frame self-attention mechanism with emotional information to enhance the video representation results containing frames with negative emotions. By embedding location information, self-attention results of video frames with emotional content are obtained. Specifically, it includes:

[0169] Based on pre-set frame position information parameters, the video representation result for the enhancement of frames containing negative emotions is... For each frame representation in the dataset, a position embedding operation is performed to obtain the embedded representation result.

[0170] Based on the video frame self-attention mechanism with emotional information, the embedding representation results are analyzed. Perform frame self-attention calculations to obtain video frame self-attention results with emotional content.

[0171] In the above specific embodiments, the deep learning-based stress detection device provided by the present invention acquires a video to be detected and extracts facial frames of the target user from the video; the facial frames of the target user are input into a pre-trained stress detection model to obtain the stress detection result output by the stress detection model; wherein, the stress detection model is trained based on facial frame samples, as well as the corresponding sample stress detection result, sample emotion-oriented description feature, and sample group classification result; the facial frame samples are extracted from video samples; the emotion-oriented description feature is obtained by superimposing the representation features extracted from the facial frame samples through a multi-layer attention mechanism; the group classification result is obtained from the facial frame samples through a group attention mechanism.

[0172] Thus, the stress detection device provided by this invention can realize a personalized stress detection method based on ubiquitous video and deep learning. It adopts a three-layer attention mechanism, focusing on the user's specific emotional facial expressions, the duration of continuous negative emotions, and the transformation and self-association of emotional frames in the video. It can accurately identify the stress level, thereby solving the technical problem of poor detection accuracy of existing contactless detection methods. It can perform relatively accurate contactless detection of the psychological stress level of the target user, realize contactless stress detection, and improve the accuracy of stress detection.

[0173] Figure 11 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 11 As shown, the electronic device may include: a processor 1110, a communications interface 1120, a memory 1130, and a communication bus 1140, wherein the processor 1110, the communications interface 1120, and the memory 1130 communicate with each other through the communication bus 1140. The processor 1110 can call logical instructions in the memory 1130 to execute a deep learning-based stress detection device, the method of which includes: acquiring a video to be detected and extracting facial frames of the target user from the video to be detected;

[0174] By inputting the target user's facial frame into a pre-trained stress detection model, the stress detection result output by the stress detection model can be obtained.

[0175] The stress detection model is trained based on facial frame samples, as well as the corresponding sample stress detection results, sample emotion guidance description features, and sample group classification results.

[0176] The facial frame samples are extracted from video samples; the emotion-guided description features are obtained by superimposing the representation features extracted from the facial frame samples through a multi-layer attention mechanism; and the group classification result is obtained from the facial frame samples through a group attention mechanism.

[0177] Furthermore, the logical instructions in the aforementioned memory 1130 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0178] On the other hand, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being able to be stored on a non-transitory computer-readable storage medium, the computer program being executed by a processor, the computer being able to execute a deep learning-based stress detection device, the method including: acquiring a video to be detected, and extracting facial frames of a target user from the video to be detected;

[0179] By inputting the target user's facial frame into a pre-trained stress detection model, the stress detection result output by the stress detection model can be obtained.

[0180] The stress detection model is trained based on facial frame samples, as well as the corresponding sample stress detection results, sample emotion guidance description features, and sample group classification results.

[0181] The facial frame samples are extracted from video samples; the emotion-guided description features are obtained by superimposing the representation features extracted from the facial frame samples through a multi-layer attention mechanism; and the group classification result is obtained from the facial frame samples through a group attention mechanism.

[0182] In another aspect, the present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, is implemented to perform the stress detection device based on deep learning provided by the methods described above, the method comprising: acquiring a video to be detected, and extracting facial frames of a target user from the video to be detected;

[0183] By inputting the target user's facial frame into a pre-trained stress detection model, the stress detection result output by the stress detection model can be obtained.

[0184] The stress detection model is trained based on facial frame samples, as well as the corresponding sample stress detection results, sample emotion guidance description features, and sample group classification results.

[0185] The facial frame samples are extracted from video samples; the emotion-guided description features are obtained by superimposing the representation features extracted from the facial frame samples through a multi-layer attention mechanism; and the group classification result is obtained from the facial frame samples through a group attention mechanism.

[0186] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0187] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0188] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A stress detection method based on deep learning, characterized in that, include: Acquire the video to be detected and extract the facial frames of the target user from the video to be detected; The target user's facial frame is input into a pre-trained stress detection model to obtain the stress detection result output by the stress detection model; The stress detection model is trained based on facial frame samples, as well as the corresponding sample stress detection results, sample emotion guidance description features, and sample group classification results. The facial frame samples are extracted from video samples; the emotion-guided description features are obtained by superimposing the representation features extracted from the facial frame samples through a multi-layer attention mechanism; the multi-layer attention mechanism includes at least an emotion expression attention mechanism, a continuous negative emotion attention mechanism, and a video frame self-attention mechanism with emotional information; wherein, the emotion expression attention mechanism is used to focus on the emotion expression of each frame; the continuous negative emotion attention mechanism is used to enhance the weight of the user's continuous negative emotions based on the concept of negative emotion cycle; the video frame self-attention mechanism with emotion is used to implant an emotion frame self-attention layer to model the frame sequence as a video representation; the group classification result is obtained from the facial frame samples through a group attention mechanism; the group attention mechanism refers to a group attention mechanism based on user personality characteristics. The emotion-oriented descriptive features are obtained in the following way: Based on the aforementioned emotion expression attention mechanism, feature representations are extracted from the facial frame samples to obtain frame representation results containing emotional information. ; Based on the negative emotion cycle and continuous negative emotion attention mechanism, the frame representation results containing emotional information are... Extracting video representations containing frames with negative emotions. ; Based on the video frame self-attention mechanism with emotional information, the video representation results enhanced by frames containing negative emotions are... By embedding location information, self-attention results of video frames with emotional content are obtained. ; The group classification results are obtained in the following way: Obtain sample users and construct a user graph based on the attribute information of the sample users; Based on the user graph, the characteristics of sample users are identified and classified through a group attention mechanism to obtain group classification results.

2. The stress detection method based on deep learning according to claim 1, characterized in that, Extracting facial frame samples from video samples specifically includes: Acquire video samples and extract frames that meet preset image quality requirements from the video samples; Based on the pre-defined emotion classification, the probability of the detected emotion classification in each frame is calculated to obtain a sequence of emotion classification probabilities. Extract a predetermined number of emotion classification probabilities from the sequence of emotion classification probabilities; The frames corresponding to the preset number of emotion classification probabilities and the frames that meet the preset image quality requirements are used as the facial frame samples.

3. The stress detection method based on deep learning according to claim 1, characterized in that, Based on the aforementioned emotion expression attention mechanism, feature representations are extracted from the facial frame samples to obtain frame representation results containing emotional information. Specifically, it includes: Based on facial frame samples With sample pressure test results To determine the proximity between them, facial frame samples are constructed. With sample pressure test results vector ; The vector Each element in the expression is normalized to [0,1], and the normalized result is used to represent the attention weight relative to the target sentiment category; Based on the attention weights relative to the target sentiment category and the vector, the categorical sentiment attention is obtained. ; Classifying emotional attention With facial frame samples Each frame in the algorithm is aligned, and a frame representation containing emotional information is calculated using a linear function. .

4. The stress detection method based on deep learning according to claim 1, characterized in that, Based on the negative emotion cycle and continuous negative emotion attention mechanism, the frame representation results containing emotional information are... Extracting video representations containing frames with negative emotions. Specifically, it includes: Extracting the frame representation results containing emotional information Negative emotional segments in consecutive facial frames ; Based on the aforementioned negative emotion fragments The duration of the negative emotion frame and its position within the segment are used to weight and enhance the frame representation of each frame in the negative emotion segment, resulting in a video representation of the facial frame sample that includes negative emotion frames. .

5. The stress detection method based on deep learning according to claim 1, characterized in that, Based on the video frame self-attention mechanism with emotional information, the video representation results enhanced by frames containing negative emotions are... By embedding location information, self-attention results of video frames with emotional content are obtained. Specifically, it includes: Based on pre-set frame position information parameters, the video representation result for the enhancement of frames containing negative emotions is... For each frame representation in the dataset, a position embedding operation is performed to obtain the embedded representation result. ; Based on the video frame self-attention mechanism with emotional information, the embedding representation results are analyzed. Perform frame self-attention calculations to obtain video frame self-attention results with emotional content. .

6. A pressure detection device based on deep learning, characterized in that, include: The video acquisition unit is used to acquire the video to be detected and extract the facial frames of the target user in the video to be detected. The result generation unit is used to input the target user's facial frame into a pre-trained stress detection model to obtain the stress detection result output by the stress detection model. The stress detection model is trained based on facial frame samples, as well as the corresponding sample stress detection results, sample emotion guidance description features, and sample group classification results. The facial frame samples are extracted from video samples; the emotion-guided description features are obtained by superimposing the representation features extracted from the facial frame samples through a multi-layer attention mechanism; the multi-layer attention mechanism includes at least an emotion expression attention mechanism, a continuous negative emotion attention mechanism, and a video frame self-attention mechanism with emotional information; wherein, the emotion expression attention mechanism is used to focus on the emotion expression of each frame; the continuous negative emotion attention mechanism is used to enhance the weight of the user's continuous negative emotions based on the concept of negative emotion cycles; the video frame self-attention mechanism with emotion is used to implant an emotion frame self-attention layer to model the frame sequence as a video representation; the group classification result is obtained from the facial frame samples through a group attention mechanism; the group attention mechanism refers to a group attention mechanism based on user personality characteristics. The emotion-oriented descriptive features are obtained in the following way: Based on the aforementioned emotion expression attention mechanism, feature representations are extracted from the facial frame samples to obtain frame representation results containing emotional information. ; Based on the negative emotion cycle and continuous negative emotion attention mechanism, the frame representation results containing emotional information are... Extracting video representations containing frames with negative emotions. ; Based on the video frame self-attention mechanism with emotional information, the video representation results enhanced by frames containing negative emotions are... By embedding location information, self-attention results of video frames with emotional content are obtained. ; The group classification results are obtained in the following way: Obtain sample users and construct a user graph based on the attribute information of the sample users; Based on the user graph, the characteristics of sample users are identified and classified through a group attention mechanism to obtain group classification results.

7. An electronic device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the deep learning-based stress detection method as described in any one of claims 1 to 5.

8. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the deep learning-based stress detection method as described in any one of claims 1 to 5.