Emotion recognition model construction method, emotion recognition method, and electronic device

By constructing an emotion recognition model through the collaborative representation and feature fusion of multimodal physiological signals, the problem of low recognition accuracy in traditional methods is solved, and higher accuracy in emotion state recognition is achieved.

CN116383726BActive Publication Date: 2026-06-26BEIJING INST OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING INST OF TECH
Filing Date
2023-03-09
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Traditional emotion recognition methods suffer from low accuracy due to the complexity of human emotional states and the varying range of physiological signal characteristics among different individuals.

Method used

By acquiring multimodal physiological signals from multiple test subjects, the collaborative representation of each modality is determined, and these signals are fused in the same collaborative space. The signals are then input into a neural network for emotion state classification learning, thereby constructing a target emotion recognition model.

Benefits of technology

It improves the accuracy of emotional state recognition and can more accurately identify the emotional state of test subjects based on physiological signals of multiple modalities.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116383726B_ABST
    Figure CN116383726B_ABST
Patent Text Reader

Abstract

Embodiments of the present application provide a construction method of an emotion recognition model, an emotion recognition method and an electronic device. First physiological signals of a plurality of test objects including at least two modalities are obtained. Based on the first physiological signals of each modality, a collaborative representation corresponding to the first physiological signals of each modality is determined. The collaborative representations corresponding to the first physiological signals of each modality are fused to obtain a fused feature. After obtaining the fused feature, the fused feature is input into a neural network for classification learning of an emotional state to obtain a target emotion recognition model. The collaborative representation is one-dimensional feature information of the first physiological signals of each modality in a same collaborative space, and the fusion of features of each modality is realized. The target emotion recognition model obtained by the scheme provided in the embodiments of the present application can recognize the emotional state of the test object based on the physiological signals of the plurality of modalities of the test object to be tested, and improve the accuracy of recognizing the emotional state.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and in particular to a method for constructing an emotion recognition model, an emotion recognition method, and an electronic device. Background Technology

[0002] From a psychophysiological perspective, human emotional fluctuations can trigger changes in behavioral characteristics and physiological indicators. Therefore, to study and analyze human behavioral characteristics and physiological indicators, emotion recognition can be performed. Emotion recognition is a new research field that intersects multiple disciplines such as engineering and cognitive science. It is defined as the process by which computers analyze and process signals collected from physiological sensors to assess the emotional state of a test subject.

[0003] Currently, emotion recognition methods typically involve acquiring the physiological signals of the test subject, extracting the features of those signals, and then determining the test subject's emotional state by analyzing the characteristics of those physiological signals and the range of those characteristics when the human is in different emotional states.

[0004] However, human emotional states are complex. Different people experience fluctuations in their physiological signals when in different emotional states, which can lead to different ranges of physiological signal characteristics among different people. Therefore, traditional emotion recognition methods suffer from low recognition accuracy. Summary of the Invention

[0005] Therefore, it is necessary to provide a method for constructing an emotion recognition model, an emotion recognition method, and an electronic device that can improve the accuracy of emotion recognition results, in response to the above-mentioned technical problems.

[0006] In a first aspect, embodiments of this application provide a method for constructing an emotion recognition model, comprising:

[0007] Acquire first physiological signals from multiple test subjects; the first physiological signals are physiological change data of the test subjects under different external stimuli, and the first physiological signals include at least two modalities.

[0008] Based on the first physiological signal of each modality, the collaborative representation corresponding to the first physiological signal of each modality is determined. The collaborative representation is the one-dimensional feature information of the first physiological signal of each modality in the same collaborative space.

[0009] The collaborative representations corresponding to the first physiological signals of each modality are fused to obtain fused features;

[0010] The fused features are input into a neural network to learn the classification of emotional states, resulting in a target emotion recognition model.

[0011] In one feasible implementation, based on the first physiological signal of each modality, the collaborative representation corresponding to the first physiological signal of each modality is determined, including:

[0012] Determine the modal characteristics of the first physiological signal of each modality; the modal characteristics include at least one of the statistical characteristics and initial frequency characteristics of the first physiological signal of each modality;

[0013] Tensor canonical correlation analysis is used to calculate the projection vectors of each modal feature in the cooperative space;

[0014] Obtain the collaborative representation of each modality feature on the corresponding projection vector.

[0015] In one feasible implementation, tensor canonical correlation analysis is used to calculate the projection vectors of each modal feature in the cooperative space, including:

[0016] Based on tensor canonical correlation analysis, the covariance tensor among the modal features is calculated.

[0017] Based on the covariance tensor, determine the tensors between modal features;

[0018] Tensor decomposition is performed on the tensor to determine the unit vectors corresponding to the first-rank approximation of the tensor; each unit vector corresponds to the first physiological signal of a mode.

[0019] Based on the unit vectors of the first-rank approximation of the tensor, the projection vectors of each modal feature in the cooperative space are determined.

[0020] In one feasible implementation, based on tensor canonical correlation analysis, the covariance tensor between each modal feature is calculated, including:

[0021] For each test object, the modal features are standardized to obtain the corresponding standardized features; the values ​​of the standardized features are within a preset range.

[0022] Based on tensor canonical correlation analysis, the covariance tensor among the standardized features is calculated.

[0023] In one feasible implementation, the modal characteristics of the first physiological signal of each modality are determined, including:

[0024] The first physiological signal of each modality is preprocessed to obtain the preprocessed first physiological signal of each modality; the preprocessing includes at least one of downsampling, filtering, and removing invalid data;

[0025] Based on the first physiological signal of each modality after preprocessing, the modal characteristics of the first physiological signal of each modality are determined.

[0026] In one feasible implementation, the number of test subjects is N, where N is a real number; the fused features are input into a neural network for emotion state classification learning to obtain a target emotion recognition model, including:

[0027] Obtain the training dataset from the fusion features; the training dataset consists of the fusion features corresponding to n test objects, where n is a positive integer not greater than N;

[0028] Based on multiple partitioning methods, the first feature subset and the second feature subset corresponding to the fusion features contained in the training dataset are determined respectively; wherein, corresponding to various partitioning methods, n test objects are divided into a first group of test objects and a second group of test objects, the first feature subset is the fusion feature of the first group of test objects, and the second feature subset is the fusion feature of the second group of test objects.

[0029] Based on the first and second feature subsets, independent cross-validation is performed to obtain the target emotion recognition model.

[0030] In one feasible implementation, independent cross-validation operations are performed, including:

[0031] The first feature subset determined by any partitioning method is input into the initial sentiment classification model for training until no feature data in the first feature subset is input into the sentiment classification model, thus obtaining a reference sentiment classification model.

[0032] The second feature subset, determined according to the same division method, is input into the reference sentiment classification model, and the sentiment state result corresponding to the second feature subset is output.

[0033] Obtain reference sentiment classification models trained based on the first feature subsets corresponding to various partitioning methods;

[0034] The emotional state labels corresponding to the first physiological signals of each modality are obtained, and the emotional state results corresponding to each second feature subset are compared with the emotional state labels of the corresponding first physiological signals to determine the accuracy of each reference emotional classification model; the emotional state labels correspond one-to-one with the first physiological signals of each modality.

[0035] The target emotion recognition model is determined based on the accuracy of each reference emotion classification model.

[0036] Secondly, embodiments of this application provide an emotion recognition method, including:

[0037] Acquire second physiological signals of multiple modalities of the test subject; the second physiological signals are physiological change data of the test subject under different external stimuli.

[0038] The second physiological signal of each modality is input into the target emotion recognition model obtained by any one of claims 1-7, and the emotional state result of the test subject output by the target emotion recognition model is obtained.

[0039] Thirdly, embodiments of this application provide an apparatus for constructing an emotion recognition model, comprising:

[0040] The acquisition module is used to acquire the first physiological signals of multiple test subjects; the first physiological signals are physiological change data of the test subjects under different external stimuli, and the first physiological signals include at least two modalities.

[0041] The determination module is used to determine the collaborative representation corresponding to the first physiological signal of each modality based on the first physiological signal of each modality. The collaborative representation is the one-dimensional feature information of the first physiological signal of each modality in the same collaborative space.

[0042] The fusion module is used to fuse the collaborative representations corresponding to the first physiological signals of each modality to obtain fused features.

[0043] The learning module is used to input fused features into a neural network to learn the classification of emotional states, thereby obtaining a target emotion recognition model.

[0044] Fourthly, embodiments of this application provide an electronic device, including: a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the method in any of the embodiments of the first and second aspects described above.

[0045] Fifthly, embodiments of this application provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the methods in any of the embodiments of the first and second aspects described above.

[0046] Sixthly, embodiments of this application provide a computer program product, including a computer program that, when executed by a processor, implements the steps of the methods in any of the embodiments of the first and second aspects described above.

[0047] The technical solution provided in this application can achieve at least the following beneficial effects:

[0048] This application provides a method for constructing an emotion recognition model, an emotion recognition method, and an electronic device. The method involves acquiring first physiological signals from multiple test subjects, including at least two modalities. Based on the first physiological signals of each modality, a collaborative representation corresponding to each modality's first physiological signal is determined. The collaborative representations corresponding to the first physiological signals of each modality are then fused to obtain fused features. These fused features are then input into a neural network for emotion state classification learning, resulting in a target emotion recognition model. The collaborative representation is a one-dimensional feature information of the first physiological signals of each modality within the same collaborative space. This fusion of features from various modalities enables simultaneous processing of multiple sets of physiological signal inputs for emotion recognition, capturing more modal data information that contributes to emotion recognition. Therefore, the target emotion recognition model obtained through the solution provided in this application can identify the emotional state of the test subject based on the physiological signals of multiple modalities, improving the accuracy of emotion state recognition. Attached Figure Description

[0049] Figure 1 This is a schematic diagram of an electronic device illustrated in an exemplary embodiment of this application;

[0050] Figure 2 This is a schematic diagram illustrating another electronic device according to an exemplary embodiment of this application;

[0051] Figure 3 This is a flowchart illustrating an exemplary embodiment of the present application of a method for constructing an emotion recognition model;

[0052] Figure 4 This is a schematic diagram of the overall process of collaborative representation shown in an exemplary embodiment of this application;

[0053] Figure 5 This is a schematic diagram illustrating a model training method using fused features, as shown in an exemplary embodiment of this application.

[0054] Figure 6 This is a flowchart illustrating an exemplary embodiment of an emotion recognition method according to this application;

[0055] Figure 7 This is a schematic diagram of an apparatus for constructing an emotion recognition model, as illustrated in an exemplary embodiment of this application. Detailed Implementation

[0056] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.

[0057] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. The singular forms “a” and “the” as used in this application and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

[0058] It should be understood that although the terms first, second, third, etc., may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to determination."

[0059] The embodiments disclosed herein can be applied to electronic devices such as terminal devices, computer systems, and servers, and can operate together with a wide range of other general-purpose or special-purpose computing system environments or configurations. Examples of well-known terminal devices, computing systems, environments, and / or configurations suitable for use with electronic devices such as terminal devices, computer systems, and servers include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, handheld or laptop devices, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputer systems, mainframe computer systems, and distributed cloud computing environments that include any of the above systems.

[0060] Electronic devices such as terminal devices, computer systems, and servers can be described in the general context of computer system executable instructions (such as program modules) executed by a computer system. Typically, program modules can include routines, programs, object programs, components, logic, data structures, etc., which perform specific tasks or implement specific abstract data types. Computer systems / servers can be implemented in distributed cloud computing environments, where tasks are executed by remote processing devices linked through communication networks. In distributed cloud computing environments, program modules can reside on local or remote computing system storage media, including storage devices.

[0061] With the rapid development of artificial intelligence and computing technology in recent years, emotion recognition, as an important research direction of affective computing, has gradually become an important research direction in the field of modern human-computer interaction. Affective computing uses computers as a medium to recognize, understand, and express human emotions, thereby endowing computers with a higher level and more comprehensive intelligence.

[0062] Since its inception, affective computing has faced many challenges, such as (1) the acquisition of emotional information and emotional space modeling; (2) emotion recognition and understanding; and (3) the complexity of the correspondence between multimodal physiological signals and emotions. Among these, emotion recognition is defined as the process by which a computer analyzes and processes signals collected from physiological sensors to assess the emotional state of a subject. Emotion recognition is a new research field that intersects multiple disciplines, including engineering and cognitive science. From a psychophysiological perspective, human emotional fluctuations can lead to changes in behavioral characteristics and physiological indicators. Emotional physiology is an important manifestation of emotional fluctuations, and using it as a starting point to identify and analyze human emotions can help to better understand the emotional fluctuations of subjects. The main research process generally involves collecting physiological signals such as electroencephalogram (EEG), electrodermal transfer (TEG), and electromyography (EMG) signals from the test subjects, and then identifying the emotional state of the test subjects based on these physiological signals.

[0063] Traditional techniques typically involve acquiring the physiological signals of a test subject, extracting features from those signals, and then determining the subject's emotional state by analyzing these features and their range across different emotional states. In this approach, if the physiological signal features fall within a certain range, the subject's emotional state is determined to be the emotional state corresponding to that range. However, human emotional states are complex, and physiological signals fluctuate depending on the individual's emotional state, leading to potentially different ranges for the features of different people's physiological signals. Therefore, traditional emotion recognition methods suffer from low accuracy.

[0064] Based on this, emotion recognition in this embodiment can be viewed as a modeling task that takes multimodal physiological signals as input. This application proposes a method for constructing an emotion recognition model, an emotion recognition method, and an electronic device. The method involves acquiring first physiological signals from multiple test subjects, including at least two modalities. Based on the first physiological signals of each modality, a collaborative representation corresponding to each modality's first physiological signal is determined. The collaborative representations corresponding to the first physiological signals of each modality are then fused to obtain fused features. These fused features are then input into a neural network for emotion state classification learning, resulting in a target emotion recognition model. The collaborative representation is a one-dimensional feature information of the first physiological signals of each modality in the same collaborative space. This fusion of features from various modalities enables simultaneous processing of multiple sets of physiological signal inputs for emotion recognition, capturing more modal data information that contributes to emotion recognition. Therefore, the target emotion recognition model obtained through the solution provided in this application can identify the emotional state of the test subject based on the physiological signals of multiple modalities, improving the accuracy of emotion state recognition.

[0065] The following description, in conjunction with the accompanying drawings, provides an exemplary illustration of the method for constructing an emotion recognition model according to embodiments of this application. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.

[0066] The method for constructing the emotion recognition model provided in this application can be applied to, for example... Figure 1 The electronic device shown can be a terminal, and its internal structure diagram can be as follows. Figure 1 As shown, the computer device includes a processor, memory, communication interface, display screen, and input device connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, carrier networks, NFC (Near Field Communication), or other technologies. When the computer program is executed by the processor, it implements a method for constructing an emotion recognition model. The display screen can be an LCD screen or an e-ink screen. The input device can be a touch layer covering the display screen, buttons, a trackball, or a touchpad on the computer device's casing, or an external keyboard, touchpad, or mouse. The terminal can be, but is not limited to, various personal computers, laptops, smartphones, tablets, and portable wearable devices.

[0067] In another application scenario, this electronic device can also be a server, and its internal structure diagram can be as follows: Figure 2 As shown, the computer device includes a processor, memory, and a network interface connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The database stores sample speech data. The network interface communicates with external terminals via a network connection. When the computer program is executed by the processor, it implements a method for constructing an emotion recognition model. The server can be implemented using a standalone server or a server cluster consisting of multiple servers.

[0068] Those skilled in the art will understand that Figure 1 and Figure 2 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0069] Figure 3 This is a flowchart illustrating an exemplary embodiment of an emotion recognition model construction method according to this application. (Refer to...) Figure 3 As shown, the method includes:

[0070] S202, acquire the first physiological signals of multiple test subjects; the first physiological signals are the physiological change data of the test subjects under different external stimuli, and the first physiological signals include at least two modalities.

[0071] The first physiological signal can include electroencephalogram (EEG) signals, electrooculogram (EOG) signals, electromyogram (EMG) signals, electrodermal oscillation (EDS) signals, respiratory band signals, plethysmography (PPS) signals, body temperature recording signals, etc., without limitation. Signals other than EEG signals can also be collectively referred to as peripheral physiological signals. The test subject can be simply referred to as the subject. Each modality represents a different type of physiological signal; for example, modality 1 is EEG signals, modality 2 is EEG signals, modality 3 is EMG signals, modality 4 is EDS signals, modality 5 is PPS signals, and modality 6 is body temperature recording signals.

[0072] Among these, electroencephalogram (EEG) signals are among the most frequently used physiological electrical signals in emotion recognition due to their ability to accurately reflect a person's emotional state. Electromyographic (EMG) signals are electrophysiological signals generated when muscles contract. A subject's emotional state influences EMG signals through neuromuscular activity, allowing for analysis of the current neural and muscular state to some extent, thus facilitating emotion recognition tasks. Electrodermal activity (EDA) mainly consists of rapidly changing phase-activity skin conductance responses and slowly changing basal conductance levels. As human emotions fluctuate, EDA responses, as a combination of psychological and physiological factors, also undergo significant changes; therefore, EDA signals generated by EDA responses are highly beneficial for emotion recognition tasks.

[0073] The various first physiological signals can be acquired through different channels. Each channel is a physiological sensor used to connect the test subject to the sensor that acquires the first physiological signal. As shown in Table 1, Table 1 lists the first physiological signals corresponding to different channels.

[0074] Table 1

[0075]

[0076]

[0077] Specifically, the first physiological signals of each test subject can be obtained through a publicly available multimodal physiological signal dataset. For example, such a dataset could include the DEAP (Distributed Evolutionary Algorithms in Python) dataset. This dataset uses a preset number of music videos of preset durations as emotional stimuli. Different physiological sensors are used to collect the first physiological signals generated by each test subject upon receiving the emotional stimuli. The dataset composed of these first physiological signals is then included in the DEAP dataset. For instance, 40 identical one-minute music videos can be used to emotionally stimulate each test subject, and then the EEG signals and peripheral physiological signals of each subject after receiving the emotional stimuli can be collected.

[0078] S204. Based on the first physiological signal of each modality, determine the collaborative representation corresponding to the first physiological signal of each modality. The collaborative representation is the one-dimensional feature information of the first physiological signal of each modality in the same collaborative space.

[0079] Specifically, one-dimensional feature extraction is performed on the first physiological signal of each modality to obtain the collaborative representations corresponding to the first physiological signals of each modality. The collaborative space is a shared space among the spaces containing each collaborative representation.

[0080] It should be noted that physiological signals commonly used in emotion recognition research, such as electroencephalography (EEG), electromyography (EMG), and electrodermal conductance (EDC), are collected from different physiological and nervous systems and possess different temporal, frequency, and spatial characteristics, requiring processing as different modalities. Therefore, due to the heterogeneity of data across different modalities, multimodal learning aims to build models capable of processing and associating information from multiple modalities. Currently, multimodal learning faces five major challenges: representation, translation, alignment, fusion, and collaborative learning. Good modal representation is the foundation of multimodal learning and also positively impacts the performance of the other four tasks. Multimodal representation focuses on learning how to utilize the complementarity and redundancy among multiple modalities to represent and summarize multimodal data, and this is currently the main research direction.

[0081] Multimodal representation includes joint representation and collaborative representation. Joint representation merges signals from different modalities into a single representation space, often used when multimodal data is involved in both training and inference steps. Common methods for joint representation include simple concatenation of single modalities, deep multimodal autoencoders, and deep multimodal Boltzmann machines. Collaborative representation maps multiple single-modal signals to a shared collaborative space through similarity or structural constraints. It unsupervisedly learns two or more collaborative representations from multimodal data and optimizes the correlation of the learned collaborative representations in the shared collaborative space using a loss function. Common constraints include maximizing correlation (e.g., canonical correlation analysis), forcing partial order, and minimizing cosine distance. The purpose of using cross-modal similarity measurement models is to learn a shared collaborative space that can directly measure the distance between vectors from different modalities, preserving the similarity structure between and within modalities. Based on cross-modal correlation measurement models, a shared collaborative space that maximizes the correlation between different modal representations can be learned.

[0082] In this embodiment, canonical correlation analysis can be used for collaborative characterization. The specific process of the canonical correlation analysis algorithm is as follows: the variable format is specified as follows:

[0083] Consider two-dimensional data with two different modalities. Suppose there are N samples that have already been centered, S = {(x1,y1),(x2,y2),…,(x...} N ,y N Let X = [x1, x2, ..., x] N ] T , Y = [y1, y2, ..., y N ] T , X represents the physiological signal of one modality, and Y represents the physiological signal of another modality. x1, x2, ..., x N Let y1, y2, ..., y be the physiological signals of one modality corresponding to N sample objects.N These are physiological signals of another modality corresponding to N sample objects.

[0084] The covariance matrix of variables X and Y is denoted as C(X,Y), as shown in formula (1).

[0085]

[0086] Where C XX The variance of X is E[X]. T X],C YY The variance of Y, E[Y] T Y],C XY =C YX T It is the covariance E[X] of X and Y. T Y].

[0087] Let the projection vector be... Projecting X and Y onto the directions of projection vectors w1 and w2 respectively, we obtain two N-dimensional vectors U and V, where U∈R. N×1 ,V∈R N×1 .

[0088]

[0089] At this point, the Pearson correlation coefficient is used to measure the degree of correlation between the variables U and V projected from the multidimensional random variables X and Y, as shown in formula (3).

[0090]

[0091] ρ measures the linear correlation coefficient between X and Y in the co-existing space after they are projected onto vectors w1 and w2, respectively. Different projection vectors can be used to calculate different linear correlation coefficients ρ. Canonical correlation analysis defines the maximum value of ρ among all possible values ​​as the linear correlation coefficient between the multidimensional variables X and Y. Therefore, the canonical correlation analysis algorithm is an unconstrained optimization problem of maximizing ρ.

[0092]

[0093]

[0094] Observations show that scaling the projection vectors w1 and w2 simultaneously or individually in formula (4) does not affect the value of the linear correlation coefficient ρ, as shown in formula (5).

[0095]

[0096] Since w1 and w2 can be arbitrarily scaled, canonical correlation analysis and support vector machines have similar transformation problem methods: specifying the size of the denominator and optimizing the numerator. The denominator of canonical correlation analysis is specified as in formula (6).

[0097] w1 T C XX w1 = 1, w2 T C YY w2=1 (6)

[0098] Therefore, the canonical correlation analysis algorithm is transformed into a constrained optimization problem of formula (7).

[0099]

[0100] Equation (7) is an optimized form of a common canonical correlation analysis algorithm. Common optimization methods for Equation (7) include the Lagrange multiplier method and matrix singular value decomposition. The Lagrange multiplier method solves for the matrix corresponding to ρ. The square root of the largest eigenvalue; the matrix corresponding to ρ is obtained by singular value decomposition of the matrix. The maximum singular value.

[0101] Specifically, when extracting features of the first physiological signal corresponding to each modality, different feature extraction methods can be used. After extracting the features of the first physiological signal corresponding to each modality, the extracted features can be mapped to the same collaborative space.

[0102] In one alternative embodiment, step S204 includes:

[0103] S2042, determine the modal characteristics of the first physiological signal of each modality. The modal characteristics include at least one of the statistical characteristics and initial frequency characteristics of the first physiological signal of each modality.

[0104] Specifically, modal features can be extracted from the first physiological signal of each modality, and different modal features can be extracted for the first physiological signal of different modalities. Statistical features may include expected value, standard deviation, skewness, kurtosis, number of peaks, expected peak amplitude, peak interval time, expected value, standard deviation, first quartile, second quartile, power, etc., without limitation. Initial frequency features can be the frequencies of each physiological signal.

[0105] For example, taking the first physiological signal, which includes electroencephalogram (EEG), electromyogram (EMG), and electrodermal signal, as shown in Table 2.

[0106] Table 2

[0107]

[0108]

[0109] Specifically, the EEG signal is calculated by multiplying the channel derived number 32 by the feature number of different features 6, resulting in 192 feature dimensions. The EMG signal is calculated by multiplying the channel derived number 2 by the feature number of different features 5, resulting in 10 feature dimensions. The EDR signal is calculated by multiplying the channel derived number 1 by the feature number of different features 7, resulting in 10 feature dimensions.

[0110] Among them, γ, β, α, slowα, θ, and δ in EEG signals are six different feature dimensions; mathematical expectation, standard deviation, skewness, and kurtosis in EMG signals are four different feature dimensions; and number of peaks, mathematical expectation of peak amplitude, peak interval time, mathematical expectation, standard deviation, first quartile, and second quartile in ESC signals are seven different feature dimensions.

[0111] The above example is merely one selection method in this embodiment. If the first physiological signal includes a body temperature recording signal, then feature data such as the expected value and standard deviation of each subject's body temperature recording signal can be extracted, without limitation. The modal features may differ for the first physiological signal of different modalities.

[0112] In one alternative embodiment, step S2042 includes:

[0113] S2042a, preprocess the first physiological signal of each modality to obtain the preprocessed first physiological signal of each modality; the preprocessing includes at least one of downsampling, filtering, and removing invalid data.

[0114] S2042b, based on the first physiological signal of each modality after preprocessing, determines the modal characteristics of the first physiological signal of each modality.

[0115] It should be noted that after acquiring the first physiological signal, it can be preprocessed. Preprocessing may include downsampling, filtering out noise signals, removing short resting-state signals, and correcting abnormal signals. For example, the sampling rate of the raw EEG signal can be reduced from 512Hz to 128Hz, the bandpass filter range can be 4Hz-45Hz, and the 3-second resting-state portion of the data can be removed.

[0116] Since the DEAP experiment collects signals during a 3-second silent period before each participant watches the video, and this data does not contribute to emotion recognition, the preprocessing stage removes the 3-second silent period sampling data for each participant's data set. The data size for each participant becomes 40 (number of videos for emotional stimulus input) × 40 (number of channels) × 7680 (60 seconds × 128Hz sampling points) physiological signals. The silent period signal can be set according to experimental requirements, and the corresponding signal is filtered out during subsequent filtering based on the set duration.

[0117] For example, this embodiment selects channels corresponding to the above three modalities in the original 40 channels of DEAP (Table 1), and uses EEG signals of channels 1-32, EMG signals of channels 35-36, and ESC signals of channel 37 as data input for emotion recognition.

[0118] If the number of peaks in the first dimension of the skin conductance feature of subjects 6, 24, 28, 29, 30, 31, and 32 in the collected DEAP dataset is 0, then the corresponding second and third dimensions (i.e., the expected value of the peak amplitude and the peak interval time) of these subjects are all invalid data (Not A Number, NAN), and cannot be directly used for model training. These signals need to be corrected to ensure the accuracy of the final results.

[0119] The specific process for correcting abnormal signals involves replacing them with data from other features of the same physiological signal in the abnormal subject. For example, when processing subject 24, arousal and valence are used as labels. The average of the second and third dimensions of the abnormal sample data of subject 24 within the same category as the normal sample data of that subject is used as the replacement for the abnormal sample data. If there is no normal sample data within the same category, the second dimension of subject 24 is replaced by its fourth dimension (i.e., mathematical expectation), and the third dimension is replaced by 60 (meaning no peak). All abnormal data are processed according to the above process, thus the ESC data are divided into data processed with arousal and data processed with valence. These two sets of data will be used as ESC data to train the labels arousal and valence.

[0120] It should be noted that arousal represents the degree of excitement or inhibition of emotion, ranging from calm and drowsy to extreme excitement. Valence distinguishes the positive or negative nature of emotion, ranging from negative to positive. This invention uses arousal and valence as labels to train the model, which can be used as the emotional attributes and intensity of subjects. The data for arousal and valence are obtained by subjects rating the arousal and valence of each music video on a self-rating discrete 9-point scale after watching it.

[0121] In this embodiment, the first physiological signal of each modality is preprocessed to obtain the processed first physiological signal of each modality. Based on the processed first physiological signal of each modality, the modal characteristics of the first physiological signal of each modality are determined. Invalid data in the data can be filtered out by at least one of downsampling, filtering, and invalid data removal to ensure the accuracy of the data.

[0122] S2044 uses tensor canonical correlation analysis to calculate the projection vectors of each modal feature in the cooperative space.

[0123] It should be noted that canonical correlation analysis has some limitations: 1) it can only process data from two modalities; 2) it can only obtain linear relationships between two modalities. However, physiological signals contributing to emotion recognition often involve more than two modalities. Therefore, considering that the concept of tensor multilinear mapping matches the logic of traditional canonical correlation analysis algorithms learning two linear mapping projection vectors or matrices, this embodiment can also employ tensor canonical correlation analysis to learn a shared collaborative space for common multimodal physiological signals such as EEG, EMG, and ESK. Then, a multimodal physiological signal collaborative representation emotion recognition model is constructed based on a support vector machine classifier.

[0124] The optimization objective w1 for the retrospective canonical correlation analysis T C XY w2 can be considered as C XY First, multiply by w1 on the left. T from (2nd order tensor) reduced to (1st order tensor). Then multiply by w2 on the right to reduce it to a 0th order tensor with linear correlation coefficient ρ.

[0125] Where X = [x1, x2, ..., x N ] T , Y = [y1, y2, ..., y N ] T , X represents the physiological signal of one modality, and Y represents the physiological signal of another modality. x1, x2, ..., x N Let y1, y2, ..., y be the physiological signals of one modality corresponding to N sample objects. N These are physiological signals of another modality corresponding to N sample objects.

[0126] C XX The variance of X is E[X]. T X],C YY The variance of Y, E[Y] T Y],C XY =C YX T It is the covariance E[X] of X and Y. T Y].

[0127] C XY Left multiply by w1 T It can be viewed as projecting C with w1 as the direction. XY Project each column vector and replace the original column vector with the resulting projection value. XY The t-th column vector represents the covariance values ​​of the t-th dimension of Y with each dimension of X. Therefore, C XY Left multiply by w1 TReceived The t-th element in the vector represents the covariance between the t-th dimension of Y and the vector resulting from projecting X onto vector w1. Similarly, multiplying this result by w2 on the right can be viewed as performing a dot product operation with w2 as the projection direction, taking the vector resulting from projecting X onto vector w1 and the vector resulting from projecting Y onto vector w2, as the final covariance value. The dot product of two centered vectors represents their covariance value (a scalar).

[0128] Based on the above calculation of the covariance values ​​of the two sets of multidimensional variables using canonical correlation analysis, the computational logic is extended to the covariance tensor. Consider multidimensional data with m modalities. Similar to the basic assumptions of canonical correlation analysis, it is assumed that there are N sets of samples, all of which have been centered, S = {(x...} 11 ,x 12 ,…,x 1m ),(x 21 ,x 22 ,…,x 2m ,),…,(x N1 ,x N2 ,…,x Nm )}, where x ij Let X1 represent the j-th mode corresponding to the i-th sample group. 11 ,x 21 ,…,x N1 ] T , X2=[x 12 ,x 22 ,…,x N2 ] T , …, X m =[x 1m ,x 2m ,…,x Nm ] T , The canonical correlation analysis of tensors is given by formula (8), where in The symbol is the cross product (a special kind of Kronecker product). For example, taking the cross product of a vector of length n and a vector of length m will result in an n×m matrix, where the elements are the product of each pair of elements between the vectors.

[0129]

[0130] stw k T C kk w k =1,k=1,2,…,m

[0131] Formula (8) The operation is a modular multiplication of a tensor and a vector. Referring to the approach of using singular value decomposition to solve canonical correlation analysis, the form of formula (8) is modified. Let... Then u k T u k =w k T C kk w k =1; let Formula (9) can be obtained.

[0132]

[0133] Formula (8) and Formula (9) are equivalent.

[0134] Solving equation (9) is equivalent to solving the tensor. The best first-rank approximation. Given data, the tensor can be directly obtained. Therefore, this invention transforms the multimodal maximum linear correlation problem into a tensor decomposition problem. Common methods for tensor decomposition include Tucker decomposition, CP decomposition, higher-order singular value decomposition, and higher-order power algorithms.

[0135] If higher-order singular value decomposition is used on the tensor Decompose it. Taking a third-order tensor as an example, decompose the tensor... It is written in the form of the first-rank approximation of formula (10).

[0136]

[0137] make Conventional tensors The mode-k expansion is written as L (1) For tensor After performing mode-1 expansion, the matrix The left-hand matrix obtained by performing singular value decomposition is... The eigenvectors form a matrix. L (2) For tensor After performing mode-2 expansion, the matrix The left-hand matrix obtained by performing singular value decomposition, i.e. The eigenvectors form a matrix. L (3) For tensor After performing mode-3 expansion, the matrix The left-hand matrix obtained by performing singular value decomposition, i.e. The eigenvectors form a matrix. Thus, the tensor is obtained. The higher-order singular value decomposition of .

[0138] This invention uses a higher-order exponentiation algorithm to obtain tensors. The first-rank approximation. Higher-order power algorithms minimize the approximation tensor. and Finding the approximate tensor using the F-norm of the difference in It can be achieved through one constraint coefficient λ and m unit vectors v1, v2, ..., v m , The description is as shown in formula (11).

[0139]

[0140] In one alternative embodiment, step S2044 specifically includes:

[0141] S2044a, based on tensor canonical correlation analysis, calculates the covariance tensor between modal features.

[0142] Specifically, for each test object, the modal features are standardized to obtain the corresponding standardized features. The values ​​of the standardized features are within a preset range.

[0143] Physiological signals vary significantly among individuals, so directly processing data from each subject's physiological modalities will introduce substantial errors. Therefore, it is necessary to remove individual differences on a subject-by-subject basis, i.e., standardize and scale each subject's features to a range between -1 and 1. Where X l It is the unstandardized feature of the lth subject, U l S is the average value of the l-th subject. l Z is the standard deviation of the l-th subject. l It is the standardized feature of the l-th subject.

[0144] Furthermore, according to the formula Calculate the covariance tensor C of different modal physiological signals 12……m .

[0145] S2044b, based on the covariance tensor, determines the tensor between modal features.

[0146] Specifically, according to the formula Calculate the tensors between each modal feature

[0147] S2044c performs tensor decomposition on the tensor to determine the unit vectors corresponding to the first-rank approximation of the tensor; each unit vector corresponds to the first physiological signal of a mode.

[0148] Specifically, higher-order exponentiation algorithms can be used to process tensors. The first-rank approximation problem is transformed into a constrained optimization problem as shown in formula (12).

[0149]

[0150] The specific process of the higher-order exponentiation algorithm is as follows.

[0151]

[0152] S2044d determines the projection vectors of each modal feature in the cooperative space based on the unit vectors of the first-rank approximation of the tensor.

[0153] Specifically, the unit vectors v1, v2, ... v based on the first-rank approximation of tensors m Then, according to the formula Perform the transformation to obtain the corresponding projection vectors w1, w2, ..., w m .

[0154] In this embodiment, covariance tensors between modal features are calculated based on tensor canonical correlation analysis. Based on these covariance tensors, tensors between modal features are determined. Tensor decomposition is performed on these tensors to determine the unit vectors corresponding to the first-rank approximation of the tensor. Based on these unit vectors, the projection vectors of each modal feature in the co-location space are determined. The projection matrix for each modality is calculated based on maximizing the correlation after linear projection of different physiological modalities. The derivation process shows that the learned co-location features have good interpretability. Tensor canonical correlation analysis can handle inputs from more than two modalities, thus enabling simultaneous processing of multiple sets of physiological signal inputs for emotion recognition, capturing more modal data information that contributes to emotion recognition. Therefore, using the multimodal physiological signal co-representation method model of this invention for subsequent classification tasks will achieve a relatively ideal emotion recognition accuracy.

[0155] S2046, obtain the collaborative representation of each modality feature on the corresponding projection vector.

[0156] Specifically, for X1, X2, ... X m By performing a projection operation, a multimodal cooperative representation is obtained. The overall process for obtaining the cooperative representation can be referenced as follows: Figure 4 The flowchart shown is illustrated. In this flowchart, each row of features in modalities 1, 2, and 3 represents the feature data corresponding to one subject.

[0157] S206, the collaborative representations corresponding to the first physiological signals of each modality are fused to obtain fused features.

[0158] Specifically, when fusing the collaborative representations corresponding to the first physiological signals of each modality, the representations of each system can be spliced ​​horizontally or vertically to obtain fused features.

[0159] S208, the fused features are input into the neural network for emotional state classification learning, resulting in a target emotion recognition model.

[0160] Among them, the neural network can choose to use a support vector machine to classify and learn the representation to obtain the target emotion recognition model. The standard support vector machine is a non-probabilistic binary linear classifier that is matched with the classification label arousal and the positive and negative labels of valence.

[0161] In one alternative embodiment, the number of objects being tested is N, where N is a real number, and step S208 specifically includes:

[0162] S2082, Obtain the training dataset from the fusion features. The training dataset consists of the fusion features corresponding to n test objects, where n is a positive integer not greater than N.

[0163] Specifically, data corresponding to any subject can be selected as test data from the fusion features, such as... Figure 5 As shown, data from any row of the fused features (i.e., the feature data of one subject) can be selected as test data, and data from other subjects can be used as the training dataset.

[0164] S2084, based on multiple partitioning methods, determine the first feature subset and the second feature subset corresponding to the fusion features contained in the training dataset. Specifically, corresponding to each partitioning method, n test objects are divided into a first group of test objects and a second group of test objects. The first feature subset is the fusion feature of the first group of test objects, and the second feature subset is the fusion feature of the second group of test objects.

[0165] Specifically, the fusion features corresponding to multiple subjects in the training dataset can be used as the first feature subset, and the fusion features corresponding to other subjects can be used as the second feature subset, where the second feature subset can be used as validation data. The fusion features corresponding to each subject can be... Figure 5 A line of data in the middle.

[0166] It should be noted that, Figure 5 The partitioning method described herein is merely one possible method for reference, namely, selecting one of the training datasets as the second feature subset. The partitioning method is not limited to this one method in this application.

[0167] S2086, based on the first feature subset and the second feature subset, performs independent cross-validation to obtain the target emotion recognition model.

[0168] The independent cross-validation operation includes:

[0169] S302, input the first feature subset determined by any partitioning method into the initial sentiment classification model for training, until there is no feature data in the first feature subset input into the sentiment classification model, and obtain the reference sentiment classification model.

[0170] S304. The second feature subset, determined according to the same segmentation method, is input into the reference sentiment classification model, and the sentiment state result corresponding to the second feature subset is output. Here, the sentiment state result is used to indicate the test label of the subject's sentiment state, such as happy or unhappy.

[0171] S306, Obtain a reference sentiment classification model trained based on the first feature subsets corresponding to various partitioning methods.

[0172] S308, obtain the emotional state label corresponding to the first physiological signal of each modality, and compare the emotional state result corresponding to each second feature subset with the emotional state label of the corresponding first physiological signal to determine the accuracy of each reference emotional classification model. The emotional state label corresponds one-to-one with the first physiological signal of each modality.

[0173] If the emotional state result matches the emotional state label of the corresponding first physiological signal, it is considered correct; otherwise, it is considered incorrect.

[0174] S310, Based on the accuracy of each reference sentiment classification model, determine the target sentiment recognition model.

[0175] Specifically, the accuracy of sentiment classification models constructed using various segmentation methods is compared, and the model constructed using the segmentation method with the highest accuracy is used as the target sentiment recognition model.

[0176] In this embodiment, by acquiring the training dataset from the fusion features, and based on multiple partitioning methods, a first feature subset and a second feature subset corresponding to the fusion features included in the training dataset are determined. Based on the first and second feature subsets, independent cross-validation is performed to obtain the target emotion recognition model. A multimodal physiological signal collaborative representation emotion recognition model is constructed based on a support vector machine classifier. A validation set is used for parameter selection, and a test set is used to measure the accuracy of the method, effectively avoiding overfitting. Furthermore, this invention employs independent cross-validation of subjects when verifying the learning effect, fully considering individual physiological differences; therefore, the final result is more applicable to real-world scenarios.

[0177] In this embodiment, multiple test subjects are acquired, including first physiological signals of at least two modalities. Based on the first physiological signals of each modality, a collaborative representation corresponding to the first physiological signal of each modality is determined. The collaborative representations corresponding to the first physiological signals of each modality are fused to obtain fused features. These fused features are then input into a neural network for emotional state classification learning, resulting in a target emotion recognition model. The collaborative representation is a one-dimensional feature information of the first physiological signals of each modality in the same collaborative space. This fusion of features from various modalities enables simultaneous processing of multiple sets of physiological signal inputs for emotion recognition, capturing more modal data information that contributes to emotion recognition. Therefore, the target emotion recognition model obtained through the scheme provided in this embodiment can identify the emotional state of the test subject based on the physiological signals of multiple modalities, improving the accuracy of emotional state recognition.

[0178] Based on the same inventive concept, after constructing a target emotion recognition model, emotion recognition can be performed based on this model. In some embodiments, such as Figure 6 As shown, the emotion recognition method includes:

[0179] S402, acquire the second physiological signals of the test subject in multiple modalities; the second physiological signals are the physiological change data of the test subject under different external stimuli.

[0180] The second physiological data can also be signal data acquired from different channels in different modalities, as shown in Table 1.

[0181] S404, input the second physiological signals of each modality into the target emotion recognition model to obtain the emotional state results of the test subject output by the target emotion recognition model.

[0182] For specific limitations on the target sentiment recognition model, please refer to the limitations on the construction method of the target sentiment recognition model mentioned above, which will not be repeated here.

[0183] In this embodiment, multiple modalities of second physiological signals of the test subject are acquired. These second physiological signals represent physiological changes in the test subject under different external stimuli. The second physiological signals of each modality are input into a target emotion recognition model, and the emotional state result of the test subject is obtained from the output of the target emotion recognition model. This allows for simultaneous processing of multiple sets of physiological signal inputs for emotion recognition, capturing more modal data information that contributes to emotion recognition. Performing classification tasks yields a relatively ideal emotion recognition accuracy.

[0184] Figure 7 This is a schematic diagram of an apparatus for constructing an emotion recognition model, as illustrated in an exemplary embodiment of this application. (Refer to...) Figure 7As shown, this device is used to implement all or part of the functions of the aforementioned method embodiments. Specifically, the device for constructing the emotion recognition model includes:

[0185] The acquisition module 701 is used to acquire the first physiological signals of multiple test subjects; the first physiological signals are physiological change data of the test subjects under different external stimuli, and the first physiological signals include at least two modalities.

[0186] The determination module 702 is used to determine the collaborative representation corresponding to the first physiological signal of each modality based on the first physiological signal of each modality. The collaborative representation is the one-dimensional feature information of the first physiological signal of each modality in the same collaborative space.

[0187] The fusion module 703 is used to fuse the collaborative representations corresponding to the first physiological signals of each modality to obtain fused features.

[0188] Learning module 704 is used to input fused features into a neural network to learn the classification of emotional states, thereby obtaining a target emotion recognition model.

[0189] For specific limitations on the construction device of the emotion recognition model, please refer to the limitations on the construction method of the emotion recognition model above, which will not be repeated here.

[0190] In one embodiment, the determining module includes:

[0191] The first determining unit is used to determine the modal characteristics of the first physiological signal of each modality; the modal characteristics include at least one of the statistical characteristics and initial frequency characteristics of the first physiological signal of each modality.

[0192] The first computational unit is used to calculate the projection vectors of each modal feature in the cooperative space using tensor canonical correlation analysis.

[0193] The first acquisition unit is used to acquire the collaborative representation of each modality feature on the corresponding projection vector.

[0194] In one embodiment, the first computing unit is specifically used to calculate the covariance tensor between modal features based on tensor canonical correlation analysis; determine the tensor between modal features based on the covariance tensor; perform tensor decomposition on the tensor to determine the unit vectors corresponding to the first-rank approximation of the tensor; each unit vector corresponds to the first physiological signal of a modality; and determine the projection vectors of each modal feature in the co-space based on the unit vectors of the first-rank approximation of the tensor.

[0195] In one embodiment, the first computing unit is specifically used to standardize the modal features for each test object to obtain the corresponding standardized features; the values ​​of the standardized features are within a preset value range; and the covariance tensor between the standardized features is calculated based on tensor canonical correlation analysis.

[0196] In one embodiment, the first computing unit is specifically used to preprocess the first physiological signal of each modality to obtain the processed first physiological signal of each modality; the preprocessing includes at least one of downsampling, filtering, and removing invalid data; and based on the processed first physiological signal of each modality, the modal characteristics of the first physiological signal of each modality are determined.

[0197] In one embodiment, the number of objects being tested is N, where N is a real number; the fusion module includes:

[0198] The second acquisition unit is used to acquire the training dataset in the fusion features; the training dataset consists of the fusion features corresponding to n test objects, where n is a positive integer not greater than N;

[0199] The second determining unit is used to determine the first feature subset and the second feature subset corresponding to the fusion features contained in the training dataset based on multiple partitioning methods; wherein, corresponding to various partitioning methods, n test objects are divided into a first group of test objects and a second group of test objects, the first feature subset is the fusion feature of the first group of test objects, and the second feature subset is the fusion feature of the second group of test objects.

[0200] The independent cross-validation unit is used to perform independent cross-validation operations based on the first feature subset and the second feature subset to obtain the target emotion recognition model.

[0201] In one embodiment, the independent cross-validation unit is specifically used to input a first feature subset determined by any partitioning method into an initial sentiment classification model for training until no feature data in the first feature subset is input into the sentiment classification model, thus obtaining a reference sentiment classification model; input a second feature subset determined according to the same partitioning method into the reference sentiment classification model, and output the sentiment state result corresponding to the second feature subset; obtain reference sentiment classification models trained based on the first feature subsets corresponding to various partitioning methods; obtain sentiment state labels corresponding to the first physiological signals of each modality, and compare the sentiment state results corresponding to each second feature subset with the sentiment state labels of the corresponding first physiological signals to determine the accuracy of each reference sentiment classification model; the sentiment state labels correspond one-to-one with the first physiological signals of each modality; and determine the target sentiment recognition model based on the accuracy of each reference sentiment classification model.

[0202] Specific limitations regarding the apparatus for constructing the emotion recognition model can be found in the limitations on the construction method of the emotion recognition model described above, and will not be repeated here. Each module in the aforementioned apparatus for constructing the emotion recognition model can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device in hardware form, or stored in the memory of a computer device in software form, so that the processor can call and execute the operations corresponding to each module.

[0203] In one embodiment, an electronic device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to implement the steps of the method in any of the above embodiments.

[0204] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the steps of the method in any of the above embodiments.

[0205] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps of the method in any of the above embodiments.

[0206] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the methods described above. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, or optical storage, etc. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM can be in various forms, such as static random access memory (SRAM) or dynamic random access memory (DRAM), etc.

[0207] It is readily understood that, based on the several embodiments provided in this application, those skilled in the art can combine, split, or reorganize the embodiments of this application to obtain other embodiments, none of which exceed the protection scope of this application.

[0208] The above detailed embodiments further illustrate the purpose, technical solution, and beneficial effects of the embodiments of this application. It should be understood that the above are merely specific embodiments of the embodiments of this application and are not intended to limit the protection scope of the embodiments of this application. Any modifications, equivalent substitutions, improvements, etc., made on the basis of the technical solutions of the embodiments of this application should be included within the protection scope of the embodiments of this application.

Claims

1. A method for constructing an emotion recognition model, characterized in that, include: Acquire a first physiological signal, wherein the first physiological signal originates from multiple test subjects; the first physiological signal is physiological change data generated by the test subjects under different external stimuli, and the first physiological signal includes at least two modalities; Determine the modal characteristics of the first physiological signal of each modality; the modal characteristics include at least one of the statistical characteristics and initial frequency characteristics of the first physiological signal of each modality; For each of the tested objects, the modal features are standardized to obtain corresponding standardized features; the values ​​of the standardized features are within a preset range. Based on tensor canonical correlation analysis, the covariance tensor among the standardized features is calculated. Based on the covariance tensor, determine the tensor between each modal feature; Tensor decomposition is performed on the tensors between the modal features, and a first-rank approximation is performed on the tensors between each modal feature based on a higher-order power algorithm to determine the unit vectors corresponding to the first-rank approximation of the tensors between the modal features; each unit vector corresponds to the first physiological signal of a modality. Based on the unit vectors of the uni-rank approximation of the tensors between the modal features, the projection vectors corresponding to each modal feature in the cooperative space are determined. Obtain the collaborative representation of each modality feature on the corresponding projection vector, wherein the collaborative representation is the one-dimensional feature information of the first physiological signal of each modality in the same collaborative space; The collaborative representations corresponding to the first physiological signals of each modality are fused to obtain fused features; The fused features are input into a neural network to learn the classification of emotional states, resulting in a target emotion recognition model.

2. The method according to claim 1, characterized in that, The determination of the modal characteristics of the first physiological signal of each modality includes: The first physiological signal of each modality is preprocessed to obtain the preprocessed first physiological signal of each modality; the preprocessing includes at least one of downsampling, filtering, and removing invalid data. Based on the first physiological signal of each modality after preprocessing, the modal characteristics of the first physiological signal of each modality are determined.

3. The method according to claim 1 or 2, characterized in that, The number of test subjects is N, where N is a real number; the step of inputting the fused features into a neural network for emotion state classification learning to obtain the target emotion recognition model includes: Obtain the training dataset from the fusion features; the training dataset consists of fusion features corresponding to n test objects, where n is a positive integer not greater than N; Based on multiple partitioning methods, the first feature subset and the second feature subset corresponding to the fusion features contained in the training dataset are determined respectively; wherein, corresponding to various partitioning methods, the n test objects are divided into a first group of test objects and a second group of test objects, the first feature subset is the fusion feature of the first group of test objects, and the second feature subset is the fusion feature of the second group of test objects. Based on the first feature subset and the second feature subset, an independent cross-validation operation is performed to obtain the target emotion recognition model.

4. The method according to claim 3, characterized in that, The execution of independent cross-validation includes: The first feature subset determined by any partitioning method is input into the initial sentiment classification model for training until no feature data in the first feature subset is input into the sentiment classification model, thus obtaining a reference sentiment classification model. The second feature subset, determined according to the same division method, is input into the reference sentiment classification model, and the sentiment state result corresponding to the second feature subset is output. Obtain the reference sentiment classification model trained based on the first feature subsets corresponding to various partitioning methods; The emotional state labels corresponding to the first physiological signals of each modality are obtained, and the emotional state results corresponding to each second feature subset are compared with the emotional state labels of the corresponding first physiological signals to determine the accuracy of each reference emotional classification model; the emotional state labels correspond one-to-one with the first physiological signals of each modality. The target emotion recognition model is determined based on the accuracy of each of the reference emotion classification models.

5. An emotion recognition method, characterized in that, include: Acquire secondary physiological signals of multiple modalities of the test subject; The second physiological signal is the physiological change data of the test subject under different external stimuli; The second physiological signal of each of the aforementioned modalities is input into the target emotion recognition model obtained by the method of any one of claims 1-4, and the emotional state result of the test subject output by the target emotion recognition model is obtained.

6. An apparatus for constructing an emotion recognition model, characterized in that, The device includes: An acquisition module is used to acquire a first physiological signal, wherein the first physiological signal originates from multiple test subjects; the first physiological signal is physiological change data generated by the test subjects under different external stimuli, and the first physiological signal includes at least two modalities. A determining module is used to determine the modal characteristics of the first physiological signal of each modality; the modal characteristics include at least one of the statistical characteristics and initial frequency characteristics of the first physiological signal of each modality; For each of the tested objects, the modal features are standardized to obtain corresponding standardized features; the values ​​of the standardized features are within a preset range. Based on tensor canonical correlation analysis, the covariance tensor among the standardized features is calculated. Based on the covariance tensor, determine the tensor between each modal feature; Tensor decomposition is performed on the tensors between the modal features, and a first-rank approximation is performed on the tensors between each modal feature based on a higher-order power algorithm to determine the unit vectors corresponding to the first-rank approximation of the tensors between the modal features; each unit vector corresponds to the first physiological signal of a modality. Based on the unit vectors of the uni-rank approximation of the tensors between the modal features, the projection vectors corresponding to each modal feature in the cooperative space are determined. Obtain the collaborative representation of each modality feature on the corresponding projection vector, wherein the collaborative representation is the one-dimensional feature information of the first physiological signal of each modality in the same collaborative space; The fusion module is used to fuse the collaborative representations corresponding to the first physiological signals of each modality to obtain fused features; The learning module is used to input fused features into a neural network to learn the classification of emotional states, thereby obtaining a target emotion recognition model.

7. An electronic device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 5.

8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 5.

9. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 5.