A child emotion intervention strategy matching method, system, terminal and medium
By using multimodal sensor technology and deep learning algorithms, the social-emotional information of children with autism is collected in real time. Combined with graph convolutional networks, the social-emotional state of children with autism can be accurately identified and quantitatively assessed, providing personalized intervention strategies and improving the intervention effect.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- LISHUI MATERNAL & CHILD HEALTH HOSPITAL
- Filing Date
- 2026-01-29
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies lack systematic intervention methods for the social emotions of children with autism, making it difficult to achieve quantitative analysis and personalized intervention for emotion recognition, understanding, regulation and expression. Furthermore, it is difficult to capture the multimodal behavior and neural information of children in natural social scenarios in real time, resulting in poor intervention effects.
Multimodal sensor technology is used to collect facial expressions, gaze, movements and brain activity of children with autism in real time. Combined with graph convolutional networks and deep learning algorithms, personalized intervention strategies are automatically matched through multimodal database and collaborative filtering technology to improve the accuracy and effectiveness of intervention.
It enables accurate identification and quantitative assessment of the social and emotional state of children with autism, providing personalized intervention strategies and improving the accuracy and effectiveness of interventions.
Smart Images

Figure CN122245634A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of data processing technology, and in particular to a method, system, terminal, and medium for matching intervention strategies for children's emotions. Background Technology
[0002] Autism Spectrum Disorder (ASD) is a group of disorders characterized by impaired social communication, narrow interests, and repetitive, stereotyped behaviors. Social communication impairment is considered a core manifestation of autism, primarily involving impaired abilities to recognize, understand, regulate, and express emotions in social situations. Children with autism often lack sensitivity to various emotional cues such as facial expressions, tone of voice, and gestures, making it difficult for them to accurately perceive the emotional states of others and to express their own emotional responses appropriately in interactions. This deficit in emotional functioning leads to a lack of positive emotional resonance and suitable social feedback in interactions, further exacerbating isolation and difficulties in social adaptation. Traditional social-emotional interventions rely heavily on professional behavioral observation and experiential judgment, using structured teaching and psychological counseling for skills training. However, assessment indicators often lack objectivity and quantifiability, and the procedures are time-consuming and highly dependent on professional personnel, hindering their widespread adoption in large populations.
[0003] In recent years, engineering techniques such as machine vision, sensor technology, electroencephalography (EEG), and functional brain imaging (FBI) have been introduced into autism intervention research, attempting to provide objective and real-time information collection and analysis capabilities. However, existing systems mostly focus on single-dimensional behavioral detection or simple classification of children's facial expressions and motor characteristics, lacking a systematic description and in-depth analysis of social and emotional function deficits. They cannot fully reveal the characteristics of autistic children's impairment in integrating multimodal emotional information, nor can they provide more targeted intervention feedback. Existing machine-assisted methods are still insufficient in assessing emotion understanding and regulation abilities, failing to cover multiple aspects such as emotion recognition, emotion understanding, emotion regulation, and expression, and lacking the ability to continuously track dynamic emotional changes in complex social situations.
[0004] Therefore, existing technologies still need to be improved and developed. Summary of the Invention
[0005] The main purpose of this application is to provide a method, system, terminal and medium for matching intervention strategies for children's emotions, aiming to solve the problems that existing machine-assisted methods lack quantitative analysis and personalized intervention methods for emotion recognition, understanding, regulation and expression functions, and are difficult to capture children's multimodal behavior and neural information in natural social scenarios in real time, resulting in poor intervention effects for children's emotions.
[0006] The first aspect of this application provides a method for matching intervention strategies for children's emotions, the method comprising the following steps: An assessment and intervention paradigm is constructed, and video information and physiological parameter information of the children are obtained during the process of the children performing the assessment and intervention paradigm; Multimodal representation extraction is performed on the video information and the physiological parameter information to obtain a social-emotional ability profile of the child; Obtain standard state data, and determine the target intervention strategy to match the child based on the social-emotional competence profile and the standard state data.
[0007] Optionally, in one embodiment of this application, the video information includes image data of the interaction between the guide and the child, and the physiological parameter information includes the child's electroencephalogram (EEG) data, speech data, spectral data, and physiological data; The acquisition of the child's video information and physiological parameter information specifically includes: Collect multimodal data and corresponding timestamp data of guides and children at different frequencies; Synchronization and alignment are performed based on the multimodal data and the timestamp data to obtain image data of the interaction between the guide and the child, the child's EEG data, spectral data, and physiological data; wherein the time series of the image data, the EEG data, the speech data, the spectral data, and the physiological data are consistent.
[0008] Optionally, in one embodiment of this application, the social emotion profiling includes multiple emotion state representation vectors; The step of extracting multimodal representations from the video information and the physiological parameter information to obtain the child's social-emotional competence profile specifically includes: Multimodal feature extraction is performed on the image data, the electroencephalogram data, the speech data, the spectral data, and the physiological data to obtain a unified feature vector; A cross-modal fusion model is constructed, and the unified feature vector is input into the cross-modal fusion model to obtain multiple emotion state representation vectors.
[0009] Optionally, in one embodiment of this application, the step of performing multimodal feature extraction on the image data, the electroencephalogram data, the speech data, the spectral data, and the physiological data to obtain a unified feature vector specifically includes: Feature extraction is performed on the image data to obtain local spatial structure features and social behavior features; Feature extraction is performed on the electroencephalogram (EEG) data, the speech data, the spectral data, and the physiological data to obtain pure brain features, speech prosody features, neural response features, and physiological features. A unified feature vector is obtained based on the local spatial structure features, the social behavior features, the pure brain features, the phonological prosody features, the neural response features, and the physiological features.
[0010] Optionally, in one embodiment of this application, the construction of the cross-modal fusion model specifically includes: Construct a multimodal social emotion network model; The multimodal social emotion network model is jointly modeled with temporal convolutional networks and graph convolutional networks to obtain a cross-modal fusion model; The cross-modal fusion model is represented as follows: ; in, A vector representing emotional state; For temporal convolutional networks; For graph convolutional networks; For timestamps The multimodal feature matrix.
[0011] Optionally, in one embodiment of this application, determining the target intervention strategy matching the child based on the social-emotional competence profile and the standard state data specifically includes: The differences in the children's emotional states are obtained based on multiple emotional state representation vectors and the standard state data. Based on the differences in emotional states, multiple preset intervention paradigms are determined to have different intervention effects on the children. Based on the effects of multiple interventions, a target intervention strategy matching the child is determined.
[0012] Optionally, in one embodiment of this application, the step of determining a target intervention strategy matching the child based on the social-emotional competence profile and the standard state data further includes: The image data, EEG data, speech data, spectral data, and physiological data are updated to obtain updated multimodal data; Based on the updated multimodal data, the updated intervention effects are obtained; Based on the updated intervention effects, the target intervention strategies matching the children are updated.
[0013] A second aspect of this application also provides a system for matching intervention strategies for children's emotions, wherein the system is applied to the method for matching intervention strategies for children's emotions as described in any of the above-described solutions; the system includes: The paradigm construction and information perception module is used to construct an assessment and intervention paradigm and acquire the child's video information and physiological parameter information during the child's execution of the assessment and intervention paradigm. The representation extraction module is used to perform multimodal representation extraction on the video information and the physiological parameter information to obtain a social-emotional ability profile of the child. The personalized intervention matching module is used to acquire standard state data and determine the target intervention strategy to match the child based on the social-emotional ability profile and the standard state data.
[0014] A third aspect of this application also provides a terminal, wherein the terminal includes: a memory, a processor, and a child emotion intervention strategy matching program stored in the memory and executable on the processor, wherein when the child emotion intervention strategy matching program is executed by the processor, it implements the steps of the child emotion intervention strategy matching method as described above.
[0015] A fourth aspect of this application also provides a computer-readable storage medium, wherein the computer-readable storage medium stores a child emotion intervention strategy matching program, and when the child emotion intervention strategy matching program is executed by a processor, it implements the steps of the child emotion intervention strategy matching method as described above.
[0016] Beneficial effects: This application provides a method, system, terminal, and medium for matching intervention strategies for children's emotions. By collecting multidimensional data of autistic children in social situations, this application performs joint learning and fine-grained representation of the multimodal features extracted from the multidimensional data, thereby achieving accurate identification and quantitative assessment of the social emotional state of autistic children, and thus enabling the matching of personalized intervention strategies for children, improving the intervention effect on children's emotions. Attached Figure Description
[0017] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0018] Figure 1 This is a schematic diagram of a preferred embodiment of the intervention strategy matching system for children's emotions according to this application; Figure 2 A flowchart illustrating a preferred embodiment of the intervention strategy matching method for children's emotions according to this application; Figure 3 This is a flowchart of multimodal information perception and representation extraction in a preferred embodiment of the intervention strategy matching method for children's emotions in this application; Figure 4 This is a structural diagram of a preferred embodiment of the intervention strategy matching system for children's emotions according to this application; Figure 5 This is a structural diagram of a preferred embodiment of the terminal of this application.
[0019] Explanation of reference numerals in the attached figures: 100. Paradigm Construction and Information Perception Module; 200. Representation Extraction Module; 300. Personalized Intervention Matching Module. Detailed Implementation
[0020] To make the objectives, technical solutions, and effects of this application clearer and more explicit, the technical solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. The described embodiments are only possible technical implementations of this application and not all possible implementations. Based on the embodiments in this application, those skilled in the art can obtain other embodiments without creative effort, and these embodiments are also within the protection scope of this application.
[0021] First, let's introduce the terms used in the embodiments of this application: MSENet: Multimodal Social Emotion Network, used for multimodal feature fusion and emotional state representation.
[0022] CNN: Convolutional Neural Network, used to extract local spatial features of images (such as pose and expression).
[0023] VitPose: A human pose estimation algorithm (based on the Vision Transformer architecture) used to detect joint coordinates and model skeletal motion.
[0024] TCN: Temporal Convolutional Network, which models temporal context dependencies and captures the temporal evolution of emotional states.
[0025] GCN: Graph Convolutional Network, which describes high-order relationships between modal features.
[0026] MLP: Multi-Layer Perceptron, used for feature mapping and attention weight calculation.
[0027] EEG (Electroencephalogram), brain electrical signals; fNIRS (functional Near-Infrared Spectroscopy) is a functional near-infrared spectroscopy technique.
[0028] VitPose: A human pose estimation algorithm.
[0029] ICA: (Independent Component Analysis).
[0030] MFCC: (Mel-Frequency Cepstral Coefficients) RMSSD: Root mean square of the difference between consecutive normal RR intervals (a measure of heart rate variability).
[0031] SDNN: Standard deviation of normal RR interval (a measure of heart rate variability).
[0032] EDA (Electrodermal Activity) refers to the skin's electrical response.
[0033] GAP: Global Average Pooling.
[0034] GMP: Global Max Pooling.
[0035] Sigmoid: S-type activation function.
[0036] KL divergence (KL divergence): A measure of the difference in probability distributions.
[0037] NTP (Network Time Protocol) is a protocol for network time.
[0038] HbO: (Oxyhemoglobin), oxyhemoglobin.
[0039] HbR (Deoxyhemoglobin) is a type of hemoglobin.
[0040] Currently, there is no systematic intervention method for the social emotions of children with autism. There is a lack of quantitative analysis and personalized intervention methods for core functions such as emotion recognition, understanding, regulation and expression. Furthermore, traditional methods are difficult to capture the multimodal behavior and neural information of children in natural social scenarios in real time, resulting in limited intervention effects.
[0041] To address the aforementioned issues, this application employs multimodal sensor technology (visual, EEG, fNIRS, speech, and physiological signals) to collect real-time data on facial expressions, gaze, movements, and brain activity in children with autism. This data is combined with graph convolutional networks and deep learning algorithms to achieve fine-grained multimodal emotional information representation. Based on relevant knowledge, a core social-emotional function intervention unit pool is constructed and pre-assessment is conducted. Through a multimodal database and collaborative filtering technology, personalized intervention strategies are automatically matched to improve the accuracy and effectiveness of the intervention.
[0042] See Figure 1 This application presents a social-emotional intervention system for autistic children based on multimodal sensor fusion and deep learning (a system for matching intervention strategies for children's emotions). It integrates visual sensors, electroencephalography (EEG), functional near-infrared spectroscopy (fNIRS), speech, and physiological sensors. This system can collect multidimensional information such as facial expressions, gaze, movements, and brain activity of autistic children during social interactions in real time. Through multimodal behavioral representation methods and graph convolutional networks, combined with the MSENet network structure proposed by the inventors, it achieves dynamic, fine-grained, and multi-level quantitative analysis of social-emotional states. Furthermore, based on a pool of intervention units for core social-emotional deficits, the system uses a collaborative filtering algorithm to intelligently match three types of intervention units—facial expression recognition, emotion imitation, and joint attention—from a multimodal database to generate personalized intervention plans. It can also continuously and dynamically monitor the intervention effects, providing feedback and optimizing clinical intervention strategies.
[0043] Compared with existing technologies that only focus on the extraction and classification of behavioral characteristics of children with autism, this application systematically describes and quantifies the social and emotional function deficits of children with autism by integrating multimodal sensor data such as vision and EEG, and combining network structures such as MSENet. Through fine-grained dynamic analysis and personalized intervention strategies, it can more accurately and comprehensively support the intervention and rehabilitation of social and emotional abilities.
[0044] This application employs multimodal sensors (vision, EEG, depth sensor, speech, physiological signals, etc.) to simultaneously collect multidimensional data on autistic children in social situations. To ensure the spatiotemporal alignment of the multimodal data, the system utilizes precise timestamp synchronization and spatial calibration techniques to eliminate differences in acquisition time and perspective between different devices. After preprocessing, the collected multimodal data is input into a deep learning model based on the MSENet architecture. By fusing Graph Convolutional Networks (GCN) and Temporal Convolutional Networks (TCN), spatiotemporal features from multi-source heterogeneous data such as human skeletons, facial expressions, and EEG are captured. The MSENet model performs joint learning and fine-grained representation of multimodal features, enabling accurate identification and quantitative assessment of the social emotional state of autistic children. This model provides a solid data foundation for subsequent matching of emotional functional unit intervention libraries and generation of personalized intervention programs based on the ICF-CY framework, supporting the system to achieve dynamic monitoring and feedback loops, and improving the accuracy and individual adaptability of interventions.
[0045] The technical solutions of this application will be described in detail below with specific embodiments. The following specific embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.
[0046] The preferred embodiment of this application describes a method for matching intervention strategies for children's emotions, such as... Figure 2 As shown, the intervention strategy matching method for children's emotions includes the following steps: In step S101, an assessment intervention paradigm is constructed, and during the process of the child performing the assessment intervention paradigm, the child's video information and physiological parameter information are obtained.
[0047] It should be noted that this application achieves multi-level and multi-dimensional deep fusion and fine-grained feature extraction of behavioral and neural signals by simultaneously collecting multi-camera, EEG, functional near-infrared spectroscopy, speech and physiological sensor data, combined with the MSENet network. This enables intelligent matching of personalized intervention paradigms and dynamic monitoring of intervention effects, forming a closed-loop optimization mechanism.
[0048] In one possible implementation, the video information includes image data of the interaction between the guide and the child, and the physiological parameter information includes the child's electroencephalogram (EEG) data, speech data, spectral data, and physiological data. Multimodal data and corresponding timestamp data of the guide and child at different frequencies are collected; the multimodal data and timestamp data are synchronized and aligned to obtain the image data of the interaction between the guide and the child, the child's EEG data, spectral data, and physiological data; wherein the time series of the image data, the EEG data, the speech data, the spectral data, and the physiological data are consistent.
[0049] Specifically, this application divides the social-emotional function of children with autism into two aspects: first, emotion recognition and understanding, which refers to the child's ability to perceive their own basic emotions such as joy, sadness, and anger, as well as to recognize the emotions and feelings of others; and second, emotion regulation and expression, which refers to the child's ability to regulate internal emotions and external behavior, appropriately express emotions in social situations, and respond appropriately to changes in the emotions of others. To achieve personalized assessment and intervention recommendations for the social-emotional function of children with autism, this study constructs an integrated assessment and intervention paradigm library around the above two dimensions. ,in Each paradigm Both possess two levels of functionality: assessment and intervention. They can serve as assessment paradigms to extract children's performance characteristics on specific emotional competence dimensions, and as intervention implementation units for competence enhancement training. As the first paradigm unit, For the second paradigm unit, For the first The paradigm unit includes the following three typical paradigms: First, the facial emotion recognition paradigm: This uses methods such as card matching and video guidance to help children identify and imitate basic facial emotions (such as happiness, anger, sadness, surprise, etc.), improving their emotion recognition ability. Second, the emotion imitation response paradigm: Based on speech and facial expression imitation training tasks, this assesses children's ability to imitate emotional expressions and the consistency of their responses, indirectly reflecting their emotion regulation and expression abilities and the activation level of their mirror nervous system. Third, the shared attention guidance paradigm: This designs scenario tasks based on "eye-target" switching and object identification to capture children's responsiveness to social cues (such as eye contact and gestures) and attention-sharing behavior, assessing their social motivation and social information perception abilities. Each of the above paradigms... Corresponding to a feature vector It consists of emotional state representation vectors extracted during the paradigm execution process, where For the first The eigenvectors corresponding to each normal form express A real-valued space. After the paradigm is executed, a profile of the social and emotional functional abilities of an individual child can be generated. As input to the intervention strategy recommendation algorithm, where A profile representing social-emotional functioning ability, by The eigenvectors corresponding to each paradigm The matrix formed by horizontally concatenating the matrices (with dimensions k×d) serves as the input to the intervention strategy recommendation algorithm. The eigenvectors corresponding to the first normal form. This is the eigenvector corresponding to the second paradigm.
[0050] Then, multimodal information perception is performed. In order to realize the refined assessment of the social and emotional abilities of children with autism and the construction of personalized intervention paths, this application uses a multimodal information perception module and collects multimodal behavioral and physiological data of children with autism and normal children simultaneously based on the assessment and intervention paradigm library, which facilitates subsequent paradigm response difference modeling and comparative analysis.
[0051] Specifically, the system is equipped with three cameras, used to capture panoramic views of the doctor's interaction with the child, the doctor's facial expressions and gaze information, and the child's facial expressions, gaze, and body movements, comprehensively capturing details of social interaction and core behavioral characteristics of autism. Simultaneously, the system collects electroencephalograms (EEG), functional near-infrared spectroscopy (fNIRS), speech signals, and physiological parameters such as heart rate and skin conductance, forming multimodal heterogeneous data input. To ensure accurate timestamp correspondence across modalities, a unified system time base is achieved using the Network Time Protocol (NTP), and unified triggering of the cameras and physiological acquisition devices is achieved through hardware synchronization signal lines, reducing startup latency. During data acquisition, timestamp recording and time interpolation methods are combined to synchronize and align data from different sampling frequencies, forming a unified time series.
[0052] Let the sampling frequencies of different modes be... The corresponding sampling time series are as follows: ; in, The sampling frequency for the first mode. The sampling frequency for the second mode. No. The sampling frequency of each mode; Modal index is The sampling time series, For the first The total number of sampling points for each modality; For the first The 0th sampling time point of each mode forms the original time series of that mode. For the first The first sampling time point of each mode forms the original time series of that mode. For the first The first mode The original time series of this mode is formed by sampling time points.
[0053] Using interpolation function : ; in . For the first in the unified time series Target time points; For the first Each modality at a unified time point The interpolated data; For the first Each modality at time point The original observations.
[0054] Finally, a synchronized and aligned multimodal dataset is obtained. : .
[0055] To ensure a uniform total length of the time series; To represent the first modality (such as visual, audio, physiological signals, etc.) at a uniform time point The aligned data value obtained through linear interpolation; To represent the second mode at a unified time point The aligned data value obtained through linear interpolation.
[0056] This application utilizes feature vectors generated through paradigm execution to quantify children's ability differences across various dimensions, thereby matching the most suitable combination of intervention paradigms. Through three main steps—hardware configuration, time synchronization, and data alignment—this application ultimately outputs structured, synchronized time-series data, providing high-quality data input for subsequent multimodal feature fusion, emotional state modeling, and personalized intervention strategy matching.
[0057] In step S102, multimodal representation extraction is performed on the video information and the physiological parameter information to obtain a social-emotional ability profile of the child.
[0058] This application, through multimodal deep fusion, can capture subtle emotional changes that are difficult to detect with a single modality, thereby improving the accuracy of the assessment.
[0059] In one possible implementation, the social emotion profiling includes multiple emotion state representation vectors. Multimodal feature extraction is performed on the image data, EEG data, speech data, spectral data, and physiological data to obtain a unified feature vector; a cross-modal fusion model is constructed, and the unified feature vector is input into the cross-modal fusion model to obtain multiple emotion state representation vectors.
[0060] In one possible implementation, feature extraction is performed on the image data to obtain local spatial structure features and social behavior features; feature extraction is performed on the EEG data, the speech data, the spectral data, and the physiological data to obtain pure brain features, speech prosody features, neural response features, and physiological features; and a unified feature vector is obtained based on the local spatial structure features, the social behavior features, the pure brain features, the speech prosody features, the neural response features, and the physiological features.
[0061] In one possible implementation, a multimodal social emotion network model is constructed; the multimodal social emotion network model is then jointly modeled with a temporal convolutional network and a graph convolutional network to obtain a cross-modal fusion model.
[0062] The cross-modal fusion model is represented as follows: ; in, A vector representing emotional state; For temporal convolutional networks; For graph convolutional networks; For timestamps The multimodal feature matrix.
[0063] Specifically, in the process of multimodal information representation extraction, see Figure 3 First, based on the multimodal time-aligned data, feature extraction processing is performed on different modalities: for the visual modality, local spatial structure features of image frames in the intervention scene are extracted first through a convolutional neural network (CNN). The system captures socially relevant information such as posture, facial expressions, and gestures. Then, the VitPose human pose estimation algorithm is used to detect key points and model the skeleton of the child in the images, obtaining temporal data containing the coordinates of 17 key points. This data is used to characterize the child's movement patterns, facial changes, and nonverbal behavioral features during social interactions. .in, For visual modal local spatial structural features, Nonverbal behavioral characteristics (i.e., social behavioral characteristics) are used in time series data to characterize nonverbal behavioral characteristics in social interactions.
[0064] EEG modality analysis utilizes wavelet transform to achieve multi-scale time-frequency domain analysis of neural signals. Wavelet transform decomposes the signal into sub-signals of different frequency bands, thereby extracting different frequency bands (e.g., ...). The energy distribution characteristics of wavelet transform. The specific operations of wavelet transform are: ; in, These are wavelet coefficients. It is a scale parameter. It is a time shift parameter. It is the mother wavelet function. This is the time series of the original input signal. Simultaneously, independent component analysis (ICA) is used to effectively remove eye movement and muscle artifacts, obtaining purer brain features with greater neurophysiological significance. The specific operation of ICA is as follows: for the observed EEG signals... ,in To observe brain signals, For an unknown mixture matrix, by estimation Thus, estimate , As the inverse mixing matrix, the estimation process of W is based on the principle of maximizing non-Gaussianity, that is, finding the transformation that maximizes the non-Gaussianity of the components, thereby ensuring that the separated components are independent of each other. The fNIRS modality uses optical methods to monitor changes in local blood oxygen concentrations (HbO, HbR) in the brain, indirectly reflecting the metabolic activity level of cortical neurons, especially the activity state of the frontal and prefrontal regions, and thus assessing children's neural responses in higher-order functional dimensions such as emotional perception and impulse control. The obtained features are... Speech modality analysis, by extracting features such as Mel-frequency cepstral coefficients (MFCCs), fundamental frequency (F0), speech rate, and energy, characterizes the prosodic structure, emotional expression, and prosodic features of children's speech during intervention. Physiological modalities, on the other hand, quantify the immediate responses of children's autonomic nervous system to social situations by using heart rate variability (such as RMSSD, SDNN) and electrical skin response (EDA), reflecting their physiological levels of emotional arousal and stress. Characteristics of a pure brain Characteristic of neural response For phonetic prosodic features, These are physiological characteristics.
[0065] Through the heterogeneous feature extraction of the above six modalities, the system forms a unified feature vector at each time stamp t. ,in This represents the child's multimodal feature vector at the t-th timestamp and the i-th paradigm, and serves as the input for subsequent multimodal deep fusion network analysis and social emotion intervention decisions. This is based on the time synchronization and multimodal features obtained above. (Hereinafter referred to as F for ease of description), the MSENet (Multimodal Social Emotion Network) model is proposed. Firstly, in the encoder stage, this application employs a channel attention mechanism to enhance and suppress saliency between features. In the feature representation stage for each modality, global contextual features are first extracted using global average pooling (GAP) and global max pooling (GMP), respectively. , in, Subsequently, the two vectors are mapped to the attention space using a shared multilayer perceptron (MLP), and channel attention weights are obtained through sigmoid activation:
[0066] in, This represents the Sigmoid function. This represents the importance weight of each channel. Finally, the attention weights are... Acting on the original feature tensor : , in This represents element-wise multiplication along the channel dimension. This represents the weighted channel features.
[0067] Considering the heterogeneity of different modalities (visual, speech, EEG, physiological) in terms of acquisition methods, perception angles, and information structures, MSENet introduces a cross-modal KL divergence regularization term in the encoder stage to align the distribution of latent representations from different modalities before fusion, in order to achieve collaborative constraints between modalities. The modality alignment loss is defined as: ; in, For modal alignment loss, Indicates the first Modalities (such as EEG, fNIRS, speech, physiological signals, etc.) at time points Alignment-processed latent feature distribution Indicates the first Each modality at time point latent feature distribution, Indicates directionality. The Kullback-Leibler divergence function encourages different modalities to be distributed as similarly as possible in the shared latent space, thereby improving the consistency and stability before multimodal fusion and contributing to the accurate construction of emotion representation. To enhance the complementarity of multimodal information in social emotion representation, this embodiment introduces a multi-head self-attention mechanism and combines it with a lightweight message bottleneck module to limit redundant information flow, thereby strengthening the representational power of emotion-related features. During the optimization process, cross-modal attention weights adaptively adjust the weights of different modalities to ensure that each modality contributes reasonably to the emotional state. Specifically, cross-modal attention weights can dynamically and selectively enhance emotion-related modal features while suppressing redundant or noisy information flow, thus improving the accuracy and reliability of emotion representation.
[0068] Furthermore, social emotions exhibit significant temporal evolution characteristics. To address this, MSENet further integrates a Temporal Convolutional Network (TCN) and a Graph Convolutional Network (GCN): the TCN is used to model temporal context dependencies, while the GCN is used to characterize higher-order relationships between modal features. Through this structure, the system can extract emotional representation features from continuous, multimodal data streams during the spatiotemporal dynamic evolution process.
[0069] ; Output a low-dimensional emotional state representation vector in time series form. at last, It will be mapped to a one-dimensional space through an MLP and a sigmoid function. The system generates a score for the subjects to quantify their social-emotional functioning.
[0070] .
[0071] in This represents the Sigmoid activation function.
[0072] The system incorporates an expert scoring mechanism as a supervisory signal for model training. Specifically, after each round of paradigm execution, professional assessors manually score the child's performance during the task (such as emotional response, interactive participation, nonverbal expression, attention maintenance, etc.). The scale is set in the range of [-1, 1] and divided into ten levels. The smaller the value, the more obvious the typical characteristics of autism, while the larger the value, the closer the child's social and emotional functions are to those of a normally developing child.
[0073] During training, the overall loss function Designed as follows: ; in, These are hyperparameters used to balance the weights of regression loss and alignment loss. This represents the total length of the time series (i.e., the total number of samples). The resulting low-dimensional sentiment state representation vector. As an input for matching downstream personalized intervention strategies, it can comprehensively, finely, and dynamically characterize the social and emotional state of children with autism during the intervention process.
[0074] This application establishes a linear workflow from time alignment to feature extraction and then to fusion modeling: data preprocessing, feature preparation, model processing, and result output. Furthermore, feature extraction for each modality is performed in parallel, but ultimately integrated into a unified feature vector. Within MSENet, each component processes features sequentially, progressively deepening feature fusion and temporal modeling. The scoring output of this application serves as input for intervention strategy matching, while dynamically collecting data to update the state representation, achieving adaptive optimization of assessment-intervention-reassessment.
[0075] In step S103, standard state data is obtained, and a target intervention strategy matching the child is determined based on the social-emotional competence profile and the standard state data.
[0076] In one possible implementation, the emotional state differences of the child are obtained based on multiple emotional state representation vectors and the standard state data; based on the emotional state differences, multiple preset intervention paradigms are determined to have multiple intervention effects on the child; and based on the multiple intervention effects, a target intervention strategy matching the child is determined.
[0077] In one possible implementation, the image data, the electroencephalogram (EEG) data, the speech data, the spectral data, and the physiological data are updated to obtain updated multimodal data; based on the updated multimodal data, multiple updated intervention effects are obtained; and based on the multiple updated intervention effects, the target intervention strategy matching the child is updated.
[0078] Specifically, in the process of information representation and strategy output of the intervention unit, the low-dimensional emotional state representation vectors under the k intervention paradigms output in step S102 are obtained. Afterwards, the system enters the personalized intervention strategy matching stage. Each This indicates that the child is in the i-th intervention paradigm The system analyzes the emotional response characteristics of children under different intervention paradigms. The construction of personalized strategies comprehensively considers two key factors: first, the differences in the current state of the child and the healthy norm child under each intervention paradigm; and second, the feedback on the intervention effects of similar individuals in the historical database under each intervention paradigm. First, to characterize the degree of abnormality of the child under each intervention paradigm, the system calculates the difference in emotional state between the child and the norm child under the i-th intervention paradigm, defined as: ; in, For the first Differences in emotional states under different intervention paradigms For pre-children in Time under an intervention paradigm The low-dimensional emotion state vector, This represents the average emotional expression of healthy children under this paradigm, and the indicator reflects the degree of social-emotional deviation of children in specific situations.
[0079] Secondly, the system extracts the average intervention effect of this intervention paradigm on children with similar emotional states from the historical database, denoted as: ; in, For the first The average historical gain effect of each intervention paradigm Indicates that a paradigm has been adopted in history. Intervention and children in a similar current state, Let represent the state vectors of the j-th child in history before and after the application of the i-th intervention paradigm, respectively. The average gain effect of the intervention unit is characterized. Taking both factors into account, the system defines a matching score for each intervention paradigm. for: ; in, , This is a hyperparameter used to adjust for the weighting of individual state differences and historical effects. Intervention paradigms with higher scores are more likely to achieve effective improvement in the current child. Based on the scoring results, the system selects the top Z (e.g., 3) highest-scoring intervention paradigms from k candidate intervention paradigms as the personalized intervention unit combination for the current round, forming an individualized strategy. To verify the strategy's effectiveness, the system continuously collects child data during intervention execution, dynamically updates their state representation under each paradigm, and recalculates state differences. and historical gain This allows for adaptive iterative optimization of intervention strategies.
[0080] This application constructs an intelligent decision-making chain from individual ability assessment to intervention strategy matching through four sub-steps: quantification of state differences, feedback on historical effects, generation of matching scores, and dynamic optimization. Essentially, it uses a data-driven approach to transform the differences between children and norms, as well as historical intervention effects, into quantifiable matching scores. Based on real-time feedback, it continuously optimizes strategies, ultimately achieving precise intervention and dynamic improvement of the social and emotional functions of children with autism.
[0081] Next, referring to the accompanying drawings, the system for matching intervention strategies for children's emotions according to embodiments of this application is described, and applied to the method for matching intervention strategies for children's emotions as described in any of the above schemes.
[0082] Figure 4 This is a structural diagram of the intervention strategy matching system for children's emotions according to an embodiment of this application.
[0083] like Figure 4 As shown, the intervention strategy matching system for children's emotions includes: a paradigm construction and information perception module 100, a representation extraction module 200, and a personalized intervention matching module 300.
[0084] Specifically, the paradigm construction and information perception module 100 is used to construct an assessment intervention paradigm and acquire the child's video information and physiological parameter information during the child's execution of the assessment intervention paradigm. The representation extraction module 200 is used to perform multimodal representation extraction on the video information and the physiological parameter information to obtain a social-emotional ability profile of the child. The personalized intervention matching module 300 is used to acquire standard state data and determine the target intervention strategy to match the child based on the social-emotional ability profile and the standard state data.
[0085] Figure 5 A structural diagram of a terminal provided in an embodiment of this application. The terminal may include: The memory 501, the processor 502, and the computer program stored on the memory 501 and capable of running on the processor 502.
[0086] When the processor 502 executes the program, it implements the intervention strategy matching method for children's emotions provided in the above embodiments.
[0087] Furthermore, the terminal also includes: Communication interface 503 is used for communication between memory 501 and processor 502.
[0088] The memory 501 is used to store computer programs that can run on the processor 502.
[0089] Memory 501 may include high-speed RAM memory, and may also include non-volatile memory. volatile memory), for example, at least one disk storage.
[0090] If the memory 501, processor 502, and communication interface 503 are implemented independently, then the communication interface 503, memory 501, and processor 502 can be interconnected via a bus to complete communication between them. The bus can be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, or an Extended Industry Standard Architecture (EIS) bus, etc. Buses can be categorized as address buses, data buses, control buses, etc. For ease of representation, Figure 5 The bus is represented by a single thick line, but this does not mean that there is only one bus or one type of bus.
[0091] Optionally, in a specific implementation, if the memory 501, processor 502, and communication interface 503 are integrated on a single chip, then the memory 501, processor 502, and communication interface 503 can communicate with each other through an internal interface.
[0092] Processor 502 may be a central processing unit (CPU), an application-specific integrated circuit (ASIC), or one or more integrated circuits configured to implement embodiments of this application.
[0093] This embodiment also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the above-described method for matching intervention strategies for children's emotions.
[0094] One embodiment of this application provides a computer program product, including a computer program that, when executed by a processor, implements the features described in this application. Figure 2 The corresponding embodiments provide a method for matching intervention strategies for children's emotions.
[0095] It should be noted that the information (including but not limited to user device information, user personal information, etc.), data (including but not limited to user-analyzed data, user-stored data, user-displayed data, etc.) and signals involved in this invention are all information, data and signals authorized by the user or fully authorized by all parties; and the collection, use and processing of relevant information, data and signals comply with the laws, regulations and standards of relevant countries and regions.
[0096] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of this application. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.
[0097] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this application, "N" means at least two, such as two, three, etc., unless otherwise explicitly specified.
[0098] Any process or method described in the flowchart or otherwise herein can be understood as representing a module, segment, or portion of code comprising one or N executable instructions for implementing custom logic functions or processes, and the scope of the preferred embodiments of this application includes additional implementations in which functions may be performed not in the order shown or discussed, including substantially simultaneously or in reverse order depending on the functions involved, as should be understood by those skilled in the art to which embodiments of this application pertain.
[0099] The logic and / or steps represented in the flowchart or otherwise described herein, for example, can be considered as a sequenced list of executable instructions for implementing logical functions, and can be embodied in any computer-readable storage medium for use by, or in conjunction with, an instruction execution system, apparatus, or device (such as a computer-based system, a processor-included system, or other system that can fetch and execute instructions from, an instruction execution system, apparatus, or device). For the purposes of this specification, "computer-readable storage medium" can be any means that can contain, store, communicate, propagate, or transmit programs for use by, or in conjunction with, an instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of computer-readable storage media include: an electrical connection having one or more wires (electronic device), a portable computer disk drive (magnetic device), random access memory (RAM), read-only memory (ROM), erasable and editable read-only memory (EPROM or flash memory), fiber optic devices, and portable optical disc read-only memory (CDROM). Alternatively, the computer-readable storage medium could be paper or other suitable media on which the program can be printed, since the program can be obtained electronically by optically scanning the paper or other medium, followed by editing, interpreting, or otherwise processing as necessary, and then stored in a computer memory.
[0100] It should be understood that the various parts of this application can be implemented using hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods can be implemented using software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware as in another embodiment, it can be implemented using any one or a combination of the following techniques known in the art: discrete logic circuits having logic gates for implementing logical functions on data signals, application-specific integrated circuits (ASICs) having suitable combinational logic gates, programmable gate arrays (PGAs), field-programmable gate arrays (FPGAs), etc.
[0101] Those skilled in the art will understand that all or part of the steps of the methods in the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, and when executed, the program includes one or a combination of the steps of the method embodiments.
[0102] Furthermore, the functional units in the various embodiments of this application can be integrated into a processing module, or each unit can exist physically separately, or two or more units can be integrated into a module. The integrated module can be implemented in hardware or as a software functional module. If the integrated module is implemented as a software functional module and sold or used as an independent product, it can also be stored in a computer-readable storage medium.
[0103] The storage medium mentioned above can be a read-only memory, a disk, or an optical disk, etc. Although embodiments of this application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting this application. Those skilled in the art can make changes, modifications, substitutions, and variations to the above embodiments within the scope of this application.
[0104] It should be understood that the application of this application is not limited to the examples above. Those skilled in the art can make improvements or modifications based on the above description, and all such improvements and modifications should fall within the protection scope of the appended claims.
[0105] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of this application.
Claims
1. A method for matching intervention strategies for children's emotions, characterized in that, The intervention strategy matching method for children's emotions includes: An assessment and intervention paradigm is constructed, and video information and physiological parameter information of the children are obtained during the process of the children performing the assessment and intervention paradigm; Multimodal representation extraction is performed on the video information and the physiological parameter information to obtain a social-emotional ability profile of the child; Obtain standard state data, and determine the target intervention strategy to match the child based on the social-emotional competence profile and the standard state data.
2. The method for matching intervention strategies for children's emotions according to claim 1, characterized in that, The video information includes image data of the interaction between the guide and the child, and the physiological parameter information includes the child's electroencephalogram (EEG) data, speech data, spectral data, and physiological data. The acquisition of the child's video information and physiological parameter information specifically includes: Collect multimodal data and corresponding timestamp data of guides and children at different frequencies; Synchronization and alignment are performed based on the multimodal data and the timestamp data to obtain image data of the interaction between the guide and the child, the child's EEG data, spectral data, and physiological data; wherein the time series of the image data, the EEG data, the speech data, the spectral data, and the physiological data are consistent.
3. The method for matching intervention strategies for children's emotions according to claim 2, characterized in that, The social emotional competence profile includes multiple emotional state representation vectors; The step of extracting multimodal representations from the video information and the physiological parameter information to obtain the child's social-emotional competence profile specifically includes: Multimodal feature extraction is performed on the image data, the electroencephalogram data, the speech data, the spectral data, and the physiological data to obtain a unified feature vector; A cross-modal fusion model is constructed, and the unified feature vector is input into the cross-modal fusion model to obtain multiple emotion state representation vectors.
4. The method for matching intervention strategies for children's emotions according to claim 3, characterized in that, The step of performing multimodal feature extraction on the image data, the electroencephalogram data, the speech data, the spectral data, and the physiological data to obtain a unified feature vector specifically includes: Feature extraction is performed on the image data to obtain local spatial structure features and social behavior features; Feature extraction is performed on the electroencephalogram (EEG) data, the speech data, the spectral data, and the physiological data to obtain pure brain features, speech prosody features, neural response features, and physiological features. A unified feature vector is obtained based on the local spatial structure features, the social behavior features, the pure brain features, the phonological prosody features, the neural response features, and the physiological features.
5. The method for matching intervention strategies for children's emotions according to claim 4, characterized in that, The construction of the cross-modal fusion model specifically includes: Construct a multimodal social emotion network model; The multimodal social emotion network model is jointly modeled with temporal convolutional networks and graph convolutional networks to obtain a cross-modal fusion model; The cross-modal fusion model is represented as follows: ; in, A vector representing emotional state; For temporal convolutional networks; For graph convolutional networks; For timestamps The multimodal feature matrix.
6. The method for matching intervention strategies for children's emotions according to claim 3, characterized in that, The step of determining the target intervention strategy matching the child based on the social-emotional competence profile and the standard state data specifically includes: The differences in the children's emotional states are obtained based on multiple emotional state representation vectors and the standard state data. Based on the differences in emotional states, the effects of multiple preset intervention paradigms on children's interventions were determined. Based on the effects of multiple interventions, a target intervention strategy matching the child is determined.
7. The method for matching intervention strategies for children's emotions according to claim 4, characterized in that, The step of determining a target intervention strategy matching the child based on the social-emotional competence profile and the standard state data further includes: The image data, EEG data, speech data, spectral data, and physiological data are updated to obtain updated multimodal data; Based on the updated multimodal data, the updated intervention effects are obtained; Based on the updated intervention effects, the target intervention strategies matching the children are updated.
8. A system for matching intervention strategies for children's emotions, characterized in that, The intervention strategy matching system for children's emotions is applied to the intervention strategy matching method for children's emotions as described in any one of claims 1-7; the intervention strategy matching system for children's emotions includes: The paradigm construction and information perception module is used to construct an assessment and intervention paradigm and acquire the child's video information and physiological parameter information during the child's execution of the assessment and intervention paradigm. The representation extraction module is used to perform multimodal representation extraction on the video information and the physiological parameter information to obtain a social-emotional ability profile of the child. The personalized intervention matching module is used to acquire standard state data and determine the target intervention strategy to match the child based on the social-emotional ability profile and the standard state data.
9. A terminal, characterized in that, The terminal includes: a memory, a processor, and a child emotion intervention strategy matching program stored in the memory and executable on the processor. When the child emotion intervention strategy matching program is executed by the processor, it implements the steps of the child emotion intervention strategy matching method as described in any one of claims 1-7.
10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a child emotion intervention strategy matching program, which, when executed by a processor, implements the steps of the child emotion intervention strategy matching method as described in any one of claims 1-7.