A neural rehabilitation action detection method based on domain generalization neural network
By using multimodal fusion and generative adversarial network training, a neural rehabilitation motion detection method based on domain generalization neural networks was constructed. This method solves the problems of insufficient multimodal information integration and domain generalization ability in existing technologies, achieving high-precision and stable motion detection and improving the intelligence and automation level of rehabilitation training.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- 中国人民解放军海军青岛特勤疗养中心
- Filing Date
- 2025-07-03
- Publication Date
- 2026-06-16
AI Technical Summary
Existing methods for detecting neurorehabilitation movements lack the ability to integrate multimodal information, cannot effectively integrate textual descriptions of movements with videos of neurorehabilitation movements, and lack domain generalization ability, resulting in a significant decrease in detection performance under different training environments or equipment conditions.
A multimodal fusion strategy is introduced to integrate the spatiotemporal features of neurorehabilitation motion videos with the semantic information of motion text descriptions. A domain adversarial training is then conducted through generative adversarial networks to reduce the gap between the feature distributions of the source and target domains, thereby constructing a detection model based on a domain generalization neural network.
It improves the accuracy and robustness of motion detection, enhances the stability of the model under different training conditions and equipment environments, realizes real-time monitoring and decision support, and improves the efficiency and effectiveness of rehabilitation training.
Smart Images

Figure CN120804582B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of intelligent detection technology for neurorehabilitation based on multimodal data fusion, and particularly relates to a method for detecting neurorehabilitation movements based on domain generalized neural networks. Background Technology
[0002] Patients with motor disorders caused by central nervous system injury typically recover their fine motor skills gradually through rehabilitation training that promotes neural function reorganization. Current rehabilitation guidance relies on the observation and assistance of clinicians, but medical resources are limited, especially given the lack of widespread availability of intelligent medical devices; over 90% of rehabilitation training must be completed at home. Due to the lack of professional rehabilitation assessment and real-time feedback, the training effectiveness and adherence of many patients are significantly affected. Studies have shown that timely and intuitive assessment and feedback can effectively improve patient motivation and rehabilitation outcomes. Therefore, developing an intelligent rehabilitation movement assessment method has significant clinical implications and practical application value.
[0003] Currently, traditional methods for assessing rehabilitation movements have evolved from manual observation to intelligent systems, but existing methods still face many challenges:
[0004] Sensor-based recognition methods: These methods typically acquire multi-dimensional data on human movements by wearing sensors such as accelerometers and gyroscopes, and then use traditional machine learning methods (such as support vector machines and decision trees) for movement classification. The advantage of these methods is that they accurately capture human motion characteristics and do not require extensive environmental dependence. However, because they require wearing multiple devices, and the sensors may interfere with human activity, they present inconvenience and discomfort in practical applications.
[0005] Action recognition methods based on computer vision: With the advancement of deep learning technology, action recognition based on video analysis has become an important research direction. Deep learning models, such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), can automatically extract action features from RGB videos or depth images, achieving high-precision action recognition. The advantage of this type of method is that it does not rely on wearable devices and can capture human movements more naturally, making it particularly suitable for remote monitoring in home or medical environments. However, video data often faces challenges such as background interference, pose occlusion, and changes in lighting, increasing the complexity of action recognition.
[0006] Deep Learning-Based Intelligent Recognition Methods: In recent years, with the widespread application of deep learning technology, neural network-based action recognition methods have made significant progress. By introducing advanced network structures, such as Convolutional Long Short-Term Memory (LSTM) networks and 3D Convolutional Neural Networks (3D CNNs), spatiotemporal features can be automatically extracted and accurately classified. These methods improve action recognition accuracy while handling complex temporal information, significantly enhancing model robustness. However, these deep learning models often have a large number of parameters, high computational complexity, and exhibit poor generalization ability when training data is insufficient or inter-domain differences exist.
[0007] While existing methods for detecting neurorehabilitation movements perform well in specific scenarios, they typically lack multimodal information and cannot effectively integrate multimodal data such as textual descriptions of movements and videos of neurorehabilitation movements. Furthermore, these methods often lack domain generalization ability; that is, a model trained in a particular scenario will show a significant decrease in detection performance when applied under different training environments or device conditions. Summary of the Invention
[0008] To address the aforementioned issues, this invention introduces a multimodal fusion strategy, effectively integrating the spatiotemporal features of neurorehabilitation motion videos with the semantic information in motion text descriptions. This fully leverages the complementarity of multimodal data, improving the accuracy and robustness of motion detection. To further address the problem of data distribution discrepancies, this invention incorporates Generative Adversarial Networks (GANs) for domain adversarial training, thereby reducing the gap in feature distribution between the source and target domains and enabling the model to maintain stable performance under different training conditions and device environments.
[0009] This invention provides a method for detecting neural rehabilitation movements based on a domain generalization neural network, comprising the following steps:
[0010] S1, real-time acquisition of video frame data and descriptive text data of the patient's neurorehabilitation movements, wherein the descriptive text data is generated from the guidance text in the rehabilitation exercise manual corresponding to different segments of the patient's neurorehabilitation movement video;
[0011] S2, preprocess the two types of data to obtain the standard input data for the model;
[0012] S3 inputs standard input data into the trained neurorehabilitation movement detection model and outputs a score for the movement quality assessment result;
[0013] The neurorehabilitation movement detection model includes a feature extraction network and a multimodal neurorehabilitation movement detection network;
[0014] The feature extraction network includes a frame-level skeleton keypoint extraction module, a human joint motion coordinate graph construction module, a spatiotemporal graph convolutional network encoding module (STGCN), a multi-scale temporal convolutional module (TCN), and a motion description text encoding module. For video modal data, the frame-level skeleton keypoint extraction module and the human joint motion coordinate graph construction module extract the graph structure data of the joints. Then, STGCN and TCN are used as video feature encoders. STGCN processes the transformed skeleton data to obtain the high-order topological structure between body joints and extracts motion-related spatial features. Then, three TCNs are used to extract temporal features at different levels, and the high and low-level spatiotemporal features are concatenated over different time spans. For text modal data, the motion description text encoding module converts the motion information in the video into feature vectors associated with the text description. The video modal feature outputs of each step are concatenated and fused with the text modal features to form the input of the multimodal neural rehabilitation motion detection network, and predicts continuous evaluation scores for the output.
[0015] Preferably, the feature extraction network is trained using a constructed domain generalization generative adversarial network training strategy; the feature extraction network serves as the generator module of the generative adversarial network, combined with a feature mapping layer to achieve effective data transformation; the feature mapping layer uses a linear transformation to concatenate different modes of the feature vectors from the source domain obtained by the feature extraction network, and then maps them to the target space to maintain the similarity between the source and target domain data; the generator module shares parameters between the source and target domains, and the goal of the generator module is to minimize the distance between the generated features from the source and target domains, with the objective function being:
[0016] minL=||ST|| 2
[0017] Where S and T represent the generated features of the source domain and the target domain, respectively, and L is the distance between the generated features of the source domain and the target domain;
[0018] The discriminator module includes a video discriminator and a text discriminator, which receive video feature data and text feature data output from the generator, respectively. The feature data is processed through two fully connected layers: the first layer has an output dimension of 512, and the second layer has an output dimension of 256, both using the ReLU activation function. Subsequently, the processed feature data is passed to the output layer, which contains a single neuron and uses the Sigmoid activation function for binary classification. The output value represents the similarity between the source and target domain features, mathematically expressed as:
[0019] D(X)=σ(W out ·X+b out ),X∈{S,T}
[0020] Where D(X) is the output of the discriminator, i.e., the input of the generator G, σ is the sigmoid function, and W... out and b out Here, represents the weights and biases of the output layer, respectively. The discriminator aims to maximize the difference between the source domain features and the target domain features, and its optimization objective function is given by the formula:
[0021] maxL=E[log D(S)]+E[log(1-D(T))]
[0022] Where E represents the expectation and log represents the logarithmic function.
[0023] Preferably, the preprocessing includes denoising, supplementing, and standardizing the video data;
[0024] For video data, wavelet denoising and time-series smoothing are used. Discrete wavelet decomposition is performed on the depth image of the i-th frame, retaining low-frequency components and applying soft thresholding to high-frequency components to remove noise. Then, time-series smoothing is performed on the wavelet-denoised depth image sequence to eliminate inter-frame jitter and temporal discontinuities. Finally, the processed video data is standardized, as shown in the following formula:
[0025]
[0026] in, This represents the video data of the i-th frame after standardization. Let be the video data of the i-th frame after wavelet denoising and time series smoothing, where μ and σ represent the mean and standard deviation of all frame data, respectively.
[0027] Preferably, the frame-level skeleton key point extraction module first processes each frame of the preprocessed video data V using the OpenPose model. For the i-th frame, the i-th frame is input into the OpenPose model to obtain the skeleton key point set. in This represents the two-dimensional coordinates of the k-th skeleton keypoint in the i-th frame, where K is the total number of skeleton keypoints. Inputting all video data into the OpenPose model yields a set of skeleton keypoint sequences. Where N represents the total number of frames in the video; secondly, to ensure consistency across frames, the nearest neighbor matching algorithm is used to match the set of skeleton keypoint sequences P. V Matching consecutive frames in the sequence generates a skeleton action sequence with temporal relationships, resulting in the skeleton action sequence. The skeleton information extracted from each frame serves as a time step, recording the motion trajectory of each joint of the target human body, providing temporal skeleton data for subsequent topology construction and feature extraction.
[0028] Preferably, the human joint motion coordinate graph construction module converts the skeletal motion sequence into a skeletal key point topology graph g = (A, V, θ), where V represents the joint as a node in the graph, θ represents the connection relationship between the joints, and A is the edge weight matrix; the node set of each topology graph represents all key points, while the edge set represents the connection relationship between each joint, ensuring that the spatial structure between the joints is preserved.
[0029] Preferably, the spatiotemporal graph convolutional network encoding module is used to extract spatiotemporal features from the skeleton keypoint topology graph g, including two temporal convolutional blocks and one spatial convolutional block. The temporal convolutional block first extracts the spatial features of the skeleton keypoint topology graph g using graph convolution operations, calculated as follows:
[0030]
[0031] in, and Γ represents temporal convolution and connection operations, respectively. μ It is a temporal convolution kernel, W k It is a learnable matrix, σ is the sigmoid activation function, and X is... j It is the video data of the j-th frame, h S This represents the spatiotemporal characteristics of the current video frame; where,
[0032]
[0033] A k Let D be the k-th submatrix of A, and let D be a diagonal matrix. k Let I be the k-th submatrix in D, and let I be the identity matrix. Then, the spatiotemporal feature h and the skeleton keypoint topology g are input into the GLU activation function layer to obtain the feature GLU(h). The GLU activation function layer controls the information flow by performing element-wise weighted summation and activation on the input spatiotemporal feature h and skeleton keypoint topology g, enabling the network to capture more complex feature representations. Finally, GLU(h) is input into a spatial convolution to obtain the feature... Spatial convolution helps extract spatial dependencies between GLU(h) nodes, providing richer spatial features for subsequent spatiotemporal modeling.
[0034] Preferably, the multi-scale temporal convolution module includes a multi-scale temporal convolutional layer and a feature concatenation layer; firstly, the temporal features are input into the multi-scale temporal convolutional layer, and convolution operations are performed on the temporal features at three time scales: 5 frames, 10 frames, and 20 frames, respectively, to obtain the features h at the three time scales of 5, 10, and 20 frames. hop5 h hop10 and h hop20This process extracts local and global dependency information related to the current time step, effectively capturing motion change patterns in both the short and long term. Then, the output h of the multi-scale temporal convolutional layer is... hop5 h hop10 and h hop20 The input feature concatenation layer yields the concatenated feature representation as shown in the formula:
[0035] h TCN =[h hop5 ,h hop10 ,h hop20 ]
[0036] By capturing multi-level temporal dependencies at different time scales and integrating this multi-scale information, a richer and more comprehensive feature representation can be obtained.
[0037] Preferably, the action description text encoding module uses the OpenAI open-source pre-trained CLIP model to extract and align action description text features; the action description text data T and the h output by the multi-scale temporal convolution module are then combined. TCN Simultaneously, the CLIP model is input to obtain a high-dimensional feature vector l of the action description text, and the dimension of l is the same as that of h. TCN The same; l contains the semantic feature information of specific action vocabulary and corresponding joint location when performing neurorehabilitation actions, including action elements, time information and spatial relationships in the action text description.
[0038] Preferably, the multimodal neurorehabilitation motion detection network includes a ConvLSTM pooling module, a multimodal fusion module, and a motion detection output module;
[0039] First, the pooling module of the convolutional long short-term memory network is responsible for modeling the temporal dependencies of the temporal video features, extracting long and short-term temporal features from the action data, enhancing the model's understanding of temporal dynamics, and achieving the dimensionality reduction effect of the pooling layer. Next, a multimodal fusion module is used to mine the intermodal shared features that are beneficial to the accurate prediction of rehabilitation movements by combining the spatiotemporal features of the neurorehabilitation exercise video with the semantic information of the text description through cross-attention. Finally, the action detection output module is used to evaluate the motion quality based on the fused features and output the result score.
[0040] Preferably, the input to the convolutional long short-term memory network pooling module is the concatenated feature h output by the multi-scale temporal convolution module. TCN =[h hop5 ,h hop10 ,h hop20 ]; Includes the patient's movement trajectory characteristics over different time spans;
[0041] First, the feature vectors h at different strideshop5 ,h hop10 ,h hop20 The feature Z is obtained by concatenation; then, time series modeling is performed on Z; the information in the time dimension is dynamically processed using a gating mechanism, and complete spatiotemporal features are extracted under inputs of different lengths;
[0042] The ConvLSTM is configured as a single layer with 256 hidden units. For the input sequence Z, the calculation formula is as follows:
[0043] M = ConvLSTM(Z)
[0044] Where M represents the features output by the ConvLSTM unit; the feature sequence M output by the ConvLSTM is used to transform the collar matrix into a learnable attention feature map g' using the following formula:
[0045] g'=σ(φ(A⊙M)ZW)
[0046] Where ⊙ represents the Hadamard product, φ represents the normalization function, W is the weight vector obtained by 1D convolution, and A is the edge weight matrix of the skeleton keypoint topology graph g. Finally, the attention feature map g' is concatenated through temporal convolution to output the video features h. f .
[0047] Compared with the prior art, the present invention has the following beneficial effects:
[0048] (1) Enhancing the intelligence and automation of motion detection: This invention significantly improves the intelligence and automation of neurorehabilitation motion detection by constructing an end-to-end neurorehabilitation motion detection model based on STGCN and LSTM. The designed feature extraction model can efficiently extract complex nonlinear features from motion data, and the end-to-end architecture simplifies the entire detection process, improving the overall performance of the model. Compared with traditional methods that require a large amount of labeled data, this invention only requires a small amount of data to achieve high-precision prediction, significantly reducing the cost and difficulty of data collection.
[0049] (2) Enhancing the robustness and cross-domain generalization ability of the model: The generative adversarial network model constructed in this invention effectively solves the problem of the distribution difference of neural rehabilitation data between different domains through interactive training of the generator and discriminator, and achieves efficient generalization of source domain and target domain data. Combined with long short-term memory network, this invention can capture long-range dependencies in time series and optimize the feature extraction process, thereby improving the model's adaptability in a variable rehabilitation environment.
[0050] (3) Enhancing the Model's Multimodal Understanding Ability: This invention significantly improves the model's understanding of neurorehabilitation movements by introducing a multimodal fusion strategy. By combining the spatiotemporal features of neurorehabilitation movement videos with the semantic information of textual descriptions, complementary information from different modalities is fully explored. The inclusion of a ConvLSTM layer enables the model to analyze the role of different body joints in the movement assessment process. The model can integrate key clues contained in multimodal data to generate more semantically deep feature representations. This fusion strategy not only enhances the model's understanding of movement data but also provides a more comprehensive basis for the assessment of different rehabilitation movements.
[0051] (4) Enhancing Real-Time Monitoring and Decision Support Capabilities: This invention, through an optimized cross-domain model, has been successfully applied to actual neurorehabilitation training. It can monitor patients' motor performance in real time and provide accurate assessment results, helping clinicians to identify potential problems and adjust rehabilitation plans promptly. Compared to existing technologies that require extensive manual intervention, this invention achieves intelligent and automated operation throughout the entire process, effectively improving the efficiency and quality of neurorehabilitation and ensuring the scientific and efficient nature of patient rehabilitation. Simultaneously, by reducing errors in rehabilitation training, it significantly improves training effectiveness, increases the speed of patient rehabilitation, and ensures continuous progress during the rehabilitation process. This real-time detection mechanism not only improves the effectiveness of neurorehabilitation treatment but also safeguards the patient's rehabilitation progress, increases treatment efficiency, and ensures the efficient use of medical resources. Attached Figure Description
[0052] Figure 1 This is a flowchart illustrating the overall technical route of the present invention.
[0053] Figure 2 This is a schematic diagram of the topology graph construction of the present invention.
[0054] Figure 3 This is a schematic diagram of the spatiotemporal graph convolutional network encoding module of the present invention.
[0055] Figure 4 A schematic diagram of the adversarial network generated for this invention.
[0056] Figure 5 This is a schematic diagram of the multimodal neurorehabilitation movement detection network of the present invention.
[0057] Figure 6 This is a schematic diagram of the pooling module of the convolutional long short-term memory network of the present invention.
[0058] Figure 7 This is a comparison chart of the experimental results of the present invention. Detailed Implementation
[0059] This invention effectively integrates the spatiotemporal features of neurorehabilitation motion videos with the semantic information of motion text descriptions by introducing a multimodal fusion strategy. It fully leverages the complementarity of multimodal data to improve the accuracy and robustness of motion detection. To further address the issue of data distribution discrepancies, this invention combines generative adversarial networks (GANs) for domain adversarial training, thereby reducing the gap in feature distribution between the source and target domains and enabling the model to maintain stable performance under different training conditions and device environments. The overall technical approach is as follows: Figure 1 As shown.
[0060] S1, Initial Dataset Composition: The neurorehabilitation movement detection method in this invention relies on two types of multimodal data: source domain data and target domain data. Source domain data is obtained by recording labeled neurorehabilitation movement videos and corresponding movement description texts using an RGBD camera in an experimental environment. Target domain data consists of unlabeled movement videos and auxiliary description texts recorded using the same device in the patient's home environment. Both types of data undergo consistent preprocessing, including denoising, standardization, and coordinate transformation of video data, and word segmentation, semantic normalization, and embedding representation of text data. For the video modality, coordinate information of each body joint is extracted and transformed into a graph structure representation. For the text modality, semantic feature vectors are generated through a text encoding module for subsequent multimodal fusion processing.
[0061] S2, Feature Extraction Network Construction: The feature extraction network consists of a frame-level skeleton keypoint extraction module, a human joint motion coordinate map construction module, a spatiotemporal graph convolutional network (STGCN) encoding module, a multi-scale temporal convolutional network (TCN) module, and a motion description text encoding module. This effectively captures the temporal, spatial, and semantic features of neurorehabilitation motion data. First, for video modal data, a video feature encoder is used for feature extraction. The frame-level skeleton keypoint extraction module and the human joint motion coordinate map construction module extract the graph structure data of joint points. Then, a combination of STGCN and TCN is used as the video feature encoder. STGCN processes the transformed skeletal data to explore the high-order topological structure between body joints, extracting motion-related spatial features to reflect the motion patterns of different body parts. Next, three TCNs are used to extract temporal features at different levels, concatenating high- and low-level spatiotemporal features across different time spans to express the dynamic video information. For text-based data, patients are required to learn each type of rehabilitation exercise from specified text descriptions in the rehabilitation manual. The learned text of the neurorehabilitation exercise serves as input to the action description text encoding module, using a pre-trained CLIP model as the text feature encoder. Through the CLIP model's image-text contrastive learning mechanism, the action information in the video is converted into feature vectors associated with the text descriptions. Finally, the video modal feature outputs for each step are concatenated and fused with the text modal features to predict successive evaluation scores.
[0062] S3, Constructing a Domain Generalization Generative Adversarial Network Training Strategy: To improve the model's adaptability in different rehabilitation environments, this invention proposes a domain generalization training strategy based on Generative Adversarial Networks (GANs). First, the generator module takes multimodal data from the source and target domains as input, using different modal features extracted by the video feature encoder and text feature encoder in the feature extraction network. Then, these different modal features are input to the feature mapping module to map them to the target space. During this process, the model can extract shared and unique features from the two domains. These features are then input to the discriminator module, which determines the source of the input sample by comparing the feature distribution differences between the input sample and the real target domain sample. If the feature distribution of the sample generated by the generator differs significantly from that of the real target domain sample, the discriminator will give a lower discrimination score. Through continuous adversarial training, the generator continuously improves, generating more realistic samples, making it difficult for the discriminator to distinguish between generated and real samples. The core idea of this strategy is to force the feature extraction network in the generator to learn a feature representation applicable to both the source and target domains through adversarial training, thereby making the model's prediction of neural rehabilitation movements more accurate in different scenarios. The proposed generative adversarial network training strategy is only used during the training phase to enhance the encoder's adaptability across different domains. During the testing and deployment phases, this training strategy is no longer needed; instead, the feature encoding networks containing different modalities in the generator are used directly.
[0063] S4. Constructing a Multimodal Weighted Neurorehabilitation Movement Prediction Network: To improve the accuracy of neurorehabilitation movement detection, this invention constructs a composite prediction model comprising a Convolutional Long Short-Term Memory (ConvLSTM) encoding module, a multimodal fusion module, and a movement detection output module. First, the ConvLSTM module models the temporal dependencies of temporal video features, extracting long and short-term temporal features from the movement data, enhancing the model's understanding of temporal dynamics, and achieving dimensionality reduction through pooling layers. Next, the multimodal fusion module uses cross-attention to mine shared features between the spatiotemporal features of the neurorehabilitation movement video and the semantic information of the text description, which are beneficial for accurate and useful intermodal prediction of rehabilitation movements. Key features are mined from multiple dimensions, highlighting the influence of core joints in the movement, thus compensating for the shortcomings of single-modal features. Finally, the movement detection output module makes predictions based on the fused features. This model not only provides accurate predictions in complex environments but also maintains good performance in cross-domain situations, adapting to diverse patient data.
[0064] S5, Model Deployment and Use: The neurorehabilitation movement detection model designed in this invention can acquire patients' movement data in real time by combining with an RGBD camera. After data preprocessing and feature extraction, the model can evaluate patients' movements in real time, including real-time video streams and learned rehabilitation movement guidance text, which, after standardization, serves as input model data. The feature extraction network transforms the video stream input and descriptive text into video features and text features, where the video feature encoder and text feature encoder are derived from a generator module optimized for domain generalization performance using a generative adversarial network training strategy during training. The features from both modalities are then used as input to a multimodal weighted neurorehabilitation movement prediction network, which predicts rehabilitation movement quality scores in a timely manner to provide feedback on rehabilitation progress or potential problems. In practical applications, this model can effectively evaluate patients' movement performance in a home environment and compare it with movement in a laboratory environment, helping rehabilitation physicians to more accurately develop personalized rehabilitation plans.
[0065] The invention will be further described below with reference to specific embodiments.
[0066] I. Dataset Construction
[0067] 1. Raw Neurorehabilitation Motion Data Acquisition: As a data-driven technology based on neurorehabilitation, this invention requires collecting RGBD camera data and textual descriptions of movements from different patients during neurorehabilitation. This data includes the movement trajectories, joint angles, and movement sequences of different patients during neurorehabilitation exercises. These data vary in length and, after processing, can be divided into source domain patient data V. S and target domain patient data V T To compensate for the limitations of visual data, this invention also introduces textual descriptions of the movements. These data originate from manuals guiding patients through rehabilitation exercises. Each video segment corresponds to a textual description of the movements in the learned manual, covering the type of rehabilitation exercise, target body part, and execution requirements, providing key semantic information. This can be divided into source domain patient data T... S Patient data in the target domain T T .
[0068] 2. Video Data Preprocessing and Conversion: Considering that motion data acquired by RGBD cameras is often affected by noise, lighting, and other factors, this invention performs noise reduction, supplementation, and standardization on the data. Specifically, for video data originating from the source domain V... S and target domain V T Wavelet denoising and time-series smoothing are used to process the depth image of the i-th frame. and Discrete wavelet decomposition is performed, low-frequency components are preserved, and soft thresholding is applied to high-frequency components to remove noise, as shown in the following formula:
[0069]
[0070] in, and These are the high-frequency and low-frequency components of the wavelet decomposition of the depth image, respectively.
[0071] TH(H) represents the soft threshold function for high-frequency components, defined as:
[0072] TH(H)=sign(H)·max(|H|-λ,0)
[0073] Where λ is the threshold, and IW(.) denotes the inverse wavelet transform. Furthermore, the wavelet-denoised depth image sequences are subjected to time-series smoothing to eliminate inter-frame jitter and temporal discontinuities.
[0074] Depth sequence of the source domain Formula used:
[0075]
[0076] For the depth sequence of the target domain Formula used:
[0077]
[0078] Where ω j The function is a Gaussian weighting function, and k is the sliding window size, controlling the smoothing range. This yields optimized video input data. and Ensure data consistency and accuracy.
[0079] By collecting RGBD camera data from different scenarios, and performing wavelet denoising and time-series smoothing, the data was denoised and optimized. To further improve data consistency and model generalization performance, the source and target domain data V... S1 and V T1 This invention standardizes the processed video data. As shown in the formula:
[0080]
[0081] in, This represents the video data of the i-th frame after standardization. Let be the video data of the i-th frame after wavelet denoising and time series smoothing, where μ and σ represent the mean and standard deviation of all frame data, respectively, calculated as follows:
[0082]
[0083]
[0084] For V respectively S1 i-frame level data and V T1 i-frame level data The above standardization process generates source domain video data V with a more uniform distribution. S2 and target domain video data V T2 This provides more stable input data for subsequent feature extraction and domain alignment.
[0085] II. Construction of a Neurorehabilitation Movement Detection Model
[0086] The neurorehabilitation motion detection model comprises a feature extraction network and a multimodal neurorehabilitation motion detection network. The feature extraction network includes a frame-level skeleton keypoint extraction module, a human joint motion coordinate graph construction module, a spatiotemporal graph convolutional network encoding module (STGCN), a multi-scale temporal convolutional module (TCN), and a motion description text encoding module. For video modal data, the frame-level skeleton keypoint extraction module and the human joint motion coordinate graph construction module extract graph structure data of joint points. Then, STGCN and TCN are used as video feature encoders. STGCN processes the transformed skeletal data to obtain the high-order topological structure between body joints and extract motion-related spatial features. Next, three TCNs are used to extract temporal features at different levels, concatenating high- and low-level spatiotemporal features across different time spans. For text modal data, the motion description text encoding module converts motion information in the video into feature vectors associated with text descriptions. The video modal feature outputs for each step are concatenated with the text modal features for multimodal fusion, forming the input to the multimodal neurorehabilitation motion detection network, and predicting continuous evaluation scores for the output.
[0087] 1. Feature Extraction Network
[0088] The feature extraction model constructed in this invention includes a frame-level skeleton keypoint extraction module, a human joint motion coordinate map construction module, a spatiotemporal graph convolutional network (STGCN) encoding module, a multi-scale temporal convolutional network (TCN) module, and a motion description text encoding module; it can simultaneously capture the temporal and spatial features of neurorehabilitation motion data. First, through the frame-level skeleton keypoint extraction module, from the preprocessed source domain V... S2 and target domain V T2Frame-level skeleton keypoint information is extracted from video data to form preliminary feature data representing the motion state of various joints in the body. Next, a human joint motion coordinate map construction module transforms the keypoint data into a topological graph to reflect the connections and topological structure between joints, laying the foundation for subsequent feature extraction. Subsequently, a spatiotemporal graph convolutional network (STGCN) is used to encode the constructed human joint motion coordinate map to explore higher-order topological relationships and dynamic interactions between joints. This stage extracts spatial features related to motion patterns, fully demonstrating the motion state of various body parts. Following this, a multi-scale temporal convolutional module (TCN) is used to model features at different time scales, including three temporal convolutional layers, to extract temporal features at different levels. These high- and low-level features are then concatenated and integrated to further enhance the representation of time-series changes. Finally, the feature outputs of each time step are concatenated to provide high-quality input features for the generator and discriminator modules in the subsequent generative adversarial network and the multimodal neural rehabilitation motion detection model. In complex neurorehabilitation scenarios, this feature extraction model can efficiently capture key features in both spatial and temporal dimensions. Even when data is dynamically changing or the amount of data is small, it can still maintain high-precision feature representation capabilities, providing reliable support for real-time motion assessment and rehabilitation training optimization.
[0089] The frame-level skeleton keypoint extraction module first processes the input video features, extracting skeleton keypoint information from the preprocessed video data in the source and target domains. The OpenPose model is used to process the source domain multimodal data S={V S2 ,T S} and target domain data T={V T2 ,T T Each frame of the video data V is processed to obtain the corresponding skeleton keypoint set. First, the OpenPose model is used to process each frame of the preprocessed video data V. For the i-th frame, the i-th frame is input into the OpenPose model to obtain the skeleton keypoint set. in This represents the two-dimensional coordinates of the k-th skeleton keypoint in the i-th frame, where K is the total number of skeleton keypoints. Inputting all video data into the OpenPose model yields a set of skeleton keypoint sequences. Where N represents the total number of frames in the video. Secondly, to ensure consistency across frames, a nearest neighbor matching algorithm is used to match the set of skeleton keypoint sequences P. V Matching consecutive frames in the sequence generates a skeleton action sequence with temporal relationships, resulting in the skeleton action sequence. For the source domain data, the skeleton action sequence is obtained. For the target domain data, the skeleton action sequence is obtained. Where N represents the total number of frames in the video. The skeleton information extracted from each frame serves as a time step, recording the motion trajectory of each joint of the target human body in both the source and target domains, providing temporal skeleton data for subsequent topology map construction and feature extraction.
[0090] Human Joint Motion Topology Construction Module: This module is used to construct a topology map of human joint motions from source domain video data V. S3 and target domain video data V T3 The skeletal keypoint information in the image is transformed into a topological map of human joint movements. The purpose of this process is to establish the spatial relationships between joints, providing a foundation for subsequent feature extraction and spatiotemporal relationship modeling. Specifically, for source domain video data V... S3 The key information of the skeleton will be transformed into a topological graph G. S =(V S ,θ S For target domain video data V T3 The key information of the skeleton will be transformed into a topological graph G. T =(V T ,θ T ), where V S and V T Representing a joint as a node in the graph, θ S and θ T To represent the connection relationships between joints, construct the edge weight matrix A. S and A T Each topology graph's node set represents all keypoints, while the edge set represents the connections between joints. A corresponding joint topology graph is generated for each frame in both the source and target domains, ensuring the spatial structure between joints is preserved. In this way, human motion data in the source and target domains are transformed into graph-structured data, providing input for the subsequent Spatiotemporal Graph Convolutional Network (STGCN) encoding module. These topology graphs capture the dynamic interactions between various joints of the human body, thus providing rich spatial information for feature learning and action recognition. The specific process is as follows... Figure 2 As shown.
[0091] Spatiotemporal Graph Convolutional Network (STGCN) Encoding Module: In this invention, the STGCN encoding module is used to extract spatiotemporal features from the skeleton keypoint data of the source and target domains, and serves as an important component of the video feature encoder. First, the input data for the source and target domains are the source domain skeleton keypoint topology graph G, respectively. S Topological graph T of target domain skeleton key points SEach node in the graph represents a human joint, and edges represent connections between joints. Each STGCN consists of two temporal convolutional blocks and one spatial convolutional block. The temporal convolutional blocks first extract spatial features using graph convolution operations. The spatial features of the source and target domains are calculated using the following formulas:
[0092]
[0093] Among them, h S i and h T i For the new features of node i in the source and target domains, h S j and h T j Let W be the feature of neighbor node j, N(i) be the neighborhood set of node i, and W be the feature of neighbor node j. S and W T Let b be the learning weight matrices for the source and target domains, respectively. S and b T This is the bias term. Then, GLU activation is used to obtain the source domain features GLU(h). S i ) and GLU(h T i The activation function controls the information flow by performing element-wise weighted summation and activation on the two parts of the input, enabling the network to capture more complex feature representations. Further spatial convolution helps extract the spatial dependencies between nodes in the source and target domains, providing richer spatial features for subsequent spatiotemporal modeling. Through this series of operations, the spatiotemporal graph convolutional network can learn spatial and temporal features in the source and target domains respectively, yielding the source domain features extracted by STGCN. and target domain features This effectively extracts the spatiotemporal features of human joint movements, providing high-quality feature representations for subsequent motion detection tasks. The specific process is as follows: Figure 3 .
[0094] Multi-scale Temporal Convolutional Module (TCN): In this invention, the multi-scale temporal convolutional module is used to extract temporal features at different levels from the spatiotemporal features of the source and target domains, and together with the STGCN, forms a video feature encoder. Specifically, the multi-scale temporal convolutional network (TCN) module models features from different time scales and concatenates high-level and low-level features to further enhance the model's ability to represent time series changes. First, the temporal convolutional layer extracts features related to time changes by performing convolution operations on the data at each time step. For the source domain S3 and the target domain T3, multiple STGCN modules are used for feature extraction at three different time steps, specifically including high-level and low-level temporal convolutions. Specifically, frame-level data with strides of 5, 10, and 20 are selected for feature encoding, using these data with different time spans to capture short-term and long-term dependencies in the time series.
[0095] At each time scale, the temporal features of the source and target domains undergo independent spatiotemporal graph convolution operations to further extract local and global dependency information relevant to the current time step. These features are processed by convolution kernels to effectively capture motion change patterns within short-term (e.g., 5 frames) and long-term (e.g., 20 frames) timeframes. For different time scales, the model uses convolution operations to capture data features within different time lengths, while adjusting the size of the convolution kernels to adapt to different time steps, thereby enhancing the model's expressive power. Then, the source domain S3 and target domain T3 features after convolution at different scales are concatenated to form a richer feature representation as shown in the following formula:
[0096] h S TCN =[h S hop5 ,h S hop10 ,h S hop20 ]
[0097] h T TCN =[h T hop5 ,h T hop10 ,h T hop20 ]
[0098] Among them, h S hop5 and h T hop5 The features are obtained by STGCN encoding with a step size of 5 frames; h S hop10 and h Thop10 This refers to a step of 10 frames; h S hop20 and h T hop20 This refers to a step size of 20 frames. In this way, the model can capture multi-level temporal dependencies at different time scales and integrate this multi-scale information to obtain richer and more comprehensive feature representations. These multi-level features will serve as input, further driving subsequent model generation and detection tasks, thereby improving the performance of the source and target domains in different tasks, especially in accurately capturing temporal features in tasks such as neural rehabilitation motion detection.
[0099] Action Description Text Encoding Module: This module is designed to enhance the multimodal understanding capabilities of neurorehabilitation action detection models by extracting semantic features from action description text. It includes an encoder for text features. Specifically, for the input text data T... S and T T and the extracted video features h S TCN and h T TCN Feature extraction is performed using a pre-trained CLIP model to obtain the source domain features of the text. S and target domain features l T Action description text is input into the CLIP model as supplementary information. The CLIP model jointly encodes the text and video to generate a high-dimensional feature vector of the action description text. Through its pre-trained multimodal contrastive learning mechanism, the CLIP model prioritizes aligning the semantic information of the text with the visual features of the video. The text input is processed by CLIP's text encoder, and the generated text features capture the action elements, temporal information, and spatial relationships in the description—information crucial for recognizing neurorehabilitation actions. The pre-trained CLIP model not only understands specific action words in the text (such as "raise arm" and "squat"), but also captures the execution method and related body parts based on the context. For example, for the description of a squatting action, the CLIP model can extract action information related to joints such as "ankle," "knee," and "spine," converting it into a numerical feature vector and incorporating it into subsequent model processing.
[0100] 2. Domain Generalization Generative Adversarial Network Training Strategy
[0101] Generator Module (Encoding Module): The generator module G constructed in this invention is a feature extraction network, combined with a feature mapping module to achieve efficient data transformation. The feature mapping layer uses a linear transformation to concatenate different modalities of the source domain feature vectors obtained by the video feature encoder and the text feature encoder, and then maps the features to the target space to maintain the similarity between the source and target domain data. This is expressed as the formula:
[0102] S = G(f map (l S ,h S TCN )))
[0103] T = G(f map (l T ,h T TCN ))
[0104] The generator G shares parameters between the source and target domains. In the generator, the neural rehabilitation actions from the source and target domains are input separately and processed using a shared feature mapping module, making the generated features as similar as possible. This design not only enhances the comparability of the generated features but also presents a greater recognition challenge to the subsequent discriminator. The generator's goal is to minimize the distance between the generated features from the source and target domains, optimizing the objective function as follows:
[0105] minL=||ST|| 2
[0106] Where S and T represent the generated features of the source and target domains, respectively, and L is the distance between the generated features of the source and target domains. This process ensures the robustness and adaptability of the model when processing data from different domains.
[0107] Discriminator Module (Differentiation Module): The discriminator module constructed in this invention is used to effectively distinguish between source and target domain neural rehabilitation movements. This discriminator module D includes a video discriminator and a text discriminator, which respectively receive video feature data and text feature data output from the generator. The feature data is processed through two fully connected layers. The output dimension of the first layer is 512, and the output dimension of the second layer is 256. Both layers use the ReLU activation function to enhance the non-linear representation capability of the features. Subsequently, the processed feature data is passed to the output layer, which contains a single neuron and uses the Sigmoid activation function for binary classification. The output value represents the similarity between the features of the source and target domains, and its mathematical expression is the formula:
[0108] D(X)=σ(W out ·X+b out ),X∈{S,T}
[0109] Where D(X) is the output of the discriminator, i.e., the input of the generator G, σ is the sigmoid function, and W... out and b out These represent the weights and biases of the output layer, respectively. The discriminator aims to improve the model's discriminative ability by maximizing the difference between source and target domain features. Its objective function is given by the formula:
[0110] maxL=E[log D(S)]+E[log(1-D(T))]
[0111] Here, E represents the expectation, and log represents the logarithmic function. In this process, the discriminator module ensures effective differentiation between source and target domain data, improving the model's robustness. Through this design, the discriminator module not only distinguishes between source and target domain data but also strengthens the model's ability to judge the similarity of features between the source and target domains.
[0112] Domain Generalization Adversarial Training Strategy: The domain generalization adversarial training strategy of this invention aims to improve the model's adaptability across different domains through adversarial learning between the generator and discriminator modules. During training, the generator and discriminator are optimized alternately to achieve effective fusion of source and target domain features. The generator aims to generate target domain features similar to source domain features, enabling it to "deceive" the discriminator, while the discriminator aims to accurately distinguish between source and target domain features. Specifically, the generator's loss function is defined by the formula:
[0113]
[0114] in, and Representing expectations, these adversarial training strategies optimize the video feature encoder (STGCN and TCN) and text feature encoder CLIP in the generator through continuous interaction with the video and text discriminators. This results in the generated target domain features having a distribution close to the source domain features, thereby enhancing the model's generalization ability and improving prediction accuracy under different neurorehabilitation action detection conditions. Its specific structure is as follows: Figure 4 As shown.
[0115] 3. Multimodal neurorehabilitation movement detection network
[0116] This invention utilizes video and text features output from a feature extraction network as data samples, and patient rehabilitation scores as labels to train a multimodal neurorehabilitation action detection network. The proposed multimodal neurorehabilitation action detection network integrates a Convolutional Long Short-Term Memory (ConvLSTM) encoding module, aiming to improve the accuracy of neurorehabilitation action detection. Replacing traditional pooling layers with a ConvLSTM encoding module allows the model to accept variable-length video data and better focus on modeling the temporal dependencies of action data, extracting long and short-term temporal features. More importantly, each joint plays a different role in each movement, and capturing this joint role is crucial for determining movement quality. However, ordinary STGCNs treat all body joints equally. When evaluating rehabilitation movements, this joint role should vary according to temporal and spatial context. This motivates the implementation of a multimodal fusion strategy, in which skeletal data from the video modality is fused with motion description text using a co-attention mechanism. Through the co-attention mechanism, the model can simultaneously focus on key information in both the video and text modalities, dynamically adjusting the importance of each joint in the movement. The overall structure of the multimodal neurorehabilitation action detection network is as follows: Figure 5 As shown.
[0117] Convolutional Long Short-Term Memory (ConvLSTM) Pooling Module: This invention designs a ConvLSTM pooling module to address the dependency of traditional global pooling or attention pooling on fixed input lengths, enabling flexible processing of variable-length video inputs. Through this module, the model can effectively capture the spatiotemporal features of neurorehabilitation motion data and adapt to input sequences of different lengths, providing support for neurorehabilitation motion detection in diverse scenarios.
[0118] The input to this module is the source domain video feature vector h after feature extraction. S TCN =[h S hop5 ,h S hop10 ,h S hop20 ] and the target domain video feature vector h T TCN =[h T hop5 ,h T hop10 ,h T hop20 The data contains the patient's movement trajectory features at different time spans. First, the feature vectors of different strides are concatenated and combined to generate a feature representation Z. S and Z TNext, the extracted spatial features are input into a ConvLSTM unit for time series modeling. ConvLSTM overcomes the limitations of traditional pooling methods in terms of input sequence length by combining the advantages of convolutional operations and long short-term memory networks. It uses gating mechanisms (input gate, forget gate, and output gate) to dynamically process information in the time dimension, thereby extracting complete spatiotemporal features under inputs of different lengths.
[0119] In the specific implementation, the ConvLSTM unit is set to a single layer with 256 hidden units to maintain computational efficiency while enhancing feature representation capabilities. For the input sequence Z... S and Z T The formula can be calculated as follows:
[0120] M S =ConvLSTM(Z) S )
[0121] M T =ConvLSTM(Z) T )
[0122] Where M S and M T The features output from the source and target domains are respectively represented by the variable-length input. Through this design, the ConvLSTM pooling module replaces traditional global pooling or attention pooling methods, improving not only the modeling ability for temporal features but also enhancing the model's adaptability to handling data of different lengths. This provides a more flexible and efficient feature representation for subsequent attention weighting mechanisms and action classification modules. The feature sequence M output by the convolutional long short-term memory network... S and M T The adjoining matrix is transformed into a learnable attention feature map using a formula:
[0123] G S =σ(φ(A) S ⊙M S )Z S W S )
[0124] G T =σ(φ(A) T ⊙M T )Z T W T )
[0125] Where ⊙ represents the Hadamard product, φ represents the standardization function, and Z... S and Z T It is a characteristic after the merger, W S and W TIt is a weight vector obtained through 1D convolution. Finally, the features are concatenated through temporal convolution to output the video features h. S f and h T f The specific process is as follows: Figure 6 As shown.
[0126] Multimodal Fusion Module: The multimodal fusion module in this invention employs a co-attention mechanism to effectively fuse information from different modalities, thereby improving the accuracy and robustness of neurorehabilitation movement detection. During neurorehabilitation movement assessment, video data and movement description text serve as two important information sources, providing temporal and spatial features of the movement, as well as semantic understanding, respectively. To fully utilize the advantages of these two modalities, a co-attention mechanism is used to achieve dynamic information fusion between video and text modalities.
[0127] First, the video data is processed using a Spatiotemporal Graph Convolutional Network (STGCN) and a Convolutional Long Short-Term Memory Network (ConvLSTM) encoding module to extract the spatial features and temporal dependencies of the skeletal data. Specifically, for video feature h... S f and h T f and text features l S and l T The co-attention mechanism dynamically assigns weights to each modality in the motion recognition task by calculating the similarity between video and text modalities. This mechanism can automatically adjust the contributions of different modalities during model training, ensuring that the model relies on the spatial features of the video data at critical moments and on the semantic information of the text description at other moments.
[0128] Ultimately, by fusing multimodal features, the model can more accurately capture the temporal information of motion and the spatial dynamics of joints, while also understanding the semantic level of motion. This strategy effectively improves the accuracy of neurorehabilitation motion detection, enhances the model's generalization ability in cross-domain environments, and provides strong support for real-time assessment and personalized feedback of patient rehabilitation status in practical applications.
[0129] Action Detection Output Module: The action detection output module in this invention is responsible for converting the fused multimodal features into quality scores for neurorehabilitation actions, providing intuitive prediction results for rehabilitation action assessment. The core design of the module includes a multilayer perceptron and a fully connected layer. It utilizes nonlinear mapping to perform deep processing of the input features, extracting key information highly correlated with action quality. Subsequently, by receiving the fused multimodal features from the multimodal fusion module, the fully connected layer maps the processed features to a predicted score y. sThis module quantifies the quality of neurorehabilitation movements. Its design supports the rapid generation of movement quality scores, enabling the system to monitor patients' rehabilitation progress in real time and providing medical personnel with accurate and efficient movement assessment data.
[0130] III. Model Deployment and Application
[0131] Deployment of neurorehabilitation data acquisition equipment: High-precision RGBD cameras and motion capture devices are deployed in the patient's rehabilitation training environment to acquire real-time video data of neurorehabilitation movements in both the source and target domains. Descriptive text data is generated from the guidance text in the rehabilitation exercise manual corresponding to different segments of the patient's neurorehabilitation movement videos. High-quality data acquisition ensures accurate and complete training data to support subsequent model training and prediction.
[0132] Deployment of the Neurorehabilitation Movement Prediction Model: The optimized neurorehabilitation movement prediction model is deployed into the rehabilitation monitoring system. RGBD video data from different patients, along with corresponding descriptive text of rehabilitation movements, are collected and preprocessed in real time. The preprocessed data is then input into the trained model to generate multimodal features in the real-world scenario. The neurorehabilitation movement model is then used for real-time movement detection and evaluation, outputting a movement quality assessment score to help doctors understand the patient's rehabilitation status in a timely manner.
[0133] Model Validation and Predictive Performance Evaluation: After model deployment, actual patient rehabilitation data are collected periodically to validate and evaluate the model's predictive performance. Comparison with actual rehabilitation outcomes ensures the model's accuracy and stability. If necessary, the model is adjusted and optimized to improve its predictive accuracy and adaptability. To verify the effectiveness of the method, a test set consisting of 20 neurorehabilitation patients and their actual rehabilitation outcomes is used to validate the proposed method. Figure 7 As shown, sample numbers 1-20 represent 20 neurological rehabilitation patients. The blue histogram represents the actual values of patient rehabilitation outcomes, and the red histogram represents the predicted values of patient rehabilitation outcomes. Figure 7 As can be seen, in 20 test samples, the error between the predicted values of rehabilitation exercise quality and the expert's evaluation standard was less than 5% for most of the methods of the present invention, indicating that the method of the present invention has high prediction accuracy and stability, and can provide a reliable basis for the rehabilitation assessment of patients.
[0134] Intelligent Feedback and Decision Support: Based on model predictions and combined with the patient's rehabilitation data and actual situation, the system automatically generates personalized rehabilitation training feedback to help clinicians develop scientific training plans. According to predicted movement trends, the system can provide early warnings of problems in the patient's rehabilitation progress, allowing for timely adjustments to training intensity and content to ensure the patient's rehabilitation process is safe, effective, and meets medical requirements.
[0135] The above description is merely a preferred embodiment of this application and is not intended to limit this application. Various modifications and variations can be made to this application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.
[0136] While the specific embodiments of the present invention have been described above, they are not intended to limit the scope of protection of the present invention. Those skilled in the art should understand that various modifications or variations that can be made by those skilled in the art without creative effort based on the technical solutions of the present invention are still within the scope of protection of the present invention.
Claims
1. A method for detecting neural rehabilitation movements based on a domain generalization neural network, characterized in that, Includes the following steps: S1, real-time acquisition of video frame data and descriptive text data of the patient's neurorehabilitation movements, wherein the descriptive text data is generated from the guidance text in the rehabilitation exercise manual corresponding to different segments of the patient's neurorehabilitation movement video; S2, preprocess the two types of data to obtain the standard input data for the model; S3 inputs standard input data into the trained neurorehabilitation movement detection model and outputs a score for the movement quality assessment result; The neurorehabilitation movement detection model includes a feature extraction network and a multimodal neurorehabilitation movement detection network; The feature extraction network includes a frame-level skeleton keypoint extraction module, a human joint motion coordinate graph construction module, a spatiotemporal graph convolutional network encoding module (STGCN), a multi-scale temporal convolutional module (TCN), and a motion description text encoding module. For video modal data, the frame-level skeleton keypoint extraction module and the human joint motion coordinate graph construction module extract the graph structure data of the joints. Then, STGCN and TCN are used as video feature encoders. STGCN processes the transformed skeleton data to obtain the high-order topological structure between body joints and extracts motion-related spatial features. Then, three TCNs are used to extract temporal features at different levels, and the high and low-level spatiotemporal features are concatenated over different time spans. For text modal data, the motion description text encoding module converts the motion information in the video into feature vectors associated with the text description. The video modal feature outputs of each step are concatenated and fused with the text modal features to form the input of the multimodal neural rehabilitation motion detection network, and predicts continuous evaluation scores for the output.
2. The method for detecting neural rehabilitation movements based on a domain generalization neural network as described in claim 1, characterized in that: The feature extraction network is trained using a constructed domain generalization generative adversarial network training strategy. The feature extraction network is used as the generator module of the generative adversarial network, and the feature mapping layer is combined to realize the effective transformation of data. The feature mapping layer uses a linear transformation to concatenate the different modes of the feature vectors of the source domain obtained by the feature extraction network, and then maps them to the target space to maintain the similarity between the source domain and the target domain data. The generator module shares parameters between the source and target domains. Its goal is to minimize the distance between the generated features from the source and target domains, optimizing the objective function as follows: min L=||S-T|| 2 Where S and T represent the generated features of the source domain and the target domain, respectively, and L is the distance between the generated features of the source domain and the target domain; The discriminator module includes a video discriminator and a text discriminator, which receive video feature data and text feature data output from the generator, respectively. The feature data is processed through two fully connected layers: the first layer has an output dimension of 512, and the second layer has an output dimension of 256, both using the ReLU activation function. Subsequently, the processed feature data is passed to the output layer, which contains a single neuron and uses the Sigmoid activation function for binary classification. The output value represents the similarity between the source and target domain features, mathematically expressed as: D(X)=σ(W out ·X+b out ),X∈{S,T} Where D(X) is the output of the discriminator, i.e., the input of the generator G, σ is the sigmoid function, and W... out and b out Here, represents the weights and biases of the output layer, respectively. The discriminator aims to maximize the difference between the source domain features and the target domain features, and its optimization objective function is given by the formula: max L=E[log D(S)]+E[log(1-D(T))] Where E represents the expectation and log represents the logarithmic function.
3. The method for detecting neural rehabilitation movements based on a domain generalization neural network as described in claim 1, characterized in that: The preprocessing includes denoising, supplementation, and standardization of the video data. For the video data, wavelet denoising and time-series smoothing are used. Discrete wavelet decomposition is performed on the depth image of the i-th frame, retaining low-frequency components and applying soft thresholding to high-frequency components to remove noise. Then, time-series smoothing is performed on the wavelet-denoised depth image sequence to eliminate inter-frame jitter and temporal discontinuities. Finally, the processed video data is standardized as shown in the following formula: in, This represents the video data of the i-th frame after standardization. Let be the video data of the i-th frame after wavelet denoising and time series smoothing, where μ and σ represent the mean and standard deviation of all frame data, respectively.
4. The method for detecting neural rehabilitation movements based on a domain generalization neural network as described in claim 1, characterized in that: The frame-level skeleton key point extraction module first uses the OpenPose model to process each frame of the preprocessed video data V. For the i-th frame, the i-th frame is input into the OpenPose model to obtain the skeleton key point set. in This represents the two-dimensional coordinates of the k-th skeleton keypoint in the i-th frame, where K is the total number of skeleton keypoints. Inputting all video data into the OpenPose model yields a set of skeleton keypoint sequences. Where N represents the total number of frames in the video; secondly, to ensure consistency across frames, the nearest neighbor matching algorithm is used to match the set of skeleton keypoint sequences P. V Matching consecutive frames in the sequence generates a skeleton action sequence with temporal relationships, resulting in the skeleton action sequence. The skeleton information extracted from each frame serves as a time step, recording the motion trajectory of each joint of the target human body, providing temporal skeleton data for subsequent topology construction and feature extraction.
5. The method for detecting neural rehabilitation movements based on a domain generalization neural network as described in claim 4, characterized in that: The human joint motion coordinate graph construction module transforms the skeletal motion sequence into a skeletal key point topology graph g = (A, V, θ), where V represents the joint as a node in the graph, θ represents the connection relationship between joints, and A is the edge weight matrix. The node set of each topology graph represents all key points, while the edge set represents the connection relationship between each joint, ensuring that the spatial structure between joints is preserved.
6. The method for detecting neural rehabilitation movements based on a domain generalization neural network as described in claim 1, characterized in that: The spatiotemporal graph convolutional network encoding module is used to extract spatiotemporal features from the skeleton keypoint topology graph g. It includes two temporal convolutional blocks and one spatial convolutional block. The temporal convolutional block first extracts the spatial features of the skeleton keypoint topology graph g using graph convolution operations, calculated as follows: in, and Γ represents temporal convolution and connection operations, respectively. μ It is a temporal convolution kernel, W k It is a learnable matrix, σ is the sigmoid activation function, and X is... j It is the video data of the j-th frame, h S This represents the spatiotemporal characteristics of the current video frame; where, A k Let D be the k-th submatrix of A, and let D be a diagonal matrix. k Let I be the k-th submatrix in D, and let I be the identity matrix. Then, the spatiotemporal feature h and the skeleton keypoint topology g are input into the GLU activation function layer to obtain the feature GLU(h). The GLU activation function layer controls the information flow by performing element-wise weighted summation and activation on the input spatiotemporal feature h and skeleton keypoint topology g, enabling the network to capture more complex feature representations. Finally, GLU(h) is input into a spatial convolution to obtain the feature... Spatial convolution helps extract spatial dependencies between GLU(h) nodes, providing richer spatial features for subsequent spatiotemporal modeling.
7. The method for detecting neural rehabilitation movements based on a domain generalization neural network as described in claim 1, characterized in that: The multi-scale temporal convolution module includes a multi-scale temporal convolutional layer and a feature concatenation layer. First, the temporal features are input into the multi-scale temporal convolutional layer, and convolution operations are performed on the temporal features at three time scales: 5 frames, 10 frames, and 20 frames, respectively, to obtain the features h at the 5, 10, and 20 time scales. hop5 h hop10 and h hop20 This process extracts local and global dependency information related to the current time step, effectively capturing motion change patterns in both the short and long term. Then, the output h of the multi-scale temporal convolutional layer is... hop5 h hop10 and h hop20 The input feature concatenation layer yields the concatenated feature representation as shown in the formula: h TCN =[h hop5 ,h hop10 ,h hop20 ] By capturing multi-level temporal dependencies at different time scales and integrating this multi-scale information, a richer and more comprehensive feature representation can be obtained.
8. The method for detecting neural rehabilitation movements based on a domain generalization neural network as described in claim 1, characterized in that: The action description text encoding module uses the OpenAI open-source pre-trained CLIP model to extract and align action description text features; it combines the action description text data T with the h output by the multi-scale temporal convolution module. TCN Simultaneously, the CLIP model is input to obtain a high-dimensional feature vector l of the action description text, and the dimension of l is the same as that of h. TCN The same; l contains the semantic feature information of specific action vocabulary and corresponding joint location when performing neurorehabilitation actions, including action elements, time information and spatial relationships in the action text description.
9. The method for detecting neural rehabilitation movements based on a domain generalization neural network as described in claim 1, characterized in that: The multimodal neurorehabilitation motion detection network includes a ConvLSTM pooling module of a convolutional long short-term memory network, a multimodal fusion module, and a motion detection output module; First, the pooling module of the convolutional long short-term memory network is responsible for modeling the temporal dependencies of the temporal video features, extracting long and short-term temporal features from the action data, enhancing the model's understanding of temporal dynamics, and achieving the dimensionality reduction effect of the pooling layer. Next, a multimodal fusion module is used to mine the intermodal shared features that are beneficial to the accurate prediction of rehabilitation movements by combining the spatiotemporal features of the neurorehabilitation exercise video with the semantic information of the text description through cross-attention. Finally, the action detection output module is used to evaluate the motion quality based on the fused features and output the result score.
10. The method for detecting neural rehabilitation movements based on a domain generalization neural network as described in claim 9, characterized in that: The input to the pooling module of the convolutional long short-term memory network is the concatenated feature h output by the multi-scale temporal convolution module. TCN =[h hop5 ,h hop10 ,h hop20 ]; Includes the patient's movement trajectory characteristics over different time spans; First, the feature vectors h at different strides hop5 ,h hop10 ,h hop20 The feature Z is obtained by concatenation; then, time series modeling is performed on Z. By using a gating mechanism to dynamically process information in the time dimension, complete spatiotemporal features can be extracted from inputs of different lengths. The ConvLSTM is configured as a single layer with 256 hidden units. For the input sequence Z, the calculation formula is as follows: M = ConvLSTM(Z) Where M represents the features output by the ConvLSTM unit; the feature sequence M output by the ConvLSTM is used to transform the collar matrix into a learnable attention feature map g' using the following formula: g'=σ(φ(A⊙M)ZW) Where ⊙ represents the Hadamard product, φ represents the normalization function, W is the weight vector obtained by 1D convolution, and A is the edge weight matrix of the skeleton keypoint topology graph g. Finally, the attention feature map g' is concatenated through temporal convolution to output the video features h. f .