Daily activity assistance system for the elderly based on multi-modal large model and reinforcement learning
By using a multimodal large model and reinforcement learning to assist the elderly in their daily activities, the problem of existing systems being unable to adapt and adjust is solved. This enables accurate perception and personalized assistance of the elderly’s real-time physiological state and environment, improving the system’s practicality and safety.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHUNENG TECHNOLOGY (JIAXING) CO LTD
- Filing Date
- 2026-01-14
- Publication Date
- 2026-06-12
AI Technical Summary
Existing daily activity assistance systems for the elderly cannot adaptively adjust to changes in the elderly’s real-time physiological state, dynamic environmental factors, and momentary activity intentions, which reduces the system’s practicality and may lead to safety accidents.
The daily activity assistance system for the elderly, which adopts a multimodal large model and reinforcement learning, realizes real-time perception and dynamic decision-making of the user's physiological state, behavior posture and environmental context through a multimodal perception fusion module, an intention and risk dynamic analysis module, a reinforcement learning decision engine module and an adaptive interaction execution module, and generates personalized assistance strategies.
It enables accurate analysis of users' real needs and potential environmental threats, improving the timeliness, security, and personalization of the auxiliary system, and enhancing the long-term availability and user satisfaction of the system.
Smart Images

Figure CN122201741A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of artificial intelligence and intelligent assistance systems, specifically relating to an assistive system for daily activities of the elderly based on multimodal large models and reinforcement learning. Background Technology
[0002] Artificial intelligence (AI) technology is playing an increasingly important role in improving the quality of human life, especially in assisting specific groups in their daily activities. Multimodal large models, as an important branch of AI, provide the technological foundation for building intelligent assistance systems by integrating multiple information modalities such as vision, language, and sensors.
[0003] Among these, daily activity assistance systems for the elderly aim to monitor and assist the daily behaviors of seniors through intelligent technology, thereby improving their autonomy and safety in daily life. These systems are typically based on preset rules or static models to provide guidance and support for the activities of seniors.
[0004] Existing technologies mostly employ static activity planning methods, failing to adaptively adjust to real-time changes in the elderly's physiological state, dynamic environmental factors, and the user's instantaneous activity intentions. For example, the system struggles to recognize the balance needs of a user transitioning from a sitting to a standing posture, or fails to consider environmental risks such as slippery surfaces or insufficient lighting, resulting in activity suggestions that are severely out of touch with the actual situation. This mismatch between static planning and dynamic needs not only reduces the system's practicality but may also lead to safety accidents such as falls due to inappropriate suggestions. Therefore, there is an urgent need for a daily activity assistance solution for the elderly that can perceive in real time, integrate multiple modalities, and make dynamic decisions. Summary of the Invention
[0005] The technical problem to be solved by this invention is to overcome the shortcomings of existing daily activity assistance systems for the elderly, which use static activity planning methods and cannot adaptively adjust according to real-time physiological changes, dynamic environmental factors and instantaneous activity intentions of the elderly, and to provide a daily activity assistance system for the elderly based on multimodal large model and reinforcement learning.
[0006] The technical solution of this invention is: a daily activity assistance system for the elderly based on a multimodal large model and reinforcement learning. This system includes a multimodal perception fusion module, an intention and risk dynamic analysis module, a reinforcement learning decision engine module, and an adaptive interaction execution module. The multimodal perception fusion module continuously collects and fuses multimodal data from visual sensors, inertial measurement units, environmental sensors, and wearable physiological monitoring devices to generate a comprehensive state representation with temporal correlation. The intention and risk dynamic analysis module, connected to the multimodal perception fusion module, receives the comprehensive state representation and performs deep semantic understanding and contextual reasoning through a built-in multimodal large model to identify the user's current activity intention in real time and simultaneously assess the potential risk levels at the environmental and physiological levels. The reinforcement learning decision engine module, connected to the intention and risk dynamic analysis module, receives the identified activity intention and the assessed risk level, and generates an optimal assistance strategy matching the current user state, intention, and environment based on a preset reward function and an online policy optimization algorithm. The adaptive interaction execution module, connected to the reinforcement learning decision engine module, receives the optimal assistance strategy and converts it into specific assistance commands, executing corresponding assistance actions through a speech synthesis unit, haptic feedback device, or environmental control interface.
[0007] Furthermore, the specific working process of the multimodal perception fusion module is as follows: the visual sensor acquires RGB images and depth information of the user's posture and surrounding scene; the inertial measurement unit acquires 3-axis acceleration and 3-axis angular velocity data of the user's body; the environmental sensor acquires ambient light intensity, ground humidity, and ambient noise decibel values; and the wearable physiological monitoring device acquires the user's heart rate, blood oxygen saturation, and surface electromyography signals. The multimodal perception fusion module incorporates a feature extraction and alignment network. This network first preprocesses and extracts features from the raw data of each modality, then performs alignment operations in the temporal dimension, and finally fuses features through a cross-modal attention mechanism, outputting a comprehensive state representation vector containing spatial, temporal, and multimodal correlation information.
[0008] Furthermore, the multimodal large model built into the intention and risk dynamic parsing module is a Transformer architecture model pre-trained with massive amounts of daily activity data and safety knowledge, and fine-tuned for specific scenarios involving the elderly. The parsing process of this model specifically includes: taking the comprehensive state representation vector as model input; the model extracting deep semantic features of the input through its encoder layer; the model capturing long-range dependencies between user posture sequences, environmental context, and user historical behavior patterns through its multi-head self-attention mechanism; and the model executing two tasks in parallel through its decoder layer. The first task is intention recognition, outputting a probability distribution representing the user's current intention category; the second task is risk assessment, outputting a quantified comprehensive risk score that integrates fall risk, excessive fatigue risk, and environmental discomfort risk.
[0009] Furthermore, the reinforcement learning decision engine module employs a proximal policy optimization algorithm as its core online policy optimization algorithm. This module combines the intent category probability distribution and comprehensive risk score output by the intent and risk dynamic parsing module, along with a portion of the comprehensive state representation vector output by the multimodal perception fusion module, to constitute the environment state for reinforcement learning. The module's preset reward function includes multiple reward items, specifically: positive rewards for successfully assisting in completing the intent activity, positive rewards for detecting and successfully avoiding potential risks, negative rewards for the provided assistance strategy causing physiological discomfort or imbalance in the user, and negative penalties for the energy consumption of the assistance strategy execution. Through continuous interaction between the agent and the aforementioned environment state, the reinforcement learning decision engine module continuously optimizes its policy network parameters, thereby outputting the optimal assistance strategy that maximizes long-term cumulative rewards.
[0010] Furthermore, the adaptive interaction execution module selects and executes at least one of the following auxiliary actions based on the specific content of the optimal assistance strategy: when the strategy instruction requires voice reminders or guidance, it drives the speech synthesis unit to generate natural language prompts that conform to the hearing characteristics of the elderly; when the strategy instruction requires providing balance assistance or warnings, it drives the tactile feedback device integrated into the user's clothing or assistive devices to generate vibrations in a specific pattern; when the strategy instruction requires adjusting the physical environment to reduce risks, it sends control commands to the intelligent lighting system or dehumidification equipment through the environmental control interface to adjust the ambient light intensity or ground dryness.
[0011] Furthermore, the system also includes a long-term personalized adaptation module, which connects to the reinforcement learning decision engine module and the multimodal perception fusion module. This module continuously records long-term response data of users to various assistance strategies, including changes in physiological indicators after strategy execution and feedback ratings provided by the user through a simple interactive interface. This module periodically uses this recorded data to fine-tune the policy network in the reinforcement learning decision engine module and update the prior knowledge of the multimodal large model in the intent and risk dynamic analysis module regarding the user's behavioral preferences, thereby achieving a progressive alignment between the system's decision model and the user's individual characteristics.
[0012] Furthermore, the multimodal perception fusion module, the intent and risk dynamic parsing module, the reinforcement learning decision engine module, and the adaptive interaction execution module all operate on an architecture that coordinates an edge computing gateway and a cloud server. Specifically, the multimodal perception fusion and preliminary risk assessment tasks, which have extremely high real-time requirements, are deployed on the edge computing gateway, while the computationally intensive intent deep parsing and reinforcement learning strategy optimization tasks are deployed on the cloud server. The two exchange data via an encrypted communication protocol.
[0013] Furthermore, in the risk assessment task, the multimodal large model in the intent and risk dynamic analysis module relies specifically on a sub-network based on dynamic analysis of human keypoint sequences for assessing fall risk. This sub-network extracts the coordinates of keypoints in the human skeleton from consecutive frames, calculates the statistical features of its centroid trajectory, support polygon changes, and joint angular velocities, and inputs these features into a dedicated risk classifier to quantify instantaneous posture stability and fall probability.
[0014] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. This invention achieves high-precision, integrated perception of user physiological state, behavioral posture, and environmental context through a multimodal perception fusion module, overcoming the limitations of single-modal information and providing a comprehensive and reliable data foundation for subsequent intent recognition and risk analysis.
[0015] 2. This invention achieves synchronous and accurate analysis of users' instantaneous activity intentions and complex environmental risks through the multimodal large model in the intention and risk dynamic analysis module. This enables the system to deeply understand the user's real needs and the potential threats in the situation, thus solving the problem of the disconnect between static planning and dynamic needs.
[0016] 3. This invention constructs a decision core that can learn and self-optimize from continuous interaction with the environment through a reinforcement learning decision engine module. The auxiliary strategies output by this system are no longer fixed, but can be adaptively adjusted according to the real-time status, thereby improving the timeliness, security and personalization of the assistance.
[0017] 4. Through the long-term personalized adaptation module, this invention enables the system to learn and adapt to the individual behavioral habits and physiological characteristics of users over a long period of time, so that the system's auxiliary behavior can become more and more in line with the specific needs of users over time, thereby improving the long-term availability of the system and user satisfaction.
[0018] 5. This invention, through an edge-cloud collaborative computing architecture, rationally allocates the computing load, ensuring the system's rapid response capability to high-risk events while supporting the deep computing needs of complex models, thus achieving a balance between performance and efficiency. Attached Figure Description
[0019] Figure 1 This is a schematic diagram of the overall technical architecture of the daily activity assistance system for the elderly based on multimodal large model and reinforcement learning proposed in this invention; Figure 2 This is a schematic diagram of the core principle framework of the intent and risk dynamic analysis module in this invention; Figure 3 This is a logical flow diagram of the reinforcement learning decision engine module in this invention; Figure 4 This is a schematic diagram of the multi-level interaction relationship and data flow between the multimodal perception fusion module and the adaptive interaction execution module in this invention; Figure 5 This is a schematic diagram illustrating the deployment framework of the collaborative architecture between the edge computing gateway and the cloud server in this invention. Detailed Implementation This embodiment details the specific technical implementation scheme of a daily activity assistance system for the elderly based on multimodal large models and reinforcement learning. Please refer to the appendix. Figure 1 The system's overall architecture consists of a multimodal perception and fusion module, an intent and risk dynamic parsing module, a reinforcement learning decision engine module, and an adaptive interactive execution module. These modules are connected and communicate via a high-speed data bus to ensure the real-time performance and integrity of the data stream. The system relies on a computing architecture that combines an edge computing gateway with a cloud server; the specific deployment and division of labor within this architecture will be explained in detail later.
[0020] The multimodal perception fusion module, serving as the system's data input, is responsible for continuously collecting and fusing raw data from various heterogeneous sensors. This module integrates four main types of sensor arrays. The first type is a visual sensor array, typically composed of high-resolution color cameras and depth-sensing cameras deployed in key indoor areas such as the living room, bedroom, and bathroom. These visual sensors acquire RGB image information of the user's posture and depth point cloud data of the surrounding scene at a rate of 30 frames per second. The depth information is used to accurately calculate the relative distance between the user and obstacles in the environment. The second type is an inertial measurement unit (IMU), integrated into the user's smart belt or insole, containing a 3-axis microelectromechanical system (MEMS) accelerometer and a 3-axis gyroscope. The accelerometer measures the linear acceleration of the user's body in the forward / backward, left / right, and up / down directions, typically with a range of ±16g and a sampling frequency of 100 Hz. The gyroscope measures the angular velocity of the user's body rotating around the three coordinate axes, typically with a range of ±2000 degrees per second. The third category is environmental sensor kits, which are deployed in the user's main activity areas. These kits include a digital light sensor to measure ambient light intensity (range 0 to 100,000 lux), a capacitive humidity sensor to detect liquid residue or humidity levels on the ground or key contact surfaces (range 0 to 100% relative humidity), and a digital microphone array to collect ambient noise and output noise values in decibels via a sound pressure level calculation module. The fourth category is wearable physiological monitoring devices, typically in the form of smart bracelets or chest patches. These devices integrate a photoplethysmography (PPG) sensor for continuous monitoring of the user's heart rate and blood oxygen saturation. The heart rate monitoring range is generally 30 to 220 beats per minute, and the blood oxygen saturation monitoring range is generally 70% to 100%. They also integrate a surface electromyography (EMG) sensor, which collects EMG signals through electrodes attached to the user's thigh or lower back muscles, with a signal bandwidth typically ranging from 10 Hz to 500 Hz.
[0021] The multimodal perception fusion module internally runs a feature extraction and alignment network, which consists of multiple parallel sub-networks responsible for preprocessing and feature encoding the four types of raw data. For visual data, background subtraction and user target segmentation are performed first to extract the user's human contour. Subsequently, a pre-trained 2D convolutional neural network, such as the ResNet50 architecture, is used to extract spatial features from the RGB image, outputting a 2048-dimensional feature vector. Simultaneously, the depth point cloud data, after voxelization, is input into a 3D convolutional neural network to extract the geometric structural features of the scene. For inertial measurement unit data, the raw acceleration and angular velocity signals first pass through a fourth-order Butterworth bandpass filter to remove high-frequency noise and DC drift, with the passband frequency set to 0.1 Hz to 20 Hz. The filtered signal is segmented through a sliding window, typically 2 seconds long, with a 50% overlap rate.
[0022] The signal within each window is converted into frequency domain features using a Fast Fourier Transform, and 12 statistical features, including mean, variance, and peak value, are calculated in the time domain. Environmental sensor data, namely light intensity, humidity, and noise levels in decibels, are directly standardized and scaled to the range of zero to one. Heart rate and blood oxygen saturation data collected by wearable physiological monitoring devices are directly used as features, while the raw electromyography (EMG) signals are full-wave rectified and smoothed, and their root mean square (RMS) values are calculated as a representation of muscle activation levels.
[0023] After passing through their respective feature extraction subnetworks, the features of all the aforementioned modalities are fed into a time alignment layer. This layer employs a dynamic time warping algorithm to unify the feature sequences of all modalities onto a single high-precision timestamp, provided by the system's global clock module with a synchronization accuracy better than 1 millisecond. The aligned multimodal feature sequences are then input into a cross-modal attention mechanism module. This mechanism first calculates the query vector, key vector, and value vector for each modality's features. Then, it calculates the attention weight between the query vector of the visual modality and the key vector of the inertial measurement unit (IMU) modality. This weight determines which parts of the IMU feature sequence should be focused on when fusing visual features. Similarly, cross-attention is calculated between all modalities. Finally, the value vectors of all modalities are weighted and summed according to the calculated cross-attention weights to generate a unified, comprehensive state representation vector containing spatial, temporal, and multimodal correlation information. This vector is a fixed-length dense vector, typically 1024-dimensional. Please refer to the appendix. Figure 4 This comprehensive state representation vector is the core output of the multimodal perception fusion module, and it will be transmitted in real time to the intent and risk dynamic analysis module through a high-speed data interface.
[0024] The core of the intent and risk dynamic analysis module is a specially designed and trained multimodal large model. Please refer to the appendix. Figure 2 The model employs the Transformer architecture as its foundation and is pre-trained on massive amounts of daily activity video data, sensor time-series data, and a safety knowledge graph. After pre-training, the model is fine-tuned using a specially collected dataset of daily activities of the elderly, which contains thousands of hours of multimodal data labeled with activity intention categories and risk levels. The model's input is the comprehensive state representation vector from the multimodal perception fusion module. This vector is first projected into the model's high-dimensional latent space, typically 768-dimensional, through an input embedding layer. Subsequently, the embedded sequence is fed into a stack consisting of 12 encoder layers.
[0025] Each encoder layer contains a multi-head self-attention mechanism and a feedforward neural network. Specifically, the multi-head self-attention mechanism's computation process can be described as follows: for the representation of each position in the input sequence, a query matrix, a key matrix, and a value matrix are generated through a linear transformation. The attention score is calculated by taking the dot product of the query matrix and the key matrix, scaling it by dividing by the square root of the key vector dimension, normalizing it using the Softmax function, and then performing a weighted summation on the value matrix. This process enables the model to capture long-range dependencies between user pose sequences, environmental context, and user historical behavior patterns; for example, recognizing the key pattern of a user's gait suddenly becoming unsteady while walking towards the restroom.
[0026] After deep semantic extraction by the encoder, the model enters a dual-task decoder layer. This decoder layer executes two independent output heads in parallel. The first output head is the intent recognition head, which is a fully connected layer followed by a softmax activation function, outputting a probability distribution vector. Each dimension of this vector corresponds to a predefined intent category, such as walking, sitting, standing, picking up objects, climbing stairs, drinking water, taking medication, and 15 other common daily activities of the elderly. The category with the highest probability value is determined as the user's current activity intent. The second output head is the risk assessment head, which is also a fully connected layer, but outputs a continuous scalar value, namely a comprehensive risk score. The calculation of this score integrates multiple dimensions of risk. Among them, the assessment of fall risk relies on a dedicated subnetwork based on dynamic analysis of human keypoint sequences.
[0027] This sub-network extracts the 2D or 3D coordinates of 17 key points of the human skeleton from consecutive visual frames. It then calculates the rate of change of the human center of mass trajectory, the rate of change of the area of the double-foot support polygon, and the root mean square values of the hip and knee joint angular velocities between consecutive frames. These dynamic features are fed into a support vector machine classifier, which outputs a fall probability value between zero and one. The risk of excessive fatigue is assessed by analyzing the low-frequency to high-frequency power ratio of heart rate variability and the decreasing trend of the average power frequency of electromyography signals. The risk of environmental discomfort is derived by comprehensively analyzing factors such as whether the ambient light intensity is below 100 lux, whether the ground humidity exceeds 80%, and whether the ambient noise is consistently above 70 dB. Finally, these sub-risk scores are combined into a single comprehensive risk score through a linear weighted layer, ranging from zero to one, with higher values indicating greater risk. The intention category probability distribution and the comprehensive risk score together constitute the output of the intention and risk dynamic parsing module.
[0028] The reinforcement learning decision engine module is the core of the system's intelligent decision-making. Please refer to the appendix. Figure 3This module constructs a standard reinforcement learning framework. The agent's state space consists of three parts: the intent category probability distribution vector output by the intent and risk dynamic parsing module, the comprehensive risk score, and the first 256 dimensions of features extracted from the comprehensive state representation vector output by the multimodal perception fusion module. The action space defines all the basic auxiliary actions the system can execute, including but not limited to generating voice prompts with specific content, triggering tactile vibrations of specific patterns, sending commands to adjust lighting, and sending commands to start dehumidification, totaling 20 discrete actions. This module uses a proximal policy optimization algorithm as its online policy optimization algorithm. Its core is a policy network, which is a multilayer perceptron. The input is the current state, and the output is the probability distribution of each action in the action space.
[0029] The goal of the proximal policy optimization algorithm is to maximize the expected cumulative reward by iteratively updating the policy network parameters. Its objective function is designed to control the step size of policy updates, avoiding drastic fluctuations during training. This objective function involves a pruning operation of the probability ratio between the old and new policies. In each iteration, the algorithm collects a certain amount of trajectory data of the agent's interaction with the environment, then uses this data to calculate the advantage function estimate, and updates the policy network using gradient ascent according to the aforementioned objective function.
[0030] The pre-defined reward function in this module is crucial for driving the agent to learn correct behavior. The reward function is designed as a weighted sum of multiple reward items. Key reward items include: a positive reward (e.g., +10 points) when the system detects that the user has successfully completed their identified intentional activity without showing discomfort; a higher positive reward (e.g., +15 points) when the system successfully predicts a potential risk and avoids it through an assisted action; a negative reward (e.g., -20 points) when the system's assisted strategy is subsequently confirmed by data to have caused abnormally high heart rate, postural imbalance, or when the user actively cancels the assistance; additionally, to encourage energy conservation, each assisted action is accompanied by a small negative penalty (e.g., -0.1 points) to balance the assisted effect with system energy consumption. The reinforcement learning decision engine module runs continuously on a cloud server. The agent constantly observes the environmental state, selects actions based on the current policy, observes new states and rewards after executing actions, and uses this empirical data to update its policy network. After millions of simulations and online learning iterations, the policy network converges, capable of generating an optimal assisted policy that maximizes long-term cumulative rewards for any input state. This optimal assistance strategy is a structured data object that specifies the type of assistance action to be performed, the action parameters, and the urgency of the execution.
[0031] The adaptive interaction execution module is responsible for translating abstract optimal assistance strategies into concrete, perceptible assistance behaviors. This module contains three main execution channels. The first channel is the speech synthesis unit. This unit receives instructions from the decision engine, which include the text content to be read aloud. The speech synthesis unit uses waveform connection-based synthesis technology or parametric speech synthesis technology to generate clear and natural speech. Considering the hearing characteristics of the elderly, the fundamental frequency range of the synthesized speech is typically controlled between 100 Hz and 300 Hz, the speech rate is adjusted to 2 to 3 Chinese characters per second, and the gain of frequency components around 2000 Hz can be selectively increased to enhance intelligibility. The generated speech signal is played through directional speakers placed in the room or through bone conduction headphones worn by the user. The second channel is haptic feedback devices. These devices are miniature vibration motors integrated into the user's smart belt, insole, or wristband.
[0032] When the strategy requires providing balance assistance or emergency alerts, this module drives the haptic feedback device to generate specific vibration patterns. For example, slow, intermittent vibrations might indicate a warning to watch your step, while rapid, continuous, strong vibrations indicate an emergency stop or need to steady yourself. The vibration pattern, intensity, and duration can all be precisely controlled by strategy parameters. The third channel is the environmental control interface. This interface typically uses wireless communication protocols such as Zigbee or ZWave. When the strategy requires adjusting the physical environment to reduce risk, this interface sends control commands to smart devices deployed in the environment. For example, sending a command to a smart lighting system to increase the light intensity in a specific area from 50 lux to 200 lux; or sending a command to a dehumidifier to activate and reduce ground humidity. Please refer again to the appendix. Figure 4 The execution results of the adaptive interactive execution module, namely the execution log of the auxiliary action and the user's immediate reaction after execution, will be recorded by the system as feedback data.
[0033] To enable long-term evolution and personalization of the system, this system also includes a long-term personalized adaptation module. This module does not participate in the real-time decision-making loop but runs periodically in the background, for example, once every 24 hours. It continuously extracts data from the system log database, including: the content of each historically executed auxiliary strategy, changes in the user's physiological indicators shortly after strategy execution, and user feedback ratings provided through a touchscreen interface with large buttons and simple icons. Feedback ratings are divided into five levels, from very dissatisfied to very satisfied. This module uses this long-term, personalized interaction data to fine-tune the policy network in the reinforcement learning decision engine module.
[0034] The fine-tuning process is similar to the continued training in reinforcement learning, but an embedding vector representing the user's long-term preferences is introduced into the state space, and a reward term directly related to the user's feedback rating is added to the reward function. Simultaneously, this module updates the prior knowledge of the multimodal large model in the intent and risk dynamic analysis module regarding the user's behavioral preferences. For example, the model learns that the user typically requires 0.5 seconds longer than average to support themselves when transitioning from a sitting to a standing posture. Through this gradual alignment, the system's decision-making model gradually adapts to the user's individual characteristics, making the assisted behavior increasingly tailored to their specific needs and habits.
[0035] Finally, please refer to the appendix. Figure 5 This paper describes the edge-cloud collaborative computing architecture of the system. The entire system operates on two computing layers. The edge computing gateway is typically an embedded device with a certain computing power deployed in a user's residence. It directly connects to all local sensors and actuators. At the edge, tasks with extremely high real-time requirements run, including the full functionality of the multimodal perception fusion module, and the sub-network in the intent and risk dynamic parsing module specifically for fall risk assessment based on dynamic analysis of human key point sequences. In this way, once an extremely high instantaneous fall risk is detected, the edge gateway can directly trigger an emergency tactile alarm locally, with a response latency controlled within 100 milliseconds, without waiting for instructions to be transmitted back from the cloud.
[0036] The cloud server handles computationally intensive tasks, including the complete multimodal large model in the intent and risk dynamic parsing module, the complete policy optimization and decision-making process of the reinforcement learning decision engine module, and the long-term personalized adaptation module. The edge gateway and cloud server communicate via an encrypted channel using a transport layer security protocol. The edge gateway uploads the processed integrated state representation vector and preliminary risk assessment results to the cloud. After completing deep parsing and decision-making, the cloud server distributes the generated optimal auxiliary policy to the edge gateway, where it is implemented by the adaptive interactive execution module. This architecture ensures both rapid response to high-risk events and supports the deep computational needs of complex models, achieving a balance between performance and efficiency.
[0037] The system described in this embodiment achieves comprehensive, adaptive, and personalized assistance for the daily activities of the elderly through the precise collaboration of the aforementioned modules. After system startup, the multimodal perception fusion module continuously collects data and generates a comprehensive state representation. The intention and risk dynamic analysis module analyzes user intentions and risks in real time. The reinforcement learning decision engine module generates the optimal assistance strategy based on the current state. The adaptive interaction execution module then translates the strategy into concrete actions. The long-term personalized adaptation module silently optimizes the system in the background, making it increasingly adept at understanding the user. The entire process is cyclical, forming a continuously learning and optimizing intelligent assistance closed loop. All data is encrypted during transmission and storage to ensure user privacy and security. The system interface design follows the usage habits of the elderly, with enlarged fonts, high color contrast, and extremely simple operation processes to ensure ease of use.
[0038] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus.
[0039] Although embodiments of the invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the appended claims and their equivalents.
Claims
1. A daily activity assistance system for the elderly based on multimodal large model and reinforcement learning, characterized in that, include: The multimodal perception fusion module is used to continuously collect and fuse multimodal data from visual sensors, inertial measurement units, environmental sensors, and wearable physiological monitoring devices to generate a comprehensive state representation with temporal correlation. The intent and risk dynamic analysis module is connected to the multimodal perception fusion module. It is used to receive the comprehensive state representation, perform deep semantic understanding and contextual reasoning through the built-in multimodal big model, identify the user's current activity intent in real time, and simultaneously assess the potential risk level at the environmental and physiological levels. The enhanced learning decision engine module is connected to the intent and risk dynamic analysis module. It is used to receive the identified activity intent and the assessed risk level, and generate the optimal auxiliary strategy that matches the current user state, intent and environment based on the preset reward function and through the online strategy optimization algorithm. An adaptive interactive execution module, connected to the reinforcement learning decision engine module, is used to receive the optimal assistance strategy and convert it into specific assistance instructions, and execute corresponding assistance actions through a speech synthesis unit, a haptic feedback device, or an environmental control interface.
2. The daily activity assistance system for the elderly based on multimodal large model and reinforcement learning according to claim 1, characterized in that, The multimodal perception fusion module includes a feature extraction and alignment network; The feature extraction and alignment network is used to preprocess and extract features from the raw data of each modality, perform alignment operations in the time dimension, and perform feature fusion through a cross-modal attention mechanism to output a comprehensive state representation vector containing spatial, temporal and multimodal correlation information.
3. The daily activity assistance system for the elderly based on multimodal large model and reinforcement learning according to claim 2, characterized in that, The multimodal data collected by the multimodal perception fusion module includes: The visual sensor collects the user's pose and the RGB image and depth information of the surrounding scene; The inertial measurement unit collects 3-axis acceleration and 3-axis angular velocity data of the user's body; Environmental sensors collect data on ambient light intensity, ground humidity, and ambient noise levels in decibels. Wearable physiological monitoring devices collect users' heart rate, blood oxygen saturation, and electromyography signals from the body surface.
4. The daily activity assistance system for the elderly based on multimodal large model and reinforcement learning according to claim 1, characterized in that, The multimodal large model built into the intent and risk dynamic analysis module is a Transformer architecture model that has been pre-trained with massive amounts of daily activity data and safety knowledge and fine-tuned for specific scenarios of the elderly. The multimodal large model extracts deep semantic features of the input through its encoder layer, captures long-range dependencies between user pose sequences, environmental context and user historical behavior patterns through a multi-head self-attention mechanism, and performs intent recognition and risk assessment tasks in parallel through the decoder layer.
5. The daily activity assistance system for the elderly based on multimodal large model and reinforcement learning according to claim 4, characterized in that, The intent recognition task output represents the probability distribution of the user's current intent category; The risk assessment task outputs a quantitative comprehensive risk score, which integrates the risks of falls, excessive fatigue, and environmental discomfort.
6. The daily activity assistance system for the elderly based on multimodal large model and reinforcement learning according to claim 5, characterized in that, The assessment of fall risk relies on a subnetwork based on dynamic analysis of human keypoint sequences; The sub-network extracts the coordinates of key points of the human skeleton in consecutive frames, calculates the statistical features of the centroid trajectory, support polygon changes, and joint angular velocities, and inputs these features into a risk classifier to quantify instantaneous posture stability and fall probability.
7. The daily activity assistance system for the elderly based on multimodal large model and reinforcement learning according to claim 1, characterized in that, The reinforcement learning decision engine module uses the near-end policy optimization algorithm as the core online policy optimization algorithm. The module combines the intent category probability distribution and the comprehensive risk score, along with a portion of the comprehensive state representation vector, to form the environment state for reinforcement learning. The preset reward function includes positive rewards for successfully assisting in completing the intended activity, positive rewards for detecting and successfully avoiding potential risks, negative rewards for providing assistance strategies that cause users to experience physiological discomfort or imbalance, and negative penalties for the energy consumption of the assistance strategy execution.
8. The daily activity assistance system for the elderly based on multimodal large model and reinforcement learning according to claim 1, characterized in that, The adaptive interactive execution module selects and executes at least one of the following auxiliary actions based on the specific content of the optimal auxiliary strategy: When the strategy indicates that voice prompts or guidance are required, the speech synthesis unit is driven to generate natural language prompts that are suitable for the hearing characteristics of the elderly. When the strategy requires providing body balance assistance or warnings, it drives the haptic feedback device integrated into the user's clothing or assistive device to generate a specific pattern of vibration; When the strategy indicates that the physical environment needs to be adjusted to reduce risk, control commands are sent to the intelligent lighting system or dehumidification equipment through the environmental control interface.
9. The daily activity assistance system for the elderly based on multimodal large model and reinforcement learning according to claim 1, characterized in that, It also includes a long-term personalized adaptation module; The long-term personalized adaptation module is connected to the reinforcement learning decision engine module and the multimodal perception fusion module, and is used to continuously record the user's long-term response data to various auxiliary strategies, including changes in physiological indicators and user feedback scores after strategy execution. The module periodically uses recorded data to fine-tune the policy network in the reinforcement learning decision engine module and updates the prior knowledge of user behavior preferences in the multimodal large model in the intent and risk dynamic analysis module.
10. The daily activity assistance system for the elderly based on multimodal large model and reinforcement learning according to claim 1, characterized in that, The system operates on an architecture that combines edge computing gateways and cloud servers. The multimodal perception fusion and preliminary risk assessment tasks, which have extremely high real-time requirements, are deployed on the edge computing gateway; The computationally intensive tasks of deep intent parsing and reinforcement learning strategy optimization are deployed on cloud servers; Edge computing gateways exchange data with cloud servers via encrypted communication protocols.