Multi-modal health monitoring and nutrition decision system and method for solitary elderly

By preprocessing and spatiotemporal graph structure analysis of multimodal health monitoring data of elderly people living alone, and combining dining table camera devices and health knowledge graphs, the problem of instability in nutritional risk identification and dietary adjustment decisions for elderly people living alone was solved, and efficient and reliable nutritional decision generation was achieved.

CN122201638APending Publication Date: 2026-06-12CHENGDU SIXIANG ZHONGHE TECHNOLOGY CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHENGDU SIXIANG ZHONGHE TECHNOLOGY CO LTD
Filing Date
2026-03-25
Publication Date
2026-06-12

Smart Images

  • Figure CN122201638A_ABST
    Figure CN122201638A_ABST
Patent Text Reader

Abstract

The application discloses a multi-modal health monitoring and nutrition decision system and method for solitary old people, relates to the technical field of health information processing, and comprises the following steps: S1, collecting multi-modal health monitoring data, and performing pretreatment on the multi-modal health monitoring data; S2, constructing a human skeleton space-time graph structure, performing physical gate state determination, triggering posture determination, close-range interaction detection and micro-motion behavior recognition, and generating a behavior health log; S3, confirming a dish category corresponding to an eating behavior, and generating a dietary intake record; S4, constructing a health knowledge graph association relationship, performing path traversal to generate a structured risk tuple set, triggering nutrition decision generation and consistency checking under the constraint of a structured slot, and outputting current nutrition state evaluation and dietary regulation decision information. The problems that multi-modal data timing is not aligned and eating intake is difficult to quantify in the existing health monitoring of solitary old people, leading to unstable nutrition risk identification and dietary regulation decision.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of health information processing technology, specifically to a multimodal health monitoring and nutrition decision-making system and method for elderly people living alone. Background Technology

[0002] With the aging population and the widespread adoption of home-based elderly care, intelligent technologies for chronic disease management, nutritional intervention, and health risk early warning are rapidly developing. In recent years, home behavior monitoring based on video and depth perception, visual recognition and nutritional quantification for dietary scenarios, and decision support integrating electronic health records and medical knowledge have gradually become hot research and industrial applications. Simultaneously, the continuous maturation of integrated edge computing terminals based on SOC chips has enabled higher deployability and continuous operation capabilities for real-time acquisition, edge preprocessing, and intelligent inference of multi-source sensor data in home scenarios. Meanwhile, artificial intelligence technology is expanding from traditional machine learning to deep learning and large model capabilities, making the fusion reasoning of multi-source health information and the generation of interactive natural language suggestions possible, and driving applications such as nursing decision-making and dietary contraindication warnings towards more refined and personalized directions. Against this backdrop, various solutions combining health assessment data, image information, and intelligent decision engines have emerged in existing technologies, as exemplified below.

[0003] For example, application CN119673381B discloses an ultrasound-guided enteral nutrition decision support method and device for critically ill patients, relating to the field of intelligent nursing. It first performs a nutritional assessment on the critically ill patient to obtain nutritional assessment data including medical history and laboratory test data. Simultaneously, it acquires abdominal ultrasound images of the critically ill patient and performs image enhancement processing to obtain enhanced abdominal ultrasound images. Subsequently, the enhanced abdominal ultrasound images and nutritional assessment data are input into a large-model-based nutritional nursing decision support engine to obtain nutritional nursing decision support recommendations. In this way, by monitoring and analyzing the patient's health changes and intestinal function status in real time, it is beneficial to provide personalized, accurate, and timely nutritional nursing recommendations.

[0004] For example, invention publication number CN119252433A discloses a personalized dietary taboo early warning system based on health profiles, including a personalized diet planning system. This personalized diet planning system is connected to a personalized nutritional needs monitoring system, a chronic disease management system, a food early warning system, and a drug response system. A health profile generation module, a physiological needs monitoring module, and a nutritional goal monitoring module are connected between the personalized diet planning system and the personalized nutritional needs monitoring system. This invention's personalized dietary taboo early warning system based on health profiles provides personalized dietary recommendations according to different individuals' health profiles, physiological needs, and nutritional goals. By providing information and suggestions on healthy eating, it enhances individuals' awareness of the importance of healthy eating. Data-driven decision support, utilizing big data and artificial intelligence technologies, provides decision support for medical professionals, helping to develop more effective dietary intervention plans.

[0005] However, while the aforementioned technologies can achieve, to some extent, the generation of health data-driven nutritional care recommendations and personalized dietary alerts, they still have several shortcomings from the perspective of continuous monitoring and actionable decision-making in the home environment of elderly people living alone. On the one hand, some solutions are more inclined towards assessment and image acquisition links within medical institutions, making it difficult to cover the continuous acquisition of the elderly's daily behavioral rhythms, mealtime processes, and actual intake, resulting in an insufficient closed loop between recommendation generation and actual home behavior. On the other hand, some recommendations based on intelligent algorithms or large models lack strong constraints and consistency verification mechanisms with the actual available conditions at home, which can easily lead to inconsistencies between the recommended content and actual dietary intake, the range of available ingredients, or medication management requirements, thereby affecting the feasibility and safety of the recommendations.

[0006] Therefore, in response to the above problems, there is an urgent need for a multimodal health monitoring and nutrition decision-making system and method for elderly people living alone. Summary of the Invention

[0007] Technical problems to be solved

[0008] To address the shortcomings of existing technologies, this invention provides a multimodal health monitoring and nutrition decision-making system and method for elderly people living alone. It solves the problems of misaligned time sequence of multimodal data and difficulty in quantifying food intake in existing health monitoring of elderly people living alone, which leads to unstable nutritional risk identification and dietary adjustment decisions.

[0009] Technical solution

[0010] To achieve the above objectives, the present invention provides the following technical solution: a multimodal health monitoring and nutritional decision-making method for elderly people living alone, comprising: S1, collecting multimodal health monitoring data, performing time alignment, smoothing and denoising, anomaly removal, missing data completion and numerical standardization on the multimodal health monitoring data to generate preprocessed multimodal health monitoring data; S2, constructing a spatiotemporal graph structure of the human skeleton based on the preprocessed multimodal health monitoring data, calculating joint motion intensity assessment values ​​and performing physical gating state determination, triggering posture determination, close interaction detection and micro-movement behavior recognition when the gating conditions are met, and generating data including behavior type identifiers, time... S3, based on the behavioral health log, filter the eating behavior time window, combine the video frames and depth map pixel distance values ​​collected by the dining table camera device with the dish image recognition results, confirm the dish category corresponding to the eating behavior, and generate a diet intake record based on 3D point cloud volume calculation and category density mapping; S4, based on the diet intake record and historical medical record data, construct a health knowledge graph association relationship, execute path traversal to generate a set of structured risk tuples, and trigger nutrition decision generation and consistency verification under structured slot constraints, output current nutritional status assessment and diet adjustment decision information for elderly people living alone.

[0011] Furthermore, the following steps are taken to collect multimodal health monitoring data and perform time alignment, smoothing and noise reduction, anomaly removal, missing data completion, and numerical standardization on the data to generate preprocessed multimodal health monitoring data: Real-time collection of multimodal health monitoring data from elderly people living alone. This data includes identity identifiers, collection timestamps, camera installation area identifiers, room video frames, room depth map frames, dining table video frames, dining table depth map frames, depth map pixel distance values, three-axis acceleration, gravitational acceleration, bed height, ground height, room camera device identifiers, dining table camera device identifiers, dining table installation area identifiers, kitchen food storage device identifiers, and food names. The system employs several techniques: a time alignment algorithm based on time window resampling is used to map multimodal health monitoring data from the same day onto the same time axis, constructing an intraday time axis; a sliding window mean filtering algorithm is used to smooth high-frequency noise introduced by sensor jitter and transient interference in the multimodal health monitoring data; an interquartile range anomaly detection algorithm is used to identify and remove abnormal sampling points caused by occlusion, communication jitter, and intermittent equipment acquisition in the multimodal health monitoring data, and the missing segments formed after removal are filled using an adjacent time slice interpolation algorithm; and a Z-score standardization algorithm is used to numerically standardize the multimodal health monitoring data to eliminate dimensional differences between different physical quantities.

[0012] Furthermore, the specific steps for constructing the spatiotemporal graph structure of the human skeleton based on the preprocessed multimodal health monitoring data are as follows: Read the preprocessed multimodal health monitoring data, and perform frame-level pairing of room video frames and room depth map frames using the acquisition timestamp as the key; for each paired room video frame and room depth map frame, use a lightweight pose estimation algorithm to perform human key joint detection on the room video frame, obtaining key joint nodes, including the top of the head, neck, left and right shoulders, left and right elbows, left and right wrists, mid-hip, left and right hips, left and right knees, and left and right ankles; and based on convolutional neural networks... The network acquires the two-dimensional pixel coordinates of key joint nodes, reads the depth values ​​in the corresponding room depth map frame based on the two-dimensional pixel coordinates, and maps the two-dimensional pixel coordinates and depth values ​​to three-dimensional spatial coordinates through the camera imaging model, thus obtaining a three-dimensional coordinate sequence of key human joints arranged according to the acquisition timestamp. Using the key joint nodes in a single frame as a node set, spatial connection edges are constructed in the same frame according to the human anatomical connection relationship, and temporal connection edges are constructed for the same joint node between adjacent acquisition timestamps, thus modeling the human skeleton as a spatiotemporal graph structure of the human skeleton containing spatial edges and temporal edges.

[0013] Further, the specific steps for calculating the joint motion intensity assessment value and performing physical gating state determination are as follows: Calculate the difference in the three-dimensional coordinates of each key joint node in the human skeleton spatiotemporal diagram between adjacent acquisition timestamps to obtain the three-dimensional velocity vector of each key joint node; take the modulus of the three-dimensional velocity vector of the i-th key joint node at acquisition timestamp t and square it to obtain the three-dimensional velocity vector modulus square; add one to the velocity modulus square and take its reciprocal, then subtract the reciprocal from one to obtain the velocity adjustment factor; multiply the joint mass weight, velocity adjustment factor, and three-dimensional velocity vector modulus square corresponding to the current key joint node sequentially to obtain the weighted motion amount of the current key joint node; perform a summation operation on the weighted motion amounts of all key joint nodes to obtain the joint motion intensity assessment value at acquisition timestamp t; and then apply the joint motion intensity assessment value... With active threshold and micro-motion threshold Perform real-time comparison: when ≥ When the current state is determined to be in the dynamic migration phase, the status flag is marked as dynamic; when < < When the current state is determined to be in the local micro-motion stage, the status flag is marked as micro-motion; when ≤ When the current state is determined to be in a steady-state maintenance phase, the state flag is marked as steady state.

[0014] Furthermore, the specific steps for triggering posture determination, close-range interaction detection, and micro-motion behavior recognition when gating conditions are met, and generating a behavior health log containing behavior type identifiers, time intervals, and spatial region identifiers, are as follows: When the status flag is in steady state or micro-motion, the spatial vector between the three-dimensional coordinates of the hip joint and the three-dimensional coordinates of the neck joint is calculated according to the collection timestamp sequence, and the gravitational acceleration under the corresponding collection timestamp is extracted as a spatial direction reference. The spatial angle between the spatial vector and the gravitational direction is calculated based on the vector dot product. The spatial angle is compared with the posture determination angle interval to determine the human posture as standing. The system monitors the user's posture, including sitting, supine, and transitional positions. It calculates the absolute height of the mid-hip joint from the ground based on the three-dimensional coordinates of the mid-hip joint and the ground height, generating a fall warning sign when the absolute height is less than the bed height. When the status indicator is slightly moving and the user is standing or sitting, it calculates the forearm length based on the three-dimensional coordinates of the elbow and wrist joints, constructing a near-body interaction field in three-dimensional space with the neck joint as the center and the forearm length ratio as the radius. It monitors in real-time whether the left and right wrist joints enter the near-body interaction field; when a hand is detected entering the near-body interaction field, it only targets the near-body... Object detection is performed on the projection area of ​​the interactive field region in the video frame of the room, and the categories of the detected interactive objects are associated and labeled. When an interactive object is detected and the dwell time exceeds the dwell threshold, the video frame of the room, the depth map frame of the room, and the three-dimensional coordinate sequence of the human key joints within the corresponding time window are extracted in the order of the collection timestamp. The topological association between key joint nodes is constructed according to the anatomical connection relationship of human joints. Combined with the temporal association relationship of the same key joint node under adjacent collection timestamps, the three-dimensional coordinate sequence of human key joints is structured to generate a temporal structured joint representation for behavior determination. Multimodal health monitoring data with historically determined behavior types are read to construct historical temporal structured joint representations. The historical temporal structured joint representations are used as training inputs, and the corresponding micro-movement behavior type identifiers are used as supervision labels. A behavior recognition model is constructed and trained based on the spatiotemporal graph convolutional network algorithm. The temporal structured joint representation generated in the current time window is input into the trained behavior recognition model, and the corresponding behavior type identifier is output. The behavior type identifier is associated with the corresponding collection timestamp interval, human posture type, status flag bit, and camera device installation area identifier to generate a behavior health log.

[0015] Furthermore, based on the behavioral health log, the specific steps for identifying the food category corresponding to the eating behavior by filtering the eating behavior time window and combining the video frames, depth map pixel distance values, and food image recognition results collected by the table camera device are as follows: Read the behavioral health log, filter the behavioral records related to eating that occurred in the areas corresponding to the table installation area and the kitchen food storage device according to the collection timestamp order, and extract the corresponding start and end collection timestamps to generate an eating behavior time window sequence; for each eating behavior time window, read the video frames collected by the table camera device in the multimodal health monitoring data according to the collection timestamp order, use a target detection algorithm to identify the food name identifiers in the video frames, and then identify the food category corresponding to the collection time window. The identified ingredient names are recorded in sequence; video frames captured by the dining table camera are input into a deep learning-based food image recognition model, which outputs candidate food category identifiers corresponding to the video frames; for the same eating behavior time window, the set of ingredients corresponding to the candidate food category identifiers is read based on the food ingredient correspondence table, and the set of ingredients is compared with the set of ingredient names detected in the current eating behavior time window; when there is an ingredient name identifier belonging to the set of ingredients in the set of detected ingredient names, the corresponding candidate food category identifier is retained as a valid food category identifier, and the food category identifier with the longest consecutive occurrence time in the time window is determined as the target food category identifier.

[0016] Furthermore, the specific steps for generating dietary intake records based on 3D point cloud volume calculation and category density mapping are as follows: Video frames captured by the table camera within the corresponding eating behavior time window and their corresponding depth map pixel distance values ​​are read in the order of the acquisition timestamps; background area subtraction is performed on the video frames to separate the food area, and a pixel back-projection algorithm is used based on the depth map pixel distance values ​​to map the pixels of the food area into a 3D spatial point set, generating 3D point cloud data of the food area; the Alpha Shape algorithm is used to construct the 3D bounding structure of the food point cloud data, and the food volume is calculated based on volume integration; the category density parameter corresponding to the target dish category identifier is read, the food volume is multiplied by the category density parameter to obtain the food weight, and the nutritional intake data corresponding to the eating behavior is calculated based on the nutritional component mapping relationship corresponding to the target dish category identifier, generating and outputting the dietary intake record.

[0017] Furthermore, the specific steps for constructing a health knowledge graph based on dietary intake records and historical medical record data, and generating a structured risk tuple set through path traversal, are as follows: Read dietary intake records, extract the food category identifiers, nutrient intake data, and time windows corresponding to the eating behavior, and map the food category identifiers to food entity nodes in the health knowledge graph, the nutrient intake data to nutrient entity nodes, and the behavior type identifiers corresponding to the eating behavior time windows to behavior event nodes; simultaneously, retrieve historical medical records associated with the current elderly person's identity through the health record database interface, read the disease diagnosis records and long-term medication records in the historical medical records, map the disease diagnosis records in the historical medical record data to disease entity nodes, and map the long-term medication records to medication entity nodes, based on the disease... The system extracts corresponding dietary restrictions from disease diagnosis records and maps them to restrictions rule nodes. It also extracts corresponding medication precautions from long-term medication records and maps them to medication constraint rule nodes. Furthermore, it establishes relationships between these nodes in a health knowledge graph. Within the health knowledge graph, it performs path traversal operations based on entity nodes and their corresponding relationships. When a path from a disease entity node through a restrictions rule node to a nutrient entity node is detected, a nutritional conflict risk identifier is generated. Similarly, when a path from a medication entity node through a medication constraint rule node to a behavioral event node is detected, a behavioral compliance risk identifier is generated. Upon detecting any risk identifier, a structured risk tuple set is generated based on the path that triggered the risk. Each risk tuple contains a risk type identifier, a risk cause identifier, and a risk source food category identifier.

[0018] Furthermore, the specific steps for triggering nutritional decision generation and consistency verification under structured slot constraints, and outputting current nutritional status assessment and dietary adjustment decision information for elderly people living alone, are as follows: For the structured risk tuple set, a structured slot template containing fact slots, boundary slots, and logic slots is constructed, and slot mapping operation is performed, wherein: risk cause identifiers and risk source dish category identifiers are written into the fact slots; the set of ingredient name identifiers detected within the current eating behavior time window is written into the boundary slots; the constraint relationship path associated with the risk cause identifiers in the health knowledge graph is written into the logic slots to generate a structured decision input object; the structured decision input object is input into the large language model, restricting the large language model to generate nutritional decision results only based on the content of the fact slots, boundary slots, and logic slots in the structured slots; and consistency verification is performed on the nutritional decision results, comparing the dish category identifiers and ingredient name identifiers involved in the nutritional decision results with the dish category identifiers and ingredient name identifiers in the dietary intake records. When inconsistencies are detected, the current nutritional decision results are discarded; when the consistency verification passes, the nutritional decision results are output as current nutritional status assessment and dietary adjustment decision information for elderly people living alone.

[0019] The second aspect of this invention provides a multimodal health monitoring and nutrition decision-making system for elderly people living alone, comprising: a multimodal health data acquisition and preprocessing module, a physically gated behavioral health log generation module, a dietary behavior modeling and nutrition profile construction module, and a nutrition risk assessment and constraint decision generation module. The multimodal health data acquisition and preprocessing module is used to collect multimodal health monitoring data and perform time alignment, smoothing and noise reduction, anomaly removal, missing data completion, and numerical standardization on the multimodal health monitoring data to generate preprocessed multimodal health monitoring data. The physically gated behavioral health log generation module is used to construct a spatiotemporal graph structure of the human skeleton based on the preprocessed multimodal health monitoring data, calculate joint motion intensity assessment values, and perform physical gating state determination. When gating conditions are met, it triggers posture determination and close-range interaction. The system includes a mutual detection and micro-action behavior recognition module to generate a behavior health log containing behavior type identifiers, time intervals, and spatial region identifiers; a dietary behavior modeling and nutritional profile construction module to filter eating behavior time windows based on the behavior health log, and to confirm the food category corresponding to the eating behavior by combining video frames, depth map pixel distance values, and food image recognition results collected by the table camera device, and to generate a dietary intake record based on 3D point cloud volume calculation and category density mapping; and a nutrition risk assessment and constraint decision generation module to build a health knowledge graph association based on dietary intake records and historical medical record data, execute path traversal to generate a set of structured risk tuples, and trigger nutrition decision generation and consistency verification under structured slot constraints, outputting current nutritional status assessment and dietary adjustment decision information for elderly people living alone.

[0020] Beneficial effects

[0021] The present invention has the following beneficial effects:

[0022] (1) A multimodal health monitoring and nutrition decision-making system and method for elderly people living alone. By introducing joint motion intensity assessment value and physical gating state judgment mechanism based on the spatiotemporal graph structure of human skeleton, the three types of motion states, namely "dynamic migration - local micro-motion - steady state maintenance", are controlled in layers with quantifiable thresholds. The posture judgment, close interaction detection and micro-movement behavior recognition are triggered only when the gating conditions are met, thereby realizing adaptive start and stop of computing link and resource focus, and improving the operating efficiency and stability in long-term continuous monitoring scenarios.

[0023] (2) A multimodal health monitoring and nutrition decision-making system and method for elderly people living alone. By using joint mass weight and velocity adjustment factor to suppress the weighted aggregation of the motion contribution of key joint nodes, the joint motion intensity assessment value has the dynamic response characteristics of "insensitive to weak shaking and amplified to significant movement", which enhances the ability to distinguish between steady state maintenance and local micro-motion boundary, and provides a more robust physical basis for the pre-judgment of fall warning triggering and close interaction detection.

[0024] (3) A multimodal health monitoring and nutrition decision-making system and method for elderly people living alone, by constructing a near-body interactive field area with the cervical joint node as the center and the length ratio of the human forearm under the condition of micro-movement and standing or sitting, the object detection is limited to the projection area of ​​the near-body interactive field area in the video frame, realizing "local visual computing guided by human skeleton", thereby reducing background interference and false detection probability caused by full-screen target detection, and improving the positioning accuracy of interactive objects related to eating.

[0025] (4) A multimodal health monitoring and nutrition decision-making system and method for elderly people living alone. By mapping the dietary intake records obtained from eating behavior modeling and historical medical records in the health knowledge graph and traversing the path, a set of structured risk tuples is generated. Under the structured slot constraints of fact slot, boundary slot and logic slot, nutrition decision generation and consistency verification are triggered, so that the nutrition decision output has a traceable risk cause link and verifiable consistency constraints, thereby improving the interpretability and reliability of dietary regulation decisions. Attached Figure Description

[0026] Figure 1 A flowchart for a multimodal health monitoring and nutrition decision-making approach for elderly people living alone;

[0027] Figure 2 A structural diagram of a multimodal health monitoring and nutrition decision-making system for elderly people living alone;

[0028] Figure 3 A bar chart showing the status flags based on joint motion intensity assessment values;

[0029] Figure 4 A schematic diagram of the behavior health log classification and judgment process driven by physical gating. Detailed Implementation

[0030] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0031] Please see Figures 1-4This invention provides a technical solution: a multimodal health monitoring and nutritional decision-making method for elderly people living alone, comprising: S1, collecting multimodal health monitoring data, performing time alignment, smoothing and denoising, anomaly removal, missing data completion and numerical standardization on the multimodal health monitoring data to generate preprocessed multimodal health monitoring data; S2, constructing a spatiotemporal graph structure of the human skeleton based on the preprocessed multimodal health monitoring data, calculating joint motion intensity assessment values ​​and performing physical gating state determination, triggering posture determination, close interaction detection and micro-movement behavior recognition when the gating conditions are met, and generating data including behavior type identifiers, time intervals and... S3: Based on the behavioral health log, filter the eating behavior time window, combine the video frames and depth map pixel distance values ​​collected by the dining table camera device with the dish image recognition results to confirm the dish category corresponding to the eating behavior, and generate a diet intake record based on 3D point cloud volume calculation and category density mapping; S4: Based on the diet intake record and historical medical record data, construct a health knowledge graph association relationship, execute path traversal to generate a set of structured risk tuples, and trigger nutrition decision generation and consistency verification under structured slot constraints, output current nutritional status assessment and dietary adjustment decision information for elderly people living alone.

[0032] Specifically, the steps for collecting multimodal health monitoring data and performing time alignment, smoothing and noise reduction, anomaly removal, missing data completion, and numerical standardization on the multimodal health monitoring data to generate preprocessed multimodal health monitoring data are as follows: Real-time collection of multimodal health monitoring data from elderly people living alone. This multimodal health monitoring data includes identity identifiers, collection timestamps, camera installation area identifiers, room video frames, room depth map frames, dining table video frames, dining table depth map frames, depth map pixel distance values, three-axis acceleration, gravitational acceleration, bed height, floor height, room camera device identifiers, dining table camera device identifiers, dining table installation area identifiers, kitchen food storage device identifiers, and food name identifiers. In this process, the identification is read and written into the collection record by the RFID wristband worn by the elderly when they enter the collection coverage area; the collection timestamp is generated and written into the collection record by the unified clock source of the collection terminal each time collection is triggered; the camera installation area identifier is fixed and written into the collection record by the camera installation registration information when the device goes online; the room video frames are collected by the visible light camera installed in the room at the frame rate and bound to the collection record with the collection timestamp; the room depth frame is collected by the structured light depth camera installed coaxially with the visible light camera in the room and bound to the collection record with the collection timestamp; the dining table video frame is collected by the visible light camera installed above the dining table and bound to the collection record with the collection timestamp. The data acquisition records are as follows: The depth image of the dining table is acquired by a structured light depth camera coaxially mounted with the visible light camera on the dining table, and the data is recorded with a timestamp. The pixel distance values ​​in the depth image are obtained by reading the distance values ​​output pixel by pixel from the depth image of the dining table under calibrated camera intrinsic parameters, and are recorded with a timestamp. The three-axis acceleration is output by a three-axis accelerometer worn by the elderly person at the sampling frequency, and is recorded with a timestamp. The gravitational acceleration is the gravitational component acceleration output by an inertial measurement sensor after attitude calculation, and is recorded with a timestamp. The bed height is measured by a laser rangefinder installed on the side of the bed, combined with the bed frame reference calibration value. The calculated height is written into the acquisition record and bound to the acquisition timestamp; the ground height is the ground reference height obtained by fitting the room depth map frame to the ground plane, written into the acquisition record and bound to the acquisition timestamp; the room camera device identifier is fixed and written into the acquisition record by the factory serial number of the room camera device during device registration; the dining table camera device identifier is fixed and written into the acquisition record by the factory serial number of the dining table camera device during device registration; the dining table installation area identifier is written into the acquisition record by the pre-configured area code of the dining table location in the room floor plan and associated with the dining table camera device identifier; the kitchen food storage device identifier is generated by the kitchen food storage device with electronic lock control and weighing capability during registration and carried in the acquisition record.The food ingredient name is identified by a barcode recognition camera on the kitchen food storage device. The barcode is then mapped to a pre-set food ingredient dictionary and written into the data collection record. A time alignment algorithm based on time window resampling is used to map multimodal health monitoring data from the same day onto the same time axis, constructing a daily time axis. The resampling time slice length is set to 1 to 5 seconds, the time slice step size is set to 1 to 5 seconds, and the daily time axis covers 0:00:00 to 23:59:59, with the end of the time slice used as the alignment sampling time. Frame-level alignment binding is achieved by selecting the nearest neighbor frame at the alignment sampling time for room video frames, room depth map frames, dining table video frames, and dining table depth map frames. Alignment value fields are generated by a time-slice-based numerical aggregation method for depth map pixel distance values, three-axis acceleration, gravitational acceleration, bed height, and ground height, and the aggregation results are bound to the alignment sampling time and written into the data. A sliding window mean filtering algorithm is used to smooth high-frequency noise introduced by sensor jitter and transient interference in multimodal health monitoring data. The sliding window coverage time is set to 5 to 30 seconds, and the sliding step size is set to 1 to 5 seconds. Within each sliding window, the window mean is calculated for the depth map pixel distance value, three-axis acceleration, gravitational acceleration, bed height, and ground height, and the value corresponding to the center time slice of the window is replaced and written. For video frames and depth map frames, the frame data is not numerically smoothed, and window smoothing marks are written synchronously to maintain the consistency of field processing. To identify and remove abnormal sampling points in multimodal health monitoring data caused by occlusion, communication jitter, and intermittent data acquisition, an interquartile range (IQR) anomaly detection algorithm is used. The IQR statistical window coverage duration is set to 60 to 600 seconds, and the statistical window step size is set to 10 to 60 seconds. Within the statistical window, the lower and upper quartiles of depth map pixel distance, three-axis acceleration, gravitational acceleration, bed height, and ground height are calculated to obtain the IQR. Abnormal sampling points are marked and removed based on criteria of being less than the lower quartile minus 1.5 times the IQR and greater than the upper quartile plus 1.5 times the IQR. An anomaly removal flag is written to the removed points to form a traceable removal chain. The missing segments formed after the removal are then filled using an adjacent time slice interpolation algorithm. The criteria for determining missing segments are set to be 2 to 20 consecutive missing time slices, and the search range for interpolation endpoints is set to 1 to 60 time slices on each side of the missing segment. Within the endpoint search range, the nearest valid time slice is selected as the interpolation endpoint, and linear interpolation is performed on the depth map pixel distance value, three-axis acceleration, gravity direction acceleration, bed height, and ground height to generate the completed numerical fields.For room video frames, room depth map frames, dining table video frames, and dining table depth map frames, a time-proximity copy of the nearest valid frame is performed, and completion markers are written synchronously to ensure the continuous availability of frame-level data along the time axis. The Z-fractional normalization algorithm is used to numerically normalize the multimodal health monitoring data, eliminating dimensional differences between different physical quantities. During the normalization process, the intraday mean and intraday standard deviation of each axis component of the three-axis acceleration are calculated, and the axis components of each time slice are subjected to Z-fractional transformation based on the mean and standard deviation to obtain the normalized three-axis acceleration. The intraday mean and intraday standard deviation of the gravity direction acceleration are calculated, and Z-fractional transformation is performed to obtain the normalized gravity direction acceleration. The intraday mean and intraday standard deviation of bed surface height and ground height are calculated, and Z-fractional transformation is performed to obtain the normalized bed surface height and normalized ground height. For surface height, the daily mean and standard deviation of the pixel distance values ​​in the depth map are calculated under the same pixel coordinate index, and Z-score transformation is performed to obtain standardized depth map pixel distance values. Simultaneously, the original values ​​of identity identifiers, data collection timestamps, camera installation area identifiers, room camera identifiers, dining table camera identifiers, dining table installation area identifiers, kitchen food storage device identifiers, and food name identifiers are retained without numerical transformation. The data, after time alignment, smoothing and noise reduction, anomaly removal, missing data completion, and numerical standardization, are aggregated in the order of data collection timestamps to generate preprocessed multimodal health monitoring data.

[0033] This implementation plan, by implementing traceable end-to-end preprocessing constraints on multimodal health monitoring data under a unified intraday timeline, ensures that identity identifiers, collection timestamps, camera installation area identifiers, room video frames, room depth map frames, dining table video frames, dining table depth map frames, depth map pixel distance values, three-axis acceleration, gravitational acceleration, bed height, ground height, room camera device identifiers, dining table camera device identifiers, dining table installation area identifiers, kitchen food storage device identifiers, and food name identifiers form an aligned, verifiable, and continuously usable input foundation under the same statistical caliber. This guarantees that the subsequent processes of constructing the human skeleton spatiotemporal map structure using collection timestamps as indexes, filtering eating behavior time windows, and generating dietary intake records have a stable data carrying structure, reducing the risk of behavioral health log deviations and erroneous triggering of nutritional decision consistency checks caused by time granularity drift, noise jumps, and abnormal fragment residues.

[0034] Specifically, the steps for constructing a spatiotemporal map structure of the human skeleton based on preprocessed multimodal health monitoring data are as follows: Read the preprocessed multimodal health monitoring data and perform frame-level pairing of room video frames and room depth map frames using the acquisition timestamp as the key; Specifically, perform consistency verification on the acquisition timestamps of the room video frames and room depth map frames, completing one-to-one binding within the time slice corresponding to the same acquisition timestamp; when a room depth map frame is missing at the current acquisition timestamp, select the most recent valid room depth map frame in reverse order of acquisition timestamps to complete the pairing and write a pairing completion mark to ensure that subsequent depth map pixel distance value reading processes have a usable reference. For each paired room video frame and room depth map frame, a lightweight pose estimation algorithm is used to perform human key joint detection on the room video frame to obtain key joint nodes, including the top of the head, neck, left and right shoulders, left and right elbows, left and right wrists, mid-hip, left and right hips, left and right knees, and left and right ankles. Specifically, for the room video frame, human region localization is first performed. A pedestrian detection method based on object detection is used to locate the bounding rectangle region of the human body in the room video frame, and the room video frame is then cropped using the bounding rectangle region to obtain the human region frame. Input scale unification processing is then performed on the human region frame, scaling it according to a preset input width and preset input height while maintaining the width. The algorithm achieves high proportional consistency and normalizes pixel grayscale values ​​in the human body region frame to a range of zero to one to ensure stable input distribution for pose estimation. The human body region frame, after scale unification and normalization, is input into a lightweight pose estimation algorithm, which outputs a set of joint heatmaps corresponding one-to-one with key joint nodes. The peak position of each joint heatmap in the set represents the pixel position of that key joint node in the human body region frame. Peak search is performed on each joint heatmap to obtain the peak pixel coordinates. Simultaneously, the peak pixel coordinates are reconstructed according to the offset of the circumscribed rectangular region of the human body in the original room video frame, yielding the two-dimensional pixel coordinates of the key joint nodes in the room video frame.The system obtains the 2D pixel coordinates of key joint nodes based on a convolutional neural network. Depth values ​​are then read from the corresponding room depth map frame based on these 2D pixel coordinates. A camera imaging model maps the 2D pixel coordinates and depth values ​​to 3D spatial coordinates, resulting in a 3D coordinate sequence of key human joints arranged by acquisition timestamps. The convolutional neural network takes human region frames as input and outputs a set of joint heatmaps. The 2D pixel coordinates are obtained from the joint heatmap set through peak search and field binding relationships are established according to the key joint node names. The 2D pixel coordinates are represented by pixel row coordinates and pixel column coordinates and bound to the acquisition timestamps. When reading depth values ​​from the corresponding room depth map frame, the 2D pixel coordinates are used as pixel indices to locate the depth map pixel distance values. A valid interval verification is performed on the depth map pixel distance values, with a lower limit of 0.2 meters and an upper limit of 6 meters. Depth map pixel distance values ​​exceeding the valid interval are considered invalid and trigger neighborhood completion. Neighborhood completion reads valid depth map pixel distance values ​​within a 3x3 pixel neighborhood centered on the 2D pixel coordinates and writes the median value to reduce the impact of local holes and transient noise on the 3D mapping. The camera imaging model uses camera intrinsic and extrinsic parameters bound to the indoor camera device identifier to convert 2D pixel coordinates and depth map pixel distance values ​​into 3D spatial coordinates. The camera intrinsic parameters include focal length and principal point parameters, while the camera extrinsic parameters include rotation and translation parameters. The focal length and principal point parameters are used to convert 2D pixel coordinates into normalized imaging plane coordinates, and the depth map pixel distance values ​​are used to determine the spatial scale corresponding to the normalized imaging plane coordinates, thus obtaining the 3D coordinates in the camera coordinate system. The rotation and translation parameters are used to convert the 3D coordinates in the camera coordinate system... The coordinates are converted to three-dimensional coordinates in the room reference coordinate system, and the height component of the three-dimensional coordinates is uniformly calibrated based on the ground height, so that the three-dimensional coordinate sequences of key human joints under different acquisition timestamps share the same ground reference plane; the camera intrinsic parameters and camera extrinsic parameters are provided by the calibration results associated with the room camera device identifier. The calibration results are collected and written into the parameter storage record during the equipment installation stage. The parameter storage record is uniquely associated with the room camera device identifier, so that the matching calibration parameters can be called in the three-dimensional mapping process corresponding to different room camera device identifiers.Using key joint nodes within a single frame as the node set, spatial connection edges are constructed within the same frame based on human anatomical connections. Temporal connection edges are also constructed for the same joint node between adjacent acquisition timestamps, modeling the human skeleton as a spatiotemporal graph structure containing both spatial and temporal edges. Specifically, spatial connection edges are established one by one according to the connection rules from the top of the head to the neck, from the neck to the left and right shoulders, from the left and right shoulders to the left and right elbows, from the left and right elbows to the left and right wrists, from the neck to the mid-hip, from the mid-hip to the left and right hips, from the left and right hips to the left and right knees, and from the left and right knees to the left and right ankles. Temporal connection edges are established based on the consistency of node identifiers for the same joint node at adjacent acquisition timestamps, and the interval between adjacent acquisition timestamps is written into the temporal connection edges to represent the temporal sampling interval, thereby giving the spatiotemporal graph structure of the human skeleton quantifiable evolutionary constraints in the time dimension.

[0035] In this implementation scheme, frame-level pairing of room video frames, room depth map frames, and acquisition timestamps is performed, and pairing completion markers are introduced in missing scenes. At the same time, the calibration results bound by the room camera device identifier constrain the unified conversion path from two-dimensional pixel coordinates to three-dimensional spatial coordinates. This enables the three-dimensional coordinate sequence of key human joints to form a consistent spatial reference expression across acquisition timestamps under the ground height reference plane. Under the joint constraints of spatial connection edges and temporal connection edges, the anatomical topological relationship and temporal evolution relationship of key joint nodes are solidified into a reusable data structure. This improves the subsequent joint motion intensity assessment value calculation and physical gating state determination's dependence on the stability of input field arrangement and coordinate reference consistency, and reduces the risk of behavioral health log generation offset caused by perspective changes, depth holes, and time slice mismatches.

[0036] Specifically, the steps for calculating the joint motion intensity assessment value and performing physical gating state determination are as follows: The difference in the three-dimensional coordinates of each key joint node in the human skeleton spatiotemporal diagram between adjacent acquisition timestamps is calculated to obtain the three-dimensional velocity vector of each key joint node. The three-dimensional coordinate difference is obtained by subtracting the three-dimensional coordinates of the previous acquisition timestamp from the current acquisition timestamp's three-dimensional coordinates. The three-dimensional velocity vector is obtained by dividing the three-dimensional coordinate difference by the difference between adjacent acquisition timestamps. The acquisition timestamp difference is obtained by performing a difference operation on adjacent acquisition timestamps and written into the same sampling record to ensure consistent time interval constraints in subsequent calculations. The three-dimensional velocity vector is used to characterize the transient motion rate of the key joint node between adjacent acquisition timestamps and serves as the basic input for the joint motion intensity assessment value. The modulus of the three-dimensional velocity vector of the i-th key joint node at acquisition timestamp t is taken and squared to obtain the three-dimensional velocity vector modulus squared. The modulus of the three-dimensional velocity vector is obtained by taking the square root of the sum of the squares of the three components of the three-dimensional velocity vector, and the modulus squared is directly obtained from the sum of the squares of the three components of the three-dimensional velocity vector. The three-dimensional velocity vector modulus squared is used to transform the three-dimensional velocity vector from a direction-dependent expression. The motion intensity of key joint nodes at acquisition timet t is characterized by amplitude correlation and energy-based dimensionality. Simultaneously, the use of a squared-modulus form avoids dependence on square root operations, reducing numerical loss in the computational link and improving comparability across acquisition timetamps. The velocity modulus-square value is incremented by one and its reciprocal is taken. Subtracting this reciprocal from the increment yields the velocity adjustment factor. The increment is used to prevent the reciprocal from being incalculable when the velocity modulus-square is zero and to ensure the velocity adjustment factor retains a definite value in zero-velocity segments. The reciprocal is used to perform nonlinear compression on the velocity modulus-square to reduce distortion. The impact of sudden jumps on stability is addressed by subtracting the inverse value from the compression result to convert it into a monotonically increasing gating coefficient. This makes the speed adjustment factor approach zero when the square of the speed modulus approaches zero, and gradually approach one as the square of the speed modulus continues to increase. This suppresses low-speed jitter segments and gradually approaches saturation in real motion segments. The speed adjustment factor is used to weaken the amplification effect of the three-dimensional velocity vector modulus square on the key joint three-dimensional coordinate perturbation introduced by the jitter of the room video frame and the depth fluctuation of the room depth map frame, and to suppress the false high motion accumulation caused by single-frame jumps.The weighted motion of the current key joint node is obtained by multiplying its joint mass weight, velocity adjustment factor, and the squared magnitude of the 3D velocity vector in sequence. Before calculation, the joint mass weight is mapped using a fixed table based on the key joint node identifier and remains consistent across acquisition timestamps under the same identifier. The joint mass weight reflects the differences in the contribution of different key joint nodes to the overall human motion intensity and maintains higher sensitivity to joint nodes related to close-range interaction. The velocity adjustment factor is used to perform gating adjustment on the squared magnitude of the 3D velocity vector to suppress low-amplitude jitter segments and weaken sudden noise segments. The squared magnitude of the 3D velocity vector provides the basic quantization of joint motion amplitude, enabling the weighted motion to comprehensively characterize the importance and motion intensity of the joint node on the same scale. For all key joints... The weighted motion quantities of the nodes are summed to obtain the joint motion intensity assessment value at the acquisition timetamp t. During the summation, the set of key joint nodes involved in the summation is checked for field integrity to ensure that the three-dimensional coordinate sequence of key joint nodes at that acquisition timetamp has complete node records. If any key joint node is missing a three-dimensional coordinate record at that acquisition timetamp, the acquisition timetamp is marked as an undeterminable sampling point, and the status flag of the previous acquisition timetamp is used to maintain the continuity of the gating link. The joint motion intensity assessment value is used to characterize the overall motion intensity of the human skeleton at the level of all key joint nodes at acquisition timetamp t and serves as a unified input for physical gating state determination, making the joint motion intensity assessment values ​​between different acquisition timetamps comparable to support continuous determination. The joint motion intensity assessment value will be used to determine the overall motion intensity of the human skeleton at the level of all key joint nodes at acquisition timetamp t. With active threshold and micro-motion threshold Real-time comparison is performed, including an activity threshold. The micro-motion threshold is used to characterize the lower limit of intensity during the dynamic migration phase and to suppress low-intensity fragments from accidentally entering the dynamic migration phase. Used to characterize the upper limit of intensity during the steady-state maintenance phase and to suppress small motion segments from accidentally entering the steady-state maintenance phase, while also assessing joint motion intensity values. A continuous sampling consistency constraint is introduced to avoid state jitter caused by a single point crossing the threshold. This constraint is triggered by a state update when the joint motion intensity evaluation values ​​at multiple adjacent sampling timestamps maintain a consistent relationship with the threshold. ≥ When the current state is determined to be in the dynamic migration phase, the status flag is marked as dynamic. The dynamic status flag triggers subsequent processing of large displacement behavior and avoids executing the identification link sensitive to steady-state characteristics during the dynamic migration phase; when... < < When the current state is determined to be in the local micro-motion stage, the state flag is marked as micro-motion. The state flag of micro-motion is used to characterize that the human body is in a low-amplitude continuous motion range and serves as the trigger condition for close-range interaction detection and micro-motion behavior recognition; when ≤ When the current state is determined to be in the steady-state maintenance phase, the state flag is marked as steady state. The steady state status flag is used to characterize that the human body is in the posture maintenance range and to provide stable input constraints for posture determination.

[0037] The specific formula for calculating the joint motion intensity assessment value is as follows:

[0038] ;

[0039] In the formula, This represents the joint motion intensity assessment value at the data collection timestamp t. This represents the total number of key joint nodes in the human body. This represents the index number of the i-th critical joint node. This represents the joint mass weight corresponding to the i-th critical joint node. Let represent the three-dimensional velocity vector of the i-th key joint node of the human body at the acquisition timestamp t.

[0040] In this embodiment, Table 1 shows the statistical results of the joint motion intensity assessment values ​​and related input data corresponding to the five acquisition timestamps within the verification window. Specifically: Acquisition timestamp 1: The total number of key joint nodes is 3. The weighted motion amount corresponding to the 1st key joint node is 0.1492, the weighted motion amount corresponding to the 2nd key joint node is 0.0743, and the weighted motion amount corresponding to the 3rd key joint node is 0.0366, resulting in a joint motion intensity assessment value of 0.2600. Acquisition timestamp 2: The total number of key joint nodes is 3. The weighted motion amount corresponding to the 1st key joint node is 0.2017, the weighted motion amount corresponding to the 2nd key joint node is 0.1173, and the weighted motion amount corresponding to the 3rd key joint node is 0.0533, resulting in a joint motion intensity assessment value of 0.3723. Data collection timestamp 3: The total number of key joint nodes is 3. The weighted motion amount corresponding to the 1st key joint node is 0.0583, the weighted motion amount corresponding to the 2nd key joint node is 0.0461, and the weighted motion amount corresponding to the 3rd key joint node is 0.0290. The corresponding joint motion intensity assessment value is 0.1335. Data collection timestamp 4: The total number of key joint nodes is 3. The weighted motion amount corresponding to the 1st key joint node is 0.2572, the weighted motion amount corresponding to the 2nd key joint node is 0.1650, and the weighted motion amount corresponding to the 3rd key joint node is 0.1364. The corresponding joint motion intensity assessment value is 0.5586. Data collection timestamp 5: The total number of key joint nodes is 3. The weighted motion amount corresponding to the 1st key joint node is 0.1009, the weighted motion amount corresponding to the 2nd key joint node is 0.0644, and the weighted motion amount corresponding to the 3rd key joint node is 0.0447. The corresponding joint motion intensity assessment value is 0.2100. The data in Table 1 is used to quantify and compare the motion contribution of each key joint node under the influence of joint mass weight and velocity adjustment factor at different data collection timestamps. It also provides direct input basis for comparing the joint motion intensity assessment value with the activity threshold and micro-motion threshold to determine the state flag and trigger posture determination, close interaction detection, and micro-motion behavior recognition.

[0041] Table 1. Joint Movement Intensity Assessment Data Table

[0042]

[0043] like Figure 3The diagram illustrates the comparison between the joint motion intensity assessment values ​​and the state flag determination results corresponding to five acquisition timestamps. The bar chart uses the acquisition timestamp as the x-axis and the joint motion intensity assessment value as the y-axis. The bar height represents the joint motion intensity assessment value accumulated after adjusting for the velocity adjustment factor based on the key joint mass weight and the square of the three-dimensional velocity vector magnitude at the corresponding acquisition timestamp. Red bars indicate acquisition timestamps determined to be in a dynamic state, orange bars indicate timestamps determined to be in a micro-motion state, and blue bars indicate timestamps determined to be in a steady state. Green dashed lines mark the active threshold positions, and purple dashed lines mark the micro-motion threshold positions, enabling a direct comparison between the assessment values ​​at each timestamp and the threshold boundaries. The corresponding joint motion intensity assessment value is labeled at the top of each bar. Specifically, the joint motion intensity assessment values ​​corresponding to timestamps 3 and 5 are both below the micro-motion threshold, corresponding to blue bars, and are determined to be in a steady state; the joint motion intensity assessment value corresponding to timestamp 1 is between the micro-motion threshold and the active threshold, corresponding to orange bars, and is determined to be in a micro-motion state; the joint motion intensity assessment values ​​corresponding to timestamps 2 and 4 are both above the active threshold, corresponding to red bars, and are determined to be in a dynamic state. Figure 3 By using a visualization method that includes threshold dashed line comparison, three types of column color differentiation, and column top numerical annotation, the quantitative results of joint motion intensity assessment values ​​and status flag determination results under different collection timestamps are presented in a structured manner. This provides a directly applicable basis for subsequent posture determination, close interaction detection, and micro-movement behavior recognition based on status flags.

[0044] In this implementation scheme, the three-dimensional coordinate sequence of the human skeleton spatiotemporal map is uniformly converted into joint motion intensity assessment values ​​and further mapped into status flags. Under the same acquisition timestamp link, a quantifiable, alignable, and verifiable motion intensity judgment benchmark is formed. This enables the dynamic migration stage, the local micro-motion stage, and the steady-state maintenance stage to have clear boundaries and stable transition relationships in the numerical domain. This significantly reduces the risk of gating judgment drift caused by acquisition jitter, single-frame jumps, and inconsistent time intervals, improves the consistency and traceability of status flags on continuous acquisition timestamps, and provides a directly referenceable judgment basis and a unified intensity scale for subsequent generation of behavioral health logs driven by status flags.

[0045] Specifically, the steps for triggering posture determination, close-range interaction detection, and micro-motion behavior recognition when gating conditions are met, and generating a behavior health log containing behavior type identifiers, time intervals, and spatial region identifiers, are as follows: When the status flag is in a steady state or micro-motion, the three-dimensional coordinates of the hip joint and the neck joint in the sequence of key human joint three-dimensional coordinates are read according to the collection timestamp. The spatial vector pointing from the three-dimensional coordinates of the hip joint to the three-dimensional coordinates of the neck joint is calculated using vector difference, and the spatial vector is normalized to eliminate the influence of individual height differences on direction determination. Simultaneously, the gravitational acceleration under the corresponding acquisition timestamp is extracted as a spatial direction reference. Low-pass smoothing is applied to the gravitational acceleration to suppress instantaneous jitter, and then direction normalization is performed on the smoothed gravitational acceleration to form a unit gravity reference vector. The cosine value of the angle between the spatial vector and the unit gravity reference vector is calculated based on the vector dot product, and an inverse cosine transform is performed on the cosine value to obtain the spatial angle. The spatial angle is compared with the posture determination angle range to determine the human posture as standing, sitting, lying, or a transitional posture. The specific process is as follows: Pre- First, set a lower limit and an upper limit for the standing posture angle to form a standing posture angle range. Then, pre-set a lower limit and an upper limit for the sitting posture angle to form a sitting posture angle range. Similarly, pre-set a lower limit and an upper limit for the supine posture angle to form a supine posture angle range. Perform range inclusion determination on the spatial angles according to the collection timestamp sequence. When the spatial angle falls within the standing posture angle range, the human posture is determined to be standing; when it falls within the sitting posture angle range, the human posture is determined to be sitting; and when it falls within the supine posture angle range, the human posture is determined to be supine. The human posture is defined as supine; when the spatial angle does not fall within the angle range of standing posture, sitting posture, or supine posture, the human posture is defined as a transitional posture; to suppress posture jumps caused by single-frame jitter, a continuity check is performed on the posture determination results of adjacent acquisition timestamps. When the same posture determination result remains at least as high as the posture retention threshold over N consecutive acquisition timestamps, the output is the final human posture type. When the continuity check is not satisfied, the human posture type corresponding to the previous acquisition timestamp remains unchanged; the recommended value of N is three to seven.When the status flag is in a slight movement and the human body is in a standing or sitting posture, the 3D coordinates of the elbow joint node and wrist joint node are read in the order of the acquisition timestamps. The length of the human forearm is calculated using Euclidean distance, and the forearm length is updated by the median of the time window to obtain a stable forearm length reference value. A near-interaction field region is constructed in 3D space with the neck joint node as the center and the scale coefficient of the forearm length reference value as the radius. The scale coefficient is limited to the range of 0.6 to 1.2 to take into account the near-interaction coverage and false trigger suppression under different body types. After the near-interaction field region is constructed, the 3D coordinates of the left and right wrist joint nodes are monitored in real time in the order of the acquisition timestamps to see if they fall within the range. Within the boundary constraints of the near-person interaction field area, the number of consecutive timestamps falling into the field is counted to determine the continuity of entry, avoiding false entry markers caused by short-term wrist tremors. When a hand is detected entering the near-person interaction field area, the room video frame is read and the 3D boundary points of the near-person interaction field area are projected onto the room video frame based on the camera imaging model to obtain the projection area. Boundary clipping is performed on the projection area to obtain a local video frame clipping block. Object detection processing is performed only on the local video frame clipping block, and the categories of the detected interactive objects are associated and marked. The associated marks are bound to the acquisition timestamp and the camera device installation area identifier to ensure that the subsequent behavior positioning and spatial source are consistent. When an interactive object is detected and the dwell time exceeds the dwell time threshold, the dwell time is obtained by maintaining a count of the interactive object category on consecutive acquisition timestamps. The dwell time threshold is limited to a range of one to five seconds to accommodate the differences between daily picking and placing and short-term dwelling behaviors. The video frames, depth maps, and 3D coordinate sequences of key human joints within the corresponding time window are extracted according to the acquisition timestamp order. Closed interval recording is performed on the start and end acquisition timestamps of the extraction time window to form the behavior time interval. The topological association between key joint nodes is constructed according to the anatomical connection relationship of human joints. The topological association uses fixed joints to write to the connection table to ensure that the connection relationship does not drift due to the absence of individual joints. The 3D coordinate sequence of key human joints is then structured by combining the temporal association relationship of the same key joint node under adjacent acquisition timestamps. The structured organization includes a tensor representation that is fixedly arranged by joint number and fixedly aligned by acquisition timestamp order, generating a temporal structured joint representation for behavior determination. At the same time, missing joint interpolation is performed on the temporal structured joint representation to maintain a constant input dimension.This process involves reading multimodal health monitoring data with established historical behavior types. Based on the historical data's collection timestamp intervals, it extracts historical video frames, depth maps, and 3D coordinate sequences of key human joints that match the current time window. This constructs a historical temporal structured joint representation, which serves as the training input. The corresponding micro-motion behavior type identifiers are used as supervision labels. A behavior recognition model is built and trained using a spatiotemporal graph convolutional network algorithm. During training, category-balanced sampling is performed on the micro-motion behavior type identifiers to avoid recognition bias caused by an excessively high proportion of common actions. The temporal structured joint representations generated within the current time window are input into the trained behavior recognition model, outputting the corresponding behavior type identifiers. A majority vote consistency check is performed on the output behavior type identifiers within the time window to suppress single-frame jumps. The behavior type identifiers are associated with the corresponding collection timestamp intervals, human posture types, status flags, and camera installation area identifiers. Furthermore, the interactive object category and fall warning sign are written into the same associated record to form a unified log entry, generating a behavior health log. Each record in the behavior health log is bound to a unique identifier to ensure individual consistency during cross-day tracing.

[0046] In this implementation plan, a traceable behavioral health log is formed by using the three-dimensional coordinate sequence of key human joints, video frames of the room, depth map frames of the room, and acceleration in the direction of gravity under the same collection timestamp constraint. This establishes a stable one-to-one correspondence between behavior type identifiers, time intervals, and spatial area identifiers. Thus, even in the home-based working conditions of elderly people living alone, where daily activities are characterized by short pauses, frequent subtle movements, and dense spatial location changes, the behavioral judgment boundaries remain clear and the recording structure remains consistent. This avoids semantic drift of log entries due to changes in perspective and segmentation of action fragments. Furthermore, it provides a standardized behavioral evidence chain that can be directly reused for subsequent time window screening of eating behavior, generation of dietary intake records, and traversal of health knowledge graph paths, thereby improving the reliability of cross-scenario comparisons and cross-day continuous assessments.

[0047] Specifically, based on the behavioral health log to filter the time window of eating behavior, and combining the video frames and depth map pixel distance values ​​collected by the dining table camera device with the food image recognition results, the specific steps to confirm the food category corresponding to the eating behavior are as follows: Read the behavioral health log, perform filtering and verification on the eating-related behavior records according to the behavior type identifier in the behavioral health log, and retain the behavior records whose behavior type identifier meets the eating association constraint as candidate records; to form the basis for determining "eating-related behavior records within the area corresponding to the dining table installation area identifier and the kitchen food storage device identifier," first establish the mapping relationship between the spatial area identifier and the camera device's field of view coordinates: after the room camera device is installed, using... The camera installation area is marked as an index to record the extrinsic and intrinsic calibration results of the camera. The extrinsic calibration results are used to provide the pose relationship of the camera coordinate system relative to the room floor coordinate system, while the intrinsic calibration results are used to provide the mapping relationship from pixel coordinates to imaging rays. In the room floor coordinate system, the area boundaries corresponding to the dining table installation area and the kitchen food storage device are defined respectively. The area boundaries are represented by a closed polygon vertex sequence, which is obtained from on-site measurements and written into the area boundary data using a uniform length unit. Any point on the area boundary polygon within the room floor coordinate system is taken as a 3D point, and this 3D point is transformed to the camera installation area using the extrinsic calibration relationship. After establishing the coordinate system, the coordinates are projected to pixel coordinates through intrinsic parameter calibration relationships to obtain the projected boundaries of the area boundaries within the room video frame. A dining table attention area mask is generated by projecting the boundaries of the dining table installation area markers, and a kitchen attention area mask is generated by projecting the boundaries of the kitchen food storage device markers. This ensures that the spatial area markers in the behavioral health log have a visual coordinate basis that can be directly retrieved within the room video frame. Regional consistency verification is performed on the spatial area markers of candidate records according to the collection timestamp order. Candidate records whose spatial area markers meet the occupancy ratio threshold within the dining table attention area mask and the kitchen attention area mask are extracted. The occupancy ratio threshold is set to a percentage. Twenty to sixty percent; and merge the two types of candidate records according to the collection timestamp order to form a feeding candidate sequence; calculate the collection timestamp difference value of adjacent candidate records in the feeding candidate sequence, and merge adjacent candidate records into the same feeding behavior segment when the collection timestamp difference value is not greater than the window interval threshold. The window interval threshold is one to ten seconds; extract the segment start collection timestamp and segment end collection timestamp for each feeding behavior segment to generate a feeding behavior time window sequence, and perform window duration constraint verification on each feeding behavior time window. The window duration is twenty to twelve hundred seconds. Feeding behavior time windows that do not meet the duration constraint are determined to be invalid windows and removed.For each eating behavior time window, video frames captured by the dining table camera in the multimodal health monitoring data are read in order of collection timestamp. Based on the collection timestamp, the corresponding dining table depth map frame is read at the same sampling point, and the pixel distance value of the depth map is extracted. Distortion correction and brightness normalization are first performed on the video frames. Then, based on the dining table installation area identifier, the area of ​​interest for the dining table is determined within the video frame, and target detection processing is performed only on the area of ​​interest. A target detection algorithm is used to identify the food name identifiers in the video frames. A set of food name identifiers is output for each collection timestamp and written into the food detection sequence. Finally, the food detection sequence is deduplicated and merged in order of collection timestamp to obtain the food detected within the current eating behavior time window. A set of name identifiers is generated; video frames captured by a dining table camera are input into a deep learning-based food image recognition model. Before input, the area of ​​interest on the dining table is cropped into a uniform resolution image and pixel normalization is performed. The model outputs a set of candidate food category identifiers corresponding to the video frames and writes it into a candidate food sequence. For the same eating behavior time window, the set of ingredients corresponding to the candidate food category identifiers is read based on the food ingredient correspondence table, and the set of ingredients is compared with the set of ingredient name identifiers detected within the current eating behavior time window. The food ingredient correspondence table is used to provide deterministic association constraints between food category identifiers and ingredient name identifiers. The food ingredient correspondence table includes a food category identifier field and an ingredient name identifier field. The table includes fields for: staple food identifier, seasoning identifier, optional side dish identifier, version number, and update timestamp. The dish ingredient correspondence table is indexed by dish category identifier and versioned using the version number field. The process of obtaining the dish ingredient correspondence table is as follows: A dictionary of dish category identifiers and a dictionary of ingredient names are pre-maintained in the cloud database. The standard recipe ingredient names corresponding to each dish category identifier are written into the staple food identifier field. The core side dish ingredient names strongly correlated with the dish category identifier are written into the ingredient name identifier field. Seasoning-related ingredient names are written into the seasoning identifier field. Side dish ingredient names that are allowed to be missing and do not affect the dish category determination are written into the optional side dish identifier field. When the dish category identifier... When the dictionary is updated, a new version number field is generated and written to the update timestamp field. On the local side, during the low-load period from midnight to 6 a.m. every day, the latest version number field is pulled from the health record database interface to retrieve the dish ingredient correspondence table and written to the local cache. The consistency of the version number field is used to ensure the stability of the query relationship. During the comparison process, the staple food identifier field corresponding to the candidate dish category identifier is first merged with the ingredient name identifier field to obtain the necessary ingredient set. The necessary ingredient set is then intersected with the detected ingredient name identifier set to obtain the matching ingredient set. When the number of elements in the matching ingredient set is not less than the ingredient matching quantity threshold, the corresponding candidate dish category identifier is retained as a valid dish category identifier. The ingredient matching quantity threshold is set to one to three.The program calculates the duration of consecutive appearance of valid dish category identifiers within the candidate dish sequence. This duration is based on the sum of the differences in the timestamps of adjacent candidate dish category identifiers of the same category, with the threshold for determining consecutive appearances within the same category ranging from 0.5 to 5 seconds. The dish category identifier with the longest consecutive appearance within the time window is then identified as the target dish category identifier.

[0048] In this implementation plan, by uniformly constraining the mapping relationship between the dining table installation area identifier, the kitchen food storage device identifier, and the camera device installation area identifier, and by introducing collection timestamp continuity rules, window duration constraints, and food name identifier set consistency comparison rules during the construction of the eating behavior time window, the eating-related behavior records can form a stable closed loop under the three constraints of spatial boundary, temporal boundary, and semantic boundary. This makes the determination of the target dish category identifier no longer dependent on the instantaneous fluctuation of the single frame recognition result, but dominated by the statistical results of the continuous occurrence duration within the eating behavior time window. This improves the verifiability of dietary intake records within the same elderly person's verification window on different dates, ensures that the structured risk tuple set generated by the subsequent health knowledge graph path traversal triggered by dietary intake records has a consistent data entry point, and reduces the risk of mis-entry of eating events into the database due to area definition drift.

[0049] Specifically, the steps for generating dietary intake records based on 3D point cloud volume calculation and category density mapping are as follows: Video frames captured by the table camera within the corresponding eating behavior time window and their corresponding depth map pixel distance values ​​are read sequentially according to the acquisition timestamps. The video frames carry the table camera identifier, the table installation area identifier, and are bound to the acquisition timestamp for storage. The depth map pixel distance values ​​are associated with the same acquisition timestamp as the video frames to ensure that subsequent pixel-level geometric calculations are completed simultaneously. Background region subtraction is performed on the video frames to separate the food area. In the background region subtraction, the field-view projection area corresponding to the table installation area identifier is used as the operation boundary. Pixels within this operation boundary are processed... Foreground identification and generation of a binary mask for the food region are performed. This mask limits subsequent pixel back-projection operations to only the pixels within the food region, preventing non-food points from being mixed in by reflections from tableware boundaries and table textures. Based on the pixel distance values ​​in the depth map, a pixel back-projection algorithm maps the pixels of the food region to a 3D point set, generating 3D point cloud data for the food region. During pixel back-projection, the imaging intrinsic parameters of the table camera device are used as the mapping reference. The pixel coordinates of the food region pixels and the corresponding pixel distance values ​​in the depth map are substituted into the camera imaging model to calculate the 3D point coordinates. A validity filter is performed on the pixel distance values ​​in the depth map to remove zero-value distance points, saturated distance points, and non-zero-value distance points. Continuous jump distance points ensure spatial connectivity of the 3D point cloud data. The AlphaShape algorithm is used to construct a 3D bounding structure for the food point cloud. During construction, the AlphaShape radius scale is adaptively determined based on the point cloud density, and hole closure processing is performed on the bounding structure to avoid volume underestimation due to occlusion gaps. The food volume is calculated based on volume integration, using the triangular mesh of the 3D bounding structure as the integration boundary to perform closed-volume volume calculation, and the volume result is bound to the acquisition timestamp. The category density parameter corresponding to the target dish category identifier is read; the category density parameter is derived from the category in the dish ingredient correspondence table. The density field provides a unique mapping relationship with the target dish category identifier. The category density parameter is used to convert volume into weight. The food weight is obtained by multiplying the food volume by the category density parameter. Based on the nutritional component mapping relationship corresponding to the target dish category identifier, the nutritional intake data corresponding to the eating behavior is calculated. The nutritional component mapping relationship is indexed by the target dish category identifier as the key to the unit weight content parameters of energy value, protein content, fat content, carbohydrate content, and sodium intake, which are then multiplied by the food weight to obtain the nutritional intake data. The collection timestamp, target dish category identifier, food volume, food weight, nutritional intake data, and eating behavior time window are associated to generate and output the diet intake record.

[0050] In this implementation scheme, by strongly binding the video frames and depth map pixel distance values ​​captured by the dining table camera device at the acquisition timestamp level, and defining the computation boundary with the dining table installation area identifier, the three-dimensional geometric reconstruction of the food area forms a traceable spatial expression under the same time and field of view constraints. This ensures that food volume, food weight, and nutrient intake data have a consistent data source path and a consistent quantification scale within the same dietary intake record, avoiding fluctuations in intake caused by field of view drift, non-food point contamination, and underestimation of volume gaps. This provides a stable, comparable, and verifiable input basis for subsequent nutritional status assessment and dietary adjustment decisions based on dietary intake records.

[0051] Specifically, the steps for constructing a health knowledge graph based on dietary intake records and historical medical records, and generating a structured risk tuple set through path traversal are as follows: Read dietary intake records, extract the dish category identifier, nutritional intake data, and time window corresponding to the eating behavior, and perform a unique location verification on the dietary intake records using the elderly person's identity identifier and the start time stamp of the eating behavior time window as a joint index to avoid cross-person record mixing and cross-window record concatenation; map the dish category identifier to food ingredient entity nodes in the health knowledge graph, and perform a coding consistency verification based on the one-to-one correspondence between the dish category identifier and the food ingredient entity node code, handling any mismatches... The dish category identifier undergoes synonym normalization and is written back to maintain consistency in the risk source location field. Nutritional intake data is mapped to nutrient entity nodes, and the nutrient name identifier in the nutritional intake data is aligned with the nutrient entity node encoding field. After writing the unit of measurement identifier into the nutritional intake data, unit consistency verification is performed to ensure the comparability of nutritional intake data across different eating behavior time windows. The behavior type identifier corresponding to the eating behavior time window is mapped to behavior event nodes, and the behavior event nodes are bound to the start and end timestamps of the eating behavior time window to support window backtracking during subsequent path traversal. The system retrieves historical medical records associated with the current elderly person's identity through a health record database interface. During the retrieval process, the elderly person's identity is used as the search key, and the number of returned records is checked to avoid medical record drift caused by merging multiple identities. Disease diagnosis records and long-term medication records are read from the historical medical records. For disease diagnosis records, the diagnostic code format is checked, and time validity is filtered, marking disease diagnosis records exceeding the preset validity period as historical background records. For long-term medication records, drug name identifiers are normalized, and the start and end time intervals of medication are extracted. Time overlap checks are performed between the medication start and end time intervals and the eating behavior time window to determine valid medication records for risk assessment. Disease diagnosis records are mapped to disease entity nodes, and the diagnostic code is aligned one-to-one with the disease entity node code. Simultaneously, the diagnosis timestamp is written into the disease entity node's associated attributes. Long-term medication records are mapped to medication entity nodes, and the drug name identifier is aligned with the medication entity node code. Simultaneously, the medication start and end time intervals are written into the medication entity node's associated attributes. Based on the mapping results, node relationships are established in the health knowledge graph. Food entity nodes are associated with nutrient entity nodes through nutrient composition relationships, behavioral event nodes are associated with food entity nodes through intake relationships, disease entity nodes are associated with nutrient entity nodes through contraindication relationships, medication entity nodes are associated with nutrient entity nodes through interaction relationships, and medication entity nodes are associated with behavioral event nodes through compliance constraint relationships. Each relationship is written with a relationship type identifier and a source data identifier to support risk traceability and verification.In the health knowledge graph, path traversal operations are performed based on entity nodes and their relationships. The maximum number of hops and the set of allowed relationship types are limited, starting from the behavior event node, to avoid unrelated path expansion leading to false triggers. When a path linking a disease entity node to a nutrient entity node is detected, it is further verified whether this path simultaneously satisfies the condition that a nutrient component relationship path exists between a food entity node and a nutrient entity node associated with the current eating behavior time window. If the verification passes, a nutrient conflict risk identifier is generated, and the codes of the disease entity node, nutrient entity node, and food entity node in the triggering path are extracted as risk cause location information. When a path linking a medication entity node to a behavior event node is detected, it is further verified that the time overlap between the medication start and end time interval and the eating behavior time window meets the valid overlap condition. When the effective overlap condition is met, a behavioral compliance risk identifier is generated, and the medication entity node code and the joint index information of the eating behavior time window are written into the risk cause location information. When any risk identifier is detected, a set of structured risk tuples is generated based on the association path that triggers the risk. The structured risk tuple contains a risk type identifier, a risk cause identifier, and a risk source dish category identifier. The risk type identifier is used to distinguish between nutritional conflict risk identifiers and behavioral compliance risk identifiers. The risk cause identifier is used to carry the combined expression of disease entity node code, medication entity node code, and nutrient entity node code. The risk source dish category identifier is used to locate the corresponding dish category identifier after tracing back from the behavioral event node to the eating behavior time window and to keep it consistent with the dietary intake record field, so as to support the generation and reference of subsequent nutritional decisions under structured slot constraints.

[0052] In this implementation plan, by unifying the coding of dietary intake records and historical medical records under the constraints of the same elderly person's identity and eating behavior time window, and performing unit consistency verification and path traversal boundary control, the risk triggers in the health knowledge graph have a traceable and verifiable causal link expression. This avoids false risk triggers caused by field granularity drift, cross-person record mixing, and time interval mismatch. At the same time, it enables the structured risk tuple set to stably locate the risk source food category identifier and maintain consistency with the fields of dietary intake records. This provides a structured input foundation with clear source basis, time validity constraints, and interpretability for subsequent nutrition decision generation.

[0053] Specifically, the steps for triggering nutritional decision generation and consistency verification under structured slot constraints, and outputting current nutritional status assessment and dietary adjustment decision information for elderly people living alone, are as follows: For the structured risk tuple set, a structured slot template containing fact slots, boundary slots, and logic slots is constructed, and slot mapping operations are performed. Specifically, the risk cause identifier, risk source food category identifier, conflicting nutrient identifier corresponding to the nutritional conflict risk identifier, and risk behavior type identifier corresponding to the behavioral compliance risk identifier are written into the fact slot. Each field in the fact slot is solidified in key-value pairs as the risk cause identifier field, risk source food category identifier field, conflicting nutrient identifier field, and risk behavior type identifier field, and bound to the same eating behavior time window identifier. The set of food name identifiers detected within the current eating behavior time window is written into the boundary slot, and the disease diagnosis data from historical medical records is simultaneously written into the boundary slot. The system maps three sets of food taboo names from the data entry records, three sets of nutrient taboos from the data entry records, and three sets of food category tags from the data entry records. Each set within the boundary slot is fixed with the fields for the set of food taboo names, the set of food taboo names, the set of nutrient taboos, and the set of food category tags, and their naming is consistent with the fields in the data entry records. The system writes the constraint relationship paths associated with risk cause tags in the health knowledge graph into the logic slots. Within the logic slots, the constraint relationship paths are structured and expanded according to the path start entity node identifier, path end entity node identifier, relationship type identifier, dose upper limit threshold identifier, and frequency upper limit threshold identifier, forming taboo rule path fields, dose upper limit fields, and frequency upper limit fields. These fields are then referenced consistently with the risk cause tag fields in the fact slots, generating structured decision input objects. The structured decision input object is input into the large language model, which is then restricted to generating nutritional decision results based solely on the content of the fact slot, boundary slot, and logic slot in the structured slots. This restriction is achieved through a combination of fixed cue constraints and output format constraints. The fixed cue constraints declare that the large language model is only allowed to generate conclusion sentences by referencing field values ​​from the structured decision input object. The output format constraints limit the nutritional decision results to structured outputs of the dish category identifier field, ingredient name identifier field, nutrient adjustment item field, adjustment range field, execution cycle field, and the dish category identifier field referencing the risk source field. During the generation process, the available ingredient name identifier set field and available dish category identifier set field in the boundary slot are designated as allowed candidate fields, while the prohibited ingredient name identifier set field and prohibited nutrient identifier set field in the boundary slot are designated as prohibited candidate fields. The dose upper limit field and frequency upper limit field in the logic slot are designated as hard constraint fields, thus forming an explicit closure of the generation space.The nutrition decision results undergo consistency checks, comparing the dish category identifiers and ingredient name identifiers in the nutrition decision results with those in the dietary intake records. If an inconsistency is detected, the current nutrition decision result is discarded. The consistency check includes at least three types of hard rules: The first type is set inclusion checks, verifying that the dish category identifier field in the nutrition decision result belongs to the set of available dish category identifiers in the bounding slots, and verifying that the ingredient name identifier field in the nutrition decision result belongs to the set of available ingredient name identifiers in the bounding slots; the second type is contraindication conflict checks, verifying that the food in the nutrition decision result... The ingredient name identifier field does not belong to the set of prohibited ingredient name identifier fields in the boundary slot, and the nutrient adjustment item field in the nutritional decision result does not belong to the set of prohibited nutrient identifier fields in the boundary slot; the third type of logical constraint verification is to perform path consistency verification on the nutrient adjustment item field and the risk cause identifier field based on the prohibited rule path field in the logical slot, and to perform threshold consistency verification on the adjustment range field and the execution cycle field based on the dose upper limit field and the frequency upper limit field in the logical slot, and at the same time verify that the referenced risk source dish category identifier field in the nutritional decision result is consistent with the risk source dish category identifier field in the fact slot. When consistency verification fails, a rollback and regeneration strategy is triggered. This strategy includes writing the failed verification rule type identifier into the rejection reason field and backfilling it into the verification feedback field in the structured decision input object. Under the constraints of the verification feedback field, nutrition decision generation is retried, and the output of the dish category identifier field, ingredient name identifier field, and nutrient adjustment item field corresponding to the rejection reason field is restricted. When the number of consecutive regenerations reaches the upper limit threshold, a conservative decision result is output. The conservative decision result only includes avoidance prompts corresponding to the risk source dish category identifier field and a recommended list of alternative ingredient name identifiers in the available ingredient name identifier set field. When consistency verification passes, the nutrition decision result is output as a current nutritional status assessment and dietary adjustment decision information for elderly people living alone. The nutrition decision result, the eating behavior time window, and the structured risk tuple set are linked and archived under the same elderly person's identity to support subsequent consistency comparison of decisions for the same risk cause identifier under multiple eating behavior time windows.

[0054] In this implementation plan, by using structured risk tuple sets as fixed fields (fact slots, boundary slots, and logic slots) to form structured decision input objects, the nutrition decision generation process has verifiable input boundaries and traceable reference paths under field-level constraints. This allows the nutrition decision results to converge from free text inference to a restricted combination output of the relationship paths between dietary intake records, ingredient name identifier sets, dish category identifier sets, and health knowledge graph constraints. Furthermore, a hard rule consistency verification chain consisting of set inclusion verification, taboo conflict verification, and logical constraint verification is introduced, along with a verification feedback-driven rollback and regeneration strategy. This ensures that the output current nutritional status assessment and dietary adjustment decision information stably meets the available set constraints and taboo constraints, avoiding unexecutable situations such as suggested dish category identifiers not belonging to the available dish category identifier set, suggested ingredient name identifiers triggering the taboo ingredient name identifier set, and nutrient adjustment items triggering the dosage upper limit field constraint. This improves the feasibility, consistency, and auditability of nutrition decision results for elderly people living alone.

[0055] like Figure 2 As shown, the second aspect of this invention provides a multimodal health monitoring and nutrition decision-making system for elderly people living alone, including: a multimodal health data acquisition and preprocessing module, a physically gated behavioral health log generation module, a dietary behavior modeling and nutrition profile construction module, and a nutrition risk assessment and constraint decision generation module. The multimodal health data acquisition and preprocessing module is used to collect multimodal health monitoring data and perform time alignment, smoothing and denoising, anomaly removal, missing data completion, and numerical standardization on the multimodal health monitoring data to generate preprocessed multimodal health monitoring data. The physically gated behavioral health log generation module is used to construct a spatiotemporal graph structure of the human skeleton based on the preprocessed multimodal health monitoring data, calculate joint motion intensity assessment values, and perform physical gating state determination. When the gating conditions are met, it triggers posture determination and near-field determination. The system includes several modules: a body interaction detection and micro-motion behavior recognition module to generate a behavior health log containing behavior type identifiers, time intervals, and spatial region identifiers; a dietary behavior modeling and nutritional profile construction module to filter eating behavior time windows based on the behavior health log, and to confirm the food category corresponding to the eating behavior by combining video frames, depth map pixel distance values, and food image recognition results collected by the table camera device, and to generate a dietary intake record based on 3D point cloud volume calculation and category density mapping; and a nutrition risk assessment and constraint decision generation module to build a health knowledge graph association based on dietary intake records and historical medical record data, execute path traversal to generate a structured risk tuple set, and trigger nutrition decision generation and consistency verification under structured slot constraints, outputting current nutritional status assessment and dietary adjustment decision information for elderly people living alone.

[0056] like Figure 4The diagram illustrates the hierarchical judgment process and data flow relationship in generating behavioral health logs according to the present invention. First, based on preprocessed multimodal health monitoring data, frame-level pairing is performed between room video frames and room depth map frames using the acquisition timestamp as the key. Then, human key joint detection is performed in the paired room video frames to obtain key joint nodes corresponding to the top of the head, neck, left and right shoulders, left and right elbows, left and right wrists, mid-hip, left and right hips, left and right knees, and left and right ankles. Further, based on the two-dimensional pixel coordinates of the key joint nodes, the depth map pixel distance value is read from the room depth map frame. The two-dimensional pixel coordinates and the depth map pixel distance value are mapped to three-dimensional spatial coordinates using a camera imaging model, thereby forming a continuous sequence of three-dimensional coordinates of human key joints arranged according to the acquisition timestamp. In the diagram, "continuous skeleton data stream" corresponds to the continuous input of the aforementioned three-dimensional coordinate sequence of human key joints on the time axis. Subsequently, the three-dimensional coordinate difference of the human body key joint three-dimensional coordinate sequence is performed between adjacent acquisition timestamps to obtain the three-dimensional velocity vector of each key joint node. The joint motion intensity assessment value is calculated based on the joint mass weight of each key joint node and the square of the three-dimensional velocity vector magnitude. The joint motion intensity assessment value is compared with the active threshold and the micro-motion threshold, and the status flag is output. In the figure, "Level 1 Judgment: Motion State Grading" corresponds to the above physical gating state judgment process. When the status flag is dynamic, it corresponds to "Dynamic Segment" in the figure. When the status flag is steady state or micro-motion, it corresponds to "Static Segment" in the figure. The static segment serves as the gating condition for triggering subsequent posture judgment and close interaction detection. Next, when the status flag is in steady state or slightly moving, the spatial vector between the three-dimensional coordinates of the hip joint and the three-dimensional coordinates of the neck joint is calculated according to the acquisition timestamp sequence. At the same time, the gravitational acceleration under the corresponding acquisition timestamp is extracted and normalized to a unit gravity reference vector. The cosine value of the angle between the spatial vector and the unit gravity reference vector is calculated based on the dot product of the spatial vector and the unit gravity reference vector. The inverse cosine transformation is performed on the cosine value of the angle to obtain the spatial angle. The spatial angle is then compared with the posture judgment angle range, and the human posture type is output as standing posture, sitting posture, supine posture, and transitional posture. The "secondary judgment: posture topology classification" in the figure corresponds to the above human posture type judgment process. At the same time, the absolute height of the hip node from the ground is calculated based on the three-dimensional coordinates of the hip joint and the ground height. When the absolute height is less than the bed height, a fall warning sign is generated and written into the behavior record of the corresponding acquisition timestamp range.Furthermore, when the status flag is "micro-motion" and the human body is in a standing or sitting posture, the length of the human forearm is calculated based on the three-dimensional coordinates of the elbow and wrist joint nodes. A near-body interaction field is constructed in three-dimensional space with the neck joint node as the center and the forearm length ratio as the radius. The left and right wrist joint nodes are monitored in real time to see if they enter the near-body interaction field. When the hand is detected to enter the near-body interaction field, the projection area of ​​the near-body interaction field in the room video frame is captured, and object detection processing is performed within the projection area. The categories of the detected interactive objects are associated and labeled. When the dwell time of the interactive object exceeds the dwell threshold, the room video frame, room depth map frame, and three-dimensional coordinate sequence of the human key joints are captured in the corresponding time window according to the acquisition timestamp order. The three-dimensional coordinate sequence of the human key joints is structured according to the human anatomical connection relationship and the temporal correlation relationship of adjacent acquisition timestamps to generate a temporal structured joint representation. The temporal structured joint representation is input into the trained behavior recognition model to output the behavior type identifier. In the figure, "Level 3 Judgment: Interaction and Micro-motion" corresponds to the above near-body interaction detection and micro-motion behavior recognition process. Finally, the behavior type identifier is associated with the corresponding collection timestamp interval, human posture type, status flag, and camera installation area identifier to generate a behavior health log. The behavior health log is used to form traceable intraday behavior statistics. The "all-day behavior profile" in the figure corresponds to the summary expression based on the behavior health log on the intraday timeline, providing a consistent behavioral event input basis for subsequent filtering of eating behavior time windows and generating dietary intake records based on the behavior health log.

[0057] In this implementation plan, by forming a closed-loop relationship of sequential dependence among the preprocessing results of multimodal health monitoring data, the spatiotemporal structure of the human skeleton, joint motion intensity assessment values, physical gating status determination results, behavioral health logs, dietary intake records, and structured risk tuple sets under the same data link, the reasoning process from health monitoring to nutritional decision-making has a unified time benchmark, spatial region identification constraints, and field-level traceability. This enables end-to-end connectivity of eating behavior triggering, food category confirmation, nutritional intake quantification, risk cause location, decision generation, and consistency verification. Under this connectivity, nutritional decision results can stably reference the association path between dietary intake records and historical medical records and are subject to structured slot constraints. This avoids unexecutable suggestions that are inconsistent with the set of available food name identifiers, decoupled from the food category identifiers of risk sources, or conflict with the constraint relationship path. This improves the feasibility, auditability, and stability of nutritional status assessment and dietary adjustment decision-making information for elderly people living alone in the context of nutrition.

[0058] It should be noted that, in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such process, method, article, or apparatus.

[0059] The preferred embodiments of the present invention disclosed above are merely illustrative of the invention. These preferred embodiments do not exhaustively describe all details, nor do they limit the invention to the specific implementations described. Clearly, many modifications and variations can be made based on the content of this specification. This specification selects and specifically describes these embodiments to better explain the principles and practical applications of the invention, thereby enabling those skilled in the art to better understand and utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims

1. A multimodal health monitoring and nutritional decision-making method for elderly people living alone, characterized in that, Includes the following steps: S1: Collect multimodal health monitoring data, and perform time alignment, smoothing and noise reduction, anomaly removal, missing data completion and numerical standardization on the multimodal health monitoring data to generate preprocessed multimodal health monitoring data; S2, based on the preprocessed multimodal health monitoring data, constructs a spatiotemporal graph structure of the human skeleton, calculates the joint motion intensity assessment value and performs physical gating state determination, triggers posture determination, close interaction detection and micro-movement behavior recognition when the gating conditions are met, and generates a behavior health log containing behavior type identifier, time interval and spatial region identifier; S3, based on behavioral health logs, filters the time window of eating behavior, combines video frames and depth map pixel distance values ​​collected by the table camera device with the dish image recognition results, confirms the dish category corresponding to the eating behavior, and generates a diet intake record based on three-dimensional point cloud volume calculation and category density mapping. S4 constructs a health knowledge graph based on dietary intake records and historical medical records, traverses the execution path to generate a set of structured risk tuples, and triggers nutritional decision generation and consistency verification under structured slot constraints, outputting current nutritional status assessment and dietary adjustment decision information for elderly people living alone.

2. The multimodal health monitoring and nutritional decision-making method for elderly people living alone according to claim 1, characterized in that: The specific steps for collecting multimodal health monitoring data and performing time alignment, smoothing and denoising, anomaly removal, missing data completion, and numerical standardization on the multimodal health monitoring data to generate preprocessed multimodal health monitoring data are as follows: Real-time collection of multimodal health monitoring data for elderly people living alone. The multimodal health monitoring data includes identity identifiers, collection timestamps, camera installation area identifiers, room video frames, room depth map frames, dining table video frames, dining table depth map frames, depth map pixel distance values, three-axis acceleration, gravitational acceleration, bed height, ground height, room camera device identifiers, dining table camera device identifiers, dining table installation area identifiers, kitchen food storage device identifiers, and food name identifiers. A time alignment algorithm based on time window resampling is used to map multimodal health monitoring data from the same day to the same time axis, constructing an intraday time axis. A sliding window mean filtering algorithm is used to smooth high-frequency noise introduced by sensor jitter and transient interference in the multimodal health monitoring data. For abnormal sampling points in the multimodal health monitoring data caused by occlusion, communication jitter, and intermittent equipment acquisition, an interquartile range anomaly detection algorithm is used to identify and remove them, and the missing segments formed after removal are filled in using an adjacent time slice interpolation algorithm. A Z-score standardization algorithm is used to perform numerical standardization processing on the multimodal health monitoring data to eliminate dimensional differences between different physical quantities.

3. The multimodal health monitoring and nutritional decision-making method for elderly people living alone according to claim 1, characterized in that: The specific steps for constructing the spatiotemporal graph structure of the human skeleton based on the preprocessed multimodal health monitoring data are as follows: Read the preprocessed multimodal health monitoring data and perform frame-level pairing of room video frames and room depth map frames using the collection timestamp as the key; For each pair of paired room video frames and room depth map frames, a lightweight pose estimation algorithm is used to perform human key joint detection on the room video frames to obtain key joint nodes, including the top of the head, neck, left and right shoulders, left and right elbows, left and right wrists, mid-hip, left and right hips, left and right knees, and left and right ankles. The two-dimensional pixel coordinates of the key joint nodes are obtained based on a convolutional neural network. The depth value is read in the corresponding room depth map frame based on the two-dimensional pixel coordinates. The two-dimensional pixel coordinates and depth values ​​are mapped to three-dimensional spatial coordinates through a camera imaging model to obtain a three-dimensional coordinate sequence of human key joints arranged according to the acquisition timestamp. Using key joint nodes within a single frame as a set of nodes, spatial connection edges are constructed within the same frame based on human anatomical connections, and temporal connection edges are constructed for the same joint node between adjacent acquisition timestamps, thus modeling the human skeleton as a spatiotemporal graph structure containing spatial and temporal edges.

4. The multimodal health monitoring and nutritional decision-making method for elderly people living alone according to claim 1, characterized in that: The specific steps for calculating the joint motion intensity assessment value and performing physical gating state determination are as follows: The difference in the three-dimensional coordinates of each key joint node in the spatiotemporal map of the human skeleton between adjacent acquisition timestamps is calculated to obtain the three-dimensional velocity vector of each key joint node; the modulus of the three-dimensional velocity vector of the i-th key joint node at acquisition timestamp t is taken and squared to obtain the three-dimensional velocity vector modulus squared. Add one to the squared value of the velocity modulus and take its reciprocal. Subtract the reciprocal value from one to obtain the velocity adjustment factor. Multiply the joint mass weight, velocity adjustment factor and squared value of the three-dimensional velocity vector modulus corresponding to the current critical joint node in turn to obtain the weighted motion of the current critical joint node. Perform a summation operation on the weighted motion of all critical joint nodes to obtain the joint motion intensity evaluation value at the acquisition time stamp t. Joint motion intensity assessment value With active threshold and micro-motion threshold Perform real-time comparison: when ≥ When the current state is determined to be in the dynamic migration phase, the status flag is marked as dynamic. when < < When the current state is determined to be in the local micro-motion stage, the status flag is marked as micro-motion; when ≤ When the current state is determined to be in a steady-state maintenance phase, the state flag is marked as steady state.

5. The multimodal health monitoring and nutritional decision-making method for elderly people living alone according to claim 1, characterized in that: The specific steps for triggering posture determination, close-range interaction detection, and micro-motion behavior recognition when gating conditions are met, and generating a behavior health log containing behavior type identifiers, time intervals, and spatial region identifiers, are as follows: When the status flag is in steady state or slightly moving, the spatial vector between the three-dimensional coordinates of the hip joint and the three-dimensional coordinates of the neck joint is calculated according to the acquisition timestamp sequence, and the gravitational acceleration under the corresponding acquisition timestamp is extracted as the spatial direction reference. The spatial angle between the spatial vector and the gravitational direction is calculated based on the vector dot product. The spatial angle is compared with the posture judgment angle range to determine the human posture as standing, sitting, supine or transitional posture. The absolute height of the hip joint from the ground is calculated based on the three-dimensional coordinates of the hip joint and the ground height. A fall warning sign is generated when the absolute height is less than the bed height. When the status flag is in a slight motion and the human body is in a standing or sitting posture, the length of the human forearm is calculated based on the three-dimensional coordinates of the elbow and wrist joint nodes. The near-body interaction field is constructed in three-dimensional space with the neck joint node as the center and the proportion of the human forearm length as the radius. The system monitors in real time whether the left and right wrist joint nodes enter the near-body interaction field. When the hand is detected to enter the near-body interaction field, object detection processing is performed only on the projection area of ​​the near-body interaction field in the video frame of the room, and the categories of the detected interactive objects are associated and labeled. When an interactive object is detected and the dwell time exceeds the dwell threshold, the video frames, depth maps, and three-dimensional coordinate sequences of key human joints within the corresponding time window are extracted in the order of the acquisition timestamps. The topological associations between key joint nodes are constructed according to the anatomical connection relationships of human joints. Combined with the temporal associations of the same key joint node under adjacent acquisition timestamps, the three-dimensional coordinate sequences of key human joints are structured to generate a temporal structured joint representation for behavior determination. Read historical multimodal health monitoring data with identified behavior types, construct historical temporal structured joint representations, use these historical temporal structured joint representations as training inputs and corresponding micro-movement behavior type identifiers as supervision labels, and construct and train a behavior recognition model based on a spatiotemporal graph convolutional network algorithm; input the temporal structured joint representations generated within the current time window into the trained behavior recognition model, and output the corresponding behavior type identifiers; The behavior type identifier is associated with the corresponding data collection timestamp interval, human posture type, status flag, and camera device installation area identifier to generate a behavior health log.

6. The multimodal health monitoring and nutritional decision-making method for elderly people living alone according to claim 1, characterized in that: The specific steps for identifying the food category corresponding to the eating behavior by filtering the eating behavior time window based on behavioral health logs and combining video frames and depth map pixel distance values ​​collected by the table camera device with the food image recognition results are as follows: Read the behavioral health log, filter the behavioral records related to eating that occurred in the areas corresponding to the table installation area and the kitchen food storage device according to the collection timestamp, and extract the corresponding start and end collection timestamps to generate a time window sequence of eating behavior. For each eating behavior time window, video frames captured by the dining table camera in the multimodal health monitoring data are read in the order of collection timestamps. The object detection algorithm is used to identify the food name identifiers in the video frames, and the identified food name identifiers are recorded in the order of collection timestamps. The video frames captured by the dining table camera are input into a deep learning-based dish image recognition model, and the model outputs the candidate dish category identifiers corresponding to the video frames. For the same eating behavior time window, the set of ingredients corresponding to the candidate dish category identifier is read based on the dish ingredient correspondence table, and the set of ingredients is compared with the set of ingredient name identifiers detected in the current eating behavior time window; when there is an ingredient name identifier belonging to the set of ingredients in the set of detected ingredient name identifiers, the corresponding candidate dish category identifier is retained as a valid dish category identifier, and the dish category identifier with the longest continuous appearance in the time window among the valid dish category identifiers is determined as the target dish category identifier.

7. The multimodal health monitoring and nutritional decision-making method for elderly people living alone according to claim 1, characterized in that: The specific steps for generating dietary intake records based on 3D point cloud volume calculation and category density mapping are as follows: The video frames and corresponding depth map pixel distance values ​​captured by the table camera within the corresponding eating behavior time window are read in the order of the acquisition timestamp; background area subtraction is performed on the video frames to separate the food area, and the pixel back projection algorithm is used based on the depth map pixel distance values ​​to map the pixels of the food area into a three-dimensional spatial point set, generating three-dimensional point cloud data of the food area; The Alpha Shape algorithm is used to construct a 3D bounding structure of the food point cloud from the 3D point cloud data, and the food volume is calculated based on volume integral operation. Read the category density parameter corresponding to the target dish category identifier, multiply the food volume by the category density parameter to obtain the food weight, calculate the nutritional intake data corresponding to the eating behavior based on the nutritional component mapping relationship corresponding to the target dish category identifier, generate a diet intake record and output it.

8. The multimodal health monitoring and nutritional decision-making method for elderly people living alone according to claim 1, characterized in that: The specific steps for constructing a health knowledge graph based on dietary intake records and historical medical record data, and generating a structured risk tuple set through path traversal are as follows: The system reads dietary intake records, extracts the food category identifiers, nutrient intake data, and time windows corresponding to the eating behavior, and maps the food category identifiers to food entity nodes in the health knowledge graph, the nutrient intake data to nutrient entity nodes, and the behavior type identifiers corresponding to the eating behavior time windows to behavior event nodes. Simultaneously, it retrieves historical medical records associated with the current elderly person's identity through the health record database interface, reads disease diagnosis records and long-term medication records from these records, maps disease diagnosis records to disease entity nodes, and long-term medication records to medication entity nodes. Based on the disease diagnosis records, it extracts corresponding dietary restrictions and maps them to restriction rule nodes, and based on the long-term medication records, it extracts corresponding medication precautions and maps them to medication constraint rule nodes. And establish the relationships between the nodes in the health knowledge graph; In the health knowledge graph, path traversal operations are performed based on entity nodes and their corresponding relationships. When an association path from a disease entity node to a nutrient entity node via a contraindication rule node is detected, a nutrient conflict risk identifier is generated. When an association path from a medication entity node to a behavior event node via a medication constraint rule node is detected, a behavior compliance risk identifier is generated. When any risk identifier is detected, a set of structured risk tuples is generated based on the associated path that triggers the risk. The risk tuples contain risk type identifier, risk cause identifier, and risk source dish category identifier.

9. The multimodal health monitoring and nutritional decision-making method for elderly people living alone according to claim 1, characterized in that: The specific steps for triggering nutrition decision generation and consistency verification under structured slot constraints, and outputting current nutritional status assessment and dietary adjustment decision information for elderly people living alone, are as follows: For a set of structured risk tuples, a structured slot template containing fact slots, boundary slots, and logic slots is constructed, and a slot mapping operation is performed. Specifically, the risk cause identifier and the risk source dish category identifier are written into the fact slot; the set of food name identifiers detected within the current eating behavior time window is written into the boundary slot; and the constraint relationship paths associated with the risk cause identifiers in the health knowledge graph are written into the logic slot, generating a structured decision input object. The structured decision input objects are fed into the large language model, which is then restricted to generating nutrition decision results solely based on the content of the fact slots, boundary slots, and logic slots in the structured slots. Consistency checks are performed on the nutrition decision results by comparing the dish category identifiers and ingredient name identifiers involved in the nutrition decision results with the dish category identifiers and ingredient name identifiers in the dietary intake records. If an inconsistency is detected, the current nutrition decision result is discarded. If the consistency check passes, the nutrition decision result is output as current nutritional status assessment and dietary adjustment decision information for elderly people living alone.

10. A multimodal health monitoring and nutrition decision-making system for elderly people living alone, characterized in that: include: The system includes a multimodal health data acquisition and preprocessing module, a physically gated behavioral health log generation module, a dietary behavior modeling and nutritional profile construction module, and a nutritional risk assessment and constraint decision generation module, among which: The multimodal health data acquisition and preprocessing module is used to acquire multimodal health monitoring data and perform time alignment, smoothing and noise reduction, anomaly removal, missing data completion and numerical standardization on the multimodal health monitoring data to generate preprocessed multimodal health monitoring data. The physical gating behavior health log generation module is used to construct a human skeleton spatiotemporal graph structure based on preprocessed multimodal health monitoring data, calculate joint motion intensity assessment values ​​and perform physical gating state determination, and trigger posture determination, close interaction detection and micro-movement behavior recognition when the gating conditions are met, and generate a behavior health log containing behavior type identifiers, time intervals and spatial region identifiers. The dietary behavior modeling and nutritional profile construction module is used to filter the time window of eating behavior based on the behavioral health log, and combine the video frames, depth map pixel distance values ​​and dish image recognition results collected by the table camera device to confirm the dish category corresponding to the eating behavior, and generate a dietary intake record based on three-dimensional point cloud volume calculation and category density mapping. The nutrition risk assessment and constraint decision generation module is used to construct a health knowledge graph based on dietary intake records and historical medical records, generate a set of structured risk tuples by traversing execution paths, and trigger nutrition decision generation and consistency verification under structured slot constraints, outputting current nutritional status assessment and dietary adjustment decision information for elderly people living alone.