A false fall filtering method and system based on multi-modal spatio-temporal feature fusion
By employing a multimodal spatiotemporal feature fusion method, and utilizing posture key point recognition and multimodal data analysis, the false alarm and false alarm problems of existing fall detection technologies are solved, thereby improving the accuracy and reliability of fall detection.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XIDIAN UNIV
- Filing Date
- 2026-04-10
- Publication Date
- 2026-06-23
AI Technical Summary
Existing fall detection technologies are susceptible to changes in lighting, occlusion, and complex backgrounds, making it difficult to distinguish between daily actions and falls. They also have high false alarm and false negative rates, especially in environments with diverse human behaviors.
A multimodal spatiotemporal feature fusion method is adopted. Suspected fall events are initially screened through a posture key point recognition model. Multimodal analysis and fusion are performed by combining multimodal data, including skeletal key points, acceleration and audio features. Thresholds are dynamically adjusted to improve recognition accuracy.
It significantly reduces the probability of bending over, sitting or lying down being misjudged as falls, improves the accuracy and reliability of fall recognition, and enables reliable determination of real fall events.
Smart Images

Figure CN122024329B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of fall detection technology, specifically to a false fall filtering method and system based on multimodal spatiotemporal feature fusion. Background Technology
[0002] As the population ages, the risk of falls among the elderly increases significantly in daily life. Falls are often accompanied by serious consequences such as fractures and brain injuries, and are a major cause of disability and even death among the elderly. How to promptly and accurately identify falls in the elderly in daily home or care settings has become a key issue that needs to be addressed in the field of intelligent care and health monitoring.
[0003] Current fall detection technologies mainly rely on single visual analysis, wearable sensors, or simple threshold rules for judgment. Among them, vision-based methods are easily affected by changes in lighting, occlusion, and complex backgrounds, and have limited ability to distinguish everyday movements such as bending over and sitting down; wearable device-based methods rely on the elderly to actively wear them, resulting in low user compliance; and solutions based on single sensor signals or fixed rules are difficult to cope with the highly diverse human behavior patterns in real-life environments.
[0004] Furthermore, falls among the elderly often occur in high-activity and rapidly changing environments such as kitchens and bathrooms. The range of normal human posture changes varies significantly in different scenarios, and using a uniform threshold could easily lead to false alarms or missed alarms. Additionally, distinguishing between a "true fall" and a "brief fall followed by the ability to get up on one's own" after a fall also places higher demands on the system's reliability. Summary of the Invention
[0005] This application aims to provide a false fall filtering method and system based on multimodal spatiotemporal feature fusion, which can improve the accuracy of fall recognition results.
[0006] The technical solution of this application is implemented as follows:
[0007] In a first aspect, embodiments of this application provide a false fall filtering method based on multimodal spatiotemporal feature fusion, the method comprising:
[0008] Acquire continuous video stream data containing the target person;
[0009] By using a pre-determined posture key point recognition model to identify skeletal key points and calculate posture indicators in the video stream data, a primary suspected fall event can be identified.
[0010] Based on the time corresponding to the primary suspected fall event, multimodal data for a preset time period is obtained; and multimodal analysis and multimodal fusion are performed based on the multimodal data to determine the secondary suspected fall event;
[0011] Based on the suspected level 2 fall incident, static posture determination and sensitive speech segment recognition are performed on the video stream data to determine the final determination result;
[0012] If the final determination result indicates that the target person has actually fallen, an alarm mechanism is triggered.
[0013] In the above scheme, the step of identifying skeletal key points and calculating posture indicators in the video stream data using a pre-determined posture key point recognition model to determine a primary suspected fall event includes:
[0014] The pose key point recognition model is used to identify the skeletal key points of the target person in the video stream data to obtain the human skeletal key points and the first coordinate information corresponding to the human skeletal key points.
[0015] Based on the first coordinate information corresponding to the key points of the human skeleton, the posture index is calculated to determine the trunk main axis tilt angle and the change in head height.
[0016] If the tilt angle of the torso axis is greater than a preset tilt angle threshold and the change in head height is greater than a preset height threshold, the suspected primary fall event is determined.
[0017] In the above scheme, the step of calculating posture indicators and determining the trunk axis tilt angle and head height change based on the first coordinate information corresponding to the key points of the human skeleton includes:
[0018] Based on the first coordinate information corresponding to the key points of the human skeleton, the target coordinate information corresponding to two sets of target skeleton key points related to the torso posture is determined; wherein, the two sets of target skeleton key points are the upper body skeleton key points and the lower body skeleton key points.
[0019] Based on the target coordinate information, calculate the tilt angle of the torso's main axis;
[0020] Based on the first coordinate information corresponding to the key points of the human skeleton, select the longitudinal coordinate values including the key points of the head or neck, and determine the sequence of longitudinal coordinates of the key points.
[0021] The change in head height is calculated based on the longitudinal coordinate sequence of the key points.
[0022] In the above scheme, the multimodal data includes an initial human skeleton key point sequence, an initial triaxial acceleration sequence, and an initial environmental audio sequence;
[0023] The process of performing multimodal analysis and multimodal fusion based on the multimodal data to determine suspected secondary fall events includes:
[0024] Based on the initial human skeleton key point sequence, the initial triaxial acceleration sequence, and the initial environmental audio sequence, time alignment processing is performed to obtain the human skeleton key point sequence, the triaxial acceleration sequence, and the environmental audio sequence.
[0025] Feature analysis and multimodal fusion were performed on the human skeleton key point sequence, the triaxial acceleration sequence, and the environmental audio sequence respectively to determine the fusion confidence.
[0026] Obtain attention weights for adjusting the dynamic threshold; and adjust the dynamic threshold based on the attention weights to determine the corrected fusion judgment threshold for the region where the target person is located;
[0027] If the fusion confidence level is greater than the modified fusion determination threshold, the suspected secondary fall event is determined.
[0028] In the above scheme, the step of performing feature analysis and multimodal fusion on the human skeletal keypoint sequence, the triaxial acceleration sequence, and the environmental audio sequence to determine the fusion confidence includes:
[0029] By using a pre-determined neural network model for recognizing fall behavior, the similarity between the sequence of key points of the human skeleton and the preset sequence of real fall patterns is calculated to obtain the fall confidence level.
[0030] Feature extraction is performed on the triaxial acceleration sequence to obtain the energy distribution and frequency characteristics of the vibration signal; and the vibration confidence level is calculated based on the energy distribution and frequency characteristics.
[0031] Feature extraction is performed on the environmental audio sequence to obtain transient impact features; and acoustic modal confidence is calculated based on the transient impact features.
[0032] Based on the fall confidence, the vibration confidence, and the acoustic modal confidence, multimodal fusion is performed to determine the fusion confidence.
[0033] In the above scheme, the step of performing static posture determination and sensitive speech segment recognition on the video stream data based on the suspected secondary fall event to determine the final determination result includes:
[0034] Based on the suspected secondary fall event, the video stream data is subjected to skeletal key point recognition using the posture key point recognition model to determine the final human skeletal key points and the corresponding second coordinate information; wherein, the final human skeletal key points include head key points, shoulder key points and hip key points.
[0035] Based on the final human skeleton key points and the second coordinate information corresponding to the final human skeleton key points, calculate the displacement variance corresponding to the head key points, the shoulder key points and the hip key points respectively;
[0036] Based on the displacement variances corresponding to the head key points, shoulder key points, and hip key points, the mean displacement variance is determined; and the mean displacement variance is compared with a preset static threshold to determine the static posture determination result.
[0037] Audio feature extraction and sensitive speech segment identification are performed on the video stream data to determine the sensitive speech segment identification result;
[0038] Based on the static posture determination result and the sensitive speech segment recognition result, the final determination result is determined.
[0039] In the above scheme, the step of extracting audio features and identifying sensitive speech segments from the video stream data, and determining the sensitive speech segment identification result, includes:
[0040] Audio features are extracted from the video stream data to obtain an audio feature sequence;
[0041] The audio feature sequence is extracted by pre-determined semantically sensitive branch, non-semantic anomaly branch and silence detection branch respectively, to obtain the first feature map, the second feature map and the third feature map corresponding to the semantically sensitive branch, the non-semantic anomaly branch and the silence detection branch respectively;
[0042] The first feature map, the second feature map, and the third feature map are fused to obtain fused features;
[0043] Based on the fusion features, sensitive speech segments are identified to obtain a three-dimensional vector; wherein, the three-dimensional vector includes the confidence scores corresponding to semantic sensitivity, non-semantic anomalies, and silent detection.
[0044] The sensitive speech segment recognition result is determined by comparing the confidence scores corresponding to the semantic sensitivity, non-semantic anomalies, and silence detection, as well as the preset confidence thresholds corresponding to the semantic sensitivity, non-semantic anomalies, and silence detection.
[0045] Secondly, embodiments of this application provide a false fall filtering system based on multimodal spatiotemporal feature fusion. The false fall filtering system based on multimodal spatiotemporal feature fusion includes: an acquisition module, a determination module, and an alarm module, wherein...
[0046] The acquisition module is used to acquire continuous video stream data containing the target person.
[0047] The determining module is used to identify skeletal key points and calculate posture indicators on the video stream data using a pre-determined posture key point recognition model to determine a primary suspected fall event; based on the time corresponding to the primary suspected fall event, acquire multimodal data for a preset time period; and perform multimodal analysis and multimodal fusion based on the multimodal data to determine a secondary suspected fall event; based on the secondary suspected fall event, perform static posture determination and sensitive speech segment recognition on the video stream data to determine the final determination result.
[0048] The alarm module is used to trigger an alarm mechanism when the final judgment result indicates that the target person has actually fallen.
[0049] Thirdly, embodiments of this application provide a false fall filtering device based on multimodal spatiotemporal feature fusion, comprising: a processor and a memory; wherein,
[0050] The memory is used to store computer programs;
[0051] The processor is configured to call and run the computer program from the memory to perform the method as described in the first aspect.
[0052] Fourthly, embodiments of this application provide a computer-readable storage medium storing executable instructions for causing a processor to perform the method described in the first aspect.
[0053] This application provides a method and system for filtering false falls based on multimodal spatiotemporal feature fusion. The method includes: acquiring continuous video stream data containing a target person; performing skeletal keypoint recognition and posture index calculation on the video stream data using a pre-determined posture keypoint recognition model to determine a primary suspected fall event; acquiring multimodal data for a preset time period based on the time corresponding to the primary suspected fall event; performing multimodal analysis and multimodal fusion on the multimodal data to determine a secondary suspected fall event; performing static posture determination and sensitive speech segment recognition on the video stream data based on the secondary suspected fall event to determine a final determination result; and triggering an alarm mechanism if the final determination result indicates that the target person has actually fallen. The above scheme first uses a posture key point recognition model to identify skeletal key points and calculate posture indicators for the target person, enabling rapid initial screening of potential fall behaviors and reducing high false positives and false negatives. Second, multimodal analysis and fusion are performed on the acquired multimodal data to cross-validate suspected fall behaviors from multiple dimensions, significantly reducing the probability of everyday actions such as bending over, squatting, sitting, and lying down being misjudged as falls. Finally, static posture determination and sensitive speech segment recognition are performed on the video stream data to determine the final identification of the target person, achieving reliable identification of real fall events and thus improving the accuracy of fall recognition results. Attached Figure Description
[0054] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the specification, serve to explain the technical solutions of this application. Obviously, the drawings described below are merely some embodiments of this application, and those skilled in the art can obtain other drawings based on these drawings without any inventive effort.
[0055] The flowcharts shown in the accompanying drawings are merely illustrative and do not necessarily include all content and operations / steps, nor do they necessarily have to be performed in the described order. For example, some operations / steps can be broken down, while others can be combined or partially combined; therefore, the actual execution order may change depending on the specific circumstances.
[0056] Figure 1 A schematic diagram of an optional process for a false fall filtering method based on multimodal spatiotemporal feature fusion provided in an embodiment of this application;
[0057] Figure 2 A schematic diagram of an optional multimodal fusion process for a false fall filtering method based on multimodal spatiotemporal feature fusion provided in an embodiment of this application;
[0058] Figure 3A schematic diagram of the structure of a false fall filtering system based on multimodal spatiotemporal feature fusion provided in this application embodiment;
[0059] Figure 4 This is a schematic diagram of the structure of a false fall filtering device based on multimodal spatiotemporal feature fusion, provided in an embodiment of this application. Detailed Implementation
[0060] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the specific technical solutions of this application will be further described in detail below with reference to the accompanying drawings of the embodiments of this application. The following embodiments are used to illustrate this application, but are not intended to limit the scope of this application.
[0061] Unless otherwise defined, all technical and scientific terms used in this application have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. The terminology used in this application is for the purpose of describing embodiments of this application only and is not intended to be limiting of this application.
[0062] In the following description, references to "some embodiments," "this embodiment," "this application embodiment," and examples, etc., describe a subset of all possible embodiments. However, it is understood that "some embodiments" may be the same subset or different subset of all possible embodiments and may be combined with each other without conflict.
[0063] If the application documents contain similar descriptions such as "first / second", the following explanation shall be added: In the following description, the terms "first / second / third" are used only to distinguish similar objects and do not represent a specific order of objects. It is understood that "first / second / third" may be interchanged in a specific order or sequence where permitted, so that the embodiments of this application described herein can be implemented in an order other than that illustrated or described herein.
[0064] Based on this, embodiments of this application provide a method for filtering false falls based on multimodal spatiotemporal feature fusion. Figure 1 This is an optional flowchart illustrating a false fall filtering method based on multimodal spatiotemporal feature fusion provided in an embodiment of this application, which will combine... Figure 1 The steps shown are explained.
[0065] S101. Obtain video stream data containing the target person over a continuous period of time.
[0066] In some embodiments of this application, the target person is someone who needs constant monitoring in daily life to prevent falls.
[0067] In some embodiments of this application, a false fall filtering method based on multimodal spatiotemporal feature fusion is adapted to fall recognition scenarios in various different regions.
[0068] In some embodiments of this application, a false fall filtering method based on multimodal spatiotemporal feature fusion is adapted to a false fall filtering system based on multimodal spatiotemporal feature fusion.
[0069] In some embodiments of this application, the robot's main camera performs high-speed, continuous visual monitoring of the target person to obtain continuous video stream data containing the target person.
[0070] S102. By using a pre-determined posture key point recognition model, skeletal key point recognition and posture index calculation are performed on the video stream data to identify primary suspected fall events.
[0071] In some embodiments of this application, a posture key point recognition model is used to identify the skeletal key points of a target person in video stream data, thereby obtaining the human skeletal key points and the first coordinate information corresponding to the human skeletal key points; based on the first coordinate information corresponding to the human skeletal key points, posture indicators are calculated to determine the torso main axis tilt angle and the change in head height; if the torso main axis tilt angle is greater than a preset tilt angle threshold and the change in head height is greater than a preset height threshold, a primary suspected fall event is determined.
[0072] S103. Based on the time corresponding to the primary suspected fall event, obtain multimodal data for a preset time period; and perform multimodal analysis and multimodal fusion based on the multimodal data to determine the secondary suspected fall event.
[0073] In some embodiments of this application, the multimodal data includes an initial human skeletal keypoint sequence, an initial triaxial acceleration sequence, and an initial environmental audio sequence.
[0074] In some embodiments of this application, multimodal data for a preset time period is obtained based on the time corresponding to the primary suspected fall event; time alignment processing is performed on the initial human skeletal keypoint sequence, the initial triaxial acceleration sequence, and the initial environmental audio sequence to obtain the human skeletal keypoint sequence, the triaxial acceleration sequence, and the environmental audio sequence; feature analysis and multimodal fusion are performed on the human skeletal keypoint sequence, the triaxial acceleration sequence, and the environmental audio sequence respectively to determine the fusion confidence; attention weights are obtained for adjusting the dynamic threshold; and dynamic threshold adjustment is performed based on the attention weights to determine the corrected fusion judgment threshold for the area where the target person is located; if the fusion confidence is greater than the corrected fusion judgment threshold, a secondary suspected fall event is determined.
[0075] S104. Based on suspected level 2 fall events, perform static posture determination and sensitive speech segment recognition on video stream data to determine the final determination result.
[0076] In some embodiments of this application, based on a suspected fall event (Level 2), a posture keypoint recognition model is used to identify skeletal keypoints in video stream data to determine the final human skeletal keypoints and their corresponding second coordinate information. These final human skeletal keypoints include head keypoints, shoulder keypoints, and hip keypoints. Based on the final human skeletal keypoints and their corresponding second coordinate information, the displacement variances corresponding to each of the head, shoulder, and hip keypoints are calculated. The mean displacement variance is determined based on these variances. The mean displacement variance is then compared with a preset static threshold to determine the static posture determination result. Audio feature extraction and sensitive speech segment recognition are performed on the video stream data to determine the sensitive speech segment recognition result. Finally, based on the static posture determination result and the sensitive speech segment recognition result, the final determination result is determined.
[0077] S105. If the final judgment result indicates that the target person has actually fallen, trigger the alarm mechanism.
[0078] In some embodiments of this application, if the final determination result indicates that the target person has actually fallen, a preset follow-up processing flow is entered, including but not limited to triggering an alarm, sending a notification to a remote terminal, or activating an emergency response mechanism.
[0079] Understandably, the process begins with a strategy of first identifying skeletal key points and calculating posture indicators for the target person using a posture key point recognition model. This allows for rapid initial screening of potential fall behaviors, reducing both high false negatives and high false negatives. Secondly, multimodal analysis and fusion of the acquired multimodal data cross-validate suspected fall behaviors from multiple dimensions, significantly reducing the probability of everyday actions such as bending, squatting, sitting, and lying down being misjudged as falls. Finally, static posture determination and sensitive speech segment recognition are performed on the video stream data to determine the final identification of the target person, ensuring reliable identification of real fall events and improving the accuracy of fall recognition results.
[0080] In some embodiments of this application, S102 can be implemented by S201-S203, as follows:
[0081] S201. Using a pose key point recognition model, identify the skeletal key points of the target person in the video stream data to obtain the human skeletal key points and the first coordinate information corresponding to the human skeletal key points.
[0082] S202. Based on the first coordinate information corresponding to the key points of the human skeleton, calculate the posture index and determine the torso main axis tilt angle and head height change.
[0083] In some embodiments of this application, based on the first coordinate information corresponding to the key points of the human skeleton, target coordinate information corresponding to two sets of target skeletal key points related to the trunk posture is determined; wherein, the two sets of target skeletal key points are the upper body skeletal key points and the lower body skeletal key points, respectively; based on the target coordinate information, the trunk principal axis tilt angle is calculated; based on the first coordinate information corresponding to the key points of the human skeleton, longitudinal coordinate values including head or neck key points are selected to determine the longitudinal coordinate sequence of key points; based on the longitudinal coordinate sequence of key points, the change in head height is calculated.
[0084] S203. If the tilt angle of the torso's main axis is greater than a preset tilt angle threshold and the change in head height is greater than a preset height threshold, a primary suspected fall event is identified.
[0085] For example, the robot's main camera acquires continuous video stream data and performs real-time detection and analysis of the target person in the video stream based on a pose keypoint recognition model, extracting the coordinate information of the target person's skeletal key points. The skeletal key points include at least key points related to the head, torso, and limbs, used to characterize the overall posture structure and motion state of the target person. The lightweight pose keypoint recognition model is used to achieve efficient human target detection and skeletal keypoint extraction on resource-constrained mobile robot platforms. The overall structure of the lightweight pose keypoint recognition model includes, in sequence: an input layer, a lightweight convolutional feature extraction module, and a pose keypoint output module. The specific functions of the input layer, lightweight convolutional feature extraction module, and pose keypoint output module are as follows:
[0086] 1. Input layer.
[0087] The input layer receives multiple consecutive single-frame RGB image data (i.e., video stream data) from the robot's main camera. The images are uniformly scaled and normalized to a preset input size (e.g., 640×640 pixels). The input layer has three channels, corresponding to the three RGB channels, to ensure consistency of input data under different lighting conditions. The input size is H×W×3.
[0088] 2. Lightweight convolutional feature extraction module.
[0089] The lightweight convolutional feature extraction module is used to extract spatial features from the input image. It employs a lightweight convolutional network structure to reduce computational complexity. This structure includes a combination of multiple layers of depthwise separable convolution and pointwise convolution. The kernel size is primarily 3×3; the stride is set to 1 or 2 depending on the feature level; each convolutional layer is followed by BatchNormalization and SiLU activation functions; the number of feature channels gradually increases with network depth to extract human contours, joint neighborhood textures, and local structural features. While maintaining pose feature representation capabilities, the lightweight convolutional feature extraction module significantly reduces the number of model parameters and inference latency, making it suitable for embedded or edge computing platforms. The output of the lightweight convolutional feature extraction module is a multi-scale pose feature map set. The output size of a single feature map in the multi-scale pose feature map set can be represented as h×w×c; h×w is the downsampled spatial size; and c is the number of feature channels (significantly greater than 3, used to carry high-dimensional pose semantic information).
[0090] 3. Attitude key point output module.
[0091] After completing multi-scale feature extraction, the pose keypoint recognition model, based on the multi-branch prediction head structure of YOLOv8-Pose, performs joint prediction of human targets, including:
[0092] Two-dimensional coordinate regression of multiple key points of the human skeleton.
[0093] Among them, the key point prediction adopts a regression method to directly output the position of each key point in the coordinate system of the input image, avoiding complex post-processing steps, thereby improving the overall reasoning efficiency.
[0094] The output data structure can be represented as: N×2K; where N is the preset number of skeletal key points; and 2K corresponds to the first coordinate information of the key points.
[0095] Based on the first coordinate information of the skeletal key points, the relevant features of posture change and height change (i.e., the tilt angle of the torso axis and the change in head height) of the target person are calculated frame by frame to characterize the degree of posture abnormality of the target object. The specific calculation process includes the following.
[0096] The calculation of the trunk's main axis tilt angle is as follows:
[0097] A. Establishment of coordinate system.
[0098] A two-dimensional coordinate system is established with the image plane as a reference, where:
[0099] The horizontal direction is the X-axis; the vertical direction is the Y-axis. An increase in the Y-axis value indicates an increase in the height of the target point relative to the ground.
[0100] B. Selection of key points.
[0101] Select two sets of target coordinate information corresponding to target skeletal key points related to trunk posture from the set of skeletal key points (i.e., human skeletal key points), and use them as upper body reference points and lower body reference points respectively. The target skeletal key points include at least: the center points of both shoulders; the center points of both hips; and can be extended to the center points of the head, neck or both knees as needed. They are denoted as upper body coordinates [x0, y0] and lower body coordinates [x1, y1].
[0102] C. Calculation of the tilt angle of the trunk's main axis.
[0103] Based on the target coordinate information of the key points of the target skeleton, the tilt angle of the torso's main axis relative to the vertical direction of the ground is calculated. The tilt angle of the torso's main axis is used to characterize the degree of deviation of the target object's torso posture. The calculation method is as follows:
[0104]
[0105] in, [x0, y0] represents the tilt angle of the torso's main axis; [x1, y1] represents the upper body coordinates; and [x1, y1] represents the lower body coordinates.
[0106] The calculation of head height change is as follows:
[0107] D. Establishment of coordinate system.
[0108] The same two-dimensional coordinate system definition method is adopted to establish the coordinate system in the tilt angle of the torso's main axis.
[0109] E. Selection of key points.
[0110] Select longitudinal coordinate values from the set of skeletal keypoints, including at least the head or neck keypoints, to form a continuous time series [y1, y2, ... y]. t ].
[0111] F. Calculation of head height change.
[0112] The instantaneous change in head height is calculated by accumulating the vertical coordinate changes of key points in the head of adjacent frames: the change in head height is used to reflect whether the target object has a significant downward trend in a short period of time.
[0113]
[0114] in, This represents the change in head height. This represents the vertical coordinate value of the i-th key point in the head or neck; t represents the vertical coordinate value of the (i+1)th head or neck key point; t represents the current time.
[0115] The target person's current behavior is classified as a "primary suspected fall incident" when all of the following conditions are met:
[0116] The trunk's main axis tilt angle exceeds a preset tilt angle threshold (e.g., 60°).
[0117] The change in head height exceeds a preset height threshold (e.g., 70cm), meaning it shows a significant downward trend within a preset time window.
[0118] After determining it to be a suspected primary fall event, the system triggers a subsequent multimodal spatiotemporal fusion verification process; if the above conditions are not met, the system maintains the normal visual monitoring state.
[0119] In some embodiments of this application, the determination of a suspected secondary fall event based on multimodal data through multimodal analysis and multimodal fusion in S103 can be achieved through S301-S304, as follows:
[0120] S301. Based on the initial human skeleton key point sequence, the initial triaxial acceleration sequence, and the initial environmental audio sequence, perform time alignment processing to obtain the human skeleton key point sequence, the triaxial acceleration sequence, and the environmental audio sequence.
[0121] S302. Perform feature analysis and multimodal fusion on the human skeleton key point sequence, triaxial acceleration sequence and environmental audio sequence respectively, and determine the fusion confidence.
[0122] In some embodiments of this application, a similarity between a sequence of key points on the human skeleton and a preset sequence of real fall patterns is calculated using a pre-determined neural network model for fall behavior recognition to obtain fall confidence. Feature extraction is performed on a triaxial acceleration sequence to obtain the energy distribution and frequency characteristics of the vibration signal. Based on the energy distribution and frequency characteristics, vibration confidence is calculated. Feature extraction is performed on an environmental audio sequence to obtain transient impact characteristics. Based on the transient impact characteristics, acoustic modal confidence is calculated. Multimodal fusion is performed based on fall confidence, vibration confidence, and acoustic modal confidence to determine fusion confidence.
[0123] S303. Obtain the attention weights used to adjust the dynamic threshold; and adjust the dynamic threshold based on the attention weights to determine the corrected fusion judgment threshold for the area where the target person is located.
[0124] S304. If the fusion confidence level is greater than the modified fusion judgment threshold, a suspected secondary fall event is identified.
[0125] For example, for the triggered "primary fall suspected event", multimodal time series information is introduced for cross-validation. Through the collaborative analysis of multi-source data such as posture, vibration and acoustics, the actual fall behavior is further distinguished from daily activity behavior, thereby significantly reducing the false alarm rate.
[0126] S1. Multimodal data synchronization and time alignment.
[0127] When a suspected primary fall event is detected, initial multimodal data for a preset time period is obtained, centered on the time of the event.
[0128] The preset time period is defined as the time before and after the suspected initial fall event, centered on the event's occurrence. This preset time period is set to 3 seconds before and after the suspected initial fall event. Initial multimodal data includes:
[0129] Visual modality: Initial sequence of key points on the human skeleton;
[0130] Motion mode: Initial three-axis acceleration sequence acquired by the robot chassis IMU;
[0131] Acoustic modality: Initial environmental audio sequence.
[0132] All the modal data mentioned above are aligned based on a unified timestamp to form a time-series data set that spans multiple modalities and equal time spans. The time-series data set includes human skeletal key point sequences, triaxial acceleration sequences, and environmental audio sequences, providing a consistent time reference for subsequent fusion analysis.
[0133] S2. Spatial-Temporal Pattern Similarity Analysis.
[0134] The sequence of human skeletal key points within a time window is constructed into a spatiotemporal graph of human posture and input into a pre-trained neural network model for fall behavior recognition. The neural network model for fall behavior recognition analyzes the structural relationships of skeletal key points in the spatial dimension and their evolutionary characteristics in the temporal dimension, calculates the similarity between the current action sequence and a predefined real fall pattern, and outputs the fall confidence S_visual of the posture modality.
[0135] S3. IMU vibration energy and spectral characteristics analysis.
[0136] The IMU acceleration sequences within the same time window were analyzed, focusing on extracting the energy distribution and frequency characteristics of the vibration signals. Among them, real falls usually cause low-frequency, high-energy instantaneous impact characteristics through ground conduction, while the vibration patterns generated by daily actions (such as walking, bending over, or stomping) have significant differences in frequency and energy distribution. Based on this, the vibration confidence level Si_imu of the IMU mode was calculated.
[0137] S4. Environmental acoustic characteristics analysis.
[0138] Acoustic features are extracted from the environmental audio sequence within the time window, with a focus on transient impact features in the low-frequency band. By filtering out speech, television noise, and other common environmental noises, it is determined whether there is an acoustic pattern that matches the characteristics of "a person falling and hitting the ground," and the acoustic modal confidence S_audio is output accordingly.
[0139] S5. Dynamic threshold adjustment based on scene semantics.
[0140] To further reduce misjudgments related to usage scenarios, scene semantic information is introduced to assist in the decision-making process. The robot identifies the type of area a target person is currently in, including but not limited to "living room," "bedroom," "kitchen," or "bathroom," through visual semantic segmentation or a pre-built indoor semantic map. For different scenarios, the system invokes the corresponding baseline model of normal activity behavior and dynamically adjusts the fusion judgment threshold. For example, in a kitchen scenario, because actions such as bending over and squatting occur frequently, the judgment threshold will be appropriately increased to require more substantial multimodal evidence.
[0141] S6. Integrated Decision Making.
[0142] The fusion confidence S_fused is calculated based on S_visual, S_imu, and S_audio. When the fusion confidence S_fused exceeds the modified fusion judgment threshold, the system marks the current event as a suspected level 2 fall event and triggers the subsequent fall confirmation process.
[0143] The methods S1-S6 described above can be implemented using a multimodal fusion network. The specific implementation structure of the multimodal fusion network is as follows: Figure 2 As shown, the multimodal fusion network includes an input module, a feature extraction module, a fusion and decision module, and an output module. The specific functions of each module are as follows:
[0144] 1. Input module (corresponding to S1).
[0145] Composition: Three independent input channels and corresponding normalization layers.
[0146] Function: Receives and preprocesses multimodal data, eliminates dimensional differences, and provides standard input for subsequent networks.
[0147] Pose sequence channel: Input is a T×17×2 sequence, representing T time frames, 17 keypoints, and (x,y) coordinates for each point. Scaled to [0,1] by a normalization layer. Output size: T×34 (coordinates flattened).
[0148] IMU vibration sequence channel: Input is a T×3 sequence, representing T time frames and 3-axis acceleration. Normalized. Output size: T×3.
[0149] Audio MFCC Sequence Channel: Input is a T×13 sequence, representing T time frames and 13-dimensional MFCC features. Normalized. Output size: T×13.
[0150] 2. Feature extraction module (corresponding to S2-S4).
[0151] Composition: Three parallel feature extraction subnetworks.
[0152] Function: Extract high-level, fall-related discriminative features from sequences of different modalities.
[0153] a. Pose Spatiotemporal Feature Extraction Subnetwork: The core of this subnetwork is a lightweight spatiotemporal graph convolutional network.
[0154] Graph Convolutional Layer: Constructs a spatiotemporal graph of the human skeleton, capturing spatial relationships between keypoints through two layers of graph convolution. Output size: T×256.
[0155] Temporal convolutional layer: Followed by one 1D temporal convolutional layer to capture dynamic patterns of pose changes. Output size: T×128.
[0156] Global pooling: Perform global average pooling along the time dimension T to obtain a fixed-length pose feature vector. Final output size: 1×128.
[0157] b. Vibration Feature Extraction Subnetwork: This subnetwork is a simple one-dimensional CNN.
[0158] Structure: It contains two one-dimensional convolutional layers (kernel size 3, number of channels 16 and 32), each followed by a ReLU activation and pooling layer.
[0159] Function: Extracts impulse energy and frequency pattern features from IMU sequences. Final output size: 1 × 32.
[0160] c. Acoustic Feature Extraction Subnetwork: This subnetwork has a similar structure to the vibration subnetwork and is a one-dimensional CNN.
[0161] Structure: Contains two one-dimensional convolutional layers (kernel size 5, number of channels 16 and 32) to capture low-frequency impact features in audio MFCC.
[0162] Function: Filters ambient noise and extracts acoustic features associated with fall impacts. Final output size: 1×32.
[0163] 3. Integration and Decision Module (corresponding to S5-S6).
[0164] Composition: Feature splicing layer, fully connected fusion layer, and decision output head.
[0165] Function: Deeply fuse multimodal features and make a final judgment based on the fused features.
[0166] Feature concatenation and fusion: The output vectors (128, 32, 32) of the three sub-networks are concatenated to obtain a 1×192 fused feature. This is then deep-fused and dimensionality-reduced using a fully connected layer (192 transformed to 64, ReLU activation). Output size: 1×64.
[0167] 4. Output module.
[0168] Based on the fused 64-dimensional features, two parallel output branches are set up:
[0169] 1. Fall confidence output head (64-to-1, Sigmoid), outputs fused confidence;
[0170] 2. Scene weight adjustment output head (64-to-N, Softmax): Outputs attention weights (i.e., scene adjustment factors) corresponding to different scenes, used for subsequent dynamic threshold adjustment. An N-dimensional vector, corresponding to the probability weights of the four scenes: "living room," "bedroom," "kitchen," and "bathroom."
[0171] Finally, the fusion judgment threshold is adjusted based on the probability weights of the robot's current location and the scene. If S_fused exceeds the adjusted fusion judgment threshold (e.g., 0.8), the subsequent fall confirmation will proceed.
[0172] In some embodiments of this application, S104 can be implemented by S401-S405, as follows:
[0173] S401. Based on suspected fall events at level 2, the video stream data is analyzed using a posture key point recognition model to identify skeletal key points and determine the final human skeletal key points and the corresponding second coordinate information; wherein, the final human skeletal key points include head key points, shoulder key points and hip key points.
[0174] S402. Based on the final human skeleton key points and the corresponding second coordinate information, calculate the displacement variance of the head key points, shoulder key points and hip key points respectively.
[0175] S403. Based on the displacement variances corresponding to the head key points, shoulder key points, and hip key points, determine the mean displacement variance; and compare the mean displacement variance with the preset static threshold to determine the static posture judgment result.
[0176] S404. Extract audio features and identify sensitive speech segments from the video stream data, and determine the result of the sensitive speech segment identification.
[0177] In some embodiments of this application, audio features are extracted from video stream data to obtain an audio feature sequence; features are extracted from the audio feature sequence through pre-determined semantically sensitive branches, non-semantic anomaly branches, and silence detection branches to obtain a first feature map, a second feature map, and a third feature map corresponding to each of the semantically sensitive branch, non-semantic anomaly branch, and silence detection branch; features are fused from the first feature map, the second feature map, and the third feature map to obtain fused features; based on the fused features, sensitive speech segment recognition is performed to obtain a three-dimensional vector; wherein, the three-dimensional vector contains the confidence scores corresponding to semantic sensitivity, non-semantic anomaly, and silence detection; the recognition result of sensitive speech segment is determined by comparing the confidence scores corresponding to semantic sensitivity, non-semantic anomaly, and silence detection with the pre-set confidence thresholds corresponding to semantic sensitivity, non-semantic anomaly, and silence detection.
[0178] S405. Based on the static posture determination result and the sensitive speech segment recognition result, determine the final determination result.
[0179] For example, after completing the detection of a suspected fall event at level two, the process proceeds to the post-fall confirmation stage. This stage is used to perform final verification of the suspected fall event at level two to determine whether the target object has actually fallen, thereby further reducing the false alarm rate and providing a reliable basis for subsequent alarm or intervention procedures. The post-fall confirmation method is initiated after the suspected fall event at level two is triggered. By fusing static analysis based on posture key points with sensitive speech segment recognition based on audio signals, it performs a rapid and objective joint determination of the suspected fall event, generating the final "actual fall" confirmation result. The specific determination process is as follows:
[0180] S21. Static determination based on attitude key points.
[0181] S211. Attitude sequence acquisition.
[0182] Within a preset confirmation time window, the pose keypoint recognition model is run again on the continuously acquired video streams to obtain the second coordinate information of the final human skeleton keypoints of the target person in each time frame. The final human skeleton keypoints include at least head keypoints, shoulder keypoints, and hip keypoints, which are used to characterize the overall posture and torso position changes of the target person.
[0183] S212. Calculation of key point displacement variance.
[0184] For each type of final skeletal keypoint within the time window, the temporal position changes of its location in the image coordinate system are statistically analyzed.
[0185] For any final skeletal key point, calculate the displacement change of its corresponding coordinate position in the time series, and further calculate the displacement variance to quantify the level of motion of the target character in that time period.
[0186] S213. Static attitude determination.
[0187] The displacement variances of key points in the head, shoulders, and hips are compared with a preset static threshold (e.g., 2). When the mean of the final displacement variance of the key points is lower than a preset very low threshold (e.g., 0.4), and the displacement variance of at least one type of key point is not higher than the preset static threshold, the system determines that the target person is in a long-term static state within the time window, thus meeting the static posture determination condition.
[0188] S22. Sensitive speech segment recognition and determination based on audio.
[0189] S221. Input and Feature Extraction.
[0190] The ambient audio sequence is used as input, and the audio features are represented using MFCC features, L×D (L is the number of time frames, and D is the MFCC feature dimension, usually 13 or 26 dimensions including difference features). MFCC features are a set of coefficients (usually 13-39) that concisely describe the spectral shape (i.e., timbre) of the sound within the current short time interval (e.g., 25 milliseconds).
[0191] The environmental audio sequence is first input into a shared one-dimensional convolutional feature extraction layer. The shared layer adopts a one-dimensional convolutional structure (with a kernel size of 3, 16 channels, and a ReLU activation function) to extract the basic common features of the audio signal and unify the feature dimension.
[0192] S222. Parallel processing of three branches.
[0193] The three branches receive features from the shared layer, process them in parallel, and each performs its own function:
[0194] Branch A (semantic sensitive branch):
[0195] Structure: 1 one-dimensional convolutional layer (kernel size = 5, number of channels = 16, ReLU activation).
[0196] Function: Captures specific acoustic patterns of distress keywords such as "help" and "help me".
[0197] Branch B (non-semantic anomaly branch):
[0198] Structure: 1 one-dimensional convolutional layer (kernel size = 5, number of channels = 16, ReLU activation).
[0199] Function: Captures abnormal but semantically meaningless acoustic events such as painful groans, heavy impacts, and rapid breathing.
[0200] Branch C (Silent Detection Branch):
[0201] Structure: Energy statistics layer. In the time domain, it calculates the short-time energy moving average in units of M frames (e.g., M=10, corresponding to about 1 second) to obtain the energy envelope sequence.
[0202] One-dimensional convolutional layer: 1 one-dimensional convolutional layer (kernel size = M, number of channels = 8, ReLU activation).
[0203] S223. Feature fusion and sensitive speech segment recognition output.
[0204] Feature concatenation: The feature maps output from the three branches are concatenated along the channel dimension.
[0205] Fusion and Classification: After passing through a global average pooling layer, it is connected to a fully connected layer (output dimension is 64, ReLU activation), and finally connected to a three-head output layer.
[0206] The final output is a three-dimensional vector [P_keyword, P_acoustic, P_silence], representing the model's confidence (range 0~1) in determining that the current audio segment contains a distress keyword, an abnormal acoustic event, and a meaningful sound disappearance. When any component of the vector exceeds its corresponding preset threshold (i.e., the preset confidence thresholds for semantic sensitivity, non-semantic anomalies, and silence detection), it is determined that there is a sensitive speech signal related to the fall height within that time period, satisfying the speech determination condition.
[0207] It should be noted that the preset reliability thresholds for semantic sensitivity, non-semantic anomaly detection, and silent detection are 0.6, 0.7, and 0.5, respectively.
[0208] S23. Fall Confirmation and Follow-up Handling.
[0209] S231. Joint confirmation and determination.
[0210] The final confirmation will be based on the following judgment results:
[0211] Static attitude determination results;
[0212] Sensitive speech segment recognition results;
[0213] When all of the above conditions are met, it is confirmed that the target person has actually fallen.
[0214] The embodiments of this application have the following beneficial effects:
[0215] First, the target object is continuously visually monitored through a posture key point recognition model. Based on the posture and height change features of the skeletal key points, a rapid initial screening of potential fall behaviors is achieved. This effectively avoids the high false negative and high false positive problems caused by relying solely on a single threshold or simple rule judgment in traditional methods, and provides a reliable triggering basis for subsequent refined verification.
[0216] Secondly, this application introduces a multimodal spatiotemporal fusion verification mechanism, which unifies and jointly analyzes the visual posture temporal features, robot chassis IMU vibration information and environmental acoustic features in a unified time, and cross-verifies suspected fall behaviors from multiple dimensions such as spatial structure, motion dynamics and external physical feedback. This significantly reduces the probability of everyday actions such as bending over, squatting, sitting and lying down being misjudged as falls, and improves the accuracy and robustness of fall recognition results.
[0217] Furthermore, this application introduces a dynamic threshold adjudication mechanism based on scene semantics, which adaptively adjusts the fall judgment threshold according to the specific environment in which the elderly person is currently located (such as kitchen, bedroom, bathroom, etc.), fully considering the differences in normal human activity patterns in different life scenarios, avoiding scene-related misjudgments caused by using a uniform threshold, and making the system's judgment logic more in line with real-life behavior.
[0218] Finally, after a suspected fall incident occurs, this application combines high-precision static posture analysis with directional voice-sensitive signal recognition to make a final confirmation of whether the target object remains still on the ground for a long time and whether it is accompanied by cries for help, painful groans, or abnormal silence. This enables a reliable determination of a real fall incident, thereby providing a credible basis for subsequent alarms and emergency responses, and improving the overall safety, practicality, and user acceptance of the fall detection system.
[0219] Based on the above embodiments of a false fall filtering method based on multimodal spatiotemporal feature fusion, this application also provides a false fall filtering system based on multimodal spatiotemporal feature fusion, such as... Figure 3 As shown, Figure 3 This is a schematic diagram of a false fall filtering system based on multimodal spatiotemporal feature fusion provided in an embodiment of this application. The false fall filtering system 3 based on multimodal spatiotemporal feature fusion includes: an acquisition module 301, a determination module 302, and an alarm module 303, wherein...
[0220] The acquisition module 301 is used to acquire continuous video stream data containing the target person.
[0221] The determining module 302 is used to identify skeletal key points and calculate posture indicators on the video stream data using a pre-determined posture key point recognition model to determine a primary suspected fall event; based on the time corresponding to the primary suspected fall event, acquire multimodal data for a preset time period; and perform multimodal analysis and multimodal fusion based on the multimodal data to determine a secondary suspected fall event; and based on the secondary suspected fall event, perform static posture determination and sensitive speech segment recognition on the video stream data to determine the final determination result.
[0222] The alarm module 303 is used to trigger an alarm mechanism when the final judgment result indicates that the target person has actually fallen.
[0223] Based on the above embodiments of a false fall filtering method based on multimodal spatiotemporal feature fusion, this application also provides a false fall filtering device based on multimodal spatiotemporal feature fusion, such as... Figure 4 As shown, Figure 4 This is a schematic diagram of a false fall filtering device based on multimodal spatiotemporal feature fusion, provided in an embodiment of this application. The false fall filtering device 4 based on multimodal spatiotemporal feature fusion includes a processor 401 and a memory 402. The memory 402 is used to store a computer program; the processor 401 is used to call and run the computer program from the memory to execute a false fall filtering method based on multimodal spatiotemporal feature fusion as described in the above embodiment.
[0224] In the embodiments of this application, the processor 401 described above can be at least one of the following: Application-Specific Integrated Circuit (ASIC), Digital Signal Processor (DSP), Digital Signal Processing Device (DSPD), Programmable Logic Device (PLD), Field-Programmable Gate Array (FPGA), Central Processing Unit (CPU), Controller, Microcontroller, and Microprocessor. It is understood that for different devices, the electronic device used to implement the above processor function can also be other types, and the embodiments of this application do not specifically limit it.
[0225] This application provides a computer-readable storage medium storing a computer program for implementing, when executed by a processor, a false fall filtering method based on multimodal spatiotemporal feature fusion as described in any of the above embodiments.
[0226] For example, the program instructions corresponding to a false fall filtering method based on multimodal spatiotemporal feature fusion in this embodiment can be stored on storage media such as optical discs, hard disks, and USB flash drives. When the program instructions corresponding to the false fall filtering method based on multimodal spatiotemporal feature fusion in the storage media are read or executed by an electronic device, a false fall filtering method based on multimodal spatiotemporal feature fusion as described in any of the above embodiments can be implemented.
[0227] Furthermore, in the embodiments of this application, the functional modules can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional module.
[0228] If the integrated unit is implemented as a software functional module and is not sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this embodiment, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the method of this embodiment. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0229] It should be understood that the phrases "one embodiment," "an embodiment," or "some embodiments" mentioned throughout the specification mean that a specific feature, structure, or characteristic related to the embodiment is included in at least one embodiment of this application. Therefore, "in one embodiment," "in one embodiment," or "in some embodiments" appearing throughout the specification do not necessarily refer to the same embodiment. Furthermore, these specific features, structures, or characteristics can be combined in any suitable manner in one or more embodiments. It should be understood that in the various embodiments of this application, the sequence numbers of the above-described processes do not imply a sequential order of execution; the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application. The sequence numbers of the embodiments in this application are merely for descriptive purposes and do not represent the superiority or inferiority of the embodiments. The descriptions of the various embodiments above tend to emphasize the differences between the various embodiments; their similarities or commonalities can be referred to mutually, and for the sake of brevity, these will not be repeated here.
[0230] The modules described above as separate components may or may not be physically separate. The components shown as modules may or may not be physical modules. They may be located in one place or distributed across multiple network units. Some or all of the modules may be selected to achieve the purpose of this embodiment according to actual needs.
[0231] In addition, each functional module in the various embodiments of this application can be integrated into one processing unit, or each module can be a separate unit, or two or more modules can be integrated into one unit; the integrated modules can be implemented in hardware or in the form of hardware plus software functional units.
[0232] Those skilled in the art will understand that all or part of the steps of the above method embodiments can be implemented by hardware related to program instructions. The aforementioned program can be stored in a computer-readable storage medium. When the program is executed, it performs the steps of the above method embodiments. The aforementioned storage medium includes various media that can store program code, such as mobile storage devices, read-only memory (ROM), magnetic disks, or optical disks.
[0233] The methods disclosed in the several method embodiments provided in this application can be arbitrarily combined without conflict to obtain new method embodiments.
[0234] The features disclosed in the several product embodiments provided in this application can be arbitrarily combined without conflict to obtain new product embodiments.
[0235] The features disclosed in the several method or device embodiments provided in this application can be arbitrarily combined without conflict to obtain new method or device embodiments.
[0236] The above description is merely an embodiment of this application, but the protection scope of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the protection scope of this application. Therefore, the protection scope of this application should be determined by the protection scope of the claims.
Claims
1. A method for filtering false falls based on multimodal spatiotemporal feature fusion, characterized in that, The method includes: Acquire continuous video stream data containing the target person; By using a pre-determined posture key point recognition model to identify skeletal key points and calculate posture indicators in the video stream data, a primary suspected fall event can be identified. Based on the time corresponding to the primary suspected fall event, multimodal data for a preset time period is acquired; and multimodal analysis and multimodal fusion are performed on the multimodal data to determine the secondary suspected fall event; wherein, the multimodal data includes an initial human skeletal keypoint sequence, an initial triaxial acceleration sequence, and an initial environmental audio sequence; the multimodal analysis characterization performs time alignment processing and feature analysis on the initial human skeletal keypoint sequence, the initial triaxial acceleration sequence, and the initial environmental audio sequence; the multimodal fusion characterization fuses the confidence levels of the initial human skeletal keypoint sequence, the initial triaxial acceleration sequence, and the initial environmental audio sequence obtained based on the multimodal analysis; Based on the suspected level 2 fall incident, static posture determination and sensitive speech segment recognition are performed on the video stream data to determine the final determination result; If the final determination result indicates that the target person has actually fallen, an alarm mechanism is triggered. The step of determining the final judgment result by performing static posture determination and sensitive speech segment recognition on the video stream data based on the suspected secondary fall event includes: Based on the suspected secondary fall event, the video stream data is subjected to skeletal key point recognition using the posture key point recognition model to determine the final human skeletal key points and the corresponding second coordinate information; wherein, the final human skeletal key points include head key points, shoulder key points and hip key points. Based on the final human skeleton key points and the second coordinate information corresponding to the final human skeleton key points, calculate the displacement variance corresponding to the head key points, the shoulder key points and the hip key points respectively; Based on the displacement variances corresponding to the head key points, shoulder key points, and hip key points, the mean displacement variance is determined; and the mean displacement variance is compared with a preset static threshold to determine the static posture determination result. Audio features are extracted from the video stream data to obtain an audio feature sequence; The audio feature sequence is extracted by pre-determined semantically sensitive branch, non-semantic anomaly branch and silence detection branch respectively, to obtain the first feature map, the second feature map and the third feature map corresponding to the semantically sensitive branch, the non-semantic anomaly branch and the silence detection branch respectively; The first feature map, the second feature map, and the third feature map are fused to obtain fused features; Based on the fusion features, sensitive speech segments are identified to obtain a three-dimensional vector; wherein, the three-dimensional vector includes the confidence scores corresponding to semantic sensitivity, non-semantic anomalies, and silent detection. Based on the confidence scores corresponding to the semantic sensitivity, non-semantic anomalies, and silence detection, and the pre-set confidence thresholds corresponding to the semantic sensitivity, non-semantic anomalies, and silence detection, the sensitive speech segment recognition result is determined. Based on the static posture determination result and the sensitive speech segment recognition result, the final determination result is determined.
2. The method according to claim 1, characterized in that, The step of identifying primary fall-related suspected events by performing skeletal keypoint recognition and posture index calculation on the video stream data using a pre-determined posture keypoint recognition model includes: The pose key point recognition model is used to identify the skeletal key points of the target person in the video stream data to obtain the human skeletal key points and the first coordinate information corresponding to the human skeletal key points. Based on the first coordinate information corresponding to the key points of the human skeleton, the posture index is calculated to determine the trunk main axis tilt angle and the change in head height. If the tilt angle of the torso axis is greater than a preset tilt angle threshold and the change in head height is greater than a preset height threshold, the suspected primary fall event is determined.
3. The method according to claim 2, characterized in that, The step of calculating posture indices based on the first coordinate information corresponding to the key points of the human skeleton, and determining the torso tilt angle and head height change, includes: Based on the first coordinate information corresponding to the key points of the human skeleton, the target coordinate information corresponding to two sets of target skeleton key points related to the torso posture is determined; wherein, the two sets of target skeleton key points are the upper body skeleton key points and the lower body skeleton key points. Based on the target coordinate information, calculate the tilt angle of the torso's main axis; Based on the first coordinate information corresponding to the key points of the human skeleton, select the longitudinal coordinate values including the key points of the head or neck, and determine the sequence of longitudinal coordinates of the key points. The change in head height is calculated based on the longitudinal coordinate sequence of the key points.
4. The method according to claim 1, characterized in that, The process of performing multimodal analysis and multimodal fusion based on the multimodal data to determine suspected secondary fall events includes: Based on the initial human skeleton key point sequence, the initial triaxial acceleration sequence, and the initial environmental audio sequence, time alignment processing is performed to obtain the human skeleton key point sequence, the triaxial acceleration sequence, and the environmental audio sequence. Feature analysis and multimodal fusion were performed on the human skeleton key point sequence, the triaxial acceleration sequence, and the environmental audio sequence respectively to determine the fusion confidence. Obtain attention weights for adjusting the dynamic threshold; and adjust the dynamic threshold based on the attention weights to determine the corrected fusion judgment threshold for the region where the target person is located; If the fusion confidence level is greater than the modified fusion determination threshold, the suspected secondary fall event is determined.
5. The method according to claim 4, characterized in that, The step involves performing feature analysis and multimodal fusion on the human skeletal keypoint sequence, the triaxial acceleration sequence, and the environmental audio sequence, respectively, to determine the fusion confidence level, including: By using a pre-determined neural network model for recognizing fall behavior, the similarity between the sequence of key points of the human skeleton and the preset sequence of real fall patterns is calculated to obtain the fall confidence level. Feature extraction is performed on the triaxial acceleration sequence to obtain the energy distribution and frequency characteristics of the vibration signal; and the vibration confidence level is calculated based on the energy distribution and frequency characteristics. Feature extraction is performed on the environmental audio sequence to obtain transient impact features; and acoustic modal confidence is calculated based on the transient impact features. Based on the fall confidence, the vibration confidence, and the acoustic modal confidence, multimodal fusion is performed to determine the fusion confidence.
6. A false fall filtering system based on multimodal spatiotemporal feature fusion, characterized in that, The false fall filtering system based on multimodal spatiotemporal feature fusion includes: an acquisition module, a determination module, and an alarm module, wherein, The acquisition module is used to acquire continuous video stream data containing the target person. The determining module is used to identify skeletal key points and calculate posture indicators on the video stream data using a pre-determined posture key point recognition model to determine a primary suspected fall event; based on the time corresponding to the primary suspected fall event, acquire multimodal data for a preset time period; and perform multimodal analysis and multimodal fusion based on the multimodal data to determine a secondary suspected fall event; wherein, the multimodal data includes an initial human skeletal key point sequence, an initial triaxial acceleration sequence, and an initial environmental audio sequence; the multimodal analysis characterization performs time alignment processing and feature analysis on the initial human skeletal key point sequence, the initial triaxial acceleration sequence, and the initial environmental audio sequence; the multimodal fusion characterization fuses the confidence levels of the initial human skeletal key point sequence, the initial triaxial acceleration sequence, and the initial environmental audio sequence obtained based on the multimodal analysis; and based on the secondary suspected fall event, perform static posture determination and sensitive speech segment recognition on the video stream data to determine the final determination result; The determining module is further configured to, based on the suspected secondary fall event, perform skeletal keypoint recognition on the video stream data using the posture keypoint recognition model to determine the final human skeletal keypoints and the corresponding second coordinate information of the final human skeletal keypoints; wherein, the final human skeletal keypoints include head keypoints, shoulder keypoints, and hip keypoints; calculate the displacement variance corresponding to each of the head keypoints, shoulder keypoints, and hip keypoints based on the final human skeletal keypoints and the corresponding second coordinate information; determine the mean displacement variance based on the displacement variance corresponding to each of the head keypoints, shoulder keypoints, and hip keypoints; compare the mean displacement variance with a preset static threshold to determine the static posture determination result; extract audio features from the video stream data to obtain an audio feature sequence; and use a pre-determined semantically sensitive... The semantically sensitive branch, the non-semantic anomaly branch, and the silence detection branch respectively extract features from the audio feature sequence to obtain a first feature map, a second feature map, and a third feature map corresponding to each of the semantically sensitive branch, the non-semantic anomaly branch, and the silence detection branch. Feature fusion is performed on the first feature map, the second feature map, and the third feature map to obtain fused features. Based on the fused features, sensitive speech segment recognition is performed to obtain a three-dimensional vector. The three-dimensional vector includes the confidence scores corresponding to semantic sensitivity, non-semantic anomaly, and silence detection. The confidence scores corresponding to semantic sensitivity, non-semantic anomaly, and silence detection are compared with preset confidence thresholds corresponding to semantic sensitivity, non-semantic anomaly, and silence detection to determine the sensitive speech segment recognition result. Based on the static posture determination result and the sensitive speech segment recognition result, the final determination result is determined. The alarm module is used to trigger an alarm mechanism when the final judgment result indicates that the target person has actually fallen.
7. A false fall filtering device based on multimodal spatiotemporal feature fusion, characterized in that, include: Processor and memory, of which, The memory is used to store computer programs; The processor is configured to call and run the computer program from the memory to perform the method as described in any one of claims 1 to 5.
8. A computer-readable storage medium, characterized in that, It stores executable instructions for causing a processor to execute, thereby implementing the method of any one of claims 1 to 5.