Behavior monitoring and companion system based on visual recognition and voice interaction

By employing a multimodal perception architecture that combines visual recognition and voice interaction, along with a bilinear pooling attention network and keypoint geometric calculation, the system addresses the shortcomings of existing systems in terms of recognition dimensions, feedback methods, and real-time performance. This enables high-precision analysis of children's learning behaviors and contextualized feedback, thereby improving the system's real-time performance and robustness.

CN122244944APending Publication Date: 2026-06-19XIDIAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
XIDIAN UNIV
Filing Date
2026-03-20
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing children's learning behavior monitoring systems have limited identification dimensions, single feedback methods, poor real-time performance, and insufficient interactivity. They cannot effectively identify multiple behavioral patterns and lack environmental adaptability and remote monitoring functions.

Method used

A multimodal perception architecture based on visual recognition and voice interaction is adopted, combined with a bilinear pooling attention network and key point geometric calculation to achieve high-precision analysis of gaze and posture. Real-time voice feedback and environmental adjustment are achieved through an edge-cloud collaborative architecture, and closed-loop control logic is constructed to improve the system's recognition accuracy and interactivity.

Benefits of technology

It achieves high-precision analysis of implicit distraction behaviors, improves the effectiveness and real-time nature of intervention, provides context-related semantic interaction and low-latency remote monitoring, and ensures the robustness of the system and vision protection under different lighting conditions.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244944A_ABST
    Figure CN122244944A_ABST
Patent Text Reader

Abstract

This invention discloses a behavior monitoring and companionship system based on visual recognition and voice interaction. The system includes an image acquisition module that collects user image data; a behavior recognition module that analyzes the monitored person's behavior based on human posture key point detection, eye direction recognition, and hand movements, and uses multimodal feature fusion to identify abnormal behavior using a threshold discrimination algorithm; an anomaly detection and response module that executes response operations, performs voice broadcasting, and uploads data to the cloud; an intelligent voice assistant module that interacts with the user via voice; a dynamic visual feedback and auxiliary lighting module that generates dynamic facial expressions to provide visual psychological compensation for the user; and a remote terminal for remote monitoring. By improving the depth of recognition and adaptability to complex learning behaviors, the system achieves accurate identification and proactive intervention in children's learning states, thereby comprehensively improving children's learning concentration and the level of intelligent supervision of family education.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of behavior monitoring technology, specifically to a behavior monitoring and companionship system based on visual recognition and voice interaction. Background Technology

[0002] With the synergistic development of artificial intelligence, computer vision (CV), and Internet of Things (IoT) technologies, smart educational hardware has gradually penetrated into home learning scenarios. In order to solve problems such as vision impairment, attention deficit, and poor learning habits that children encounter during home learning, a variety of monitoring devices have emerged in the market.

[0003] Currently, existing technical solutions can be mainly divided into the following three categories: Static monitoring solutions based on physical sensors: Using pressure sensors (installed on a desktop or chair), ultrasonic ranging sensors, or infrared sensors, children's sitting height or chest distance can be monitored by setting distance thresholds.

[0004] Logical reminder scheme based on timed tasks: By using a built-in timer or a simple motion trigger, a fixed frequency of voice broadcasts is set (such as reminding "pay attention to posture" every 20 minutes), which is a kind of non-closed-loop preset interaction.

[0005] A pose estimation scheme based on single-modal vision: It uses an embedded camera to acquire images and adopts a traditional human pose estimation algorithm to perform single-dimensional geometric calculations on spinal curvature or head tilt angle.

[0006] Through in-depth analysis of the aforementioned existing technologies, the following significant technical shortcomings were found in their practical applications, making it difficult to meet the deeper needs of intelligent home education: The lack of visual recognition feature dimensions and weak semantic analysis capability for complex behaviors are problems. Existing visual monitoring solutions mostly focus on the topological analysis of large joints in the human body (such as spinal curvature), but neglect the correlation analysis between eye gaze direction (Gaze Estimation) and subtle hand movements (such as resting chin on hand or rubbing eyes). This results in the system's inability to distinguish between semantically significant behaviors such as "looking down at a book" and "looking down at a mobile phone," creating serious blind spots and hindering deep analysis of various complex learning behaviors.

[0007] Insufficient environmental adaptability leads to poor algorithm recognition accuracy and robustness. Existing camera surveillance equipment is highly sensitive to lighting conditions. In home learning scenarios, light intensity varies drastically with day and night and weather. Traditional visual algorithms exhibit keypoint drift and feature loss in low-brightness or strong backlight environments. Due to the lack of hardware-level light environment adaptive compensation mechanisms (such as the synergy between hardware dimming and software algorithms), the system's recognition accuracy drops significantly under non-ideal conditions, resulting in a high false alarm rate.

[0008] The feedback logic is rigid, and the interaction lacks immediacy and contextual relevance. Existing voice reminder devices mostly use "timed triggering" or "single threshold triggering" logic, resulting in feedback content that is not relevant to the child's immediate state. This mechanical voice prompt, detached from specific contexts, not only fails to elicit children's proactive corrective awareness but also easily leads to auditory fatigue and psychological resistance. The system lacks dynamic semantic feedback based on specific violations, resulting in an unnatural interactive experience.

[0009] The unreasonable data transmission mechanism leads to significant delays in parental monitoring. Existing networked monitoring products typically employ either full-stream video upload or periodic data aggregation. The former requires extremely high home bandwidth and poses a privacy risk, while the latter results in a noticeable lag in the information received by parents. Current technology lacks optimization for the instantaneous capture and asynchronous transmission of "keyframes of abnormal behavior," preventing parents from obtaining evidence of abnormal states in a timely manner and hindering their ability to implement immediate and effective remote intervention.

[0010] The system architecture lacks closed-loop control, and environmental comfort adjustment is inadequate. Existing learning monitoring devices are typically independent of the lighting system. During children's learning process, the quality of ambient lighting directly affects their fatigue level and concentration, but existing devices only focus on "behavioral monitoring" and fail to integrate "environmental optimization" and "behavioral intervention." The lack of closed-loop control logic to automatically adjust the lighting environment based on visual recognition status prevents the provision of comprehensive learning support.

[0011] In summary, while existing children's learning behavior monitoring systems have seen some application in areas such as posture detection and voice reminders, they still have several shortcomings. First, sensor-based posture reminder devices can only detect changes in body posture, resulting in limited functionality, low recognition accuracy, complex installation, and high cost. They also cannot recognize various behaviors such as distraction or resting one's chin on one's hand. Second, smart learning companions with voice reminder functions often use timed or fixed voice prompts, lacking dynamic feedback related to the actual learning state. The prompts are mechanical, lack interactivity, and are difficult to effectively guide children to correct poor posture independently. Third, while posture detection systems based on single-modal visual recognition can identify head or shoulder movements, their recognition dimensions are limited, they are easily affected by lighting and angle, leading to misjudgments, and they lack emotional feedback and real-time intervention capabilities. Furthermore, existing systems primarily focus on monitoring functions, failing to achieve real-time image transmission and remote supervision from the parent end, resulting in delayed feedback and isolated information. Summary of the Invention

[0012] The purpose of this invention is to address the aforementioned problems by providing a behavior monitoring and companionship system based on visual recognition and voice interaction. This system overcomes the limitations of existing children's learning behavior monitoring technologies, such as limited recognition dimensions, single feedback methods, poor real-time performance, and insufficient interactivity. It enables accurate identification of children's learning status, proactive intervention, emotional feedback, and remote monitoring by parents.

[0013] The technical solution adopted in this invention is as follows: A behavior monitoring and companionship system based on visual recognition and voice interaction, the system comprising: The image acquisition module is used to acquire the user's image data; The behavior recognition module is used to analyze the behavior of the monitored person based on the collected image data, including human posture key point detection, eye direction recognition, and hand movements, and through multimodal feature fusion, and to identify abnormal behavior based on a threshold discrimination algorithm. The anomaly detection and response module is used to automatically execute response operations when the identified abnormal behavior continues to exceed a set threshold. Through a continuous frame analysis mechanism, the behavior recognition results are smoothed over time. When the abnormal state continues to exceed a set number of frames, it is determined to be a valid anomaly, and voice broadcast and cloud upload are performed. The intelligent voice assistant module is used to detect keyword wake-up by real-time monitoring of the ambient audio stream, initiate the speech recognition process, and convert the monitored person's voice content into text input; The dynamic visual feedback and auxiliary lighting module is used to adjust the brightness of the system according to the ambient light and generate dynamic expressions through dynamic visual feedback to provide visual psychological compensation for users. The remote receiving end is used to synchronously receive behavior judgment data and visualize it.

[0014] Furthermore, the image acquisition module includes a high-definition camera for real-time acquisition of image data of the user's behavior.

[0015] Furthermore, the behavior recognition module is specifically a gaze and posture recognition based on the user's facial geometric mapping, specifically including a facial central axis offset calculation module and a torso compression quantization determination module; The facial central axis offset calculation module is used to calculate the horizontal offset between the tip of the user's nose and the center point of both eyes to determine the state of gaze separation. The torso compression quantification judgment module is used to determine abnormal user sitting posture behavior by measuring the rate of change of the vertical projection distance between the center of the user's mouth and the center of the shoulder.

[0016] Furthermore, the facial central axis offset calculation module specifically includes: Define feature points, and select the left eye feature point E_left, right eye feature point E_right, and nose tip feature point N of the user; Parameter calculation, calculate the geometric midpoint C_eyes of the user's binocular connection, and its abscissa x_avg = (x_left + x_right) / 2; Calculate the Euclidean distance Dh in the horizontal direction between the nose tip point x_nose and the binocular center point x_avg, Dh = |x_nose - x_avg|; Judgment logic: In the state of looking straight ahead and concentrating, the nose tip should be near the vertical bisector of the binocular connection, and Dh approaches 0; when the user deviates from the straight ahead, the nose tip coordinates will have a horizontal displacement relative to the binocular center. The system sets a horizontal offset threshold Th. When Dh > Th, it is determined as an abnormal behavior of line of sight detachment.

[0017] Furthermore, the trunk compression quantization determination module is specifically: Define feature points, select the left / right corner points of the user's mouth to calculate the center height Y_mouth of the mouth; select the left / right shoulder points to calculate the center height Y_shoulder of the shoulders; Geometric calculation, calculate the projected distance Dv in the vertical direction between the center height of the mouth and the center height of the shoulders, Dv = Y_shoulder - Y_mouth; Judgment logic, the distance Dv represents the degree of extension of the neck and thoracic vertebra. When the user has abnormal behaviors such as lowering the head or hunching the back, the mouth coordinates sink and the shoulder coordinates float relatively, resulting in a decrease in Dv. The system sets a vertical distance threshold Tv. When Dv < Tv, it is determined as an abnormal behavior of poor sitting posture; Cascaded state determination, adopt state cascaded logic, regard poor sitting posture as a subset of the non-concentrated state. When it is detected that Dv < Tv, the system logic variable unfocus is also forced to be set to True to ensure comprehensive coverage of hidden abnormal behaviors.

[0018] Furthermore, the behavior recognition module also includes precise line of sight detection based on a bilinear pooling attention network, specifically including: Bilinear pooling mechanism, extract the local convolutional features X of the facial image, calculate the outer product of the feature vector and its transposed vector, and obtain second-order context information; Nested attention enhancement mechanism, aiming at the spatial misalignment problem of the face in the video stream, dynamically calibrate the feature weights through the nested attention mechanism, specifically as follows: Global aggregation, perform global average pooling on the second-order local features to generate a global descriptor G representing the overall posture and ambient light; Adaptive weighting, use the global descriptor G to re-weight the local features A, such as D = Softmax((U + G) A) Through this mechanism, the system can automatically suppress background noise; The enhanced feature map of the gaze vector regression is input into the fully connected layer, which outputs a two-dimensional gaze vector (Yaw, Pitch): the yaw angle represents the degree of left and right deviation of the gaze in the horizontal direction, and the pitch angle represents the degree of up and down deviation of the gaze in the vertical direction. When the absolute value of any angle exceeds the preset safety threshold, the system determines that the user's gaze has left the view.

[0019] Furthermore, the behavior recognition module also includes a process for determining the persistence of abnormal behavior based on a time-series sliding window, specifically including time-series consistency verification, as follows: Frame counter and state locking: The system maintains two state machines in memory, which record the distraction start frame F_start_unfocus and the abnormal posture start frame F_start_posture, respectively. Time threshold logic: When the detection algorithm outputs an abnormality (True), if this is the first trigger, record the current frame number F_current as the starting frame; continuously calculate the number of continuous frames Delta_F = F_current - F_start, and only when Delta_F equals the preset time threshold T_cooldown is the abnormality confirmed as a valid abnormal event; State Reset: If the detection result returns to normal before the threshold T_cooldown is reached, the initial frame record is immediately cleared and monitoring restarts.

[0020] Furthermore, in the anomaly detection and response module, when an abnormal behavior is confirmed, the image data of the current frame that triggered the anomaly threshold is extracted, the image data is uploaded to the cloud via cloud storage, and the status text of the current abnormal behavior is superimposed and rendered on the video stream using image fusion technology to complete the visual feedback.

[0021] Furthermore, the dynamic visual feedback and auxiliary lighting module integrates an embedded microcontroller (MCU), an environmental sensing sensor, and a display driver unit, as detailed below: The microcontroller (MCU) collects ambient illuminance values ​​through an environmental sensing sensor, compares them with a preset standard illuminance range, and dynamically adjusts the PWM duty cycle to achieve constant illuminance. By using high-precision clock interrupts to control the refresh of the display memory, and through time-step logic, anthropomorphic blinking and smiling animations are achieved, providing visual psychological compensation.

[0022] Furthermore, the system also includes edge-based image capture and cloud-based monitoring, as detailed below: At the edge, key frame capture is performed. The system locks the current frame image data that triggers the abnormal threshold and uploads it to the cloud for storage. The current frame image data includes the abnormal behavior image, behavior type label and time. Cloud-based monitoring associates and displays abnormal behavior images, behavior type tags, and time on the terminal, and generates a focus analysis table.

[0023] In summary, due to the adoption of the above technical solution, the beneficial effects of the present invention are: This invention overcomes the limitations of existing technologies in terms of recognition dimensions, environmental adaptability, and interaction logic by constructing an edge-cloud collaborative multimodal perception architecture. Compared with existing single-sensor monitoring devices such as ultrasonic posture reminders, gravity-sensing posture correction garments, and monitoring equipment based on basic computer vision, it has the following significant advantages: 1. It breaks through the limitations of single-dimensional geometric monitoring and achieves high-precision analysis of latent distraction behavior. This invention introduces a two-stream detection mechanism combining a bilinear pooling attention network (BPA-Net) with keypoint geometric calculation. Utilizing second-order feature extraction, it captures the nonlinear coupling between pupil texture and eyelid shape. This not only corrects explicit posture problems such as "looking down," but also accurately detects implicit attention problems such as gaze deviation (yaw / pitch angle shift). Combined with temporal consistency verification logic, the false alarm rate is reduced by more than 30%, achieving a technological leap from "physical distance monitoring" to "behavioral semantic understanding."

[0024] 2. The intervention has been upgraded from a mechanical threshold alarm to a context-related semantic interaction, improving its effectiveness. This invention utilizes an edge-cloud collaborative architecture to map visually recognized behavioral tags (such as "chin resting") to dynamic text generated by a Large Language Model (LLM) in real time, and outputs emotionally rhythmic prompts through neural speech synthesis technology. The system can execute audio scheduling strategies such as "ducking" or "interaction interruption" based on the severity of the abnormality. This contextualized feedback based on semantic understanding makes the intervention process more targeted and approachable, significantly improving children's willingness to actively correct their behavior and solving the problem of poor long-term compliance caused by the "mechanical supervision" of traditional devices.

[0025] 3. A low-latency edge-cloud visualized evidence chain was established, solving the problem of information lag in remote supervision. This invention constructs an asynchronous synchronous mechanism based on "abnormal trigger capture". Once the timing verification confirms an abnormality, the edge device immediately captures keyframes and uploads them to the cloud within seconds. Through the polling push mechanism on the Android device, parents can obtain high-definition screenshots with timing tags in real time. This "what you see is what you get" visual monitoring method greatly shortens the time link from the occurrence of an abnormality to parental intervention, and improves the real-time and scientific nature of family education supervision.

[0026] 4. A closed-loop synergy between environmental adaptation and algorithm robustness is achieved, balancing recognition accuracy and vision protection. This invention integrates an ambient light negative feedback system based on UART communication with an algorithm featuring a nested attention (AiA) mechanism. On one hand, the algorithm automatically calibrates feature weights through the AiA mechanism to suppress noise caused by uneven illumination; on the other hand, the system achieves automatic stepless adjustment of lighting brightness through PWM duty cycle control. This closed-loop control logic not only ensures a constant eye-protection environment but also provides stable illuminance conditions for the visual algorithm, reducing the computational overhead of the algorithm while achieving dual protection of system robustness and visual health. Attached Figure Description

[0027] Figure 1 This is a flowchart illustrating the principle of auxiliary lighting and dynamic visual feedback in the behavior monitoring and companionship system based on visual recognition and voice interaction of the present invention, which relies on precise timing control, pulse width modulation (PWM) technology and serial communication protocol (UART). Figure 2 In this invention, the main interface of the APP displays historical instances of exceptions through a RecyclerView list component. Figure 3 This is a flowchart of the intelligent voice assistant in the system of the present invention; Figure 4 This is a flowchart of the parent monitoring terminal in the system of the present invention; Figure 5 This is the overall block diagram structure of the system of the present invention. Detailed Implementation

[0028] The present invention will now be described in detail with reference to the accompanying drawings.

[0029] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0030] This invention integrates two core modules, visual recognition and voice interaction, to build an intelligent monitoring and companionship system in children's learning scenarios.

[0031] This invention utilizes a camera to collect real-time image data of children's learning process. Based on human posture key point detection, eye direction recognition, and hand movement analysis, it achieves joint identification of various undesirable learning behaviors such as distraction, inattentiveness, looking down, hunching over, and resting chin on hand. Through multimodal feature fusion and threshold discrimination algorithms, it effectively improves the accuracy and robustness of learning state identification. When abnormal behavior is detected and continues for more than a set time threshold, the system automatically triggers a voice broadcast module, generating targeted voice reminders based on different behavior types to guide children to actively adjust their posture or focus, shifting from passive supervision to active intervention. Simultaneously, the system automatically captures the current keyframe image and uploads it along with the behavior type and detection data to the cloud, enabling real-time synchronization and remote visual supervision for parents.

[0032] Furthermore, this invention integrates an intelligent voice assistant module, supporting natural language dialogue between children and the system. The system can provide emotional responses based on behavioral data and a knowledge base, creating a continuously interactive learning environment that enhances children's learning motivation and emotional stability. Through PWM control of the OLED expression screen and LED lighting system, the system can generate anthropomorphic visual expressions and dynamic lighting feedback, creating a friendly and warm learning environment. Simultaneously, combined with a light sensor to detect ambient brightness, it achieves adaptive adjustment of LED brightness, effectively protecting children's eyesight and providing a comfortable learning lighting environment.

[0033] In summary, the purpose of this invention is to construct an intelligent education system that integrates "multimodal perception, intelligent discrimination, emotional interaction, and remote supervision." This system can automatically identify, intelligently respond to, provide personalized companionship, and remotely supervise children's learning behaviors in home learning scenarios. It not only improves learning focus and behavioral norms but also enhances the naturalness of human-computer interaction and the scientific nature of the educational process, providing a comprehensive, efficient, intelligent, and personalized solution for modern family education.

[0034] Example This embodiment provides a system for monitoring and accompanying children's learning behavior based on multimodal visual recognition and intelligent voice interaction, such as... Figure 5 As shown, it specifically includes: Image acquisition module: Includes a high-definition camera for real-time image data acquisition during children's learning process. The camera covers the upper body learning area of ​​the child and supports 720P resolution and low-light environment recognition; The image acquisition module in this embodiment can also be equipped with a 3D depth camera (RGB-D): using a TOF (Time-of-Flight) camera or a binocular stereo vision camera, it can directly calculate the three-dimensional curvature of the human spine by acquiring depth information (Depth Map), thereby more accurately identifying postures such as "hunchback" without relying on pure geometric projection algorithms; it also includes an infrared camera, which can be used to solve the recognition problem in extremely dark environments, in conjunction with an infrared band eye-tracking algorithm to achieve monitoring in completely dark environments.

[0035] Behavior recognition module: The behavior recognition module is the intelligent core of this invention, responsible for analyzing the collected images and determining the behavior.

[0036] The module integrates a deep learning inference engine, runs a human pose estimation model and an eye orientation recognition model, and combines the spatial relationships of multiple key points such as the shoulder, mouth, eyes, and hands to calculate parameters such as the vertical distance between the eyes and shoulders, the horizontal offset distance of the face, the eye gaze vector, and the distance between the hands and eyes.

[0037] Through a multimodal feature fusion algorithm, the module can identify various types of abnormal learning behaviors, including distraction, inattentiveness, looking down, and resting one's chin on one's hand. The system employs a multi-layered convolutional feature extraction network and attention mechanism at the algorithm level to improve feature discrimination and reduces misjudgments caused by individual differences through an adaptive threshold discrimination strategy. The identification results are output to the anomaly detection module in the form of data labels, while key behavioral confidence parameters are generated to provide a basis for subsequent feedback decisions.

[0038] The specific principle is as follows: The child learning monitoring module in this embodiment is based on computer vision and key point detection technology. It achieves real-time analysis of the learning scene by constructing a processing chain of feature geometry calculation, temporal state filtering, and cascaded response feedback. The system runs in a Python environment, uses OpenCV for video stream acquisition, and integrates a pose estimation model to extract human topology. Its core operating logic is as follows: (1) Principle of gaze and posture recognition based on facial geometric mapping In this embodiment, the system first parses the video stream frame by frame, extracts the skeletal key points of the child's upper body through a pose estimation model, and constructs a human body structure vector based on pixel coordinates (x, y). The specific behavior determination algorithm is calculated based on the following geometric features: Based on the principle of estimating gaze / head pose according to the relative position of "eye-nose", i.e., distraction recognition, it is as follows: To address the instability of simple eye tracking at low resolution, this embodiment employs a facial center axis offset quantization algorithm to characterize the gaze direction. Feature point definition: The system selects the left eye feature point E_left, the right eye feature point E_right, and the nose tip feature point N; Geometric calculation: First, calculate the geometric midpoint C_eyes of the line connecting the two eyes, and its abscissa x_avg = (x_left + x_right) / 2; calculate the Euclidean distance Dh between the x_nose of the nose tip point and the x_avg of the center point of the two eyes in the horizontal direction. The formula is: Dh = |x_nose - x_avg|; Judgment logic: In the state of looking straight ahead (focused), the nose tip should be near the perpendicular bisector of the line connecting the two eyes, and Dh approaches 0. When the child turns his head left or right or the line of sight deviates significantly from the straight ahead, the nose tip coordinates will have a significant horizontal displacement relative to the center of the two eyes. The system sets a horizontal offset threshold Th (the value in the embodiment is 10). When Dh > Th, it is determined that the line of sight has left the learning area (i.e., "distracted").

[0039] Principle of recognizing abnormal sitting posture based on the vertical projection of "shoulder - mouth" The system in this embodiment determines bad sitting postures (such as lowering the head, lying on the table) by analyzing the degree of compression between the lower boundary of the face and the upper boundary of the torso.

[0040] Feature point definition: Select the left / right corner points of the mouth to calculate the center height Y_mouth of the mouth; select the left / right shoulder points to calculate the center height Y_shoulder of the shoulders; Geometric calculation: Calculate the projection distance Dv in the vertical direction between the two: Dv = Y_shoulder - Y_mouth; Judgment logic: The physical meaning of this distance Dv represents the degree of extension of the neck and thoracic vertebra. When the child is writing with his head down or hunching his back, the coordinates of the mouth sink and the coordinates of the shoulders float relatively, resulting in a significant decrease in Dv. The system sets a vertical distance threshold Tv (the value in the embodiment is 35). When Dv < Tv, it is determined as "bad sitting posture".

[0041] Cascaded state determination mechanism The system in this embodiment adopts the "State Cascading" logic: regard "bad sitting posture" as a subset of the "non - focused state"; that is, when it is detected that Dv < Tv (abnormal sitting posture), the system logic variable unfocus is also forced to be set to True. This mechanism ensures comprehensive coverage of implicit distracted behaviors (such as sleeping趴在桌上睡觉).

[0042] (2) Principle of accurate line - of - sight detection based on the bilinear pooling attention network (BPA - Net) To overcome the limitations of traditional geometric methods in scenarios with varying lighting, large head angulation, or low resolution, this embodiment introduces a bilinear pooling attention network (BPA-Net) module to achieve high-precision regression of gaze direction. This module effectively solves the nonlinear coupling problem between eye appearance and head posture by capturing the second-order correlation between feature channels.

[0043] Second-order feature extraction based on bilinear pooling: Traditional convolutional networks typically only focus on first-order features (pixel intensity), which easily leads to the loss of details. This embodiment introduces a bilinear pooling mechanism: Feature interaction: The system extracts local convolutional features X from the facial image and calculates the outer product of the feature vector and its transpose.

[0044] Physical meaning: The outer product operation generates a "feature autocorrelation matrix," where each element represents the co-occurrence relationship between two feature channels. For example, it can simultaneously capture specific combinations of "pupil texture" and "eyelid shape," thereby extracting more discriminative second-order contextual information than a single feature. The nested attention (AiA) enhancement mechanism addresses the potential spatial misalignment of faces in video streams. The system employs a nested attention mechanism to dynamically calibrate feature weights. Global aggregation: Perform global average pooling on second-order local features to generate a global descriptor G representing the overall pose and ambient lighting; Adaptive weighting: Local features A are reweighted using a global descriptor G. The calculation formula is: D = Softmax((U + G)). A). Through this mechanism, the system can automatically suppress background noise (such as hair obstruction) and significantly enhance the response weights to key visual areas such as the pupil and iris edge.

[0045] The enhanced feature map from the gaze vector regression is input into the fully connected layer, and the direct regression output is a two-dimensional gaze vector (Yaw, Pitch): Yaw (yaw angle): Indicates the degree of left or right deviation of the line of sight in the horizontal direction; Pitch: Represents the degree of vertical deviation of the line of sight. When the absolute value of any angle exceeds a preset safety threshold such as 15 degrees, the system determines it as "unfocused".

[0046] (3) Principle of determining the persistence of abnormal behavior based on time-series sliding window To eliminate false alarms caused by camera shake, children's unconscious momentary head turning or posture adjustment, this embodiment introduces a temporal consistency check mechanism.

[0047] Frame counter and state locking: The system maintains two state machines in memory, which record the "distraction start frame" F_start_unfocus and the "posture abnormality start frame" F_start_posture, respectively.

[0048] Time threshold logic: When the detection algorithm outputs an anomaly (True), if this is the first trigger, the current frame number F_current is recorded as the starting frame; the system continuously calculates the number of frames Delta_F = F_current - F_start; only when Delta_F equals the preset time threshold T_cooldown (set to 60 frames in the example, corresponding to approximately 2 seconds) will the system confirm the anomaly as a "valid anomaly event".

[0049] State Reset: If the detection result turns to normal (False) before reaching the threshold T_cooldown, the initial frame record is immediately cleared (set to None), and monitoring restarts. This mechanism effectively filters transient noise.

[0050] The anomaly detection and response module includes a duration determination unit, an automatic snapshot unit, a voice broadcast unit, and a cloud synchronization unit. When the identified abnormal behavior persists for more than a set threshold, the system automatically executes a response. The system uses a continuous frame analysis mechanism to smooth the behavior recognition results over time; when the abnormal state persists for more than a set number of frames, it is determined to be a valid anomaly. At this time, the module automatically performs the following operations: (1) Keyframe capture: Call the image acquisition module to save the current video frame as an abnormal image; (2) Voice prompt trigger: Send a command to the intelligent voice interaction module to activate the corresponding voice prompt; (3) Cloud synchronization: Upload abnormal images and behavior type tags to cloud storage via Wi-Fi.

[0051] The specific principle is as follows: Once the timing verification confirms that the abnormal event has occurred and the duration meets the conditions, the system immediately triggers the following response process: Keyframe capture and local persistence: The system locks the image data of the frame that triggers the anomaly threshold, names it with the current timestamp and anomaly type (Unfocus / BadPosture), and generates a local evidence image such as unfocus_frameID_timestamp.jpg through the cv2.imwrite interface.

[0052] Asynchronous cloud-based evidence storage: By calling the cloud storage interface, locally generated evidence images are uploaded to the cloud object storage server, and an accessible URL link is obtained; this step enables remote synchronization of monitoring data, allowing parents to view on-site screenshots in real time.

[0053] Visual Enhanced Feedback (OSD): The system uses image fusion technology cv2ImgAddText to overlay and render the current status text, such as "Distracted!" or "Poor posture," directly onto the video stream, forming an intuitive visual feedback loop.

[0054] Intelligent Voice Assistant Module: This module enables voice communication and learning assistance between children and the system. It includes a microphone, speaker, and voice processing unit. The system detects keyword wake-up by monitoring the ambient audio stream in real time. When a preset wake-up word such as "Xiao Zhi" is detected, the voice recognition process is immediately started to convert the user's voice content into text input.

[0055] The recognized text is fed into a semantic analysis and question-answering model. Based on the recognition results, the system automatically determines the user's intent and generates corresponding responses. The generated text is then converted into natural and fluent speech output by the speech synthesis module and broadcast through a speaker, achieving a closed-loop human-computer dialogue. In learning scenarios, the system can differentiate between question-answering and reminder functions based on different voice tasks: when a child actively asks a question, the system uses the knowledge-based question-answering model to provide subject-specific answers; when abnormal learning behavior is detected, the system automatically invokes the voice reminder module, synthesizing gentle prompts through preset phrases to help the child return to a learning state.

[0056] The specific principle is as follows: In this embodiment, the intelligent voice assistant module serves as the core of the system's interaction, such as... Figure 3 As shown, it adopts a hybrid architecture design of "end-cloud collaboration"; it runs on Android mobile terminals and edge computing nodes, and establishes a natural language communication bridge between children and the system through processing processes such as audio acquisition, keyword wake-up, speech recognition, semantic understanding, content generation and speech synthesis; at the same time, this module and the visual behavior recognition module are deeply linked through the message bus to realize proactive voice intervention and emotional companionship for abnormal behavior. (1) Full-duplex voice interaction principle based on Azure cloud services This submodule runs on the Android mobile device, referencing the project architecture: android_deepseek_app. It is responsible for handling user-initiated dialogue requests, forming a closed-loop interaction of "listening - understanding - feedback".

[0057] Audio Stream Acquisition and Wake-up Mechanism: In this embodiment, the system utilizes the AudioRecord interface of the Android system to acquire the raw PCM audio stream. To address the issue of limited background operation on mobile devices, the system starts a foreground service with FOREGROUND_SERVICE_MICROPHONE permissions and binds a persistent notification bar to ensure that the application can continuously listen to ambient sounds in the background environment of Android 9 (API 28) and above. The audio stream is sent in real time to the Azure Voice SDK (KeywordRecognizer module) integrated locally. This module runs a lightweight wake word model with the preset keyword "Xiao Zhi" and performs low-power feature matching at the edge. Once the confidence level exceeds the threshold, the system immediately switches to a high-power cloud interaction mode.

[0058] Continuous Speech Recognition and Text Conversion (ASR): After being triggered to wake up, the system establishes a WebSocket connection with the Azure cloud platform and enables Continuous Speech Recognition (ASR) mode. This mode supports streaming transmission, meaning that the user speaks while the cloud performs recognition, significantly reducing the latency of the first word. The system simultaneously loads pre-trained acoustic and language models and combines them with a contextual semantic filtering mechanism to automatically correct homophones in noisy environments, converting the speech stream into text commands in real time.

[0059] Semantic understanding based on the DeepSeek large model (LLM Processing): The recognized text is first cleaned locally using regularization to remove invalid filler words such as "that" and "um". Then, the cleaned text is used as the Prompt input and the DeepSeek large language model API is called via HTTP / HTTPS protocol. The DeepSeek model performs deep analysis of the user's intent based on the preset "child companion" role prompt (System Prompt) and generates a text response that is appropriate for children's cognitive level and has a friendly tone.

[0060] Neural TTS: The generated response text is sent to the Azure Text-to-Speech API; the system uses Neural TTS technology to dynamically adjust the speech rate, pitch and rhythm based on the emotional tags of the text, generating a human-like audio stream that is played on the mobile device, completing one round of human-computer interaction; after the broadcast ends, the system automatically releases playback resources and seamlessly reverts to the wake-up listening state through the state machine control of Audio Focus.

[0061] The voice interaction and large model architecture in this embodiment can also adopt the following solutions: Alternative to large model deployment: Lightweight large models on the edge can be deployed on local high-performance computing units such as NPUs. Lightweight large models with quantization pruning, such as Llama-3-8B-Int4 and ChatGLM-Edge, can be deployed to achieve fully offline intelligent dialogue, protect privacy and reduce latency.

[0062] Other cloud-based model services: These can be replaced by other generative artificial intelligence services such as Baidu Wenxin Yiyan, Xunfei Xinghuo, and ChatGPT.

[0063] Alternatives to speech synthesis: Offline TTS engines, using offline speech synthesis engines such as VITS and FastSpeech2 for edge implementation, to adapt to environments without network access.

[0064] (2) The principle of proactive alert for abnormal behavior based on visual linkage This submodule mainly runs on the visual computing node (Python environment) and is responsible for handling passive interventions initiated by the system to achieve cross-modal linkage of "visual discovery - voice response".

[0065] Cross-modal triggering mechanism: When the vision module described in Part 1 detects abnormal states such as "eye contact loss" or "poor posture" and meets the timing judgment conditions, it will generate a control instruction containing the abnormality type (Type) and severity level (Level); Lightweight speech synthesis (Edge-TTS): To ensure the real-time performance and low latency of the reminders, this reminder process does not go through large model generation, but directly calls the lightweight edge-tts Python library; the system matches the preset prompt text according to the type of abnormality, such as "Sit up straight" or "Take a break for your eyes", and uses edge-tts to quickly synthesize speech files or streaming media; Playback Priority Control: The voice assistant has priority arbitration logic. When an "Abnormal Reminder" is triggered, if the current conversation is in a "Large Model Dialogue", the system will decide whether to "Ducking" or "Interrupt the current conversation" based on the severity level, ensuring that the child can clearly receive the monitoring instructions.

[0066] In this embodiment, the dynamic visual feedback and auxiliary lighting module includes an OLED expression screen, an LED light strip, and a light sensor. The OLED screen displays facial expressions such as blinking and smiling. The LED lights achieve continuously adjustable brightness through PWM control. The light sensor collects ambient brightness information in real time and transmits it to the main control unit via a UART interface. The system automatically adjusts the light strip brightness based on the detection data to achieve eye-protective lighting. The specific principle is as follows: This embodiment integrates an embedded microcontroller (MCU), an environmental sensing sensor, and a display driver unit to construct a lighting system with environmental adaptability and a human-like visual interaction system. Its core operation relies on precise timing control, pulse width modulation (PWM) technology, and a serial communication protocol (UART). The flowchart is shown below. Figure 1 As shown, the specific principle is as follows: (1) Principle of stepless adjustment of LED brightness based on PWM This embodiment uses pulse width modulation (PWM) technology to achieve digital control of the brightness of the LED fill light, instead of the traditional analog voltage regulation, in order to ensure color consistency and eliminate flicker.

[0067] Control logic: The system uses the timer inside the MCU to generate a square wave signal with a fixed frequency, such as 20kHz, which is higher than the flicker fusion frequency of the human eye; by adjusting the value of the timer compare register (CCR, Capture Compare Register), the duty cycle of the output waveform is changed. Brightness mapping: Duty cycle is defined as the ratio of the high-level time T_on to the signal period T (D=T_on / T); the drive circuit controls the on and off of the MOSFET according to the PWM signal. When the duty cycle increases, the average current of the LED per unit time increases, which macroscopically manifests as increased brightness; When the duty cycle decreases, the brightness decreases.

[0068] Software implementation: During MCU initialization, GPIO is configured to multiplexed push-pull output mode, and the timer clock is enabled; the system calculates the target duty cycle value (0~100%) according to the lighting requirements, writes it into the comparison register, and the hardware automatically generates the corresponding waveform to achieve smooth and stepless adjustment of brightness.

[0069] (2) Ambient light closed-loop regulation principle based on UART communication To achieve "eye-protecting lighting", the system establishes communication with an external light sensor module through a UART (Universal Asynchronous Receiver / Transmitter) interface, forming a closed-loop negative feedback control of brightness.

[0070] Data acquisition (UART reading): The photoelectric conversion and ADC sampling are completed inside the light sensor module integrated with BH1750 or similar chips, and the ambient light intensity is converted into a digital quantity (unit: Lux).

[0071] The MCU configures the serial port interrupt (NVIC) or DMA reception mode and sets the baud rate, such as 9600 bps. When the sensor sends data frames at a predetermined frequency, the MCU receives through the RX pin and analyzes the data packets through the state machine to obtain the current ambient light intensity value L_current.

[0072] Closed-loop algorithm: The system sets the target standard brightness range [L_min, L_max], for example, 300 - 500 Lux.

[0073] Comparison and decision: The main program periodically compares L_current with the target value. If L_current < L_min, the system gradually increases the PWM duty cycle for supplementary lighting. If L_min > L_max, the PWM duty cycle is decreased.

[0074] The PWM adjustment step size is dynamically calculated according to the light deviation amount to make the light change soft and natural.

[0075] (3)OLED Dynamic Visual Feedback and Timing Control Principle In this embodiment, an OLED display screen is used to achieve anthropomorphic dynamic expressions through video memory (GRAM) operations and precise timing control. The principle is as follows: Time-based system (SysTick): The system uses the SysTick timer of the ARM core to build a high-precision time base. The system beat (Tick) is maintained through the SysTick_Handler interrupt service function, and combined with the fac_us and fac_ms multiplication factors, it provides precise delays at the microsecond level (delay_us) and millisecond level (delay_ms); this provides a time reference for the frame rate control of dynamic expressions.

[0076] Video memory mapping and refresh mechanism: The system allocates a two-dimensional array u8OLED_GRAM in memory as the video memory buffer (Buffer), and each bit corresponds to a pixel on the screen.

[0077] Drawing principle: When the OLED_ShowPicture function is called, the MCU writes the preset expression bitmap data into the OLED_GRAM array through address calculation instead of directly operating the screen hardware; Hardware communication: Using the analog I2C protocol, the levels of OLED_SCL and OLED_SDA are flipped through the GPIO port, and the video memory data is sent to the driver chip by page and column.

[0078] Dynamic interaction implementation (time step control): To achieve the "blinking" dynamic effect, the system designed animation logic based on time steps; the specific steps are as follows: 1) Display the "eyes open" image frame; 2) Call delay_ms(T1) to maintain the state (e.g., T1=2000ms); 3) Refresh the video memory to display the "closed eyes" image frame; 4) Call delay_ms(T2) to maintain the state (e.g., T2=100ms); 5) Restore "eyes open".

[0079] By precisely controlling the dwell time of each keyframe in the main loop using delay_ms, and in conjunction with high-speed I2C refresh, a smooth and natural visual feedback animation was achieved.

[0080] The parent-side APP receiving and display module in this embodiment, such as... Figure 4 As shown: This module is used to realize the remote synchronization and visualization of learning behavior data, including a cloud storage platform and a parent-side mobile application.

[0081] When the anomaly detection module triggers an upload event, the system sends the anomaly image and behavior tags to cloud storage via Wi-Fi. The cloud generates a timestamp and access link for each record. The parent's app periodically polls the cloud interface; if new data is detected, a notification is sent to the notification bar. Parents can click on the notification to view the anomaly image and behavior description. The application interface also provides functions such as historical records, behavior trend statistics, and learning focus report generation, making it convenient for parents to track their children's learning status over the long term. The specific principle is as follows: This module constructs a visual evidence chain from "edge capture" to "terminal display," ensuring that parents can monitor their children's status in real time through an Android app.

[0082] (1) After the edge key frame capture and cloud vision process locks the abnormal key frame, the cloud object storage SDK is called to compress the image into JPEG and upload it with the name "Device ID_Timestamp_Abnormal Type" to obtain the persistent image URL.

[0083] (2) Android client polling synchronization mechanism The parent app is developed based on native Android and adopts an active polling strategy: the background network service (based on Retrofit / OkHttp) sends queries to the cloud interface at preset intervals and filters out the latest abnormal records by comparing file timestamps.

[0084] (3) Visualized interactive process, display of historical records: such as Figure 2As shown, the app's main interface displays the history of abnormal activity using a RecyclerView list component. Each item includes a prominent prompt: "Your child's learning situation is abnormal! Click to view" and the upload time, accurate to the second, such as 2025-06-27 17:28:22. The bottom of the interface provides "Refresh" and "Clear History" buttons for easy data management by parents.

[0085] Immersive Details Viewing: When a parent clicks on a list item, the app redirects to the details activity. The system uses the Glide image loading library to retrieve a high-resolution screenshot from the cloud, displaying the child's unusual state at the time (such as looking down at a phone, lying on a table, etc.) in full screen, and provides a "back" button, achieving a complete evidence collection loop from "overview" to "evidence".

[0086] The data transmission and monitoring interaction in this embodiment can also adopt the following scheme: Communication protocol alternatives: Long-connection push (WebSocket / MQTT), which can use WebSocket long-connection or MQTT IoT protocol to proactively push messages to the APP when abnormal records are generated in the cloud, further reducing latency and saving power.

[0087] Real-time video streaming (WebRTC / RTMP): This uses the WebRTC or RTMP protocol to directly open a live video channel when an anomaly is detected, allowing parents to view continuous monitoring footage rather than static images.

[0088] Alternative storage methods: Private cloud / home storage (NAS) allows data to be encrypted and uploaded to a home private cloud (NAS) or transmitted directly to the parent's mobile phone via P2P technology, without the data passing through a public cloud server.

[0089] The alternative solutions for hardware interaction methods are as follows: Alternatives to visual feedback: LCD / e-ink / dot matrix screens. You can use an LCD color screen to provide richer information display, or use an e-ink screen (E-ink) with eye protection as a selling point, or use an LED dot matrix screen to display expressions in a pixel style.

[0090] Alternative to lighting control: DC dimming, which uses DC dimming circuits to control LED brightness, can also achieve the same flicker-free and eye-protecting effect.

[0091] This article uses specific embodiments to illustrate the principles and implementation methods of the present invention. The descriptions of the embodiments above are only for the purpose of helping to understand the method and core ideas of the present invention. It should be noted that those skilled in the art can make several improvements and modifications to the present invention without departing from the principles of the present invention, and these improvements and modifications also fall within the protection scope of the claims of the present invention.

Claims

1. A behavior monitoring and companionship system based on visual recognition and voice interaction, characterized in that, The system includes: The image acquisition module is used to acquire the user's image data; The behavior recognition module is used to analyze the behavior of the monitored person based on the collected image data, including human posture key point detection, eye direction recognition, and hand movements, and through multimodal feature fusion, and to identify abnormal behavior based on a threshold discrimination algorithm. The anomaly detection and response module is used to automatically execute response operations when the identified abnormal behavior continues to exceed a set threshold. Through a continuous frame analysis mechanism, the behavior recognition results are smoothed over time. When the abnormal state continues to exceed a set number of frames, it is determined to be a valid anomaly, and voice broadcast and cloud upload are performed. The intelligent voice assistant module is used to detect keyword wake-up by real-time monitoring of the ambient audio stream, initiate the speech recognition process, and convert the monitored person's voice content into text input; The dynamic visual feedback and auxiliary lighting module is used to adjust the brightness of the system according to the ambient light and generate dynamic expressions through dynamic visual feedback to provide visual psychological compensation for users. The remote receiving end is used to synchronously receive behavior judgment data and visualize it.

2. The behavior monitoring and companionship system based on visual recognition and voice interaction according to claim 1, characterized in that, The image acquisition module includes a high-definition camera, used to acquire image data of the user's behavior process in real time.

3. The behavior monitoring and companionship system based on visual recognition and voice interaction according to claim 1, characterized in that, The behavior identification module is specifically a gaze and posture recognition based on the user's facial geometric mapping, which includes a facial central axis offset calculation module and a torso compression quantization determination module. The facial central axis offset calculation module is used to calculate the horizontal offset between the tip of the user's nose and the center point of both eyes to determine the state of gaze separation. The torso compression quantification judgment module is used to determine abnormal user sitting posture behavior by measuring the rate of change of the vertical projection distance between the center of the user's mouth and the center of the shoulder.

4. The behavior monitoring and companionship system based on visual recognition and voice interaction according to claim 3, characterized in that, The facial central axis offset calculation module specifically includes: Define feature points, selecting the user's left eye feature point E_left, right eye feature point E_right, and nose tip feature point N; Parameter calculation: Calculate the geometric midpoint C_eyes of the line connecting the user's eyes, with its x-coordinate x_avg=(x_left+x_right) / 2; Calculate the Euclidean distance Dh between the nose tip point x_nose and the center points of both eyes x_avg in the horizontal direction, Dh=|x_nose-x_avg|; Judgment logic: When looking straight ahead with focus, the tip of the nose should be near the vertical bisector of the line connecting the eyes, and Dh should be close to 0; when the user deviates from straight ahead, the coordinates of the tip of the nose will be horizontally displaced relative to the center of the eyes. The system sets a horizontal offset threshold Th. When Dh > Th, it is judged as abnormal behavior of gaze deviation.

5. The behavior monitoring and companionship system based on visual recognition and voice interaction according to claim 3, characterized in that, The trunk compression quantification determination module is specifically as follows: Define feature points, select the left / right corners of the user's mouth to calculate the center height of the mouth Y_mouth; select the left / right shoulder points to calculate the center height of the shoulder Y_shoulder; Geometric calculation: Calculate the vertical projection distance Dv between the center height of the mouth and the center height of the shoulder, where Dv = Y_shoulder - Y_mouth; The determination logic. The distance Dv represents the degree of extension of the neck and thoracic vertebrae. When the user exhibits abnormal behaviors such as lowering the head or hunching the back, the mouth coordinate sinks, and the shoulder coordinate relatively floats, resulting in a decrease in Dv. The system sets a vertical distance threshold Tv. When Dv < Tv, it is determined as an abnormal behavior of poor sitting posture; Cascaded state determination. The state cascaded logic is adopted, and poor sitting posture is regarded as a subset of the unfocused state. When it is detected that Dv < Tv, the system logic variable unfocus is also forced to be set to True to ensure comprehensive coverage of implicit abnormal behaviors.

6. The behavior monitoring and companionship system based on visual recognition and voice interaction according to claim 3, characterized in that, The behavior recognition module further includes precise gaze detection based on the bilinear pooling attention network, specifically including: The bilinear pooling mechanism extracts the local convolutional features X of the facial image, calculates the outer product of the feature vector and its transposed vector, and obtains second-order context information; The nested attention enhancement mechanism, aiming at the spatial misalignment problem of the human face in the video stream, dynamically calibrates the feature weights through the nested attention mechanism, specifically as follows: Global aggregation performs global average pooling on the second-order local features to generate a global descriptor G representing the overall posture and environmental illumination; Adaptive weighting uses a global descriptor G to reweight local features A, such as D = Softmax((U + G)). A) Through this mechanism, the system can automatically suppress background noise; The enhanced feature map after gaze vector regression is input into the fully connected layer, and a two-dimensional gaze vector (Yaw, Pitch) is output: The yaw angle Yaw represents the left-right deviation degree of the gaze in the horizontal direction, and the pitch angle Pitch represents the up-down deviation degree of the gaze in the vertical direction. When the absolute value of any angle exceeds the preset safety threshold, the system determines that the user's gaze has deviated.

7. The behavior monitoring and companionship system based on visual recognition and voice interaction according to claim 3, characterized in that, The behavior recognition module further includes a determination process for the persistence of abnormal behaviors based on a time-series sliding window, specifically including time-series consistency verification, specifically as follows: Frame counter and state locking: The system maintains two state machines in memory, respectively recording the distraction start frame F_start_unfocus and the sitting posture abnormal start frame F_start_posture; Time threshold logic: When the detection algorithm outputs an abnormality, i.e., True, if it is the first trigger currently, the current frame number F_current is recorded as the start frame; Continuously calculate the continuous frame number Delta_F = F_current - F_start. Only when Delta_F is equal to the preset time threshold T_cooldown, it is confirmed that this abnormality is a valid abnormal event; State reset: If the detection result turns normal before reaching the threshold T_cooldown, the start frame record is immediately cleared and the monitoring starts again.

8. The behavior monitoring and companionship system based on visual recognition and voice interaction according to claim 1, characterized in that, In the abnormal detection and response module, when it is confirmed that an abnormal behavior is established, the image data of the current frame that triggers the abnormal threshold is extracted, and this image data is uploaded to the cloud through cloud storage. The status text of the current abnormal behavior is superimposed and rendered on the video stream screen through image fusion technology to complete visual feedback.

9. The behavior monitoring and companionship system based on visual recognition and voice interaction according to claim 1, characterized in that, The dynamic visual feedback and auxiliary lighting module integrates an embedded microcontroller MCU, an environmental perception sensor, and a display driver unit, specifically as follows: The microcontroller MCU is used to collect the environmental illuminance value through the environmental perception sensor, compare it with the preset standard illuminance range, and dynamically adjust the PWM duty cycle to achieve constant illuminance; By using high-precision clock interrupts to control the refresh of the display memory, and through time-step logic, anthropomorphic blinking and smiling animations are achieved, providing visual psychological compensation.

10. The behavior monitoring and companionship system based on visual recognition and voice interaction according to claim 1, characterized in that, The system also includes edge capture and cloud monitoring, as detailed below: At the edge, key frame capture is performed. The system locks the current frame image data that triggers the abnormal threshold and uploads it to the cloud for storage. The current frame image data includes the abnormal behavior image, behavior type label and time. Cloud-based monitoring associates and displays abnormal behavior images, behavior type tags, and time on the terminal, and generates a focus analysis table.