Efficient smart home embedded HAR system
By pre-extracting RoI videos and integrating them in the SlowFast network, the method addresses computational and memory issues, enhancing performance for multiple person actions in embedded systems.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- KOREA ELECTRONICS TECH INST
- Filing Date
- 2024-12-27
- Publication Date
- 2026-07-02
AI Technical Summary
SlowFast networks face high computational load and memory usage due to RoI processing, especially in embedded systems with limited resources, and struggle with efficient recognition of multiple people's actions in complex real-world environments.
Pre-extract RoI videos during preprocessing and group interacting people to integrate RoIs, processing them as a single unit in the SlowFast network, optimizing resource usage and reducing unnecessary computations.
Efficiently operates SlowFast networks with limited resources by reducing memory usage and improving performance for multiple object recognition, especially in smart homes.
Smart Images

Figure KR2024021271_02072026_PF_FP_ABST
Abstract
Description
Efficient Smart Home Embedded HAR System
[0001] The present invention relates to artificial intelligence-based behavior recognition, and more specifically, to efficient Human Activity Recognition (HAR) in embedded systems utilizing computer vision technology.
[0002] The SlowFast network, widely used for conventional behavior recognition, is effective for extracting spatiotemporal features but has the following limitations.
[0003] First, both human location detection and behavior recognition within the image are required. Since Region of Interest (RoI) processing occurs after feature fusion, feature extraction from the entire image is necessary, placing a significant burden on the limited computing resources of embedded systems.
[0004] Second, when multiple people are present in an image, memory usage increases significantly during the process of handling fused features for each RoI. This is a major cause of system performance degradation in embedded environments.
[0005] Meanwhile, in real-world application environments such as smart homes, it is necessary to detect and recognize the simultaneous actions of multiple people, so there is an even greater need for technology for efficient action recognition.
[0006] The present invention has been devised to solve the aforementioned problems, and the objective of the present invention is to provide a method for recognizing behavior by pre-extracting RoI video during the video preprocessing process and inputting it into the SlowFast network, as a means to efficiently operate a SlowFast network with the limited computing resources of an embedded system.
[0007] Furthermore, another objective of the present invention is to provide a behavior recognition method for efficiently operating a SlowFast network with limited computing resources of an embedded system, which groups interacting people during the video preprocessing process to integrate their RoIs and process them as a single unit in the SlowFast network.
[0008] A behavior recognition method according to an embodiment of the present invention for achieving the above objective comprises: a step of receiving a video input; a step of extracting a RoI video from the input video; and a first behavior recognition step of inputting the extracted RoI video into a SlowFast network to recognize a behavior.
[0009] The first action recognition step may involve inputting the extracted RoI video into both the Slow path and the Fast path of the SlowFast network.
[0010] The extraction step may include: a step of detecting objects in frames constituting an input video and assigning IDs to them; a step of performing ID matching between frames for the detected objects; and a step of generating object-unit RoI videos based on the ID matching results.
[0011] The action recognition method according to the present invention may further include the steps of: assigning an integrated ID to an integrated object that groups together objects in which bounding boxes overlap in a plurality of consecutive frames, and performing ID matching between frames; and generating an RoI video in the unit of the integrated object based on the ID matching result.
[0012] The behavior recognition method according to the present invention may further include the step of performing ID matching between the previous frame and the subsequent frame of the frames in which ID matching is broken, for an object and an integrated object.
[0013] The behavior recognition method according to the present invention may further include a second behavior recognition step of recognizing a behavior by inputting an input video into the Slow path of a SlowFast network and inputting an extracted RoI video into the Fast path of a SlowFast network.
[0014] The second action recognition step may be performed when the complexity of the video exceeds a criterion during a defined interval.
[0015] The behavior recognition method according to the present invention may further include a third behavior recognition step of inputting an input video into the Fast path of a SlowFast network and inputting an extracted RoI video into the Slow path of a SlowFast network to recognize a behavior.
[0016] The third action recognition step may be performed when the average amount of movement of objects in the video during a defined interval exceeds a standard or when the average amount of occlusion of objects exceeds a standard.
[0017] According to another aspect of the present invention, a behavior recognition system is provided, characterized by comprising: a preprocessing module that receives a video input and extracts a RoI video from the input video; and a SlowFast network that receives the extracted RoI video and recognizes a behavior.
[0018] According to another aspect of the present invention, a behavior recognition method is provided, characterized by comprising: a step of receiving a video; a step of extracting an RoI video from the input video; and a step of selectively inputting the input video and the extracted RoI video into a SlowFast network to recognize a behavior.
[0019] According to another aspect of the present invention, an action recognition system is provided, characterized by comprising: an input unit for receiving a video; a preprocessing unit for extracting a RoI video from the input video; and an action recognition network for selectively receiving the input video and the extracted RoI video to recognize an action.
[0020] As described above, according to the embodiments of the present invention, by pre-extracting RoI video during the video preprocessing process and inputting it into a SlowFast network, the memory usage of the embedded system can be efficiently reduced and the processing efficiency of the SlowFast network can be improved through the separation of the RoI extraction and feature extraction processes.
[0021] In addition, according to embodiments of the present invention, by grouping people interacting during the video preprocessing process and integrating the RoIs so that they are processed as one in the SlowFast network, the SlowFast network can be operated more efficiently with the limited computing resources of the embedded system, as well as the performance of behavior recognition for multiple objects is improved and stable operation is possible even in complex situations.
[0022] Figure 1 shows the structure of a SlowFast network to which the present invention is applicable.
[0023] FIG. 2 is a HAR method utilizing a SlowFast network according to an embodiment of the present invention.
[0024] FIG. 3 is a HAR system according to another embodiment of the present invention
[0025] Figure 4 shows a test data augmentation method
[0026] FIGS. 5 and 6 illustrate a smart home embedded HAR method and system according to another embodiment of the present invention.
[0027] The present invention will be described in more detail below with reference to the drawings.
[0028] As the application scope of camera-based Human Activity Recognition (HAR) systems expands to everyday environments such as smart homes, efficient behavior recognition in embedded environments is required. While SlowFast networks can effectively extract temporal and spatial features, the high computational load and memory usage involved in the RoI processing within the network pose a significant technical challenge in environments with limited hardware resources.
[0029] Meanwhile, the HAR method has evolved based on benchmark datasets such as NTU, SMART HOME Toyota, and PKU-MMD, most of which contain a single action of a single person. However, in real-world home environments, multiple residents may be present simultaneously, and more complex situations may arise, such as each performing different actions or interacting with others.
[0030] Figure 1 shows the structure of a SlowFast network to which the present invention is applicable. The SlowFast network to which the present invention is applicable processes an input image through two paths with different frame rates. The Slow path extracts spatial features at a low frame rate, and the Fast path extracts temporal features at a high frame rate. Then, the extracted features are fused to perform final action recognition. The SlowFast network has the following problem.
[0031] First, since feature extraction is performed on the entire input video and RoI processing is carried out only after feature fusion, the same amount of computation is required for backgrounds or unnecessary areas that are not the region of interest.
[0032] Second, when multiple RoIs exist within a video, individual feature fusion and processing are required for each RoI, causing memory usage to increase rapidly in proportion to the number of RoIs.
[0033] In environments with limited computing resources and memory, such as embedded systems or mobile devices, excessive computational load and memory usage resulting from these structural characteristics are major causes of performance degradation. Furthermore, in environments with limited battery capacity, increased power consumption caused by unnecessary computations also acts as a factor that reduces system efficiency.
[0034] Accordingly, an embodiment of the present invention proposes a method that improves system efficiency through preprocessing-based RoI optimization while maintaining the structure of the SlowFast network, thereby enabling effective behavior recognition. This technology involves pre-extracting RoI videos during the video preprocessing stage prior to SlowFast network processing, grouping interacting people to integrate the RoIs, and inputting the results into the SlowFast network.
[0035] FIG. 2 is a diagram illustrating the flow of a HAR method utilizing a SlowFast network according to an embodiment of the present invention. In the HAR method according to an embodiment of the present invention, a Region of Interest (RoI) video is extracted from an input video through a preprocessing process and input into a SlowFast network.
[0036] To this end, as described above, a video is first received as input, and objects in the consecutive frames constituting the input video are detected as bounding boxes and assigned IDs to each object (S110).
[0037] For the objects detected in the next step S110, identical objects are identified and inter-frame ID matching is performed (S120). As a result, identical objects have the same ID in consecutive frames. This configuration is designed to track the movement of objects while ensuring temporal continuity.
[0038] Then, objects that frequently / much overlap of bounding boxes in a number of consecutive frames are grouped into a single integrated object, an integrated ID is assigned to the integrated object, and ID matching between frames is performed (S130). This is intended to effectively recognize group-unit actions by multiple associated objects and to reduce the number of objects to be recognized for action.
[0039] Subsequently, for both single objects and integrated objects, ID matching is performed between the previous and subsequent frames of the frames where ID matching is broken (S140). This is intended to address situations where IDs are broken due to the non-detection of a single object or integrated object in some frames, and corresponds to a configuration designed to minimize information loss due to detection failure by ensuring temporal continuity.
[0040] Subsequently, based on the ID matching results, object-unit RoI videos and integrated object-unit RoI videos are generated (S150). The generated RoI videos are optimized RoI information that considers both temporal continuity and spatial comprehensiveness, allowing for the elimination of operations on unnecessary areas, focusing only on areas where meaningful actions occur, and enabling the integrated processing of related objects.
[0041] Afterwards, the ROI video generated in step S150 is input into the SlowFast network in the form of objects or integrated objects to recognize actions (S160). When inputting the ROI video into the SlowFast network in step S160, the ROI video generated in step S150 is input into the Fast path as is, but the frame rate of the ROI video generated in step S150 is lowered and input into the Slow path.
[0042] FIG. 3 is a diagram illustrating the configuration of a HAR system according to another embodiment of the present invention. As illustrated, the HAR system according to an embodiment of the present invention is configured to include a preprocessing module (110) and a SlowFast network (120).
[0043] The preprocessing module (110) performs steps S110 to S150 of FIG. 2 described above to generate object-specific RoI videos and integrated object-specific RoI videos from the input video.
[0044] The SlowFast network (120) refers to a network that recognizes the behavior of an object using object-specific RoI videos generated by the preprocessing module (110) and integrated object-specific RoI videos, or hardware or a processor for executing the same.
[0045] Since the SlowFast network (120) processes RoI video of a smaller size than the entire input video, feature extraction is performed only on the necessary areas in both the Slow and Fast paths, and the amount of computation can be effectively reduced during the feature fusion process in the latter part. In particular, for behavior recognition in a limited environment such as a smart home, the resources of the embedded system can be utilized efficiently while maintaining the spatiotemporal feature extraction capabilities of the existing SlowFast network.
[0046] To verify the performance of the HAR method and system according to an embodiment of the present invention, test data for various situations was constructed through preprocessing. Specifically, as shown in Fig. 4, a RoI was generated from single action data of the NTU dataset, and each video was combined to generate complex situation data containing simultaneous actions of multiple people. Specifically, utilizing the characteristic of data where the camera angle is constant for each laboratory, the data was augmented by combining videos performing different actions side by side. As a result of experiments conducted using the generated data, it was verified that the proposed structure effectively performs action recognition for each person even in complex situations where multiple people simultaneously perform different actions.
[0047] Up to now, an efficient smart home embedded HAR method and system utilizing a SlowFast network has been described in detail with reference to preferred embodiments.
[0048] In the above embodiment, by pre-extracting RoI videos during the video preprocessing process and inputting them into the SlowFast network, the memory usage of the embedded system can be efficiently reduced and the processing efficiency of the SlowFast network improved through the separation of the RoI extraction and feature extraction processes.
[0049] Furthermore, in the above embodiment, by grouping interacting people during the video preprocessing process and integrating the RoIs so that they are processed as one in the SlowFast network, the SlowFast network can be operated more efficiently with the limited computing resources of the embedded system, as well as the performance of behavior recognition for multiple objects is improved and stable operation is possible even in complex situations.
[0050] Meanwhile, RoI video generation / input can be applied selectively / variably for each path of the SlowFast network. For example, as shown in Fig. 5, it is possible to input RoI video into the Fast path of the SlowFast network, while inputting the video with a reduced frame rate into the Slow path. This can be done to enhance spatial feature extraction when the complexity of the video increases during a defined interval.
[0051] Furthermore, as shown in Fig. 6, it is possible to input the RoI video at a reduced frame rate into the Slow path of the SlowFast network, while inputting the video as is into the Fast path. This can be done to enhance temporal feature extraction when the average amount of movement of objects in the video exceeds a threshold during a defined interval, or when the average amount of occlusion of objects exceeds a threshold.
[0052] As such, in the smart home embedded HAR method and system according to the embodiment of the present invention, it is possible to execute the methods presented in FIGS. 3, 5, and 6 while adaptively changing them according to the state of the input video.
[0053] Meanwhile, it goes without saying that the technical concept of the present invention may also be applied to a computer-readable recording medium containing a computer program that enables the device and method according to the present embodiment to perform their functions. Furthermore, the technical concept according to various embodiments of the present invention may be implemented in the form of computer-readable code recorded on a computer-readable recording medium. A computer-readable recording medium may be any data storage device that can be read by a computer and store data. For example, a computer-readable recording medium may be a ROM, RAM, CD-ROM, magnetic tape, floppy disk, optical disk, hard disk drive, etc. Additionally, computer-readable code or a program stored on a computer-readable recording medium may be transmitted through a network connected between computers.
[0054] Furthermore, although preferred embodiments of the present invention have been illustrated and described above, the present invention is not limited to the specific embodiments described above. Various modifications are possible by those skilled in the art without departing from the essence of the invention as claimed in the claims, and such modifications should not be understood individually from the technical spirit or perspective of the present invention.
Claims
1. Step of receiving video input; Step of extracting RoI video from input video; A behavior recognition method characterized by including a first behavior recognition step of inputting an extracted RoI video into a SlowFast network to recognize a behavior.
2. In Claim 1, The first behavior recognition stage is, A behavior recognition method characterized by inputting extracted RoI videos into both the Slow path and the Fast path of a SlowFast network.
3. The extraction step is, A step of detecting objects in the frames constituting the input video and assigning IDs; A step of performing ID matching between frames for detected objects; A behavior recognition method characterized by including the step of generating object-unit RoI videos based on ID matching results.
4. In Claim 3, A step of assigning an integrated ID to an integrated object that groups together objects in which bounding boxes overlap in a number of consecutive frames, and performing ID matching between frames; and A behavior recognition method characterized by further including the step of generating RoI videos at the integrated object unit level based on ID matching results.
5. In Claim 4, A behavior recognition method characterized by further including the step of performing ID matching between the previous frame and the subsequent frame of the frames where ID matching is broken, for an object and an integrated object.
6. In Claim 1, A behavior recognition method characterized by further including a second behavior recognition step of inputting an input video into the Slow path of a SlowFast network and inputting an extracted RoI video into the Fast path of a SlowFast network to recognize a behavior.
7. In Claim 1, The second behavior recognition stage is, A behavior recognition method characterized by being performed when the complexity of the video exceeds a standard during a defined interval.
8. In Claim 1, A behavior recognition method characterized by further including a third behavior recognition step of inputting an input video into the Fast path of a SlowFast network and inputting an extracted RoI video into the Slow path of a SlowFast network to recognize a behavior.
9. In Claim 8, The third behavior recognition stage is, A behavior recognition method characterized by being performed when the average amount of movement of objects in a video exceeds a standard during a defined interval or when the average amount of occlusion of objects exceeds a standard.
10. A preprocessing module that receives a video as input and extracts the RoI video from the input video; A behavior recognition system characterized by including a SlowFast network that receives extracted RoI video as input and recognizes behavior.
11. Step of receiving video input; Step of extracting RoI video from input video; A behavior recognition method characterized by including the step of selectively inputting an input video and an extracted RoI video into a SlowFast network to recognize behavior.
12. Input section for receiving video; A preprocessing unit that extracts RoI video from input video; A behavior recognition system characterized by including a behavior recognition network that selectively receives input video and extracted RoI video to recognize behavior.