A sleep behavior recognition method, system and medium under security monitoring

CN116311516BActive Publication Date: 2026-06-26NINGBO INST OF MATERIALS TECH & ENG CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NINGBO INST OF MATERIALS TECH & ENG CHINESE ACAD OF SCI
Filing Date
2023-03-07
Publication Date
2026-06-26

Smart Images

  • Figure CN116311516B_ABST
    Figure CN116311516B_ABST
Patent Text Reader

Abstract

The application discloses a sleep behavior recognition method and system under security monitoring, and a medium, and the method comprises the following steps: preparing a video sequence dataset; adopting a convolutional neural network to perform feature coding on the video sequence dataset, and coding image frame features; adopting a video segmentation strategy to segment the image frame features, and calculating segment-level features of a current segment; stacking the segment-level features of each segment in time sequence to form video-level features of the video sequence, and taking the video-level features as inputs of a time recurrent neural network module added with a spatiotemporal non-local attention mechanism; and passing output feature maps of the spatiotemporal non-local attention mechanism through a full connection layer of the time recurrent neural network module, and completing final sleep behavior recognition. The improved CNN-LSTM network architecture realizes sleep behavior detection under monitoring video.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of video action recognition, specifically relating to a method, system, and medium for recognizing sleeping behavior under security monitoring. Background Technology

[0002] Video action recognition algorithms are video classification algorithms that identify human actions within video sequences. The focus of action recognition is observing human behavior within a series of relationships between individual actions and the environment. Actions to be identified may include standing, sitting, speaking, and walking. Some of these actions can be directly determined from a single frame, while more complex actions, such as opening and closing doors, require combining the current frame with preceding and following frames to determine the specific action. Personnel sleeping on duty often sleep in chairs, and their sleeping posture may be the same as their office posture. Furthermore, due to the distance of security monitoring, it is difficult to clearly observe the state of a person's face. Using a single frame to determine if someone is sleeping on duty will lead to significant errors.

[0003] In the development of video sequence action recognition, there are methods based on traditional hand-designed features, such as those based on global and local features. There are also methods based on deep learning.

[0004] Traditional hand-designed feature methods, such as IDT, have achieved good results on public datasets like HDMB51 and UCF101. However, hand-designed features are computationally expensive, have poor generalization ability, and are difficult to deploy. Deep learning networks can effectively address these issues.

[0005] Currently, there are two main types of video action recognition algorithms in deep learning: one is CNN (Convolutional Neural Networks), and the other uses CNN to extract features from image frames in a video sequence, and then uses RNN (Recurrent Neural Network) to model the entire video sequence. The first method mainly includes Two-Stream, C3D, SlowFast, and PoseC3D. The second method mainly includes CNN-LSTM (Convolutional Long Short-Term Memory Neural Network), LRCN (Learning Resource Center Network), and Bidirectional LSTM.

[0006] The steps for deep learning networks to solve video action recognition can be divided into data annotation, training, and prediction. Video annotation typically involves placing video clips containing a specific type of action in a separate folder. During training, video sequences for each type of action are sampled and fed into the neural network for iterative optimization of network parameters, ensuring the video classification result matches the target label. Representative methods based on convolutional networks include two-stream networks and C3D networks. Two-stream networks add an additional optical flow branch as a motion representation between video sequence frames, while C3D networks treat the video sequence as a three-dimensional tensor with two spatial dimensions and one temporal dimension, using 3D convolution as a general feature extractor for video action recognition. Methods based on recurrent neural networks treat the video as a time series. A common approach is to aggregate the features of a single frame of the CNN into video-level features, which are then used as input to the RNN network to obtain the final predicted label.

[0007] Existing video action recognition algorithms perform well on datasets with short video sequences, but their performance is not as good on long video sequences. For example, 3D convolutional networks can stack multiple short-time convolutions to achieve long-distance temporal connections. However, this method loses the connection information between image frames that are far apart. Moreover, most current algorithms are designed for scenarios with large changes in motion amplitude in video sequences, and their performance is unsatisfactory in scenarios with small changes in motion amplitude, such as sleeping positions.

[0008] Therefore, designing a video motion recognition algorithm suitable for scenarios with small changes in the range of motion, such as sleeping on duty, is an urgent problem to be solved. Summary of the Invention

[0009] The main objective of this invention is to provide a method, system, and medium for recognizing sleeping behavior in security monitoring scenarios where the range of motion changes little, such as when someone is on duty sleeping.

[0010] To achieve the aforementioned objectives, the technical solution adopted by this invention includes: a method for recognizing sleeping behavior under security monitoring, comprising:

[0011] S1, Create a dataset, which includes collecting sleeping and non-sleeping videos under security monitoring, processing the videos, and obtaining the video sequence dataset required by the neural network;

[0012] S2, a convolutional neural network is used to encode the features of the video sequence dataset to encode the image frame features;

[0013] S3, the image frame features are segmented using a video segmentation strategy, and the segment-level features of the current segment are calculated;

[0014] S4: The segment-level features of each segment are stacked in chronological order to form the video-level features of the video sequence, which are used as the input to the temporal recurrent neural network module with added spatiotemporal nonlocal attention mechanism;

[0015] S5 passes the output feature map of the spatiotemporal nonlocal attention mechanism through the fully connected layer of the temporal recurrent neural network module to obtain the final classification result, thus completing the final recognition of sleep behavior.

[0016] In a preferred embodiment, in step S1, the acquired video is cropped in length and size, and randomly divided into a training set, a validation set, and a test set.

[0017] In a preferred embodiment, in step S2, a ResNet50 network is used to perform feature encoding on the video sequence dataset to obtain the image frame features represented by a single frame.

[0018] In a preferred embodiment, S3 includes:

[0019] S31, The image frame features before the fully connected layer of the Resnet50 network are segmented using a video segmentation strategy, and the image frame features are divided into multiple segments.

[0020] S32, based on the principle of metric learning, calculate the two feature vectors that are most similar to and least similar to the current feature vector in the current segment;

[0021] S33, the two feature vectors are concatenated, and the concatenated feature vector is used as the segment-level feature of the current segment.

[0022] In a preferred embodiment, in step S32, the minimum feature vector with the smallest Euclidean distance and the maximum feature vector with the largest distance from other feature vectors in the current segment are calculated; in step S33, the minimum feature vector and the maximum feature vector are concatenated.

[0023] In a preferred embodiment, in step S32, the average Euclidean distance between the current feature vector and other feature vectors in the current segment is calculated, and the maximum and minimum feature vectors corresponding to the maximum and minimum distances are selected from the average distances.

[0024] In a preferred embodiment, in step S4, after the spatiotemporal nonlocal attention mechanism is applied to the output layer of the recurrent temporal neural network and before the fully connected layer, the spatiotemporal nonlocal attention output feature map is used as the input to the fully connected layer of the entire recurrent temporal neural network.

[0025] In a preferred embodiment, in step S4, the segment-level features of each segment are stacked in chronological order to form the video-level features of the video sequence.

[0026] On the other hand, the present invention also provides a sleeping behavior recognition system under security monitoring, comprising:

[0027] The dataset creation module is used to create datasets, which includes collecting sleeping and non-sleeping videos under security monitoring, processing the videos, and obtaining the video sequence dataset required by the neural network.

[0028] The image frame feature encoding module is used to perform feature encoding on the video sequence dataset using a convolutional neural network to encode image frame features.

[0029] The segment-level feature calculation module is used to segment the image frame features using a video segmentation strategy and calculate the segment-level features of the current segment.

[0030] The video-level feature acquisition module is used to stack the segment-level features of each segment in chronological order to form the video-level features of the video sequence, which serve as the input to the temporal recurrent neural network module with added spatiotemporal nonlocal attention mechanism;

[0031] The sleep behavior recognition module is used to pass the output feature map of the spatiotemporal non-local attention mechanism through the fully connected layer of the temporal recurrent neural network module to obtain the final classification result and complete the final recognition of sleep behavior.

[0032] In another aspect, the present invention also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the sleeping behavior recognition method under security monitoring described above.

[0033] Compared with the prior art, the beneficial effects of the present invention are at least as follows:

[0034] This invention builds upon the existing video action recognition network (CNN-LSTM) framework by incorporating a segmented feature representation method and a non-local attention mechanism. The segmented feature representation method, compared to image frame-level representation, reduces the distance between the beginning and end of a video sequence to some extent. The non-local attention mechanism captures long-term dependencies in both the spatial and temporal domains, achieving better performance without additional annotations. This invention implements sleep behavior detection in surveillance videos using an improved CNN-LSTM network architecture. Attached Figure Description

[0035] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0036] Figure 1 This is a flowchart of the method of the present invention;

[0037] Figure 2 This is a schematic diagram of sleeping video image frame sequences and non-sleeping video image frame sequences in the training set of the video action recognition network;

[0038] Figure 3 This is a schematic diagram of the overall network framework after the segmentation strategy is introduced in the feature fusion stage;

[0039] Figure 4 This is a schematic diagram of the feature segmentation strategy and the calculation method of representative features of the calculated segment;

[0040] Figure 5 This is a schematic diagram of the spatiotemporal nonlocal attention mechanism module structure. Detailed Implementation

[0041] The invention will be more fully understood through the following detailed description, which should be read in conjunction with the accompanying drawings. Detailed embodiments of the invention are disclosed herein; however, it should be understood that the disclosed embodiments are merely exemplary of the invention, which may be embodied in various forms. Therefore, the specific functional details disclosed herein should not be construed as limiting, but rather as the basis for the claims and as intended to teach those skilled in the art to employ the representative basis of the invention in different ways in any suitable detailed embodiment.

[0042] The present invention discloses a method, system and medium for recognizing sleeping behavior under security monitoring. It addresses the difficulties in feature extraction from long video sequences such as sleeping behavior and the recognition problem of small changes in movement amplitude. Based on existing video action recognition algorithms (such as the CNN-RNN architecture), it adds a video sequence segmentation strategy and a non-local block attention mechanism in the feature fusion stage to overcome the shortcomings of existing algorithms.

[0043] like Figure 1 As shown in the embodiments of the present invention, a method for recognizing sleeping behavior under security monitoring specifically includes the following steps:

[0044] S1, Create a dataset, which includes collecting sleeping and non-sleeping videos under security monitoring, processing the videos, and obtaining the video sequence dataset required by the neural network.

[0045] This invention uses video as a dataset for sleep-related behaviors. Specifically, it collects videos from security surveillance cameras, manually annotates them to create sleeping and non-sleeping videos, and crops both the sleeping and non-sleeping videos in terms of length and size. These videos are then randomly divided into training, validation, and test sets to serve as the video sequence dataset for video action recognition networks (such as CNN-LSTM networks). Figure 2 The image shown is a schematic diagram of sleeping video image frame sequences and non-sleeping video image frame sequences in the training set of the video action recognition network.

[0046] S2, a convolutional neural network is used to encode the features of the video sequence dataset to encode the image frame features.

[0047] In practice, the ResNet50 network is specifically used as the feature encoding network for the image frame features of the video sequence. Specifically, the ResNet50 network is used to encode the features of the video sequence dataset obtained in step S1 to obtain the image frame features representing a single frame of the video.

[0048] S3, the image frame features are segmented using a video segmentation strategy, and the segment-level features of the current segment are calculated.

[0049] Specifically, such as Figure 3 The diagram shown illustrates the overall network framework after introducing a segmentation strategy in the feature fusion stage. It specifically includes the following steps:

[0050] S31, a video segmentation strategy is used to segment the image frame features before the fully connected layer of the ResNet50 network into multiple segments.

[0051] S32, based on the principle of metric learning, calculates the minimum eigenvector with the smallest Euclidean distance and the maximum eigenvector with the largest distance from the current eigenvector in the current segment to other eigenvectors.

[0052] Specifically, based on the Euclidean distance between vectors representing the similarity between feature vectors, the average distance between the current feature vector in the current segment and other feature vectors is selected, and the maximum and minimum feature vectors corresponding to the maximum and minimum distances are selected from the average distances.

[0053] S33, the minimum feature vector and the maximum feature vector are concatenated, and the concatenated feature vector is used as the segment-level feature of the current segment.

[0054] Specifically, the two representative feature vectors selected in step S32 are concatenated, and the concatenated feature vector is used as the segment-level feature of the current segment, and the segment-level feature of the current segment is used as one dimension of the video-level feature.

[0055] S4 stacks the segment-level features of each segment in chronological order to form the video-level features of the video sequence, which serve as the input to the temporal recurrent neural network module with added spatiotemporal nonlocal attention mechanism.

[0056] Specifically, the segment-level features of each segment are stacked in chronological order to form the video-level features of the video sequence, resulting in a two-dimensional tensor which serves as the input to a temporal recurrent neural network (LSTM) module with added spatiotemporal nonlocal attention. For example... Figure 4 The diagram shown illustrates the feature segmentation strategy and the calculation method for representative features of a computational segment.

[0057] like Figure 5 The diagram shows the structure of the spatiotemporal nonlocal attention mechanism. Similar to the self-attention mechanism, in the spatiotemporal nonlocal attention mechanism, attention is not reflected in the relationship between input and output, but rather in the interrelationships between elements within the input. Mapped to a video action recognition network, this means establishing the relationships between different image frames in a video sequence. By establishing correlations between video image frames, the spatiotemporal nonlocal attention mechanism improves the model's performance during actual training. Similar to the self-attention mechanism, the spatiotemporal nonlocal attention mechanism retains the Q, K, and V branches, and its computation typically involves three steps: First, similarity calculation is performed between the query and key to obtain weights. Second, the weight matrix is ​​normalized using the softmax function. Third, the weights and corresponding key values ​​are weighted and summed to obtain the final attention feature map. The spatiotemporal nonlocal attention mechanism directly captures long-range dependencies by calculating the weights between two locations, rather than being limited to adjacent frames. It abandons the concept of distance and solves the problem of network layers needing to transmit information at different distances. In spatiotemporal nonlocal attention mechanisms, the most crucial aspect is the calculation and measurement of feature similarity. Specifically, the Gaussian similarity function is used to calculate similarity, typically employing a dot product approach. Using only the dot product reduces computational cost and complexity. The Gaussian function can then be expressed as: Similarity is calculated using the Embedded Gaussian function, a simple extension of the Gaussian function. It calculates similarity within an embedding space. The embedding constructs a mapping, projecting entities from the embedding space onto a linear vector space, where the distance between them is measured. This can be represented as: Dot product similarity calculation can be expressed as:

[0058] S5 passes the feature map of the attention mechanism layer through the fully connected layer of the time recurrent neural network module to obtain the final classification result, thus completing the final recognition of sleeping behavior.

[0059] This invention uses a training set to train a video action recognition network, obtains predicted and true class labels for video actions using the network, and calculates a classification loss value using a loss function. Based on this classification loss value, the network parameters of the video action recognition algorithm are optimized. In one specific embodiment, the process of obtaining the predicted class labels for video actions follows the steps S1 to S4 described above, and will not be elaborated further here.

[0060] Corresponding to the sleeping behavior recognition method under security monitoring disclosed in the above embodiments, the sleeping behavior recognition system under security monitoring disclosed in this invention specifically includes:

[0061] The dataset creation module is used to create datasets, which includes collecting sleeping and non-sleeping videos under security monitoring, processing the videos, and obtaining the video sequence dataset required by the neural network.

[0062] The image frame feature encoding module is used to perform feature encoding on the video sequence dataset using a convolutional neural network to encode image frame features.

[0063] The segment-level feature calculation module is used to segment the image frame features using a video segmentation strategy and calculate the segment-level features of the current segment.

[0064] The video-level feature acquisition module is used to stack the segment-level features of each segment in chronological order to form the video-level features of the video sequence, which serve as the input to the temporal recurrent neural network module with added spatiotemporal nonlocal attention mechanism;

[0065] The sleep behavior recognition module is used to pass the output feature map of the spatiotemporal non-local attention mechanism through the fully connected layer of the temporal recurrent neural network module to obtain the final classification result and complete the final recognition of sleep behavior.

[0066] The specific implementation principles of each module can be found in the above description, and will not be elaborated here.

[0067] In one embodiment, a computer-readable storage medium is also provided, which stores a computer program that, when executed by a processor, implements the steps of the aforementioned method for recognizing sleeping behavior under security monitoring.

[0068] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided in this application can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0069] This invention builds upon the existing video action recognition network (CNN-LSTM) framework by incorporating a segmented feature representation method and a non-local attention mechanism. The segmented feature representation method, compared to image frame-level representation, reduces the distance between the beginning and end of a video sequence to some extent. The non-local attention mechanism captures long-term dependencies in both the spatial and temporal domains, achieving better performance without additional annotations. This invention implements sleep behavior detection in surveillance videos using an improved CNN-LSTM network architecture.

[0070] All aspects, embodiments, features, and examples of this invention are to be regarded as illustrative in all respects and are not intended to limit the invention, the scope of which is defined only by the claims. Other embodiments, modifications, and uses will become apparent to those skilled in the art without departing from the spirit and scope of the invention as claimed.

[0071] The use of headings and sections in this invention is not intended to limit the invention; each section can be applied to any aspect, embodiment or feature of the invention.

Claims

1. A method for recognizing sleeping behavior under security monitoring, characterized in that: The method includes: S1, Create a dataset, which includes collecting sleeping and non-sleeping videos under security monitoring, processing the videos, and obtaining the video sequence dataset required by the neural network; S2, a convolutional neural network is used to encode the features of the video sequence dataset to encode the image frame features; S3, the image frame features are segmented using a video segmentation strategy, and the segment-level features of the current segment are calculated; S3 includes: S31, The image frame features before the fully connected layer of the Resnet50 network are segmented using a video segmentation strategy, and the image frame features are divided into multiple segments. S32, based on the principle of metric learning, calculate the two feature vectors that are most similar to and least similar to the current feature vector in the current segment; S33, the two feature vectors are concatenated, and the concatenated feature vector is used as the segment-level feature of the current segment; In step S32, the minimum feature vector with the smallest Euclidean distance and the maximum feature vector with the largest distance from other feature vectors in the current segment are calculated; in step S33, the minimum feature vector and the maximum feature vector are concatenated; in step S32, the average Euclidean distance between the current feature vector and other feature vectors in the current segment is calculated, and the maximum feature vector and the minimum feature vector corresponding to the maximum distance and the minimum distance are selected from the average distance. S4: The segment-level features of each segment are stacked in chronological order to form the video-level features of the video sequence, which are used as the input to the temporal recurrent neural network module with added spatiotemporal nonlocal attention mechanism; S5 passes the output feature map of the spatiotemporal nonlocal attention mechanism through the fully connected layer of the temporal recurrent neural network module to obtain the final classification result, thus completing the final recognition of sleep behavior.

2. The method for recognizing sleeping behavior under security monitoring according to claim 1, characterized in that: In step S1, the acquired video is cropped in length and size and randomly divided into training set, validation set and test set.

3. The method for recognizing sleeping behavior under security monitoring according to claim 1, characterized in that: In step S2, the ResNet50 network is used to encode the features of the video sequence dataset to obtain the image frame features represented by a single frame.

4. The method for recognizing sleeping behavior under security monitoring according to claim 1, characterized in that: In S4, the spatiotemporal nonlocal attention mechanism is applied after the output layer of the recurrent temporal neural network and before the fully connected layer, and the spatiotemporal nonlocal attention output feature map is used as the input of the fully connected layer of the entire recurrent temporal neural network.

5. The method for recognizing sleeping behavior under security monitoring according to claim 1, characterized in that: In step S4, the segment-level features of each segment are stacked in chronological order to form the video-level features of the video sequence.

6. A sleep behavior recognition system under security monitoring, characterized in that: The system includes: The dataset creation module is used to create datasets, which includes collecting sleeping and non-sleeping videos under security monitoring, processing the videos, and obtaining the video sequence dataset required by the neural network. The image frame feature encoding module is used to perform feature encoding on the video sequence dataset using a convolutional neural network to encode image frame features. The segment-level feature calculation module is used to segment the image frame features using a video segmentation strategy and calculate the segment-level features of the current segment, including: S31, The image frame features before the fully connected layer of the Resnet50 network are segmented using a video segmentation strategy, and the image frame features are divided into multiple segments. S32, based on the principle of metric learning, calculate the two feature vectors that are most similar to and least similar to the current feature vector in the current segment; S33, the two feature vectors are concatenated, and the concatenated feature vector is used as the segment-level feature of the current segment; In step S32, the minimum feature vector with the smallest Euclidean distance and the maximum feature vector with the largest distance from other feature vectors in the current segment are calculated; in step S33, the minimum feature vector and the maximum feature vector are concatenated; in step S32, the average Euclidean distance between the current feature vector and other feature vectors in the current segment is calculated, and the maximum feature vector and the minimum feature vector corresponding to the maximum distance and the minimum distance are selected from the average distance. The video-level feature acquisition module is used to stack the segment-level features of each segment in chronological order to form the video-level features of the video sequence, which serve as the input to the temporal recurrent neural network module with added spatiotemporal nonlocal attention mechanism; The sleep behavior recognition module is used to pass the output feature map of the spatiotemporal non-local attention mechanism through the fully connected layer of the temporal recurrent neural network module to obtain the final classification result and complete the final sleep behavior recognition.

7. A computer-readable storage medium storing a computer program, characterized in that... When the computer program is executed by the processor, it implements the steps of the sleeping behavior recognition method under security monitoring as described in any one of claims 1 to 5.