Method and apparatus for recognizing the atomic behavior of a teacher based on an improved end-to-end network
The improved end-to-end network with spatial and time-series adaptation units and parameter freezing addresses the accuracy and efficiency issues in recognizing teacher's atomic actions by reducing GPU consumption and video memory, enhancing discrimination of complex patterns.
Patent Information
- Authority / Receiving Office
- JP · JP
- Patent Type
- Patents
- Current Assignee / Owner
- HUAZHONG NORMAL UNIV
- Filing Date
- 2026-03-24
- Publication Date
- 2026-06-25
Smart Images

Figure 0007880108000001_ABST
Abstract
Description
Technical Field
[0001] This application belongs to the field of computer vision technology, and more specifically, relates to a method and device for recognizing a teacher's atomic actions based on an improved end-to-end network.
Background Art
[0002] The recognition of atomic actions (the smallest unit of actions) is an important research direction in the fields of computer vision and artificial intelligence, and is widely applied in various fields such as education. Currently, as a general method for atomic action recognition, (1) it is realized in an end-to-end manner using a single-stage action recognition network. In this realization process, no step division is required, and the spatial features and temporal features are directly extracted from a multi-frame video sequence by the 3D convolutional layer of the model or the "2D convolution + temporal attention" unit. However, when the entire teaching video is directly input, it is affected by the interference of complex backgrounds and misrecognition occurs, resulting in a problem that the accuracy of atomic action recognition decreases; (2) it depends on the conventional end-to-end network and uses Vision Transformer (ViT) as the backbone for training by masking. However, this network requires fine-tuning of all parameters during training and has a high requirement for video memory consumption, which affects the device performance and reduces the efficiency of atomic action recognition. Therefore, the recognition of a teacher's atomic actions by the above methods has low accuracy and efficiency.
Summary of the Invention
[0003] In view of the drawbacks of the prior art, the purpose of this application is to provide a method and device for recognizing a teacher's atomic actions based on an improved end-to-end network, aiming to solve the problem that the accuracy and efficiency of the teacher's atomic action recognition are reduced due to the interference of complex backgrounds, the necessity of fine-tuning all parameters during network training, and the high video memory requirements.
[0004] To achieve the above objective, a first aspect of the present invention provides a teacher atomic motion recognition method based on an improved end-to-end network. A step of determining time-dimensional video features and spatial-dimensional video features from a target teacher's current lesson video based on an improved target end-to-end network, wherein the improved target end-to-end network includes a spatial adaptation unit, a time-series adaptation unit, and a parameter freezing unit. The multi-time window attention mechanism generates a multi-window feature group based on the time-dimension video features and the spatial-dimension video features, A step of fusing the multi-window feature group using a frame selection network and recognizing the atomic behavior of the target teacher based on the fusing result, Includes.
[0005] In one embodiment, the step of determining time-dimensional video features and spatial-dimensional video features from a target teacher's current lesson video based on the improved target end-to-end network is: Based on the previous normal unit in the improved target end-to-end network, the step of determining the current image patch from the target teacher's current lesson video, The steps include: reconstructing the current image patch to obtain the current three-dimensional spatial structure image; The process involves using a spatial adaptation unit within an improved target end-to-end network to upsample the current three-dimensional spatial structure image, extract local features from the upsampled three-dimensional spatial structure image, and obtain local body features. The steps include: downsampling the aforementioned local body features to obtain spatial dimensional video features, The process involves using a time-series adaptive unit in an improved target end-to-end network to upsample the spatial-dimensional video features, perform time-series convolution on the three-dimensional features of multiple consecutive upsampled frames, and obtain time-series convolutional features. The steps include: downsampling the aforementioned time-series convolutional features, performing a linear transformation on the downsampled time-series features, and obtaining time-dimensional video features; Includes.
[0006] In one embodiment, prior to the step of determining time-dimensional video features and spatial-dimensional video features from the current lesson videos of the target teacher based on the improved target end-to-end network, The steps include acquiring a set of historical single-frame images and training an action recognition network based on the set of historical single-frame images, The steps include obtaining a set of recorded lesson videos and performing a consistent trim on the set of recorded lesson videos, The steps include: obtaining a set of historical lesson frame sets by dividing a uniformly trimmed set of historical lesson video segments into frames, The steps include: recognizing the historical lesson frameset based on the aforementioned motion recognition network, and performing format conversion for the target motion of each recognized keyframe; The steps include verifying each target operation after format conversion, If the verification is successful, the parameters of the backbone network are frozen by the parameter freezing unit, the spatial adaptation unit and the time-series adaptation unit are trained based on the set of historical lesson videos and the target operations after format conversion, and an improved target end-to-end network is obtained. Includes.
[0007] In one embodiment, the step of generating a multi-window feature group based on the time-dimension video features and the spatial-dimension video features using the multi-time window attention mechanism is: A step of determining the target frame features based on the aforementioned time-dimension video features and spatial-dimension video features, The process involves performing frame-level scoring on the target frame features using the target convolutional layer, normalizing the scoring results, and obtaining temporal attention weights. A step of obtaining a global feature group by performing a weighted sum on the target frame features based on the aforementioned temporal attention weights, The multi-time window attention mechanism generates multiple local feature groups based on the target frame features, and obtains a multi-window feature group based on the global feature group and the multiple local feature groups. Includes.
[0008] In one embodiment, the step of fusing the multi-window feature group using the frame selection network and recognizing the atomic behavior of the target teacher based on the fusing result is: The steps include performing alignment processing on the multi-window feature group, The steps include: performing region of interest detection on the processed multi-window feature group and determining the features of the region of interest based on the detection results; A step of generating a list of target group features based on the characteristics of each of the aforementioned regions of interest, The steps include: fusing the target group feature list using a frame selection network and recognizing the atomic behavior of the target teacher based on the fusing result; Includes.
[0009] In one embodiment, the step of fusing the target group feature list using the frame selection network and recognizing the atomic behavior of the target teacher based on the fusing result is: The steps include: performing classification predictions independently for each feature of interest in the target group feature list based on a multi-branch prediction module, and outputting an operation category prediction score for each group via a fully connected layer; The steps include: fusing the motion category prediction scores of each group to obtain a motion prediction result stack; The steps include obtaining the number of groups in the aforementioned action prediction result stack, In the prediction stage, a selection is made from the motion prediction result stack based on the number of groups, and the maximum prediction score for each category is obtained. A step of recognizing the atomic behavior of the target teacher based on the maximum predicted score of each category, Includes.
[0010] In a second aspect, the present invention provides a teacher atomic motion recognition device based on an improved end-to-end network. A decision module for determining time-dimensional video features and spatial-dimensional video features from a target teacher's current lesson video, based on an improved target end-to-end network, wherein the improved target end-to-end network includes a decision module comprising a spatial adaptation unit, a time-series adaptation unit, and a parameter freezing unit. A generation module for generating a multi-window feature group based on the time-dimensional video features and the spatial-dimensional video features using a multi-time window attention mechanism, A recognition module for fusing the multi-window feature group using a frame selection network and recognizing the atomic behavior of the target teacher based on the fusing result, Includes.
[0011] In a third aspect, the present invention provides an electronic device comprising at least one memory and at least one processor, wherein the memory is used to store a program, the processor is used to execute the program stored in the memory, and when the program stored in the memory is executed, the processor performs the method of the first aspect or any embodiment of the first aspect.
[0012] In a fourth aspect, the present invention provides a computer-readable storage medium in which a computer program is stored, and when the computer program is executed on a processor, the processor is made to execute the method described in the first aspect or any embodiment of the first aspect.
[0013] As a fifth aspect, the present application provides a computer program product, which, when executed on a processor, causes the processor to execute the method described in any of the first aspect or the embodiments of the first aspect.
[0014] As can be understood, regarding the beneficial effects according to the second to fifth aspects above, reference can be made to the relevant descriptions in the first aspect above, and repeated explanations are omitted here.
[0015] Overall, the technical solution proposed by the present application has the following beneficial effects compared with the prior art. (1) In the present application, Adapter fine-tuning is introduced, and a spatial adaptation unit (S-Adapter), a temporal adaptation unit (T-Adapter), and a parameter freezing unit are respectively embedded in the end-to-end network. Here, the spatial adaptation unit is used to extract local body features inside the video frame, the temporal adaptation unit is used to enhance the temporal information of adjacent frames and extract temporal convolution features, and the parameter freezing unit is used to freeze the parameters of the backbone network. During network training, only the parameters of the above units are trained, and fine-tuning of all parameters is not required, so GPU consumption can be reduced, video memory requirements can be reduced, and deployment difficulty can be reduced.
[0016] (2) In this application, the frame selection network further focuses on frames containing effective operation information to reduce the interference caused by background redundancy to the feature learning of the model. Also, through the feature grouping by the global window and multiple local windows, the global time-series context and local key-frame information can be captured respectively, and the insufficient adaptation to complex time-series patterns by a single window can be compensated. Furthermore, independent classification predictions are made based on the multi-branch prediction module, and the operation category prediction scores of each group are output through the fully connected layer. By selecting the maximum value for each category at this time, the contribution of the most discriminative group features for each category is adaptively selected, avoiding the interference of weak features, and improving the overall classification accuracy, thereby effectively improving the accuracy of atomic operation recognition.
[0017] As described above, based on the improved target end-to-end network, the time-dimensional video features and space-dimensional video features are determined from the current teaching video of the target teacher. The improved target end-to-end network includes a spatial adaptation unit, a time-series adaptation unit, and a parameter freezing unit. The multi-time window attention mechanism generates a multi-window feature group based on the time-dimensional video features and the space-dimensional video features, and the frame selection network fuses the multi-window feature group, and recognizes the atomic operation of the target teacher based on the fusion result. With the above configuration, the spatial adaptation unit and the time-series adaptation unit are embedded in the end-to-end network, extracting the time-dimensional video features and space-dimensional video features at the frame level respectively, and only training the parameters of the above units during network training to reduce GPU consumption. Furthermore, by improving the discrimination ability of complex time-series information in video action recognition through multi-branch feature processing and adaptive fusion, the accuracy and efficiency of atomic operation recognition can be effectively improved, while reducing the video memory requirement and the deployment difficulty.
Brief Description of the Drawings
[0018] [Figure 1]This is a flowchart of one embodiment of a teacher atomic motion recognition method based on an improved end-to-end network according to an embodiment of the present invention. [Figure 2] This is a schematic diagram of network training according to an embodiment of the present invention. [Figure 3] This is a comparative chart of the average average precision (mAP) for the embodiments of the present application. [Figure 4] This is a flowchart of another embodiment of the improved end-to-end network-based teacher atomic motion recognition method according to an embodiment of the present application. [Figure 5] This is a modular configuration diagram of a teacher atomic motion recognition device based on an improved end-to-end network according to an embodiment of the present invention. [Figure 6] This is a diagram showing the configuration of an electronic device according to an embodiment of the present invention. [Modes for carrying out the invention]
[0019] To further clarify the purpose, technical solution, and advantages of this application, the application will be described in more detail below with reference to the drawings and embodiments. Please understand that the specific embodiments described herein are for illustrative purposes only and do not limit the application.
[0020] In this specification, the terms "and / or" indicate a relationship between related subjects, meaning that three relationships are possible. For example, "A and / or B" includes the cases where only A exists, where A and B exist simultaneously, and where only B exists. In this specification, the symbol " / " indicates that the related subjects are in an "or" relationship; for example, A / B means A or B.
[0021] The terms "first," "second," etc., used in this specification and in the claims are used to distinguish different subjects and do not indicate a specific order of subjects. For example, "first response message" and "second response message" are used to distinguish different response messages and do not indicate a specific order of response messages.
[0022] In the embodiments of this Application, terms such as “exemplary” or “for example” are used for illustrative, demonstrative, or explanatory purposes. Any embodiment or design solution described as “exemplary” or “for example” in the embodiments of this Application should not be construed as being superior or advantageous to other embodiments or design solutions. The use of these terms is intended to specifically illustrate the relevant concepts.
[0023] Based on this, the embodiments of the present application provide a teacher atomic motion recognition method based on an improved end-to-end network. Please refer to Figure 1. Figure 1 is a flowchart of one embodiment of the teacher atomic motion recognition method based on an improved end-to-end network according to the embodiments of the present application. In this embodiment, the teacher atomic motion recognition method based on an improved end-to-end network includes steps S10 to S30. In step S10, time-dimension video features and spatial-dimension video features are determined from the current lesson videos of the target teacher based on the improved target end-to-end network. Here, the improved target end-to-end network includes a spatial adaptation unit, a time-series adaptation unit, and a parameter freezing unit.
[0024] Here, the improved symmetric end-to-end network, compared to the conventional end-to-end network, is improved by the incorporation of spatially adaptive units (S-Adapters), time-series adaptive units (T-Adapters), and parameter-freezing units. Each of these units is constructed using hierarchical adaptive modules and can be directly embedded in predetermined layers, such as the first five layers and layers six through eleven. The spatially adaptive unit is used to extract local body features within video frames, the time-series adaptive unit is used to enhance time-series information from adjacent frames and extract time-series convolutional features, and the parameter-freezing unit is used to freeze the parameters of the backbone network. During network training, GPU consumption is reduced by training only the parameters of the above units. The core of the improved symmetric end-to-end network lies in "backbone preservation + hierarchical adaptive unit embedding + parameter freezing." Here, backbone preservation refers to preserving the 3D patch embedding layer and the 12-layer Transformer block of the conventional end-to-end network, and the conventional end-to-end network may be a VideoMAE network. Since the Transformer block has 12 layers, the spatial adaptation units can be embedded in the first 5 layers, and the time-series adaptation units can be embedded in layers 6 through 11.
[0025] To ensure clarity, the target teacher may be a classroom teacher or a teacher training student, and is not limited to these in this embodiment. In the improved target end-to-end network, after inputting the current lesson video, local body features within the video frame are extracted in the initial stage, and spatial-dimensional video features are obtained. In the later stage, time-series information of adjacent frames is enhanced, time-series convolutional features are extracted, and temporal-dimensional video features are obtained. In a teacher training setting, the current lesson video can be acquired by a camera installed in the room.
[0026] Furthermore, step S10 is preceded by the following steps: namely, the steps of acquiring a set of historical single-frame images and training an action recognition network based on the set of historical single-frame images; acquiring a set of historical lesson videos and performing uniform trimming on the set of historical lesson videos; dividing the uniformly trimmed set of historical lesson video segments into frames to acquire a set of historical lesson frames; recognizing the set of historical lesson frames based on the action recognition network and performing format conversion on the target action of each recognized keyframe; verifying each target action after format conversion; and, if verification is passed, freezing the parameters of the backbone network with a parameter freezing unit, training a spatial adaptation unit and a time-series adaptation unit based on the set of historical lesson videos and each target action after format conversion, and acquiring an improved target end-to-end network.
[0027] To make it clear, a history single-frame image set refers to a collection of single-frame images collected at past points in time. To illustrate with an example from teacher training, mimicking the Atomic Visual Actions (AVA) method, a total of 9316 target keyframe actions were collected from 100 lesson videos. There were a total of 10 types of actions, including explaining, nodding, looking down, walking, standing, standing to the side, writing on the board, frequently waving hands, pointing to a PowerPoint presentation, explaining towards the PowerPoint presentation, and holding a manuscript. For specifics, please refer to Table 1 below. [Table 1]
[0028] To make it clear, on the one hand, after acquiring a set of historical single-frame images, annotation can be performed on each single-frame image within the set, and then an action recognition network can be trained to assist in keyframe annotation. This action recognition network may be a YOLO network. On the other hand, a set of historical lesson videos can also be collected, which may consist of past lesson videos from multiple teacher trainees and teachers. In this case, a series of processes such as uniform trimming and frame splitting are further performed on the historical lesson video set. For example, the video can be uniformly trimmed into 9-minute segments, then action recognition can be performed using the aforementioned trained YOLO network, and the target action of each recognized keyframe can be format-converted. At this time, the format of the target action of the keyframe may be VIA format to facilitate subsequent processing. After that, further sorting and annotation are performed, and based on a multi-object tracking strategy, a unique and consistent label is assigned to the same object in the video sequence, which may be an ID, specifically represented as person_id. Note that both the set of historical single-frame images and the set of historical lesson videos can be obtained directly from the internet.
[0029] Furthermore, it should be emphasized that in this embodiment, since the spatial adaptation unit and time-series adaptation unit are already embedded during the training of the backbone network, only fine-tuning of the parameters of the spatial adaptation unit and time-series adaptation unit is required, eliminating the need for fine-tuning of all parameters, thereby reducing GPU consumption. In addition, as a prerequisite for such fine-tuning, it is necessary to freeze the parameters of the backbone network. For example, please refer to Figure 2. Figure 2 is a schematic diagram of network training, in which frames A and C represent the backbone network portion and show the frozen parameters, and frame B represents the adaptation unit portion and shows the parameters to be trained. During the training process, the parameters of the backbone network are not involved, and only the parameters of the spatial adaptation unit and time-series adaptation unit are trained. The backbone network includes, but is not limited to, patch embedding, position encoding, Transformer attention layer and MLP layer. By training only the parameters of the spatial adaptation unit and time-series adaptation unit, the total number of parameters is significantly reduced compared to the fine-tuning of all parameters in conventional techniques, and video memory consumption is reduced, thereby achieving reduced video memory requirements and reduced deployment difficulty.
[0030] Furthermore, step S10 includes the following steps: determining the current image patch from the current lesson video of the target teacher based on the previous normal unit in the improved target end-to-end network; reconstructing the current image patch to obtain the current three-dimensional spatial structure image; upsampling the current three-dimensional spatial structure image using a spatial adaptation unit in the improved target end-to-end network, performing local feature extraction on the upsampled three-dimensional spatial structure image to obtain local body features; downsampling the local body features to obtain spatial-dimensional video features; upsampling the spatial-dimensional video features using a time-series adaptation unit in the improved target end-to-end network, performing time-series convolution on the three-dimensional features of the upsampled consecutive frames to obtain time-series convolution features; downsampling the time-series convolution features, performing a linear transformation on the downsampled time-series features to obtain time-dimensional video features.
[0031] To ensure understanding, the current image patch refers to the image patch output by the preceding normal unit in the improved target end-to-end network based on the current lesson video, and this preceding normal unit may be the preceding Transformer layer. The spatial adaptation unit within the improved target end-to-end network includes processes such as upsampling, local feature extraction, and downsampling, with upsampling and downsampling each implemented by two linear layers. Here, local body features may include fine features such as hand shape and body posture. Specifically,
number
[0032] To be understood, the time-series adaptive units within the improved target end-to-end network include processes such as upsampling, time-series convolution, downsampling, and linear transformation. Specifically,
number
[0033] In step S20, the multi-time window attention mechanism generates a multi-window feature group based on the time-dimension video features and the spatial-dimension video features.
[0034] To ensure understanding, this embodiment introduces a multi-time window attention mechanism to capture important information across different time scales in current lecture videos. This mechanism differentiates feature groups, generating feature groups corresponding to multiple local time windows, i.e., multi-window feature groups. Features in different groups have different discriminatory capabilities for different behavioral categories; for example, the global window is suitable for long-duration operations, while local windows are suitable for short-duration operations. This compensates for the problem that a single window cannot adequately handle complex time-series patterns.
[0035] In step S30, the multi-window feature group is merged using a frame selection network, and the atomic behavior of the target teacher is recognized based on the fusion result.
[0036] As can be understood, the frame selection network is used to focus on frames containing valid motion information in a fused manner, reducing the interference of background redundancy on model feature learning compared to interference and misrecognition problems caused by complex backgrounds in conventional techniques, and solving the drawback that redundant background frames in all frame inputs interfere with feature extraction and affect recognition accuracy. In this embodiment, atomic motion of the target teacher can be recognized based on the fusion results of multi-window feature groups, and after solving the above drawbacks, the accuracy and efficiency of atomic motion recognition can be effectively improved.
[0037] Furthermore, step S30 includes the steps of: performing alignment processing on the multi-window feature group; performing region of interest detection on the processed multi-window feature group and determining the features of the region of interest based on the detection results; generating a target group feature list based on the features of each of the regions of interest; and fusing the target group feature list using a frame selection network and recognizing the atomic behavior of the target teacher based on the fusing results.
[0038] To ensure understanding, in order to achieve feature grouping of Regions of Interest (ROIs), it is necessary to obtain multi-window feature groups, and then perform ROI alignment and region of interest detection on each window feature group to extract the features of the region of interest. The region of interest may be the bounding box of the human body, in which case a target group feature list can be generated based on the features of each region of interest. This target group feature list is represented as group_roi_feats_list, and the shape of each element is [R, M, 1, Ph, Pw], where R represents the number of ROIs, M represents the number of feature channels, Ph represents the height after pooling, and Pw represents the width after pooling.
[0039] Furthermore, the step of fusing the target group feature list by the frame selection network and recognizing the atomic actions of the target teacher based on the fusion result includes: based on the multi-branch prediction module, independently performing classification prediction on each feature of interest in the target group feature list, and outputting the operation category prediction scores of each group through a fully connected layer; fusing the operation category prediction scores of each group to obtain an operation prediction result stack; obtaining the number of groups in the operation prediction result stack; in the prediction stage, performing selection on the operation prediction result stack based on the number of groups to obtain the maximum prediction score of each category; and recognizing the atomic actions of the target teacher based on the maximum prediction score of each category.
[0040] As can be understood, in order to enhance the contribution of highly discriminative features, avoid the interference of weak features, and improve the overall classification accuracy, the present application further introduces a multi-branch prediction module to perform independent classification prediction on each feature of interest in the target group feature list. The multi-branch prediction module may be a multi-group bounding box head (MultiGroupBBoxHead). Then, the operation category prediction scores of each group are output through a fully connected layer, and an operation prediction result stack is obtained through fusion processing. The operation prediction result stack may be a logits_stack with a shape of [G, R, C], where G represents the number of groups and C represents the number of categories.
[0041] Furthermore, it should be emphasized that in the prediction stage, processing along the group dimension is performed on the operation prediction result stack, and for each category c (0 ≤ c < C) and each ROI r (0 ≤ r < R), the maximum score of the category in all groups is selected as the final prediction result. Among them, the operation of obtaining the maximum score for each category can be expressed as follows.
Equation
[0042] Here, the meaning of the above formula is that for each domain of interest and each action category, the maximum value is selected from the predicted scores of all feature groups and this is set as the final predicted score for that category in that domain. To reduce confusion between similar action categories, this embodiment significantly reduces the confusion rate by reinforcing category-specific optimal features for action categories that are prone to confusion. Specifically, differences between similar actions appear only in local features; for example, the "dribbling action" in "basketball" depends on a specific group of local features, while the "receiving action" in "volleyball" depends on a different group of local features. The method of obtaining the maximum value for each category can capture such subtle differences, whereas conventional averaging and weighted fusion make category-specific features ambiguous, resulting in insufficient understanding of differences. Therefore, by recognizing the atomic actions of the target teacher based on the maximum predicted score of each category, the accuracy and efficiency of atomic action recognition can be effectively improved.
[0043] To illustrate, refer to Figure 3, which is a comparison chart of the mean average precision (mAP). Specifically, using a conventional end-to-end network and an improved target end-to-end network as examples, when both backbone networks are ViT-B and both use additional labels, the average mean average precision of the conventional end-to-end network is 31.8, while the average mean average precision of the improved target end-to-end network is 35.4. From this, it can be seen that the average mean average precision of the conventional end-to-end network is clearly lower than the average mean average precision of the improved target end-to-end network in this embodiment. Furthermore, when both backbone networks are ViT-L and both use additional labels, the average mean average precision of the conventional end-to-end network is 37.0, while the average mean average precision of the improved target end-to-end network is 39.1. From this, it can be seen that the average mean average precision of the conventional end-to-end network is clearly lower than the average mean average precision of the improved target end-to-end network in this embodiment. In other words, in this embodiment, by using an end-to-end network with embedded spatial adaptation units, time-series adaptation units, and parameter freezing units, the average of the average precision is significantly improved, resulting in superior performance for the improved target end-to-end network.
[0044] In this embodiment, based on an improved target end-to-end network, time-dimensional video features and spatial-dimensional video features are determined from the current lesson video of the target teacher. The improved target end-to-end network includes a spatial adaptation unit, a time-series adaptation unit, and a parameter freezing unit. A multi-time window attention mechanism generates multi-window feature groups based on the time-dimensional video features and spatial-dimensional video features. A frame selection network fuses the multi-window feature groups, and the atomic behavior of the target teacher is recognized based on the fusion result. With this configuration, a spatial adaptation unit and a time-series adaptation unit are embedded within the end-to-end network, extracting frame-level time-dimensional video features and spatial-dimensional video features, respectively. During network training, only the parameters of these units are trained, reducing GPU consumption. Furthermore, multi-branch feature processing and self-adaptive fusion improve the ability to distinguish complex time-series information in video behavior recognition, effectively improving the accuracy and efficiency of atomic behavior recognition, reducing video memory requirements, and enhancing the user experience.
[0045] In one embodiment, the present application provides a step for determining a multi-window feature group. See Figure 4. Figure 4 is a flowchart of another embodiment of a teacher atomic behavior recognition method based on an improved end-to-end network according to an embodiment of the present application. Step S20 comprises steps S201 to S204. In step S201, the target frame features are determined based on the time-dimension video features and the spatial-dimension video features.
[0046] Here, the shapes of the time-dimension video features and spatial-dimension video features are [B, M, T, H, W], where B represents the batch size, M represents the number of feature channels, T represents the number of time frames, H represents the height, and W represents the width. In this case, the spatial dimension can be compressed by global average pooling and the target frame features can be obtained, specifically as follows.
number
[0047] In step S202, the target convolutional layer performs frame-level scoring on the target frame features, and the scoring results are normalized to obtain temporal attention weights.
[0048] As can be understood, after acquiring the target frame features, frame-level scoring can be performed on those target frame features using a target convolutional layer. This target convolutional layer may be a 1D convolutional layer. In this case, the score can be converted into a temporal attention weight through normalization, meaning that the higher the score of a frame, the larger the temporal attention weight, thus realizing "dynamic focus". The sum of the temporal attention weights for all frames may be 1.
[0049] In step S203, a weighted sum is performed on the target frame features based on the temporal attention weights to obtain a global feature group.
[0050] As can be understood, global time-weighted pooling can be achieved by obtaining temporal attention weights and then performing a weighted sum on the target frame features based on those temporal attention weights.
[0051] In step S204, the multi-time window attention mechanism generates multiple local feature groups based on the target frame features, and obtains a multi-window feature group based on the global feature group and the multiple local feature groups.
[0052] To ensure understanding, after acquiring the target frame features, this embodiment further sets multi-window parameters using a multi-time window attention mechanism and introduces a Gaussian prior distribution in the calculation of attention weights. Specifically, multiple local feature groups are generated by centering on intermediate frames of the video and associating different Gaussian variances with different window sizes. For example, windows of sizes 2, 4, and 6 are windows of time step length.
[0053] In this embodiment, target frame features are determined based on the time-dimensional video features and the spatial-dimensional video features. A target convolutional layer performs frame-level scoring on the target frame features. The scoring results are normalized to obtain temporal attention weights. A weighted sum is performed on the target frame features based on the temporal attention weights to obtain a global feature group. Furthermore, a multi-time window attention mechanism generates multiple local feature groups based on the target frame features. A multi-window feature group is obtained based on the global feature group and the multiple local feature groups. With the above configuration, after determining target frame features based on time-dimensional video features and spatial-dimensional video features, a global feature group is determined on the one hand using a global time-weighted pooling method, and multiple local feature groups are generated on the other hand based on the target frame features. Subsequently, the global feature group and the multiple local feature groups are combined, and feature grouping is performed for the global window and multiple local windows, respectively. This captures the global time-series context and local keyframe information, thereby effectively improving the accuracy of obtaining the multi-window feature group.
[0054] The following describes the improved end-to-end network-based teacher atomic motion recognition device provided by the present application. The improved end-to-end network-based teacher atomic motion recognition device described below can be referenced in correspondence with the improved end-to-end network-based teacher atomic motion recognition method described above. Please refer to Figure 5. Figure 5 is a modular configuration diagram of the improved end-to-end network-based teacher atomic motion recognition device according to an embodiment of the present application, and includes a determination module T10, a generation module T20, and a recognition module T30.
[0055] The decision module T10 is used to determine time-dimensional and spatial-dimensional video features from a target teacher's current lesson video, based on an improved target end-to-end network, the improved target end-to-end network including a spatial adaptation unit, a time-series adaptation unit, and a parameter freezing unit.
[0056] The generation module T20 is used to generate multi-window feature groups based on the time-dimensional video features and the spatial-dimensional video features using a multi-time window attention mechanism.
[0057] The recognition module T30 is used to merge the multi-window feature groups using a frame selection network and to recognize the atomic behavior of the target teacher based on the fusion result.
[0058] In this embodiment, based on an improved target end-to-end network, time-dimensional video features and spatial-dimensional video features are determined from the target teacher's current lesson video. The improved target end-to-end network includes a spatial adaptation unit, a time-series adaptation unit, and a parameter freezing unit. A multi-time window attention mechanism generates multi-window feature groups based on the time-dimensional and spatial-dimensional video features. A frame selection network fuses the multi-window feature groups, and the target teacher's atomic behavior is recognized based on the fusion result. With this configuration, a spatial adaptation unit and a time-series adaptation unit are embedded within the end-to-end network, extracting frame-level time-dimensional and spatial-dimensional video features, respectively. During network training, only the parameters of these units are trained to reduce GPU consumption. Subsequently, multi-branch feature processing and self-adaptive fusion improve the ability to distinguish complex time-series information in video behavior recognition, thereby effectively improving the accuracy and efficiency of atomic behavior recognition, reducing video memory requirements, and enhancing the user experience.
[0059] For clarity, detailed explanations of the functional implementation of each module described above can be found in the previous method examples; therefore, repeated explanations are omitted here.
[0060] As can be understood, the apparatus described above is used to carry out the method in the above-described embodiment, and the implementation principle and technical effect of the corresponding program module within the apparatus are similar to those described in the above-described method. The operating process of the apparatus can be described by referring to the corresponding process in the above-described method, so a repeated explanation is omitted here.
[0061] Based on the method described in the above-described embodiment, the embodiment of the present application provides an electronic device. Please refer to Figure 6. Figure 6 is a configuration diagram of the electronic device according to the embodiment of the present application.
[0062] The device may also include a processor 10, a communications interface 20, memory 30, and a communications bus 40, and the processor 10, communications interface 20, and memory 30 communicate with each other via the communications bus 40. The processor 10 can execute the method in the above embodiment by calling and executing logical instructions in memory 30.
[0063] Furthermore, the logical instructions in the memory 30 may be implemented in the form of a software function unit and, if sold or used as an independent product, may be stored on a computer-readable storage medium. Based on this understanding, the essence of the present invention, i.e., its contribution to the prior art, or a part of the present invention, may be embodied in the form of a software product. This computer software product may be stored on a storage medium, contain a number of instructions, and be used to cause a computer device (e.g., a personal computer, server, or network device) to perform all or part of the steps of the methods described in each embodiment of the present invention.
[0064] Based on the method described in the above-described embodiment, the embodiment of the present application provides a computer-readable storage medium. A computer program is stored in the computer-readable storage medium, and when the computer program is executed on a processor, the processor is made to execute the method described in the above-described embodiment.
[0065] Based on the method described in the above-described embodiment, the embodiment of the present application provides a computer program product. When the computer program product is executed on a processor, the processor is made to execute the method described in the above-described embodiment.
[0066] As can be understood, the processor in the embodiments of the present application may be a central processing unit, or it may be any other general-purpose processor, digital signal processor, application-specific integrated circuit, field-programmable gate array or other programmable logic device, transistor logic device, hardware component or any combination thereof. The general-purpose processor may be a microprocessor or any conventional processor.
[0067] The method steps in the embodiments of this application may be implemented by hardware or by a processor executing software instructions. The software instructions consist of corresponding software modules, which may be stored in random access memory, flash memory, read-only memory, programmable read-only memory, erasable programmable read-only memory, electrically erasable programmable read-only memory, registers, hard disks, removable hard disks, or any other form of storage medium well known to those skilled in the art. For example, the storage medium may be coupled to the processor, allowing the processor to read and write information to it. Of course, the storage medium may also be part of the processor's components.
[0068] As should be understood, the various reference numerals used in the embodiments of this Application are merely distinctions for explanatory purposes and do not limit the scope of the embodiments of this Application. As will be readily apparent to those skilled in the art, the foregoing is merely a preferred embodiment of this Application and does not limit it, and any modifications, equivalent substitutions, and improvements made within the spirit and principles of this Application are included within the scope of protection of this Application.
Claims
1. A method for recognizing the atomic behavior of a teacher based on an improved end-to-end network, which is performed by a computer, wherein the recognition method includes the steps performed by the computer: A step of determining time-dimensional video features and spatial-dimensional video features from a target teacher's current lesson video based on an improved target end-to-end network, wherein the improved target end-to-end network includes a spatial adaptation unit, a time-series adaptation unit, and a parameter freezing unit. The multi-time window attention mechanism generates a multi-window feature group based on the time-dimension video features and the spatial-dimension video features, A step of fusing the multi-window feature group using a frame selection network and recognizing the atomic behavior of the target teacher based on the fusing result, Includes, The step of generating a multi-window feature group based on the time-dimensional video features and the spatial-dimensional video features using the multi-time window attention mechanism is: A step of determining the target frame features based on the aforementioned time-dimension video features and spatial-dimension video features, The process involves performing frame-level scoring on the target frame features using the target convolutional layer, normalizing the scoring results, and obtaining temporal attention weights. A step of obtaining a global feature group by performing a weighted sum on the target frame features based on the aforementioned temporal attention weights, The multi-time window attention mechanism generates multiple local feature groups based on the target frame features, and obtains a multi-window feature group based on the global feature group and the multiple local feature groups. Includes, Based on the improved target end-to-end network, the step of determining time-dimensional video features and spatial-dimensional video features from the target teacher's current lesson videos is: Based on the previous normal unit in the improved target end-to-end network, the step of determining the current image patch from the target teacher's current lesson video, The steps include: reconstructing the current image patch to obtain the current three-dimensional spatial structure image; The process involves using a spatial adaptation unit within an improved target end-to-end network to upsample the current three-dimensional spatial structure image, extracting local features from the upsampled three-dimensional spatial structure image, and obtaining local body features. The steps include: downsampling the aforementioned local body features to obtain spatial dimensional video features, The process involves using a time-series adaptive unit in an improved target end-to-end network to upsample the spatial-dimensional video features, perform time-series convolution on the three-dimensional features of a series of upsampled frames, and obtain time-series convolutional features. The steps include: downsampling the aforementioned time-series convolutional features, performing a linear transformation on the downsampled time-series features, and obtaining time-dimensional video features; A method characterized by including
2. Based on the improved target end-to-end network, before the step of determining time-dimensional and spatial-dimensional video features from the target teacher's current lesson videos, The steps include acquiring a set of historical single-frame images and training an action recognition network based on the set of historical single-frame images, The steps include obtaining a set of recorded lesson videos and performing a consistent trim on the set of recorded lesson videos, The steps include: obtaining a set of historical lesson frame sets by dividing a uniformly trimmed set of historical lesson video segments into frames, The steps include: recognizing the historical lesson frameset based on the aforementioned motion recognition network, and performing format conversion for the target motion of each recognized keyframe; The steps include verifying each target operation after format conversion, If the verification is successful, the parameters of the backbone network are frozen by the parameter freezing unit, the spatial adaptation unit and the time-series adaptation unit are trained based on the set of historical lesson videos and the target operations after format conversion, and an improved target end-to-end network is obtained. The method according to claim 1, characterized by including
3. The step of fusing the multi-window feature group using the frame selection network and recognizing the atomic behavior of the target teacher based on the fusing result is as follows: The steps include performing alignment processing on the multi-window feature group, The steps include: performing region of interest detection on the processed multi-window feature group and determining the features of the region of interest based on the detection results; A step of generating a list of target group features based on the characteristics of each of the aforementioned regions of interest, The steps include: fusing the target group feature list using a frame selection network and recognizing the atomic behavior of the target teacher based on the fusing result; The method according to claim 1, characterized by including
4. The step of fusing the target group feature list using the frame selection network and recognizing the atomic behavior of the target teacher based on the fusing result is as follows: The steps include: performing classification predictions independently for each feature of interest in the target group feature list based on a multi-branch prediction module, and outputting an operation category prediction score for each group via a fully connected layer; The steps include: fusing the motion category prediction scores of each group to obtain a motion prediction result stack; The steps include obtaining the number of groups in the aforementioned action prediction result stack, In the prediction stage, a selection is made from the motion prediction result stack based on the number of groups, and the maximum prediction score for each category is obtained. A step of recognizing the atomic behavior of the target teacher based on the maximum predicted score of each category, The method according to claim 3, characterized by including
5. At least one memory for storing computer programs, At least one processor for executing a program stored in the memory, Equipped with, An electronic device characterized in that, when the program is executed, the processor performs the method according to any one of claims 1 to 4.
6. A computer-readable storage medium in which computer programs are stored, A computer-readable storage medium characterized in that, when the computer program is executed on the processor, the processor is caused to execute the method according to any one of claims 1 to 4.
7. A computer program, A computer program characterized in that, when the computer program is executed on a processor, it causes the processor to execute the method according to any one of claims 1 to 4.