A method for monitoring and evaluating attention based on knowledge transmission rhythm
By capturing video from multiple cameras and constructing a rhythm spectrum, and using image recognition technology to monitor the head posture of trainers and trainees, the problem of high cost and impact on training experience of wearable devices is solved, achieving low-cost and accurate attention monitoring.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- HEFEI UNIV OF TECH
- Filing Date
- 2024-04-07
- Publication Date
- 2026-06-19
Smart Images

Figure CN118379769B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer vision, specifically to an attention monitoring and evaluation method based on the rhythm of knowledge transfer. Background Technology
[0002] Classroom attention monitoring and assessment help trainers understand the trainees' level of focus during training, enabling timely adjustments to training programs to improve effectiveness. Currently, attention monitoring is primarily achieved through the use of wearable devices, eye trackers, and electroencephalography (EEG). For example, EEG records brain activity, eye trackers measure fixation position and eye movement trajectories, and indicators such as completion of behavioral tasks are used to infer attention levels and assess attention.
[0003] Since the above-mentioned assessment methods rely on wearable devices to monitor each trainee, the research cost for group attention monitoring is too high, and wearable devices may affect the experience during normal training, causing distraction and affecting the accuracy of experimental data; therefore, the present invention aims to provide a low-cost attention monitoring and assessment method that can accurately reflect the overall attention changes in the training classroom. Summary of the Invention
[0004] To address the shortcomings of existing technologies, this invention aims to provide an attention monitoring and evaluation method based on the rhythm of knowledge transfer. This method utilizes multiple cameras to collect data, constructs a rhythm spectrum to depict the posture of trainees in a state of focused attention, and combines image recognition technology to obtain changes in overall attention. Compared to wearable device monitoring, this method is low-cost, and the impact of the cameras on trainees is minimal (almost negligible), thus the obtained data more closely reflects the level of normal training.
[0005] To achieve the above objectives, the present invention provides the following technical solution:
[0006] An attention monitoring and evaluation method based on the rhythm of knowledge transfer includes the following steps:
[0007] (1) Use cameras to collect video data on the performance of trainers and trainees throughout the training process;
[0008] (2) Preprocess the video data to ensure that the video of the trainee is fully aligned with the video of the trainer and contains all the teaching points of the entire course.
[0009] (3) Perform facial key position recognition on the trainees and trainees in each frame of the preprocessed video, and determine the head posture of the trainees and trainees based on the spatial relationship of the identified facial key positions.
[0010] (4) Based on the head postures that trainees are expected to exhibit according to the trainer’s head posture, a rhythm spectrum is constructed to assess the trainees’ attention status.
[0011] (5) Determine whether the relative relationship between the postures of the trainees and the trainers matches the rhythm spectrum setting, and use the attention concentration rate to draw the classroom attention change curve.
[0012] In step (1) of this invention, during the video data acquisition process, multiple cameras are used to collect data in different areas of the training venue to ensure that the recording process does not cause facial obstruction.
[0013] The head postures of the trainers in this invention include looking down, looking straight ahead, and turning to the side towards the multimedia; the head postures of the trainees include looking at the trainers, looking at the multimedia area, and other inattentive postures.
[0014] The present invention identifies the head posture of trainees as follows: First, the angle of the trainee's head looking down while facing forward is used as the baseline range; second, the key point coordinates of the chin and nose are obtained through image recognition technology, and the absolute values of the vertical and horizontal distances between the chin and nose are calculated; finally, the current head-down angle of the trainee is calculated using the arctangent function; when the head-down angle is within the baseline range, the trainee is considered to be facing forward; when the head-down angle is not within the baseline range, the trainee is considered to be in a head-down state; if all key points of the trainee's face cannot be identified, the trainee is considered to be in a side-facing multimedia state.
[0015] The present invention identifies the head posture of trainees as follows: First, any state in which the trainee is looking at the trainer or multimedia is selected as the baseline state, and the absolute value threshold range of the difference between the horizontal coordinate of the midline position of the facial key points and the horizontal coordinate of the chin key point is determined in the baseline state; Second, facial recognition is performed on each trainee to obtain the coordinates of their facial key points, and the absolute value of the difference between the horizontal coordinate of the midline position of the facial key points and the horizontal coordinate of the chin key point is calculated; Finally, for those facial key points that cannot be fully identified, they are considered to be in other distracted states; when the absolute value of the difference is within the baseline state threshold range, it is in a quasi-state; when the absolute value of the difference is not within the baseline state threshold range, it is in another state.
[0016] In the posture recognition process, this invention converts the frame state data of the trainee into corresponding second state data, groups the number of frames per second as a group, and takes the state value that appears most frequently in the current frame state group as the trainee's state in the current second.
[0017] In this invention, the distribution of the areas in the rhythm spectrum that the trainer expects the trainee's attention to follow is externalized as a correspondence between the trainee's head posture and the trainer's head posture: when the trainer is looking down, the trainee needs to focus their attention on the multimedia area, which is manifested by the trainee's head posture facing the multimedia area; when the trainer is looking straight ahead, the trainee focuses their attention on the trainer's area, which is manifested by the trainee's head posture facing the trainer; when the trainer is turned to the side facing the multimedia and explaining the content on the multimedia, the trainee needs to focus their attention on the multimedia area, which is manifested by the trainee's head posture facing the multimedia area.
[0018] Compared with the prior art, the beneficial effects of the present invention are:
[0019] 1. This invention uses face detection and head posture recognition to obtain the state of the trainer and the trainee respectively, and uses the constructed rhythm spectrum to determine the trainee's attention concentration rate. It is simple and convenient to operate.
[0020] 2. This invention converts the frame state data of trainees into corresponding second state data, and takes the state value that appears most frequently in the current frame state group as the state of the trainee in the current second. This reduces the randomness of frame state recognition to a certain extent, making the state recognition of trainees more accurate and more in line with the actual research needs. Attached Figure Description
[0021] Figure 1 This is a schematic diagram of key facial features in this invention.
[0022] Figure 2 This is a flowchart of the face detection and status recognition process of the present invention.
[0023] Figure 3 This is a status diagram of the trainees for this invention.
[0024] Figure 4 A musical notation for classroom rhythm.
[0025] Figure 5 This is a schematic diagram of the attention change curve in this invention. Detailed Implementation
[0026] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0027] This invention discloses an attention monitoring and evaluation method based on the rhythm of knowledge transfer, as detailed below:
[0028] (a) Acquiring video data.
[0029] The experimental equipment required for this invention mainly consists of three cameras and one video camera. To minimize the interference of camera placement on the experiment, the three cameras are mounted above the classroom to acquire video data from trainees at different positions; the video camera is placed at the back of the classroom to acquire video data from the trainers, with the camera facing the trainees. Before the experiment begins, the experimenters arrive at the classroom in advance to test the multimedia facilities and ensure they can play normally. The cameras are positioned in places that are not easily noticed by the trainees to record their true state. When the trainees arrive at the classroom, their seats are adjusted appropriately to ensure that their faces are not obstructed. Once everything is ready, the experiment begins.
[0030] To ensure the accuracy of the experimental data and prevent facial obstruction during recording, the classroom was divided into three areas based on the seating arrangement of the trainees. Each area was monitored and recorded by a different camera, which improved the reliability and accuracy of the experimental results to some extent. In the specific implementation, the monitoring cameras in each area were set at different angles according to their different spatial distribution. In the classroom, the trainees were positioned directly in front of the podium.
[0031] Based on the characteristics of the training content and the speaker's pace, this invention divides the podium area into a speaker area and a multimedia area.
[0032] (ii) Video preprocessing
[0033] Data synchronization and calibration are required between multiple recording devices to ensure that the captured video data can be correctly mapped to the same time and spatial point. First, the video data collected by the three cameras is processed separately, including editing and audio-visual alignment, to ensure complete alignment with the trainees' videos. Preprocessing of the recorded videos ensures their length matches the standard classroom time and that all teaching points of the entire lesson are completed within the specified time frame. During video processing, due to the excessive length and frame rate of the recorded videos, this invention considers selecting the most reasonable algorithm based on multiple evaluation criteria, including convenience, efficiency, and accuracy.
[0034] In analyzing training videos, this invention considers frame-by-frame processing. In the specific implementation, the pre-processed video is imported into the detection model, and the `Video_get` function is used to obtain the current video's frame count, i.e., the number of images to be detected per second. First, the `get` method of the `cv2.VideoCapture` object is used to obtain relevant video parameters, including the video's frame count. Then, a `cv2.VideoCapture` object is created to open the video file or capture the video stream in real time; finally, the `get` method is used to obtain the video's frame count, where the parameter `cv2.CAP_PROP_FRAME_COUNT` represents the total number of frames in the video.
[0035] The video processing steps are consistent with those for training personnel, and the videos of trainees recorded on different devices are processed separately. During this process, considering the different number of people and recording angles in each video, test data from videos recorded in different areas are used to set the video resolution and frame rate.
[0036] (III) Monitoring and Assessment of Attention
[0037] Step 1: Trainee Status Identification
[0038] (1) Face detection
[0039] Facial key position recognition is performed on the trainees in each frame of the preprocessed video, and the posture of the trainees is determined based on the spatial relationship of the identified facial key positions.
[0040] The Dlib library, based on the Python language, is used for face recognition, detection, and head pose determination. The focus of face detection is on facial landmark detection, recording their absolute positions within the entire image, and then rendering them. For example... Figure 1 The image shown is a map of facial landmarks. By detecting facial landmarks in video frame data and observing the changes in their relative positions under different states, the current state of the facial landmarks can be determined.
[0041] In the specific implementation process, a face detector file is imported, and faces in video frames are recognized using facial key points. During face recognition of video data, grayscale processing and appropriate scaling are required to facilitate recognition. After the preliminary work is completed, the video is processed using code, recognizing each frame. In this process, faces appearing in each frame are detected. After face recognition is completed, the face in the current frame is drawn, specifically including drawing a face recognition bounding box based on the relative position of the face and setting necessary facial key points. Simultaneously with face recognition, the coordinate information of key points on the face is obtained using the key point detector.
[0042] (2) Trainee Status Identification
[0043] In this invention, key point coordinates corresponding to different organs are used to determine the task status in video frames. Based on the recorded training videos, the trainees' status in the videos is divided into three categories: Status 1 is looking down at the built-in computer on the lectern, in which case an educational video is being played; Status 2 is looking straight ahead, interacting with the trainees; and Status 3 is turning sideways towards the multimedia, explaining the content on the multimedia.
[0044] Face detection and state recognition process as follows Figure 2 As shown, the process begins by identifying the trainee's state in each frame of the video and setting corresponding state values for easy identification. In states one and two, the trainee is assumed to be facing the camera (with the camera positioned this way). Therefore, when looking straight ahead and looking down, key facial features and their positions can be identified in the video frame. The trainee will have different angles when looking straight ahead and looking down. Therefore, the identified facial key point coordinates can be used to calculate the relative angle of the trainee's head in the vertical direction and determine whether they are looking down. In the specific implementation, the key point coordinates of the chin and nose are first identified. Since there are multiple key points for the chin and nose in the facial key point map, the coordinates representing the chin and nose positions are calculated using the mean value method. Secondly, the absolute value of the horizontal distance between the chin and nose is calculated based on the coordinates representing the chin and nose positions. Finally, the arctangent function is used to calculate the trainee's current head-down angle. In the specific parameter settings, the head-down angle when the trainee is looking straight ahead is used as the baseline range to determine whether the trainee is looking down. If the head tilt angle is within the normal viewing range, "Forward" is displayed above the drawn face frame, which is state one. Conversely, "Head down" is displayed above the drawn face frame, which is state two. For the third state, which is turning sideways towards the multimedia, the face is not fully displayed, so the computer cannot recognize it. Therefore, the horizontal deflection angle of key points cannot be simply considered. In this case, the state where not all key points can be recognized is identified as turning sideways towards the PPT. Based on the three judgment conditions for video frames, this invention was tested on a test set. The information recognized by the computer was compared with the manually annotated video frame information. The accuracy was excellent, indicating that the judgment conditions set in this invention are reasonable.
[0045] The coordinates of all key points on the chin and nose are identified separately. The average coordinates (chin.x, chin.y) of the chin and the average coordinates (nose.x, nose.y) of the nose are calculated. The angle (angle) in the vertical direction of the face in the experiment is then calculated using the following formula, where || represents the absolute value operation:
[0046]
[0047] After traversing all video frames, the trainee's status information is saved. Next, the status information file is processed. Due to the long video duration, the corresponding video frame status information is also substantial. To better represent the trainee's status, this invention converts the trainee's frame status data into corresponding second-by-second status data. First, the number of frames per second in the current video is determined and acquired; a value of 30 indicates 30 corresponding frame statuses per second. Based on the video frame count obtained in step (I), the video is grouped into sets of 30 frame statuses each, with any remaining frames less than 30 in the last group. After grouping, Python code is used to calculate the mode of each frame status group, i.e., the status value that appears most frequently in the current frame status group, representing the trainee's status for the current second. This operation reduces the randomness of frame status recognition to a certain extent, making the identification of the trainee's status more accurate. Furthermore, using second-by-second statuses to represent the trainee's current status better meets the needs of practical research.
[0048] Based on the second-level status identified by the code, and using Python code to draw a graph of the trainees' classroom status intervals in minutes, such as... Figure 3 As shown.
[0049] (3) Construction of rhythm spectrum
[0050] The rhythm spectrum primarily refers to the distribution of areas where trainers expect trainees to focus their attention. According to training requirements, when the trainer is in State 1 (head down), trainees need to concentrate on the multimedia area during video playback; trainees are expected to follow the trainer's rhythm and watch the video attentively. When the trainer is in State 2 (facing forward, speaking to or interacting with trainees), trainees need to concentrate on the trainer's area and follow the class rhythm. When the trainer is in State 3 (turning sideways to the multimedia screen, explaining content), trainees need to concentrate on the multimedia area.
[0051] Based on the division of the podium area, a classroom rhythm spectrum is formed, such as... Figure 4 As shown.
[0052] Step 2: Monitoring the attention of trainees
[0053] (1) Status frame processing
[0054] Different states are determined by setting corresponding conditional ranges for the trainees' states in the horizontal direction. During state determination, some special situations may arise, requiring special handling. For example, when there are obstructions, insufficient lighting, or other interference factors in the video, errors in state determination may occur. These special situations need to be addressed to ensure the accuracy of state determination. When processing the video, it is necessary to ensure that the number of trainees' state values acquired in each frame of data is consistent with the actual number of trainees in the video. This can be achieved by performing state determination and count statistics on each frame of data to ensure the accuracy and completeness of the data.
[0055] After acquiring different frame state data, the data files are merged and processed to obtain the overall frame state file for the trainees. From the set of frame states corresponding to each second, the percentage of trainees in different states in the current second is obtained as the overall state representation for each second. Through the above steps, the percentage of trainees in different states in each second of the class can be obtained. Subsequently, this is matched with the rhythm spectrum to obtain the percentage of trainees whose states match the required state for each second.
[0056] (2) Attention monitoring
[0057] Since the direction of a trainee's gaze change in the classroom can be approximated as the direction of head rotation, the direction of the trainee's head rotation is an important indicator for assessing their level of concentration.
[0058] This invention determines the area that attracts the trainee's attention by the direction the trainee's head is facing. In this invention, the trainee's attention should be mainly focused on the trainer or the multimedia; otherwise, it is considered to be in a state of inattention.
[0059] Whether the training aligns with the classroom rhythm can be used to assess attention levels during training. For example, if a certain segment of the rhythm requires participants to focus on a multimedia area, participants whose heads are facing the multimedia area are considered attentive, while those whose heads are facing other areas are considered inattentive.
[0060] The identification of the trainee's state follows a similar approach to that of the trainer. It calculates the horizontal deflection angle of key points to determine orientation, thus identifying each trainee's current state. Since the relative positions of each trainee to the trainer (in this invention, the trainer is only positioned at the podium) and the multimedia equipment are different, the trainee's head deflection angle can be used to determine their focus. Specifically, the identification of key facial points and their locations is based on the changing relative position of the facial midline and chin.
[0061] For each area of video footage, a baseline state is selected where the trainee is looking at the trainer or multimedia. A threshold range for the absolute value of the difference between the horizontal coordinate of the midline of the trainee's face and the horizontal coordinate of their chin is determined within this baseline state. In this baseline state, the camera is positioned directly facing the trainee to reduce subsequent state judgment conditions and improve recognition accuracy. Specifically, the trainee's horizontal deflection angle is represented by the change in the relative position of the midline of the face and the chin. This is achieved by calculating the difference between the horizontal coordinate of the midline of the face and the horizontal coordinate of the chin. The chin position is represented by the average coordinates of all key points on the chin, as described above. The positions of the eyes, cheeks, and lips are represented in the same way. The horizontal coordinate of the midline of the face is calculated by averaging the boundary points of the eyes, cheeks, and lips. When the absolute value of the difference is within the range where the trainee is looking directly at the camera, representing one state (looking at the trainer or multimedia), and when the absolute value is outside this range, representing another state (looking at the trainer or multimedia), the trainee is considered to be facing the camera. Despite setting up different classroom areas and different camera recording equipment, there are still situations in actual experiments where the trainees' faces are partially obscured or they are looking down, making it impossible to recognize their faces. This invention sets such situations as a state of inattention.
[0062] In judging trainees, only their horizontal deflection is considered to determine the area they are facing. The coordinates of all key points on the chin, eyes, cheeks, and lips are identified. The average coordinate (Chin) of the chin is calculated based on the coordinates of all key points on the chin; the average coordinate (Eye) of the left and right eyes is calculated based on the coordinates of all key points on the left and right eyes. left Eye right The mean coordinates of the cheeks (cheek) are calculated based on the coordinates of all key points on the left and right cheeks. left cheek right The mean coordinates (Lip) of all key points on the left and right lips are calculated. left Lip right Given the facial landmarks Eye, Cheek, and Lip of the trainee, the horizontal coordinates on the left and right sides are Eye...left Eye right Cheek left Cheek right Lip left Lip right Chin is the horizontal coordinate of the face. By using formulas (2), (3), and (4), the horizontal coordinates of the face's midline under different standards can be obtained.
[0063]
[0064]
[0065]
[0066] After obtaining the midline position of the face in the horizontal direction under different standards, the difference between the midline position and the Chin point in the horizontal direction is calculated using formulas (5), (6), and (7):
[0067] D_Vaule1=Face_cente_line1-Chin (5)
[0068] D_Vaule2=Face_cente_line2-Chin (6)
[0069] D_Vaule2=Face_cente_line2-Chin (7)
[0070] After obtaining the differences D_Vaule1, D_Vaule2, and D_Vaule3 in the horizontal direction between the facial midline and the Chin key point under different standards, the positive or negative value of the difference is used to determine the current facing area of the subject.
[0071] When the participants looked at the speaker, their judgment was still based on the horizontal deflection according to the speaker's spatial position. Throughout the experiment, the basic judgment method remained unchanged; only the compensation difference between the participant's spatial position and the recording device needed to be adjusted.
[0072] Step 3: Matching trainees' attention assessment status
[0073] Through the above steps, using the Dlib library and related Python data processing methods, the overall second-by-second status files of both trainers and trainees are obtained. Using Python, the statuses of the two are matched at the same moment. When the trainer is looking down or turned to the side towards the multimedia, the trainee needs to look at the multimedia area; when the trainer is looking directly at the students, the trainee needs to look at the teacher area.
[0074] At any given moment, a matching analysis is performed to determine whether trainees are in a state consistent with the trainer's expectations. The percentage of trainees whose performance aligns with the classroom rhythm at that moment represents the overall classroom attention level. Based on the percentage of trainees in a consistent state, the overall attention level at that moment is evaluated. Data from the matching file is used to plot a curve showing the changes in classroom attention during the training process. Figure 5 As shown, this reveals the attentional patterns of trainees during training.
[0075] The above description is merely a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any equivalent substitutions or modifications made by those skilled in the art within the scope of the technology disclosed in the present invention, based on the technical solution and inventive concept of the present invention, should be covered within the scope of protection of the present invention.
Claims
1. A method for attention monitoring and evaluation based on the rhythm of knowledge transfer, characterized in that, Includes the following steps: (1) Use cameras to collect video data on the performance of trainers and trainees throughout the training process; (2) Preprocess the video data to ensure that the video of the trainee is fully aligned with the video of the trainer and includes all the teaching points of the entire course; (3) Perform facial key position recognition on the trainees and trainees in each frame of the preprocessed video, and determine the head posture of the trainees and trainees based on the spatial relationship of the recognized facial key positions. (4) Based on the head postures of the trainees as expected, a rhythm spectrum is constructed to assess the trainees' attention status. (5) Determine whether the relative relationship between the postures of the trainees and the trainers matches the rhythm spectrum setting, and use the attention concentration rate to draw the classroom attention change curve; The head posture of the trainer includes looking down, looking straight ahead, and facing the multimedia; the head posture of the trainee includes facing the trainer, facing the multimedia area, and other inattentive postures. During the posture recognition process, the trainee's frame state data is converted into corresponding second state data. The number of frames per second is grouped into a group, and the state value that appears most frequently in the current frame state group represents the trainee's state in the current second. The distribution of areas in the rhythm spectrum that the trainer expects the trainee's attention to follow is externalized as a correspondence between the trainee's head posture and the trainer's head posture: when the trainer is looking down, the trainee needs to focus their attention on the multimedia area, which is manifested by the trainee's head posture facing the multimedia area; when the trainer is looking straight ahead, the trainee focuses their attention on the trainer's area, which is manifested by the trainee's head posture facing the trainer; when the trainer is turned to the side towards the multimedia and explaining the content on the multimedia, the trainee needs to focus their attention on the multimedia area, which is manifested by the trainee's head posture facing the multimedia area.
2. The method of claim 1, wherein, In step (1), during the video data acquisition process, multiple cameras are used to collect data in different areas of the training venue to ensure that the recording process does not obscure the face.
3. The method of claim 1, wherein the knowledge transfer pace is determined by a number of times a user has interacted with a content item. The specific steps for identifying the head posture of trainees are as follows: First, the angle of the trainee's head looking down while facing forward is used as the baseline range. Second, key point coordinates of the chin and nose are obtained through image recognition technology, and the absolute values of the vertical and horizontal distances between the chin and nose are calculated. Finally, the arctangent function is used to calculate the trainee's current head-down angle. When the head-down angle is within the baseline range, the trainee is considered to be in a forward-looking state; when the head-down angle is not within the baseline range, the trainee is considered to be in a head-down state. If all key facial features of the trainee cannot be identified, they are considered to be facing a multimedia device.
4. The method of claim 1, wherein the knowledge transfer pace-based attention monitoring and evaluation method is characterized by, The specific steps for identifying the head posture of the trainee are as follows: First, select any state where the trainee's head is facing the trainer or multimedia as the baseline state, and determine the threshold range of the absolute value difference between the horizontal coordinate of the midline of the facial key points and the horizontal coordinate of the chin key point in the baseline state. Second, perform facial recognition on each trainee to obtain the coordinates of their facial key points, and calculate the absolute value difference between the horizontal coordinate of the midline of the facial key points and the horizontal coordinate of the chin key point. Finally, if not all facial key points can be recognized, it is considered another inattentive state. When the absolute value of the difference is within the threshold range of the baseline state, it is in a quasi-state; when the absolute value of the difference is not within the threshold range of the baseline state, it is in another state.