Attention recognition method based on AI education platform

CN115457617BActive Publication Date: 2026-06-30CHENGDU JIEGAO EDUCATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
CHENGDU JIEGAO EDUCATION TECH CO LTD
Filing Date
2022-01-19
Publication Date
2026-06-30

Smart Images

  • Figure CN115457617B_ABST
    Figure CN115457617B_ABST
Patent Text Reader

Abstract

This invention provides a focus recognition method based on an AI-powered education platform. The method includes: capturing input video frames from multiple users; detecting the facial regions of the users; calculating the pixel-averaged image of the set of facial image windows to establish a facial appearance model of the users; generating paths for the users in the multiple input video frames; estimating the detected facial orientation to calculate focus; detecting the number of faces with frontal poses and detecting users who have gazed at the displayed content for a predefined duration; calculating the focus of each user by calculating the time they gaze at the displayed content; associating body language with one of multiple emotion type labels; training a classifier using features extracted from the video frame data; and using the classifier to detect the emotional feedback of the users. This invention proposes a focus recognition method based on an AI-powered education platform, which is better adapted to application scenarios with low-resolution images and combines visual recognition and emotion recognition to help the AI-powered education platform obtain the focus distribution status of users in real time.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to intelligent education, and in particular to a method for recognizing attention based on an artificial intelligence education platform. Background Technology

[0002] In recent years, image recognition has been increasingly integrated with education-related scenarios, finding applications in personalized education, automated scoring, and speech recognition assessment. Students receive tailored learning support, fostering future-oriented adaptive education. To assess student focus, cameras can capture frontal video of students during class; extracting facial regions from the video images can determine the number of attentive students and their facial expressions, providing data support for educational effectiveness. While current technologies for measuring attention levels utilize eye-gaze-based techniques, measuring eye gaze typically requires close-range, high-resolution images. Using distant, low-resolution images is susceptible to errors. Summary of the Invention

[0003] To address the problems existing in the prior art, this invention proposes a focus recognition method based on an artificial intelligence education platform, comprising:

[0004] The image capture device captures multiple input video frames from multiple users listening to the lesson in the area where the content is displayed.

[0005] A machine learning-based face detection method is used to segment regions with skin color pixel values ​​in multiple input video frames, and to detect the face regions of the users listening to the class in the multiple input video frames.

[0006] A facial appearance model of the user attending the class is established by calculating the average pixel image of the group of facial image windows.

[0007] By generating paths for users listening to the class in multiple input video frames, the detected faces are tracked individually and the identities assigned to the users are maintained. When a user's face is detected, a path for that user is generated, and the detected face is assigned to the generated path.

[0008] Estimate the direction of detected faces to calculate focus; detect users who have been looking at the displayed content for a predefined duration by detecting the number of faces with a frontal pose;

[0009] The focus level of each student is calculated by measuring the time they spend looking at the displayed content.

[0010] The system processes video frame data to detect the physical behavior of users in a video frame sequence; it associates the observed physical behavior with one of a number of emotion type labels, where each type label corresponds to a specific emotional response; it trains a classifier using features extracted from the video frame data, and uses the classifier to detect the emotional responses of users in the video frame sequence.

[0011] Preferably, the method further includes:

[0012] Potential users of the displayed content are identified by tracking multiple behaviors of several users around the displayed content.

[0013] Preferably, each emotional feedback is a predicted facial expression expressing the emotional state of the user attending the class, and the method further includes capturing second video frame data of the user attending the class; and applying features extracted from the second video frame data to the classifier to determine the emotional state of the user attending the class.

[0014] Preferably, the method further includes:

[0015] The Viola Jones face detector algorithm is applied to the input video frame to determine face regions; a deformable part-based model is applied to determine the Region of Interest (ROI) corresponding to the face markers of the listening user within the face region; features are extracted from the ROI region; the features are correlated with emotion types; and a classifier is trained using the correlation results.

[0016] Preferably, a feature histogram is generated from the extracted features; coordinate transformation is performed on the ROI region in multiple video frames;

[0017] The extracted features are concatenated to generate feature descriptors;

[0018] The classifier is trained using the final feature descriptor and the feature histogram.

[0019] Compared with the prior art, the present invention has the following advantages:

[0020] This invention proposes a focus recognition method based on an artificial intelligence education platform, which is better adapted to the application scenarios of low-resolution images and combines visual recognition and emotion recognition to help the artificial intelligence education platform obtain the focus distribution status of students in real time. Attached Figure Description

[0021] Figure 1 This is a flowchart of a focus recognition method based on an artificial intelligence education platform according to an embodiment of the present invention. Detailed Implementation

[0022] The following text and illustrations are attached to the principles of this invention. Figure 1This document provides a detailed description of one or more embodiments of the invention. The invention is described in conjunction with such embodiments, but is not limited to any particular embodiment. The scope of the invention is defined only by the claims, and the invention encompasses numerous alternatives, modifications, and equivalents. Numerous specific details are set forth in the following description to provide a thorough understanding of the invention. These details are provided for illustrative purposes, and the invention may be practiced without some or all of these specific details as described in the claims.

[0023] One aspect of the present invention provides a focus recognition method based on an artificial intelligence education platform. Figure 1 This is a flowchart of a focus recognition method based on an artificial intelligence education platform according to an embodiment of the present invention.

[0024] This invention automatically measures the attention level of users to displayed content by counting the number of users and the duration of their gaze. Attention level also includes the user's attention level, the number of focuses (e.g., how many people actually looked at the displayed content, average focus length, focus time distribution, and ratings based on user responses). The displayed content is measured by tracking the behavior of users around it. An image-capturing device is used to collect information about users' proximity to the displayed content.

[0025] A forward-facing device used to capture images is used to measure the actual number of users attending the displayed content. This device detects when people are looking at the screen. Attention time is calculated starting when a user looks towards the screen for a predefined minimum duration. The total number of users attentive to the displayed content constitutes the actual number of users attending the displayed content.

[0026] This invention utilizes a combination of skin color detection and pattern-based face detection to accurately detect faces in complex backgrounds, enabling the tracking method to accurately mark entry and exit times. Path continuity is achieved by combining face detection and face matching. The degree of attention is determined using 3D pose estimation based on overall facial pattern variations, thus enabling a more meaningful measurement of focus. It distinguishes users who are actually gazing at the displayed content from other users who appear near the displayed content but are not actually gazing at it.

[0027] When multiple users appear in the gaze area, the image capture device captures images of multiple users. The captured images are processed by a computer system's control and processing system, which applies various visual techniques, including face detection, face tracking, and 3D face pose estimation, to the captured visual information of the multiple users. In an exemplary embodiment, the invention also measures the effectiveness of the displayed content for the attending users. Users gaze at the displayed content within a limited spatial range to utilize robust face detection / tracking techniques and face pose estimation. The sum of the number of users focused on the displayed content yields the actual number of users attending the displayed content.

[0028] The AI ​​education platform of this invention includes a skin color detection module, a face detection module, a user path management module, a 3D face pose estimation module, and a data collection module. The user path management module further includes a geometric matching module, a path generation module, a path maintenance module, and a path termination module. The skin color detection module identifies regions in video frames that resemble the skin color of a face. The face detection module then runs a face detection window on the regions identified by the skin color detection module. Detected faces are first processed by the geometric matching module to determine whether the face belongs to an existing path and whether the face belongs to a new user, thereby generating a new path. If the face belongs to a new user, the path generation module is activated to generate a new path and add it to the path queue. If the face belongs to an existing path, the path maintenance module acquires the path data and activates the 3D face pose estimation module. If the geometric matching module cannot find a subsequent face belonging to a certain path, the path termination module is activated to store the path data and remove the path from the storage queue. The data collection module then records the path data and the estimated face pose data.

[0029] The AI-powered education platform automatically calculates the level of attention to displayed content by processing video input frames from image capture devices near the displayed content. Using live video as input, it detects user faces in the video, tracks each user individually by user identity, estimates 3D facial pose, records appearance and disappearance timestamps, and collects data to determine the occurrence and duration of attention. A 3D pose estimation method is used to automatically correct the viewpoint offset between the camera and the displayed content.

[0030] In face detection, skin color segmentation is performed first. In this step, color information is used to segment regions in the video frame where faces might exist—the detected skin regions. A color space transformation is used to make skin colors form compact regions in the transformed space. Skin color detection serves as a means to accelerate face detection and significantly reduces false detections of faces from the background. The output of this step is a set of mask regions in the video frame. Next, the face detection process begins. A machine learning-based method is used to detect faces within the skin color regions identified in the previous step. The image is converted to grayscale and processed to detect faces. This step provides the location and size of the detected faces in a given video frame.

[0031] During face tracking, once a face is detected, an automatic face geometry correction step is initiated. The estimated face geometry is used to generate a corrected face from the detected face image, ensuring that the face features are placed in standard positions within the cropped face image window, thereby establishing a reliable face appearance model. Each time a face is added to the user path, the user's appearance model is constructed by calculating the average pixel image of the entire face image window in the path.

[0032] The tracking step monitors the identity of users in the scene to measure the duration of time a user gazes at the displayed content. Tracking utilizes two measurements: geometric matching between tracking history and newly detected faces. Path management generates paths when new faces appear in the scene, assigns paths to detected faces, monitors the identity of users in the scene, and terminates paths when a user leaves the scene.

[0033] When a new face is detected in the current video frame, a face-path mapping table is constructed. Then, a geometric matching score is calculated for each face-path pair to measure the probability that a given face belongs to a given path. The geometric matching score is based on position, size, the time difference between the corrected face and the last face in the path, and the difference between the average face appearance stored in the path and the corrected face. If the total score is below a predefined threshold, the data pair is excluded from the mapping table. This process is repeated until all faces are assigned a matching path. If a path does not have a new face within a predefined time period, that path is terminated.

[0034] Furthermore, the focus level of the user during the gaze is accurately measured by calculating the proportion of time the user pays attention to the displayed content relative to the total duration of time the user's face is viewed. The estimated face orientation is used to determine whether a face is facing forward. Then, the ratio of the number of forward-facing faces to the total number of detected faces is calculated.

[0035] In a preferred embodiment, after the focus recognition is completed, the video frame data is further processed to detect the physical behavior of the user in the video frame sequence; the observed physical behavior is associated with one of a plurality of emotion type labels, wherein each type label corresponds to a corresponding emotional feedback; a classifier is trained using features extracted from the video frame data, and the classifier is used to detect the emotional feedback of the user in the video frame sequence.

[0036] Each emotional feedback is a predicted facial expression that expresses the emotional state of the user attending the class, and the method further includes capturing second video frame data of the user attending the class; and applying features extracted from the second video frame data to the classifier to determine the emotional state of the user attending the class.

[0037] The step of detecting the face region of the user attending the class in the plurality of input video frames further includes:

[0038] The Viola Jones face detector algorithm is applied to the input video frame to determine face regions; a deformable part-based model is applied to determine the Region of Interest (ROI) corresponding to the face marker of the user in the face region; features are extracted from the ROI; the features are associated with emotion type; and a classifier is trained using the association results. A feature histogram is generated from the extracted features; coordinate transformations are performed on the ROI in multiple video frames; the extracted features are concatenated to generate a feature descriptor; and the classifier is trained using the final feature descriptor and the feature histogram.

[0039] To implement the gaze detection process for determining the fixation point, in a further embodiment, a capture device with zoom capability is used to photograph the user attending the class, and the captured image and zoom value are output; the image of the user's iris is distinguished from the background of the image; then, the center of the user's eyeball is specified based on the iris image, and the intersection of the user and the vertical line from the center of the eyeball to the user's face is specified as a reference point; a zoom value indicating a predetermined size of the iris image is set, and the distance from the iris to the user is specified based on the zoom value; the offset of the iris is determined based on the offset of the iris image, and the gaze offset on the user is specified based on the iris offset and the distance from the iris to the user; the fixation point is calculated based on the reference point and the gaze offset.

[0040] Each time a change in the user's position is detected, the distance measurement step and the reference point determination step are iteratively executed. The distance measurement step further includes: acquiring the size of the iris image as a reference value, acquiring a zoom value as a reference value, acquiring the distance from the iris to the displayed content as a reference value, and pre-storing the image size, zoom value, and distance; controlling the zoom function so that the iris image size is equal to the iris image size used as the reference value; and determining the distance from the iris to the displayed content based on the zoom value used. The gaze offset is specified by using the pre-stored distance from the center of the eyeball to the iris.

[0041] In summary, this invention proposes a focus recognition method based on an artificial intelligence education platform, which is better adapted to application scenarios with low-resolution images and combines visual recognition and emotion recognition to help the artificial intelligence education platform obtain the focus distribution status of users in real time.

[0042] Obviously, those skilled in the art should understand that the modules or steps of the present invention described above can be implemented using general-purpose computing systems. They can be centralized on a single computing system or distributed across a network of multiple computing systems. Optionally, they can be implemented using program code executable by a computing system, and thus stored in a storage system for execution by the computing system. Therefore, the present invention is not limited to any specific hardware and software combination.

[0043] It should be understood that the specific embodiments described above are merely illustrative or explanatory of the principles of the invention and do not constitute a limitation thereof. Therefore, any modifications, equivalent substitutions, improvements, etc., made without departing from the spirit and scope of the invention should be included within the protection scope of the invention. Furthermore, the appended claims are intended to cover all variations and modifications falling within the scope and boundaries of the appended claims, or equivalent forms of such scope and boundaries.

Claims

1. A method for recognizing attention based on an artificial intelligence education platform, characterized in that, include: The image capture device captures multiple input video frames from multiple users listening to the lesson in the area where the content is displayed. A machine learning-based face detection method is used to segment regions with skin color pixel values ​​in multiple input video frames, and to detect the face regions of the users listening to the class in the multiple input video frames. A facial appearance model of the user attending the class is established by calculating the average pixel image of the face image window. By generating paths for users listening to the class in multiple input video frames, the detected faces are tracked individually and the identities assigned to the users are maintained. When a user's face is detected, a path for that user is generated, and the detected face is assigned to the generated path. Estimate the orientation of the detected face to calculate focus; The system detects users who have been looking at the displayed content for a predefined duration by detecting the number of faces with a frontal posture. The focus level of each student is calculated by measuring the time they spend looking at the displayed content. Process video frame data to detect the physical behavior of users in a video frame sequence; associate the observed physical behavior with one of a number of emotion type labels, where each type label corresponds to a corresponding emotional response; A classifier is trained using features extracted from video frame data, and the classifier is used to detect the emotional feedback of the listening user in the video frame sequence. The method further includes: When a new face is detected in the current video frame, a face-path mapping table is constructed, and a geometric matching score is calculated for each face-path pair to measure the probability that a given face belongs to a given path. The geometric matching score is based on position, size, and the time difference between the corrected face and the last face in the path, as well as the difference between the average face appearance stored in the path and the corrected face. If the total score is lower than a predefined threshold, the data pair is excluded from the mapping table. This process is repeated until all faces are assigned a matching path.

2. The method according to claim 1, characterized in that, The method further includes: Potential users of the displayed content are identified by tracking multiple behaviors of several users around the displayed content.

3. The method according to claim 1, characterized in that, Each emotional feedback is a predicted facial expression expressing the emotional state of the user attending the class, and the method further includes capturing second video frame data of the user attending the class; and applying features extracted from the second video frame data to the classifier to determine the emotional state of the user attending the class.

4. The method according to claim 1, wherein detecting the face region of the user attending the class in the plurality of input video frames further comprises: The Viola Jones face detector algorithm is applied to the input video frame to determine face regions; a deformable part-based model is applied to determine the Region of Interest (ROI) corresponding to the face markers of the listening user within the face region; features are extracted from the ROI region; the features are correlated with emotion types; and a classifier is trained using the correlation results.

5. The method of claim 4, further comprising: Generate feature histograms from the extracted features; Perform coordinate transformations on the ROI region across multiple video frames; The extracted features are concatenated to generate feature descriptors; The classifier is trained using the final feature descriptor and the feature histogram.