Remote conference participant attention detection method and device, and storage medium
By processing video and audio streams through image detection and voice endpoint detection, and combining calculation rules, the problem of inaccurate judgment of participant attention in remote meetings is solved. This enables accurate calculation and report generation of the attention of all participants, thereby improving the effectiveness of meeting management.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- CHINA PING AN LIFE INSURANCE CO LTD
- Filing Date
- 2022-07-21
- Publication Date
- 2026-06-30
AI Technical Summary
Existing technology cannot accurately determine the level of attention of some participants in a remote meeting, especially those who do not need to participate in the discussion or presentation, resulting in an inability to accurately obtain the attention of all participants.
By processing gazed screen frames in the video stream through image detection and speaking frames in the audio stream through voice endpoint detection, and combining these with preset calculation rules, the participant attention level of each user terminal is calculated, and an attention report is generated.
It enables accurate tracking of the attention of all participants in remote meetings, improving the management and optimization capabilities for meeting quality.
Smart Images

Figure CN115272922B_ABST
Abstract
Description
Technical Field
[0001] This application relates to, but is not limited to, the field of artificial intelligence technology, and in particular to a method, apparatus, and storage medium for detecting the attention of participants in a remote meeting. Background Technology
[0002] Currently, in remote meetings, the attention level of participants is usually judged by the length of time they speak. When the attention level is low, the operators can improve the attention level by optimizing the meeting content. However, when some participants do not need to participate in the discussion or presentation, their speaking time is short or they do not speak at all. The above-mentioned attention level judgment method cannot accurately judge the attention level of these participants, resulting in the inability to accurately obtain the attention level of all participants. Summary of the Invention
[0003] The following is an overview of the subject matter described in detail herein. This overview is not intended to limit the scope of the claims.
[0004] This application provides a method, apparatus, and storage medium for detecting participant attention in remote conferencing. It can accurately obtain the attention level of all participants by processing video streams through image detection and audio streams through voice endpoint detection.
[0005] To achieve the above objectives, a first aspect of this application proposes a method for detecting participant attention in a remote conference, applied to a server. The server is communicatively connected to multiple user terminals and to a management terminal. The method includes: receiving video and audio streams from the multiple user terminals; determining gaze frames in each video frame of the video stream based on image detection processing; determining speech frames in each audio frame of the audio stream based on voice endpoint detection processing, wherein the audio frames correspond one-to-one with the video frames, and the audio frames and their corresponding video frames are time-synchronized; obtaining the participant attention level of each user terminal based on preset calculation rules, according to the video frames, gaze frames, audio frames, and speech frames; generating an attention report based on the participant attention levels of each user terminal; and sending the attention report to the management terminal in response to a report query request from the management terminal.
[0006] In some embodiments, the step of obtaining the participant attention level of each user terminal based on the video frame, the gazed screen frame, the audio frame, and the speech frame according to a preset calculation rule includes: determining the total audio and video duration based on the video frame or the audio frame; determining non-gaze screen frames in each video frame based on the gazed screen frame; determining non-speech frames in each audio frame based on the speech frame; determining a first duration based on the gazed screen frame and the speech frame based on overlap processing, and determining a second duration based on the speech frame and the non-gaze screen frame; and determining the attention level of each participant based on the gazed screen frame. The screen frames and the non-speaking frames determine a third duration; a first attention value is determined based on the total audio and video duration, the first duration, and a preset first weight value; a second attention value is determined based on the total audio and video duration, the second duration, and a preset second weight value, wherein the second weight value is less than the first weight value; a third attention value is determined based on the total audio and video duration, the third duration, and a preset third weight value, wherein the third weight value is less than the second weight value; and the participant attention level of each user terminal is obtained based on the first attention value, the second attention value, and the third attention value.
[0007] In some embodiments, generating an attention report based on the attention levels of participants from each user terminal includes: obtaining basic meeting information, wherein the basic meeting information includes meeting content information, meeting start time, and meeting end time, and the meeting content information is matched with the attention levels of participants; determining the meeting duration based on the meeting start time and meeting end time; determining the attention duration based on the observed screen frames and the spoken frames; determining the attention time percentage of each user terminal based on the attention duration and the meeting duration; and generating an attention report based on the basic meeting information, the attention time percentage of each user terminal, and the attention levels of participants.
[0008] In some embodiments, after the step of obtaining the participant attention level of each user terminal based on the video frame, the gaze screen frame, the audio frame, and the speech frame according to the preset calculation rules, the method further includes: when it is determined that the participant attention level of any user terminal is less than a preset attention level threshold, generating a reminder message, wherein the reminder message is matched with the user terminal; and sending the reminder message to the management terminal.
[0009] In some embodiments, determining gaze screen frames in each video frame of the video stream based on image detection processing includes: determining active state frames in each video frame of the video stream based on liveness detection processing; and determining gaze screen frames in each of the active state frames based on face detection processing and eye detection processing.
[0010] In some embodiments, before the step of determining the gaze screen frame in each video frame of the video stream based on image detection processing, the method further includes: dividing the video stream into multiple sub-video streams and dividing the audio stream into multiple sub-audio streams based on a preset segmentation duration, wherein the sub-audio streams correspond one-to-one with the sub-video streams, and the sub-video streams and the corresponding sub-audio streams are used to determine the participant's attention level.
[0011] To achieve the above objectives, a second aspect of this application proposes a method for detecting participant attention in a remote conference, applied to a management terminal. The management terminal is communicatively connected to a server, and the server is communicatively connected to multiple user terminals. The method includes: sending a report query request to the server; and receiving an attention report from the server. The attention report is generated based on the participant attention of each user terminal. The participant attention is obtained by the server based on preset calculation rules, according to video frames of a video stream, audio frames of an audio stream, gaze frames, and speech frames. The video stream and audio stream are obtained by each user terminal sending data to the server. The gaze frames are determined by the server based on image detection processing within each video frame of the video stream. The speech frames are determined by the server based on voice endpoint detection processing within each audio frame of the audio stream. Each audio frame corresponds one-to-one with a video frame, and the audio frames and their corresponding video frames are time-synchronized.
[0012] To achieve the above objectives, a third aspect of this application proposes a participant attention detection device for remote conferencing, applied to a server. The server is communicatively connected to multiple user terminals and a management terminal. The device includes: an acquisition unit for receiving video and audio streams from the multiple user terminals; a video processing unit for determining gaze frames in each video frame of the video stream based on image detection processing; an audio processing unit for determining speech frames in each audio frame of the audio stream based on speech endpoint detection processing, wherein the audio frames correspond one-to-one with the video frames, and the audio frames and their corresponding video frames are time-synchronized; a calculation unit for obtaining the participant attention of each user terminal based on preset calculation rules, according to the video frames, gaze frames, audio frames, and speech frames; a generation unit for generating an attention report based on the participant attention of each user terminal; and a sending unit for sending the attention report to the management terminal in response to a report query request from the management terminal.
[0013] To achieve the above objectives, a fourth aspect of this application provides an electronic device, which includes a memory, a processor, a program stored in the memory and executable on the processor, and a data bus for enabling communication between the processor and the memory. When the program is executed by the processor, it implements the participant attention detection method for remote conferencing described in the first aspect.
[0014] To achieve the above objectives, a fifth aspect of the present application provides a storage medium, which is a computer-readable storage medium for computer-readable storage. The storage medium stores one or more programs, which can be executed by one or more processors to implement the remote meeting participant attention detection method described in the first aspect above, or the remote meeting participant attention detection method described in the second aspect above.
[0015] The present application discloses a method, apparatus, and storage medium for detecting participant attention in remote conferencing. Embodiments of this application include: receiving video and audio streams from multiple user terminals; determining gaze frames in each video frame of the video stream based on image detection processing; determining speech frames in each audio frame of the audio stream based on voice endpoint detection processing, wherein each audio frame corresponds one-to-one with a video frame, and the audio frame and its corresponding video frame are time-synchronized; obtaining the participant attention level of each user terminal based on preset calculation rules, according to the video frames, gaze frames, audio frames, and speech frames; generating an attention report based on the participant attention levels of each user terminal; and sending the attention report to the management terminal in response to a report query request from the management terminal. According to the solution provided in the embodiments of this application, the server receives video streams and audio streams from each user terminal. Then, for each video stream and audio stream, the video frames in the video stream are processed by image detection to determine the gaze frame. Then, the audio frames in the audio stream are processed by voice endpoint detection to determine the speaking frame. Based on the calculation rules, the attention level of each participant on each user terminal is calculated, and then an attention report is generated and sent to the management terminal. This enables the accurate acquisition of the attention level of all participants in a remote meeting.
[0016] Other features and advantages of this application will be set forth in the description which follows, and will be apparent in part from the description, or may be learned by practicing the application. The objectives and other advantages of this application may be realized and obtained by means of the structures particularly pointed out in the description, claims and drawings. Attached Figure Description
[0017] The accompanying drawings are used to provide a further understanding of the technical solutions of this application and constitute a part of the specification. They are used together with the embodiments of this application to explain the technical solutions of this application and do not constitute a limitation on the technical solutions of this application.
[0018] Figure 1 This is a flowchart of a method for detecting participant attention in a remote conference applied to a server, provided in one embodiment of this application;
[0019] Figure 2 This is a flowchart illustrating the adjustment of the number of data rows provided in another embodiment of this application;
[0020] Figure 3 This is a flowchart illustrating an update of a target data row, provided in another embodiment of this application;
[0021] Figure 4 This is a flowchart illustrating an update of a target data row, provided in another embodiment of this application;
[0022] Figure 5 This is a flowchart illustrating the filtering of target data rows provided in another embodiment of this application;
[0023] Figure 6 This is a flowchart of a process for processing business requirement text provided in another embodiment of this application;
[0024] Figure 7 This is a flowchart of a method for detecting participant attention in a remote conference applied to a management terminal, provided in one embodiment of this application;
[0025] Figure 8 This is a system block diagram of a participant attention detection method provided in one embodiment of this application;
[0026] Figure 9 This is a schematic diagram of the structure of a participant attention detection device for a remote conference provided in another embodiment of this application;
[0027] Figure 10 This is a schematic diagram of the hardware structure of an electronic device provided in another embodiment of this application. Detailed Implementation
[0028] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0029] In the description of this application, "several" means one or more, "multiple" means two or more, "greater than", "less than", "exceeding" etc. are understood to exclude the number itself, and "above", "below", "within" etc. are understood to include the number itself.
[0030] It should be noted that although functional modules are divided in the device schematic diagram and a logical order is shown in the flowchart, in some cases, the steps shown or described may be performed in a different order than the module division in the device or the order in the flowchart. The terms "first," "second," etc., in the specification, claims, or the aforementioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence.
[0031] First, let's analyze some of the terms used in this application:
[0032] Artificial Intelligence (AI) is a new branch of computer science that studies, develops, and applies theories, methods, technologies, and systems to simulate, extend, and expand human intelligence. It aims to understand the essence of intelligence and produce intelligent machines that can react in a way similar to human intelligence. Research in this field includes robotics, speech recognition, image recognition, natural language processing, and expert systems. AI can simulate the information processes of human consciousness and thought. Furthermore, AI utilizes digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceiving the environment, acquiring knowledge, and using that knowledge to achieve optimal results.
[0033] Liveness detection is a method used in identity verification scenarios to determine the true physiological characteristics of an object. In facial recognition applications, liveness detection can verify whether a user is a real, living person by using technologies such as facial landmark localization and facial tracking through combined actions such as blinking, opening the mouth, shaking the head, and nodding. It can effectively resist common attack methods such as photos, face swapping, masks, occlusion, and screen re-photographing, thereby helping users identify fraudulent behavior and protecting their interests.
[0034] Face detection refers to the process of searching a given image using a specific strategy to determine whether it contains a human face, and if so, returning the position, size, and pose of the face.
[0035] Voice Activity Detection (VAD) is a technique used in speech processing to detect the presence of a speech signal.
[0036] Currently, in remote meetings, the attention level of participants is usually judged by the length of time they speak. When the attention level is low, the operators can improve the attention level by optimizing the meeting content. However, when some participants do not need to participate in the discussion or presentation, their speaking time is short or they do not speak at all. The above-mentioned attention level judgment method cannot accurately judge the attention level of these participants, resulting in the inability to accurately obtain the attention level of all participants.
[0037] To address the problem of inaccurately obtaining the attention levels of all participants, this application provides a method, apparatus, and storage medium for detecting participant attention levels in remote conferencing. The method is applied to a server, which is communicatively connected to multiple user terminals and a management terminal. The method includes: receiving video and audio streams from multiple user terminals; determining the gaze screen frame in each video frame of the video stream based on image detection processing; determining the speech frame in each audio frame of the audio stream based on voice endpoint detection processing, wherein the audio frame corresponds one-to-one with the video frame, and the audio frame and its corresponding video frame are time-synchronized; obtaining the participant attention level of each user terminal based on preset calculation rules, according to the video frame, gaze screen frame, audio frame, and speech frame; generating an attention report based on the participant attention levels of each user terminal; and sending the attention report to the management terminal in response to a report query request from the management terminal. According to the solution provided in the embodiments of this application, the server can obtain the video stream and audio stream of each user terminal. For each video stream and audio stream, the server processes the video frames in the video stream through image detection to determine the gaze frame. Then, it processes the audio frames in the audio stream through voice endpoint detection to determine the speaking frame. Based on the calculation rules, the server calculates the attention level of each user terminal participant, generates an attention report, and sends the attention report to the management terminal. This enables the server to accurately obtain the attention level of all participants in a remote meeting.
[0038] The remote meeting participant attention detection method, device, and storage medium provided in this application embodiment are specifically described through the following embodiments. First, the remote meeting participant attention detection method in this application embodiment is described.
[0039] The method for detecting participant attention in remote conferencing provided in this application relates to the field of artificial intelligence technology. This method can be applied to a terminal, a server, or software running on either a terminal or a server. In some embodiments, the terminal can be a smartphone, tablet, laptop, desktop computer, etc.; the server can be configured as an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms; the software can be an application implementing the method for detecting participant attention in remote conferencing, but is not limited to the above forms.
[0040] This application can be used in a wide variety of general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices. This application can be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform specific tasks or implement specific abstract data types. This application can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.
[0041] It should be noted that in all specific embodiments of this application, when processing data related to user identity or characteristics, such as user information, user behavior data, user historical data, and user location information, user permission or consent is obtained first. Furthermore, the collection, use, and processing of this data comply with relevant laws, regulations, and standards of the relevant countries and regions. In addition, when embodiments of this application require access to sensitive personal information of users, separate permission or consent from the user is obtained through pop-ups or redirects to confirmation pages. Only after obtaining the user's separate permission or consent is the necessary user-related data for the proper functioning of the embodiments of this application obtained.
[0042] The embodiments of this application will be further described below with reference to the accompanying drawings.
[0043] like Figure 1 As shown, Figure 1 This is a flowchart illustrating a method for detecting participant engagement in a remote conference, according to an embodiment of this application. This method can be applied to a server, which communicates with multiple user terminals and a management terminal. The method includes, but is not limited to, the following steps:
[0044] Step S110: Receive video and audio streams from multiple user terminals;
[0045] Step S120: Based on image detection processing, determine the gaze screen frame in each video frame of the video stream;
[0046] Step S130: Based on the voice endpoint detection processing, determine the speaking frame in each audio frame of the audio stream, wherein the audio frame corresponds one-to-one with the video frame, and the audio frame and the corresponding video frame are time-synchronized.
[0047] Step S140: Based on preset calculation rules, the participant attention level of each user terminal is obtained according to video frames, gaze screen frames, audio frames and speech frames.
[0048] Step S150: Generate an attention report based on the attention levels of participants on each user terminal;
[0049] Step S160: In response to the report query request from the management terminal, send an attention report to the management terminal.
[0050] Understandably, user terminals upload video and audio streams in real time. The server receives these streams from each user terminal, processes the video stream through image detection to determine the corresponding screen-gazing frame when a participant is looking at the screen, and then processes the audio stream through voice endpoint detection to detect the audio frames where human voice appears and disappears. All audio frames between the appearance and disappearance of human voice are considered as speech frames. Then, calculation rules are applied to the video frames, screen-gazing frames, audio frames, and speech frames to calculate the participant's attention level for each user terminal, generating an attention report, which is then sent to the management terminal. Operations personnel then... The management terminal obtains attention reports, which are then used to optimize meeting content and improve meeting quality. Based on this, the server receives video and audio streams from each user terminal. For each video and audio stream, image detection is used to process video frames to determine the frames the participants are looking at. Similarly, audio endpoint detection is used to process audio frames to determine the speaking frames. Based on calculation rules, the attention level of each participant is calculated, and an attention report is generated and sent to the management terminal. This allows for accurate determination of the attention level of all participants in remote meetings.
[0051] It should be noted that the user terminal and the management terminal can be either wireless or wired terminal devices. Wireless terminal devices can refer to devices with wireless transceiver capabilities, including but not limited to mobile phones, tablets (Pads), and computers with wireless transceiver capabilities.
[0052] Additionally, refer to Figure 2 In one embodiment, Figure 1 Step S140 in the illustrated embodiment includes, but is not limited to, the following steps:
[0053] Step S210: Determine the total duration of audio and video based on video frames or audio frames;
[0054] Step S220: Based on the gaze screen frames, determine the non-gaze screen frames in each video frame;
[0055] Step S230: Based on the spoken frames, determine the non-spoken frames in each audio frame;
[0056] Step S240: Based on overlap processing, a first duration is determined according to the gazed screen frame and the speaking frame, a second duration is determined according to the speaking frame and the non-gaze screen frame, and a third duration is determined according to the gazed screen frame and the non-speaking frame.
[0057] Step S250: Determine the first attention value based on the total duration of audio and video, the first duration, and the preset first weight value;
[0058] Step S260: Determine the second attention value based on the total duration of audio and video, the second duration, and the preset second weight value, wherein the second weight value is less than the first weight value;
[0059] Step S270: Determine the third attention value based on the total duration of audio and video, the third duration, and the preset third weight value, wherein the third weight value is less than the second weight value;
[0060] Step S280: Based on the first attention value, the second attention value, and the third attention value, obtain the attention level of the participants on each user terminal.
[0061] Understandably, it's necessary to calculate the attention level of each participant separately, with each user terminal corresponding to one participant. For any given user terminal, since video or audio frames are synchronized and correspond one-to-one, the total audio-visual duration is determined by calculating the duration of all video or audio frames. Then, fixated screen frames are removed from each video frame to obtain non-fixated screen frames, and speech frames are removed from each audio frame to obtain non-speech frames. After overlay processing, the times corresponding to the overlapping portions of fixated screen frames and speech frames are summed to determine the first duration. Then, after overlay processing, the times corresponding to the overlapping portions of speech frames and non-fixated screen frames are summed to determine the second duration. Then, after overlay processing, the times corresponding to the overlapping portions of fixated screen frames and non-speech frames are summed to determine the third duration. Finally, the attention value corresponding to each of these three durations is calculated, and the first duration is calculated. The first attention value is obtained by multiplying the length of the video / audio session by the first weight value. The second attention value is obtained by multiplying the length of the video / audio session by the second weight value. The third attention value is obtained by multiplying the length of the video / audio session by the third weight value. Finally, the first, second, and third attention values are summed to obtain the participant attention score. The calculation is based on the following weight values: the first weight value is greater than the second weight value, and the second weight value is greater than the third weight value. This means that the participant who is looking at the screen and speaking has the highest weight, the participant who is only speaking has the next highest weight, and the participant who is only looking at the screen and not speaking has the lowest weight. No attention score is calculated for participants who are neither looking at the screen nor speaking. The first, second, and third weight values were obtained through repeated experiments and optimization, thus ensuring the reliability and accuracy of the participant attention score calculation.
[0062] In practice, the first weight value can be 0.5, the second weight value can be 0.3, and the third weight value can be 0.2.
[0063] Additionally, refer to Figure 3 In one embodiment, Figure 1 Step S150 in the illustrated embodiment includes, but is not limited to, the following steps:
[0064] Step S310: Obtain basic meeting information, including meeting content information, meeting start time, and meeting end time, with the meeting content information matching the participants' attention levels.
[0065] Step S320: Determine the meeting duration based on the meeting start time and meeting end time;
[0066] Step S330: Determine the attention duration based on the gazed screen frame and the spoken frame;
[0067] Step S340: Determine the percentage of attention time for each user terminal based on the attention duration and meeting duration;
[0068] Step S350: Generate an attention report based on the basic information of the meeting, the percentage of attention time for each user terminal, and the attention level of the participants.
[0069] Understandably, for a remote meeting, a corresponding attention report is generated. The information in the attention report includes, but is not limited to: meeting content information, meeting start time, meeting end time, number of participants, percentage of time spent paying attention, and participant attention level.
[0070] It should be noted that by adding the durations corresponding to the screen gaze frames and the speaking frames, and subtracting the duration corresponding to the overlapping portion of the screen gaze frames and the speaking frames, the attention duration is obtained. This is equivalent to calculating the time that participants spend looking at the screen or speaking as the attention duration. Then, the ratio of the attention duration to the meeting duration is calculated to obtain the attention time ratio. The number of participants is the number of user terminals that upload video and audio streams.
[0071] Additionally, refer to Figure 4 In one embodiment, Figure 1 Following step S140 in the illustrated embodiment, the following steps may also be included, but are not limited to:
[0072] Step S410: When it is determined that the attention level of any user terminal is less than the preset attention level threshold, a reminder message is generated, wherein the reminder message is matched with the user terminal;
[0073] Step S420: Send a reminder message to the management terminal.
[0074] Understandably, by setting an attention threshold, when a participant's attention level is below the threshold (e.g., 0.4), it indicates that the participant is not paying attention to the meeting content, and an automatic reminder message will be generated. Operations personnel can identify the participant through the reminder message via the management terminal, analyze the meeting content, and then adjust the meeting content or interaction methods in real time to promptly improve participant attention and enhance meeting quality.
[0075] It should be noted that the attention threshold was obtained through repeated experiments and optimization.
[0076] Additionally, refer to Figure 5 In one embodiment, Figure 1 Step S120 in the illustrated embodiment includes, but is not limited to, the following steps:
[0077] Step S510: Based on liveness detection processing, determine the active state frames in each video frame of the video stream.
[0078] Step S520: Based on face detection processing and eye detection processing, determine the gaze screen frame in each active state frame.
[0079] Understandably, by first performing liveness detection to determine whether real people exist in each video frame, and then using video frames containing real people as active state frames, and then using face detection and eye detection to identify active state frames where faces appear and eyes are looking directly at the screen, the accuracy and reliability of subsequent participant attention calculations can be ensured.
[0080] like Figure 6 As shown, in one embodiment, Figure 1 Before step S120 in the illustrated embodiment, the following steps may also be included, but are not limited to:
[0081] Step S610: Based on a preset segmentation duration, the video stream is divided into multiple sub-video streams, and the audio stream is divided into multiple sub-audio streams. The sub-audio streams correspond one-to-one with the sub-video streams, and the sub-video streams and their corresponding sub-audio streams are used to determine the participants' attention levels.
[0082] Understandably, dividing the video stream into multiple sub-video streams of fixed duration and the audio stream into multiple sub-audio streams of the same fixed duration, and calculating the corresponding participant attention for each sub-video stream and its corresponding sub-audio stream, allows for the analysis of participant attention across different time intervals, thereby improving the comprehensiveness and real-time nature of participant attention analysis.
[0083] In practice, for example, a 60-minute video stream can be divided into 60 one-minute sub-video streams, and a 60-minute audio stream can be divided into 60 one-minute sub-audio streams. Then, the participant attention rate for each sub-video stream and its corresponding sub-audio stream can be calculated. This allows us to obtain the participant attention rate within 60 time intervals. When the participant attention rate in a certain time interval falls below the attention rate threshold, a reminder message can be generated in a timely manner. The operations staff can identify the participant through the reminder message via the management terminal. By analyzing the meeting content and adjusting it, participant attention rate can be improved in a timely manner, thereby improving the meeting quality.
[0084] like Figure 7 As shown, Figure 7 This is a flowchart illustrating a method for detecting participant attention levels in a remote conference, according to an embodiment of this application. This method can be applied to a management terminal, which communicates with a server. The server communicates with multiple user terminals. The method includes, but is not limited to, the following steps:
[0085] Step S710: Send a report query request to the server;
[0086] Step S720: Receive attention report from the server. The attention report is generated by the attention of participants from each user terminal. The participant attention is obtained by the server based on preset calculation rules, according to video frames of the video stream, audio frames of the audio stream, gaze frames, and speech frames. The video stream and audio stream are obtained by each user terminal sending them to the server. The gaze frames are determined by the server in each video frame of the video stream based on image detection processing. The speech frames are determined by the server in each audio frame of the audio stream based on voice endpoint detection processing. The audio frames correspond one-to-one with the video frames, and the audio frames and their corresponding video frames are time-synchronized.
[0087] It is understood that the specific implementation method of the participant attention detection method for remote conferencing applied to the management terminal is basically the same as the specific embodiment of the participant attention detection method for remote conferencing applied to the server, and will not be repeated here. Based on this, the server receives the video stream and audio stream of each user terminal, and then, for each video stream and audio stream, the video frames in the video stream are processed by image detection to determine the gaze frame, and the audio frames in the audio stream are processed by voice endpoint detection to determine the speaking frame. Then, based on the calculation rules, the participant attention of each user terminal is calculated, and an attention report is generated and sent to the management terminal, thus realizing the accurate acquisition of the attention of all participants in the remote conferencing.
[0088] like Figure 8 As shown, Figure 8 This is a system block diagram of a participant attention detection method provided in one embodiment of this application.
[0089] Understandably, the system diagram illustrates user A and user B participating in a remote conference. Multiple other users can also participate. The server receives audio and video streams from each user's terminal. Within the server, the video processing unit processes the video stream, and the audio processing unit processes the audio stream. The calculation unit calculates the participant attention level, and the generation unit generates an attention report. When a participant's attention level is below a threshold, a reminder message is generated and sent to management terminal A, notifying operations personnel A. When operations personnel B queries through management terminal B, they send a report query request and receive the attention report from the server. Operations personnel B can then view this report through management terminal B to determine the attention level of all participants. Based on the attention report, operations personnel B can optimize the meeting content, improving meeting quality.
[0090] Additionally, refer to Figure 9This application also provides a participant attention detection device 900 for remote conferencing. The participant attention detection device 900 is applied to a server, which is communicatively connected to multiple user terminals and to a management terminal. The participant attention detection device 900 includes:
[0091] Acquisition unit 910 is used to receive video streams and audio streams from multiple user terminals;
[0092] The video processing unit 920 is used to determine the gaze screen frame in each video frame of the video stream based on image detection processing.
[0093] The audio processing unit 930 is used to determine the speaking frame in each audio frame of the audio stream based on speech endpoint detection processing, wherein the audio frame corresponds one-to-one with the video frame and the audio frame and the corresponding video frame are time-synchronized.
[0094] The calculation unit 940 is used to obtain the attention level of participants on each user terminal based on preset calculation rules, according to video frames, gaze screen frames, audio frames, and speech frames.
[0095] The generation unit 950 is used to generate an attention report based on the attention of participants on each user terminal.
[0096] The sending unit 960 is used to send a attention report to the management terminal in response to a report query request from the management terminal.
[0097] It is understood that the specific implementation of the participant attention detection device 900 for remote conferencing is basically the same as the specific implementation of the participant attention detection method for remote conferencing described above, and will not be repeated here. Based on this, the server receives video and audio streams from each user terminal, and then, for each video and audio stream, it processes the video frames in the video stream through image detection to determine the gaze frame, and then processes the audio frames in the audio stream through voice endpoint detection to determine the speaking frame. Based on the calculation rules, the participant attention of each user terminal is calculated, and then an attention report is generated and sent to the management terminal, thus realizing the accurate acquisition of the attention of all participants in a remote conferencing.
[0098] Furthermore, the calculation unit includes, but is not limited to: total duration determination unit, first frame determination unit, second frame determination unit, classification duration determination unit, first attention value determination unit, second attention value determination unit, third attention value determination unit, and summary unit (not shown in the figure).
[0099] The total duration determination unit is used to determine the total duration of audio and video based on video frames or audio frames;
[0100] The first frame determination unit is used to determine the non-gaze screen frames in each video frame based on the gaze screen frames.
[0101] The second frame determination unit is used to determine non-speaking frames in each audio frame based on the spoken frames;
[0102] The classification duration determination unit is used to determine a first duration based on the gazed screen frame and the speech frame, a second duration based on the speech frame and the non-gaze screen frame, and a third duration based on the gazed screen frame and the non-speech frame, based on the overlap processing.
[0103] The first attention value determination unit is used to determine the first attention value based on the total duration of audio and video, the first duration, and the preset first weight value.
[0104] The second attention value determination unit is used to determine the second attention value based on the total duration of audio and video, the second duration, and the preset second weight value, wherein the second weight value is less than the first weight value;
[0105] The third attention value determination unit is used to determine the third attention value based on the total duration of audio and video, the third duration, and the preset third weight value, wherein the third weight value is less than the second weight value.
[0106] The aggregation unit is used to obtain the attention level of participants from each user terminal based on the first attention value, the second attention value, and the third attention value.
[0107] Furthermore, the generation units include, but are not limited to: basic information determination unit, meeting duration determination unit, attention duration determination unit, percentage determination unit, and report generation unit (not shown in the figure).
[0108] The basic information determination unit is used to obtain basic meeting information, including meeting content information, meeting start time, and meeting end time. The meeting content information is matched with the participants' attention levels.
[0109] The meeting duration determination unit is used to determine the meeting duration based on the meeting start time and meeting end time.
[0110] The attention duration determination unit is used to determine the attention duration based on the gazed screen frame and the speech frame;
[0111] The percentage determination unit is used to determine the percentage of attention time for each user terminal based on the attention duration and meeting duration.
[0112] The report determination unit is used to generate an attention report based on basic meeting information, the percentage of attention time on each user terminal, and the attention level of the participants.
[0113] Furthermore, the remote meeting participant attention detection device 900 also includes: a reminder confirmation unit and a reminder unit (not shown in the figure).
[0114] The reminder determination unit is used to generate a reminder message when it determines that the attention level of any participant on any user terminal is less than a preset attention threshold, wherein the reminder message is matched with the user terminal;
[0115] The reminder unit is used to send reminder information to the management terminal.
[0116] Furthermore, the video processing unit includes, but is not limited to, a first detection unit and a second detection unit (not shown in the figure).
[0117] The first detection unit is used to determine the active state frames in each video frame of the video stream based on liveness detection processing.
[0118] The second detection unit determines the gazed screen frame in each active state frame based on face detection processing and eye detection processing.
[0119] Furthermore, the remote meeting participant attention detection device 900 also includes a segmentation unit (not shown in the figure).
[0120] The segmentation unit is used to divide the video stream into multiple sub-video streams and the audio stream into multiple sub-audio streams based on a preset segmentation duration. The sub-audio streams correspond one-to-one with the sub-video streams, and the sub-video streams and their corresponding sub-audio streams are used to determine the participants' attention levels.
[0121] Additionally, refer to Figure 10 , Figure 10 The hardware structure of an electronic device according to another embodiment is illustrated. The electronic device includes:
[0122] The processor 1001 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this application.
[0123] The memory 1002 can be implemented as a read-only memory (ROM), static storage device, dynamic storage device, or random access memory (RAM). The memory 1002 can store the operating system and other applications. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory 1002 and is called by the processor 1001 to execute the participant attention detection method for remote conferencing applied to a server according to the embodiments of this application, for example, executing the above-described... Figure 1 Method steps S110 to S160, Figure 2 Method steps S210 to S280, Figure 3 Method steps S310 to S340, Figure 4 Method steps S410 to S420 Figure 5 Method steps S510 to S520 Figure 6 Method step S610;
[0124] Input / output interface 1003 is used to implement information input and output;
[0125] The communication interface 1004 is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, network cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).
[0126] Bus 1005 transmits information between various components of the device (e.g., processor 1001, memory 1002, input / output interface 1003, and communication interface 1004);
[0127] The processor 1001, memory 1002, input / output interface 1003 and communication interface 1004 are connected to each other within the device via bus 1005.
[0128] This application embodiment also provides a storage medium, which is a computer-readable storage medium for computer-readable storage. The storage medium stores one or more programs, which can be executed by one or more processors to implement the above-described method for detecting participant attention in remote conferencing applied to a server. For example, it executes the above-described method. Figure 1 Method steps S110 to S160, Figure 2 Method steps S210 to S280, Figure 3 Method steps S310 to S340, Figure 4 Method steps S410 to S420 Figure 5 Method steps S510 to S520 Figure 6 Method step S610, or implementing the above-described method for detecting participant attention in remote conferencing applied to a management terminal, for example, executing the above-described method. Figure 7 Method steps S710 to S720.
[0129] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0130] The present application provides a method, apparatus, and storage medium for detecting participant attention in remote conferencing on a server. This method receives video and audio streams from multiple user terminals; determines the gazed-on screen frame in each video frame of the video stream based on image detection processing; and determines the speaking frame in each audio frame of the audio stream based on voice endpoint detection processing. The audio frames correspond one-to-one with the video frames, and the audio frames and their corresponding video frames are time-synchronized. Based on preset calculation rules, the participant attention level of each user terminal is obtained according to the video frames, gazed-on screen frames, audio frames, and speaking frames. The method then calculates the participant attention level based on the attention level of each user terminal. The system generates attention reports; responds to report query requests from the management terminal and sends attention reports to the management terminal; based on this, the server receives video and audio streams from each user terminal, and then, for each video and audio stream, it processes video frames in the video stream through image detection to determine the gazed screen frames, and processes audio frames in the audio stream through voice endpoint detection to determine the speaking frames. Based on calculation rules, it calculates the attention level of each participant on each user terminal, generates an attention report, and sends the attention report to the management terminal, thus enabling accurate acquisition of the attention level of all participants in remote meetings.
[0131] The embodiments described in this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. As those skilled in the art will know, with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.
[0132] It will be understood by those skilled in the art that Figures 1 to 7The technical solutions shown do not constitute a limitation on the embodiments of this application, and may include more or fewer steps than shown, or combine certain steps, or different steps.
[0133] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.
[0134] Those skilled in the art will understand that all or some of the steps in the methods disclosed above, as well as the functional modules / units in the systems and devices, can be implemented as software, firmware, hardware, or suitable combinations thereof.
[0135] The terms “first,” “second,” “third,” “fourth,” etc. (if present) in the specification and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms “comprising” and “having,” and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.
[0136] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent three cases: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.
[0137] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.
[0138] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.
[0139] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.
[0140] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes multiple instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing programs, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0141] The preferred embodiments of the present application have been described above with reference to the accompanying drawings, but this does not limit the scope of the claims of the present application. Any modifications, equivalent substitutions, and improvements made by those skilled in the art without departing from the scope and substance of the embodiments of the present application shall be within the scope of the claims of the present application.
Claims
1. A method for detecting the attention of a participant in a teleconference, applied to a server, the server being in communication connection with a plurality of user terminals, the server being in communication connection with a management terminal, characterized in that, The method includes: Receive video and audio streams from multiple user terminals; Based on image detection processing, the gaze screen frame is determined in each video frame of the video stream; Based on voice endpoint detection processing, speaking frames are determined in each audio frame of the audio stream, wherein each audio frame corresponds one-to-one with a video frame, and the audio frame and the corresponding video frame are time-synchronized. Based on preset calculation rules, the participant attention level of each user terminal is obtained according to the video frame, the gaze screen frame, the audio frame and the speech frame; A attention report is generated based on the attention levels of participants on each user terminal. In response to a report query request from the management terminal, the attention report is sent to the management terminal; The step of obtaining the participant attention level of each user terminal based on preset calculation rules, according to the video frame, the gazed screen frame, the audio frame, and the speech frame, includes: The total duration of the audio and video is determined based on the video frame or the audio frame. Based on the gaze screen frames, non-gaze screen frames are determined in each of the video frames; Based on the spoken frames, non-spoken frames are determined in each of the audio frames; Based on overlap processing, a first duration is determined according to the gazed screen frame and the speaking frame, a second duration is determined according to the speaking frame and the non-gaze screen frame, and a third duration is determined according to the gazed screen frame and the non-gaze screen frame. A first attention value is determined based on the total duration of the audio and video, the first duration, and a preset first weight value; A second attention value is determined based on the total duration of the audio and video, the second duration, and a preset second weight value, wherein the second weight value is less than the first weight value; A third attention value is determined based on the total duration of the audio and video, the third duration, and a preset third weight value, wherein the third weight value is less than the second weight value; The participant attention level of each user terminal is obtained based on the first attention value, the second attention value, and the third attention value.
2. The method according to claim 1, characterized in that, The step of generating an attention report based on the attention levels of participants on each user terminal includes: Obtain basic meeting information, including meeting content information, meeting start time, and meeting end time, wherein the meeting content information is matched with the participants' attention levels; The meeting duration is determined based on the meeting start time and meeting end time. The attention duration is determined based on the gazed screen frame and the spoken frame; Based on the attention duration and the meeting duration, determine the percentage of attention time for each user terminal; Based on the basic information of the meeting, the percentage of time each user terminal spent on the meeting, and the attention level of the participants, an attention report is generated.
3. The method according to claim 1, characterized in that, After the step of obtaining the participant attention level of each user terminal based on the video frame, the gaze screen frame, the audio frame, and the speech frame according to the preset calculation rules, the method further includes: When it is determined that the attention level of any participant on any of the user terminals is less than a preset attention threshold, a reminder message is generated, wherein the reminder message is matched with the user terminal; The reminder information is sent to the management terminal.
4. The method according to claim 1, characterized in that, The image detection processing, which determines the gaze screen frame in each video frame of the video stream, includes: Based on liveness detection processing, active state frames are determined in each video frame of the video stream. Based on face detection processing and eye detection processing, gaze screen frames are determined in each of the active state frames.
5. The method according to claim 1, characterized in that, Before the step of determining the gaze screen frame in each video frame of the video stream based on image detection processing, the method further includes: Based on a preset segmentation duration, the video stream is divided into multiple sub-video streams, and the audio stream is divided into multiple sub-audio streams. Each sub-audio stream corresponds one-to-one with a sub-video stream, and the sub-video stream and the corresponding sub-audio stream are used to determine the participant's attention level.
6. A method for detecting participant attention in a remote conference, applied to a management terminal, wherein the management terminal is communicatively connected to a server, and the server is communicatively connected to multiple user terminals, characterized in that... The method includes: Send a report query request to the server; The system receives a attention report from the server, wherein the attention report is generated by the attention of participants from each user terminal. The attention of participants is obtained by the server based on preset calculation rules, according to video frames of the video stream, audio frames of the audio stream, gaze frames, and speech frames. The video stream and the audio stream are obtained by each user terminal sending them to the server. The gaze frames are determined by the server based on image detection processing in each video frame of the video stream. The speech frames are determined by the server based on voice endpoint detection processing in each audio frame of the audio stream. The audio frames correspond one-to-one with the video frames, and the audio frames and the corresponding video frames are time-synchronized. The server is configured to: determine the total duration of audio and video based on the video frames or the audio frames; determine non-focused screen frames in each of the video frames based on the gazed screen frames; determine non-speaking frames in each of the audio frames based on the speaking frames; determine a first duration based on the gazed screen frames and the speaking frames, a second duration based on the speaking frames and the non-focused screen frames, and a third duration based on the gazed screen frames and the non-speaking frames, based on overlap processing; determine a first attention value based on the total duration of audio and video, the first duration, and a preset first weight value; determine a second attention value based on the total duration of audio and video, the second duration, and a preset second weight value, wherein the second weight value is less than the first weight value; determine a third attention value based on the total duration of audio and video, the third duration, and a preset third weight value, wherein the third weight value is less than the second weight value; and obtain the participant attention level of each user terminal based on the first attention value, the second attention value, and the third attention value.
7. A device for detecting participant attention levels in a remote conference, applied to a server, wherein the server is communicatively connected to multiple user terminals and to a management terminal, characterized in that, The device includes: The acquisition unit is configured to receive video streams and audio streams from multiple user terminals; A video processing unit is used to determine the gaze screen frame in each video frame of the video stream based on image detection processing. An audio processing unit is used to determine a speaking frame in each audio frame of the audio stream based on speech endpoint detection processing, wherein the audio frame corresponds one-to-one with the video frame, and the audio frame and the corresponding video frame are time-synchronized. The calculation unit is used to obtain the participant attention level of each user terminal based on the preset calculation rules, according to the video frame, the gaze screen frame, the audio frame and the speech frame; The generation unit is used to generate an attention report based on the attention of participants on each of the user terminals. The sending unit is configured to send the attention report to the management terminal in response to a report query request from the management terminal; The step of obtaining the participant attention level of each user terminal based on preset calculation rules, according to the video frame, the gazed screen frame, the audio frame, and the speech frame, includes: The total duration of the audio and video is determined based on the video frame or the audio frame. Based on the gaze screen frames, non-gaze screen frames are determined in each of the video frames; Based on the spoken frames, non-spoken frames are determined in each of the audio frames; Based on overlap processing, a first duration is determined according to the gazed screen frame and the speaking frame, a second duration is determined according to the speaking frame and the non-gaze screen frame, and a third duration is determined according to the gazed screen frame and the non-gaze screen frame. A first attention value is determined based on the total duration of the audio and video, the first duration, and a preset first weight value; A second attention value is determined based on the total duration of the audio and video, the second duration, and a preset second weight value, wherein the second weight value is less than the first weight value; A third attention value is determined based on the total duration of the audio and video, the third duration, and a preset third weight value, wherein the third weight value is less than the second weight value; The participant attention level of each user terminal is obtained based on the first attention value, the second attention value, and the third attention value.
8. An electronic device, characterized in that, The electronic device includes a memory, a processor, a program stored in the memory and executable on the processor, and a data bus for enabling communication between the processor and the memory. When the program is executed by the processor, it implements the participant attention detection method for a remote conference as described in any one of claims 1 to 5.
9. A storage medium, said storage medium being a computer-readable storage medium for computer-readable storage, characterized in that, The storage medium stores one or more programs, which can be executed by one or more processors to implement the participant attention detection method for a remote conference as described in any one of claims 1 to 5, or the participant attention detection method for a remote conference as described in claim 6.