A method and apparatus for classroom analysis

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By segmenting classroom audio data using voiceprint recognition technology, determining the number of audio switching between students and teachers, and constructing a classroom interaction structure, this approach solves the problem of limited dimensions in existing classroom interaction analysis, enabling precise analysis of student participation and teaching optimization.

CN115547304BActive Publication Date: 2026-06-16HANGZHOU HIKVISION DIGITAL TECHNOLOGY CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: HANGZHOU HIKVISION DIGITAL TECHNOLOGY CO LTD
Filing Date: 2022-09-01
Publication Date: 2026-06-16

Application Information

Patent Timeline

01 Sep 2022

Application

16 Jun 2026

Publication

CN115547304B

IPC: G10L15/04; G10L15/02; G10L15/08; G10L21/0272; G10L25/51; G10L13/06; G10L13/02; G06Q50/20

CPC: G10L15/04; G10L15/02; G10L15/08; G10L25/51; G10L13/06; G10L13/02; G10L21/0272; G06Q50/205

AI Tagging

Application Domain

Data processing applications Speech recognition

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing speech-based classroom analysis methods only determine teaching patterns by analyzing the duration of teacher speech. This approach is relatively one-dimensional and cannot comprehensively and accurately describe classroom interactions, thus limiting its reference value.

⚗Method used

By segmenting classroom audio data using voiceprint recognition technology, the number of audio switching between students and teachers, as well as among students, is determined, and a classroom interaction structure is constructed to analyze student participation.

🎯Benefits of technology

It enables precise analysis of student participation in the classroom, assisting teachers in providing targeted instruction.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115547304B_ABST

Patent Text Reader

Abstract

The application provides a classroom analysis method and device, relates to the technical field of speech recognition, and can accurately analyze the participation degree of students in a classroom by determining the switching times between audio emitted by two students or between audio emitted by a student and a teacher in the classroom. The classroom analysis method comprises the following steps: acquiring classroom audio data; performing voiceprint recognition on the classroom audio data to segment audio data of each sound source individual in the classroom, wherein the sound source individual comprises a teacher and / or a student; determining a first switching time and a second switching time according to the audio data of each sound source individual in the classroom; and determining a classroom interaction structure of the classroom according to the first switching time and the second switching time, wherein the classroom interaction structure is used to indicate the interaction between the teacher and the students in the classroom.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of speech recognition technology, and in particular to a classroom analysis method and apparatus. Background Technology

[0002] With the rapid development of educational informatization, the integration of teaching activities and artificial intelligence technology is becoming increasingly close. Objective classroom and teaching quality analysis and evaluation based on signal processing and artificial intelligence technologies has become a new trend. Timely teaching analysis is of great significance for educators to improve teaching quality. It helps educators to summarize and correct problems and shortcomings in the teaching process, thereby implementing in-depth, direct, and effective teaching activities, which is conducive to improving teaching quality.

[0003] Currently, speech-based classroom analysis methods often analyze the duration of teachers' and students' speech to assess classroom teaching patterns. However, this approach is relatively one-dimensional and cannot comprehensively and accurately describe classroom interactions. Consequently, the results offer limited value for improving teaching activities. Summary of the Invention

[0004] This application provides a classroom analysis method and apparatus that can accurately analyze student participation in the classroom by determining the number of times audio is switched between two students or between a student and a teacher.

[0005] To achieve the above technical objectives, this application adopts the following technical solution:

[0006] Firstly, this application provides a classroom analysis method, which includes: acquiring classroom audio data; performing voiceprint recognition on the classroom audio data to segment the audio data of each individual sound source in the classroom, the individual sound source including teachers or students; determining a first switching count and a second switching count based on the audio data of each individual sound source in the classroom, wherein the first switching count is used to indicate the number of switching between audio emitted by students and audio emitted by teachers in the classroom, and the second switching count is used to indicate the number of switching between audio emitted by two students in the classroom; and determining the classroom interaction structure based on the first switching count and the second switching count, the classroom interaction structure being used to indicate the interaction between teachers and students in the classroom.

[0007] The classroom analysis method provided in this application has at least the following beneficial effects: Firstly, by performing voiceprint recognition on classroom audio data, the method can accurately locate the identity information of the individual whose audio data belongs. Therefore, this method can determine the number of times audio is switched between the student's and the teacher's audio, as well as the number of times audio is switched between two students in the classroom. This allows for precise analysis of student participation, accurate classroom analysis, and assistance to teachers in precise teaching.

[0008] In one possible implementation, determining the first and second switching times based on the audio data of each individual sound source in the classroom includes: determining the temporal relationship between the audio data of each individual sound source based on the start time of the audio data of each individual sound source in the classroom; and determining the first and second switching times based on the temporal relationship and the individual sound source to which each audio data belongs.

[0009] In another possible implementation, the above-mentioned voiceprint recognition of classroom audio data to segment the audio data of each individual sound source in the classroom, including teachers or students, includes: performing voiceprint recognition on classroom audio data to obtain audio data from different individuals sound sources; and determining whether the individual sound source is a teacher or a student based on preset rules, wherein the preset rules are to determine the oldest individual sound source as a teacher, the individual sound source with the longest total audio data duration as a teacher, or the individual sound source of the first audio data in the classroom as a teacher.

[0010] In another possible implementation, the above-mentioned voiceprint recognition of classroom audio data to segment the audio data of each individual sound source in the classroom includes: segmenting the classroom audio data into multiple audio segments; wherein an audio segment belongs to the same individual sound source; and identifying one or more audio segments of each individual sound source in the multiple audio segments.

[0011] In another possible implementation, the above determination of one or more audio segments for each sound source individual among multiple audio segments includes: comparing a first audio segment with historically recognized speech to determine whether the first audio segment and the historically recognized speech belong to the same sound source individual; wherein the first audio segment is one of multiple audio segments, and the historically recognized speech is previously recognized speech information.

[0012] In another possible implementation, the above-mentioned comparison of the first audio segment with historical recognized speech to determine whether the first audio segment and the historical recognized speech belong to the same sound source individual includes: performing recognition processing on the first audio segment to obtain the first voiceprint feature and hierarchical speech information of the first audio segment, wherein the hierarchical speech information includes word-level speech information, character-level speech information and phoneme-level speech information of the first audio segment; splicing the hierarchical speech information of the first audio segment according to the historical recognized speech to obtain spliced speech; performing recognition processing on the spliced speech to obtain the second voiceprint feature of the spliced speech; and comparing the first voiceprint feature, the second voiceprint feature and the third voiceprint feature of the historical recognized speech to determine whether the first audio segment and the historical recognized speech belong to the same sound source individual.

[0013] In another possible implementation, the above-mentioned comparison of the first voiceprint feature, the second voiceprint feature, and the third voiceprint feature of the historical recognized speech to determine whether the first audio segment and the historical recognized speech belong to the same sound source individual includes: comparing the first voiceprint feature and the second voiceprint feature to obtain a first similarity parameter; comparing the first voiceprint feature and the third voiceprint feature to obtain a second similarity parameter; determining the third similarity parameter between the first audio segment and the historical recognized speech based on the first similarity parameter and the second similarity parameter; and determining that the first audio segment and the historical recognized speech belong to the same sound source individual when the third similarity parameter is greater than or equal to a preset threshold.

[0014] In yet another possible implementation, the third similarity parameter mentioned above satisfies the following relationship:

[0015]

[0016] Where s is the third similarity parameter, The first similarity parameter, α is the second similarity parameter, d is the duration of the first audio segment, and α is the preset parameter value.

[0017] Secondly, this application provides a classroom analysis device, comprising: an acquisition module for acquiring classroom audio data; an identification module for performing voiceprint recognition on the classroom audio data to segment the audio data of each individual sound source in the classroom, the individual sound sources including teachers and / or students; a processing module for determining a first switching count and a second switching count based on the audio data of each individual sound source in the classroom, wherein the first switching count indicates the number of switchings between audio emitted by students and audio emitted by teachers in the classroom, and the second switching count indicates the number of switchings between audio emitted by two students in the classroom; the processing module is further configured to determine the classroom interaction structure based on the first switching count and the second switching count, the classroom interaction structure indicating the interaction between teachers and students in the classroom.

[0018] In one possible implementation, the above processing module is specifically used to: determine the temporal relationship between the audio data of each sound source individual based on the start time of the audio data of each sound source individual in the classroom; and determine the first switching number and the second switching number based on the temporal relationship and the sound source individual to which each audio data belongs.

[0019] In another possible implementation, the above processing module further includes: performing voiceprint recognition on classroom audio data to obtain audio data from different sound source individuals; and determining whether the sound source individual is a teacher or a student based on preset rules, wherein the preset rules are to determine the oldest sound source individual as a teacher, the sound source individual with the longest total audio data duration as a teacher, or the sound source individual of the first audio data in the classroom as a teacher.

[0020] In another possible implementation, the aforementioned identification module is specifically used to: segment classroom audio data into multiple audio segments; wherein an audio segment belongs to the same sound source individual; and identify one or more audio segments of each sound source individual among the multiple audio segments.

[0021] In another possible implementation, the aforementioned recognition module is further specifically used to: compare the first audio segment with historically recognized speech to determine whether the first audio segment and historically recognized speech belong to the same sound source; wherein the first audio segment is one of multiple audio segments, and the historically recognized speech is speech information that has been recognized in the past.

[0022] In another possible implementation, the above-mentioned device further includes a splicing module and an identification module, which are specifically used to perform identification processing on the first audio segment to obtain the first voiceprint feature and hierarchical speech information of the first audio segment. The hierarchical speech information includes word-level speech information, character-level speech information and phoneme-level speech information of the first audio segment. The splicing module is used to splice the hierarchical speech information of the first audio segment according to the historical identified speech to obtain spliced speech. The identification module is also used to perform identification processing on the spliced speech to obtain the second voiceprint feature of the spliced speech. The identification module is also used to compare the first voiceprint feature, the second voiceprint feature and the third voiceprint feature of the historical identified speech to determine whether the first audio segment and the historical identified speech belong to the same sound source.

[0023] In another possible implementation, the aforementioned recognition module is specifically used to: compare the first voiceprint feature with the second voiceprint feature to obtain a first similarity parameter; compare the first voiceprint feature with the third voiceprint feature to obtain a second similarity parameter; determine a third similarity parameter between the first audio segment and the historical recognized speech based on the first similarity parameter and the second similarity parameter; and determine that the first audio segment and the historical recognized speech belong to the same sound source when the third similarity parameter is greater than or equal to a preset threshold.

[0024] In yet another possible implementation, the third similarity parameter mentioned above satisfies the following relationship:

[0025]

[0026] Where s is the third similarity parameter, The first similarity parameter, α is the second similarity parameter, d is the duration of the first audio segment, and α is the preset parameter value.

[0027] Thirdly, this application provides an electronic device including a memory and a processor. The memory and processor are coupled. The memory stores computer program code, including computer instructions. When the processor executes the computer instructions, it causes the electronic device to perform the method described in the first aspect and any possible implementation thereof.

[0028] Fourthly, this application provides a chip system applied to an electronic device; the chip system includes one or more interface circuits and one or more processors. The interface circuits and processors are interconnected via lines; the interface circuits are used to receive signals from the electronic device's memory and send signals to the processor, the signals including computer instructions stored in the memory. When the processor executes the computer instructions, it causes the electronic device to perform the method provided in the first aspect above.

[0029] Fifthly, this application provides a computer-readable storage medium storing computer instructions that, when executed on an electronic device, cause the electronic device to perform the method provided in the first aspect above.

[0030] In a sixth aspect, this application provides a computer program product including computer instructions that, when executed on an electronic device, cause the electronic device to perform the method provided in the first aspect above.

[0031] For a detailed description of aspects two through six and their various implementations in this application, please refer to the detailed description in aspect one and its various implementations; and for a detailed description of the beneficial effects of aspects two through six and their various implementations, please refer to the beneficial effect analysis in aspect one and its various implementations, which will not be repeated here. Attached Figure Description

[0032] Figure 1 A schematic diagram of a classroom analysis system provided in an embodiment of this application;

[0033] Figure 2 A schematic diagram of the hardware structure of an electronic device provided in an embodiment of this application;

[0034] Figure 3 A flowchart of a classroom analysis method provided in an embodiment of this application;

[0035] Figure 4 A flowchart illustrating a method for identifying a sound source individual for speech information, provided in an embodiment of this application;

[0036] Figure 5 This is a schematic diagram of a voiceprint recognition process provided in an embodiment of this application;

[0037] Figure 6 This is a schematic diagram of the composition of a classroom analysis device provided in an embodiment of this application. Detailed Implementation

[0038] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0039] In the description of this application, unless otherwise stated, " / " means "or," for example, A / B can mean A or B. The "and / or" in this document is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A alone, A and B simultaneously, and B alone. Furthermore, "at least one" means one or more, and "multiple" means two or more. The terms "first," "second," etc., do not limit the quantity or order of execution, and "first," "second," etc., do not necessarily imply differences.

[0040] It should be noted that, in this application, the terms "exemplary" or "for example" are used to indicate that something is being described as an example, illustration, or illustration. Any embodiment or design described as "exemplary" or "for example" in this application should not be construed as being more preferred or advantageous than other embodiments or design solutions. Specifically, the use of terms such as "exemplary" or "for example" is intended to present the relevant concepts in a concrete manner.

[0041] To facilitate understanding, we will first provide a brief introduction and explanation of some terms or basic concepts of technology involved in the embodiments of the present invention.

[0042] 1. Voiceprint

[0043] Voiceprints are sound wave spectra that carry speech information and are displayed using electroacoustic instruments. Modern scientific research shows that voiceprints are not only specific but also relatively stable. After adulthood, a person's voice can remain relatively stable for a long period. This is because the organs controlling vocal production include the vocal cords, soft palate, tongue, teeth, and lips; the vocal resonators include the pharynx, oral cavity, and nasal cavity. Even slight differences in these organs can lead to changes in airflow during vocalization, resulting in differences in tone quality and timbre. Furthermore, people's vocalization habits vary in speed and force, also causing differences in intensity and duration. Pitch, intensity, duration, and timbre are known as the "four elements" of speech in linguistics, and these factors can be further broken down into more than ninety characteristics. These characteristics represent the different wavelengths, frequencies, intensities, and rhythms of different sounds. Speech mapping instruments can convert changes in sound waves into changes in the intensity, wavelength, frequency, and rhythm of electrical signals. The instrument then plots these changes in electrical signals as a spectrum, which becomes the voiceprint. Therefore, there will be significant differences between the voiceprints of two different people.

[0044] 2. Voiceprint Recognition (VPR)

[0045] Voiceprint recognition is a type of biometric technology. Also known as speaker recognition, it includes speaker identification and speaker verification. Voiceprint recognition involves converting sound signals into electrical signals, which are then processed by a computer. Different tasks and applications utilize different voiceprint recognition technologies.

[0046] Voiceprint recognition has two types: text-dependent and text-independent.

[0047] Text-related voiceprint recognition technology refers to acquiring the user's speech according to pre-defined content when building a user's voiceprint model. Furthermore, the speech content recognized during the recognition process is also pre-defined. Therefore, text-related voiceprint recognition technology can achieve relatively good recognition results. However, based on this technology, if the user's pronunciation does not match the pre-defined content, the user cannot be correctly identified.

[0048] Text-independent voiceprint recognition technology refers to a technology that does not specify the user's pronunciation content. Model building is relatively difficult, but it is convenient for users and has a wide range of applications.

[0049] 3. Automatic speech recognition (ASR)

[0050] Speech recognition technology, also known as Automatic Speech Recognition (ASR), aims to convert the lexical content of human speech into computer-readable input, such as keystrokes, binary codes, or character sequences. Speech recognition is a multidisciplinary field, closely linked to acoustics, phonetics, linguistics, digital signal processing theory, information theory, computer science, and many other disciplines.

[0051] The above is an introduction to the technical terms involved in the embodiments of this application, which will not be repeated below.

[0052] As described in the background section, speech-based classroom analysis methods detect activity sounds in classroom audio and then use artificial intelligence to extract interactive speech data between teachers and students. The teaching mode is then determined as a practice-based, lecture-based, or blended learning environment based solely on the proportion of teacher speaking time to total class time. However, this method relies on only the proportion of teacher speaking time to total class time, making its judgment based on a relatively simplistic approach.

[0053] In view of this, embodiments of this application provide a classroom analysis method, which specifically includes: acquiring classroom audio data; performing voiceprint recognition on the classroom audio data to segment the audio data of each individual sound source in the classroom, wherein the individual sound source includes a teacher or a student; determining a first switching count and a second switching count based on the audio data of each individual sound source in the classroom, wherein the first switching count is used to indicate the number of switchings between the audio emitted by the student and the audio emitted by the teacher in the classroom, and the second switching count is used to indicate the number of switchings between the audio emitted by two students in the classroom; and determining the classroom interaction structure based on the first switching count and the second switching count.

[0054] Based on this, the level of student participation in class can be accurately analyzed by determining the number of times audio is switched between two students or between a student and a teacher.

[0055] This application provides a classroom analysis system, and the above-described method can be applied to this system.

[0056] like Figure 1 As shown, the system 100 may include an audio acquisition device 10 and an audio analysis device 20. Optionally, the system 100 may also include a result output device 30.

[0057] The audio acquisition device 10 can be a pickup, microphone, or other device with audio acquisition capabilities. The audio acquisition device 10 may include a multi-array pickup, installed in the classroom, to collect audio data from the entire class. This audio data may include voice data emitted by the teacher and voice data emitted by the students.

[0058] The audio analysis device 20 can be an electronic device with audio data processing capabilities. The audio analysis device 20 can be used to analyze and process received audio data, obtain a first switching count and a second switching count, and analyze the student activity level in the classroom based on the first switching count and the second switching count.

[0059] Optionally, the audio analysis device 20 can be the management server of the classroom analysis system. It can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks, and big data servers.

[0060] In some embodiments, the audio analysis device 20 may be integrated into the audio acquisition device 10, or the audio analysis device 20 may be set up independently from the audio acquisition device 10.

[0061] The output device 30 can be a user device used to output the student activity level of the classroom analyzed by the audio analysis device 20 based on the first and second switching counts, for educators to refer to and make timely improvements.

[0062] Optionally, the output device 30 can be a mobile phone, tablet computer, desktop computer, laptop computer, handheld computer, notebook computer, ultra-mobile personal computer (UMPC), netbook, as well as cellular phone, personal digital assistant (PDA), augmented reality (AR) / virtual reality (VR) device, and other terminal devices. This application does not impose any special restrictions on the specific form of the electronic device.

[0063] In some embodiments, the audio analysis device 20 and the result output device 30 are integrated into one device, or the audio analysis device 20 and the result output device 30 are two independent devices.

[0064] This application also provides a classroom analysis device (hereinafter referred to as the classroom analysis device for ease of description), which is the executing entity of the above-described classroom analysis method. The classroom analysis device can be an electronic device with data processing capabilities, or a functional module within such an electronic device; there is no limitation on this. For example, the electronic device can be the audio analysis device 20 in the above-described classroom analysis system 100.

[0065] The following example uses a classroom analysis device as an electronic device, combined with... Figure 2 An introduction to one hardware structure of electronic device 200.

[0066] like Figure 2 As shown, the electronic device 200 includes a processor 210, a communication line 220, and a communication interface 230.

[0067] Optionally, the electronic device 200 may also include a memory 240. The processor 210, memory 240, and communication interface 230 can be connected via a communication line 220.

[0068] The processor 210 can be a central processing unit (CPU), a network processor (NP), a digital signal processor (DSP), a microprocessor, a microcontroller, a programmable logic device (PLD), or any combination thereof. The processor 210 can also be any other device with processing capabilities, such as a circuit, device, or software module, without limitation.

[0069] In one example, processor 210 may include one or more CPUs, for example Figure 2 CPU0 and CPU1 in the CPU.

[0070] As an optional implementation, electronic device 200 may include multiple processors, for example, in addition to processor 210, it may also include processor 270. Communication line 220 is used to transmit information between the components included in electronic device 200.

[0071] Communication interface 230 is used for communicating with other devices or other communication networks. These other communication networks can be Ethernet, radio access network (RAN), wireless local area networks (WLAN), etc. Communication interface 230 can be a module, circuit, transceiver, or any device capable of enabling communication.

[0072] The memory 240 is used to store instructions. These instructions can be computer programs.

[0073] The memory 240 may be a read-only memory (ROM) or other type of static storage device capable of storing static information and / or instructions; it may also be a random access memory (RAM) or other type of dynamic storage device capable of storing information and / or instructions; it may also be an electrically erasable programmable read-only memory (EEPROM), a compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital universal optical discs, Blu-ray discs, etc.), magnetic disk storage media, or other magnetic storage devices, etc., without limitation.

[0074] It should be noted that the memory 240 can exist independently of the processor 210, or it can be integrated with the processor 210. The memory 240 can be used to store instructions, program code, or some data, etc. The memory 240 can be located inside or outside the electronic device 200, without restriction.

[0075] Processor 210 is configured to execute instructions stored in memory 240 to implement the communication method provided in the following embodiments of this application. For example, when electronic device 200 is a terminal or a chip or system-on-a-chip in a terminal, processor 210 can execute instructions stored in memory 240 to implement the classroom analysis method provided in this application.

[0076] As an optional implementation, the electronic device 200 also includes an output device 250 and an input device 260. The output device 250 can be a display screen, speaker, or other device capable of outputting data from the electronic device 200 to the user. For example, the output device 250 can output classroom analysis results.

[0077] The input device 260 can be a keyboard, mouse, microphone, joystick, or other device capable of inputting data to the electronic device 200.

[0078] It should be pointed out that, Figure 2 The structure shown does not constitute a limitation on the electronic device 200, except Figure 2 In addition to the components shown, the electronic device 200 may include more or fewer components than illustrated, or combine certain components, or have different component arrangements.

[0079] Furthermore, this application also provides a device for identifying individual sound sources of speech information (hereinafter referred to as a speech information recognition device for ease of description). This speech information recognition device is used to identify classroom audio data and identify individual sound sources based on voiceprint features. Moreover, this speech information recognition device can also be an electronic device module with data processing capabilities, or a functional module within that electronic device; there is no limitation on this. Optionally, the hardware structure of the speech information recognition device can also be as follows... Figure 2 As shown.

[0080] In some embodiments, the classroom analysis device and the voice information recognition device may be integrated into one device; or, the classroom analysis device and the voice information recognition device may be two separate devices.

[0081] The embodiments provided in this application will now be described in detail with reference to the accompanying drawings.

[0082] like Figure 3 As shown in the embodiments of this application, a classroom analysis method is provided. Optionally, this method is... Figure 2 The electronic device 200 shown executes the method, which includes the following steps:

[0083] S101, Classroom analysis device acquires classroom audio data.

[0084] The aforementioned classroom audio data can be complete audio data from any single lesson, allowing the classroom analysis device to analyze the effectiveness of the lesson.

[0085] In addition, classroom audio data may include voice data emitted by the teacher and / or voice data emitted by the students.

[0086] It should be noted that the aforementioned classroom audio data may include both voice data emitted by the teacher and voice data emitted by the students. Alternatively, the aforementioned classroom audio data may include only voice data emitted by the teacher. Or, the aforementioned classroom audio data may include only voice data emitted by the students.

[0087] Optionally, the classroom analysis device can acquire classroom audio data in real time using an audio acquisition device installed in the classroom. Alternatively, the audio acquisition device can store the audio data from multiple lessons in a database, allowing the classroom analysis device to extract the audio data from any single lesson for which analysis is needed.

[0088] In some embodiments, the classroom analysis device can also perform noise reduction processing on the classroom audio data to obtain noise-reduced audio, thereby improving the accuracy of the analysis.

[0089] S102. The classroom analysis device performs voiceprint recognition on the classroom audio data and segments out the audio data of each individual sound source in the classroom.

[0090] Optionally, the classroom analysis device can segment classroom audio data into multiple audio segments, where each audio segment includes speech from the same individual source.

[0091] Optionally, the classroom analysis device can also compare the first audio segment with historical recognized speech to determine whether each audio segment belongs to the same sound source.

[0092] Since the classroom audio data includes voice data emitted by the teacher and / or students, the aforementioned sound source individuals include either teachers or students. Typically, for a complete classroom audio data set of a lesson, there is one teacher and one or more students. Furthermore, the first audio segment is one of multiple audio segments, and the historically recognized voice data refers to voice information previously recognized by the classroom analysis device. In this embodiment, the historically recognized voice data can be the voice information previously recognized by the classroom analysis device when recognizing the first audio segment.

[0093] In some embodiments, the classroom analysis device can determine that each of the aforementioned sound sources is a teacher or a student based on preset rules.

[0094] Optionally, the above-mentioned preset rules can be set based on the actual situation of the teacher or classroom, and the preset rules have at least the following possible examples:

[0095] In one example, a preset rule could be to identify the oldest individual with the most vocal voice as a teacher, thus classifying all other individuals with vocal voice as students. It should be understood that the classroom analysis device can use voiceprint recognition to determine the gender, age, dialect (region of residence), and other characteristics of each individual with vocal voice. Therefore, the device can distinguish between teachers and students based on the age of each individual with vocal voice.

[0096] In another example, the preset rule could be to identify the individual with the longest total audio data duration as the teacher, thus classifying all other individual audio sources identified by the classroom analysis device as students. Specifically, after determining whether each audio segment belongs to the same individual audio source, the classroom analysis device can calculate the total duration of audio segments belonging to the same individual audio source, thereby obtaining the speaking time of each individual audio source in the classroom.

[0097] In another example, the preset rule could be to identify the sound source of the first audio data in the classroom as the teacher, so that all other sound sources identified by the classroom analysis device are students.

[0098] In some embodiments, the classroom analysis device can also determine the identity of each sound source individual based on the voiceprint characteristics of the audio data of the sound source individual.

[0099] It should be noted that the classroom analysis system can pre-store the identity information and voiceprint information of the teachers and students in the classroom. Due to the specificity and relative stability of voiceprints, there will be significant differences between the voiceprints of each two people. Therefore, during the process of voiceprint recognition of classroom audio data, the classroom analysis device can compare the voiceprint information of the currently identified audio segment with the pre-stored voiceprint information of the teachers and students, and thus determine the identity information of the individual to which the audio segment belongs.

[0100] In some embodiments, the classroom analysis device can also label each audio segment with the start time, the type of the sound source, the start time, the end time, etc. For example, the label of one audio segment may include "Student 1", "20220713 16:00:05", and "20220713 16:00:10", etc.

[0101] S103. The classroom analysis device determines the first switching count and the second switching count based on the audio data of each sound source in the classroom.

[0102] The first switching count is used to indicate the number of times the audio emitted by students and the audio emitted by teachers are switched in the classroom, and the second switching count is used to indicate the number of times the audio emitted by two students is switched in the classroom.

[0103] Optionally, the classroom analysis device may use the number of times the audio emitted by students in the classroom audio data is switched to the audio emitted by the teacher as the first switching count. Alternatively, the classroom analysis device may also use the number of times the audio emitted by the teacher in the classroom audio data is switched to the audio emitted by students as the first switching count.

[0104] In some embodiments, the classroom analysis device can determine the temporal relationship between the audio data of each individual sound source based on the start time of the audio data of each individual sound source in the classroom. Furthermore, the classroom analysis device can determine the first number of switching operations and the second number of switching operations based on the temporal relationship and the individual sound source to which each audio data belongs.

[0105] Optionally, the classroom analysis device can arrange each audio segment based on the temporal relationship between them, according to the start and end times of each audio segment.

[0106] Furthermore, if the labels of the individual sound sources of each audio segment in the classroom are arranged in the above order as: "TSSSTSTSSS…", where "T" represents the teacher and "S" represents the student, and the number of adjacent pairs of labels "TS" is 3, and the number of adjacent pairs of labels "SS" is 4, then the classroom analysis device can determine that in this classroom, the first switching count is 3 and the second switching count is 4.

[0107] S104. The classroom analysis device determines the classroom interaction structure based on the first switching count and the second switching count.

[0108] In some embodiments, the classroom analysis device can also count the total duration of audio emitted by the same sound source in the classroom. Furthermore, the classroom analysis device can analyze the interactive structure of the classroom by combining information such as the first number of switching events, the second number of switching events, the teacher's total audio duration, and the students' total audio duration.

[0109] Optionally, the classroom analysis device can determine the classroom type based on information such as the first number of switching, the second number of switching, the total audio duration of the teacher, and the total audio duration of the students. The classroom type can include practice-based classrooms, lecture-based classrooms, discussion-based classrooms, or self-study classrooms.

[0110] Optionally, the classroom analysis device can also compare parameters such as the first and second switching times of this classroom with those of other classrooms, so that teachers or schools can analyze the classroom effectiveness of the course.

[0111] The classroom analysis method provided in this application has at least the following beneficial effects: Firstly, by performing voiceprint recognition on classroom audio data, the method can accurately locate the identity information of the individual whose audio data belongs. Therefore, this method can determine the number of times audio is switched between the student's and the teacher's audio, as well as the number of times audio is switched between two students in the classroom. This allows for precise analysis of student participation, accurate classroom analysis, and assistance to teachers in precise teaching.

[0112] In some embodiments, such as Figure 4 As shown in the embodiments of this application, a method for identifying a sound source individual of speech information is also provided. This method is applied to a speech information recognition device and specifically includes the following steps:

[0113] S201, The voice information recognition device acquires the voice to be recognized.

[0114] The speech to be identified can be an audio segment from the aforementioned classroom audio data, such as the first audio segment, wherein an audio segment includes speech from the same individual source.

[0115] Optionally, the speech duration of the speech to be recognized can be 1 second, 2 seconds, 5 seconds or other reasonable durations.

[0116] S202. The speech information recognition device performs recognition processing on the speech to be recognized to obtain the first voiceprint feature and the classified speech information of the speech to be recognized.

[0117] Among them, the above classified speech information includes the word-level speech information, character-level speech information and phoneme-level speech information of the speech to be recognized.

[0118] Among them, the above word-level speech information refers to the speech used to indicate words in the speech information to be recognized. A word is a collective term for words and phrases, including words (such as single words or compound words) and phrases (also known as phrases), which are the smallest word structure units that make up a sentence or article. For example, the above word-level speech information may include the speech information of words such as "attend class" and "Chinese". In addition, the above character-level speech information is used to indicate the speech used to indicate a single character in the speech information to be recognized. For example, the speech of characters such as "shàng", "kè", "yǔ" or "wén".

[0119] Among them, a phoneme is the smallest speech unit divided according to the natural properties of speech. From an acoustic perspective, a phoneme is the smallest speech unit divided from the perspective of speech quality. From a physiological perspective, one pronunciation action forms one phoneme. The sounds produced by the same pronunciation action are the same phoneme, and the sounds produced by different pronunciation actions are different phonemes. Thus, the speech of a single character can be composed of multiple phonemes. That is, each character in the speech to be recognized can correspond to the speech of multiple phonemes.

[0120] And, the above phoneme-level speech information is used to indicate the speech used to indicate a single phoneme in the speech information to be recognized, where multiple phonemes can form the speech of a single character in the speech to be recognized.

[0121] Optionally, as Figure 5 shown, the speech information recognition device can input the speech to be recognized into a pre-set ASR model 21 to obtain the word-level speech information, character-level speech information and phoneme-level speech information of the speech to be recognized.

[0122] And, as Figure 5 shown, the speech information recognition device can input the speech to be recognized into a pre-set voiceprint model 22 to obtain the first voiceprint feature of the speech to be recognized.

[0123] S203. The speech information recognition device splices the classified speech information of the speech to be recognized according to the historical recognized speech to obtain the spliced speech.

[0124] Optionally, the speech information recognition device can acquire hierarchical speech information of historically recognized speech, which includes word-level speech information, character-level speech information and phoneme-level speech information of historically recognized speech.

[0125] Optional, such as Figure 5 As shown, the speech information recognition device can input historically recognized speech into a pre-set ASR model 11 to obtain word-level speech information, character-level speech information, and phoneme-level speech information of the historically recognized speech. Furthermore, the speech information recognition device can also input historically recognized speech into a pre-set voiceprint model 11 to obtain the third voiceprint feature of the historically recognized speech.

[0126] Then, the speech information recognition device compares the hierarchical speech information of the speech to be recognized with the hierarchical speech information of the historical recognized speech, and determines the word-level speech information, character-level speech information, or phoneme-level speech information in the speech to be recognized that is the same as the hierarchical speech information of the historical recognized speech.

[0127] Specifically, the speech recognition device can first match the word-level speech information in the speech to be recognized with the same word-level speech information as the previously recognized speech. For the word-level speech information of the previously recognized speech that failed to match, it then matches the character-level speech information in the speech to be recognized that failed to match. Finally, for the character-level speech information of the previously recognized speech that failed to match, it then matches the phoneme-level speech information in the speech to be recognized that failed to match.

[0128] Therefore, the speech information recognition device can, according to the temporal relationship between the various speech information in the hierarchical speech information of the historical recognized speech, concatenate the word-level speech information, character-level speech information, or phoneme-level speech information that is the same as the hierarchical speech information of the historical recognized speech according to the temporal relationship, thereby obtaining the concatenated speech.

[0129] It should be understood that the spliced speech is speech obtained by splicing together the speech to be recognized, and is similar to the historically recognized speech. Since the spliced speech is similar to the historically recognized speech, the process by which the speech information recognition device performs voiceprint recognition on the spliced speech based on the historically recognized speech can be considered a text-related voiceprint recognition process. Therefore, this recognition process can achieve relatively good recognition results.

[0130] S204. The speech information recognition device performs recognition processing on the spliced speech to obtain the second voiceprint feature of the spliced speech.

[0131] Optional, such as Figure 5 As shown, the voice information recognition device can splice the voice input into a pre-set voiceprint model 23 to obtain the second voiceprint feature of the spliced voice.

[0132] S205. The voice information recognition device compares the first voiceprint feature, the second voiceprint feature, and the third voiceprint feature of the historical recognized voice to determine whether the voice to be recognized and the historical recognized voice belong to the same sound source.

[0133] The aforementioned historically recognized speech refers to one of one or more previously recognized speech information.

[0134] During the recognition process, the voice information recognition device can sequentially compare the first voiceprint feature, the second voiceprint feature, and the third voiceprint feature of each historical voice information, and determine whether the voice to be recognized and each historical recognized voice belong to the same sound source individual.

[0135] Optionally, the voice information recognition device can compare the first voiceprint feature with the second voiceprint feature to obtain a first similarity parameter. It can also compare the first voiceprint feature with the third voiceprint feature to obtain a second similarity parameter.

[0136] Furthermore, the speech information recognition device can determine a third similarity parameter between the speech to be recognized and the previously recognized speech based on the first similarity parameter and the second similarity parameter.

[0137] Optionally, the third similarity parameter mentioned above can be determined according to the following formula (1):

[0138]

[0139] Where s is the third similarity parameter, The first similarity parameter, α is the second similarity parameter, d is the duration of the speech to be recognized, and α is the preset parameter value.

[0140] In one possible scenario, when only one historical speech information has a third similarity parameter greater than or equal to a preset threshold, the speech recognition device can determine that the speech to be recognized and the historically recognized speech belong to the same sound source. The preset threshold is a pre-defined similarity parameter value, which can be comprehensively set based on similarity data obtained during the actual recognition process of the speech recognition device.

[0141] In another possible scenario, when the third similarity parameter of multiple (two or more) historical speech information is greater than or equal to a preset threshold, the speech information recognition device can compare the third similarity parameter of each historical speech information in the multiple historical speech information, and determine that the historical speech information with the largest third similarity parameter belongs to the same sound source individual as the speech to be recognized.

[0142] In some embodiments, when the third similarity parameter is less than a preset threshold, the speech information recognition device can determine that the speech to be recognized and the previously recognized speech belong to different sound source individuals. Furthermore, when the speech to be recognized and all historical speech information belong to different sound source individuals, the speech information recognition device can store the speech to be recognized as the speech information of a new source individual.

[0143] Based on the above embodiments, the method for identifying individual voice sources provided in this application can match and splice the voice to be identified based on historical voice recognition, and then compare the voiceprint features of the spliced voice with the voiceprint features of the historical voice recognition. Since the spliced voice is similar to the historical voice recognition, this process can improve the accuracy of voiceprint recognition.

[0144] The foregoing primarily describes the solutions provided by the embodiments of this application from a methodological perspective. To achieve the aforementioned functions, it includes corresponding hardware structures and / or software modules for executing each function. Those skilled in the art should readily recognize that, in conjunction with the units and algorithm steps of the various examples described in the embodiments disclosed herein, this application can be implemented in hardware or a combination of hardware and computer software. Whether a function is executed in hardware or by computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0145] like Figure 6 The diagram shown is a structural schematic of a classroom analysis device 300 provided in an embodiment of this application. The device 300 may include: an acquisition module 301, an identification module 302, and a processing module 303.

[0146] The acquisition module 301 is used to acquire classroom audio data.

[0147] The recognition module 302 is used to perform voiceprint recognition on classroom audio data and segment the audio data of each individual sound source in the classroom, including teachers and / or students.

[0148] The processing module 303 is used to determine the first switching count and the second switching count based on the audio data of each sound source in the classroom. The first switching count is used to indicate the number of switching between the audio emitted by the students and the audio emitted by the teacher in the classroom, and the second switching count is used to indicate the number of switching between the audio emitted by two students in the classroom.

[0149] The processing module 303 is also used to determine the classroom interaction structure based on the first switching count and the second switching count. The classroom interaction structure is used to indicate the interaction between teachers and students in the classroom.

[0150] In one possible implementation, the processing module 303 is specifically used to: determine the temporal relationship between the audio data of each sound source individual based on the start time of the audio data of each sound source individual in the classroom; and determine the first switching number and the second switching number based on the temporal relationship and the sound source individual to which each audio data belongs.

[0151] In another possible implementation, the above-mentioned processing module 303 further includes: performing voiceprint recognition on classroom audio data to obtain audio data from different sound source individuals; and determining whether the sound source individual is a teacher or a student based on preset rules, wherein the preset rules are to determine the oldest sound source individual as a teacher, the sound source individual with the longest total audio data duration as a teacher, or the sound source individual of the first audio data in the classroom as a teacher.

[0152] In another possible implementation, the aforementioned identification module 302 is specifically used to: segment classroom audio data into multiple audio segments; wherein an audio segment belongs to the same sound source individual; and determine one or more audio segments of each sound source individual among the multiple audio segments.

[0153] In another possible implementation, the aforementioned recognition module 302 is further specifically used to: compare the first audio segment with the historical recognized speech to determine whether the first audio segment and the historical recognized speech belong to the same sound source; wherein, the first audio segment is one of multiple audio segments, and the historical recognized speech is speech information that has been recognized in the past.

[0154] In another possible implementation, the above-mentioned device further includes a splicing module 304, and the recognition module 302 is specifically used to perform recognition processing on the first audio segment to obtain the first voiceprint feature and hierarchical speech information of the first audio segment. The hierarchical speech information includes word-level speech information, character-level speech information and phoneme-level speech information of the first audio segment. The splicing module 304 is used to splice the hierarchical speech information of the first audio segment according to the historical recognized speech to obtain spliced speech. The recognition module is also used to perform recognition processing on the spliced speech to obtain the second voiceprint feature of the spliced speech. The recognition module 302 is also used to compare the first voiceprint feature, the second voiceprint feature and the third voiceprint feature of the historical recognized speech to determine whether the first audio segment and the historical recognized speech belong to the same sound source individual.

[0155] In another possible implementation, the aforementioned recognition module 302 is specifically used to: compare the first voiceprint feature with the second voiceprint feature to obtain a first similarity parameter; compare the first voiceprint feature with the third voiceprint feature to obtain a second similarity parameter; determine a third similarity parameter between the first audio segment and the historical recognized speech based on the first similarity parameter and the second similarity parameter; and determine that the first audio segment and the historical recognized speech belong to the same sound source when the third similarity parameter is greater than or equal to a preset threshold.

[0156] In yet another possible implementation, the third similarity parameter mentioned above satisfies the following relationship:

[0157]

[0158] Where s is the third similarity parameter, The first similarity parameter, α is the second similarity parameter, d is the duration of the first audio segment, and α is the preset parameter value.

[0159] For a detailed description of the above-mentioned optional methods, please refer to the foregoing method embodiments, which will not be repeated here. Furthermore, the explanation of any of the classroom analysis devices 300 provided above, as well as the description of their beneficial effects, can be found in the corresponding method embodiments described above, and will not be repeated here.

[0160] Those skilled in the art will readily recognize that, based on the units and algorithm steps described in conjunction with the embodiments disclosed herein, this application can be implemented in hardware or a combination of hardware and computer software. Whether a function is implemented in hardware or by computer software driving hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0161] It should be noted that, Figure 6 The module division shown is illustrative and represents only one logical functional division; in actual implementation, other division methods are possible. For example, two or more functions can be integrated into a single processing module. These integrated modules can be implemented either in hardware or as software functional modules.

[0162] This application also provides a computer-readable storage medium including computer-executable instructions that, when run on a computer, cause the computer to perform any of the methods provided in the above embodiments. For example, Figure 3 One or more features in S101 to S104 can be performed by one or more computer-executable instructions stored in the computer-readable storage medium.

[0163] This application also provides a computer program product containing computer execution instructions, which, when run on a computer, causes the computer to perform any of the methods provided in the above embodiments.

[0164] This application also provides a chip, including a processor and an interface. The processor is coupled to a memory through the interface. When the processor executes a computer program in the memory or computer execution instructions, any of the methods provided in the above embodiments are executed.

[0165] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented using software programs, implementation can be, in whole or in part, in the form of a computer program product. This computer program product includes one or more computer instructions. When these computer instructions are loaded and executed on a computer, all or part of the flow or function according to the embodiments of this application is generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, computer instructions can be transmitted from one website, computer, server, or data center to another via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium accessible to a computer or a data storage device containing one or more servers, data centers, etc., that can be integrated with the medium. The available media can be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid-state disks, SSDs).

[0166] Although this application has been described in conjunction with specific features and embodiments, it is obvious that various modifications and combinations can be made thereto without departing from the spirit and scope of this application. Accordingly, this specification and drawings are merely exemplary illustrations of this application as defined by the appended claims, and are considered to cover any and all modifications, variations, combinations, or equivalents within the scope of this application. Clearly, those skilled in the art can make various alterations and modifications to this application without departing from the spirit and scope of this application. Thus, if such modifications and modifications of this application fall within the scope of the claims of this application and their equivalents, this application is also intended to include such modifications and modifications.

[0167] The above description is merely a specific embodiment of this application, but the scope of protection of this application is not limited thereto. Any changes or substitutions within the technical scope disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A classroom analysis method, characterized in that, The method includes: Acquire classroom audio data; Voiceprint recognition is performed on the classroom audio data to segment the audio data of each individual sound source in the classroom, including teachers and / or students; Based on the audio data of each sound source in the classroom, a first switching count and a second switching count are determined, wherein the first switching count is used to indicate the number of switching between the audio emitted by the student and the audio emitted by the teacher in the classroom, and the second switching count is used to indicate the number of switching between the audio emitted by two students in the classroom. Based on the first number of switching times and the second number of switching times, the classroom interaction structure is determined, and the classroom interaction structure is used to indicate the interaction between the teacher and the students in the classroom; The step of performing voiceprint recognition on the classroom audio data to segment the audio data of each individual sound source in the classroom includes: The classroom audio data is divided into multiple audio segments; wherein each audio segment belongs to the same sound source. The first audio segment is processed to obtain the first voiceprint feature and hierarchical speech information of the first audio segment. The hierarchical speech information includes word-level speech information, character-level speech information and phoneme-level speech information of the first audio segment. The first audio segment is one of the plurality of audio segments, and the historically recognized speech is the speech information that has been recognized in the past. Based on the historical recognized speech, the hierarchical speech information of the first audio segment is spliced together to obtain spliced speech; The spliced speech is processed to obtain the second voiceprint feature of the spliced speech; By comparing the first voiceprint feature, the second voiceprint feature, and the third voiceprint feature of the historical recognized speech, it is determined whether the first audio segment and the historical recognized speech belong to the same sound source individual.

2. The method according to claim 1, characterized in that, The step of determining the first switching count and the second switching count based on the audio data of each sound source in the classroom includes: Based on the start time of the audio data of each individual sound source in the classroom, determine the temporal relationship between the audio data of each individual sound source. Based on the temporal relationship and the individual sound source to which each audio data belongs, the first number of switching and the second number of switching are determined.

3. The method according to claim 2, characterized in that, The process involves performing voiceprint recognition on the classroom audio data to segment the audio data of each individual sound source in the classroom, where each sound source includes teachers or students. Voiceprint recognition is performed on the classroom audio data to obtain audio data from different sound source individuals; Based on preset rules, the individual sound source is determined to be a teacher or a student. The preset rules are to determine the oldest individual sound source as a teacher, the individual sound source with the longest total audio data duration as a teacher, or the individual sound source of the first audio data in the classroom as a teacher.

4. The method according to claim 1, characterized in that, By comparing the first voiceprint feature, the second voiceprint feature, and the third voiceprint feature of the historical recognized speech, determining whether the first audio segment and the historical recognized speech belong to the same sound source individual includes: By comparing the first voiceprint feature with the second voiceprint feature, a first similarity parameter is obtained; By comparing the first voiceprint feature with the third voiceprint feature, a second similarity parameter is obtained; Based on the first similarity parameter and the second similarity parameter, a third similarity parameter is determined between the first audio segment and the historical recognized speech. When the third similarity parameter is greater than or equal to a preset threshold, it is determined that the first audio segment and the historical recognized speech belong to the same sound source individual.

5. The method according to claim 4, characterized in that, The third similarity parameter satisfies the following relationship: Where s is the third similarity parameter, For the first similarity parameter, d is the second similarity parameter, α is the duration of the first audio segment, and α is a preset parameter value.

6. A classroom analysis device, characterized in that, The device includes: The acquisition module is used to acquire classroom audio data; The recognition module is used to perform voiceprint recognition on the classroom audio data and segment the audio data of each individual sound source in the classroom, including teachers and / or students; The processing module is used to determine a first number of switching times and a second number of switching times based on the audio data of each sound source in the classroom. The first number of switching times is used to indicate the number of switching times between the audio emitted by the student and the audio emitted by the teacher in the classroom, and the second number of switching times is used to indicate the number of switching times between the audio emitted by two students in the classroom. The processing module is further configured to determine the classroom interaction structure of the classroom based on the first number of switching times and the second number of switching times, wherein the classroom interaction structure is used to indicate the interaction between the teacher and the students in the classroom; The identification module is specifically used for: The classroom audio data is divided into multiple audio segments; wherein each audio segment belongs to the same sound source. Identify one or more audio segments for each individual sound source among the plurality of audio segments; The identification module is also specifically used for: By comparing the first audio segment with the historical recognized speech, it is determined whether the first audio segment and the historical recognized speech belong to the same sound source; wherein, the first audio segment is one of the plurality of audio segments, and the historical recognized speech is speech information that has been recognized in the past; The recognition module is further specifically used to: perform recognition processing on the first audio segment to obtain the first voiceprint feature and hierarchical speech information of the first audio segment, wherein the hierarchical speech information includes word-level speech information, character-level speech information and phoneme-level speech information of the first audio segment; The splicing module is used to splice the hierarchical speech information of the first audio segment according to the historical recognized speech to obtain spliced speech; The recognition module is also used to perform recognition processing on the spliced speech to obtain the second voiceprint feature of the spliced speech; The recognition module is further configured to compare the first voiceprint feature, the second voiceprint feature, and the third voiceprint feature of the historical recognized speech to determine whether the first audio segment and the historical recognized speech belong to the same sound source individual.

7. The apparatus according to claim 6, characterized in that, The processing module is specifically used for: Based on the start time of the audio data of each individual sound source in the classroom, determine the temporal relationship between the audio data of each individual sound source. Based on the temporal relationship and the individual sound source to which each audio data belongs, determine the first number of switching and the second number of switching; The processing module is also specifically used for: Voiceprint recognition is performed on the classroom audio data to obtain audio data from different sound source individuals; Based on preset rules, the sound source individual is determined to be a teacher or a student. The preset rules are to determine the oldest sound source individual as a teacher, the sound source individual with the longest total audio data duration as a teacher, or the sound source individual of the first audio data in the classroom as a teacher. By comparing the first voiceprint feature with the second voiceprint feature, a first similarity parameter is obtained; By comparing the first voiceprint feature with the third voiceprint feature, a second similarity parameter is obtained; Based on the first similarity parameter and the second similarity parameter, a third similarity parameter is determined between the first audio segment and the historical recognized speech. When the third similarity parameter is greater than or equal to a preset threshold, it is determined that the first audio segment and the historical recognized speech belong to the same sound source individual; The third similarity parameter satisfies the following relationship: Where s is the third similarity parameter, For the first similarity parameter, d is the second similarity parameter, α is the duration of the first audio segment, and α is a preset parameter value.

8. An electronic device, characterized in that, The electronic device includes a memory and a processor; the memory and the processor are coupled; the memory is used to store computer program code, the computer program code including computer instructions; Wherein, when the processor executes the computer instructions, the electronic device performs the method according to any one of claims 1 to 5.

9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores computer instructions that, when executed, cause the computer to perform the method described in any one of claims 1 to 5.