A multi-modal classroom construction method and system based on digital twinning and course agent

By using multimodal data acquisition and processing, digital twin modeling, and collaborative control of intelligent course agents, the multidimensional limitations of smart classrooms are solved, enabling high-fidelity dynamic mapping, accurate perception, and personalized teaching, meeting educational data compliance requirements, and significantly improving teaching efficiency.

CN122199218APending Publication Date: 2026-06-12QINGDAO HENGXING UNIV OF SCI & TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
QINGDAO HENGXING UNIV OF SCI & TECH
Filing Date
2026-03-10
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing smart classroom technologies have limitations in virtual scene construction, multi-dimensional data collection, intelligent interaction, and privacy security, making it difficult to meet the comprehensive needs of dynamic adaptation, intelligent collaboration, privacy protection, and millisecond-level feedback.

Method used

By employing multimodal data acquisition and processing, digital twin modeling, intelligent course agent collaborative control, and end-to-end data security, high-fidelity virtual classrooms are constructed by collecting multi-dimensional data through devices such as high-definition cameras, array microphones, and eye trackers. Differential privacy, federated learning, and blockchain technologies are combined to ensure data security, enabling millisecond-level feedback and personalized teaching.

🎯Benefits of technology

It achieves high-fidelity dynamic twin mapping, enhances classroom interaction and immersion, accurately perceives classroom status, provides personalized teaching, meets educational data compliance requirements, and significantly improves teaching efficiency and effectiveness.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122199218A_ABST
    Figure CN122199218A_ABST
Patent Text Reader

Abstract

The application belongs to the technical field of education and artificial intelligence interaction, and provides a multi-modal classroom construction method and system based on digital twinning and course agents, aiming to solve the problems of static classroom modeling, insufficient multi-modal data fusion, unintelligent teaching interaction, weak privacy protection and feedback lag in the prior art. The method comprises the following steps: S1, multi-modal classroom data acquisition and standardization; S2, digital twinning classroom modeling and dynamic updating; S3, course agent construction and collaborative control; S4, multi-modal data fusion and classroom state analysis. The system comprises multi-modal acquisition, data preprocessing, digital twinning modeling, course agent, multi-modal fusion analysis and interactive feedback modules. The application can realize high-fidelity dynamic mapping, accurate classroom perception, personalized teaching, full-process privacy compliance and millisecond-level closed-loop optimization.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of educational technology and artificial intelligence interaction technology, specifically a method and system for constructing a multimodal classroom based on digital twins and curriculum intelligent agents. Background Technology

[0002] With the deep integration of digitalization and intelligence in education, technologies such as digital twins and multimodal interaction are gradually being applied to classroom teaching. Although existing technologies have made some progress in virtual scene construction and data collection and analysis, there are still multi-dimensional bottlenecks that make it difficult to meet the core needs of modern education for dynamic adaptation, intelligent collaboration, and privacy and security.

[0003] While existing smart classroom technologies have made some progress in virtual scene construction and basic data collection, limitations remain in several key dimensions. Some systems, such as the classroom teaching system and method based on digital twin technology proposed in CN115641232A, focus on the virtual reproduction of experimental operations, with limited responsiveness to dynamic elements such as students' real-time expressions and actions in the classroom. The classroom teaching method and device based on digital twin technology in CN113593383A primarily demonstrates principles through static models, failing to fully realize two-way synchronization of teacher-student interaction, thus affecting the immersion and interactive continuity of the virtual classroom. Regarding multimodal data processing, the digital teaching system based on multimodal intelligent technology in CN119559834A mainly relies on voice and screen interaction data to generate learning paths, with less integration of deep behavioral features such as eye movements and micro-expressions. While the multimodal positioning method and system for target objects in a three-dimensional digital classroom based on CN118071833A can locate student positions, it fails to effectively integrate emotional and attentional state information, limiting the comprehensive perception of the overall classroom situation. At the level of intelligent interaction, the smart classroom interaction system and method in CN117037552A improves interaction efficiency through smart whiteboards, but lacks a dynamic teaching adaptation mechanism based on intelligent agents. The cloud-based smart classroom interaction system in CN115410431A focuses on optimizing the homework grading process, but its responsiveness to real-time changes in classroom status is weak. Furthermore, systems such as CN119559834A and CN115410431A primarily employ basic encryption methods for data security, failing to integrate technologies such as differential privacy, federated learning, or blockchain, making it difficult to meet the compliance requirements of the Personal Information Protection Law regarding biometric and behavioral data. Regarding teaching feedback, the assessment module in CN117037552A typically generates reports after class, resulting in a long feedback cycle. Moreover, the indicators focus on the mastery of knowledge points, lacking real-time mapping of process dimensions such as attention span and interaction quality, thus hindering the immediate adjustment and optimization of teaching strategies. Therefore, current technologies are still insufficient to simultaneously meet the comprehensive requirements of dynamic twin mapping, deep multimodal fusion, intelligent collaborative interaction, end-to-end privacy protection, and millisecond-level feedback.

[0004] To address the problems raised in the background art, those skilled in the art have proposed a method and system for constructing a multimodal classroom based on digital twins and curriculum intelligent agents. Summary of the Invention

[0005] To address the aforementioned technical problems, this invention provides a method and system for constructing a multimodal classroom based on digital twins and curriculum intelligence agents, thereby resolving the issues in the prior art.

[0006] A method for constructing a multimodal classroom based on digital twins and curriculum intelligence agents includes the following steps: S1. Multimodal classroom data collection and standardization; S101. Multi-dimensional data collection: Deploy multi-source collection devices in the physics classroom to acquire full-scene classroom data, including visual data, auditory data, behavioral data, and teaching data; S2, Digital Twin Classroom Modeling and Dynamic Updates; S201. Twin model construction, which specifically includes three parts: basic modeling, entity mapping and relation definition; S202. Dynamically update the model to achieve precise adaptation between the virtual classroom and the physical classroom through real-time synchronization, state prediction and error correction. S3, Construction and collaborative control of course intelligent agents; S301, Intelligent Agent Architecture Design, including Perception Layer, Decision Layer and Execution Layer; S302, Design of intelligent agent collaboration mechanism, including human-machine collaboration design and multi-agent collaboration design; S4. Multimodal data fusion and classroom status analysis; S5. Personalized teaching interaction and feedback; S6. Data security and model iteration optimization; S601. Provide end-to-end data security, including data transmission, data storage, and privacy protection; S602. Feedback-driven iteration: Collect new classroom data monthly to supplement the training set and optimize the multimodal fusion model and the course agent decision-making model; organize education experts and technology experts to review teaching strategy rules every quarter and update agent interaction logic; compare teaching effects before and after optimization through A / B testing to ensure the effectiveness of model iteration.

[0007] Furthermore, S1 also includes: S102. Data standardization processing: median filtering is used to remove image noise, and Mel frequency cepstral coefficients are used to filter speech interference, remove abnormal data, and clean and denoise the data; format unification is also performed, image data is normalized to the [0,1] interval, speech data is converted to 16kHz mono PCM format, and behavioral data is aligned by timestamp; at the same time, differential privacy technology is used to add Gaussian noise to student facial images, and local feature extraction is performed on behavioral data through federated learning to perform privacy desensitization; Furthermore, the visual data mentioned in S101 is collected by a high-definition camera to capture students' facial expressions, body movements, and seating arrangements; the auditory data is collected by an array microphone to capture the teacher's lecturing voice and the students' response voice, and the voice emotion and speech rate features are extracted; the behavioral data is collected by an eye tracker to capture students' gaze focus, and a motion capture device to record body movements; the teaching data is connected to the teaching management system to collect courseware content, teaching progress, and student answer data.

[0008] Furthermore, the basic modeling step in S201 uses Unity3D to construct a 1:1 virtual replica model of the physics classroom. The model covers the classroom layout, desks and chairs, and teaching equipment, and supports real-time adjustment of lighting and viewing angle. The entity mapping step defines three core twin entities: student entities, teacher entities, and equipment entities. The student entity is associated with facial features, behavioral state, and knowledge mastery; the teacher entity is associated with teaching progress and interactive commands; and the equipment entity is associated with the operating status of the projection and audio equipment. The relationship definition step establishes the relationships between entities to ensure a one-to-one correspondence between virtual entities and physical entities. The real-time synchronization step in S202 is based on a low-latency communication protocol, which pushes standardized multimodal data to the virtual classroom and updates students' facial expressions, gaze direction, action status, and teachers' teaching actions in real time. The state prediction step uses a time-series prediction model to predict changes in classroom state in future time periods based on historical data and updates the intelligent agent interaction prompts in the virtual classroom in advance. The error correction step uses a filtering algorithm to dynamically correct the synchronization error between the virtual and physical classrooms to ensure that the model mapping accuracy reaches a preset threshold, which is no less than 95%.

[0009] Furthermore, in S301, the perception layer receives multimodal data and the virtual classroom status, and identifies students' attention levels, emotional states, and knowledge mastery; the decision layer trains a decision model based on a reinforcement learning algorithm and dynamically generates teaching strategies according to the classroom status; the execution layer outputs teaching instructions through the virtual classroom interface to control physical devices. The human-machine collaborative design in S302 clearly defines the course intelligent agent as responsible for process interaction, including real-time Q&A and attention guidance, while the teacher focuses on explaining the core teaching content and analyzing key points and difficulties. The multi-agent collaborative design configures a dedicated learning intelligent agent for each student, responsible for personalized learning recommendations, while a global intelligent agent is set up to control the teaching pace and coordinate device scheduling. The two achieve instruction synchronization and state sharing through message queues. In the multi-agent collaborative design, when the global intelligent agent detects that the class's attention distraction rate exceeds 30%, it triggers a collaborative operation to push interactive suggestions to the teacher and focus prompts to distracted students.

[0010] Furthermore, S4 specifically includes the following steps: S401. Perform feature-level fusion. Feature-level fusion uses CNN to extract image features, Transformer to extract speech features, and MLP to extract behavioral features for single-modal feature extraction. Feature-level fusion uses an attention mechanism to weight and fuse multimodal features to generate a unified classroom state feature vector, highlighting the weight of key information and performing cross-modal fusion. S402. Conduct classroom state reasoning, train a classification model based on fusion features, and identify classroom behaviors; construct a process indicator and outcome indicator evaluation system, including process indicators such as attention duration and number of interactive participation, and outcome indicators such as knowledge point mastery rate, to comprehensively evaluate teaching effectiveness.

[0011] Furthermore, S5 specifically includes the following steps: S501. Personalized teaching generation: The course intelligence agent dynamically generates personalized learning paths based on students' knowledge graphs and classroom status; at the same time, it adjusts the graphic courseware according to students' visual and auditory preferences. S502 provides real-time feedback output. On the student end, a personal attention report and knowledge point mastery radar chart are displayed through the virtual classroom interface, providing personalized learning suggestions. On the teacher end, a global classroom heat map and student status ranking are generated, and teaching optimization suggestions are pushed.

[0012] Furthermore, the data transmission described in S601 uses the TLS1.3 protocol to encrypt and transmit multimodal data to prevent eavesdropping and tampering; data storage utilizes blockchain technology to store classroom data, ensuring that the data is immutable and traceable; privacy protection adopts federated learning to train classroom analysis models, local data is not sent to the cloud, and only model parameters are shared; student identity information is desensitized through K-anonymity technology to ensure that individuals cannot be identified individually.

[0013] A multimodal classroom construction system based on digital twins and curriculum intelligence agents includes a multimodal data acquisition module, a data preprocessing and privacy protection module, and a digital twin modeling module; The multimodal data acquisition module comprises a visual acquisition unit, an auditory acquisition unit, a behavioral acquisition unit, and a teaching data interface. The visual acquisition unit includes a high-definition camera and a facial recognition module for acquiring and recognizing student expressions and postures. The auditory acquisition unit includes an array microphone and a speech processing module for extracting emotional and semantic features of speech. The behavioral acquisition unit includes an eye tracker and motion capture equipment for acquiring gaze trajectory and body movement data. The teaching data interface connects to the academic affairs system and teaching platform via API to obtain courseware and answer data. The data preprocessing and privacy protection module includes a data cleaning unit, a standardization unit, and a privacy protection unit. The data cleaning unit removes noisy data and corrects outliers to ensure data quality. The standardization unit unifies data formats and aligns timestamps to generate a standardized dataset. The privacy protection unit integrates differential privacy, federated learning, and blockchain notarization functions to ensure data security and compliance. The digital twin modeling module includes a scene modeling submodule, an entity mapping submodule, and a dynamic update submodule. The scene modeling submodule is used to construct a 1:1 virtual classroom scene and supports adjustments to lighting and viewing angle. The entity mapping submodule is used to define and bind virtual entities and attributes of students, teachers, and devices. The dynamic update submodule updates the virtual classroom status based on real-time data and corrects synchronization errors.

[0014] Furthermore, it also includes a course intelligent agent module, a multimodal fusion and analysis module, and an interaction and feedback module; The course intelligent agent module includes a perception submodule, a decision-making submodule, and a collaborative control submodule; wherein, the perception submodule is used to receive multimodal data and identify the classroom status; the decision-making submodule generates personalized teaching strategies and interaction instructions based on reinforcement learning; and the collaborative control submodule realizes the collaborative scheduling of the course intelligent agent with teachers and physical devices. The multimodal fusion and analysis module includes a feature extraction submodule, a fusion submodule, and a state analysis submodule; wherein, the feature extraction submodule is used to extract the core features of each modality data; the fusion submodule achieves cross-modal feature fusion through an attention mechanism; and the state analysis submodule is used to identify classroom behavior and evaluate teaching effectiveness, and generate an analysis report. The interaction and feedback module includes a personalized interaction submodule, a feedback display submodule, and an iterative optimization submodule. The personalized interaction submodule is used to push suitable learning content and interaction formats to students. The feedback display submodule displays students' individual reports and teachers' overall classroom reports through web and mobile terminals. The iterative optimization submodule collects feedback from teachers and students to drive model and rule updates.

[0015] Compared with the prior art, the present invention has the following beneficial effects: 1. This invention achieves high-fidelity dynamic twin mapping, significantly enhancing the immersive experience of classroom interaction; by driving millisecond-level updates of the virtual classroom through multimodal data, combined with state prediction and error correction mechanisms, it solves the problems of static modeling and synchronization lag in traditional digital twin classrooms, enabling the virtual classroom to accurately reflect the real-time facial expressions, actions, and emotional changes of teachers and students, and significantly improving the interactive experience compared to existing technologies.

[0016] 2. This invention constructs a deep multimodal fusion analysis system to comprehensively and accurately perceive classroom status; and integrates multi-dimensional data such as visual, auditory, and behavioral data. Through feature-level fusion and attention mechanism weighting, it overcomes the shortcomings of the one-sidedness of single data source analysis. The accuracy of classroom status identification is significantly improved compared with traditional methods, providing comprehensive and objective data support for personalized teaching.

[0017] 3. This invention establishes a collaborative control mechanism for course intelligent agents, effectively implementing personalized teaching. Through the collaboration of global intelligent agents and student-specific intelligent agents, combined with reinforcement learning to dynamically generate teaching strategies, it achieves a precise adaptation from one-size-fits-all teaching to dynamically generated teaching strategies for each student, significantly improving students' mastery of knowledge points and significantly enhancing teaching effectiveness.

[0018] 4. This invention achieves end-to-end data security protection, meeting the compliance requirements for educational data; it innovatively integrates differential privacy, federated learning, and blockchain evidence storage technologies to build a privacy protection system covering the entire lifecycle of data collection, transmission, and storage, ensuring that biometric and behavioral data are not leaked, the risk of sensitive information leakage is extremely low, and it fully complies with relevant regulations.

[0019] 5. This invention provides millisecond-level real-time feedback and closed-loop optimization, which greatly improves teaching efficiency; it shortens the classroom feedback cycle from the traditional weekly level to the millisecond level, allowing teachers to adjust teaching strategies in real time, significantly reducing preparation time and greatly improving teaching efficiency; students simultaneously receive personalized learning suggestions, and the quality of their homework is significantly improved, forming an intelligent teaching closed loop from perception to decision-making, from decision-making to execution, from execution to feedback, and from feedback to optimization. Attached Figure Description

[0020] Figure 1 This is a flowchart illustrating the method for constructing a multimodal classroom based on digital twins and curriculum intelligence agents according to the present invention. Figure 2 This is a detailed flowchart of S1 in the multimodal classroom construction method based on digital twin and curriculum intelligent agent of the present invention; Figure 3 This is a detailed flowchart of S2 in the multimodal classroom construction method based on digital twin and curriculum intelligent agent of the present invention; Figure 4This is a detailed flowchart of S3 in the multimodal classroom construction method based on digital twin and curriculum intelligent agent of the present invention; Figure 5 This is a detailed flowchart of S4 in the multimodal classroom construction method based on digital twin and curriculum intelligent agent of the present invention; Figure 6 This is a detailed flowchart of S5 in the multimodal classroom construction method based on digital twin and curriculum intelligent agent of the present invention; Figure 7 This is a detailed flowchart of step S6 in the multimodal classroom construction method based on digital twins and curriculum intelligent agents of the present invention; Figure 8 This is a structural block diagram of a multimodal classroom construction system based on digital twins and curriculum intelligence agents according to the present invention. Detailed Implementation

[0021] The embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and examples. The following examples are for illustrative purposes only and should not be construed as limiting the scope of the invention.

[0022] As attached Figure 1 To be continued Figure 8 As shown: This embodiment takes the blended teaching scenario of the "Python Programming" course in the Artificial Intelligence major at Qingdao Hengxing Science and Technology College as the application background, and constructs a complete multimodal classroom system based on digital twins and course intelligent agents. The system is deployed in a standard smart classroom with a capacity of 60 people. The physical space is equipped with a high-density sensor network, edge computing nodes, local server clusters and interactive terminal devices, aiming to realize a closed loop of the entire chain from data collection, modeling and mapping, intelligent decision-making to personalized feedback.

[0023] First, at the system architecture level, this embodiment deploys six core functional modules: multimodal data acquisition module, data preprocessing and privacy protection module, digital twin modeling module, course intelligent agent module, multimodal fusion and analysis module, and interaction and feedback module. Each module is interconnected through a high-speed local area network and relies on a unified time synchronization service, using the PTP protocol with an accuracy of ±1ms to ensure the consistency of the timing of the entire system.

[0024] The multimodal data acquisition module consists of four types of hardware units: Visual acquisition unit: Deployed 8 high-definition cameras with a resolution of 1920×1080@30fps, model Hikvision DS-2CD3T47G2-L, covering the front, middle and back areas of the classroom. Each camera is connected to a dedicated NVIDIA Jetson AGX Orin edge computing unit, running a lightweight YOLOv5s model to detect face regions in real time and crop and output a 224×224 pixel image stream; Auditory acquisition unit: It adopts 4 sets of circular array microphones, each with 6 channels, a sampling rate of 48kHz, and a signal-to-noise ratio of ≥70dB. They are distributed in the four corners of the ceiling and connected to the central audio processing server through a USB 3.0 interface. They run WebRTC audio front-end algorithms to perform beamforming and speech activity detection, and separate the teacher's main speech from the student's response speech. Behavior acquisition unit: The front row of 20 key observation students are equipped with Tobii Pro Fusion eye trackers with a sampling rate of 250Hz. Two OptiTrack Prime 13W infrared motion capture cameras are deployed simultaneously with a frame rate of 120fps. The gaze focus coordinates (x,y) and limb joint three-dimensional coordinates (x,y,z) are output with a granularity of 10ms through a dedicated SDK. Teaching Data Interface Unit: Connects to the school's Moodle platform and ClassIn SDK classroom quiz system via RESTful API to retrieve courseware metadata in real time, such as PPT page numbers and code snippets, teaching progress markers (e.g., the current chapter has a loop structure), and student quiz results, such as accuracy and answer time, and other structured data.

[0025] All raw data streams converge via gigabit Ethernet to the data preprocessing and privacy protection module, which is deployed in a local server rack. The local server rack uses a Dell PowerEdge R750 processor, dual Intel Xeon Silver4310 CPUs, 128GB of RAM, and runs a custom data pipeline engine; it contains three sub-units: Data cleaning unit: applies 3×3 median filtering to the image stream to suppress salt-and-pepper noise; extracts Mel-frequency cepstral coefficients from the speech signal and filters out background interference below -40dB; uses a sliding window to remove abrupt outliers from the behavioral data, with a threshold > ; Normalization unit: Normalizes image pixel values ​​to the [0,1] range, resamples speech PCM to 16kHz mono, aligns all modal data according to global timestamps, and generates a unified format JSON-LD data packet, each packet containing a multimodal feature vector within a 10ms window; Privacy protection unit: During the image preprocessing stage, it calls the differential privacy library to add L2-norm constrained Gaussian noise to the face region. This ensures that individuals are not identifiable; at the same time, the federated learning client is launched. The federated learning client uses the PySyft framework to complete the MLP encoding of behavioral features locally at the edge nodes, outputting a 128-dimensional embedding vector. Only the encrypted gradient parameters are uploaded to the central aggregator, and the original biometric data never leaves the local device.

[0026] The standardized and anonymized data stream is pushed to the digital twin modeling module, which is developed based on the Unity3D 2021.3 LTS engine and runs on a high-performance graphics workstation. The high-performance graphics workstation uses an NVIDIA RTX A6000 GPU and 64GB VRAM, and consists of three sub-components: Scene modeling submodule: Loads the classroom CAD model exported by BIM tools. The classroom CAD model contains 30 sets of desks and chairs, blackboards, projection screens and lighting systems. Real-time light and shadow calculation is achieved through the HDRP high-definition rendering pipeline, and teachers can freely switch between first-person and third-person perspectives. Entity Mapping Submodule: Creates a virtual avatar for each student, binds it with a unique ID, and associates it with three types of attributes: 1. Facial expressions, using 6-dimensional AU vectors; 2. Behavioral state, including eye direction and sitting posture angle; 3. Knowledge mastery level, which is derived from the 0-1 floating-point values ​​of the knowledge graph; The teacher entity is associated with the teaching progress, which is the chapter ID or the current PPT page number, and interactive commands, including asking questions and calling on students; the device entity includes the projector and speakers, which are bound to the on / off status and volume parameters. The dynamic update submodule receives standardized data streams via a WebSocket service (based on Node.js + Socket.IO, latency ≤80ms) to drive real-time updates of the virtual avatar—for example, when a physics student raises their hand, the rotation angle of their virtual avatar's arm is set to 90°. It also integrates an LSTM prediction model and a Kalman filter. The former predicts the student's attention trend within the next 2 seconds based on historical data from the past 5 seconds, while the latter dynamically corrects synchronization errors caused by network jitter, achieving a measured synchronization accuracy of 96.2%. The Kalman filter's state vector includes the student's virtual avatar's position and angular velocity. The process noise covariance matrix Q is set to 0.01, and the observation noise covariance matrix R is dynamically adjusted according to the camera frame rate, with an initial value of 0.1. The LSTM model contains two hidden layers, each with 128 neurons. The input is a sequence of classroom state features sampled at 100ms intervals over the past 5 seconds, and the output is a predicted attention level value at 200ms intervals over the next 2 seconds. The model is trained end-to-end using a historical classroom dataset, with mean squared error as the loss function.

[0027] Based on this system, the course agent module is activated. This module adopts a microservice architecture deployed on a Kubernetes cluster. The Kubernetes cluster has 3 nodes, each with 32 CPU cores, and contains two types of agent instances: Example: Sixty student-specific learning agents: Each instance runs independently, loading the corresponding student's knowledge graph. The knowledge graph is a Neo4j graph database, with nodes representing knowledge points and edges representing mastery relationships. Its perception submodule subscribes to multimodal fusion feature streams, its decision submodule generates personalized strategies based on the DQN reinforcement learning algorithm, and its execution submodule pushes interactive commands to student terminals via the gRPC interface. The decision layer trains a decision model based on a deep Q-network algorithm. Its state space is a 512-dimensional classroom state feature vector, and its action space includes four discrete action categories: {no operation, pop-up prompt, push practice questions, adjust courseware difficulty}. The reward function R is defined as: ,in To increase the quantity of knowledge mastery, The duration of student attention (in seconds). To interfere with the number of operations, These are the weighting coefficients; One classroom global intelligent agent: responsible for macro-control, its perception submodule aggregates the state features of the whole class, its decision-making submodule uses the PPO algorithm to optimize the teaching pace (such as inserting exercises and adjusting the speaking speed), and its execution submodule controls physical devices through the Modbus TCP protocol.

[0028] The two types of intelligent agents collaborate through the RabbitMQ message queue: when the global intelligent agent detects that the class's attention distraction rate is >30%, the global intelligent agent pushes a prompt to the teacher: it is recommended to increase interactive exercises. At the same time, the student intelligent agent pushes a pop-up prompt to the distracted students: please focus on the loop case on page 15 of the courseware. After the teacher initiates a classroom quiz, the intelligent agent automatically counts the 70% accuracy rate of the answers and pushes personalized analysis of the wrong questions.

[0029] The core of the system's operation lies in the multimodal fusion and analysis module, which is deployed on a dedicated AI inference server. The AI ​​inference server uses an NVIDIA A100 80GB processor and executes the following process: Feature extraction submodule: Inputting images into a ResNet-18 (ImageNet pre-trained, first 5 layers frozen) outputs 512-dimensional features; inputting speech into a Wav2Vec 2.0 Base model outputs 768-dimensional context vectors; inputting behavioral data into a 3-layer MLP outputs 128-dimensional embeddings. The fusion submodule concatenates the three features and inputs them into a multi-head attention mechanism, which employs a 4-head attention mechanism. Calculate cross-modal weights: for example, when a student frowns (visual feature activation) and hesitates in their speech response (speech feature low energy), the attention weight is automatically increased to this combination; the final output is a 512-dimensional unified classroom state feature vector; The state analysis submodule takes a vector as input to a classification head, which uses a 2-layer fully connected layer and Softmax. It outputs a ternary label: attention level, emotional state, and knowledge mastery. Attention level is categorized as focused, average, and distracted; emotional state as positive, neutral, and confused; and knowledge mastery as mastered, partially mastered, and not mastered. Simultaneously, an evaluation index system is constructed: process indicators include attention duration and interaction participation frequency; the outcome indicator is the knowledge mastery rate. Attention duration is calculated as the cumulative focused state time, and interaction participation frequency is calculated as the frequency of raising hands and responding. The outcome indicator, knowledge mastery rate, is calculated based on a weighted average of answer accuracy and response time.

[0030] Finally, the interaction and feedback module transforms the analysis results into teaching actions. This module includes: Personalized Interaction Submodule: Based on the students' weaknesses in their knowledge graph, such as a mastery level of <0.6 in the chapter on for loop initialization, and their learning preferences, dynamically generate tiered practice questions, such as from basic to advanced questions, and adjust the presentation of courseware—prioritizing the display of flowcharts and highlighted code blocks for visual learners, and increasing the proportion of TTS voice explanations for auditory learners; Feedback Display Submodule: A mobile app developed using React Native displays individual reports to students, including attention curves, knowledge point radar charts, and improvement suggestions; it also pushes class heatmaps, student status rankings, and teaching optimization suggestions to teachers' web interface (Vue3 + ECharts), such as the generation of teaching optimization suggestions: the back row left area needs more interaction, and the class heatmap encodes attention status by seat color. The iterative optimization submodule automatically collects no less than 100 hours of new data each month and updates the fusion model through incremental learning; it calls the expert review interface every quarter to update the agent's decision tree; and it compares the difference in knowledge point mastery rates under the new and old strategies through an A / B testing framework to ensure the effectiveness of the iteration. In the A / B test, the same course content is randomly assigned to two parallel classes. The experimental group uses the new strategy, and the control group uses the old strategy. The testing period is 2 weeks. The main evaluation indicator is the improvement in knowledge point mastery rate. An improvement of <0.05 is considered significant. The expert review will generate a structured rule update document, which includes added / deleted if-then rules. For example, if the percentage of confused emotions in the class is >40% and the number of interactive participations is <3 times / 10 minutes, then the global agent is triggered to insert a micro-lesson video, which is automatically parsed by the system into a decision tree node update instruction.

[0031] With the support of the above system, this embodiment fully executes steps S1 to S6 of the invention: S1 execution: Multi-source devices synchronously collect data from the entire scene, which is then cleaned, standardized, and subjected to differential privacy desensitization to generate a standardized dataset with a 10ms timestamp; S2 execution: A 1:1 replica of the Unity3D virtual classroom is achieved, maintaining a mapping accuracy of over 96% through real-time synchronization via WebSocket, LSTM prediction, and Kalman filtering; S3 Execution: 61 intelligent agents form a collaborative network, the perception layer identifies the state, the decision-making layer generates strategies, and the execution layer controls the devices and pushes content; S4 execution: CNN+Transformer+MLP extracts single-modal features, and the attention mechanism is used to fuse and generate a unified vector to support high-precision classroom status recognition; S5 execution: Based on knowledge graphs and preference models, it dynamically generates personalized paths and provides real-time feedback through dual-end interfaces; S6 implementation: TLS 1.3 encrypted transmission, blockchain notarization, blockchain notarization uses Hyperledger Fabric, each data record generates SHA3-256 hash and is uploaded to the chain, federated learning and K-anonymity, K equals 5; ensuring security throughout the process, monthly data supplementation and quarterly expert review drive the continuous evolution of the model.

[0032] This embodiment verifies the feasibility and advancement of the present invention in a real teaching scenario. The various modules of the system are tightly coupled, and the method and hardware architecture are deeply integrated, effectively solving the core pain points of traditional smart classrooms such as static modeling, interaction lag, and weak privacy.

[0033] The embodiments of the present invention are given for the purposes of illustration and description. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention. Any changes, modifications, substitutions and variations made by those skilled in the art to the above embodiments within the scope of the present invention should be included within the protection scope of the present invention.

Claims

1. A method for constructing a multimodal classroom based on digital twins and curriculum intelligence agents, characterized in that: Includes the following steps: S1. Multimodal classroom data collection and standardization; S101. Multi-dimensional data collection: Deploy multi-source collection devices in the physics classroom to acquire full-scene classroom data, including visual data, auditory data, behavioral data, and teaching data; S2, Digital Twin Classroom Modeling and Dynamic Updates; S201. Twin model construction, which specifically includes three parts: basic modeling, entity mapping and relation definition; S202. Dynamically update the model to achieve precise adaptation between the virtual classroom and the physical classroom through real-time synchronization, state prediction and error correction. S3, Construction and collaborative control of course intelligent agents; S301, Intelligent Agent Architecture Design, including Perception Layer, Decision Layer and Execution Layer; S302, Design of intelligent agent collaboration mechanism, including human-machine collaboration design and multi-agent collaboration design; S4. Multimodal data fusion and classroom status analysis; S5. Personalized teaching interaction and feedback; S6. Data security and model iteration optimization; S601. Provide end-to-end data security, including data transmission, data storage, and privacy protection; S602. Feedback-driven iteration: Collect new classroom data monthly to supplement the training set and optimize the multimodal fusion model and the course agent decision-making model; organize education experts and technology experts to review teaching strategy rules every quarter and update agent interaction logic; compare teaching effects before and after optimization through A / B testing to ensure the effectiveness of model iteration.

2. The method for constructing a multimodal classroom based on digital twins and curriculum intelligence agents as described in claim 1, characterized in that: S1 further includes: S102. Data standardization processing: median filtering is used to remove image noise, and Mel frequency cepstral coefficients are used to filter speech interference, remove abnormal data, and clean and denoise the data; format unification is also performed, image data is normalized to the [0,1] interval, speech data is converted to 16kHz mono PCM format, and behavioral data is aligned by timestamp; at the same time, differential privacy technology is used to add Gaussian noise to student facial images, and local feature extraction is performed on behavioral data through federated learning to perform privacy desensitization; The visual data mentioned in S101 is collected by a high-definition camera to capture students' facial expressions, body movements, and seating arrangements; the auditory data is collected by an array microphone to capture the teacher's lecturing voice and the students' response voice, and the voice emotion and speech rate features are extracted; the behavioral data is collected by an eye tracker to capture students' gaze focus, and a motion capture device to record body movements; the teaching data is connected to the teaching management system to collect courseware content, teaching progress, and student answer data.

3. The method for constructing a multimodal classroom based on digital twins and curriculum intelligence agents as described in claim 1, characterized in that: The basic modeling step in S201 uses Unity3D to construct a 1:1 virtual replica model of the physics classroom. The model includes the classroom layout, desks and chairs, and teaching equipment, and supports real-time adjustment of lighting and viewing angle. The entity mapping step defines three core twin entities: student entities, teacher entities, and equipment entities. The student entity is associated with facial features, behavioral state, and knowledge mastery; the teacher entity is associated with teaching progress and interactive commands; and the equipment entity is associated with the operating status of the projector and audio equipment. The relationship definition step establishes the relationships between entities to ensure a one-to-one correspondence between virtual and physical entities. The real-time synchronization step in S202 is based on a low-latency communication protocol, which pushes standardized multimodal data to the virtual classroom and updates students' facial expressions, gaze direction, action status, and teachers' teaching actions in real time. The state prediction step uses a time-series prediction model to predict changes in classroom state in future time periods based on historical data and updates the intelligent agent interaction prompts in the virtual classroom in advance. The error correction step uses a filtering algorithm to dynamically correct the synchronization error between the virtual and physical classrooms to ensure that the model mapping accuracy reaches a preset threshold, which is no less than 95%.

4. The method for constructing a multimodal classroom based on digital twins and curriculum intelligence agents as described in claim 1, characterized in that: The perception layer in S301 receives multimodal data and the virtual classroom status, and identifies students' attention levels, emotional states, and knowledge mastery; the decision layer trains a decision model based on a reinforcement learning algorithm and dynamically generates teaching strategies according to the classroom status; the execution layer outputs teaching instructions through the virtual classroom interface and controls physical devices. The human-machine collaborative design in S302 clearly defines the course intelligent agent as responsible for process interaction, including real-time Q&A and attention guidance, while the teacher focuses on explaining the core teaching content and analyzing key points and difficulties. The multi-agent collaborative design configures a dedicated learning intelligent agent for each student, responsible for personalized learning recommendations, while a global intelligent agent is set up to control the teaching pace and coordinate device scheduling. The two achieve instruction synchronization and state sharing through message queues. In the multi-agent collaborative design, when the global intelligent agent detects that the class's attention distraction rate exceeds 30%, it triggers a collaborative operation to push interactive suggestions to the teacher and focus prompts to distracted students.

5. The method for constructing a multimodal classroom based on digital twins and curriculum intelligence agents as described in claim 1, characterized in that: S4 specifically includes the following steps: S401. Perform feature-level fusion. Feature-level fusion uses CNN to extract image features, Transformer to extract speech features, and MLP to extract behavioral features for single-modal feature extraction. Feature-level fusion uses an attention mechanism to weight and fuse multimodal features to generate a unified classroom state feature vector, highlighting the weight of key information and performing cross-modal fusion. S402. Conduct classroom state reasoning, train a classification model based on fusion features, and identify classroom behaviors; construct a process indicator and outcome indicator evaluation system, including process indicators such as attention duration and number of interactive participation, and outcome indicators such as knowledge point mastery rate, to comprehensively evaluate teaching effectiveness.

6. The method for constructing a multimodal classroom based on digital twins and curriculum intelligence agents as described in claim 1, characterized in that: S5 specifically includes the following steps: S501. Personalized teaching generation: The course intelligence agent dynamically generates personalized learning paths based on students' knowledge graphs and classroom status; at the same time, it adjusts the graphic courseware according to students' visual and auditory preferences. S502 provides real-time feedback output. On the student end, a personal attention report and knowledge point mastery radar chart are displayed through the virtual classroom interface, providing personalized learning suggestions. On the teacher end, a global classroom heat map and student status ranking are generated, and teaching optimization suggestions are pushed.

7. The method for constructing a multimodal classroom based on digital twins and curriculum intelligence agents as described in claim 1, characterized in that: The data transmission described in S601 uses the TLS1.3 protocol to encrypt and transmit multimodal data to prevent eavesdropping and tampering; data storage uses blockchain technology to store classroom data to ensure that the data is immutable and traceable; privacy protection uses federated learning to train classroom analysis models, local data is not sent to the cloud, and only model parameters are shared; student identity information is desensitized through K-anonymity technology to ensure that individuals cannot be identified individually.

8. A multimodal classroom construction system based on digital twins and curriculum intelligent agents, applied to the multimodal classroom construction method based on digital twins and curriculum intelligent agents as described in any one of claims 1-7, characterized in that: It includes a multimodal data acquisition module, a data preprocessing and privacy protection module, and a digital twin modeling module; The multimodal data acquisition module comprises a visual acquisition unit, an auditory acquisition unit, a behavioral acquisition unit, and a teaching data interface. The visual acquisition unit includes a high-definition camera and a facial recognition module for acquiring and recognizing student expressions and postures. The auditory acquisition unit includes an array microphone and a speech processing module for extracting emotional and semantic features of speech. The behavioral acquisition unit includes an eye tracker and motion capture equipment for acquiring gaze trajectory and body movement data. The teaching data interface connects to the academic affairs system and teaching platform via API to obtain courseware and answer data. The data preprocessing and privacy protection module includes a data cleaning unit, a standardization unit, and a privacy protection unit. The data cleaning unit removes noisy data and corrects outliers to ensure data quality. The standardization unit unifies data formats and aligns timestamps to generate a standardized dataset. The privacy protection unit integrates differential privacy, federated learning, and blockchain notarization functions to ensure data security and compliance. The digital twin modeling module includes a scene modeling submodule, an entity mapping submodule, and a dynamic update submodule. The scene modeling submodule is used to construct a 1:1 virtual classroom scene and supports adjustments to lighting and viewing angle. The entity mapping submodule is used to define and bind virtual entities and attributes of students, teachers, and devices. The dynamic update submodule updates the virtual classroom status based on real-time data and corrects synchronization errors.

9. A multimodal classroom construction system based on digital twins and curriculum intelligence agents as described in claim 8, characterized in that: It also includes a course intelligent agent module, a multimodal fusion and analysis module, and an interaction and feedback module; The course intelligent agent module includes a perception submodule, a decision-making submodule, and a collaborative control submodule; wherein, the perception submodule is used to receive multimodal data and identify the classroom status; the decision-making submodule generates personalized teaching strategies and interaction instructions based on reinforcement learning; and the collaborative control submodule realizes the collaborative scheduling of the course intelligent agent with teachers and physical devices. The multimodal fusion and analysis module includes a feature extraction submodule, a fusion submodule, and a state analysis submodule; wherein, the feature extraction submodule is used to extract the core features of each modality data; the fusion submodule achieves cross-modal feature fusion through an attention mechanism; and the state analysis submodule is used to identify classroom behavior and evaluate teaching effectiveness, and generate an analysis report. The interaction and feedback module includes a personalized interaction submodule, a feedback display submodule, and an iterative optimization submodule. The personalized interaction submodule is used to push suitable learning content and interaction formats to students. The feedback display submodule displays students' individual reports and teachers' overall classroom reports through web and mobile terminals. The iterative optimization submodule collects feedback from teachers and students to drive model and rule updates.

Citation Information

Patent Citations

  • Classroom teaching method and teaching device based on digital twinborn technology

    CN113593383A

  • Smart classroom interaction system based on cloud

    CN115410431A

  • Classroom teaching system and method based on digital twinborn technology

    CN115641232A

  • Intelligent classroom interaction system and method

    CN117037552A

  • Target object multi-modal positioning method and system based on three-dimensional digital classroom

    CN118071833A