Three-dimensional digital twin scene control method based on voice dialogue, medium and product
By constructing a modular system architecture and deep semantic understanding of a large language model, the problems of closedness and insufficient intelligence of the digital twin voice control system have been solved, enabling flexible deployment and intelligent decision-making across industries and scenarios, and improving management efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI FOCUSVISION SECURITY TECH CO LTD
- Filing Date
- 2026-03-16
- Publication Date
- 2026-06-12
Smart Images

Figure FT_1 
Figure SMS_1 
Figure SMS_2
Abstract
Description
Technical Field
[0001] This invention relates to a digital twin and intelligent human-computer interaction method, and more particularly to a three-dimensional digital twin scene control method, medium and product based on voice dialogue. Background Technology
[0002] With the deepening of digital transformation, the application value of digital twin technology in fields such as intelligent manufacturing, smart cities, building operation and maintenance, and the industrial metaverse is becoming increasingly prominent. As the core carrier connecting the physical world and virtual space, 3D digital twin systems are evolving towards real-time interaction and intelligent decision-making. Among them, voice dialogue, as the most natural and convenient human-computer interaction method, is regarded as a key entry point for controlling and exploring complex 3D twin scenarios. Directly scheduling production lines in virtual factories, adjusting energy strategies in smart parks, or querying the real-time status of facilities inside buildings through voice can greatly improve management efficiency and operational intuitiveness.
[0003] However, despite clear market demand, existing digital twin voice interaction solutions still face a series of profound technical bottlenecks on their path to universality, intelligence, and practicality. Current technologies are mostly limited to simple command control in specific scenarios, failing to meet the flexible deployment needs across industries and scenarios, and even less able to achieve intelligent control based on deep semantic understanding. This severely restricts the full realization of the potential of digital twin technology. Specifically, there are three main core problems: 1. The system is closed, with poor versatility and scalability. Currently, most digital twin voice control systems are deeply bound to specific 3D rendering engines, data sources, and speech recognition models, forming a closed technology stack. Their instruction sets are typically pre-hard-coded, and the control logic is strongly coupled to specific scene models. This makes it difficult for the system to adapt to different domains (such as switching from smart manufacturing to smart buildings) or integrate new large language models. When new equipment types or business rules need to be introduced, extensive modifications and retraining of the system's underlying code are often required, resulting in high development and maintenance costs and failing to achieve the universal goal of "build once, deploy everywhere."
[0004] 2. Low level of intelligence, weak semantic understanding and data fusion capabilities. Traditional solutions often rely on simple keyword matching or fixed sentence templates to parse voice commands. This approach can only handle commands with clear structures and standardized expressions (such as "switch to night mode" or "open the floor"), and cannot understand natural language containing complex intentions, vague descriptions, or contextual relationships (such as "check the slow machine at the innermost part of the third floor"). More importantly, existing systems lack the ability to deeply integrate and understand voice commands with the dynamically changing multi-source structured data behind the digital twin (such as real-time equipment status, spatial metadata, and business logic rules), thus failing to make intelligent reasoning and decisions based on data context, and the interaction remains at the surface control level.
[0005] 3. Rigid architecture, insufficient modularization and service-oriented architecture. Existing technologies often exist in a tightly coupled, integrated application form, with blurred boundaries and interdependence among various functional modules (speech recognition, intent understanding, scene-driven processing). This architecture prevents core intelligent interaction capabilities from being extracted as independent services for flexible invocation by third-party business systems or northbound applications via standard APIs. As a result, advanced voice interaction capabilities are confined to a single application, failing to empower a broader business ecosystem, such as convenient integration with enterprise MES systems, IoT platforms, or BI analysis tools, thus limiting the maximization of the technology's value.
[0006] Therefore, the industry urgently needs a breakthrough solution. This solution must be able to build a universal, modular, and intelligent voice interaction middleware layer that can flexibly adapt to different large language models and 3D twin scenarios, and achieve scene control based on deep semantic understanding through innovative data fusion technology, while providing capabilities externally through standard service interfaces. Such a technological breakthrough is the key to propelling digital twins from "visible and viewable" to a new stage of "manageable, controllable, interactive, and decision-making," unlocking their true commercial value. Summary of the Invention
[0007] The technical problem to be solved by the present invention is to provide a three-dimensional digital twin scene control method based on voice dialogue, which can solve the problems of fixed instructions, shallow semantic understanding, closed system and poor scalability, and difficulty in deep integration with multi-source structured data in existing digital twin voice control methods.
[0008] To address the aforementioned technical problems, this invention provides a three-dimensional digital twin scene control method based on voice dialogue, comprising the following steps: S1, constructing a modular system architecture, receiving voice commands and outputting control results through a standardized northbound interface; S2, establishing a metadata API configuration layer corresponding to the three-dimensional scene, dynamically acquiring multi-dimensional structured data of the park, buildings, and equipment; S3, performing vectorized fusion processing on the multi-dimensional structured data and voice commands to form enhanced prompt information input to a large language model; S4, utilizing the deep semantic understanding and reasoning capabilities of the large language model to generate executable three-dimensional scene operation commands; S5, ultimately driving the three-dimensional digital twin scene to perform real-time interaction.
[0009] Further, in step S1, a local recognition unit and a streaming recognition proxy unit are constructed to form a front-end speech recognition module. The local recognition unit is built based on the browser's native Web Speech API and is used to directly capture audio streams, call the speech recognition engine built into the device's operating system for real-time transcription, and output the intermediate results and the final text through event callbacks. The streaming recognition proxy unit uploads the audio data collected by the front end to the back-end recognition server in real-time in the form of a stream through a low-latency bidirectional communication protocol.
[0010] Furthermore, the low-latency bidirectional communication protocol is the WebSocket communication protocol.
[0011] Furthermore, in step S2, a unified API gateway is established, managed by a metadata agent, which maintains a dynamically registerable directory of data sources.
[0012] Furthermore, when the instructions in step S1 involve building equipment, the metadata agent in step S2 obtains the spatial structure by scheduling the BIM model API, and at the same time schedules the IoT platform API to obtain the real-time status of the equipment.
[0013] Furthermore, step S3 selects and calls one or more of the most relevant data source APIs according to the instruction requirements, and merges and transforms the raw data from each API, combining it with a predefined domain ontology to construct a dynamic, reasonable semantic knowledge graph in memory.
[0014] Further, step S3 includes: standardization transformation: converting and semantically aligning raw data from different sources according to a predefined unified data model; conflict resolution and fusion: when multiple data sources provide information about the same entity, verifying and fusing the data based on the priority and real-time nature of the data sources to ensure the uniqueness and accuracy of the data; generating an enhanced context: encapsulating the converted standardized data, the original instructions, and the semantic relationships between the data into a complete, self-describing context data package.
[0015] Furthermore, the executable 3D scene operation instructions in step S4 include entity device control, spatial navigation and positioning, scene modes and effects, simulation and contingency plans, and query and annotation.
[0016] The present invention also provides a computer-readable storage medium storing computer instructions thereon, which are executed by a processor to implement the above-described three-dimensional digital twin scene control method.
[0017] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the above-described three-dimensional digital twin scene control method.
[0018] Compared with the prior art, the present invention has the following beneficial effects: The three-dimensional digital twin scene control method based on voice dialogue provided by the present invention achieves universal, intelligent and scalable voice dialogue control of three-dimensional digital twin scenes by innovatively integrating vectorized data integration and large model semantic understanding. Attached Figure Description
[0019] Figure 1 This is a flowchart of the three-dimensional digital twin scene control based on voice dialogue according to the present invention. Detailed Implementation
[0020] The present invention will now be further described with reference to the accompanying drawings and embodiments.
[0021] Figure 1 This is a flowchart of the three-dimensional digital twin scene control based on voice dialogue according to the present invention.
[0022] Please see Figure 1 The present invention provides a three-dimensional digital twin scene control method based on voice dialogue, comprising the following steps: S1. Construct a modular system architecture to receive voice commands and output control results through a standardized northbound interface; S2. Establish a metadata API configuration layer corresponding to the 3D scene to dynamically obtain multi-dimensional structured data of the park, buildings and equipment; S3. The multidimensional structured data and voice commands are vectorized and fused to form enhanced prompt information, which is then input into the large language model. S4. Utilize the deep semantic understanding and reasoning capabilities of large language models to generate executable 3D scene operation instructions; S5 ultimately drives the real-time interaction of the 3D digital twin scene.
[0023] The specific modules for implementing the above control method according to the present invention are given below.
[0024] 1. Front-end speech recognition module: Local recognition unit: This unit is built on the browser's native Web Speech API (or equivalent interface), and is a lightweight recognition solution that is purely front-end and does not require back-end services.
[0025] Workflow: After the user grants microphone access, the unit directly captures the audio stream, calls the speech recognition engine built into the device's operating system to perform real-time transcription, and outputs the intermediate results and the final text through event callbacks.
[0026] Features and applicable scenarios: It has the advantages of zero network latency and good privacy, and is suitable for offline and lightweight control scenarios with extremely high requirements for response speed, simple commands, poor network environment, or strict requirements for data privacy.
[0027] Streaming recognition proxy unit: This unit serves as a communication bridge between the front-end and the back-end high-performance speech recognition large model (such as streaming ASR service based on deep neural networks).
[0028] Workflow: This unit uses WebSocket or a similar low-latency bidirectional communication protocol to upload audio data collected from the front end in real-time, in segments, to the backend recognition server. The large model on the server side continuously processes the audio stream and downloads the recognized text (including intermediate hypotheses and final results) to the front end in almost real-time.
[0029] Features and Applicable Scenarios: Leveraging the powerful computing and deep learning capabilities of the server's large model, this unit exhibits significant advantages in accuracy across complex noisy environments, specialized terminology recognition, long sentences, and dialects. It is suitable for industrial applications with stringent accuracy requirements, complex instructions, and noisy environments.
[0030] 2. Employ metadata management: Build a dynamic knowledge hub driven by intelligent agents. This module is the intelligent core of the system for perceiving and scheduling physical world data. It is responsible for the unified modeling and on-demand scheduling of all entities and their relationships in the 3D digital twin scene.
[0031] API Management: Establish a unified API gateway managed by a metadata agent. This agent maintains a dynamically registerable directory of data sources and can intelligently select and invoke one or more of the most relevant data source APIs based on instruction requirements. For example, when the instruction involves "meeting room," the agent can simultaneously invoke the BIM (Building Information Modeling) API to obtain the spatial structure and the IoT (Internet of Things) platform API to obtain the real-time status of devices.
[0032] Knowledge base management: The agent merges and transforms the raw data from various APIs, and combines it with predefined domain ontology (such as spatial inclusion and device affiliation) to build a dynamic and reasonable semantic knowledge graph in memory, providing support for real-time decision-making.
[0033] Database management: A hybrid storage strategy is adopted to support the persistence and high-speed caching of static model metadata and dynamic runtime data, ensuring that the metadata agent can quickly access historical data and state snapshots to complete trend analysis or state comparison.
[0034] 3. Instruction parsing and scheduling agent Step 1: Instruction parsing and structured requirement generation The scheduler receives structured instructions from the semantic recognition module. First, it performs semantic parsing on the instructions, automatically identifying the implicit target entities, operations to be executed, and required data dimensions. After parsing, it generates a machine-executable, formatted data requirement description file, which explicitly lists the data objects, attributes, and constraints required to complete the instructions.
[0035] Step 2: Intelligent data source routing based on the registry center The scheduler has a built-in metadata registry that dynamically maintains standardized metadata for all available data sources, including: data source type, access interface, data mode, real-time level, and performance characteristics.
[0036] The scheduler matches the "data requirement description file" with the registry center, dynamically calculating and selecting the optimal data source access path for each data requirement. Specifically, this applies to three types of data sources: Knowledge base: For static or quasi-static knowledge such as concept definitions and entity relationships, the scheduler converts the requirements into query statements and calls the pre-built domain meta-knowledge base (such as vector knowledge base or graph) for retrieval.
[0037] API Interface: For dynamic data that requires real-time status, the scheduler automatically assembles parameters and calls the corresponding real-time service interfaces in parallel according to the registered protocol specifications.
[0038] Database: For structured data such as historical records and business attributes, the scheduler generates and executes query commands through a standardized data access layer.
[0039] Step 3: Unified Acquisition, Encapsulation, and Context Construction Based on the routing results, the scheduler executes all necessary data acquisition operations in parallel and processes the returned results uniformly. Standardization transformation: Converting and semantically aligning raw data from different sources according to a predefined unified data model.
[0040] Conflict resolution and fusion: When multiple data sources provide information about the same entity, the data is verified and fused according to the priority and real-time nature of the data sources to ensure the uniqueness and accuracy of the data.
[0041] Generate enhanced context: The transformed standardized data, original instructions, and semantic relationships between the data are encapsulated into a complete, self-describing context data package. This data package serves as the direct input for the downstream parameter generation module to perform precise operation derivation.
[0042] This process enables automatic, accurate, and efficient mapping from user intent to multi-source data, ensuring that system decisions are based on a comprehensive, real-time, and consistent scenario context.
[0043] 4. 3D Scene Engine Response Module This invention can parse and execute a variety of complex control instructions, which can be categorized into the following typical types to demonstrate its wide application coverage and fine control capabilities:
[0044] The core of this invention lies in building a universal, intelligent, and decoupled "middleware" system, the advantages of which are reflected in a comprehensive innovation from concept to architecture: 1. Fundamentally enhance interactive intelligence and cognitive depth Advantages: Traditional solutions rely on a mechanical response of "hearing a word and performing an action." This invention, however, is a cognitive collaboration that "understands what you want and helps you achieve it in the virtual world." Through the fusion and understanding of multi-source data (space, device, business) using a large language model, the system can handle fuzzy intentions, perform logical reasoning (such as "finding the area with the highest energy consumption"), and generate multi-step operation sequences.
[0045] 2. System openness and scalability Advantages: It completely decouples control logic from specific 3D engines and business scenarios. Through standardized metadata APIs and modular service interfaces, this system acts as a "smart adapter," connecting to different voice assistants or business applications on the front end and driving different digital twin platforms and data sources on the back end, achieving the universal goal of "build once, empower flexibly."
[0046] 3. Low-cost integration and maintenance for engineering purposes Advantages: The modular design makes system deployment, upgrades, and maintenance exceptionally flexible. To support new scenarios, only new metadata needs to be accessed through API configuration, without modifying the core code. This significantly reduces the cost of customized development and long-term technical debt, enabling advanced human-computer interaction capabilities to be quickly and cost-effectively integrated into various existing systems.
[0047] To objectively evaluate the effectiveness of this invention, it selects "OpenAssistant" (a general-purpose dialogue assistant based on a large model), a representative open-source project in the field of intelligent interaction, as a baseline for theoretical comparison. It should be noted that OpenAssistant itself is not designed for digital twin control, but its paradigm of command understanding based on a large model represents the current technological frontier. This comparison aims to highlight the significant performance improvements brought about by the targeted optimization of this invention for specific vertical domains.
[0048]
[0049] Although the present invention has been disclosed above with reference to preferred embodiments, it is not intended to limit the present invention. Any person skilled in the art can make some modifications and improvements without departing from the spirit and scope of the present invention. Therefore, the scope of protection of the present invention shall be defined by the claims.
Claims
1. A method for controlling a three-dimensional digital twin scene based on voice dialogue, characterized in that, Includes the following steps: S1. Construct a modular system architecture to receive voice commands and output control results through a standardized northbound interface; S2. Establish a metadata API configuration layer corresponding to the 3D scene to dynamically obtain multi-dimensional structured data of the park, buildings and equipment; S3. The multidimensional structured data and voice commands are vectorized and fused to form enhanced prompt information, which is then input into the large language model. S4. Utilize the deep semantic understanding and reasoning capabilities of large language models to generate executable 3D scene operation instructions; S5 ultimately drives the real-time interaction of the 3D digital twin scene.
2. The three-dimensional digital twin scene control method based on voice dialogue as described in claim 1, characterized in that, Step S1 constructs a local recognition unit and a streaming recognition proxy unit to form a front-end speech recognition module; The local recognition unit is built on the browser's native Web Speech API and is used to directly capture audio streams, call the speech recognition engine built into the device's operating system for real-time transcription, and output the intermediate recognition results and the final text through event callbacks. The streaming recognition agent unit uploads audio data collected from the front end to the back-end recognition server in real time in the form of a stream through a low-latency bidirectional communication protocol.
3. The three-dimensional digital twin scene control method based on voice dialogue as described in claim 2, characterized in that, The low-latency bidirectional communication protocol is the WebSocket communication protocol.
4. The three-dimensional digital twin scene control method based on voice dialogue as described in claim 1, characterized in that, Step S2 establishes a unified API gateway managed by a metadata agent, which maintains a dynamically registerable data source directory.
5. The three-dimensional digital twin scene control method based on voice dialogue as described in claim 4, characterized in that, When the instructions in step S1 involve building equipment, the metadata agent in step S2 obtains the spatial structure by scheduling the BIM model API, and at the same time schedules the IoT platform API to obtain the real-time status of the equipment.
6. The three-dimensional digital twin scene control method based on voice dialogue as described in claim 1, characterized in that, Step S3 selects and calls one or more of the most relevant data source APIs according to the instruction requirements, and merges and transforms the raw data from each API. Combined with a predefined domain ontology, a dynamic and reasonable semantic knowledge graph is constructed in memory.
7. The three-dimensional digital twin scene control method based on voice dialogue as described in claim 1, characterized in that, Step S3 includes: Standardization transformation: Converting and semantically aligning raw data from different sources according to a predefined unified data model; Conflict resolution and fusion: When multiple data sources provide information about the same entity, the data is verified and fused according to the priority and real-time nature of the data sources to ensure the uniqueness and accuracy of the data; Generate enhanced context: Encapsulate the transformed standardized data, original instructions, and semantic relationships between data into a complete, self-describing context data package.
8. The three-dimensional digital twin scene control method based on voice dialogue as described in claim 1, characterized in that, The executable 3D scene operation instructions in step S4 include physical device control, spatial navigation and positioning, scene modes and effects, simulation and contingency plans, and query and annotation.
9. A computer-readable storage medium storing computer instructions thereon, characterized in that, The computer instructions are executed by the processor to implement the three-dimensional digital twin scene control method as described in any one of claims 1-8.
10. A computer program product, comprising a computer program, characterized in that, When executed by a processor, the computer program implements the three-dimensional digital twin scene control method as described in any one of claims 1-8.