A text description-based 3D virtual scene real-time generation method and system
By combining multimodal large models and large language models, the semantic understanding and spatial logic problems in 3D virtual scene construction were solved, enabling real-time personalized 3D virtual scene generation and improving generation efficiency and quality.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI YAOLINGZHEN INTELLIGENT TECHNOLOGY CO LTD
- Filing Date
- 2026-05-08
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244329A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of virtual scene generation, and more specifically, to a method and system for real-time generation of 3D virtual scenes based on text description. Background Technology
[0002] With the rapid development of technologies such as digital twins, virtual reality, and metaverse, the construction of high-precision 3D virtual scenes has become a core requirement in many fields. Traditional 3D scene construction relies on manual modeling, layout, and rendering by professionals, which is time-consuming, costly, and difficult to scale up. In recent years, text-to-3D (Text-to-3D) technology has become a research hotspot, but existing technologies still have the following core shortcomings: Insufficient semantic understanding depth: Most existing methods can only recognize object nouns in the text. They are weak in capturing the high-dimensional semantic features implied by adjectives that modify the style of objects (such as "cozy" and "post-apocalyptic wasteland"), time and weather information (such as "dusk" and "after a rainstorm"), and material state (such as "rusty" and "covered in snow"). As a result, the generated scene is just a simple stacking of objects, lacking overall atmosphere and visual consistency.
[0003] Lack of spatial logical reasoning ability: Current technology lacks common sense reasoning about the topological relationships and functional connections between objects. The generated scenes often have problems such as intermingling models, disproportionate scale, and layout that violates physical laws, which cannot meet the application requirements that require reasonable spatial structure (such as game levels and building simulation).
[0004] The contradiction between real-time performance and generation quality: methods relying on a pre-set asset library are fast but difficult to cover long-tail personalized needs; end-to-end generation methods (such as diffusion models) can generate diverse assets, but the generation speed is extremely slow (minutes to hours for a single object) and cannot achieve real-time interaction.
[0005] Therefore, there is an urgent need for a 3D virtual scene construction method that can combine real-time response and personalized content generation while understanding deep semantics and ensuring the rationality of spatial logic. Summary of the Invention
[0006] The purpose of this application is to provide a method and system for real-time generation of 3D virtual scenes based on text description, which can take into account both real-time response and personalized content generation of 3D virtual scenes while understanding deep semantics and ensuring the rationality of spatial logic.
[0007] This application is implemented as follows: In a first aspect, this application provides a method for real-time generation of 3D virtual scenes based on text descriptions, comprising the following steps: S1: Obtain the text description input by the user, and use a multimodal large model to perform deep semantic parsing on the text description. Extract entity information, abstract attributes and spatial relationships from the text description to generate a structured semantic blueprint. Abstract attributes include time and weather attributes, material state attributes and atmosphere and emotion attributes. S2: Based on semantic blueprints, scene layout planning is performed using a large language model that has been fine-tuned by instructions, generating a standardized scene description file containing geometric transformation information and environment rendering parameters; scene layout planning includes determining the spatial position, rotation angle and scaling ratio of each entity in the scene through hierarchical reasoning, and determining the environment rendering parameters. S3: Based on the entity list in the standardized scene description file, use the description text of each entity as the search condition, and perform similarity search in the pre-built 3D asset vector database. When the search confidence is higher than the preset threshold, call the corresponding 3D model and texture; when the search confidence is lower than the preset threshold, call the lightweight 3D generation model to generate the 3D model and texture of the entity in real time. S4: Based on the environment rendering parameters and layout information in the standardized scene description file, as well as the 3D model and textures obtained in step S3, the scene is automatically assembled in the rendering engine; at the same time, the time and weather attributes, material state attributes, and atmosphere and emotion attributes in the semantic blueprint are mapped to the corresponding specific parameters of the rendering pipeline; according to the environment rendering parameters, layout information, 3D model and textures, and specific parameters of the rendering pipeline, the rendering operation is executed to output a complete 3D virtual scene. S5: Receive the user's incremental modification command, perform local semantic parsing on the incremental command, and generate a new semantic blueprint; compare the new semantic blueprint with the current scene description file to identify the changed entities or environmental parameters; only re-execute steps S2 to S4 for the changed entities or environmental parameters to achieve dynamic scene updates.
[0008] Based on the first aspect, the multimodal large model in step S1 is a pre-trained image-text alignment model. The image-text alignment model is used to map adjectives and adverbs in the text into visual parameters through its multimodal alignment capability. The visual parameters include color temperature, illumination angle, material reflectivity, surface roughness, fog concentration, and color lookup table index.
[0009] Based on the first aspect, the scene layout planning in step S2 includes determining the spatial position, rotation angle, and scaling ratio of each entity in the scene through hierarchical reasoning, including: Macro-layout reasoning: Utilize the thinking chain capability of the large language model to determine the topological relationships of scene elements, including the position of the focal object and the distribution range of surrounding objects; Geometric constraint solution: The Poisson disk sampling algorithm is used to uniformly distribute objects within a specified area and ensure that the objects maintain the minimum distance between them; Output generation: Generates a scene description file in JSON or USD format. The scene description file includes ambient light parameters, directional light parameters, fog effect parameters, as well as asset query keywords, location coordinates, rotation angle, and scaling factor for each entity.
[0010] Based on the first aspect, the similarity retrieval in the pre-constructed 3D asset vector database in step S3 includes: Asset library construction: Generate multimodal descriptive text for each 3D model, extract feature vectors through a graphic encoding model, store them in a vector database and build an index; Retrieval execution: Encode the entity description text into a query feature vector, perform an approximate nearest neighbor search, sort by similarity, and determine whether the highest similarity score is greater than or equal to a dynamic threshold; The result is determined as follows: when the highest similarity score is greater than or equal to the dynamic threshold, the 3D model corresponding to that similarity score is confirmed as the matching retrieval result.
[0011] Based on the first aspect, in step S3, when the retrieval confidence is lower than a preset threshold, calling the lightweight 3D generation model to generate the 3D geometric model and texture of the entity in real time includes: The 3D generation module uses an optimized attention calculation mechanism to output a 3D mesh and basic texture after sampling a preset number of steps.
[0012] Based on the first aspect, step S4 maps the time-weather attribute, material state attribute, and atmosphere / emotion attribute in the semantic blueprint to the corresponding specific parameters of the rendering pipeline, including: Map the semantics of dusk or evening in the time and weather attributes to warm-toned light sources, low-angle lighting directions, and corresponding skybox labels; Map the snow cover semantics in the material state properties to high reflectivity material parameters and adjust the normal map intensity; The tranquility semantic in the emotional attribute of atmosphere is mapped to reducing the density of dynamic elements, reducing the intensity of ambient light, and increasing the softening effect of fog.
[0013] Secondly, this application provides a real-time 3D virtual scene generation system based on text description, comprising: Semantic parsing module: Used to obtain the text description input by the user. It uses a multimodal large model to perform deep semantic parsing on the text description, extracting entity information, abstract attributes and spatial relationships from the text description, and generating a structured semantic blueprint; abstract attributes include time and weather attributes, material state attributes and atmosphere and emotion attributes; Scene planning module: Based on semantic blueprints, it uses a large language model that has been fine-tuned by instructions to plan the scene layout and generate a standardized scene description file containing geometric transformation information and environment rendering parameters. Scene layout planning includes determining the spatial position, rotation angle and scaling ratio of each entity in the scene through hierarchical reasoning, and determining the environment rendering parameters. Similarity retrieval module: Based on the entity list in the standardized scene description file, it uses the description text of each entity as the retrieval condition to perform similarity retrieval in a pre-built 3D asset vector database. When the retrieval confidence is higher than a preset threshold, it calls the corresponding 3D model and texture; when the retrieval confidence is lower than the preset threshold, it calls a lightweight 3D generation model to generate the 3D model and texture of the entity in real time. Assembly Rendering Module: This module is used to automatically assemble the scene in the rendering engine based on the environment rendering parameters and layout information in the standardized scene description file, as well as the 3D model and textures obtained in step S3. At the same time, it maps the time and weather attributes, material state attributes, and atmosphere and emotion attributes in the semantic blueprint to the corresponding specific parameters of the rendering pipeline. Based on the environment rendering parameters, layout information, 3D model and textures, and specific parameters of the rendering pipeline, it performs rendering operations and outputs a complete 3D virtual scene. Interactive Update Module: This module receives incremental modification commands from users, performs local semantic parsing on the incremental commands, generates a new semantic blueprint, compares the new semantic blueprint with the current scene description file, and identifies changed entities or environmental parameters. Only for changed entities or environmental parameters are the semantic parsing module, scene planning module, similarity retrieval module, and assembly rendering module re-executed to achieve dynamic scene updates.
[0014] Thirdly, this application provides an electronic device, comprising: Memory, used to store one or more programs; processor; The above method is implemented when one or more programs are executed by the processor.
[0015] Fourthly, this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the above-described method.
[0016] Compared with the prior art, this application has at least the following advantages or beneficial effects: 1. By using a multimodal large model, abstract attributes such as time, weather, material status, atmosphere, and emotion in the text are quantified into specific rendering parameters, achieving precise alignment between text description and visual effects.
[0017] 2. Utilizing the reasoning capabilities of large language models, spatial relationships such as "surrounding," "nearby," and "far away" are transformed into precise geometric constraints. Combined with the Poisson disk sampling algorithm, a reasonable layout that conforms to physical laws is generated.
[0018] 3. By adopting a hybrid strategy of "retrieval + generation", we have achieved millisecond-level response for common assets and compressed the generation time of long-tail assets to a few seconds, while taking into account both content diversity and real-time interaction requirements. Attached Figure Description
[0019] To more clearly illustrate the technical solutions of the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of this application and should not be regarded as a limitation of the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.
[0020] Figure 1 This is a flowchart of a method for real-time generation of 3D virtual scenes based on text description, as described in this application; Figure 2 This is a schematic diagram of the structure of a real-time 3D virtual scene generation system based on text description according to this application; Figure 3 This is a schematic diagram of the structure of an electronic device according to this application.
[0021] icon: 1. Semantic parsing module; 2. Scene planning module; 3. Similarity retrieval module; 4. Assembly rendering module; 5. Interactive update module; 6. Processor; 7. Memory; 8. Communication interface. Detailed Implementation
[0022] To make the objectives, technical solutions, and advantages of the embodiments of this application clearer, the technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. The components of the embodiments of this application described and shown in the accompanying drawings can generally be arranged and designed in various different configurations.
[0023] The following detailed description of some embodiments of this application is provided in conjunction with the accompanying drawings. Unless otherwise specified, the various embodiments and features described below can be combined with each other.
[0024] Example This application provides a method and system for real-time generation of 3D virtual scenes based on text description, which can achieve both real-time response and personalized content generation of 3D virtual scenes while understanding deep semantics and ensuring the rationality of spatial logic.
[0025] The hardware and software environment used in this application is as follows: The server-side utilizes a Linux operating system and an NVIDIA A100 GPU cluster; the deep learning framework is PyTorch; semantic parsing employs the Qwen3-VL multimodal large model, supporting 256K context; INT8 quantization is used, compressing the model size from 140GB to 35GB with an accuracy loss of <1%; vLLM and FlashAttention-2 are used to accelerate inference, improving throughput by 5-10 times. The front-end uses WebGL / WebGPU, with rendering engines including Three.js r160+ and Babylon.js 6.0+, supporting PBR materials, HDR skyboxes, real-time shadows, and post-processing. The asset vector database uses Milvus, and HNSW indexing enables millisecond-level retrieval of millions of assets.
[0026] Please refer to Figure 1 The method for real-time generation of 3D virtual scenes based on text description includes the following steps: S1: Obtain the text description input by the user, and use a multimodal large model to perform deep semantic parsing on the text description. Extract entity information, abstract attributes and spatial relationships from the text description to generate a structured semantic blueprint. Abstract attributes include time and weather attributes, material state attributes and atmosphere and emotion attributes. Furthermore, user input supports text, voice (Web Speech API), multiple languages (Chinese / English / Japanese), and incremental dialogue input. If it is voice input, the recognized speech will be converted into a text description.
[0027] Furthermore, the multimodal large model is a pre-trained image-text alignment model. The image-text alignment model is used to map adjectives and adverbs in the text to visual parameters through its multimodal alignment capabilities. The visual parameters include color temperature, illumination angle, material reflectivity, surface roughness, fog concentration, and color lookup table index.
[0028] Specifically, users input text descriptions via a web interface or voice, such as "a snowy forest shrouded in mist at dusk, with a wooden cabin with a chimney in the center, surrounded by pine trees." The system then calls upon the deployed Qwen3-VL multimodal large model to perform deep analysis of the text. The model first identifies entity information: wooden cabin, pine trees, and chimney; then extracts abstract attributes: time and weather attributes (dusk, mist) and material state attributes (snow cover); finally, it resolves spatial relationships: the wooden cabin is in the center, surrounded by pine trees.
[0029] A layered understanding strategy is adopted: the first layer extracts entities and spatial relationships through a Transformer encoder; the second layer uses a multimodal alignment mechanism to map "dusk" to a color temperature of approximately 2800K, an illumination angle of approximately 15°, and a skybox label pointing to a sunset scene; it maps "snow cover" to a material with a reflectivity increased to above 0.8, and an upward-facing normal surface mixed with a white highly reflective material; and it maps "fog" to fog density parameters. Finally, a 768-dimensional semantic blueprint vector is generated as the structured input for subsequent steps.
[0030] This step utilizes the image-text alignment capability of a multimodal large model to quantify abstract descriptions in natural language (such as time, weather, material state, atmosphere, and emotion) into specific rendering parameters, providing precise semantic driving signals for subsequent scene assembly and solving the problem of "single semantic understanding dimension" in existing technologies.
[0031] S2: Based on semantic blueprints, scene layout planning is performed using a large language model that has been fine-tuned by instructions, generating a standardized scene description file containing geometric transformation information and environment rendering parameters; scene layout planning includes determining the spatial position, rotation angle and scaling ratio of each entity in the scene through hierarchical reasoning, and determining the environment rendering parameters. Furthermore, scene layout planning includes determining the spatial position, rotation angle, and scaling ratio of each entity in the scene through hierarchical reasoning, including: Macro-layout reasoning: Utilize the thinking chain capability of the large language model to determine the topological relationships of scene elements, including the position of the focal object and the distribution range of surrounding objects; Geometric constraint solution: The Poisson disk sampling algorithm is used to uniformly distribute objects within a specified area and ensure that the objects maintain the minimum distance between them; Output generation: Generates a scene description file in JSON or USD format. The scene description file includes ambient light parameters, directional light parameters, fog effect parameters, as well as asset query keywords, location coordinates, rotation angle, and scaling factor for each entity.
[0032] Specifically, the scene planning service receives the semantic blueprint generated in step S1. The Qwen3 large model, after fine-tuning the instructions, first performs macro-layout reasoning: it determines the wooden house as the scene anchor point with coordinates (0, 0, 0); it infers that the pine trees should be arranged in a "surrounding" layout, distributed around the wooden house but leaving space at the doorway; and it determines that the surrounding area is a ring-shaped area with an inner diameter of 8 meters and an outer diameter of 25 meters.
[0033] The geometric constraint solver is then invoked, employing the Poisson disk sampling algorithm to generate 20 sampling points within the annular region, ensuring a minimum distance of 3 meters between trees to prevent model overlap. Finally, a standardized JSON scene description file is output, containing environmental parameters (skybox, directional light, fog effect) and asset query keywords for each entity (e.g., "rustic log cabin with snow", "pine tree with snow"), location coordinates, rotation angle, and scaling factor.
[0034] This step leverages the common-sense reasoning capabilities of large language models to transform ambiguous semantic descriptions into precise geometric layouts. Simultaneously, it uses the Poisson disk sampling algorithm to ensure that objects are evenly distributed and do not intersect, thus solving the core problems of "insufficient spatial logical reasoning capabilities and layouts that violate physical laws" in existing technologies.
[0035] S3: Based on the entity list in the standardized scene description file, use the description text of each entity as the search condition, and perform similarity search in the pre-built 3D asset vector database. When the search confidence is higher than the preset threshold, call the corresponding 3D model and texture; when the search confidence is lower than the preset threshold, call the lightweight 3D generation model to generate the 3D model and texture of the entity in real time. Furthermore, similarity retrieval in the pre-built 3D asset vector database includes: Asset library construction: Generate multimodal descriptive text for each 3D model, extract feature vectors through a graphic encoding model, store them in a vector database and build an index; Retrieval execution: Encode the entity description text into a query feature vector, perform an approximate nearest neighbor search, sort by similarity, and determine whether the highest similarity score is greater than or equal to a dynamic threshold; The result is determined as follows: when the highest similarity score is greater than or equal to the dynamic threshold, the 3D model corresponding to that similarity score is confirmed as the matching retrieval result.
[0036] Furthermore, when the retrieval confidence level is lower than a preset threshold, a lightweight 3D generation model is invoked to generate the entity's 3D geometric model and texture in real time, including: The 3D generation module uses an optimized attention calculation mechanism to output a 3D mesh and basic texture after sampling a preset number of steps.
[0037] Specifically, the JSON file output from step S2 is parsed, and asset retrieval is performed for each entity. For "pine tree with snow", its descriptive text is encoded into a 768-dimensional query vector using Qwen-VL. An HNSW indexed approximate nearest neighbor search is performed in the pre-built Milvus vector database, calculating the cosine similarity with assets in the database. When the highest similarity score (0.92) is greater than the preset threshold of 0.85, the 3D model (.glb format) and texture corresponding to that asset are confirmed as a matching result.
[0038] For personalized descriptions not found in the database (such as "magic stone tablet with glowing runes"), the retrieval confidence is below the threshold. The automatic generation path is triggered: a 3D generation module based on a latent diffusion model is invoked. This module uses FlashAttention to optimize attention calculations and outputs a 3D mesh and basic texture after 30 steps of DDIM sampling. The entire generation process takes approximately 3-5 seconds. The generated 3D model, after automatic simplification and UV unwrapping, is returned along with the retrieved assets.
[0039] This step employs a hybrid strategy of "retrieval as the primary method and generation as the secondary method," which ensures millisecond-level response speed for common assets while generating long-tail personalized assets in real time, thus solving the technical challenge of "the contradiction between real-time performance and generation quality."
[0040] S4: Based on the environment rendering parameters and layout information in the standardized scene description file, as well as the 3D model and textures obtained in step S3, the scene is automatically assembled in the rendering engine; at the same time, the time and weather attributes, material state attributes, and atmosphere and emotion attributes in the semantic blueprint are mapped to the corresponding specific parameters of the rendering pipeline; according to the environment rendering parameters, layout information, 3D model and textures, and specific parameters of the rendering pipeline, the rendering operation is executed to output a complete 3D virtual scene. Furthermore, the temporal weather attributes, material state attributes, and atmosphere / emotion attributes in the semantic blueprint are mapped to corresponding specific rendering pipeline parameters, including: Map the semantics of dusk or evening in the time and weather attributes to warm-toned light sources, low-angle lighting directions, and corresponding skybox labels; Map the snow cover semantics in the material state properties to high reflectivity material parameters and adjust the normal map intensity; The tranquility semantic in the emotional attribute of atmosphere is mapped to reducing the density of dynamic elements, reducing the intensity of ambient light, and increasing the softening effect of fog.
[0041] Specifically, the rendering engine receives the JSON scene description file output in step S2 and the 3D model and textures produced in step S3. First, the scene is initialized according to the environmental parameters: the sunset HDR skybox is loaded, the directional light color is set to warm orange (1.0, 0.6, 0.3), intensity is 1.2, shadow deviation is optimized, and exponential height fog is enabled (density 0.05, color (0.95, 0.85, 0.7)).
[0042] Then, instantiation is performed on each entity: for 20 pine trees that share the same geometric mesh, GPU instantiation technology is automatically enabled, keeping only one copy of the mesh data, and drawing 20 instances by transforming the matrix array, which significantly reduces draw calls.
[0043] Finally, semantic-driven post-processing is performed: based on the "dusk" semantic, a bloom effect (intensity 0.5) is automatically enabled to enhance the sunset glow; based on the "fog" semantic, the fog density is adjusted; and based on the "snow" semantic, a warm-toned color lookup table is loaded. After all parameters are mapped, rendering is performed, outputting a real-time 3D virtual scene at 60fps.
[0044] This step fills the technical gap between semantic understanding and graphics rendering pipelines. It achieves deep integration of text semantics with ambient lighting, fog effects, and post-processing effects through automated parameter mapping. At the same time, it ensures real-time rendering performance through GPU instantiation, eliminating the problem of "asset-environment" separation in traditional methods.
[0045] S5: Receive the user's incremental modification command, perform local semantic parsing on the incremental command, and generate a new semantic blueprint; compare the new semantic blueprint with the current scene description file to identify the changed entities or environmental parameters; only re-execute steps S2 to S4 for the changed entities or environmental parameters to achieve dynamic scene updates.
[0046] Specifically, the user issues an incremental modification command, "Turn all trees into dead trees." First, a local semantic parsing is performed on this command to generate a new semantic blueprint (dead trees, retaining their original positions). Then, the new semantic blueprint is compared with the current scene description file, identifying that only the "tree asset type" has changed, while the cabin and environment parameters remain unchanged. Only steps S2 and S3 are re-executed: the asset query keyword for trees in the entity list is updated to "dead tree," and the "dead tree" 3D model is retrieved again; the scene description files for the cabin and environment remain unchanged, and their layout and position data are also preserved. The entire update process only requires replacing the asset model files, without needing to re-layout or re-render the entire scene, achieving sub-second dynamic scene modification.
[0047] This step, through incremental semantic parsing and difference comparison, only reprocesses the changed parts, avoiding full-scene reconstruction and enabling dynamic scene updates under real-time interaction, significantly improving the user experience.
[0048] Please refer to Figure 2 Based on the same inventive concept, this embodiment also provides a real-time 3D virtual scene generation system based on text description, including: Semantic parsing module 1: Used to obtain the text description input by the user. It uses a multimodal large model to perform deep semantic parsing on the text description, extracting entity information, abstract attributes and spatial relationships from the text description to generate a structured semantic blueprint; the abstract attributes include time and weather attributes, material state attributes and atmosphere and emotion attributes; Scene planning module 2: Based on semantic blueprints, it uses a large language model that has been fine-tuned by instructions to plan the scene layout and generate a standardized scene description file containing geometric transformation information and environment rendering parameters. Scene layout planning includes determining the spatial position, rotation angle and scaling ratio of each entity in the scene through hierarchical reasoning, and determining the environment rendering parameters. Similarity retrieval module 3: Based on the entity list in the standardized scene description file, it uses the description text of each entity as the retrieval condition to perform similarity retrieval in the pre-built 3D asset vector database. When the retrieval confidence is higher than the preset threshold, it calls the corresponding 3D model and texture; when the retrieval confidence is lower than the preset threshold, it calls the lightweight 3D generation model to generate the 3D model and texture of the entity in real time. Assembly Rendering Module 4: This module is used to automatically assemble the scene in the rendering engine based on the environment rendering parameters and layout information in the standardized scene description file, as well as the 3D model and textures obtained in step S3. At the same time, it maps the time and weather attributes, material state attributes, and atmosphere and emotion attributes in the semantic blueprint to the corresponding specific parameters of the rendering pipeline. Based on the environment rendering parameters, layout information, 3D model and textures, and specific parameters of the rendering pipeline, it performs rendering operations and outputs a complete 3D virtual scene. Interactive Update Module 5: This module receives incremental modification commands from users, performs local semantic parsing on the incremental commands, generates a new semantic blueprint, compares the new semantic blueprint with the current scene description file, and identifies changed entities or environmental parameters. Only for changed entities or environmental parameters are the semantic parsing module, scene planning module, similarity retrieval module, and assembly rendering module re-executed to achieve dynamic scene updates.
[0049] For a detailed implementation of the real-time generation system for 3D virtual scenes based on text description, please refer to the detailed implementation of the real-time generation method for 3D virtual scenes based on text description mentioned above. Further details will not be provided here.
[0050] Please refer to Figure 3 This embodiment also provides an electronic device, including: Memory 7 is used to store one or more programs; Processor 6; Processor 6 is connected to memory 7 via communication interface 8; When one or more programs are executed by processor 6, all or some of the above methods are implemented.
[0051] Fourthly, this application provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor 6, implements all or part of the methods described above.
[0052] The memory 7 may be, but is not limited to, random access memory (RAM), read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), etc.
[0053] Processor 6 can be an integrated circuit chip with signal processing capabilities. This processor can be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc.; it can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
[0054] It will be apparent to those skilled in the art that this application is not limited to the details of the exemplary embodiments described above, and that this application can be implemented in other specific forms without departing from the spirit or essential characteristics of this application. Therefore, the embodiments should be considered illustrative and non-limiting in all respects, and the scope of this application is defined by the appended claims rather than the foregoing description. Thus, all variations falling within the meaning and scope of equivalents of the claims are intended to be included within this application. No reference numerals in the claims should be construed as limiting the scope of the claims.
Claims
1. A method for real-time generation of 3D virtual scenes based on text description, characterized in that, Includes the following steps: S1: Obtain the text description input by the user, and use a multimodal large model to perform deep semantic parsing on the text description to extract entity information, abstract attributes, and spatial relationships from the text description to generate a structured semantic blueprint; the abstract attributes include time and weather attributes, material state attributes, and atmosphere and emotion attributes; S2: Based on the semantic blueprint, use the large language model that has been fine-tuned by instructions to plan the scene layout and generate a standardized scene description file containing geometric transformation information and environment rendering parameters. The scene layout planning includes determining the spatial position, rotation angle, and scaling ratio of each entity in the scene through hierarchical reasoning, and determining the environment rendering parameters; S3: Based on the entity list in the standardized scene description file, use the description text of each entity as a search condition to perform a similarity search in the pre-built 3D asset vector database. When the search confidence is higher than a preset threshold, call the corresponding 3D model and texture; when the search confidence is lower than the preset threshold, call the lightweight 3D generation model to generate the 3D model and texture of the entity in real time. S4: Based on the environment rendering parameters and layout information in the standardized scene description file, and the 3D model and textures obtained in step S3, the scene is automatically assembled in the rendering engine; at the same time, the time weather attributes, material state attributes, and atmosphere emotion attributes in the semantic blueprint are mapped to the corresponding specific parameters of the rendering pipeline; according to the environment rendering parameters, layout information, 3D model and textures, and the specific parameters of the rendering pipeline, the rendering operation is performed to output a complete 3D virtual scene; S5: Receive the user's incremental modification instruction, perform local semantic parsing on the incremental instruction, and generate a new semantic blueprint; compare the new semantic blueprint with the scene description file of the current scene to identify the changed entities or environmental parameters; only re-execute steps S2 to S4 for the changed entities or environmental parameters to achieve dynamic scene updates.
2. The method for real-time generation of 3D virtual scenes based on text description according to claim 1, characterized in that, The multimodal large model mentioned in step S1 is a pre-trained image-text alignment model. The image-text alignment model is used to map adjectives and adverbs in the text into visual parameters through its multimodal alignment capability. The visual parameters include color temperature, illumination angle, material reflectivity, surface roughness, fog concentration, and color lookup table index.
3. The method for real-time generation of 3D virtual scenes based on text description according to claim 1, characterized in that, The scene layout planning in step S2 includes determining the spatial position, rotation angle, and scaling ratio of each entity in the scene through hierarchical reasoning, including: Macro-layout reasoning: Utilize the thinking chain capability of the large language model to determine the topological relationship of scene elements, including the position of the focal object and the distribution range of surrounding objects; Geometric constraint solution: The Poisson disk sampling algorithm is used to uniformly distribute objects within a specified area and ensure that the objects maintain the minimum distance between them; Output generation: Generates a scene description file in JSON or USD format, which includes ambient light parameters, directional light parameters, fog effect parameters, and asset query keywords, location coordinates, rotation angle, and scaling factor for each entity.
4. The method for real-time generation of 3D virtual scenes based on text description according to claim 1, characterized in that, Step S3, which involves performing similarity retrieval in the pre-built 3D asset vector database, includes: Asset library construction: Generate multimodal descriptive text for each 3D model, extract feature vectors through a graphic encoding model, store them in a vector database and build an index; Retrieval execution: Encode the entity description text into a query feature vector, perform an approximate nearest neighbor search, sort by similarity, and determine whether the highest similarity score is greater than or equal to a dynamic threshold; The result is determined as follows: when the highest similarity score is greater than or equal to the dynamic threshold, the 3D model corresponding to the similarity score is confirmed as the matching search result.
5. The method for real-time generation of 3D virtual scenes based on text description according to claim 4, characterized in that, In step S3, when the retrieval confidence level is lower than the preset threshold, the lightweight 3D generation model is invoked to generate the 3D geometric model and texture of the entity in real time, including: The 3D generation module uses an optimized attention calculation mechanism to output a 3D mesh and basic texture after sampling a preset number of steps.
6. The method for real-time generation of 3D virtual scenes based on text description according to claim 1, characterized in that, Step S4 involves mapping the time / weather attribute, material state attribute, and atmosphere / emotion attribute in the semantic blueprint to corresponding specific rendering pipeline parameters, including: The semantics of dusk or evening in the time and weather attributes are mapped to warm-toned light sources, low-angle lighting directions, and corresponding skybox labels. The snow cover semantics in the material state attributes are mapped to high reflectivity material parameters and normal map intensity adjustments. The tranquility semantic in the aforementioned atmospheric emotional attributes is mapped to reducing the density of dynamic elements, reducing the intensity of ambient light, and increasing the softening effect of fog.
7. A real-time 3D virtual scene generation system based on text description, characterized in that, include: Semantic parsing module: used to obtain the text description input by the user, and to perform deep semantic parsing on the text description using a multimodal large model, extracting entity information, abstract attributes and spatial relationships from the text description to generate a structured semantic blueprint; the abstract attributes include time and weather attributes, material state attributes and atmosphere and emotion attributes; Scene planning module: Based on the semantic blueprint, it uses a large language model that has been fine-tuned by instructions to plan the scene layout and generate a standardized scene description file containing geometric transformation information and environment rendering parameters. The scene layout planning includes determining the spatial position, rotation angle, and scaling ratio of each entity in the scene through hierarchical reasoning, and determining the environment rendering parameters; Similarity retrieval module: Based on the entity list in the standardized scene description file, it uses the description text of each entity as a retrieval condition to perform similarity retrieval in a pre-built 3D asset vector database. When the retrieval confidence is higher than a preset threshold, it calls the corresponding 3D model and texture; when the retrieval confidence is lower than the preset threshold, it calls a lightweight 3D generation model to generate the 3D model and texture of the entity in real time. Assembly Rendering Module: This module is used to automatically assemble the scene in the rendering engine based on the environment rendering parameters and layout information in the standardized scene description file, as well as the 3D model and textures obtained in step S3. Simultaneously, it maps the time / weather attributes, material state attributes, and atmosphere / emotion attributes in the semantic blueprint to corresponding specific parameters of the rendering pipeline. Based on the environment rendering parameters, layout information, 3D model and textures, and the specific parameters of the rendering pipeline, it performs rendering operations and outputs a complete 3D virtual scene. Interactive update module: This module receives incremental modification commands from users, performs local semantic parsing on the incremental commands, generates a new semantic blueprint, compares the new semantic blueprint with the scene description file of the current scene to identify changed entities or environmental parameters, and re-executes the semantic parsing module, scene planning module, similarity retrieval module, and assembly rendering module only for changed entities or environmental parameters to achieve dynamic scene updates.
8. An electronic device, characterized in that, include: Memory, used to store one or more programs; processor; When the one or more programs are executed by the processor, the method as described in any one of claims 1-6 is implemented.
9. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the method as described in any one of claims 1-6.