Three-dimensional scene generation method, electronic device, and storage medium
By generating target-aware text and using multi-view rendering technology, the problems of blurred boundaries and inconsistent appearance in 3D model editing have been solved, achieving geometric and appearance consistency between the editing area and the non-editing area, thus improving the accuracy and efficiency of 3D editing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- INSPUR SUZHOU INTELLIGENT TECH CO LTD
- Filing Date
- 2026-05-21
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, the boundaries of the 3D model editing area are blurred, and the appearance and geometry are inconsistent from multiple perspectives, resulting in insufficient realism and low efficiency in the generated results.
By acquiring the source 3D model and reference image, target-aware text is generated. Based on the target-aware text and reference image, a target 3D model is constructed. The boundary band between the edited and non-edited areas is extracted through multi-view rendering. Geometric alignment is performed using depth images. The appearance features are optimized by combining scene-aware text with multi-view images to achieve visual consistency between the edited and non-edited areas.
It significantly improves the accuracy and efficiency of 3D editing, ensures the geometric and appearance consistency between the edited and non-edited areas, and enhances the realism and visual quality of the generated results.
Smart Images

Figure CN122244341A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer vision technology, and in particular to methods for generating 3D scenes, electronic devices, and storage media. Background Technology
[0002] In related technologies, two-dimensional image editing tools are typically used to enhance the editing effect to three-dimensional space through multi-view iterative optimization. For example, local editing can be achieved by combining diffusion models with three-dimensional Gaussian splashing, thereby enabling three-dimensional editing that conforms to the guidance of text or reference images in a specified area.
[0003] However, the relevant technologies still face significant problems in practical applications: (1) The editing areas determined based on text or reference images, or those specified by humans, are not precise, resulting in over-editing or under-editing of the editing results; (2) The editing areas and non-editing areas of the 3D scene lack continuity after editing, resulting in insufficient realism. When there are multiple editing requirements and multiple editing areas, the boundaries between the multiple editing areas and the source scene are more likely to be inconsistent; (3) The editing areas of the 3D scene after editing have multiple perspectives and geometric inconsistencies with the reference images. Summary of the Invention
[0004] This application provides a method for generating three-dimensional scenes, an electronic device, and a storage medium to at least solve the problems in related technologies, such as the lack of realism and low efficiency of the generated three-dimensional models due to blurred boundaries of the editing area, inconsistencies between the appearance and geometry from multiple perspectives, and the lack of continuity between the editing area and the non-editing area.
[0005] This application provides a method for generating a 3D scene, comprising the following steps: acquiring a source 3D model and a reference image; generating target-aware text based on the reference image; generating a target 3D model based on the target-aware text and the reference image; inserting the target 3D model into a target region of the source 3D model to obtain a first 3D model; using the target region as an editing region and the region outside the target region as a non-editing region; rendering the first 3D model to obtain a first multi-view image; determining the boundary band between the editing region and the non-editing region in the first 3D model based on the first multi-view image; determining the depth image of the boundary band based on the depth image of the first multi-view image; aligning the geometric features between the editing region and the non-editing region in the first 3D model based on the depth image of the boundary band to obtain a second 3D model; generating scene-aware text based on the source 3D model; rendering the second 3D model to obtain a second multi-view image; fusing the first multi-view image and the second multi-view image to obtain a fused image; aligning the appearance features between the editing region and the non-editing region in the second 3D model based on the target-aware text, the scene-aware text, and the fused image; and generating a 3D scene based on the second 3D model aligned with the appearance features.
[0006] This application also provides an electronic device, including: a memory for storing a computer program; and a processor for executing the computer program to implement the steps of any of the above-described three-dimensional scene generation methods.
[0007] This application also provides a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of any of the above-described three-dimensional scene generation methods.
[0008] This application's embodiments generate target-aware text from a reference image, rapidly generate a target 3D model using the reference image and the target-aware text, and insert the target 3D model into the target region of the source 3D model to generate a first 3D model. Boundary bands between edited and non-edited regions are extracted based on multi-view rendering, and geometric alignment is performed using depth images to improve the geometric coherence between the edited region and the source 3D scene. Scene-aware text is used to fuse multi-view images and combine with target-aware text to guide appearance optimization, achieving visual consistency between the edited and non-edited regions in terms of style, lighting, and materials. This significantly improves the efficiency, accuracy, and visual quality of reference image-based 3D editing. Therefore, it solves the technical problems in related technologies, such as blurred boundaries of the edited region, inconsistencies in appearance and geometry from multiple perspectives, and a lack of coherence between the edited and non-edited regions, resulting in insufficient realism and low efficiency in the generated results. Attached Figure Description
[0009] To more clearly illustrate the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0010] Figure 1 A flowchart illustrating a three-dimensional scene generation method provided in this application embodiment; Figure 2 A flowchart for generating a 3D scene based on consistent boundaries provided in this application embodiment; Figure 3 A flowchart of a 3D scene pre-editing process based on object-aware text and reference images provided in this application embodiment; Figure 4 A block diagram illustrating the three-dimensional scene generation device provided in the embodiments of this application. Detailed Implementation
[0011] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the protection scope of this application.
[0012] It should be noted that, in the description of this application, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. The terms "first," "second," etc., in this application are used to distinguish similar objects and are not used to describe a specific order or sequence.
[0013] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments.
[0014] Specifically, Figure 1 This application provides a schematic diagram of a three-dimensional scene generation process.
[0015] like Figure 1 As shown, this 3D scene generation method includes the following steps: In step S101, the source 3D model and reference image are acquired, and target-aware text is generated based on the reference image.
[0016] Among them, the source 3D model is the original 3D scene or object model to be edited, serving as the basis for editing operations; the reference image is a 2D image that guides the generation of target content, usually containing object or style information that is expected to be inserted into the 3D scene; and the target-aware text is descriptive text that is automatically generated or manually provided based on the reference image, used to characterize the semantics, shape, category, or attributes of the target object and guide the generation of the 3D model.
[0017] It is understood that the embodiments of this application can perform semantic parsing on the reference image and automatically generate target-aware text describing key attributes such as the target object's category, shape, posture, or style, so as to facilitate the subsequent generation of the target's 3D model.
[0018] In this embodiment of the application, generating target-aware text based on a reference image includes: obtaining a pre-set first prompt text and a search tag; generating a descriptive text of the reference image based on the first prompt text and the search tag; and generating target-aware text based on the descriptive text of the reference image.
[0019] It is understood that the embodiments of this application can analyze the reference image based on the pre-set first prompt text and search tags, combined with visual recognition and natural language processing technology, to generate detailed descriptive text. It can understand the objects in the image and their attributes, spatial layout and other key information, extract and convert the key information into target-aware text, accurately capture and express the core content and style features of the original image, so as to facilitate the subsequent generation of the target 3D model.
[0020] It should be noted that the first cue text can be a pre-designed, guiding natural language instruction input into a multimodal model (such as a visual language model). This instruction guides the model to focus on specific types of information from the reference image or to generate descriptions that meet the requirements of downstream tasks. For example, "Please describe in detail the category, appearance, pose, and environment of the object in the image" is a typical first cue text. The role of the first cue text is to activate the model's ability to perceive semantic elements in the image relevant to 3D generation through cue engineering, thereby improving the relevance and structure of the generated text.
[0021] Search tags can be predefined keywords or standardized terms used to characterize the high-level semantic concepts or attributes that a reference image may contain. These tags can come from classification systems, ontology libraries, or user-defined tags, and can be used to constrain or enhance the description generation process, or as an index basis for subsequent content retrieval or matching.
[0022] Specifically, such as Figure 2 As shown, in the process of generating 3D objects based on reference images, since only a single reference image can be used to generate a 3D object, the generated 3D object is prone to geometric or appearance differences from the reference image, and there is also the problem of overfitting to a given single viewpoint. To address these issues, introducing text descriptions of reference images can increase editing generalization and effectively solve the problem of overfitting to a given single viewpoint of the reference image. However, in related technologies, text descriptions of 3D models are too simplistic and easily introduce background information from the reference image, image-text ambiguity, and generate illusions.
[0023] This application proposes a text description acquisition method for object perception, which removes background interference information and obtains fine-grained text descriptions of objects in a reference image to guide the generation of 3D objects based on a single image. The main steps include: (a) Use a Vision Language Model (VLM) to identify salient objects in the input reference image, i.e., objects to be inserted or replaced in the source 3D scene, and then have the VLM output a text description, such as "a graybear". Then, based on the simple text description output by the VLM, use cueing engineering and retrieval enhancement to obtain a text description of object perception.
[0024] The project is prompted with the following details: 1. You are a professional photography critic. Please answer this question using three paragraphs: 1) Visual style (20 words); 2) Lighting, including the type and direction of the light source, and the resulting highlight / shadow shapes (30 words); 3) Associative information about the material and tactile sensation of an object's surface (20 words).
[0025] 2. Structured fields must be output in JSON (JavaScript Object Notation) format to ensure field integrity and machine parsability. Deductions will be made for each missing element. {"style":"", "lighting": {"type":"", "direction":"", "shape":"}, "material":"", "mood":"} Search enhancement: 1. Create a vector library of style tags and text descriptions in advance; during inference, search for text using images, take the first five characters of the description and dynamically concatenate them into the prompt words of the generation model, that is, concatenate style tags such as "Baroque, Bauhaus, vaporwave" into the prompt words without specific restrictions.
[0026] 2. Prepare a vector library of light labels and text descriptions in advance; during inference, search for text using images, take the first five characters of the description, and dynamically concatenate them into the prompts of the generation model, that is, concatenate light labels such as "sunlight coming in from the left window" into the prompts.
[0027] (b) Simplify these descriptions using a pre-trained language model.
[0028] The following tips are provided: Simplify and merge repetitive text descriptions, while retaining fine-grained descriptions of object style, lighting, and materials.
[0029] Ultimately, the acquired fine-grained text descriptions are used as the text descriptions for object perception.
[0030] It's important to note that Visual Learning Models (VLMs) are a type of deep learning model capable of simultaneously understanding images (visual information) and text (linguistic information) and establishing semantic connections between the two. A VLM typically comprises two components: a visual encoder to convert the input image into a high-dimensional vector representation, capturing visual features such as objects, scenes, and layouts; and a language decoder to process text input and generate at least one type of text, interacting with visual features through an attention mechanism. During training, VLMs undergo contrastive learning or generative pre-training on large-scale image-text datasets. For example, given an image of "a gray bear standing on grass," the model learns to align it semantically with the corresponding text. In 3D scene editing tasks, VLMs are used to automatically parse the semantic content of reference images, output initial text descriptions, and provide a foundation for further structured cue generation and retrieval enhancements, serving as a crucial bridge connecting 2D images and 3D semantic generation.
[0031] Pre-trained language models generally refer to language model training tasks designed based on large-scale corpora (including language training materials such as sentences and paragraphs). A large-scale neural network algorithm structure is trained to learn and implement the model, resulting in the pre-trained language model. Subsequent tasks can then use this model for feature extraction or task fine-tuning to achieve specific objectives. The idea behind pre-training is to first train a set of model parameters for one task, then use these parameters to initialize the network model parameters, and finally use the initialized network model to train other tasks, obtaining models adapted for those tasks. By pre-training on large-scale corpora, neural language representation models can learn powerful language representation capabilities, extracting rich syntactic and semantic information from text. Pre-trained language models can provide lexical units and sentence-level features containing rich semantic information for downstream tasks, and can also be directly fine-tuned for downstream tasks, conveniently and quickly obtaining downstream-specific models.
[0032] The neural network algorithm structure used to train a pre-trained language model can be CNN (Convolutional Neural Network), RNN (Recurrent Neural Network), LSTM (Long Short-Term Memory), etc., or it can be a model built with attention networks, such as Transformer (attention mechanism model), BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), CLIP (Contrastive Language-Image Pretraining), etc., and this application does not limit it. An attention network refers to a network model that uses an attention mechanism for training. This model extracts more important feature information from the input sequence by assigning different weights to each part of the input sequence, so that the model can finally obtain a more accurate output.
[0033] In step S102, a target 3D model is generated based on the target-aware text and reference image. The target 3D model is then inserted into the target region of the source 3D model to obtain the first 3D model. The target region is used as the editing region, and the region outside the target region is used as the non-editing region.
[0034] It is understood that the embodiments of this application can construct a target 3D model consistent with semantic description and visual content based on target-aware text and reference images through multimodal conditional generation technology, ensuring that it accurately reflects the reference intent in shape, style and structure; the target 3D model is precisely embedded into the preset target area in the source 3D model to form the preliminary editing result, i.e., the first 3D model, and the modified editing area and the unchanging non-editing area are clearly delineated, realizing the semantic alignment generation from 2D reference to 3D object, and completing the initial layout of scene fusion through spatial positioning, providing a structural foundation for the subsequent fine coordination of geometry and appearance, and significantly improving the semantic controllability and operational efficiency of 3D editing.
[0035] It should be noted that the editable area refers to the selected local spatial range within the source 3D model used for inserting, replacing, or modifying content; specifically, it is the location where the newly generated target 3D model is embedded. The non-editable area refers to all parts of the source 3D model other than the editable area, that is, the area that remains in its original state and does not undergo structural changes.
[0036] In this embodiment of the application, before inserting the target 3D model into the target region of the source 3D model to obtain the first 3D model, the method further includes: representing the target region of the source 3D model as a 3D editing box; identifying the target spatial relationship in the source 3D model; and inserting the target 3D model into the 3D editing box according to the target spatial relationship to obtain the first 3D model.
[0037] It is understood that, in the embodiments of this application, before inserting the target 3D model into the target area of the source 3D model, the target area can be modeled as a 3D editing box to define the spatial range to be replaced or inserted. By analyzing the target spatial relationships in the source 3D model, including the relative positions, scale ratios, and orientations of objects and other contextual geometric constraints, the target 3D model is adapted and adjusted. Based on the spatial relationships, the target 3D model is accurately embedded into the 3D editing box to generate a structurally reasonable first 3D model. This not only ensures that the newly inserted object is semantically consistent with the original scene in terms of position and scale, but also lays a geometric foundation for subsequent boundary processing and multi-view consistency optimization, effectively improving the rationality and automation level of 3D editing.
[0038] It should be noted that target spatial relationships refer to the geometric and semantic contextual information associated with the target region in the source 3D model. Examples include: relative position: the spatial coordinate relationship of the target region relative to other key objects or structures in the scene; scale: the size ratio of the target region within the overall scene; orientation and pose: the implicit directionality or viewpoint tendency of the target region; support or dependence: whether the target region is on the ground, a tabletop, or other objects to determine if the inserted object needs to meet physical plausibility; occlusion and visibility: the occlusion of the target region by surrounding objects, affecting the visible parts and rendering method of the inserted model. Spatial relationships are extracted through analysis of the source 3D model's geometric structure, semantic segmentation, or scene graph, and are used to guide the reasonable placement of the target 3D model within the 3D editing frame. This ensures that the generated first 3D model conforms to the contextual logic of the original scene in terms of spatial layout, improving the realism and plausibility of the editing result.
[0039] Specifically, this application, given a reference image, utilizes a single-image-based 3D generation large model and inputs the acquired object-aware text description as guidance to obtain a high-quality 3D model corresponding to the reference image. Given a source 3D model to be edited and specifying the insertion or replacement position (represented by a 3D editing box), an existing automatic layout algorithm is run to insert the generated object into the specified position in the source 3D model according to spatial relationships, thereby obtaining a preliminary result of the edited 3D scene.
[0040] This approach avoids the iterative image editing process required in existing 3D editing methods, allowing for the rapid acquisition of preliminary editing results. However, such preliminary editing results are too rough, with inconsistencies in geometry and appearance between the edited and non-edited areas. Therefore, this application will use reconstruction optimization and other methods described in the following steps to achieve high-quality 3D editing with consistent geometry and appearance between the edited and non-edited areas.
[0041] In step S103, the first three-dimensional model is rendered to obtain a first multi-view image. The boundary band between the edited area and the non-edited area in the first three-dimensional model is determined based on the first multi-view image. The depth image of the boundary band is determined based on the depth image of the first multi-view image. The geometric features between the edited area and the non-edited area in the first three-dimensional model are aligned based on the depth image of the boundary band to obtain a second three-dimensional model.
[0042] It is understood that the embodiments of this application can generate a series of images by rendering the first three-dimensional model from multiple perspectives, thereby accurately identifying the boundary band between the edited area and the non-edited area, and using depth images to obtain the depth information of the boundary band. Based on these depth data, the geometric features of the edited area and the non-edited area are aligned, ensuring a smooth transition and consistency between the two areas, and finally obtaining a second three-dimensional model with more unified geometric features. This significantly improves the accuracy and visual effect of three-dimensional model editing, making the edited model more natural and smooth in both detail and overall.
[0043] It should be noted that the boundary zone refers to a transitional area at the junction of the edited and non-edited areas in the first 3D model. Its spatial range covers the adjacent part of the two areas and is used to capture and analyze problems such as geometric discontinuities, depth jumps, or surface fractures caused by inserting the target 3D model.
[0044] In this embodiment of the application, rendering a first three-dimensional model to obtain a first multi-view image includes: calculating a three-dimensional Gaussian representation of the first three-dimensional model; calculating the pixel color of a single-view image based on the three-dimensional Gaussian representation; and generating the first multi-view image based on the pixel colors of multiple single-view images.
[0045] It is understood that the embodiments of this application can effectively capture the geometric and appearance features of the model by calculating its three-dimensional Gaussian representation on the first three-dimensional model. The color of each pixel in the single-view image can be calculated based on the three-dimensional Gaussian representation, which can accurately reflect the visual effect of the model under different viewpoints. By integrating the pixel color information of multiple single-view images to generate the first multi-view image, not only are the details and features of the original three-dimensional model preserved, but also a richer and more comprehensive visual experience is provided to the user.
[0046] Specifically, this application renders the preliminary 3D editing results to obtain multi-view preliminary 3D editing result images, specifically: The 3D model after initial editing is represented by a 3D Gaussian representation. ,in , , Representing a 3D model The number of three-dimensional Gaussian points contained therein. This indicates the position of the i-th Gaussian center point. Let represent the covariance matrix of the i-th Gaussian point, and let the color of the i-th Gaussian point be determined by . This indicates that the opacity of the i-th Gaussian is determined by... The 3D Gaussian representation of the pre-edited 3D model is then expressed as:
[0047] in, This represents the intersection of the ray and the three-dimensional Gaussian ray with the center point of the i-th three-dimensional Gaussian ray. distance, transpose, It is the inverse of the covariance matrix.
[0048] Single-view image of the edited 3D model The color of a certain pixel for:
[0049]
[0050] in, Representing a 3D model The number of three-dimensional Gaussian points contained therein. Let i be the color of the i-th Gaussian. Weight the transparency of the i-th Gaussian. Let the opacity be the i-th Gaussian. The three-dimensional Gaussian representation of the three-dimensional model, that is, the i-th Gaussian in spatial position. The weight of the position, Let be the cumulative transmittance before the i-th Gaussian.
[0051] Given multiple different rendering perspectives, images from multiple different perspectives are rendered using formulas (1)-(3). .
[0052] In this embodiment of the application, determining the boundary band between the edited region and the non-edited region in the first three-dimensional model based on the first multi-view image includes: identifying the semantics of object instances and the spatial relationships between object instances in the edited region; dividing the edited region into a retaining layer and an editing layer according to the categories of object instance semantics and spatial relationships, segmenting object instances into replaced objects and non-replaced objects, assigning the replaced objects to the editing layer, and assigning the non-replaced objects to the retaining layer; calculating the mask of the replaced object in the editing layer, completing the remaining region of the replaced object according to the mask of the replaced object, and re-dividing the retaining layer and the editing layer in the completed edited region; calculating the mask of the re-divided editing layer and the mask of the retaining layer, and determining the boundary band between the edited region and the non-edited region according to the mask of the editing layer and the mask of the retaining layer.
[0053] It is understood that the embodiments of this application can analyze the first multi-view image to identify the semantics of object instances in the editing area and their spatial relationships, and divide the editing area into a retain layer and an editing layer accordingly: the replaced object is assigned to the editing layer, and the non-replaced object is assigned to the retain layer; then the mask of the replaced object in the editing layer is calculated, and the mask is used to semantically complete the residual empty areas caused by insertion or deletion operations to ensure geometric integrity; based on the completion, the retain layer and the editing layer are re-divided, and corresponding accurate masks are generated respectively; finally, by fusing the updated editing layer mask and the retain layer mask, the boundary band between the editing area and the non-editing area is accurately defined, thereby realizing fine-grained understanding and dynamic adjustment of the interaction relationship of multiple objects in complex scenes, effectively avoiding problems such as blurred boundaries, structural breaks, or semantic misalignment.
[0054] It's important to note that an object instance refers to each specific, distinguishable physical object in a 3D scene or image. Even if multiple objects belong to the same category, as long as they are separate and independently existing entities in space, they constitute different instances. For example, two chairs in an image are two object instances, while in a 3D model, each independently modeled piece of furniture, figure, decoration, etc., is considered an object instance. Therefore, object instances emphasize individuality and spatial uniqueness, and are typically identified and separated using instance segmentation techniques.
[0055] Object instance semantics refers to the high-level semantic labels or attribute descriptions assigned to each object instance, used to characterize its semantic information such as category, function, material, style, or role. For example, the semantics of an object instance might be "wooden dining table," "gray fabric sofa," or "metal chandelier." The semantic information not only includes basic categories (such as "table"), but may also include fine-grained attributes (color, material, purpose, etc.).
[0056] A replaced object is an existing object instance in the source 3D model located within the target area that is planned to be replaced or removed by the newly generated target 3D model. For example, if you want to replace an old sofa with a new one in the living room, the "old sofa" is the replaced object. A non-replaced object is an object instance located within or near the target area that should not be modified and should be retained in its original state. For example, a coffee table, carpet, or wall decoration in the same area, although within the editing area, are classified as non-replaced objects because they are not part of the editing intent.
[0057] The edit layer is a logical layer consisting of all replaced objects. It represents the set of geometric content that needs to be modified, deleted, or overwritten. This layer will be replaced by the target 3D model and serves as the primary object for subsequent geometric alignment and appearance optimization. The retain layer is a logical layer consisting of all non-replaced objects. It represents the parts whose geometric and appearance attributes must be fully preserved, even if they are located inside the target area. This layer acts as a contextual constraint to guide the fusion of the edit layers, ensuring the rationality of the scene structure and semantic coherence.
[0058] The mask for the edit layer is a binary or probabilistic spatial indicator function (typically defined in the image or multi-view projection domain) that marks all pixel or voxel locations belonging to the object being replaced. This mask is used to locate the areas to be reconstructed or filled and to update the boundaries after filling in any remaining holes. The mask for the preserve layer marks the spatial regions belonging to the non-replaced objects. It is used to protect these regions from modification in subsequent processing and serves as a reference for geometric and appearance alignment.
[0059] Specifically, during 3D scene editing, the editable area needs to be edited according to the user's specifications, while ensuring that the non-editable area remains unchanged. Furthermore, the user-defined 3D editing frame often has an excessively large bounding box, leading to over-editing in the final result. This application proposes a semantically layered 3D scene representation method to represent the initially edited 3D scene rendering image. The layers are represented as preserve layers and edit layers, and their pixel-level masks are obtained. and ,in and The size of the image.
[0060] This application can support multiple editing needs, such as inserting objects from multiple reference images into a source 3D scene. The following describes a single editing need. For multiple editing needs, the following steps can be repeated to obtain multiple editing layers and the boundary band between the editing area and the non-editing area.
[0061] Layer decomposition: Utilizing the rich world knowledge of VLM, semantic recognition is performed on object instances within the editing area, and spatial relationships between object instances are described, such as "a gray bear doll on the table." Objects are then categorized into the preserve layer or the edit layer based on semantic category and spatial relationships. For example, Figure 3 Within the editing area, the doll is placed into the editing layer, while other areas within the editing area are placed into the retaining layer. Instance segmentation is used to segment object instances, generating pixel-level masks for the instances placed into the editing layer. Simultaneously, a 2D editing frame is obtained by projecting a 3D editing frame (the user-specified editing area) onto the corresponding viewpoint. .
[0062] When editing replacement objects, if the replacement object differs from the object in the source scene, it can lead to residual images of the replaced object from the source scene remaining in the edit area. To address this issue, a residual mask of the replaced object is calculated based on the obtained pixel-level mask of the edit layer. ,in, The residual mask of the replaced object. To obtain a 2D edit box by projecting a 3D edit box (the user-specified edit area) onto the corresponding viewpoint, i.e., a projection area mask, This is the pixel-level mask of the instance in the editing layer, i.e., the effective mask of the object being replaced.
[0063] Then, existing mask-based image inpainting methods are used to complete the remaining areas of the replaced object and remove the remaining replaced object in the editing area, making the initial editing result clearer and the obtained instance segmentation result more accurate.
[0064] Perform a layer decomposition again to obtain Then, the pixel-level mask of the editing layer is obtained by set union. ,in, For the projection area mask, The effective mask for the object being replaced. For the editing layer mask, a more generalized editing region is obtained using union, which ensures that it conforms to the user-specified editing region without losing the pre-edited results, providing a good initial value for the editing region for boundary consistency optimization; pixel-level masks of the layer are preserved. ,in, Edit layer mask, To preserve the layer mask.
[0065] In this embodiment of the application, the process of completing the residual area of the replaced object based on the mask of the replaced object includes: projecting the editing area onto a preset single viewpoint; calculating the mask of the projected area; determining the residual mask of the replaced object based on the difference between the mask of the projected area and the mask of the replaced object; determining the residual area of the replaced object based on the residual mask of the replaced object; and completing the residual area of the replaced object.
[0066] It is understood that, in this embodiment of the application, the editing area can be projected onto a preset single viewpoint to obtain a two-dimensional projection area under that viewpoint, and the corresponding projection mask can be calculated. Then, the difference operation is performed between the original mask of the object being replaced and the projection mask to accurately identify the areas that are not completely covered or omitted due to viewpoint occlusion, geometric breaks, or insertion operations, i.e., the residual mask of the object being replaced. Based on the residual mask, the residual area is located, and a context-aware completion strategy (such as based on surrounding geometry or texture information) is used to fill and repair it, thereby effectively eliminating holes, breaks, or incomplete structures generated during the editing process, ensuring that the editing layer is geometrically closed and semantically coherent.
[0067] It should be noted that residual mask refers to a binary or probabilistic mask used to identify incomplete, broken, or hollow areas left behind when the replaced object was not completely removed or covered during the editing process.
[0068] Specifically, when editing replacement objects, if the replacement object differs from the object being replaced in the source scene, it can lead to residual images of the replaced object from the source scene remaining in the edit area. To address this issue, a residual mask of the replaced object is calculated based on the obtained pixel-level mask of the edit layer. ,in, The residual mask of the replaced object. To obtain a 2D edit box by projecting a 3D edit box (the user-specified edit area) onto the corresponding viewpoint, i.e., a projection area mask, This is the pixel-level mask of the instance in the editing layer, i.e., the effective mask of the object being replaced.
[0069] In this embodiment of the application, calculating the mask of the edit layer and the mask of the retain layer after re-division includes: projecting the re-divided edit region onto a preset single viewpoint; calculating the mask of the projected region; determining the mask of the edit layer based on the union of the mask of the projected region and the mask of the replaced object; and calculating the mask of the retain layer based on the mask of the edit layer.
[0070] It is understood that the embodiments of this application can obtain the two-dimensional projection range of the editing area by projecting it onto a preset single viewpoint and calculating the corresponding projection mask; by performing a difference operation between the projection mask and the original mask of the object being replaced, the residual parts that have not been effectively covered or removed can be accurately extracted to form a residual mask; based on the residual mask, the residual area is located, and a context-aware completion method is used to repair these empty or fragmented areas, which effectively solves the problem of incomplete editing caused by viewpoint occlusion, geometric mismatch or segmentation error, and ensures that the object being replaced is completely removed and the geometric structure of the editing layer is closed.
[0071] Specifically, perform a new layer decomposition to obtain Then, the pixel-level mask of the editing layer is obtained by set union. ,in, For the projection area mask, The effective mask for the object being replaced. For the editing layer mask, a more generalized editing region is obtained using union, which ensures that it conforms to the user-specified editing region without losing the pre-edited results, providing a good initial value for the editing region for boundary consistency optimization; pixel-level masks of the layer are preserved. ,in, Edit layer mask, To preserve the layer mask.
[0072] In this embodiment, determining the boundary band between the edited region and the non-edited region based on the mask of the edited layer and the mask of the hold layer includes: identifying the boundary between the edited layer and the hold layer based on the mask of the edited layer and the mask of the hold layer; obtaining a first annular band of a first proportion from the edited layer based on the boundary between the edited layer and the hold layer; obtaining a second annular band of a first proportion from the hold layer based on the boundary between the edited layer and the hold layer; and generating the boundary band between the edited region and the non-edited region based on the first annular band and the second annular band.
[0073] The first annular zone refers to the annular area formed by extending inward from the editing layer with a certain proportion of width, based on the boundary between the editing layer and the retaining layer. The second annular zone refers to the annular area formed by extending inward from the retaining layer with the same or corresponding proportion of width, based on the same boundary.
[0074] It is understood that the embodiments of this application can identify the boundary between the editing layer mask and the retaining layer mask in an image or multi-view projection by leveraging their complementary relationship. Then, based on this boundary, annular regions of a certain proportion of width are extracted from the editing layer side and the retaining layer side along the normal direction to form a first annular band and a second annular band. Finally, these two annular bands are merged to form a boundary band covering the transition area between the editing area and the non-editing area. By symmetrically expanding the boundary on both sides, the edge details of the edited content are preserved, while incorporating the contextual information of the original scene. This provides a structurally complete, semantically coherent, and spatially accurate local operation area for subsequent geometric alignment and multi-view appearance fusion based on depth images, effectively avoiding problems such as hard edges, geometric breaks, or visual discontinuities.
[0075] It should be noted that this application maintains the layer and editing layer Take the editing layer at the boundary Annular strip and retaining layer 2 The annular zone serves as the boundary zone. The first annular band is adjacent to the boundary and belongs to the edge of the editing area. It is used to capture the geometric and appearance features of the newly inserted content near the boundary and is a key area that needs to be adjusted or aligned in the subsequent fusion process. The second annular band belongs to the edge of the non-editing area (i.e. the original scene). It retains the unmodified context structure and visual attributes and serves as a reference for fusion, guiding the editing layer to smoothly connect with the original scene in terms of geometry and appearance.
[0076] In this embodiment of the application, determining the depth image of the boundary band based on the depth image of the first multi-view image includes: determining the depth image of the editing layer and the depth image of the holding layer based on the depth image of the first multi-view image, and determining the depth image of the boundary band based on the depth image of the editing layer and the depth image of the holding layer.
[0077] It is understood that, in this embodiment of the application, the depth information of the editing layer and the retaining layer at each viewpoint can be extracted from the depth image corresponding to the first multi-view image to obtain the depth image of the editing layer and the depth image of the retaining layer. According to the determined boundary zone spatial range, the depth values of the corresponding regions are cropped or merged from these two sets of depth images to generate a boundary zone-specific depth image. This ensures that the boundary zone depth information contains both the edited content and the geometric structure of the original scene, providing precise spatial constraints for subsequent geometric feature alignment. This effectively avoids distortion phenomena such as model breakage, floating, or interlacing caused by depth discontinuity, thereby improving the geometric consistency and visual realism of the edited and non-edited regions in three-dimensional space.
[0078] It should be noted that a depth image is a single-channel image generated when rendering a 3D model or scene from a specific camera viewpoint. The value of each pixel represents the distance (i.e., depth value) from the corresponding spatial point to the camera's imaging plane (or camera center). Depth images are a crucial intermediate representation connecting the 2D rendered view and the 3D geometry. In 3D editing, they are used to quantify and correct spatial positional differences, ensuring that the edited results blend naturally in shape and structure.
[0079] In this embodiment of the application, before aligning the geometric features between the edited and non-edited regions in the first 3D model based on the depth image of the boundary band to obtain the second 3D model, the method includes: calculating the depth difference of pixels on the boundary band based on the depth image of the boundary band; calculating a weighted cumulative distribution curve based on the depth difference of pixels on the boundary band; determining the translation amount of the boundary points between the edited and non-edited regions based on the weighted cumulative distribution curve; correcting the depth image of the boundary points between the edited and non-edited regions based on the translation amount of the boundary points; calculating a pseudo-ground value depth image of the first 3D model based on the depth image of the boundary points and the depth image of the first multi-view image; using the corrected first multi-view image as the pseudo-ground value image; and reconstructing and optimizing the first 3D model based on the depth image of the first multi-view image, the pseudo-ground value depth image, the pseudo-ground value image, and the first multi-view image to obtain the second 3D model.
[0080] It is understood that, before aligning geometric features, this embodiment can calculate the depth difference of each pixel at the junction of the editing layer and the preservation layer based on the depth image of the boundary band, reflecting the degree of geometric discontinuity between the editing area and the non-editing area; then, it can use the depth difference to construct a weighted cumulative distribution curve, and determine the required translation amount of the boundary point through statistical analysis to achieve a smooth transition at the depth level; the depth value of the boundary point is corrected according to the translation amount to generate an optimized boundary depth image, and the depth image of the first three-dimensional model is obtained by fusing it with the depth image of the original first multi-view image, while the corrected multi-view image is used as the pseudo-true value image; the original multi-view image and its depth, the pseudo-true value image and its depth together constitute a supervision signal, and the first three-dimensional model is jointly reconstructed and optimized to obtain the second three-dimensional model. By explicitly modeling and correcting the boundary geometric misalignment in a data-driven manner, the depth jump and structural break between the inserted object and the original scene are effectively eliminated, and the geometric continuity and overall structural consistency of the second three-dimensional model in the boundary area are significantly improved.
[0081] It should be noted that the cumulative distribution curve is a function curve obtained by statistically modeling the depth difference of all pixels on the boundary band. It is used to describe the distribution pattern of depth difference in the overall boundary region.
[0082] In this embodiment of the application, the weighted cumulative distribution curve is calculated based on the depth difference of pixels on the boundary band, including: obtaining the color gradient magnitude and pixel residual gradient magnitude of pixels on the boundary band; calculating the weight of pixels based on the color gradient magnitude and pixel residual gradient magnitude; calculating the proportion of pixel weight to the total weight based on the depth difference and weight of pixels on the boundary band; and generating the weighted cumulative distribution curve based on the proportion of pixel weight to the total weight.
[0083] It is understood that the embodiments of this application can obtain the color gradient magnitude and residual gradient magnitude of each pixel on the boundary band, which respectively reflect the saliency of the pixel at the visual edge and the spatial inconsistency of the depth error; based on these two gradient magnitudes, the confidence weight of each pixel is calculated, so that the region with smooth color and stable residual can obtain higher confidence; combined with the depth difference of the pixel and its corresponding weight, the proportion of the cumulative weight to the total weight under each depth threshold is statistically analyzed, thereby constructing a weighted cumulative distribution curve, which can robustly characterize the overall distribution characteristics of the depth deviation of the boundary band, effectively suppress the noise interference introduced by texture edges or outlier depth estimation, and provide a reliable data foundation for subsequent accurate solution of weighted median translation and realization of geometric seamless alignment.
[0084] It should be noted that the geometric consistency optimization algorithm based on boundary depth alignment in this application ensures geometric consistency between edited and non-edited regions. The inputs to the geometric consistency optimization algorithm based on boundary depth alignment are the depths of the preservation layer and the editing layer. and Pixel-level mask at the boundary Output of the geometric consistency optimization algorithm based on boundary depth alignment: the depth of the boundary band after editing depth alignment. .
[0085] Specifically, existing image depth estimation methods are used to perform initial editing on the resulting image. Depth estimation The retain layer and the edit layer are respectively and Then, boundary alignment is used to ensure geometric consistency between the edited and non-edited areas.
[0086] a) Obtain the initial depth difference of the boundary zone , = - ; b) Based on pixel confidence right Weighted, the It is determined by a combination of color consistency and depth discontinuity; c) Iteratively solve for the optimal translation amount Δ using a weighted median method until the difference between the current translation amount Δ and the previous translation amount Δ is less than the adaptive threshold ε; d) Perform an overall translation of the boundary band based on the final translation amount Δ to achieve seamless splicing of the editing layer and the preservation layer.
[0087] Specifically as follows: Step 1: Calculate the depth of the initial boundary band:
[0088] in, Initialize to ,Right now ; This represents the initial depth value within the boundary zone. To preserve the pixel-level mask of the layer, For the pixel-level mask of the editing layer, To preserve the layer's depth value at pixel i, Let i be the depth value of the editing layer at pixel i. This is the boundary zone depth map after geometric alignment optimization.
[0089] Step 2: Calculate the pixel depth difference of the boundary band : ; in, For the pixel-level mask of the boundary band, To maintain the layer depth value, This represents the depth value of the editing layer. This represents the depth difference of the boundary zone.
[0090] Step 3: Calculate the weighted cumulative distribution:
[0091] in, Let be the weighted cumulative distribution function, representing the weighted proportion of pixels in the boundary band with a depth difference less than or equal to t at the k-th iteration, where t is the pixel depth difference threshold. Let i be the confidence weight of pixel i in the k-th iteration. For the boundary mask, Let i be the depth difference at pixel position i in the boundary band. The indicator function is a logical judgment: if the depth difference of pixel i... If the value is less than or equal to the threshold t, the value is 1; otherwise, it is 0.
[0092] It should be noted that, The pixel depth difference threshold is a continuous variable that represents all pixels. After sorting from smallest to largest The record is "all" "The proportion of pixel weights to the total weights", when When sweeping from the minimum value min(δ) to the maximum value max(δ), The curve monotonically increases from 0 to 1, forming a weighted cumulative distribution curve, where... This is an indicator function that indicates that only pixels whose depth difference meets the threshold will participate in the weighted cumulative distribution calculation.
[0093] Take the weighted median as the normal value for the next deep correction. Sequence by weight Draw a probability distribution, then take... quantiles, i.e., finding make = 0.5, which is the dividing point at half the weight, and is used as the next translation amount. :
[0094] in, This is the depth translation correction amount at the (k+1)th iteration. This represents the weighted proportion of pixels in the boundary band with a depth difference less than or equal to t at the k-th iteration. To adjust the threshold for optimal depth, For the depth difference in the boundary band at the k-th iteration The corresponding weighted cumulative probability value.
[0095] Step 4: Update the weights and perform outlier suppression:
[0096]
[0097] in, and pixels The more drastic the color gradient magnitude or the pixel residual gradient magnitude, the more likely it is to be an edge pixel, and the lower its weight. and These are the set thresholds, which are constants. Let i be the depth difference at pixel position i in the boundary band. This is the depth translation correction amount at the (k+1)th iteration. Let be the depth residual of pixel i.
[0098] Step 5: Boundary point depth correction: ; in, This is the aligned boundary zone depth map. This is the depth translation correction amount at the (k+1)th iteration.
[0099] It should be noted that, Through the The geometric alignment result obtained after multiple depth translation corrections.
[0100] Step 6: If | - |, among which For the first The next deep repair is normal. This is the depth translation correction amount at the (k+1)th iteration. If the preset small positive threshold is called the convergence tolerance, then stop boundary depth alignment; otherwise, return to step 3 and repeat steps 3-5.
[0101] In this embodiment of the application, the reconstruction and optimization of the first three-dimensional model based on the depth image, pseudo-ground value depth image, pseudo-ground value image and the first multi-view image to obtain the second three-dimensional model includes: initializing the pixel depth values of the depth image of the first three-dimensional model; reconstructing the initialized first three-dimensional model; calculating the loss value of the reconstruction process of the first three-dimensional model based on the depth image, pseudo-ground value depth image, pseudo-ground value image and the first multi-view image; and optimizing the reconstruction process of the first three-dimensional model based on the loss value.
[0102] It is understood that the embodiments of this application can initialize the depth image of the first 3D model with pixel-level depth values to establish an optimizable geometric representation; then, a reconstruction process is performed based on the initial model, and a multi-source supervision signal is constructed by jointly utilizing the first multi-view image and its corresponding depth image, as well as the pseudo-ground value image and pseudo-ground value depth image generated by boundary alignment, to calculate the reconstruction loss value; this loss value comprehensively reflects multi-dimensional errors such as color consistency, depth accuracy, and boundary geometric continuity; finally, the loss value is minimized through backpropagation and optimization algorithms, and the geometric and appearance parameters of the first 3D model are iteratively adjusted to generate a second 3D model with a more complete structure, smoother boundaries, and higher overall visual quality. By introducing pseudo-ground values as enhanced supervision, the model is effectively guided to repair geometric breaks and appearance distortions in the editing area, significantly improving the fidelity and semantic consistency of 3D reconstruction.
[0103] Specifically, this application obtains the pseudo-ground value image of the edited 3D model and performs reconstruction optimization: (a) Obtain the initial edit result image using a boundary depth alignment strategy. Corresponding depth map Result after depth alignment This serves as a pseudo-true depth image for the edited 3D model.
[0104]
[0105] in, This is the corrected initial edit depth map. The original initial edit depth map is the depth value obtained by a depth estimation algorithm after multi-view rendering of the first 3D model (i.e., the scene after the target object is inserted); This is the boundary zone depth map after geometric alignment optimization. Use a mask for the boundary.
[0106] Then, acquire the depth-corrected multi-view images. This serves as a pseudo-true value image for the edited 3D model.
[0107] (b) Reconstruction and optimization.
[0108] 3D model after initial editing Rendered depth image The depth of a certain pixel is:
[0109]
[0110] in, The Gaussian center μ on ray r i The projection depth, Let r be the origin. Let the opacity be the i-th Gaussian. A 3D Gaussian representation of a 3D model. Let i be the cumulative opacity of the i-th Gaussian. Let be the center position of the i-th Gaussian.
[0111] At the same time, the initial edited 3D model is obtained by rendering using formulas (1)-(3). Rendered color image Reconstruction optimization is performed using multidimensional loss to obtain a 3D edited model with boundary depth alignment. The specific loss calculation is as follows:
[0112] in, and These are hyperparameters, This represents the joint loss value for depth and color. This is the corrected pseudo-ground depth image. To initially edit the depth map, This is a pseudo-true value image. This is the initial image to be edited.
[0113] Gradient only in The corresponding 3D Gaussian is propagated to reduce the number of optimized Gaussians, accelerate editing, and ensure that the non-edited areas remain unchanged, maintaining the realism of the edited model; the edited 3D result after boundary depth alignment optimization is as follows: .
[0114] In step S104, scene-aware text is generated based on the source 3D model, the second 3D model is rendered to obtain a second multi-view image, the first multi-view image and the second multi-view image are fused to obtain a fused image, the appearance features between the edited and non-edited areas in the second 3D model are aligned based on the target-aware text, the scene-aware text and the fused image, and a 3D scene is generated based on the second 3D model aligned with the appearance features.
[0115] It is understood that the embodiments of this application can generate scene-aware text through the source 3D model to capture the overall semantic and contextual information of the original scene; at the same time, the optimized second 3D model is rendered from multiple perspectives to obtain a second multi-view image, which is then fused with the first multi-view image from the initial editing stage to generate a fused image that takes into account both the edited content and the original scene structure; by jointly utilizing target-aware text (describing the fine-grained attributes of the inserted object), scene-aware text (describing the environmental context), and visual cues in the fused image, the appearance feature alignment process is driven, so that the appearance attributes such as material, lighting, and color of the edited area in the second 3D model are semantically and visually consistent with those of the non-edited area; finally, based on the appearance-aligned second 3D model, a high-fidelity, semantically coherent, and geometrically and appearance-seamlessly integrated 3D scene is generated, realizing a closed-loop editing process from semantic guidance to geometric optimization to appearance coordination, which significantly improves the realism and overall consistency of 3D scene editing.
[0116] In this embodiment of the application, generating scene-aware text based on a source 3D model includes: identifying object instances in the source 3D model; obtaining a pre-set second prompt text; generating descriptive text of the object instances in the source 3D model based on the second prompt text; and generating scene-aware text based on the descriptive text of the object instances.
[0117] It is understood that the embodiments of this application can identify each object instance in the source 3D model to obtain its spatial location, category, and semantic attributes; then, combined with a pre-set second prompt text, guide the large language model or visual language model to generate structured, fine-grained descriptive text for each object instance; and perform semantic aggregation and context integration of the descriptive text of all object instances to form scene-aware text covering the overall layout, spatial relationships between objects, and environmental semantics. This achieves automatic conversion from 3D geometric data to high-level semantic expression, providing key contextual support for the semantic and appearance-level coordination and alignment of subsequent edited content with the original scene, effectively enhancing the rationality of 3D scene editing.
[0118] Specifically, this application obtains scene-aware text of non-editable areas, which, together with object-aware text of editable areas, serves as editing text guidance. When using VLM to describe the source 3D scene, it is easily affected by prominent objects in the source 3D scene, making it difficult to extract text descriptions of non-editable areas of the source 3D scene. Therefore, a scene-aware editing text guidance acquisition method is proposed.
[0119] (a) First, use visual and language models to identify salient objects in the source 3D scene image, and then have the VLM output a text description, such as “a table”, as a negative text cue.
[0120] (b) Then, using prompting engineering and retrieval enhancement, the “simple text description” is transformed into a “scene-aware text description”. The project is prompted with the following details: 1. You are a professional photography critic. Please answer this question using three paragraphs: 1) Visual style (20 words); 2) Lighting, including the type and direction of the light source, and the resulting highlight / shadow shapes (40 words); 3) Associative information about the surface material and tactile sensation of an object (30 words); 2. Structured fields, mandatory JSON output; points will be deducted for each missing item: {"style":"", "lighting": {"type":"", "direction":"", "shape":"}, "material":"", "mood":"}.
[0121] 3. The negative message is: "a table" (c) Simplify these descriptions using a pre-trained language model, with the following suggestion: Based on the rendered image of the source 3D model and its text description, the text description is simplified while retaining descriptions of the source 3D scene's style, lighting, and materials. It is ensured that the text description does not include salient objects from the source 3D scene (i.e., objects within the editing area). This simplified text description, along with the reference object text descriptions obtained in step one, forms the final edit text description. .
[0122] In this embodiment, aligning the appearance features between edited and non-edited regions in a second 3D model based on target-aware text, scene-aware text, and fused image includes: performing diffusion iteration on the appearance features between edited and non-edited regions; calculating diffusion loss based on target-aware text, scene-aware text, and fused image; and optimizing the appearance features based on the diffusion loss during the diffusion iteration process.
[0123] It is understood that the embodiments of this application can incorporate the appearance features between the edited and non-edited regions into the iterative optimization framework of the diffusion model. In each diffusion step, target-aware text (describing the fine-grained attributes of the inserted object), scene-aware text (characterizing the contextual semantics of the original environment), and fused image (containing multi-view visual consistency information) are used to jointly construct semantic and visual constraints. Based on these three supervision signals, the diffusion loss is calculated. This loss function measures the deviation of the current appearance features from the desired target in terms of material, lighting, color, and style. The implicit appearance representation in the diffusion process is updated through backpropagation, gradually guiding the appearance of the edited region to evolve in a direction consistent with the non-edited region. With the progressive generation capability of the diffusion mechanism and combined with multimodal priors, the natural integration with the lighting, material, and style of the original scene is achieved while preserving the semantic identity of the target object, which significantly improves the visual realism and overall consistency of the 3D scene editing results.
[0124] Specifically, this application further optimizes the rendering appearance based on a consistent fusion mechanism using dynamic interpolation: Two-branch rendering is used: First branch: the 3D result after initial editing. Render the image Second branch: Edited 3D results after boundary depth alignment optimization The image is obtained by rendering using formulas (1)-(3). The two are then weighted and merged: (15) Consistency weight , Let i be the distance from the boundary band. and For superparameters, The value is approximately 1 at the center of the edit area, approximately 0 in the non-edit area, and smoothly transitions from 0 to 1 at the boundary. This represents element-wise multiplication. For consistency weight, The image after depth alignment. This is the initial image to be edited.
[0125] (3) Use diffusion loss for appearance optimization.
[0126] After editing in steps one and two, the quality of the 3D model basically meets the requirements, so the appearance optimization in step three can be completed with a small number of iterations, thus accelerating the editing process.
[0127]
[0128] in, for The loss is an unbiased gradient estimate of θ, used for gradient descent to update θ. Using SDS loss, it treats the pre-trained diffusion model as a "scorer," causing the 3D representation rendered image to align with the text prompt, without requiring a ground truth image. The latent encoding of the rendered image x = g(θ) at time step t is obtained by adding noise, where g(θ) is a 3D Gaussian model. This represents the parameters to be optimized in the 3D Gaussian model, and x represents the image rendered from the 3D Gaussian model. The noise predicted by the pre-trained diffusion model. For the true value of the added noise, This refers to the embedding vector corresponding to the text prompt, which is a fixed feature vector obtained by passing the input text prompt through a pre-trained text encoder (such as CLIP or T5). For t and These two random variables take their expected values, and t is the time step. Indicates standard noise sampling. The real noise samples added to the clean latent representation at time step t during the diffusion process.
[0129] In summary, this application proposes a method for generating 3D scenes with consistent boundaries. From both appearance and geometry perspectives, it optimizes the consistency and coherence between edited and non-edited areas of the 3D scene, as well as the accuracy of the edited area, to achieve consistent 3D scene editing based on reference images, while improving the efficiency and realism of the generated 3D scene.
[0130] This application proposes a method for rapidly acquiring pre-edited results of 3D scenes based on object-aware text. This method quickly obtains preliminary results of the edited 3D scene, serving as initial values for subsequent optimization. Simultaneously, by utilizing object- and scene-aware text guidance, a balance is struck between reference image fidelity and multi-view generalization, addressing the problem of overfitting the edited region to the reference image.
[0131] This application proposes a geometric coherence optimization method based on boundary depth alignment. By utilizing the hierarchical reconstruction of non-edited regions and edited regions and the depth alignment of overlapping boundary regions, geometric coherence between edited and non-edited regions is achieved, that is, the geometric consistency fusion between the edited region and the source scene is achieved.
[0132] This application proposes an appearance coherence optimization method based on scene-aware text guidance. The scene-aware text hierarchically describes the scene, from foreground to background, from artistic style to ambient lighting, solving the problem of appearance consistency (including style, lighting effects, etc.) between the edited and non-edited areas of the scene after editing.
[0133] This application proposes a consistency fusion mechanism based on dynamic interpolation, which renders the background and the area of the replaced / inserted edited object in layers, and then uses dynamic interpolation to fuse the editing and non-editing boundary areas, thereby further solving the problem of geometric and appearance consistency fusion between the editing area and the source scene.
[0134] The 3D scene generation method according to embodiments of this application generates target-aware text through reference images, quickly generates a target 3D model using the reference images and target-aware text, and inserts the target 3D model into the target region of the source 3D model to generate a first 3D model. It extracts the boundary band between edited and non-edited regions based on multi-view rendering, performs geometric alignment using depth images, and improves the geometric coherence between the edited region and the source 3D scene. By fusing multi-view images with scene-aware text and combining it with target-aware text to guide appearance optimization, it achieves visual consistency between the edited and non-edited regions in terms of style, lighting, and material, significantly improving the efficiency, accuracy, and visual quality of reference image-based 3D editing. Therefore, it solves the technical problems in related technologies, such as blurred boundaries of the edited region, inconsistencies in appearance and geometry from multiple perspectives, and a lack of coherence between the edited and non-edited regions, resulting in insufficient realism and low efficiency in the generated 3D model editing.
[0135] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method.
[0136] Embodiments of this application also provide a three-dimensional scene generation apparatus.
[0137] like Figure 4 As shown, the 3D scene generation device 10 includes: an acquisition module 100, a generation module 200, a rendering module 300, and a fusion module 400.
[0138] The acquisition module 100 is used to acquire a source 3D model and a reference image, and generate target-aware text based on the reference image; the generation module 200 is used to generate a target 3D model based on the target-aware text and the reference image, insert the target 3D model into the target area of the source 3D model to obtain a first 3D model, use the target area as the editing area, and use the area outside the target area as the non-editing area; the rendering module 300 is used to render the first 3D model to obtain a first multi-view image, determine the boundary band between the editing area and the non-editing area in the first 3D model based on the first multi-view image, determine the depth image of the boundary band based on the depth image of the first multi-view image, and align the geometric features between the editing area and the non-editing area in the first 3D model based on the depth image of the boundary band to obtain a second 3D model; the fusion module 400 is used to generate scene-aware text based on the source 3D model, render the second 3D model to obtain a second multi-view image, fuse the first multi-view image and the second multi-view image to obtain a fused image, align the appearance features between the editing area and the non-editing area in the second 3D model based on the target-aware text, the scene-aware text and the fused image, and generate a 3D scene based on the second 3D model aligned with the appearance features.
[0139] In this embodiment of the application, the acquisition module 100 is further configured to: acquire a pre-set first prompt text and a search tag; generate a description text of a reference image based on the first prompt text and the search tag; and generate target-aware text based on the description text of the reference image.
[0140] In this embodiment of the application, it further includes: a representation module, used to represent the target area of the source 3D model as a 3D editing box before inserting the target 3D model into the target area of the source 3D model to obtain the first 3D model; identify the target spatial relationship in the source 3D model, and insert the target 3D model into the 3D editing box according to the target spatial relationship to obtain the first 3D model.
[0141] In this embodiment, the rendering module 300 is further configured to: calculate the three-dimensional Gaussian representation of the first three-dimensional model; calculate the pixel color of the single-view image based on the three-dimensional Gaussian representation; and generate a first multi-view image based on the pixel colors of multiple single-view images.
[0142] In this embodiment, the rendering module 300 is further configured to: identify the semantics of object instances and the spatial relationships between object instances in the editing region; divide the editing region into a retaining layer and an editing layer according to the categories of object instance semantics and spatial relationships, segment object instances into replaced objects and non-replaced objects, assign the replaced objects to the editing layer, and assign the non-replaced objects to the retaining layer; calculate the mask of the replaced object in the editing layer, complete the remaining area of the replaced object according to the mask of the replaced object, and re-divide the retaining layer and the editing layer in the completed editing region; calculate the mask of the re-divided editing layer and the mask of the retaining layer, and determine the boundary band between the editing region and the non-editing region according to the mask of the editing layer and the mask of the retaining layer.
[0143] In this embodiment of the application, the rendering module 300 is further configured to: project the editing area onto a preset single viewpoint; calculate the mask of the projected area; determine the residual mask of the replaced object based on the difference between the mask of the projected area and the mask of the replaced object; determine the residual area of the replaced object based on the residual mask of the replaced object; and complete the residual area of the replaced object.
[0144] In this embodiment, the rendering module 300 is further configured to: project the re-divided editing area onto a preset single viewpoint; calculate the mask of the projection area; determine the mask of the editing layer based on the union of the mask of the projection area and the mask of the replaced object; and calculate the mask of the retention layer based on the mask of the editing layer.
[0145] In this embodiment, the rendering module 300 is further configured to: identify the boundary between the editing layer and the retaining layer based on the mask of the editing layer and the mask of the retaining layer; obtain a first annular band of a first proportion from the editing layer based on the boundary between the editing layer and the retaining layer; obtain a second annular band of a first proportion from the retaining layer based on the boundary between the editing layer and the retaining layer; and generate a boundary band between the editing area and the non-editing area based on the first annular band and the second annular band.
[0146] In this embodiment, the rendering module 300 is further configured to: determine the depth image of the editing layer and the depth image of the retaining layer based on the depth image of the first multi-view image, and determine the depth image of the boundary band based on the depth image of the editing layer and the depth image of the retaining layer.
[0147] In this embodiment, the system further includes: a calculation module, configured to: calculate the depth difference of pixels on the boundary band based on the depth image of the boundary band before aligning the geometric features between the edited and non-edited regions in the first 3D model according to the depth image of the boundary band to obtain the second 3D model; calculate a weighted cumulative distribution curve based on the depth difference of pixels on the boundary band; determine the translation amount of the boundary points between the edited and non-edited regions based on the weighted cumulative distribution curve; correct the depth image of the boundary points between the edited and non-edited regions based on the translation amount of the boundary points; calculate a pseudo-ground value depth image of the first 3D model based on the depth image of the boundary points and the depth image of the first multi-view image; and use the corrected first multi-view image as the pseudo-ground value image; and reconstruct and optimize the first 3D model based on the depth image of the first multi-view image, the pseudo-ground value depth image, the pseudo-ground value image, and the first multi-view image to obtain the second 3D model.
[0148] In this embodiment, the calculation module is further configured to: obtain the color gradient magnitude and pixel residual gradient magnitude of the pixels on the boundary band; calculate the weight of the pixels based on the color gradient magnitude and pixel residual gradient magnitude; calculate the proportion of the pixel weight to the total weight based on the depth difference and weight of the pixels on the boundary band; and generate a weighted cumulative distribution curve based on the proportion of the pixel weight to the total weight.
[0149] In this embodiment of the application, the calculation module is further configured to: initialize the pixel depth values of the depth image of the first three-dimensional model; reconstruct the initialized first three-dimensional model; calculate the loss value of the reconstruction process of the first three-dimensional model based on the depth image of the first multi-view image, the pseudo-ground value depth image, the pseudo-ground value image and the first multi-view image; and optimize the reconstruction process of the first three-dimensional model based on the loss value.
[0150] In this embodiment, the fusion module 400 is further configured to: identify object instances in the source 3D model; obtain a pre-set second prompt text; generate descriptive text of the object instances in the source 3D model based on the second prompt text; and generate scene-aware text based on the descriptive text of the object instances.
[0151] It should be noted that the description of the features in the embodiment corresponding to the 3D scene generation device can be found in the relevant description of the embodiment corresponding to the 3D scene generation method, and will not be repeated here.
[0152] Embodiments of this application also provide an electronic device, including a memory and a processor, wherein the memory stores a computer program and the processor is configured to run the computer program to perform the steps in any of the above embodiments of the three-dimensional scene generation method.
[0153] Embodiments of this application also provide a computer-readable storage medium storing a computer program, wherein the computer program is configured to execute the steps in any of the above embodiments of the three-dimensional scene generation method at runtime.
[0154] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.
[0155] The embodiments of this application also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in any of the above embodiments of the three-dimensional scene generation method.
[0156] Embodiments of this application also provide another computer program product, including a non-volatile computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps in any of the above-described embodiments of the three-dimensional scene generation method.
[0157] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0158] The above provides a detailed description of a three-dimensional scene generation method provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only intended to help understand the method and its core ideas. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from its principles, and these improvements and modifications also fall within the protection scope of the claims of this application.
Claims
1. A method for generating a three-dimensional scene, characterized in that, include: Acquire a source 3D model and a reference image, and generate target-aware text based on the reference image; A target 3D model is generated based on the target-aware text and the reference image. The target 3D model is then inserted into the target region of the source 3D model to obtain a first 3D model. The target region is used as the editing region, and the region outside the target region is used as the non-editing region. The first three-dimensional model is rendered to obtain a first multi-view image. The boundary band between the edited region and the non-edited region in the first three-dimensional model is determined based on the first multi-view image. The depth image of the boundary band is determined based on the depth image of the first multi-view image. The geometric features between the edited region and the non-edited region in the first three-dimensional model are aligned based on the depth image of the boundary band to obtain a second three-dimensional model. Scene-aware text is generated based on the source 3D model. The second 3D model is rendered to obtain a second multi-view image. The first multi-view image and the second multi-view image are fused to obtain a fused image. The appearance features between the edited area and the non-edited area in the second 3D model are aligned based on the target-aware text, the scene-aware text, and the fused image. A 3D scene is generated based on the second 3D model aligned with the appearance features.
2. The three-dimensional scene generation method according to claim 1, characterized in that, The step of generating target-aware text based on the reference image includes: Retrieve the pre-set initial prompt text and search tags; A descriptive text for the reference image is generated based on the first prompt text and the search tag; The target-aware text is generated based on the descriptive text of the reference image.
3. The three-dimensional scene generation method according to claim 1, characterized in that, Before inserting the target 3D model into the target region of the source 3D model to obtain the first 3D model, the method further includes: The target region of the source 3D model is represented as a 3D editing box; Identify the target spatial relationships in the source 3D model, and insert the target 3D model into a 3D editing box according to the target spatial relationships to obtain the first 3D model.
4. The three-dimensional scene generation method according to claim 1, characterized in that, The process of rendering the first 3D model to obtain the first multi-view image includes: Calculate the three-dimensional Gaussian representation of the first three-dimensional model; The pixel color of the single-view image is calculated based on the three-dimensional Gaussian representation; The first multi-view image is generated based on the pixel colors of multiple single-view images.
5. The three-dimensional scene generation method according to claim 1, characterized in that, Determining the boundary band between the edited region and the non-edited region in the first 3D model based on the first multi-view image includes: Identify the semantics of object instances and the spatial relationships between object instances in the editing area; The editing region is divided into a retaining layer and an editing layer based on the semantics of the object instance and the category of the spatial relationship. The object instance is divided into replaced objects and non-replaced objects. The replaced objects are assigned to the editing layer, and the non-replaced objects are assigned to the retaining layer. Calculate the mask of the object to be replaced in the editing layer, fill in the remaining area of the object to be replaced based on the mask of the object to be replaced, and re-divide the retaining layer and editing layer in the filled-in editing area; Calculate the mask of the edit layer and the mask of the retain layer after re-division, and determine the boundary band between the edited region and the non-edited region based on the mask of the edit layer and the mask of the retain layer.
6. The three-dimensional scene generation method according to claim 5, characterized in that, The step of completing the remaining area of the replaced object based on the mask of the replaced object includes: Project the editing area onto a preset single viewpoint; Calculate the mask of the projection area, and determine the residual mask of the replaced object based on the difference between the mask of the projection area and the mask of the replaced object; The residual region of the replaced object is determined based on the residual mask of the replaced object, and the residual region of the replaced object is completed.
7. The three-dimensional scene generation method according to claim 5, characterized in that, The calculation of the mask for the edit layer and the mask for the preserve layer after re-division includes: The redefined editing area is then projected onto a preset single-viewpoint. Calculate the mask of the projection area, and determine the mask of the editing layer based on the union of the mask of the projection area and the mask of the object being replaced; The mask of the retain layer is calculated based on the mask of the edit layer.
8. The three-dimensional scene generation method according to claim 5, characterized in that, Determining the boundary band between the edited region and the non-edited region based on the mask of the edit layer and the mask of the retain layer includes: The boundary between the editing layer and the retaining layer is identified based on the mask of the editing layer and the mask of the retaining layer; A first annular band of a first proportion is obtained from the editing layer based on the boundary between the editing layer and the holding layer, and a second annular band of a first proportion is obtained from the holding layer based on the boundary between the editing layer and the holding layer; A boundary band between the editable region and the non-editable region is generated based on the first annular band and the second annular band.
9. The three-dimensional scene generation method according to claim 5, characterized in that, Determining the depth image of the boundary band based on the depth image of the first multi-view image includes: The depth image of the editing layer and the depth image of the retaining layer are determined based on the depth image of the first multi-view image, and the depth image of the boundary band is determined based on the depth image of the editing layer and the depth image of the retaining layer.
10. The three-dimensional scene generation method according to claim 1, characterized in that, Before aligning the geometric features between the edited and unedited regions in the first 3D model based on the depth image of the boundary band to obtain the second 3D model, the process includes: Calculate the depth difference of pixels on the boundary band based on the depth image of the boundary band; A weighted cumulative distribution curve is calculated based on the depth difference of pixels on the boundary band. The translation amount of the boundary points between the edited region and the non-edited region is determined based on the weighted cumulative distribution curve. The depth image of the boundary points between the edited region and the non-edited region is corrected based on the translation amount of the boundary points. The pseudo-ground value depth image of the first three-dimensional model is calculated based on the depth image of the boundary point and the depth image of the first multi-view image, and the corrected first multi-view image is used as the pseudo-ground value image. The first 3D model is reconstructed and optimized based on the depth image, the pseudo-ground value depth image, the pseudo-ground value image, and the first multi-view image to obtain the second 3D model.
11. The three-dimensional scene generation method according to claim 10, characterized in that, The step of calculating the weighted cumulative distribution curve based on the depth difference of pixels on the boundary band includes: Obtain the color gradient magnitude and pixel residual gradient magnitude of the pixels on the boundary band; The weight of the pixel is calculated based on the color gradient magnitude and the pixel residual gradient magnitude; The proportion of pixel weight to total weight is calculated based on the depth difference of pixels on the boundary band and the weight, and the weighted cumulative distribution curve is generated based on the proportion of pixel weight to total weight.
12. The three-dimensional scene generation method according to claim 10, characterized in that, The step of reconstructing and optimizing the first 3D model based on the depth image of the first multi-view image, the pseudo-ground value depth image, the pseudo-ground value image, and the first multi-view image to obtain the second 3D model includes: Initialize the pixel depth values of the depth image of the first 3D model; The first 3D model after initialization is reconstructed, and the loss value of the first 3D model reconstruction process is calculated based on the depth image of the first multi-view image, the pseudo-ground value depth image, the pseudo-ground value image and the first multi-view image. The reconstruction process of the first 3D model is optimized based on the loss value.
13. The three-dimensional scene generation method according to claim 1, characterized in that, The step of generating scene-aware text based on the source 3D model includes: Identify object instances in the source 3D model; Obtain a pre-set second prompt text, and generate a description text of the object instance in the source 3D model based on the second prompt text; The scene-aware text is generated based on the description text of the object instance.
14. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor, configured to implement the steps of the three-dimensional scene generation method as described in any one of claims 1 to 13 when executing the computer program.
15. A non-volatile computer-readable storage medium, characterized in that, The non-volatile computer-readable storage medium stores a computer program, wherein the computer program, when executed by a processor, implements the steps of the three-dimensional scene generation method as described in any one of claims 1 to 13.