Building image intelligent processing method and device based on natural language instruction
By using a natural language-based intelligent image processing method for buildings, the problem of poor applicability of existing AI design methods to ordinary users has been solved. This method enables an efficient and intelligent building design process, supports natural language input, reduces operational complexity, and improves design efficiency and quality control.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- TSINGHUA UNIVERSITY
- Filing Date
- 2026-03-10
- Publication Date
- 2026-06-23
AI Technical Summary
Existing AI-assisted architectural design methods are poorly applicable to ordinary users, have high human-computer interaction barriers, low design efficiency, and poor controllability of design output quality, making it difficult to meet the design needs of the general public, high efficiency, and precision.
A natural language-based intelligent image processing method for buildings is adopted. By receiving natural language commands and building images input by users, tokenization processing is performed to generate processing results that match the commands, including token sequences and semantic tags for outlines and rooms. It supports bimodal input, reduces operational complexity, and improves design efficiency.
It enables ordinary users to directly describe their design needs using natural language, reduces manual correction costs, improves the level of design automation, and meets the needs of modern architectural design for high efficiency and intelligence.
Smart Images

Figure CN122263221A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of interdisciplinary technology of architecture and artificial intelligence, and in particular to a method and apparatus for intelligent processing of architectural images based on natural language commands. Background Technology
[0002] As a core preliminary step in architectural engineering, the rationality of architectural floor plan design directly determines the building's functionality, space utilization, and end-user experience. Traditional architectural floor plan design heavily relies on the designer's accumulated professional experience, with the design process primarily involving manual drawing or simple auxiliary drawing tools. This approach has several inherent drawbacks: Firstly, it results in lengthy design cycles and low efficiency, making it difficult to quickly respond to diverse and personalized market demands. Secondly, the design outcome is easily influenced by the designer's subjective perception and experience preferences, making it difficult to achieve a precise balance between functional practicality and spatial aesthetics. Some design schemes require multiple rounds of iterative modifications to meet basic usage requirements, further increasing design and time costs.
[0003] With the development of Artificial Intelligence (AI) technology, some AI-aided design techniques have been gradually applied to the field of architectural floor plan design, attempting to address the pain points of traditional design methods. However, existing AI-aided design methods are poorly applicable to ordinary non-professional users, and the human-computer interaction threshold is high: most existing methods only support single-modal interaction methods such as structured parameter input or fixed-format drawing input, and cannot directly parse the natural language requirements commonly used by ordinary users. The design results often require a lot of manual correction to comply with building codes. This leads to problems such as difficulty for ordinary users to use directly, low design efficiency, inconvenient human-computer interaction, and poor controllability of design quality, making it difficult to meet the demands of modern architectural design for popularization, efficiency, intelligence, and precision. Summary of the Invention
[0004] This application provides a method and apparatus for intelligent processing of architectural images based on natural language commands, in order to solve the defects of low design efficiency, inconvenient human-computer interaction, and poor controllability of design results in the prior art.
[0005] This application provides a method for intelligent processing of architectural images based on natural language commands, applied to electronic devices, comprising the following steps: It receives natural language commands and architectural images from the user. The architectural images are used to identify the spatial features of the target building. The system processes the architectural images and generates processing results that match the natural language commands.
[0006] According to the intelligent image processing method for buildings based on natural language instructions provided in this application, the building image includes a floor plan image of the target building, and the natural language instructions are used to instruct the interpretation of the floor plan image. The above-mentioned processing of the building image to generate a processing result matching the natural language instructions may specifically include: The planar layout image is tokenized to obtain a spatial token sequence. Based on the spatial token sequence, target text is generated, which is used to describe the layout information in the planar layout image.
[0007] According to the intelligent architectural image processing method based on natural language commands provided in this application, the above-mentioned tokenization processing of the planar layout image to obtain a spatial token sequence may specifically include: Based on the planar layout image, the outline image and room images of the target building are determined. The outline image is used to identify the outline of the target building, and the room images are used to identify the rooms within the target building. The outline image is tokenized to obtain an outline token sequence, which is used to identify the outline features of the target building. Based on the outline image, the room images are tokenized to obtain a room token sequence, which is used to identify the geometric features of the rooms. Semantic tag tokens matching the room token sequences are obtained; these semantic tag tokens are used to identify the functional attributes of the rooms. The outline token sequence, room token sequence, and semantic tag tokens are cross-arranged to obtain a spatial token sequence.
[0008] According to the intelligent architectural image processing method based on natural language commands provided in this application, the contour image is tokenized to obtain a contour token sequence, which may specifically include: The contour image is standardized, and features are extracted from the standardized contour image to obtain multiple first feature vectors corresponding to the contour image. For each first feature vector, the first codebook vector most similar to the first feature vector is determined from the preset contour codebook. The contour codebook is a pre-constructed set of first codebook vectors. The first codebook vectors corresponding to each first feature vector are combined to obtain the contour token sequence.
[0009] According to the intelligent architectural image processing method based on natural language commands provided in this application, the above-mentioned method, based on contour images, performs tokenization processing on room images to obtain room token sequences, which specifically includes: Based on the contour image, features are extracted from the room image to obtain multiple second feature vectors corresponding to the room image. For each second feature vector, the second codebook vector most similar to the second feature vector is determined from a pre-constructed set of second codebook vectors. The second codebook vectors corresponding to each second feature vector are combined to obtain a room token sequence.
[0010] According to the intelligent architectural image processing method based on natural language instructions provided in this application, the architectural image includes a contour image of the target building, which is used to identify the contour of the target building. The natural language instructions are used to instruct the generation of a planar layout image that meets the target requirements. The above-mentioned processing of the architectural image to generate a processing result matching the natural language instructions may specifically include: The contour image is tokenized to obtain a contour token sequence, which is used to identify the contour features of the target building. Semantic label tokens and room token sequences that meet the target requirements are generated. The semantic label tokens that meet the target requirements identify the functional attributes of the rooms indicated in the target requirements, and the room token sequences that meet the target requirements identify the geometric features of the rooms indicated in the target requirements. Based on the contour token sequence, the semantic label tokens that meet the target requirements, and the room token sequences that meet the target requirements, a floor plan layout image that meets the target requirements is generated.
[0011] According to the intelligent processing method for architectural images based on natural language instructions provided in this application, the architectural image includes a plan view of the target building, and the natural language instructions are used to instruct the editing of a target area of the plan view. The above-mentioned processing of the architectural image to generate a processing result matching the natural language instructions may specifically include: The planar layout image is tokenized to obtain a spatial token sequence, which includes an outline token sequence, a room token sequence, and a semantic label token matching the room token sequence. Based on natural language instructions, the room token sequence and semantic label token corresponding to the target region in the spatial token sequence are updated to obtain an updated spatial token sequence. The updated spatial token sequence is then decoded to obtain the updated planar layout image.
[0012] According to the intelligent architectural image processing method based on natural language commands provided in this application, the electronic device can also perform structural correction and / or standardization processing on the initial image to obtain the target image. The initial image is either an updated floor plan or a floor plan that meets the target requirements. The target image in the target format is then output.
[0013] According to the intelligent processing method for architectural images based on natural language instructions provided in this application, a multimodal large model can be deployed in an electronic device. The multimodal large model is used to process architectural images and generate processing results that match the natural language instructions.
[0014] This application also provides a building image intelligent processing device based on natural language commands, which includes the following modules: The receiving module is used to receive natural language commands and architectural images input by the user. The architectural images are used to identify the spatial features of the target building.
[0015] The processing module is used to process architectural images and generate processing results that match natural language instructions.
[0016] According to the intelligent architectural image processing device based on natural language instructions provided in this application, the architectural image includes a floor plan image of the target building, and the natural language instructions are used to instruct the interpretation of the floor plan image. The aforementioned processing module is specifically used for: The planar layout image is tokenized to obtain a spatial token sequence. Based on the spatial token sequence, target text is generated, which is used to describe the layout information in the planar layout image.
[0017] According to the intelligent architectural image processing device based on natural language commands provided in this application, the processing module is specifically used for: Based on the planar layout image, the outline image and room images of the target building are determined. The outline image is used to identify the outline of the target building, and the room images are used to identify the rooms within the target building. The outline image is tokenized to obtain an outline token sequence, which is used to identify the outline features of the target building. Based on the outline image, the room images are tokenized to obtain a room token sequence, which is used to identify the geometric features of the rooms. Semantic tag tokens matching the room token sequences are obtained; these semantic tag tokens are used to identify the functional attributes of the rooms. The outline token sequence, room token sequence, and semantic tag tokens are cross-arranged to obtain a spatial token sequence.
[0018] This application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the building image intelligent processing method based on natural language instructions as described above.
[0019] This application also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the intelligent processing method for building images based on natural language instructions as described above.
[0020] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the building image intelligent processing method based on natural language instructions as described above.
[0021] The architectural image intelligent processing method based on natural language commands provided in this application supports dual-modal input of natural language commands and architectural images, breaking down the barriers of traditional human-computer interaction. It can directly parse the natural language requirement descriptions commonly used by designers in their work, significantly reducing operational complexity. This allows for rapid response to personalized design needs, solving problems such as low efficiency in traditional design, inconvenient human-computer interaction, and poor controllability of design output quality. Furthermore, this solution can generate processing results that match the user's input natural language commands, reducing the cost of manual corrections, effectively improving the automation level of architectural design, and further enhancing design efficiency to meet the demands of modern architectural design for high efficiency, intelligence, and precision. Attached Figure Description
[0022] To more clearly illustrate the technical solutions in this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0023] Figure 1 This is a flowchart illustrating the intelligent processing method for architectural images based on natural language commands provided in this application.
[0024] Figure 2 This is a schematic diagram of the floor plan and room images provided in this application.
[0025] Figure 3 This is a schematic diagram illustrating the tokenization process of contour images and room images provided in this application.
[0026] Figure 4 This is a schematic diagram of tokenization processing of a contour image provided in this application.
[0027] Figure 5 This is a schematic diagram of tokenization processing of room images provided in this application.
[0028] Figure 6 This is a schematic diagram of the bubble chart provided in this application.
[0029] Figure 7 These are schematic diagrams of the original and updated floor plan layouts provided in this application.
[0030] Figure 8 This is a comparative diagram of the initial image and the target image provided in this application.
[0031] Figure 9 This is a schematic diagram of training a large multimodal model provided in this application.
[0032] Figure 10 This is a schematic diagram of the structure of a building image intelligent processing device based on natural language commands provided in this application.
[0033] Figure 11 This is a schematic diagram of the structure of an electronic device provided in this application. Detailed Implementation
[0034] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.
[0035] In the description of this application, it should be understood that the terms "center," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," and "outer," etc., indicating orientation or positional relationships based on the orientation or positional relationships shown in the accompanying drawings, are used only for the convenience of describing this application and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of this application. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Thus, a feature defined with "first," "second," or "third" may explicitly or implicitly include one or more of the stated features. In the description of this application, "a plurality of" means two or more, unless otherwise explicitly specified.
[0036] The following is combined with Figures 1 to 11 This application describes a method for intelligent processing of architectural images based on natural language instructions.
[0037] Figure 1 This is one of the flowcharts illustrating the intelligent architectural image processing method based on natural language commands provided in this application, such as... Figure 1 As shown, the method includes the following steps S101-S102.
[0038] S101 receives natural language commands and architectural images input by the user.
[0039] Architectural images are used to identify the spatial features of the target building. The type of target building in this application embodiment is not specifically limited. For example, the target building can be a residential building, such as a typical residential building with two or three bedrooms, or a commercial building (such as a shopping mall or supermarket), an office building (such as an office building or industrial park office building), a public service building (such as a school, hospital, or library), or an industrial building (such as a standard factory building or warehouse).
[0040] This application does not impose specific limitations on the natural language commands input by the user.
[0041] In one example, natural language instructions can be used to instruct the performance of comprehension tasks, such as interpreting a floor plan image. For instance, a natural language instruction could be "describe this floor plan."
[0042] In another example, natural language instructions can be used to instruct the execution of generative tasks, such as generating a floor plan layout image that meets target requirements. For instance, a natural language instruction could be "Design a two-bedroom apartment for me."
[0043] In another example, natural language instructions can be used to instruct the execution of editing tasks, such as editing a target area of a planar layout image. For instance, a natural language instruction could be "convert the secondary bedroom into a study."
[0044] Architectural images correspond to natural language instructions. For example, when a natural language instruction is used to instruct the interpretation of a floor plan image, or to instruct the editing of a target area of the floor plan image, the architectural image can be a floor plan image of the target building, enabling the electronic device to interpret the floor plan image or edit the target area of the floor plan image. When a natural language instruction is used to instruct the generation of a floor plan image that meets the target requirements, the architectural image can be a contour image, which identifies the contour of the target building, enabling the electronic device to generate a floor plan image within the building contour identified by the contour image.
[0045] Specifically, taking the use of natural language commands to instruct the interpretation of floor plan images as an example, if a user intends to understand the content of a certain floor plan image, they can input the floor plan image and the corresponding natural language command "describe this layout" into the electronic device. In this way, the electronic device can receive the natural language command input by the user and the architectural image.
[0046] S102, process the architectural image to generate a processing result that matches the natural language command.
[0047] It is understandable that different natural language instructions result in different processing methods for architectural images, leading to different processing results. The following will describe the process of S102 in detail using three examples: 1. Natural language instructions for interpreting floor plan images; 2. Natural language instructions for editing target areas of floor plan images; 3. Natural language instructions for generating floor plan images that meet target requirements.
[0048] I. Natural language instructions are used to instruct the interpretation of planar layout images.
[0049] In one example, where a natural language instruction is used to instruct the interpretation of a floor plan image, the architectural image may include a floor plan image of the target building. Accordingly, the above-described processing of the architectural image to generate a processing result matching the natural language instruction can be implemented as follows: Step One and Step Two.
[0050] Step 1: Tokenize the planar layout image to obtain a spatial token sequence.
[0051] The spatial token sequence includes the outline token sequence, the room token sequence, and the semantic tag token that matches the room token sequence.
[0052] In one example, the electronic device can first determine the outline image and room images of the target building based on a planar layout image. Then, it tokenizes the outline image to obtain an outline token sequence, and based on the outline image, it tokenizes the room images to obtain a room token sequence. Next, the electronic device can obtain semantic tag tokens that match the room token sequences, and cross-arrange the outline token sequence, room token sequence, and semantic tag tokens to obtain a spatial token sequence.
[0053] A contour image is used to identify the outline of a target building, that is, to represent the global boundary position of the target building. A contour image is a pixel image and is in binary mask format; therefore, it can also be called a contour mask image.
[0054] Room images are used to identify rooms within a target building. Room images are pixel-based images. They are obtained by dividing a floor plan image into functional categories (such as living room, kitchen, bedroom, etc.). Each room image can contain areas with a single functional attribute, and the shape of each room image is identical to that of the outline image. For example, refer to... Figure 2 As shown in (a), assume that the planar layout image includes 6 regions, namely Figure 2 The room images shown in (a) represent areas 1-6, and can include... Figure 2The room images A-F shown in (b) correspond one-to-one with regions 1-6.
[0055] Outline token sequences are used to identify the outline features of a target building. In other words, an outline token sequence consists of multiple outline tokens and can represent the global shape of the target building.
[0056] Room token sequences are used to identify the geometric features of a room, such as its shape, size, and location within the target building.
[0057] Semantic tag tokens are used to identify the functional attributes of a room, such as living room, kitchen, bedroom, etc.
[0058] The process of tokenizing contour images and room images can be collectively referred to as hierarchical tokenization. In other words, electronic devices can convert contour images and room images into discrete token sequences through hierarchical tokenization.
[0059] Specifically, the electronic device can first determine the outline image of the target building based on the overall shape of the planar layout image, and then determine the room image corresponding to each room based on each room contained in the planar layout image. Next, the electronic device can tokenize the outline image to obtain a sequence of outline tokens representing the global shape of the target building. For each room image, the electronic device can tokenize the room image based on the outline image to obtain a sequence of room tokens containing the set features of that room. The electronic device can obtain semantic tag tokens matching the room token sequence. The electronic device can then cross-arrange the outline token sequence, room token sequence, and semantic tag tokens in a preset order to obtain a spatial token sequence (also known as a structured token sequence).
[0060] In one example, where natural language instructions are used to instruct the interpretation of a floor plan image, the architectural image may also include a contour image to avoid inaccuracies in the contour image determined by the electronic device based on the floor plan image. That is, the contour image of the target building can be input by the user.
[0061] In one example, two independent Vector Quantized Variational Autoencoders (VQ-VAEs) can be pre-built into the electronic device: one VQ-VAE for tokenizing the contour image and the other for tokenizing the room image. For example, refer to... Figure 3 As shown, assuming the contour image is Figure 3 Image M in the image, the room image includes Figure 3Images A-F in the image are used for tokenization of the contour image. The VQ-VAE is... Figure 3 The VQ-VAE1 shown is a VQ-VAE used for tokenizing room images. Figure 3 The VQ-VAE2 is shown below. The electronic device can input image M into VQ-VAE1 to obtain the contour token sequence corresponding to image M, and input images A-F sequentially into VQ-VAE2 to obtain the room token sequence corresponding to images A-F.
[0062] In one optional implementation, the above-mentioned tokenization of the contour image to obtain a contour token sequence can be specifically implemented as follows: The contour image is standardized; features are extracted from the standardized contour image to obtain multiple first feature vectors corresponding to the contour image. For each first feature vector, a first codebook vector most similar to the first feature vector is determined from a preset contour codebook; the first codebook vectors corresponding to each first feature vector are combined to obtain the contour token sequence.
[0063] The contour codebook is a set of first codebook vectors obtained through joint training. Specifically, the contour codebook, encoder, and decoder are obtained through end-to-end joint training: the contour is used as input, mapped into an encoding vector by the encoder, quantized based on the codebook, and then input into the decoder to reconstruct the contour. The contour reconstruction error is used as the optimization objective to jointly train the encoder, decoder, and codebook, thereby obtaining the trained contour codebook.
[0064] Standardization processing can include, but is not limited to, operations such as size normalization and pixel value normalization.
[0065] In one example, referencing Figure 4 As shown, an electronic device may be equipped with a convolutional neural network encoder A. The convolutional neural network encoder A is pre-trained and has the ability to extract deep features from contour images. Its network structure may include convolutional layers, pooling layers and fully connected layers, which are used to map the input two-dimensional contour image into a feature vector in a high-dimensional space.
[0066] Continue to refer to Figure 4 As shown, after obtaining the standardized contour image, the electronic device can input the standardized contour image into a convolutional neural network encoder A to obtain multiple first feature vectors corresponding to the contour image. The electronic device can be equipped with... Figure 4The vector quantization module A shown can store a contour codebook (i.e., a first codebook vector set) of a preset size. For each first feature vector output by the convolutional neural network encoder A, the electronic device can input the first feature vector into the vector quantization module A. The vector quantization module A can calculate the similarity between the first feature vector and each first codebook vector in the contour codebook. By comparing the similarity between the first feature vector and each first codebook vector, the first codebook vector most similar to the first feature vector is selected. The electronic device can use the first codebook vector most similar to each first feature vector as the discrete contour token corresponding to the first feature vector. The electronic device can arrange the discrete contour tokens corresponding to each first feature vector in a preset order to obtain a contour token sequence of a preset dimension.
[0067] In one example, the convolutional neural network encoder and vector quantization module can be a convolutional neural network encoder and vector quantization module deployed in VQ-VAE.
[0068] In one example, similarity calculation can be achieved through Euclidean distance calculation. Accordingly, the first codebook vector most similar to the first feature vector refers to the first codebook vector that is closest to the first feature vector.
[0069] In one optional implementation, the above-mentioned tokenization of the room image based on the contour image to obtain the room token sequence can be specifically implemented as follows: based on the contour image, feature extraction is performed on the room image to obtain multiple second feature vectors corresponding to the room image; for each second feature vector, the second codebook vector most similar to the second feature vector is determined from the preset room codebook; and the second codebook vectors corresponding to each second feature vector are combined to obtain the room token sequence.
[0070] The room codebook is a set of second codebook vectors obtained through joint training. Specifically, the room codebook, encoder, and decoder are obtained through end-to-end joint training: the room layout is taken as input, mapped into an encoded vector by the encoder, quantized based on the codebook, and then input into the decoder to reconstruct the room layout. The room layout reconstruction error is used as the optimization objective, and the encoder, decoder, and codebook are jointly trained to obtain the trained room codebook.
[0071] In one example, referencing Figure 5 As shown, an electronic device can be equipped with a convolutional neural network encoder B. This encoder B is pre-trained and possesses the ability to extract deep features by fusing room images and contour images. Its network structure includes conditional convolutional layers, pooling layers, and fully connected layers, used to map the input two-dimensional room images and contour images into feature vectors in a high-dimensional space.
[0072] Continue to refer to Figure 5 As shown, after obtaining the room image and the contour image, the electronic device can first perform standardization processing on the room image and the contour image. Then, the electronic device can input the standardized contour image and the room image into the convolutional neural network encoder B to obtain multiple second feature vectors corresponding to the room image. The second feature vectors contain both the geometric features of the room itself and the contextual constraints of the overall building contour.
[0073] Electronic devices can be deployed with Figure 5 The vector quantization module B shown can store a room codebook (i.e., a set of second codebook vectors) of a preset size. For each second feature vector output by the convolutional neural network encoder B, the electronic device can input the second feature vector into the vector quantization module B. The vector quantization module B can calculate the similarity between the second feature vector and each second codebook vector in the room codebook. By comparing the similarity between the second feature vector and each second codebook vector, the electronic device selects the second codebook vector most similar to the second feature vector. The electronic device can use the second codebook vector most similar to each second feature vector as the discrete room token corresponding to that second feature vector. The electronic device can arrange the discrete room tokens corresponding to each first feature vector in a preset order to obtain a room token sequence of a preset dimension.
[0074] Step 2: Generate the target text based on the spatial token sequence.
[0075] The target text is used to describe the layout information in the planar layout image.
[0076] Specifically, electronic devices can also deploy multimodal large models. These models are pre-trained and capable of outputting corresponding natural language descriptions based on spatial token sequences and natural language instructions. The electronic device inputs the spatial token sequences and natural language instructions into the multimodal large model, which then outputs the corresponding natural language description, i.e., the target text.
[0077] In one example, the decoder can be a decoder deployed in VQ-VAE.
[0078] In one example, the target text could be "There is a living room in the middle of the layout, the kitchen is to the northeast of the living room, and the bathroom is to the west...".
[0079] In one example, in addition to the target text, the electronic device can also generate bubble charts and / or formatted structured information based on spatial token sequences.
[0080] Bubble diagrams can be used to illustrate the spatial relationships between rooms within a target building. Specifically, a bubble diagram can use nodes to represent room types (such as living room, bedroom, kitchen) and use connections to visually display the spatial location and adjacency relationships of each room within the target building. Figure 6 As shown in the bubble diagram, the living room node serves as the core, and is connected to the kitchen node, bathroom node, secondary bedroom node, master bedroom node, dining room node, etc., through lines, clearly presenting the topological layout and functional zoning logic of the apartment, making it easy for users to quickly understand the relationship between the rooms.
[0081] The structured information can store attributes such as the index, type, area, size, and spatial location of each room in key-value pairs. This structured information can be directly integrated with subsequent computer-aided design (CAD) software, floor plan rendering tools, or smart home configuration systems to achieve seamless import and secondary development of layout data, thereby improving the engineering efficiency of floor plan design and application.
[0082] In one example, the formatted structured information can be structured information in JSON format.
[0083] II. Natural language commands are used to instruct on editing target areas of a planar layout image.
[0084] In one example, when a natural language instruction is used to direct editing of a target area in a floor plan image, the architectural image may include the floor plan of the target building. Accordingly, the above-described processing of the architectural image to generate a processing result matching the natural language instruction can be implemented by tokenizing the floor plan image to obtain a spatial token sequence. Based on the natural language instruction, the room token sequence and semantic tag token corresponding to the target area in the spatial token sequence are updated to obtain an updated spatial token sequence. The updated spatial token sequence is then decoded to obtain the updated floor plan.
[0085] Editing the target area may include operations such as deleting, replacing, adjusting the shape, and adding to the target area, and this application embodiment does not specifically limit this.
[0086] The process of tokenizing the planar layout image to obtain a spatial token sequence can be referred to the above natural language instructions used to instruct the interpretation of the description in the planar layout image, and will not be repeated here.
[0087] After obtaining the spatial token sequence, the electronic device can identify and locate the target area to be modified within natural language instructions. For example, if the natural language instruction is to convert the bathroom on the west side of the living room into a walk-in closet, the electronic device can determine that the target area is the bathroom. Then, the electronic device can determine the room token sequence and semantic tag token corresponding to the target area from the spatial token sequence. The electronic device can update the room token sequence and semantic tag token corresponding to the target area. For example, taking the natural language instruction to convert the bathroom on the west side of the living room into a walk-in closet as an example, the electronic device can modify the room token sequence and semantic tag token corresponding to the bathroom in the target area to the room token sequence and semantic tag token corresponding to the walk-in closet. The room token sequence and semantic tag token corresponding to other areas besides the target area remain unchanged. In this way, the electronic device can obtain an updated spatial token sequence. The electronic device can be equipped with a decoder, which is pre-trained and has the ability to reason from the spatial token sequence to a floor plan. The electronic device can input the modified spatial token sequence into the decoder to obtain an updated floor plan.
[0088] In one example, the original floor plan can be referenced. Figure 7 The updated floor plan can be referenced from the image shown in (a). Figure 7 Image (b) is shown in the image.
[0089] The above technical solution allows for updates and modifications to be made only to the target area, while keeping the geometric features and topological relationships of other areas unaffected, thus achieving precise local editing.
[0090] 3. Natural language instructions are used to instruct the generation of planar layout images that meet the target requirements.
[0091] In one example, when a natural language instruction is used to instruct the generation of a floor plan image that meets the target requirements, the architectural image may include an outline image of the target building. Accordingly, the above-described processing of the architectural image to generate a processing result matching the natural language instruction can be implemented as follows: tokenizing the outline image to obtain an outline token sequence; generating semantic tag tokens and room token sequences that meet the target requirements; and generating a floor plan image that meets the target requirements based on the outline token sequence, the semantic tag tokens that meet the target requirements, and the room token sequences that meet the target requirements.
[0092] Semantic tag tokens that meet the target requirements are used to identify the functional attributes of the rooms indicated in the target requirements, and room token sequences that meet the target requirements are used to identify the geometric features of the rooms indicated in the target requirements.
[0093] The process of tokenizing the contour image to obtain the contour token sequence described above can be referred to the above natural language instructions used to instruct the interpretation of the description in the planar layout image, and will not be repeated here.
[0094] After receiving the outline token sequence, the electronic device can use an autoregressive generative model to analyze the functional requirements (such as the number of rooms, functional zoning, and space allocation) and layout constraints in the natural language instructions. This process generates semantic tag tokens and a sequence of room tokens that meet the target requirements. During generation, the autoregressive generative model uses the outline token sequence as a contextual constraint to ensure that the semantic tag tokens and room token sequences match the geometric features of the building's outer outline, avoiding issues such as room layouts exceeding the building's outline or contradictory topological relationships.
[0095] After generating semantic tag tokens and room token sequences that meet the target requirements, the electronic device can combine the outline token sequence, semantic tag tokens, and room token sequences in a preset order to form a complete spatial token sequence. The electronic device may contain a decoder, which is pre-trained and possesses the ability to reason from the spatial token sequence to a floor plan. The electronic device can input the spatial token sequence into the decoder, which can then generate a floor plan image that meets the target requirements.
[0096] In one example, the floor plan image generated by the decoder that meets the target requirements can be a color image. To ensure clear distinction between functional areas, a pre-defined standardized color code system can be used to encode the floor plan image. Specifically, different colors correspond to different functional types of rooms (such as bedrooms, living rooms, kitchens, and bathrooms), interior and exterior walls, and doors to avoid visual confusion.
[0097] In one alternative implementation, the electronic device may further perform structural correction and / or standardization processing on the updated planar layout diagram or the planar layout image that meets the target requirements (hereinafter referred to as the initial image) to obtain the target image and output the target image in the target format.
[0098] In one example, structural correction can be achieved using a combination of morphological opening and closing operations. The opening operation prioritizes erosion to precisely remove small noise pixels (such as isolated specks or tiny burrs) from the image, preventing irrelevant pixels from interfering with the visual clarity of the layout area. Then, a dilation operation is performed to restore the boundaries slightly contracted by the erosion operation while preserving the main outline of the room, preventing excessive erosion from causing incomplete room outlines or missing boundaries. The closing operation is performed in the reverse order: first, dilation fills small gaps and closes tiny cracks inside the room; then, erosion smooths the room boundary lines and repairs broken wall markings and area outlines. Ultimately, this ensures that all room outlines are closed and complete, with no overlapping or intersecting areas, and that wall lines are continuous and smooth, conforming to the actual logic of the architectural layout.
[0099] Standardization processing can include pixel standardization and / or color standardization. Pixel standardization refers to scaling the initial image, after structural correction, to a preset pixel specification that precisely corresponds to the area ratio of the actual building, ensuring a fixed mapping between image pixel size and physical space size, thus meeting the needs of subsequent precise design and size calculation.
[0100] Color standardization processing remaps and calibrates the colors of rooms and building components according to preset unified color coding rules, ensuring that the same functional areas (such as bedrooms and living rooms of different apartment types) use unified color identification, completely avoiding color inconsistencies caused by generation deviations and differences in device display, and ensuring the accuracy and uniformity of functional area identification.
[0101] The target format can be CAD format or Building Information Modeling (BIM) format, and this application embodiment does not specifically limit it.
[0102] Specifically, taking CAD format as an example, electronic devices perform structural correction and / or standardization processing on the initial image to obtain the target image. The target image can then be converted into CAD format so that the target image in CAD format can be directly imported into building structure analysis software for mechanical performance verification. This achieves closed-loop optimization from model design results to engineering implementation schemes, ensuring that the final output results meet the actual application requirements of building engineering.
[0103] For example, Figure 8 A comparative diagram of two initial images and a target image is shown. (Refer to...) Figure 8 As shown, Image 1 (target image) is the image after structural correction and normalization of Image a (initial image), and Image 2 (target image) is the image after structural correction and normalization of Image b (initial image). (Refer to...) Figure 8It can be seen that the target image after structural correction and standardization has a more complete and clearer outline, more accurate color mapping, clearer functional area boundaries, and higher color differentiation between different rooms than the initial image, thus avoiding the blurring and deviation of the initial image.
[0104] In one example, after obtaining the target image, compliance verification can be performed on the target image. Once the target image is confirmed to be compliant, the electronic device can output the target image to the user.
[0105] In one alternative implementation, a multimodal large model can be deployed in the electronic device, and the multimodal large model is used to process architectural images to generate processing results that match natural language instructions.
[0106] Multimodal large models can perform real-time inference on conventional computing devices, meeting the high-efficiency requirements of engineering design.
[0107] The original multimodal large model lacks the ability to specifically adapt to architectural spatial scenes and cannot accurately process spatial token sequences and natural language instructions. Therefore, before using the multimodal large model to process architectural images and generate processing results that match natural language instructions, the multimodal large model needs to be trained. Figure 9 This paper illustrates the core steps of training a large multimodal model, including two main stages: multimodal alignment training and instruction fine-tuning training. The following sections will provide a detailed explanation of 1. Multimodal alignment training and 2. Instruction fine-tuning training.
[0108] 1. Multimodal alignment training Multimodal alignment training is the foundation for building a model’s ability to associate architectural space with language semantics. Its core objective is to incorporate architectural space features and natural language symbols into a unified representation system, establish their basic compatibility and preliminary correspondence rules. This stage can be divided into two phases: embedding initialization training and multimodal pre-training.
[0109] Embedded initialization training The core task of embedding initialization training is to construct a cross-modal unified embedding space to achieve basic compatibility between spatial tokens and natural language vocabulary. In specific implementation, a dedicated spatial vocabulary containing contour tokens, semantic label tokens, and room tokens (hereinafter referred to as spatial tokens) is first constructed. Then, the dedicated spatial vocabulary is embedded into the basic vocabulary of the multimodal large model, so that the spatial tokens and the original natural language vocabulary of the multimodal large model share the same vocabulary system.
[0110] Subsequently, each spatial token can be assigned an independent trainable embedding vector, while freezing all original model parameters (i.e., retaining only the training permissions for the spatial token embedding parameters). This preserves the model's existing general language understanding capabilities and prevents new spatial tokens from interfering with the performance of the original model. During training, the embedding vectors of spatial tokens are adjusted by optimizing the alignment loss function, ensuring that the embedding vectors of spatial tokens are semantically compatible with the embedding vectors of the model's original text tokens in the same high-dimensional space. This establishes a basic association between textual and spatial information, guaranteeing that the model can correctly recognize and process various architectural spatial tokens.
[0111] Multimodal pre-training Multimodal pre-training is conducted using a specially constructed "Spatial Token Sequence-Natural Language Description" dataset for full-parameter training. This dataset is built based on open-source data and can be divided into training, validation, and test sets according to a preset ratio (e.g., 7:2:1), covering architectural scenarios with different apartment types, functional zones, and spatial layouts, thus ensuring the comprehensiveness of the training.
[0112] The training employs an autoregressive language modeling objective. After paired data is input into the model, the model must predict the next token in the sequence based on the input token sequence (such as spatial tokens or text tokens). Through this training process, the model deeply learns the bidirectional correspondence between natural language descriptions and spatial token sequences. This enables the multimodal large model to accurately parse the spatial token sequences corresponding to natural language descriptions and to infer natural language descriptions of building floor plans based on spatial token sequences, ultimately achieving bidirectional semantic alignment between text and spatial information.
[0113] 2. Command fine-tuning training After completing multimodal alignment training, the model possesses basic spatial semantic understanding capabilities. At this stage, further fine-tuning training targeting architectural design tasks is needed, focusing on improving the model's response accuracy and task adaptability to specific design instructions. Architectural design tasks can include the understanding, editing, and generation tasks described above. Therefore, this stage of training can be based on a specially compiled instruction dataset. This dataset comprehensively covers the three categories of architectural floor plan understanding, generation, and editing tasks, with each task having a clearly defined input-output paradigm, as follows: Understanding the task: The input consists of natural language instructions and spatial token sequences, and the output is a natural language description of the building layout corresponding to the spatial token sequence. The core purpose is to verify the ability of the multimodal large model to interpret existing layout information.
[0114] The task is to generate a sequence of natural language instructions and outline tokens as input, and output a complete sequence of spatial tokens that meets the requirements of the natural language instructions and the building outline constraints. This sequence of spatial tokens can be input into the decoder to generate an initial floor plan image.
[0115] Editing task: The input is a natural language command and a spatial token sequence. The output is a spatial token sequence after local adjustment of the target area based on the natural language command. After decoding and post-processing, a planar layout image with precise local modification can be obtained, realizing precise optimization and iteration of the layout.
[0116] Fine-tuning training with instructions ensures that the model can accurately match the design requirements of different scenarios and stably output processing results that are highly consistent with natural language instructions.
[0117] The following describes the intelligent building image processing device based on natural language instructions provided in the embodiments of this application. The intelligent building image processing device based on natural language instructions described below can be referred to in correspondence with the intelligent building image processing method based on natural language instructions described above.
[0118] Figure 10 This example illustrates a structural diagram of a building image intelligent processing device based on natural language commands. The device includes: The receiving module 1001 is used to receive natural language commands and architectural images input by the user, and the architectural images are used to identify the spatial features of the target building.
[0119] The processing module 1002 is used to process architectural images and generate processing results that match natural language instructions.
[0120] Figure 11 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 11 As shown, the electronic device may include: a processor 1110, a communications interface 1120, a memory 1130, and a communication bus 1140, wherein the processor 1110, the communications interface 1120, and the memory 1130 communicate with each other via the communication bus 1140. The processor 1110 can call logical instructions in the memory 1130 to execute a building image intelligent processing method based on natural language instructions, the method including: It receives natural language commands and architectural images from the user. The architectural images are used to identify the spatial features of the target building. The system processes the architectural images and generates processing results that match the natural language commands.
[0121] Furthermore, the logical instructions in the aforementioned memory 1130 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0122] On the other hand, this application also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer is able to execute the intelligent architectural image processing method based on natural language instructions provided by the above methods, the method including: It receives natural language commands and architectural images from the user. The architectural images are used to identify the spatial features of the target building. The system processes the architectural images and generates processing results that match the natural language commands.
[0123] Furthermore, this application also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, is implemented to perform the intelligent architectural image processing method based on natural language instructions provided by the methods described above, the method comprising: It receives natural language commands and architectural images from the user. The architectural images are used to identify the spatial features of the target building. The system processes the architectural images and generates processing results that match the natural language commands.
[0124] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0125] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0126] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. A method for intelligent processing of architectural images based on natural language commands, characterized in that, Applied to electronic devices, the method includes: Receive natural language commands and architectural images input by the user, wherein the architectural images are used to identify the spatial features of the target building; The building image is processed to generate a processing result that matches the natural language instruction.
2. The method according to claim 1, characterized in that, The building image includes a floor plan of the target building, and the natural language instructions are used to instruct the interpretation of the floor plan. The process of processing the building image to generate a processing result that matches the natural language instruction includes: The planar layout image is tokenized to obtain a spatial token sequence; Based on the spatial token sequence, target text is generated, which is used to describe the layout information in the planar layout image.
3. The method according to claim 2, characterized in that, The tokenization process of the planar layout image to obtain a spatial token sequence includes: Based on the planar layout image, an outline image and room images of the target building are determined; the outline image is used to identify the outline of the target building, and the room images are used to identify the rooms within the target building. The contour image is tokenized to obtain a contour token sequence, which is used to identify the contour features of the target building. Based on the contour image, the room image is tokenized to obtain a room token sequence, which is used to identify the geometric features of the room; Obtain the semantic tag token that matches the room token sequence; the semantic tag token is used to identify the functional attributes of the room. The spatial token sequence is obtained by cross-arranging the outline token sequence, the room token sequence, and the semantic tag token.
4. The method according to claim 3, characterized in that, The tokenization process of the contour image to obtain a contour token sequence includes: The contour image is standardized. Feature extraction is performed on the standardized contour image to obtain multiple first feature vectors corresponding to the contour image; For each first feature vector, a first codebook vector that is most similar to the first feature vector is determined from a preset contour codebook, wherein the contour codebook is a pre-constructed set of first codebook vectors; The first codebook vector corresponding to each of the first feature vectors is combined to obtain the outline token sequence.
5. The method according to claim 3, characterized in that, The step of tokenizing the room image based on the contour image to obtain a room token sequence includes: Based on the contour image, feature extraction is performed on the room image to obtain multiple second feature vectors corresponding to the room image; For each second feature vector, a second codebook vector that is most similar to the second feature vector is determined from a preset room codebook, wherein the room codebook is a pre-constructed set of second codebook vectors; The room token sequence is obtained by combining the second codebook vector corresponding to each second feature vector.
6. The method according to any one of claims 1-5, characterized in that, The building image includes an outline image of the target building, the outline image being used to identify the outline of the target building, and the natural language instruction being used to instruct the generation of a planar layout image that meets the target requirements; The process of processing the building image to generate a processing result that matches the natural language instruction includes: The contour image is tokenized to obtain a contour token sequence, which is used to identify the contour features of the target building. Generate a sequence of semantic tag tokens and room tokens that meet the target requirements. The semantic tag tokens that meet the target requirements are used to identify the functional attributes of the room indicated in the target requirements, and the sequence of room tokens that meet the target requirements are used to identify the geometric features of the room indicated in the target requirements. Based on the outline token sequence, the semantic tag token that meets the target requirements, and the room token sequence that meets the target requirements, a floor plan layout image that meets the target requirements is generated.
7. The method according to any one of claims 1-5, characterized in that, The building image includes a floor plan of the target building, and the natural language instructions are used to instruct the editing of a target area of the floor plan image; The process of processing the building image to generate a processing result that matches the natural language instruction includes: The planar layout image is tokenized to obtain a spatial token sequence, which includes an outline token sequence, a room token sequence, and a semantic tag token that matches the room token sequence. Based on the natural language instructions, the room token sequence and semantic tag token corresponding to the target area in the spatial token sequence are updated to obtain the updated spatial token sequence; The updated spatial token sequence is decoded to obtain the updated planar layout diagram.
8. The method according to claim 6 or 7, characterized in that, The method further includes: The initial image is subjected to structural correction and / or normalization to obtain the target image, wherein the initial image is the updated planar layout diagram or the initial image is the planar layout image that meets the target requirements; Output the target image in the target format.
9. The method according to any one of claims 1-5, characterized in that, The electronic device is equipped with a multimodal large model, which is used to process the building image and generate a processing result that matches the natural language command.
10. A building image intelligent processing device based on natural language commands, characterized in that, The device includes: A receiving module is used to receive natural language commands and architectural images input by the user, wherein the architectural images are used to identify the spatial features of the target building; The processing module is used to process the building image and generate a processing result that matches the natural language instruction.
11. The apparatus according to claim 10, characterized in that, The building image includes a floor plan of the target building, and the natural language instruction is used to instruct the interpretation of the floor plan; the processing module is specifically used for: The planar layout image is tokenized to obtain a spatial token sequence; Based on the spatial token sequence, target text is generated, which is used to describe the layout information in the planar layout image.
12. The apparatus according to claim 11, characterized in that, The processing module is specifically used for: Based on the planar layout image, an outline image and room images of the target building are determined; the outline image is used to identify the outline of the target building, and the room images are used to identify the rooms within the target building. The contour image is tokenized to obtain a contour token sequence, which is used to identify the contour features of the target building. Based on the contour image, the room image is tokenized to obtain a room token sequence, which is used to identify the geometric features of the room; Obtain the semantic tag token that matches the room token sequence; the semantic tag token is used to identify the functional attributes of the room. The spatial token sequence is obtained by cross-arranging the outline token sequence, the room token sequence, and the semantic tag token.