Method for generating ancient furniture decoration patterns based on multi-modal large model
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING UNIV OF CHEM TECH
- Filing Date
- 2025-10-16
- Publication Date
- 2026-06-23
Smart Images

Figure CN121414902B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer vision and artificial intelligence generation, and in particular to a method for generating decorative patterns for antique furniture based on a multimodal large model. Background Technology
[0002] Decorative patterns on antique furniture are an important part of traditional Chinese furniture culture. Widely used in components such as tables, chair backs, screens, and beds, they possess rigorous compositional structures and rich cultural connotations, embodying auspicious concepts such as "fortune, prosperity, longevity, and happiness." Traditional pattern design primarily relies on hand-drawing, the inheritance of craftsmanship experience, and the collage of pattern elements. The design process is time-consuming and labor-intensive, and it is difficult to balance the continuity of historical culture with modern aesthetic demands, making it challenging to meet the rapid development requirements of digital cultural and creative industries and customized furniture design.
[0003] With the development of AIGC (Artificial Intelligence Generated Content) technology, text-driven image generation models have been widely used in various fields such as graphics, game design, and fashion design. Diffusion models, as one of the current mainstream image generation architectures, have significant advantages in image clarity, style transfer capabilities, and diverse outputs. Among them, the Stable Diffusion model has become a representative technology in text-image generation tasks due to its strong generation capabilities, open structure, and good adaptability.
[0004] Currently, some technologies attempt to introduce generative large language models into design assistance and text generation tasks, but they lack deep integration with domain knowledge. The generation process relies more on natural language prompts than on structured semantics, and the construction of prompts still suffers from strong subjectivity and poor controllability. Summary of the Invention
[0005] This invention provides a method for generating decorative patterns for ancient furniture based on a multimodal large model. By constructing the relationship between pattern structure and cultural semantics, it enhances the semantic controllability and cultural expression ability of traditional pattern generation. It can be widely applied to scenarios such as digital cultural and creative design, cultural heritage reproduction, and intelligent generation of decorative patterns for Chinese furniture.
[0006] In a first aspect, embodiments of the present invention provide a method for generating decorative patterns for antique furniture based on a multimodal large model, including:
[0007] Obtain multiple examples of decorative patterns on antique furniture, as well as semantic tags for each pattern example. The semantic tags include compositional elements, symbolic meanings, and decorative parts.
[0008] The multimodal large model is used to analyze each pattern example and generate descriptive text for each pattern example. The descriptive text includes the composition structure, central element, and auxiliary elements of the pattern example.
[0009] The multimodal large model is fine-tuned using the semantic tags of each pattern example, so that the fine-tuned large model can output the descriptive text of each pattern example.
[0010] The natural language requirements for generating antique furniture decorative patterns are processed using a fine-tuned multimodal large model to obtain prompts for generating antique furniture decorative patterns; the prompts are then input into the fine-tuned image generation model to obtain antique furniture decorative patterns that meet the natural language requirements.
[0011] In a second aspect, embodiments of the present invention provide an electronic device, the electronic device comprising:
[0012] One or more processors;
[0013] Memory, used to store one or more programs.
[0014] When the one or more programs are executed by the one or more processors, the one or more processors implement the method for generating antique furniture decorative patterns based on a multimodal large model as described in any embodiment.
[0015] Thirdly, embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the method for generating decorative patterns for antique furniture based on a multimodal large model as described in any embodiment.
[0016] In summary, this invention provides a method for generating decorative patterns for antique furniture based on a multimodal large model. It collects information such as pattern elements and cultural connotations from antique furniture decorative patterns and uses the multimodal large model LLaVA to generate image description text, constructing a structured semantic dataset. Then, the multimodal large model LLaVA is fine-tuned using LoRA to automatically generate image generation prompts based on the target design intent. Finally, the prompts are input into a fine-tuned diffusion model to automatically generate antique furniture decorative patterns with traditional Chinese style and cultural symbolism. This method, through the fusion of structured semantic knowledge and generative language models, is suitable for intelligent generation, cultural display, and design assistance scenarios for classical furniture decorative patterns.
[0017] This method addresses the problems in traditional pattern design, such as reliance on human experience in pattern construction, difficulty in expressing cultural connotations, and lack of semantic control in image generation. It proposes a pattern design method that integrates semantic knowledge modeling and text-driven image generation, making up for the shortcomings of existing generation models in terms of cultural understanding and visual consistency, and effectively improving the intelligence level, cultural expression consistency, and design efficiency of the pattern generation process.
[0018] Specifically, this embodiment is able to:
[0019] First, by introducing a multimodal large model to generate image description text, the visual details of the image can be automatically analyzed and expressed in language. Compared with simply relying on manual annotation, it can more comprehensively and accurately depict image information. Furthermore, the multimodal large model can be fine-tuned to enable it to more accurately understand the relationship between the image's composition structure, pattern elements, and cultural connotations, and to construct semantically clear and structurally reasonable prompts for ancient furniture decorative patterns, thereby improving the controllability and expressive accuracy of the prompts.
[0020] Second, the Stable Diffusion model and LoRA fine-tuning strategy are adopted to ensure that the generated patterns have a unified style and clear details, meeting the visual requirements of classical furniture.
[0021] Third, it realizes an intelligent closed loop for the entire process from "dataset construction - prompts - image generation", improving the design efficiency, semantic consistency and cultural dissemination value of traditional patterns;
[0022] Fourth, it is applicable to scenarios such as pattern design for Chinese classical furniture and development of digital cultural and creative products, and has broad practical prospects and promotional value. Attached Figure Description
[0023] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0024] Figure 1 This is a flowchart of a method for generating decorative patterns for antique furniture based on a multimodal large model, provided by an embodiment of the present invention;
[0025] Figure 2 This is a schematic diagram illustrating how a multimodal large model is used to generate images of decorative patterns and descriptive text for antique furniture, according to an embodiment of the present invention.
[0026] Figure 3This invention provides an embodiment of an antique furniture pattern generated based on a finely tuned Stable Diffusion model.
[0027] Figure 4 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation
[0028] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below. Obviously, the described embodiments are only a part of the embodiments of this invention, and not all of them. Based on the embodiments of this invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this invention.
[0029] In the description of this invention, it should be noted that the terms "center," "upper," "lower," "left," "right," "vertical," "horizontal," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are used only for the convenience of describing the invention and for simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and should not be construed as indicating or implying relative importance.
[0030] In the description of this invention, it should also be noted that, unless otherwise explicitly specified and limited, the terms "installation," "connection," and "linking" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal connection of two components. Those skilled in the art can understand the specific meaning of the above terms in this invention based on the specific circumstances.
[0031] Figure 1 This is a flowchart of a method for generating decorative patterns for antique furniture based on a multimodal large model, provided by an embodiment of the present invention. By constructing a semantic dataset of antique furniture patterns, a multimodal large model is used to generate pattern descriptions. Furthermore, the multimodal large model is fine-tuned using LoRA (Low-Rank Adaptation) to enable it to more accurately learn the correspondence between pattern elements, cultural connotations, and compositional structures, and automatically generate high-quality image prompts. These prompts are then input into a LoRA-tuned diffusion model, outputting pattern images with traditional Chinese style and cultural symbolism, thus achieving classical patterns that combine stylistic consistency and semantic accuracy. This method is executed by an electronic device, such as... Figure 1As shown, the specific steps include the following:
[0032] S110. Construct a dataset of decorative patterns for ancient furniture.
[0033] This embodiment first collects and organizes information such as pattern elements and cultural connotations of antique furniture decorative patterns. At the same time, it uses a multimodal large model to automatically analyze the collected pattern images and generate semantic description text corresponding to the images, thus forming an antique furniture decorative pattern dataset.
[0034] In one specific implementation, firstly, examples of decorative patterns on antique furniture (i.e., example images) can be collected from publicly available resources, including typical style patterns from the Qing Dynasty and Ming Dynasty, and a pattern index table can be established. Then, for each image... The compositional elements, symbolic meanings, and structural components are manually labeled to generate corresponding semantic tags. ,in, Representing images respectively Tag information.
[0035] Simultaneously, the image was processed using the multimodal large model LLaVA (Large Language and Vision Assistant). The analysis generates descriptive text corresponding to the image. The descriptive text can include information such as the graph structure, central element, and auxiliary elements. The manually annotated information combined with the descriptive text generated by the large model forms a complete semantic database. The pattern and the generated descriptive text are as follows: Figure 2 As shown.
[0036] Specifically, the LLaVA structure includes CLIP and Vicuna. CLIP is used to process the input image. Feature extraction is performed to obtain visual feature vectors. ,in This represents the feature extraction function implemented by CLIP, used to map the input image to a high-dimensional semantic space. To match the dimensionality of the word embedding space in language models, the visual feature vectors are projected through a trainable linear projection matrix. Mapped to visual tokens This visual token preserves the semantic information of the image and can also be directly used as input to a language model. The projected visual token... With text embedding splicing, among which The embedding function, consisting of a word embedding layer and a positional encoding layer, forms a multimodal input sequence. This constructed multimodal input sequence is then fed into Vicuna, which generates corresponding descriptive text based on the input sequence. .
[0037] In this embodiment, any pattern sample is input into CLIP to obtain the visual feature vector of the pattern sample; simultaneously, a text command is input into LLaVA, instructing LLaVA to describe the pattern sample; LLaVA internally concatenates the visual feature vector with the text embedding of the text command and inputs it into Vicuna, which generates the descriptive text for the pattern sample. Optionally, the text command can specify "Please describe this image for me, including the image's composition, central element, and auxiliary elements".
[0038] S120. Fine-tune the multimodal large model using the semantic tags of each pattern example, so that the fine-tuned large model can output the descriptive text of each pattern example.
[0039] This step is based on the existing dataset of antique furniture decorative patterns. The multimodal large model is lightweighted and fine-tuned to enable the model to learn the correspondence between pattern elements, cultural connotations and compositional structure. This allows the model to automatically generate image generation prompts containing target style, pattern elements and cultural connotations when the target design intent or semantic context is input.
[0040] In one specific implementation, firstly, for each antique furniture pattern image... Collect pattern semantic tags and description text Construct training corpus pairs:
[0041]
[0042] in, The input will be used for Vicuna's training, and the target output will be descriptive text. .
[0043] Then, LoRA is used to fine-tune the LLaVA model. The weight matrix of the pre-trained self-attention layer is denoted as... , and Let represent the input and output dimensions of the self-attention layer, respectively. For each self-attention layer, a parallel network structure is introduced, containing two modules, whose weight matrices are denoted as follows: and Using low-rank matrix decomposition ,in Size and same, , and During the training process Frozen and Includes trainable parameters for the input of the self-attention layer Its output for:
[0044]
[0045] Among them, the weight matrix The weight matrix is initialized using a random Gaussian distribution. The initial value is 0, therefore, when training begins... It is a zero matrix.
[0046] Of course, fine-tuning can also be achieved without using the LoRA strategy, and this embodiment does not impose specific limitations. S130: Generate pictorial prompts based on the fine-tuned multimodal large model and the user's natural language requirements.
[0047] This step utilizes a finely tuned multimodal large language model to generate structured image prompts based on semantic context, which drive the image generation model to generate high-quality images.
[0048] Optionally, after fine-tuning, given the user's target design intent or cultural implications set. The fine-tuned LLaVA generates prompt words through the following mapping.
[0049]
[0050] in These are the original model parameters. This represents the low-rank adaptation parameters learned through LoRA. The generated prompt words. The image prompts are input into the finely tuned Stable Diffusion image generation model to generate images of antique furniture decorative patterns that conform to specified pattern elements or cultural connotations.
[0051] S140. Input the prompt into the finely adjusted image generation model to obtain an antique furniture decorative pattern that meets the requirements of the natural language.
[0052] Optionally, the image generation model used in this embodiment is the Stable Diffusion model, and a lightweight fine-tuning method using the LoRA training strategy is employed to train the model for style adaptation using labeled pattern data.
[0053] Specifically, the Stable Diffusion model consists of three parts: a VAE (Variational Autoencoder) composed of an image encoder and an image decoder, a CLIP text encoder, and a U-Net denoising network. During image generation, the model first processes the input text information... (The model fine-tuning phase uses descriptive text, while the model usage phase uses prompts.) This is encoded into semantic conditional vectors using a text encoder. , used to guide image generation.
[0054] During the fine-tuning phase, the model first uses real images Image representation in latent space compressed by VAE's image encoder Then, Gaussian noise is progressively added to the latent image representation to obtain latent representations with different noise intensities. The U-Net denoising network is used to denoise noisy images. Semantic condition vectors corresponding to text annotations Using the input as input, the denoising process is learned to restore a clean latent image representation.
[0055] During the generation phase (i.e., the model usage phase), the model samples an initial noise vector from a standard Gaussian distribution. And in the semantic condition vector corresponding to the second prompt. Guided by [the principle / guideline], a latent representation of the target image is generated through a multi-step iterative inverse diffusion process. The entire denoising process can be simplified to the following form:
[0056]
[0057] in, Indicates parameters The U-Net denoising network, It is the final generated latent image representation. Indicates the initial Gaussian noise. This represents a standard normal distribution. Finally, this latent representation is used by the VAE's image decoder to reconstruct the final high-quality image.
[0058] When fine-tuning a Stable Diffusion model using LoRA, the weight matrix of the pre-trained cross-attention layer can be denoted as... ,in, and Let represent the input and output dimensions of the cross-attention layer, respectively. The cross-attention layer belongs to the U-Net denoising network, using text vectors as conditional information. The cross-attention mechanism guides U-Net to predict noise components in the current image at each step. For each cross-attention layer, a parallel network structure is introduced, containing two modules. The weight matrices of each module are denoted as follows: and Using low-rank matrix decomposition, the weight matrix of this parallel network structure is... ,in Size and same, , and During the training process Frozen and Includes trainable parameters for the input of the cross-attention layer Its output for:
[0059]
[0060] Among them, the weight matrix The weight matrix is initialized using a random Gaussian distribution. The initial value is 0, therefore, when training begins... The matrix is zero. After fine-tuning, a separate low-rank parameter module (LoRA weights) is obtained, which is used in conjunction with the original Stable Diffusion model.
[0061] The fine-tuning process used 100 images of antique furniture patterns, cropped them to a uniform size of 512×512 pixels, and used descriptive text generated by LLaVA as the label for each image.
[0062] Because Stable Diffusion is relatively complex and flexible, requiring adjustment of many parameters, after repeated experiments, the XL model was selected as the Stable Diffusion model version, and satisfactory results were obtained with 10 training epochs. It should be noted that overfitting or underfitting may occur during training; therefore, LoRA models from different training phases (every two epochs) will be saved, allowing users to select the most suitable model for fine-tuning.
[0063] By inputting the aforementioned prompts into the finely tuned image generation model, antique furniture decorative patterns that conform to the central theme, style, composition, and elements can be obtained. The final output image resolution is no less than 512×512, possessing a clear composition, delicate texture, and consistent style. Figure 3 As shown.
[0064] It is worth mentioning that the finely tuned image generation model in the above embodiments can be reused in each subsequent image generation to quickly generate the antique furniture decorative patterns required by the user.
[0065] Additionally, it should be noted that, in order to illustrate the principle of LoRA fine-tuning, some of the same variable symbols were used in the two LoRA fine-tunings mentioned above. However, the data referred to by the same variable symbol in the two LoRA fine-tunings may not be the same, and should be treated differently depending on the specific data.
[0066] In summary, this embodiment provides a method for generating decorative patterns for antique furniture based on a multimodal large model. It collects information such as pattern elements and cultural connotations from antique furniture decorative patterns and uses the multimodal large model LLaVA to generate image description text, constructing a structured semantic dataset. Then, the multimodal large model LLaVA is fine-tuned using LoRA to automatically generate image generation prompts based on the target design intent. Finally, the prompts are input into a fine-tuned diffusion model to automatically generate antique furniture decorative patterns with traditional Chinese style and cultural symbolism. This method, through the fusion of structured semantic knowledge and generative language models, is suitable for intelligent generation, cultural display, and design assistance scenarios for classical furniture decorative patterns.
[0067] This method addresses the problems in traditional pattern design, such as reliance on human experience in pattern construction, difficulty in expressing cultural connotations, and lack of semantic control in image generation. It proposes a pattern design method that integrates semantic knowledge modeling and text-driven image generation, making up for the shortcomings of existing generation models in terms of cultural understanding and visual consistency, and effectively improving the intelligence level, cultural expression consistency, and design efficiency of the pattern generation process.
[0068] Specifically, this embodiment is able to:
[0069] First, by introducing a multimodal large model to generate image description text, the visual details of the image can be automatically analyzed and expressed in language. Compared with simply relying on manual annotation, it can more comprehensively and accurately depict image information. Furthermore, the multimodal large model can be fine-tuned to enable it to more accurately understand the relationship between the image's composition structure, pattern elements, and cultural connotations, and to construct semantically clear and structurally reasonable prompts for ancient furniture decorative patterns, thereby improving the controllability and expressive accuracy of the prompts.
[0070] Second, the Stable Diffusion model and LoRA fine-tuning strategy are adopted to ensure that the generated patterns have a unified style and clear details, meeting the visual requirements of classical furniture.
[0071] Third, it realizes an intelligent closed loop for the entire process from "dataset construction - prompts - image generation", improving the design efficiency, semantic consistency and cultural dissemination value of traditional patterns;
[0072] Fourth, it is applicable to scenarios such as pattern design for Chinese classical furniture and development of digital cultural and creative products, and has broad practical prospects and promotional value.
[0073] Figure 4 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention, such as... Figure 4 As shown, the device includes a processor 60, a memory 61, an input device 62, and an output device 63; the number of processors 60 in the device can be one or more. Figure 4 Taking a processor 60 as an example; the processor 60, memory 61, input device 62, and output device 63 in the device can be connected via a bus or other means. Figure 4 Taking the example of a connection between China and Israel via a bus.
[0074] The memory 61, as a computer-readable storage medium, can be used to store software programs, computer-executable programs, and modules, such as the program instructions / modules corresponding to the method for generating antique furniture decorative patterns based on a multimodal large model in this embodiment of the invention. The processor 60 executes various functional applications and data processing of the device by running the software programs, instructions, and modules stored in the memory 61, thereby realizing the aforementioned method for generating antique furniture decorative patterns based on a multimodal large model.
[0075] The memory 61 may primarily include a program storage area and a data storage area. The program storage area may store the operating system and at least one application program required for a given function; the data storage area may store data created based on terminal usage. Furthermore, the memory 61 may include high-speed random access memory and non-volatile memory, such as at least one disk storage device, flash memory, or other non-volatile solid-state storage device. In some instances, the memory 61 may further include memory remotely located relative to the processor 60, which can be connected to the device via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
[0076] Input device 62 can be used to receive input digital or character information, and to generate key signal inputs related to user settings and function control of the device. Output device 63 may include display devices such as a display screen.
[0077] This invention also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the method for generating decorative patterns for antique furniture based on a multimodal large model, as described in any embodiment.
[0078] The computer storage medium of this invention can be any combination of one or more computer-readable media. A computer-readable medium can be a computer-readable signal medium or a computer-readable storage medium. A computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media (a non-exhaustive list) include: an electrical connection having one or more wires, a portable computer disk, a hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination thereof. In this document, a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
[0079] Computer-readable signal media may include data signals propagated in baseband or as part of a carrier wave, carrying computer-readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. Computer-readable signal media may also be any computer-readable medium other than computer-readable storage media, capable of sending, propagating, or transmitting programs for use by or in connection with an instruction execution system, apparatus, or device.
[0080] Program code contained on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wire, optical fiber, RF, etc., or any suitable combination thereof.
[0081] Computer program code for performing the operations of this invention can be written in one or more programming languages or a combination thereof. Programming languages include object-oriented programming languages—such as Java, Smalltalk, and C++—as well as conventional procedural programming languages—such as C or similar programming languages. The program code can be executed entirely on the user's computer, partially on the user's computer, as a standalone software package, partially on the user's computer and partially on a remote computer, or entirely on a remote computer or server. In cases involving remote computers, the remote computer can be connected to the user's computer via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (e.g., via the Internet using an Internet service provider).
[0082] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the technical solutions of the embodiments of the present invention.
Claims
1. A method for generating decorative patterns for antique furniture based on a multimodal large model, characterized in that, include: Obtain multiple examples of decorative patterns on antique furniture, as well as semantic tags for each pattern example. The semantic tags include compositional elements, symbolic meanings, and decorative parts. A multimodal large model is used to analyze each pattern example and generate descriptive text for each pattern example. The descriptive text includes the compositional structure, central element, and auxiliary elements of the pattern example. The multimodal large model is LLaVA, which includes CLIP and Vicuna. This step specifically includes: inputting any pattern example into CLIP to obtain the visual feature vector of the pattern example; inputting a text command into LLaVA to instruct LLaVA to describe the pattern example; concatenating the visual feature vector with the text embedding of the text command and inputting it into Vicuna, which generates the descriptive text for the pattern example. The semantic labels of each pattern example are embedded into the input Vicuna for fine-tuning, so that the fine-tuned Vicuna can output the descriptive text of each pattern example; specifically, the self-attention layer of Vicuna is fine-tuned through LoRA during the fine-tuning process. A fine-tuned multimodal large model is used to process the natural language requirements for generating decorative patterns on antique furniture, resulting in prompts for generating these patterns. These natural language requirements represent the user's intended design goals or a set of cultural implications. ; The prompt is input into the finely tuned image generation model to obtain antique furniture decorative patterns that meet the requirements of the natural language.
2. The method according to claim 1, characterized in that, Before inputting the prompt into the finely tuned image generation model to obtain the antique furniture decorative pattern that meets the natural language requirements, the process further includes: Obtain multiple examples of decorative patterns for antique furniture, along with descriptive text for each pattern example. The descriptive text includes the composition structure, central element, and auxiliary elements. Each pattern example and its descriptive text are used as input to the image encoder and text encoder in the Stable Diffusion model, respectively, and LoRA is used to fine-tune the Stable Diffusion model.
3. The method according to claim 2, characterized in that, The Stable Diffusion model includes a VAE consisting of an image encoder and an image decoder, a CLIP text encoder, and a U-Net denoising network; Accordingly, the step of using each pattern example and its descriptive text as input to the image encoder and text encoder in the Stable Diffusion model, respectively, and fine-tuning the Stable Diffusion model using LoRA, includes: The image encoder will process any pattern sample. Compressed into an image representation in the latent space ; Through the text encoder, The descriptive text is encoded as a semantic condition vector. ; Towards By gradually adding Gaussian noise, we can obtain the latent representations under different noise intensities. ; Will and The U-Net denoising network is input to learn the denoising process, thereby restoring a clean latent image representation.
4. The method according to claim 3, characterized in that, The fine-tuning of the Stable Diffusion model using LoRA includes: For the pre-trained cross-attention layer in the U-Net denoising network, a parallel network structure is introduced, and the weight matrix of the network structure is decomposed into two low-rank matrices to be trained; the input of the cross-attention layer will be processed by the cross-attention layer and the parallel network structure at the same time, and the processing results will be fused as the final output of the cross-attention layer; Each pattern example and its descriptive text are used as input to the image encoder and text encoder, respectively. The parameters of each low-rank matrix are trained, and the trained low-rank matrices are used in conjunction with the original Stable Diffusion model.
5. The method according to claim 4, characterized in that, Of the two low-rank matrices to be trained, the initial weights of one low-rank matrix are initialized using a random Gaussian distribution, while the initial weights of the other low-rank matrix are 0.
6. The method according to claim 3, characterized in that, The step of inputting the prompt into the finely tuned image generation model to obtain antique furniture decorative patterns that meet the requirements of the natural language includes: The text encoder encodes the prompt into a semantic condition vector; Sample an initial noise vector from a standard Gaussian distribution. ; Will The semantic condition vector is input into the U-Net denoising network, and guided by the semantic condition vector, a latent representation of the target image is generated through a multi-step iterative inverse diffusion process. ; Through the image decoder, Restore the antique furniture decorative patterns to meet the requirements of the natural language.
7. An electronic device, characterized in that, include: One or more processors; Memory, used to store one or more programs. When the one or more programs are executed by the one or more processors, the one or more processors implement the method for generating antique furniture decorative patterns based on a multimodal large model as described in any one of claims 1-6.
8. A computer-readable storage medium, characterized in that, It stores a computer program that, when executed by a processor, implements the method for generating decorative patterns for antique furniture based on a multimodal large model as described in any one of claims 1-6.