Personalized image generation method and apparatus, electronic device, and storage medium
By employing cross-modal automated annotation and data augmentation methods, an image-text-instance segmentation mask association dataset is constructed, which solves the problems of unsatisfactory generation results and inconsistent features in existing technologies, and achieves high-quality personalized image generation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- GUANGDONG LAB OF ARTIFICIAL INTELLIGENCE & DIGITAL ECONOMY (SZ)
- Filing Date
- 2025-04-15
- Publication Date
- 2026-06-23
Smart Images

Figure CN120612396B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image processing technology, and in particular to a personalized image generation method, apparatus, electronic device, and storage medium. Background Technology
[0002] With the continuous development of digital content creation, personalized image generation technology has gradually become a research hotspot. Existing image generation technologies mainly rely on a single reference image or text description, which is insufficient to meet users' needs for complex scenes and diverse features. In multi-reference image tasks, existing technologies often suffer from unsatisfactory generation results, such as image distortion and the absence of reference features in the generated image, which seriously affects the quality and usability of the generated image.
[0003] Furthermore, while some layout-based personalized feature image generation methods produce good results, they require users to input additional layout images, increasing user complexity and limiting the flexibility and usability of the technology. Meanwhile, although existing commercial image generation methods can perform personalized image generation tasks to some extent, they still cannot effectively solve the problem of inconsistency between the generated image and the reference image features in multi-reference image tasks. Therefore, existing technologies suffer from unsatisfactory generation results, complex user operations, and inconsistencies between the generated image and the reference image features, making it difficult to meet the demand for high-quality personalized image generation.
[0004] The preceding description is intended to provide general background information and does not necessarily constitute prior art. Summary of the Invention
[0005] To address the aforementioned technical problems, this application provides a personalized image generation method, apparatus, electronic device, and storage medium, which can solve the problems of unsatisfactory generation results, complex user operations, and inconsistencies between the generated image and the reference image features in the prior art, thereby achieving high-quality, flexible, and accurate personalized image generation.
[0006] To address the aforementioned technical problems, this application provides a personalized image generation method, comprising the following steps:
[0007] By associating image data, text descriptions, and instance segmentation masks through a cross-modal automated annotation system, an associated dataset of image-text-instance segmentation masks is generated.
[0008] A corresponding hybrid training dataset is constructed based on the associated dataset, and data augmentation is performed on the hybrid training dataset.
[0009] The pre-defined diffusion model is trained using the data-augmented hybrid training dataset;
[0010] The target image is generated by a hybrid guidance strategy with time step control using a trained diffusion model.
[0011] Furthermore, in some embodiments of this application, the step of associating image data, text descriptions, and instance segmentation masks through a cross-modal automated annotation system to generate an associated dataset of image-text-instance segmentation masks includes:
[0012] Extract cross-frame images containing the same instance from video keyframe data, and generate an initial instance segmentation mask using a pre-trained semantic segmenter;
[0013] The initial instance segmentation mask is replaced with the real labeled mask of the video dataset by feature number mapping;
[0014] A pre-trained text extraction model is used to generate a global semantic description of the image, and the semantic alignment between the text and the instance is optimized through entity parsing and lemmatization to generate an associated dataset.
[0015] Furthermore, in some embodiments of this application, the step of constructing a corresponding hybrid training dataset based on the associated dataset and performing data augmentation on the hybrid training dataset includes:
[0016] Multiple reference images of the same video segment, or copies of any static image, are selected from the associated dataset, and the target instance region is randomly expanded and cropped.
[0017] The cropped reference image is combined with the corresponding text description, and then enhanced training samples are generated by random slicing.
[0018] Different denoising time step thresholds were set for video data and still images respectively, and a hybrid training set was constructed based on the enhanced training samples.
[0019] Furthermore, in some embodiments of this application, the step of training a preset diffusion model based on a data-augmented hybrid training dataset includes:
[0020] The U-shaped network of the diffusion model is copied as a reference feature extractor, and the features of multiple reference images are extracted and stored in the auxiliary module;
[0021] In the self-attention module, the features stored in the auxiliary module and the features extracted by the denoising network are spatially concatenated, and the result that matches the shape of the denoising features is retained by cropping, so as to dynamically fuse the features of multiple instances.
[0022] Furthermore, in some embodiments of this application, the dynamic fusion of multi-instance features includes:
[0023] In self-attention computation, reference feature weights are introduced to dynamically adjust the impact of multi-instance features on the generated results.
[0024] The feature map after cropping and splicing is based on the spatial location of the denoising features, and local information that matches the current generation stage is retained.
[0025] Furthermore, in some embodiments of this application, the generation of the target image using a time-step controlled hybrid guidance strategy through the trained diffusion model includes:
[0026] In the initial generation stage, the main structure and global semantics of the image are generated based on the text description, and intermediate generation results are output.
[0027] In the refinement generation stage, the intermediate generation results are fused with the instance features of the reference image to output the target image.
[0028] Furthermore, in some embodiments of this application, fusing the intermediate generated result with instance features of the reference image includes:
[0029] The reference instance features are aligned with the spatial resolution of the intermediate generated result using an interpolation algorithm;
[0030] Reference instance features after weighted fusion alignment at the channel dimension.
[0031] Accordingly, this application provides a personalized image generation apparatus, comprising:
[0032] The data association module is used to associate image data, text descriptions, and instance segmentation masks through a cross-modal automated annotation system to generate an associated dataset of image-text-instance segmentation masks.
[0033] The data processing module is used to construct a corresponding hybrid training dataset based on the associated dataset and to perform data augmentation on the hybrid training dataset;
[0034] The model training module is used to train a pre-defined diffusion model based on a data-augmented hybrid training dataset.
[0035] The image generation module is used to generate target images using a hybrid guidance strategy with time step control through a trained diffusion model.
[0036] This application also provides an electronic device, including: a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the personalized image generation method as described above.
[0037] This application also provides a storage medium storing a computer program that can be loaded by a processor and executed as described above for the personalized image generation method.
[0038] Implementing the embodiments of this application has the following beneficial effects:
[0039] As described above, this application provides a personalized image generation method, apparatus, electronic device, and storage medium. The method includes: first, associating image data, text descriptions, and instance segmentation masks through a cross-modal automated annotation system to generate an image-text-instance segmentation mask association dataset; then, constructing a corresponding hybrid training dataset based on the association dataset and performing data augmentation on the hybrid training dataset; next, training a preset diffusion model based on the data-augmented hybrid training dataset; and finally, generating a target image using a time-step controlled hybrid guidance strategy through the trained diffusion model. The personalized image generation scheme provided in this application generates an image-text-instance segmentation mask association dataset through a cross-modal automated annotation system, solving the problem of unsatisfactory generation results caused by relying on a single reference image or text description in existing technologies. By constructing a hybrid training dataset and performing data augmentation, this method can effectively improve the diversity and quality of training data, thereby enhancing the stability and consistency of the generated images. Finally, training the diffusion model based on the data-augmented hybrid training dataset and generating the target image using a time-step controlled hybrid guidance strategy can dynamically balance the fusion of global semantics and local features, avoiding problems such as distorted or inconsistent features in the generated images. It is evident that this application, through the optimization of multimodal data association, data augmentation, and diffusion models, can improve the quality and flexibility of personalized image generation, reduce user operational complexity, and meet users' needs for high-quality personalized image generation. Attached Figure Description
[0040] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application. To more clearly illustrate the technical solutions of the embodiments of this application, the drawings used in the description of the embodiments will be briefly introduced below. Obviously, those skilled in the art can obtain other drawings based on these drawings without any creative effort.
[0041] Figure 1 This is a schematic diagram illustrating an application scenario of the personalized image generation method provided in the embodiments of this application;
[0042] Figure 2 This is a flowchart illustrating the personalized image generation method provided in the embodiments of this application;
[0043] Figure 3Another schematic diagram of the personalized image generation method provided in the embodiments of this application;
[0044] Figure 4 This is a schematic diagram of the structure of the personalized image generation device provided in the embodiments of this application;
[0045] Figure 5 This is a schematic diagram of the structure of the electronic device provided in the embodiments of this application.
[0046] The realization of the objectives, functional features, and advantages of this application will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. The accompanying drawings have illustrated specific embodiments of this application, which will be described in more detail below. These drawings and textual descriptions are not intended to limit the scope of the concept in any way, but rather to illustrate the concepts of this application to those skilled in the art through reference to specific embodiments. Detailed Implementation
[0047] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numbers in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this application as detailed in the appended claims.
[0048] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element. Furthermore, components, features, and elements with the same names in different embodiments of this application may have the same meaning or different meanings, the specific meaning of which must be determined by its interpretation in that specific embodiment or further in conjunction with the context of that specific embodiment.
[0049] It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to limit this application.
[0050] In the following description, the use of suffixes such as "module," "part," or "unit" to denote elements is solely for the purpose of illustrative purposes and has no specific meaning in itself. Therefore, "module," "part," or "unit" may be used interchangeably.
[0051] Personalized image generation is a core technology in digital content creation, with significant applications in e-commerce, virtual try-on, and advertising design. With the development of generative AI, users expect to precisely control specific instance features (such as clothing styles and furniture designs) in the generated results using multiple reference images, while maintaining a high semantic correlation with the text description. However, current technologies suffer from unsatisfactory generation results, complex user operations, and inconsistencies between the generated and reference images, making it difficult to meet the demand for high-quality personalized image generation.
[0052] To address the aforementioned technical problems, this application provides a personalized image generation method, apparatus, electronic device, and storage medium.
[0053] Specifically, the personalized image generation device can be integrated into an electronic device, such as a smartphone, tablet, laptop, or desktop computer, but is not limited to these. The electronic device can be directly or indirectly connected to the server via wired or wireless communication. The server can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. This application does not impose any restrictions on these aspects.
[0054] Please see Figure 1 , Figure 1 This is an application environment diagram of a personalized image generation method in one embodiment. (Refer to...) Figure 1 This personalized image generation method can be applied to a personalized image generation system. The personalized image generation system can include a terminal 110 and a server 120. The terminal 110 and server 120 are connected via a network. The terminal 110 can be a desktop terminal or a mobile terminal, and the mobile terminal can be at least one of a mobile phone, tablet computer, or laptop computer. The server 120 can be implemented as a standalone server or a server cluster consisting of multiple servers. Specifically, the server 120 can be used to associate image data, text descriptions, and instance segmentation masks through a cross-modal automated annotation system to generate an image-text-instance segmentation mask association dataset; construct a corresponding hybrid training dataset based on the association dataset and perform data augmentation on the hybrid training dataset; train a preset diffusion model based on the data-augmented hybrid training dataset; and generate the target image using a time-step controlled hybrid guidance strategy through the trained diffusion model.
[0055] The following sections provide detailed descriptions of each example. It should be noted that the order in which the embodiments are described is not intended to limit the priority of the embodiments.
[0056] Please see Figure 2 , Figure 2 This is a flowchart illustrating the personalized image generation method provided in this embodiment. The personalized image generation method provided in this embodiment may specifically include the following steps:
[0057] S1. By associating image data, text descriptions, and instance segmentation masks through a cross-modal automated annotation system, an associated dataset of image-text-instance segmentation masks is generated;
[0058] Specifically, for step S1, the cross-modal automated annotation system can extract image data from multiple data sources (such as video keyframe data, static image data, etc.) and combine it with a pre-trained semantic segmenter and text extraction model to generate an initial instance segmentation mask and a global semantic description. Through feature number mapping replacement, the automatically annotated instance segmentation mask is aligned with the ground truth annotation mask to ensure annotation consistency for the same instance across images. The final generated associated dataset contains multimodal information of images, text descriptions, and instance segmentation masks, providing a rich data foundation for subsequent training. Furthermore, it can combine object detection and semantic segmentation techniques from deep learning to further improve the accuracy and efficiency of automatic annotation. For example, models such as Mask R-CNN or U-Net can be used for instance segmentation, combined with pre-trained language models such as BERT to extract text descriptions. It supports extracting information from video keyframe data and static image data, generating a more comprehensive associated dataset through time-series alignment and spatial feature fusion. By introducing feature number values from the ground truth annotation mask, it ensures annotation consistency for the same instance across images and reduces the impact of annotation noise on training.
[0059] S2. Construct a corresponding hybrid training dataset based on the associated dataset, and perform data augmentation on the hybrid training dataset;
[0060] Specifically, in step S2, multiple reference images of the same video segment or copies of any static image are selected from the associated dataset, and the target instance region is randomly expanded and cropped to generate diverse training samples. Through random slicing and feature concatenation, the cropped reference images are combined with corresponding text descriptions to form enhanced training samples. Furthermore, different denoising time step thresholds are set for video data and static images respectively to adapt to the characteristics of different data sources, constructing the final hybrid training dataset. In addition, more advanced data augmentation methods, such as random rotation, flipping, and color jitter, can be introduced to further enrich the diversity of training data. A dynamic number of reference image inputs is supported, and batch processing and feature concatenation improve the model's adaptability to multiple reference images. Different denoising time step thresholds are set for video data and static images respectively to optimize noise processing strategies during training.
[0061] S3. Train the pre-defined diffusion model based on the data-augmented hybrid training dataset;
[0062] Specifically, in step S3, the U-shaped network of the diffusion model is used as a reference feature extractor to extract features from multiple reference images and store them in the auxiliary module. In the self-attention module, the features stored in the auxiliary module are spatially concatenated with the features extracted by the denoising network, and cropping is used to retain the result that matches the shape of the denoised features, achieving dynamic fusion of multi-instance features. Through this feature write-read mechanism, the model can effectively fuse multi-instance features during training, improving the detail and semantic consistency of the generated images. Furthermore, more advanced feature extraction networks (such as Transformer-based networks) and feature fusion methods (such as attention mechanisms) can be introduced to further enhance the feature fusion effect. The model supports a dynamic number of reference image inputs, flexibly handling different numbers of reference images through batch processing and the feature storage module. The generation effect of the model can be further improved by adjusting the training parameters of the diffusion model (such as learning rate and noise scheduling strategy).
[0063] S4. The target image is generated using a hybrid guidance strategy with time step control through the trained diffusion model;
[0064] Specifically, for step S4, a hybrid guidance strategy controlled by time step size is used to generate the main structure and global semantics of the image based on text description in the initial generation stage, outputting intermediate generation results. In the refinement generation stage, the intermediate generation results are fused with the instance features of the reference image to output the final target image. The spatial resolution of the reference instance features is aligned using an interpolation algorithm, and the aligned features are weighted and fused in the channel dimension to ensure the detail and semantic consistency of the generated image. In addition, more guidance modes (such as layout-based guidance, user interaction-based guidance, etc.) can be introduced to further improve the controllability and flexibility of the generated image; the noise processing strategy in the generation process is optimized by dynamically adjusting the time step size threshold to further improve the quality of the generated image; and the effect of feature alignment and fusion is further improved by using more advanced interpolation algorithms (such as bilinear interpolation, multi-scale interpolation, etc.) and feature fusion methods (such as attention-weighted fusion).
[0065] As can be seen, this embodiment significantly improves the diversity and quality of training data and reduces the cost and complexity of manual annotation by associating multimodal data and automatically annotating it, providing a high-quality data foundation for subsequent model training. Through data augmentation and the construction of hybrid training datasets, it significantly improves the diversity and robustness of training data, enhancing the model's adaptability to multiple reference images and complex scenes, providing high-quality input for subsequent model training. Through feature write-read mechanisms and multi-instance feature fusion, it significantly improves the model's adaptability to multiple reference images, enhancing the detail and semantic consistency of generated images, providing strong model support for subsequent image generation. Through a hybrid guidance strategy with time step control, it significantly improves the quality and semantic consistency of generated images, reducing distortion and feature inconsistencies in generated images, meeting the needs of high-quality personalized image generation.
[0066] Furthermore, in some embodiments, step S1, "associating image data, text descriptions, and instance segmentation masks through a cross-modal automated annotation system to generate an associated dataset of image-text-instance segmentation masks," may specifically include:
[0067] S11. Extract cross-frame images containing the same instance from video keyframe data, and generate an initial instance segmentation mask using a pre-trained semantic segmenter;
[0068] Specifically, an automated annotation system extracts cross-frame images containing the same instance from video keyframe data. The system identifies and extracts images of the same object in different frames, ensuring temporal consistency. Pre-trained semantic segmenters (such as Mask R-CNN or U-Net) generate initial instance segmentation masks, identifying the location and contour of each instance in the image. Additionally, content-varying algorithms, such as optical flow or inter-frame differencing, can be used to extract keyframes, ensuring that the extracted frames represent significant changes in the video. More advanced segmentation models, such as DETR (Detection Transformers), can be used to improve segmentation accuracy and efficiency. Instance tracking techniques (such as SORT or DeepSORT) are introduced to ensure the continuity and consistency of the same instance across different frames.
[0069] S12. Map and replace the feature numbers of the initial instance segmentation mask with the ground-labeled mask of the video dataset;
[0070] Specifically, by mapping and replacing the feature numbers of the automatically generated initial instance segmentation mask with the ground truth labeled mask in the video dataset, the consistency of annotations for the same instance across images is ensured. For example, the system calculates the intersection-over-union (IoU) between the automatically generated mask and the ground truth labeled mask, and replaces the feature numbers of the automatically generated mask with those of the ground truth labeled mask. Additionally, graph matching or clustering-based algorithms can be used to improve the accuracy and efficiency of feature number mapping. A consistency check mechanism is introduced to ensure the consistency of the replaced labeled mask across different frames. Combined with a user feedback mechanism, the automatically generated annotations are corrected to further improve annotation quality.
[0071] S13. Use a pre-trained text extraction model to generate a global semantic description of the image, and optimize the semantic alignment between the text and the instance through entity parsing and lemmatization to generate an associated dataset;
[0072] Specifically, pre-trained text extraction models (such as BERT or CLIP) generate global semantic descriptions of images, capturing the overall semantic information of the images. Entity parsing and lemmatization techniques are used to optimize the semantic alignment between text descriptions and instances, ensuring that the text descriptions accurately reflect the features of the instances in the images. The resulting associated dataset contains multimodal information including images, text descriptions, and instance segmentation masks. Furthermore, more advanced multimodal models, such as FLAVA or ALIGN, are employed to improve the accuracy and semantic richness of text extraction. Knowledge graph-based entity parsing and Transformer-based lemmatization techniques are introduced to improve the semantic alignment accuracy between text and instances. Combining multimodal information from images and text, a cross-modal attention mechanism further optimizes the semantic consistency of the dataset.
[0073] This embodiment ensures the continuity and consistency of instances in video data by extracting cross-frame images containing the same instance and generating an initial segmentation mask, providing high-quality basic data for subsequent annotation and training. By using feature number mapping replacement, it ensures the consistency between the automatically generated segmentation mask and the real annotation mask, reducing the impact of annotation noise on subsequent training and improving the quality of training data. By generating a global semantic description and optimizing semantic alignment, it ensures a high degree of correlation between text descriptions and image instances, improving the semantic consistency and multimodal fusion effect of the associated dataset.
[0074] In a specific embodiment, the training data is preprocessed, and associated data for image-text-instance segmentation masks is generated through an automated annotation system. For example, VIPseg720 video keyframe data and some Laion image data are used for automatic annotation, and the automatic annotation is adjusted based on existing semantic segmentation annotations to form associated data for image-text-instance segmentation masks.
[0075] In one embodiment, training data preprocessing includes: on the video keyframe dataset, instance segmentation mask annotations for the same object are consistent across images; using a pre-trained image text extraction model to extract global image description text; using a text library for noun entity parsing and lexical reconstruction; locating instance regions using a pre-trained object detector and generating initial masks by combining a pre-trained semantic segmenter; and fusing real and automatic annotations from the video dataset.
[0076] It is understood that in this embodiment, image data containing simple text descriptions can be obtained from or online based on the image links provided by the Laion dataset. This data can be mixed with directly downloaded video keyframe data, automatically annotated, and then subjected to instance-based random chunking for data augmentation. The resulting datasets can be constructed to produce consistent outputs and then concatenated for use as a mixed training dataset.
[0077] In a specific embodiment, the process of integrating real and automatic annotation of video datasets includes: constructing a mapping relationship between real and automatically annotated instances in the video dataset; replacing the instance segmentation mask of the automatic annotation with the instance segmentation mask from the real annotation; and constructing a mapping relationship between the instance segmentation mask and the text description of the automatic annotation.
[0078] It should be noted that since the semantic segmentation annotations contained in the video keyframe data are instance segmentation annotations, which contain the feature numbers of instances, the automatic annotation process achieves consistency of instance annotations across images by replacing the value of the automatically annotated instance segmentation mask with the feature number value of a real instance segmentation mask. That is, in different keyframes corresponding to the same video, instance segmentation annotations with the same feature number correspond to the same instance, and the feature number value of the real instance segmentation mask comes from the real segmentation mask with the largest intersection-union ratio (IoU) with the automatically annotated instance segmentation mask.
[0079] Furthermore, in some embodiments, step S2, "constructing a corresponding hybrid training dataset based on the associated dataset and performing data augmentation on the hybrid training dataset," may specifically include:
[0080] S21. Select multiple reference images of the same video segment from the associated dataset, or a copy of any static image, and randomly expand and crop the target instance region;
[0081] Specifically, the system selects multiple reference images containing the target instance from the associated dataset, or copies images from static images, and then randomly expands and crops the target instance region to generate diverse training samples. Additionally, a content similarity-based algorithm can be used to select the reference image most relevant to the target instance, improving the relevance and quality of the training samples. More advanced image processing techniques, such as adaptive cropping and multi-scale expansion, are introduced to further enrich the diversity of the training samples. Combined with a pre-trained target detector, the system accurately locates the target instance region, ensuring that the cropped image contains complete instance information.
[0082] S22. Combine the cropped reference image with the corresponding text description, and generate enhanced training samples by random slicing;
[0083] Specifically, the system pairs the cropped reference image with the text description, then randomly segments the image to generate multiple sub-images as training samples. These sub-images retain the key features of the original image while increasing sample diversity through random segmentation. Additionally, grid-based or instance-mask-based segmentation methods can be used to ensure that the segmented images retain key instance information; text-image alignment techniques are introduced to ensure a high degree of correlation between the text description and the content of the segmented images; and other data augmentation techniques (such as rotation, flipping, and color jitter) are combined to further enrich the diversity of the training samples.
[0084] S23. Set different denoising time step thresholds for video data and still images respectively, and construct a hybrid training set based on the enhanced training samples;
[0085] Specifically, the system sets different denoising time step thresholds based on the data type (video or still image) to adapt to the characteristics of different data sources. Video data typically requires a longer denoising time step to handle noise in the time series, while still images can use a shorter denoising time step. In this way, a hybrid training set is constructed, containing augmented samples from different data sources. Additionally, an adaptive denoising time step strategy can be employed to dynamically adjust the denoising time step based on the image complexity. Data balancing techniques are introduced to ensure a reasonable proportion of samples from different data sources in the hybrid training set, preventing any one type of data from dominating the training process. Advanced noise processing techniques (such as Gaussian noise and salt-and-pepper noise) are combined to further improve the quality of the training samples.
[0086] This embodiment significantly improves the diversity and robustness of training samples by selecting multiple reference images or duplicate images and randomly expanding and cropping the target instance region, thereby enhancing the model's adaptability to different perspectives and scales. Through random slicing and text-image combinations, it significantly improves the diversity and semantic consistency of training samples, enhancing the model's adaptability to multiple reference images and complex scenes. By setting different denoising time step thresholds for video data and static images respectively, it significantly improves the robustness and adaptability of training samples, enhancing the model's ability to process different data sources and providing high-quality input for subsequent model training.
[0087] In a specific embodiment, each time the dataset is randomly sampled, two images from the same video frame dataset are randomly selected, or one image is randomly selected from the image dataset and copied into two images. Multiple reference images containing the target instance are cropped from one of the images, and text descriptions are extracted from the other image. After data augmentation using random cropping, the images are used as a training data.
[0088] The method for constructing the multi-reference image training dataset includes: applying connectivity filtering, shape filtering, and position filtering methods to use only instance segmentation masks with complete segmentation results that do not belong to the background type; applying a random slicing method based on the mask range to obtain diverse reference images and real result images; applying a consistent data output format to stitch together the video keyframe data training set and the image data training set to construct the hybrid training dataset based on multiple datasets; and adopting a two-stage time step sampling strategy, where the sampling time step t>τ when using video keyframe data and the sampling time step t≤τ when using static image data.
[0089] Specifically, during the sampling process, firstly, based on instance segmentation masks, compare the identical instances contained in two images and randomly select the instances. The connectivity of the masks is calculated, filtering out instances with high connectivity (i.e., those segmented into multiple blocks). The aspect ratio of the rectangle containing the mask is calculated, filtering out instances with excessively large or small aspect ratios (i.e., long and narrow). It is also calculated whether the mask covers multiple corners, filtering out background types within these areas. These connectivity-based, shape-based, and position-based filtering methods filter out low-quality segmentation masks, reducing the impact of low-quality results from the automatic annotation process on subsequent training. During random slicing, for the reference image of the instance, the rectangle containing the mask is randomly expanded to crop the original image; the cropped result is used as the reference image for that instance. For the original image, the rectangles containing all selected masks are randomly expanded to crop the original image; the cropped result is used as the noisy image and the ground truth image for evaluating the model's image generation results.
[0090] Furthermore, in some embodiments, step S3, "training the preset diffusion model based on the data-augmented hybrid training dataset," may specifically include:
[0091] S31. The U-shaped network of the replication diffusion model is used as a reference feature extractor to extract features from multiple reference images and store them in the auxiliary module;
[0092] Specifically, a U-shaped network based on a replication diffusion model is used as a reference feature extractor to extract features from multiple reference images, and these features are stored in an auxiliary module. The initial parameters and structure of the reference feature extractor are consistent with the U-shaped network of the denoising network, enabling the extraction of multi-level features from the reference images. These features are stored in the auxiliary module for use in subsequent denoising processes. Alternatively, more advanced feature extraction networks, such as Transformer-based networks (e.g., Swin Transformer) or deeper convolutional neural networks (e.g., ResNet-152), can be employed to improve the accuracy and efficiency of feature extraction. The auxiliary module can be designed as a dynamic feature storage and management module, supporting real-time feature updates and efficient retrieval to accommodate varying numbers of reference images. Feature compression techniques (e.g., PCA or AutoEncoder) are introduced to reduce the dimensionality of stored features, improving storage efficiency and computational speed.
[0093] S32. In the self-attention module, the features stored in the auxiliary module and the features extracted by the denoising network are spatially concatenated, and the result that matches the shape of the denoising features is retained by cropping, so as to dynamically fuse the features of multiple instances.
[0094] Specifically, in the self-attention module, the reference features stored in the auxiliary module are spatially concatenated with the features extracted by the denoising network. The reference and denoised features are concatenated spatially, and then cropped to retain the shape matching the denoised features. This process achieves dynamic fusion of multi-instance features, allowing the generated image to simultaneously retain features from multiple reference images. Furthermore, more advanced self-attention mechanisms, such as sparse attention or local attention, can be introduced to improve computational efficiency and feature fusion performance. Before concatenation, interpolation algorithms (such as bilinear interpolation) are used to spatially align the reference and denoised features, ensuring feature consistency during concatenation. A dynamic weighting mechanism is introduced to dynamically adjust the weights of the reference and denoised features according to the needs of the current generation stage, further optimizing the feature fusion effect.
[0095] This embodiment uses a replicated U-shaped network as a reference feature extractor, ensuring the efficiency and consistency of reference image feature extraction. The introduction of an auxiliary module enables the effective storage and management of features from multiple reference images, providing high-quality input for subsequent feature fusion. Through feature concatenation and cropping in the self-attention module, dynamic fusion of multi-instance features is achieved, significantly improving the detail and semantic consistency of the generated image. The dynamic fusion mechanism ensures that the generated image can simultaneously retain the key features of multiple reference images, enhancing the diversity and accuracy of the generated results.
[0096] Furthermore, in some embodiments, the "dynamic fusion of multi-instance features" in step S32 may specifically include:
[0097] In self-attention computation, reference feature weights are introduced to dynamically adjust the impact of multi-instance features on the generated results.
[0098] The feature map after cropping and splicing is based on the spatial location of the denoising features, and local information that matches the current generation stage is retained.
[0099] Specifically, for dynamic fusion, in the self-attention module, the reference feature weights are dynamically adjusted according to the needs of the current generation stage, allowing the contribution of different instance features to the generated result to be optimized based on the context. This dynamic adjustment mechanism ensures that the generated image can better reflect the user's intent and the features of the reference image. Alternatively, a weight calculation mechanism based on attention scores can be used to dynamically adjust the weights according to the relevance of the reference features to the currently generated content. Introducing a multi-scale attention mechanism to adjust feature weights at different scales further improves the effect of feature fusion. Combining user input text descriptions or other interactive information, a user intent model is constructed to guide the direction of weight adjustment.
[0100] After feature concatenation, the concatenated feature map is cropped based on the spatial location information of the denoised features to ensure that the retained features match the needs of the current generation stage. This process ensures that local details during the generation process are fully preserved and utilized. Spatial location encoding technology is introduced to enhance the model's ability to perceive the spatial location of features. In addition, an adaptive cropping algorithm automatically adjusts the cropping region according to the dynamic needs of the generation stage. Combined with local feature enhancement technology, the retained local information is further highlighted, improving the detail representation of the generated image.
[0101] This embodiment introduces reference feature weights to dynamically adjust the influence of multi-instance features, significantly improving the semantic consistency and detail representation of the generated image, ensuring that the generated result can better reflect the user's intent and the features of the reference image; through spatial location cropping, it retains local information that matches the current generation stage, improving the detail retention and semantic consistency of the generated image, ensuring that the generated result maintains high quality at different stages.
[0102] In a specific embodiment, the denoising image generation network is trained using a mixed training dataset, and a feature write-read mechanism is employed to achieve multi-instance feature fusion. Specifically, the diffusion model training method for multi-reference image adaptation supports multiple reference images as input, including: designing a data batch processing mechanism to support a dynamic number of reference image inputs; configuring a data path to extract hierarchical features from multiple reference images; and fusing features through an improved self-attention mechanism.
[0103] Specifically, during data sampling and stitching, samples with fewer reference images are padded with zeros based on the number of reference images used in the current batch, thus achieving batch training. In the training process, a reference network is first used to extract features from reference images. The initial parameters and model structure of the reference network are a copy of the U-shaped network used in the denoising network. During feature extraction, each input data set consists of a set of reference images. Intermediate results from feature extraction are stored in an auxiliary module. In a batch of data, reference image feature extraction is repeated a number of times per reference image, ensuring that the reference image features of the same sampled data are stored in the same space within the auxiliary module. During image denoising, the noisy image uses a parameter-frozen denoising network to estimate the noise corresponding to the current sampling timestamp. During estimation, in the self-attention module, the features stored in the auxiliary module and the features extracted by the denoising network are stitched together along the spatial domain. After calculating self-attention, the output of the self-attention module is cropped according to the shape of the features extracted by the denoising network. The cropped result is used as input to subsequent modules of the denoising network, thus achieving the fusion of features from multiple reference images.
[0104] Furthermore, in some embodiments, step S4, "generating the target image using a time-step controlled hybrid guidance strategy through the trained diffusion model," may specifically include:
[0105] S41. In the initial generation stage, the main structure and global semantics of the image are generated based on the text description, and intermediate generation results are output;
[0106] Specifically, the diffusion model starts with completely random noise and generates the main structure of the image through a gradual denoising process. Text descriptions are incorporated into the model via an embedding mechanism to guide the generation process and ensure that the generated image conforms to the semantics of the text description. More advanced text embedding techniques, such as CLIP or FLAVA, can be used to improve the semantic consistency between the text description and the generated image. A multi-stage generation strategy is introduced, first generating a low-resolution main structure and then gradually refining it to a high resolution, improving generation efficiency and quality. Combined with semantic segmentation techniques, different semantic regions are individually controlled during the generation process, further enhancing the semantic accuracy of the generated image.
[0107] S42. In the refinement generation stage, the intermediate generation results are fused with the instance features of the reference image to output the target image;
[0108] Specifically, in the refinement generation stage, the intermediate generated results are fused with instance features from the reference image to output the final target image. During the later denoising process of the diffusion model, instance features from the reference image are gradually introduced. Through a feature fusion mechanism, the detailed features of the reference image are integrated into the generated image, ensuring that the generated image not only conforms to the semantics of the text description but also retains the key details of the reference image. An attention-based feature fusion mechanism can be used to dynamically adjust the weights of reference features, ensuring the preservation of key features. Feature fusion of multiple reference images is supported; through feature aggregation and selection mechanisms, an image integrating features from multiple reference images is generated. Combined with super-resolution technology, image details are enhanced in the refinement generation stage, further improving the quality of the generated image.
[0109] This embodiment ensures that the main structure and global semantics of the generated image are highly consistent with the user's intent through initial generation based on text description, providing a high-quality foundation for subsequent detailed generation; by fusing with instance features of the reference image, it improves the detail preservation capability and semantic consistency of the generated image, ensuring that the generated image can simultaneously meet the requirements of the text description and the reference image.
[0110] Furthermore, in some embodiments, step S42, "fusing the intermediate generated result with instance features of the reference image," includes:
[0111] The spatial resolution of the reference instance features is aligned with that of the intermediate generated results using an interpolation algorithm.
[0112] Reference instance features after weighted fusion alignment at the channel dimension.
[0113] Specifically, for instance feature fusion, the reference instance features and the intermediate generated results may have different spatial resolutions. Interpolation algorithms (such as bilinear interpolation or nearest neighbor interpolation) are used to adjust the spatial resolution of the reference instance features to match that of the intermediate generated results, ensuring that they can be fused in the same spatial dimension. More advanced interpolation algorithms, such as bicubic interpolation or deep learning-based super-resolution interpolation, can be employed to improve interpolation accuracy and efficiency. Multi-scale interpolation techniques can be introduced to interpolate features at different scales, ensuring feature consistency across different scales. Furthermore, feature alignment detection techniques can be combined to automatically evaluate the alignment effect of the interpolated features, ensuring alignment accuracy.
[0114] By weighting the features of the reference instance along the channel dimension and dynamically adjusting the contribution of different feature channels, the fused features better reflect the key features of the reference instance. Weighted fusion can automatically determine weights through a learning mechanism or be adjusted using predefined rules. A dynamic weight adjustment mechanism is introduced to automatically adjust the weights of feature channels based on the needs of the current generation stage. Combined with an attention mechanism, weights are dynamically allocated along the channel dimension to highlight key feature channels. Multiple feature fusion strategies are supported, such as element-level fusion, channel-level fusion, and decision-level fusion, allowing the selection of the optimal strategy based on specific needs.
[0115] This embodiment aligns the spatial resolution of the reference instance features and the intermediate generated results using an interpolation algorithm, ensuring the accuracy of feature fusion and improving the quality and detail retention of the generated image. By weighting and fusing the aligned reference instance features along the channel dimension, the effect of feature fusion is improved, enhancing the detail and semantic consistency of the generated image and ensuring that the generated result better reflects the features of the reference instance.
[0116] In a specific embodiment, progressive inference generation is performed, and the target image is output through a hybrid guidance strategy controlled by time steps. The progressive inference generation method uses different guidance modes at different time steps, including: using plain text guidance to generate the main structure during the initialization phase; gradually injecting reference image instance features during the refinement phase; and setting time step thresholds to control the timing of guidance mode switching.
[0117] Specifically, the input to the denoising network in the inference process uses completely random noise. A planner samples several denoising time parameters from 1 to 0. The denoising network is repeatedly called, with each input noise being the current noise minus the output noise estimated in the previous call. The denoising time parameters used are those sampled by the planner. Different control conditions are selected based on the denoising time parameters. When the denoising time parameters are large, text description is used to obtain a more accurate overall structure. When the denoising time parameters are small, a denoising network based on a reference image and text description is used to fuse the features of multiple instances with the generated image results.
[0118] like Figure 3As shown, this embodiment also provides a specific implementation of the personalized image generation method, divided into offline and online stages. In the offline stage, data is first acquired from a mixed video-image dataset, and then associated data of image-text-instance segmentation masks is generated through an automated annotation system. Next, this associated data is augmented to increase the diversity and robustness of the training data. The augmented data is used to train a diffusion model that learns how to generate images from noise. In the online stage, the trained diffusion model receives personalized reference images and text descriptions as input. Using these inputs, the model generates the target image through a time-step-controlled hybrid guidance strategy, including an initial generation stage that generates the main structure and global semantics of the image based on the text description; and a refinement generation stage that fuses the intermediate generation results with the instance features of the reference image to output the final personalized image.
[0119] In summary, the personalized image generation method provided in this embodiment constructs a cross-modal automated annotation system, integrates the instance mapping relationship between real annotations and automatically annotated video datasets, and generates image-text-instance mask association data. It supports obtaining annotations from unannotated image data or selecting instances that are helpful for training from video keyframe data to construct data, thereby enabling the construction of diverse training datasets. By applying a feature write-read mechanism to achieve dynamic multi-instance feature fusion during the training phase, the method transforms the original method of generating images from single-instance features into a method of generating images from multi-instance features. This allows for the generation of personalized images using multiple reference images and text descriptions, solving the problems of unsatisfactory generation results, complex user operations, and inconsistencies between generated images and reference image features in existing technologies, thus achieving high-quality, flexible, and accurate personalized image generation.
[0120] To facilitate better implementation of the personalized image generation method of this application, this application also provides a personalized image generation apparatus. The meanings of the terms used are the same as in the personalized image generation method described above, and specific implementation details can be found in the descriptions of the method embodiments.
[0121] Please see Figure 4 , Figure 4 The diagram below illustrates the structure of a personalized image generation device provided in this embodiment. Specifically, the personalized image generation device may include a data association module 201, a data processing module 202, a model training module 203, and an image generation module 204, as follows:
[0122] The data association module 201 is used to associate image data, text descriptions and instance segmentation masks through a cross-modal automated annotation system to generate an associated dataset of image-text-instance segmentation masks;
[0123] The data processing module 202 is used to construct a corresponding hybrid training dataset based on the associated dataset and to perform data augmentation on the hybrid training dataset;
[0124] The model training module 203 is used to train a pre-defined diffusion model based on a data-augmented hybrid training dataset.
[0125] Image generation module 204 is used to generate target images using a hybrid guidance strategy with time step control through a trained diffusion model.
[0126] Furthermore, in some embodiments, the data association module 201 may specifically include:
[0127] The extraction unit is used to extract cross-frame images containing the same instance from video keyframe data and generate an initial instance segmentation mask through a pre-trained semantic segmenter.
[0128] The replacement unit is used to replace the initial instance segmentation mask with the real labeled mask of the video dataset by mapping the feature numbers;
[0129] The generation unit is used to generate a global semantic description of an image using a pre-trained text extraction model, and optimizes the semantic alignment between text and instances through entity parsing and lemmatization to generate an associated dataset.
[0130] Furthermore, in some embodiments, the data processing module 202 may specifically include:
[0131] The cropping unit is used to select multiple reference images of the same video segment from the associated dataset, or a copy of any static image, and to randomly expand and crop the target instance region.
[0132] The combination unit is used to combine the cropped reference image with the corresponding text description and generate enhanced training samples by random slicing.
[0133] The setting unit is used to set different denoising time step thresholds for video data and still images respectively, and to construct a hybrid training set based on the enhanced training samples.
[0134] Furthermore, in some embodiments, the model training module 203 may specifically include:
[0135] The replication unit is used to replicate the U-shaped network of the diffusion model as a reference feature extractor, extracting features from multiple reference images and storing them in the auxiliary module;
[0136] The fusion unit is used in the self-attention module to spatially concatenate the features stored in the auxiliary module with the features extracted by the denoising network, and retain the result that matches the shape of the denoised features by cropping, so as to dynamically fuse multi-instance features.
[0137] Furthermore, in some embodiments, the fusion unit is specifically used for:
[0138] In self-attention computation, reference feature weights are introduced to dynamically adjust the impact of multi-instance features on the generated results.
[0139] The feature map after cropping and splicing is based on the spatial location of the denoising features, and local information that matches the current generation stage is retained.
[0140] Furthermore, in some embodiments, the image generation module 204 may specifically include:
[0141] The initial generation unit is used to generate the main structure and global semantics of the image based on the text description during the initial generation stage, and output the intermediate generation results.
[0142] The refinement generation unit is used to fuse the intermediate generation results with the instance features of the reference image during the refinement generation stage to output the target image.
[0143] Furthermore, in some embodiments, the refinement generation unit is specifically used for:
[0144] The spatial resolution of the reference instance features is aligned with that of the intermediate generated results using an interpolation algorithm.
[0145] Reference instance features after weighted fusion alignment at the channel dimension.
[0146] In summary, the personalized image generation device provided in this embodiment generates an associated dataset of image-text-instance segmentation masks through a cross-modal automated annotation system, solving the problem of unsatisfactory generation results caused by relying on a single reference image or text description in the prior art. By constructing a hybrid training dataset and performing data augmentation, this method can effectively improve the diversity and quality of training data, thereby enhancing the stability and consistency of generated images. Finally, the diffusion model is trained based on the data-augmented hybrid training dataset, and a hybrid guidance strategy with time step control is used to generate target images, which can dynamically balance the fusion of global semantics and local features, avoiding problems such as distorted or inconsistent features in the generated images.
[0147] Furthermore, embodiments of this application also provide an electronic device, such as... Figure 5 The diagram illustrates the structure of an electronic device according to an embodiment of this application. Specifically, the electronic device may include components such as a processor 301 with one or more processing cores, a memory 302 with one or more computer-readable storage media, a power supply 303, and an input unit 304. Those skilled in the art will understand that... Figure 5The electronic device structure shown does not constitute a limitation on the electronic device and may include more or fewer components than shown, or combine certain components, or have different component arrangements. Wherein:
[0148] The processor 301 is the control center of the electronic device. It connects various parts of the electronic device via various interfaces and lines, and performs various functions and processes data by running or executing software programs and / or modules stored in the memory 302, and by calling data stored in the memory 302, thereby providing overall monitoring of the electronic device. Optionally, the processor 301 may include one or more processing cores; preferably, the processor 301 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and applications, and the modem processor mainly handles wireless communication. It is understood that the modem processor may not be integrated into the processor 301.
[0149] The memory 302 can be used to store software programs and modules. The processor 301 executes various functional applications and personalized image generation methods by running the software programs and modules stored in the memory 302. The memory 302 may mainly include a program storage area and a data storage area. The program storage area may store the operating system, application programs required for at least one function (such as sound playback function, image playback function, etc.), etc.; the data storage area may store data created according to the use of the electronic device, etc. In addition, the memory 302 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 302 may also include a memory controller to provide the processor 301 with access to the memory 302.
[0150] The electronic device also includes a power supply 303 that supplies power to various components. Preferably, the power supply 303 can be logically connected to the processor 301 through a power management system, thereby enabling functions such as charging, discharging, and power consumption management through the power management system. The power supply 303 may also include one or more DC or AC power supplies, recharging systems, power fault detection circuits, power converters or inverters, power status indicators, and other arbitrary components.
[0151] The electronic device may also include an input unit 304, which can be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
[0152] Although not shown, the electronic device may also include a display unit, etc., which will not be described in detail here. Specifically, in this embodiment, the processor 301 in the electronic device loads the executable files corresponding to the processes of one or more applications into the memory 302 according to the following instructions, and the processor 301 runs the applications stored in the memory 302 to realize various functions, as follows:
[0153] A cross-modal automated annotation system is used to associate image data, text descriptions, and instance segmentation masks to generate an image-text-instance segmentation mask association dataset. Based on the association dataset, a corresponding hybrid training dataset is constructed and data augmentation is performed on the hybrid training dataset. Based on the data augmentation hybrid training dataset, a pre-defined diffusion model is trained. The target image is generated using a hybrid guidance strategy with time step control through the trained diffusion model.
[0154] For details on the implementation of each of the above operations, please refer to the previous examples, which will not be repeated here.
[0155] This application's embodiments generate an associated dataset of image-text-instance segmentation masks through a cross-modal automated annotation system, solving the problem of unsatisfactory generation results caused by relying on a single reference image or text description in the prior art. By constructing a hybrid training dataset and performing data augmentation, this method can effectively improve the diversity and quality of training data, thereby enhancing the stability and consistency of generated images. Finally, the diffusion model is trained based on the data-augmented hybrid training dataset, and a hybrid guidance strategy with time step control is used to generate target images, which can dynamically balance the fusion of global semantics and local features, avoiding problems such as distorted generated images or inconsistent features.
[0156] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be performed by instructions, or by instructions controlling related hardware. These instructions can be stored in a computer-readable storage medium and loaded and executed by a processor.
[0157] Therefore, embodiments of this application provide a storage medium storing a plurality of instructions that can be loaded by a processor to execute steps in any of the personalized image generation methods provided in embodiments of this application. For example, the instructions can execute the following steps:
[0158] A cross-modal automated annotation system is used to associate image data, text descriptions, and instance segmentation masks to generate an image-text-instance segmentation mask association dataset. Based on the association dataset, a corresponding hybrid training dataset is constructed and data augmentation is performed on the hybrid training dataset. Based on the data augmentation hybrid training dataset, a pre-defined diffusion model is trained. The target image is generated using a hybrid guidance strategy with time step control through the trained diffusion model.
[0159] For details on the implementation of each of the above operations, please refer to the previous examples, which will not be repeated here.
[0160] The storage medium may include: read-only memory (ROM), random access memory (RAM), disk, or optical disk, etc. Since the instructions stored in the storage medium can execute the steps of any of the personalized image generation methods provided in the embodiments of this application, the beneficial effects achievable by any of the personalized image generation methods provided in the embodiments of this application can be realized, as detailed in the preceding embodiments, and will not be repeated here.
[0161] The above provides a detailed description of a personalized image generation method, apparatus, electronic device, and storage medium provided in the embodiments of this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The description of the above embodiments is only for the purpose of helping to understand the method and core ideas of this application. At the same time, for those skilled in the art, there will be changes in the specific implementation methods and application scope based on the ideas of this application. Therefore, the content of this specification should not be construed as a limitation of this application.
Claims
1. A personalized image generation method, characterized in that, Includes the following steps: By associating image data, text descriptions, and instance segmentation masks through a cross-modal automated annotation system, an associated dataset of image-text-instance segmentation masks is generated. A corresponding hybrid training dataset is constructed based on the associated dataset, and data augmentation is performed on the hybrid training dataset. The pre-defined diffusion model is trained based on the data-augmented hybrid training dataset, including: replicating the U-shaped network of the diffusion model as a reference feature extractor, extracting features from multiple reference images and storing them in an auxiliary module; in the self-attention module, spatially concatenating the features stored in the auxiliary module with the features extracted by the denoising network, and retaining the result that matches the shape of the denoising features by cropping; introducing reference feature weights in the self-attention calculation to dynamically adjust the influence of multiple instance features on the generation result; cropping and concatenating the feature map according to the spatial location of the denoising features, retaining local information that matches the current generation stage; The target image is generated by a hybrid guidance strategy with time step control using a trained diffusion model. The process includes: in the initial generation stage, generating the main structure and global semantics of the image based on text descriptions and outputting intermediate generation results; in the refinement generation stage, aligning the spatial resolution of the reference instance features with the intermediate generation results using an interpolation algorithm, and weighted fusing the aligned reference instance features in the channel dimension to output the target image.
2. The personalized image generation method according to claim 1, characterized in that, The process involves associating image data, text descriptions, and instance segmentation masks using a cross-modal automated annotation system to generate an associated dataset of image-text-instance segmentation masks, including: Extract cross-frame images containing the same instance from video keyframe data, and generate an initial instance segmentation mask using a pre-trained semantic segmenter; The initial instance segmentation mask is replaced with the real labeled mask of the video dataset by feature number mapping; A pre-trained text extraction model is used to generate a global semantic description of the image, and the semantic alignment between the text and the instance is optimized through entity parsing and lemmatization to generate an associated dataset.
3. The personalized image generation method according to claim 1, characterized in that, The step of constructing a corresponding hybrid training dataset based on the associated dataset and performing data augmentation on the hybrid training dataset includes: Multiple reference images of the same video segment, or copies of any static image, are selected from the associated dataset, and the target instance region is randomly expanded and cropped. The cropped reference image is combined with the corresponding text description, and then enhanced training samples are generated by random slicing. Different denoising time step thresholds were set for video data and still images respectively, and a hybrid training set was constructed based on the enhanced training samples.
4. A personalized image generation device, characterized in that, include: The data association module is used to associate image data, text descriptions, and instance segmentation masks through a cross-modal automated annotation system to generate an associated dataset of image-text-instance segmentation masks. The data processing module is used to construct a corresponding hybrid training dataset based on the associated dataset and to perform data augmentation on the hybrid training dataset; The model training module is used to train a preset diffusion model based on a data-augmented hybrid training dataset. This includes: replicating the U-shaped network of the diffusion model as a reference feature extractor; extracting features from multiple reference images and storing them in an auxiliary module; in the self-attention module, spatially concatenating the features stored in the auxiliary module with the features extracted by the denoising network, and cropping to retain the result matching the shape of the denoised features; introducing reference feature weights in the self-attention calculation to dynamically adjust the influence of multiple instance features on the generated result; and cropping the concatenated feature map according to the spatial location of the denoised features, retaining local information matching the current generation stage. The image generation module is used to generate a target image using a hybrid guidance strategy with time step control through a trained diffusion model. The module includes: in the initial generation stage, generating the main structure and global semantics of the image based on text descriptions and outputting intermediate generation results; in the refinement generation stage, aligning the spatial resolution of the reference instance features with the intermediate generation results using an interpolation algorithm, and weighted fusing the aligned reference instance features in the channel dimension to output the target image.
5. An electronic device, characterized in that, include: A memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the steps of the personalized image generation method as described in any one of claims 1-3.
6. A storage medium, characterized in that, The computer program is stored that can be loaded by a processor and executed as described in any one of claims 1-3.