A method for generating music based on visual semantic synapses of images and texts
By employing visual semantic synaptic mechanisms and multimodal contrastive learning, the problem of insufficient utilization of semantic information in image-to-music generation is solved, achieving high-quality, semantically consistent music generation suitable for various application scenarios.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- EAST CHINA NORMAL UNIV
- Filing Date
- 2026-03-25
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies cannot effectively utilize the semantic information of images in image-to-music generation methods, resulting in low-quality and semantically inconsistent generated music.
By introducing a visual semantic synapse mechanism, and combining the self-attention features of images with the cross-attention features of the music generation network through adaptive loss function and multimodal contrastive learning, the loss weights are dynamically adjusted to optimize the training process of the music generation network and improve semantic understanding and generation quality.
It significantly improves the semantic consistency and quality of music generation, enhances the model's generalization ability in various application scenarios, and is suitable for short video background music, advertising music creation, and game scene music.
Smart Images

Figure CN122245259A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of artificial intelligence and music generation technology, specifically a method for jointly generating music from images and text based on visual semantic synapses. Background Technology
[0002] Music is an indispensable part of creative media, widely used in film, social media content creation, and other fields. Traditional music generation methods mainly rely on text descriptions, generating music through text-to-music models. However, this method has significant limitations: it ignores the rich semantic information contained in images and fails to fully utilize visual content to guide music generation. Although some research has attempted to use image-guided audio generation, these methods typically rely only on the pixel information of images, ignoring the rich semantic content they contain. For example, when generating music matching "a bustling street market," using only the pixel information of the image cannot accurately capture key semantics such as "bustling" and "street market," resulting in low-quality generated music. With the successful application of diffusion models in image generation, music generation is also gradually adopting diffusion models. Methods such as Riffusion, Mouˆsai, and Noise2Music generate music using diffusion models, but most of these methods rely only on text descriptions and fail to fully utilize the semantic information of images.
[0003] Existing image-to-music generation methods typically employ two strategies: 1) converting images into text descriptions using an image description generator, and then using a text-to-music model to generate music; 2) directly generating music using only the pixel information of the image. However, both methods have significant shortcomings: the first method leads to the loss of semantic information and a decrease in quality; the second method ignores the semantic content of the image, failing to generate music that highly matches the image content. Therefore, there is an urgent need for a music generation method that can combine image and text semantic information to achieve higher quality and more semantically consistent music creation. Summary of the Invention
[0004] The purpose of this invention is to address the shortcomings of existing technologies by providing a method for jointly generating music from images and text based on visual semantic synapses. By introducing a visual semantic synaptic mechanism, the semantic information of images is effectively integrated into the music generation process, thereby improving the quality and semantic consistency of generated music. This method, through the design of an adaptive loss function, dynamically adjusts the weights of perceptual and semantic losses, thereby improving training stability and generation quality. It exhibits stronger semantic understanding and better music generation quality, especially when handling complex semantic content. This invention, through the visual semantic synaptic mechanism, enables the music generation network to understand the semantic content of images and generate matching music accordingly. Combined with multimodal contrastive learning to improve generalization ability, it can provide high-quality audio output for music generation in various application scenarios. This ensures that the generated music has high reliability and consistency in practical applications such as short video soundtracks, advertising music creation, and game scene music, effectively improving the accuracy and semantic consistency of music generation. It boasts higher music generation quality, stronger semantic consistency, and better generalization ability, providing high-quality audio output for music generation in various application scenarios. The evaluation results show high reliability and consistency, demonstrating promising application prospects in short video soundtracks, advertising music creation, and game scene music.
[0005] The objective of this invention is achieved as follows: A method for jointly generating music from images and text based on visual semantic synapses, characterized by the following specific steps:
[0006] 1) Extract the semantic description of the image, specifically including the following steps:
[0007] 1.1: The input image is uniformly adjusted to 224x224 pixels and scaled using bilinear interpolation to ensure consistent image size and maintain its original proportions and structure. The scaling ratio is 0.8 to 1.2.
[0008] 1.2: The RGB color channels of the image are standardized. The pixel value of each channel is subtracted from the mean [0.485, 0.456, 0.406] and then divided by the standard deviation [0.229, 0.224, 0.225] to ensure that the pixel value of each channel is distributed in the range of [-1, 1]. The scaling ratio is 0.8 to 1.2. The standardization process involves subtracting the mean from the pixel value of each color channel and then dividing by the standard deviation. The mean and standard deviation are obtained from the ImageNet dataset.
[0009] 1.3: Input the processed image into the pre-trained CLIP model to extract semantic descriptions related to the image content.
[0010] 2) The image is converted into a latent spatial representation using the DDIM inverse process, and self-attention features are extracted.
[0011] Specifically, the following steps are included:
[0012] 2.1: The image is converted into a latent spatial representation using the DDIM inverse process, ensuring that it can be derived from the latent representation.
[0013] Reconstruct the original image;
[0014] 2.2: Extracting self-attention features of latent spatial representation of images from a pre-trained text-image diffusion model.
[0015] 3) Integrating image semantics into the music generation network using visual semantic synapses, specifically including the following steps:
[0016] 3.1: Visual semantic synapses are introduced into the decoder layer of the music generation network to fuse the self-attention features of images with the cross-attention features of the music generation network;
[0017] 3.2: The fusion ratio of image features and music generation features is controlled by the learned α parameter;
[0018] 3.3: Music generation using the fused feature representation.
[0019] 4) Adopt an adaptive loss function, combining perceptual loss and semantic loss, to improve the training stability of the model.
[0020] Sex and quality of production, specifically including the following steps:
[0021] 4.1: Design an adaptive loss function that weights and sums the perceptual loss and semantic loss. The perceptual loss helps the network capture the perceptual quality of the music, while the semantic loss enhances the network's adaptability to the semantic content of the image.
[0022] 4.2: Dynamically adjust the loss weights so that different types of losses can be optimized at different stages of the training process, avoiding overfitting or instability during training.
[0023] 4.3: By optimizing the adaptive loss function, the music generation network's ability to predict music quality is improved, ensuring the stability of the training process and the high quality of the generated results.
[0024] 5) Utilize multimodal contrastive learning to optimize the generalization ability of the music generation network and reduce reliance on manual annotation.
[0025] Data dependencies specifically include the following steps:
[0026] 5.1: Optimize the music generation network by using a multimodal contrastive learning method to compare the semantic feature representations of high-quality and low-quality music clips;
[0027] 5.2: During the comparative learning process, the music generation network gradually learns the relationship between music quality and image semantic content, enabling it to adapt to image and text data with different semantic content and to have strong generalization ability.
[0028] 5.3: Through multimodal contrastive learning, the reliance on a large amount of manually labeled data is reduced, enabling the model to achieve efficient and high-quality music generation in a variety of application scenarios.
[0029] 6) A well-trained music generation network can accurately predict the music corresponding to the input image and text.
[0030] And to provide high-quality output for different application scenarios, specifically including the following steps:
[0031] 6.1: The trained music generation network was applied to a real-world music generation task to accurately predict music that matches the input image and text;
[0032] 6.2: Based on the prediction results, the music generation network can provide high-quality music for various application scenarios such as short video background music, advertising music creation, and game scene music.
[0033] Compared with the prior art, the present invention has the following beneficial technical effects and significant technical progress:
[0034] 1) This invention proposes a "visual semantic synapse" mechanism, which controllably fuses the self-attention features of an image in its latent space with the cross-attention structure of a music generation network, enabling music generation driven from the semantic level of the image (rather than just pixels or low-level visual features). This mechanism utilizes high-level semantic information extracted by pre-trained multimodal models (such as CLIP) and combines it with the self-attention representation in the diffusion model, enabling the music generation network to truly understand the scene type, emotional tendency, and object semantics expressed by the image. This results in music that is highly consistent with the visual content in terms of emotion, rhythm, and style, significantly improving the semantic alignment and auditory performance quality of cross-modal generation.
[0035] 2) This invention innovatively introduces a multimodal contrastive learning strategy. Without requiring extensive manually labeled pairing data, it guides the model to learn the implicit correlation between image semantics and music quality by comparing the representational differences between high-quality and low-quality music clips in the semantic space. This method effectively alleviates the dependence on labeled data, enhances the model's generalization ability in open-domain scenarios, and enables it to robustly adapt to diverse input content. It is widely applicable to practical applications such as intelligent music accompaniment for short videos, background music generation for advertisements, and dynamic sound effect synthesis for games, significantly improving the system's practicality and deployment flexibility.
[0036] 3) This invention has higher music generation quality, stronger semantic consistency and better generalization ability. Through the visual semantic synapse mechanism, the music generation network can understand the semantic content of the image and generate matching music accordingly. Combined with multimodal contrastive learning to improve generalization ability, it can provide high-quality audio output for music generation in a variety of application scenarios. The evaluation results have high reliability and consistency, and have good application prospects in the fields of short video background music, advertising music creation, game scene music. Attached Figure Description
[0037] Figure 1 This is a flowchart of the present invention;
[0038] Figure 2 This is a schematic diagram illustrating the specific operation of Example 1. Detailed Implementation
[0039] See Figure 1 The present invention specifically includes the following steps:
[0040] 1) Extract semantic description of the image
[0041] 1.1: Enhancement techniques such as random rotation, random scaling, random cropping, and brightness / contrast adjustment are employed.
[0042] Enhancing the diversity of input data further improves the model's adaptability to various image contents and its robustness, specifically including:
[0043] 1.1.1: The input images are uniformly adjusted to 224x224 pixels and scaled using bilinear interpolation to ensure consistent image size and maintain their original proportions and structure;
[0044] 1.1.2: Standardize the RGB color channels of the image, subtracting the mean value [0.485, 0.456, 0.406] from the pixel value of each channel;
[0045] 1.1.3: Divide by the standard deviation [0.229, 0.224, 0.225] to ensure that the pixel values of each channel are distributed in the range of [-1, 1].
[0046] 1.2: The processed image is input into the pre-trained CLIP model, which is trained through contrastive learning and can map images and text to the same semantic space, thereby extracting semantic descriptions related to the image content.
[0047] 1.3: The extracted semantic description is encoded to obtain a semantic feature vector, which contains high-level semantic information of the image content, such as scene type, sentiment, object category, etc.
[0048] 2) The image is converted into a latent spatial representation using the DDIM inverse process, and self-attention features are extracted, specifically including:
[0049] 2.1: The image is converted into a latent spatial representation using the DDIM inverse process, ensuring that the original image can be reconstructed from the latent representation;
[0050] 2.2: Extract self-attention features of the latent spatial representation of images from a pre-trained text-image diffusion model. These features contain semantic information of the images.
[0051] 2.3: The extracted self-attention features are used for subsequent visual-semantic synaptic fusion.
[0052] 3) Integrating image semantics into music generation networks using visual semantic synapses, specifically including:
[0053] 3.1: Visual semantic synapses are introduced into the decoder layer of the music generation network to fuse the self-attention features of images with the cross-attention features of the music generation network;
[0054] 3.2: The fusion ratio of image features and music generation features is controlled by the learned α parameter, which is obtained through training;
[0055] 3.3: Music generation is performed using the fused feature representations to ensure that the generated music is highly consistent with the semantic content of the input image and text.
[0056] 4) Employ an adaptive loss function, combining perceptual loss and semantic loss, to improve the stability of the training process.
[0057] Qualitative and generation quality, specifically including:
[0058] 4.1: Design an adaptive loss function that weights and sums the perceptual loss and semantic loss.
[0059] Cognitive loss is mainly used to capture the subjective perceived quality of music, while semantic loss improves the network's adaptability to the semantic content of images.
[0060] 4.2: A dynamic adjustment method for loss weights is adopted to ensure that the network can adaptively adjust the weights of perceptual loss and semantic loss according to the training objectives at different stages during the training process, thereby avoiding overfitting and improving the stability of the training process.
[0061] 4.3: Through optimization of the adaptive loss function, the music generation network can more accurately predict the music that matches the input image and text, and can maintain a high level of generation quality, especially when faced with complex semantic content.
[0062] 5) Utilize multimodal contrastive learning to further improve the model's generalization ability and reduce reliance on manually labeled data. Specifically, this includes:
[0063] 5.1: Using a multimodal contrastive learning method, the music generation network is optimized by comparing the semantic feature representations of high-quality and low-quality music clips;
[0064] 5.2: During the contrastive learning process, the music generation network gradually learns the relationship between music quality and image semantic content by comparing the semantic features between music segments of different qualities, thereby improving its generalization ability and enabling the model to efficiently adapt to different types of image and text inputs.
[0065] 5.3: Multimodal contrastive learning reduces the reliance on manually labeled data, enabling music generation networks to generate music efficiently and with high quality in various application scenarios.
[0066] 6) A well-trained music generation network can accurately predict the music corresponding to input images and text, and provides high-quality output for different application scenarios, including:
[0067] 6.1: A well-trained music generation network can be used for practical music generation tasks and can accurately predict music that matches the input image and text, supporting various application scenarios such as short video background music, advertising music creation, and game scene music.
[0068] 6.2: Based on the prediction results, the model can provide high-quality music for different fields, supporting content creation and optimization decisions. The invention will be further illustrated below with specific embodiments.
[0069] Example 1
[0070] See Figure 2This embodiment first performs standardized preprocessing on the input image and extracts its high-level semantic description using a pre-trained CLIP model. Simultaneously, the image is mapped to the latent space of the diffusion model through the DDIM inverse process, and self-attention features are extracted from it. Subsequently, a visual semantic synapse mechanism is introduced into the decoder of the music generation network to dynamically fuse the self-attention features of the image with the cross-attention features of the music sequence; the fusion ratio is controlled by a learnable parameter α. During training, an adaptive loss function is used, combining perceptual and semantic losses, and the model is optimized through multimodal contrastive learning. This enables the model to generate high-quality music with style matching and emotional consistency based on the semantics of the image and text without requiring extensive manual annotation. This method has been successfully applied to scenarios such as short video background music and advertising background music generation, verifying its effectiveness and generalization ability.
[0071] The above embodiments are merely illustrative of the present invention and are not intended to limit the scope of this patent. Any equivalent implementations of the present invention should be included within the scope of the claims of this patent.
Claims
1. A method for jointly generating music from images and text based on visual semantic synapses, characterized in that, The method specifically includes the following steps: 1) Input images and text into a pre-trained visual-language model to extract semantic descriptions of the images; 2) Use the DDIM inverse process to convert the image into a latent spatial representation; 3) Extract the self-attention features of the latent spatial representation of the image, and integrate these features into the cross-attention layer of the music generation network through visual semantic synapses; 4) Apply an adaptive loss function to the input images and text to calculate perceptual loss and semantic loss; 5) By comparing consistency loss, adaptive conditional weight adjustment and structure-preserving regularization to optimize the diffusion generation process, the output is a music audio signal related to the input image and language semantics; A trained music generation network is used to predict the music corresponding to the input image and text.
2. The method for jointly generating music from images and text based on visual semantic synapses according to claim 1, characterized in that, Step 1) specifically includes: 1.1: The input images are uniformly adjusted to 224x224 pixels and scaled using bilinear interpolation to ensure consistent image size while maintaining their original proportions and structure; 1.2: The RGB three color channels of the image are standardized. The pixel value of each channel is reduced by the mean [0.485, 0.456, 0.406] and then divided by the standard deviation [0.229, 0.224, 0.225], so that the pixel value of each channel is distributed in the range of [-1, 1]. 1.3: Input the processed image into the pre-trained CLIP model to extract semantic descriptions related to the image content.
3. The method for jointly generating music from images and text based on visual semantic synapses according to claim 1, characterized in that, Step 2) specifically includes: 2.1: The image is converted into a latent spatial representation using the DDIM inverse process, and the original image is reconstructed from the latent representation; 2.2: Extracting self-attention features of latent spatial representation of images from a pre-trained text-image diffusion model.
4. The method for jointly generating music from images and text based on visual semantic synapses according to claim 1, characterized in that, Step 3) specifically includes: 3.1: Visual semantic synapses are introduced into the decoder layer of the music generation network to fuse the self-attention features of images with the cross-attention features of the music generation network; 3.2: The fusion ratio of image features and music generation features is controlled by the learned α parameter; 3.3: Music generation using the fused feature representation.
5. The method for jointly generating music from images and text based on visual semantic synapses according to claim 1, characterized in that, Step 4) specifically includes: 4.1: The music generation network is trained using a combination of perceptual loss and semantic loss to enhance its adaptability to different semantic content and improve the overall quality and semantic consistency of music generation. 4.2: Dynamically adjust the weights of perceptual loss and semantic loss to optimize the loss function at different training stages.
6. The method for jointly generating music from images and text based on visual semantic synapses according to claim 1, characterized in that, Step 5) specifically includes: 5.1: The music generation network is optimized by comparing the semantic feature representations of high-quality and low-quality music clips; 5.2: The music generation network model is trained by calculating the semantic similarity and differences between music segments to improve the generalization ability of the music generation network; 5.3: Employing multimodal contrastive learning enables the music generation network to learn useful semantic features from a large amount of unlabeled data without labeled data, improving its adaptability to different types of image and text inputs and reducing its reliance on manually labeled data.
7. The method for jointly generating music from images and text based on visual semantic synapses according to claim 2, characterized in that, The scaling ratio is: 0.8–1.2; The standardization process involves subtracting the mean from the pixel value of each color channel and then dividing by the standard deviation; The mean and standard deviation are derived from the ImageNet dataset.