A multi-modal feature alignment method based on an implicit feature space

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By employing a multimodal feature alignment method based on implicit feature space, and utilizing the large model DeepSeek R1 and cross-modal attention mechanism, the integration challenge in multimodal data processing is solved, achieving efficient and accurate information processing and powerful correlation discovery capabilities for cross-modal tasks.

CN120705808BActive Publication Date: 2026-06-16BEIJING INST OF COMP TECH & APPL

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BEIJING INST OF COMP TECH & APPL
Filing Date: 2025-06-19
Publication Date: 2026-06-16

Application Information

Patent Timeline

19 Jun 2025

Application

16 Jun 2026

Publication

CN120705808B

IPC: G06F18/25; G06F18/213; G06N3/045; G06N3/096; G06N3/092

CPC: G06F18/253; G06F18/213; G06N3/045; G06N3/096; G06N3/092

AI Tagging

Application Domain

Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Knowledge graph construction method and device, equipment and storage medium
CN119149753BImprove timing analysisImproving performance in directional reasoningBiological models Knowledge representation
QA system and method
US20260162247A1Programme control Image enhancement
Systems and methods for data collection in an industrial environment
US20260161153A1Machine part testing Receivers monitoring
A speech reconstruction method and system based on gating re-estimation and route weighting
CN121983072BSpeech analysis Biological models
Generative model-based pharmaceutical sales support platform program, operating system therefor, operating server and method therefor, program for executing method, and recording medium on which program is recorded
WO2026121676A1Biological models Office automation

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Existing technologies struggle to fully integrate information from complex scenarios when processing multimodal data, especially text, image, and audio data, resulting in limited application effectiveness.

⚗Method used

We employ a multimodal feature alignment method based on implicit feature space, fine-tuning multimodal data using a pre-trained large model DeepSeek R1, introducing a cross-modal attention mechanism, and utilizing the GRPO algorithm for reinforcement learning to achieve the fusion and mapping of text, image, and audio features.

🎯Benefits of technology

It improves the efficiency and accuracy of multimodal data analysis, enhances the ability to discover associations across modal tasks, and is suitable for application scenarios such as visual question answering, sentiment analysis, and cross-modal retrieval.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN120705808B_ABST

Patent Text Reader

Abstract

The present application relates to a kind of multi-modal feature alignment methods based on implicit feature space, belong to artificial intelligence, multi-modal data processing field.The present application is centered on text, and the information of image and audio is extracted using pre-training large model and fine-tuning.Then an implicit feature space that can capture deep-level association between different data types is constructed, and the model is fine-tuned using contrastive learning framework to generate feature representation reflecting the intrinsic relationship of each modality.Unlike traditional methods, this technology does not rely on explicit corresponding labels, reducing the need for large-scale labeled datasets, thereby improving the generalization ability and adaptability of the model.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the fields of artificial intelligence and multimodal data processing, and specifically relates to a multimodal feature alignment method based on implicit feature space. Background Technology

[0002] As the complexity and diversity of new media content continue to increase, traditional methods have encountered many limitations when processing multimodal data. These methods typically focus on a single type of data and cannot fully integrate the complex relationships between multiple information sources such as text, images, and audio, resulting in limited effectiveness in complex scenarios.

[0003] To address these issues, this invention proposes a multimodal feature alignment method based on an implicit feature space. This method centers on text and uses a pre-trained large model to extract and fine-tune features from image and audio information. Then, it constructs an implicit feature space capable of capturing deep-seated relationships between different data types. A contrastive learning framework is used to train the model, generating feature representations that reflect the intrinsic connections between each modality. Unlike traditional methods, this technique does not rely on explicit correspondence annotations, reducing the need for large-scale labeled datasets and thus improving the model's generalization ability and adaptability.

[0004] This invention is applicable to various scenarios, such as visual question answering, image captioning generation, cross-modal retrieval, multimodal sentiment analysis, and human-computer interaction. It not only achieves efficient and accurate processing and understanding of new media information but also meets diverse needs such as intelligence mining, public opinion monitoring, topic tracking, and brand reputation analysis. This innovative technical solution significantly enhances the efficiency and accuracy of multimodal data analysis, providing strong support for intelligent information processing. Summary of the Invention

[0005] (a) Technical problems to be solved

[0006] The technical problem to be solved by this invention is how to provide a multimodal feature alignment method based on implicit feature space, so as to solve the shortcomings of multimodal data processing in the prior art, as well as the limitations of multimodal data in feature representation and alignment.

[0007] (II) Technical Solution

[0008] To address the aforementioned technical problems, this invention proposes a multimodal feature alignment method based on implicit feature space, which includes the following steps:

[0009] Step 1: Deploy a large DeepSeek R1 model locally;

[0010] Step 2: Fine-tuning the multimodal data of the large model

[0011] Input the raw text data into DeepSeek to obtain the text features H of the text information. T ;

[0012] The original text and audio data are input into DeepSeek in text-audio pair format. Based on the text prompts, DeepSeek outputs the audio features H following the text prompts. A ;

[0013] Input the raw data (text and images) into DeepSeek in text-image pair format. Based on the text prompts, instruct DeepSeek to output the image features H as requested by the text prompts. I ;

[0014] Based on the open-source large model DeepSeek, the GRPO algorithm is used for fine-tuning;

[0015] Step 3: Cross-modal attention mechanism

[0016] For the aforementioned text, image, and speech features, a cross-modal attention layer is introduced to enable different modalities to learn from each other and strengthen the correlation between them;

[0017] Step 4: Feature Fusion and Mapping

[0018] The features of all modes are mapped to the shared latent space z through a linear transformation.

[0019] (III) Beneficial Effects

[0020] This invention proposes a multimodal feature alignment method based on implicit feature space. The beneficial effects of this method are reflected in the following aspects:

[0021] (1) In multimodal information processing, leveraging the powerful feature extraction and semantic understanding capabilities of large models, semantic feature extraction can be performed on heterogeneous data such as images and audio, centered on text. Specifically, through pre-trained large language models and multimodal interfaces, key features related to text descriptions can be extracted from original images or audio, while ignoring irrelevant or redundant information. Simultaneously, based on text guidance, semantic alignment and local correction can be performed on visual elements in images or speech content in audio, such as enhancing the recognition of specific objects and correcting ambiguous words in speech recognition. This method not only improves the consistency and accuracy of cross-modal data but also provides strong support for building text-driven intelligent content editing systems.

[0022] (2) An architecture combining cross-modal attention mechanisms was designed to capture complex relationships between different modalities. By introducing self-attention and cross-modal attention layers, not only is the model's understanding of information within each modality enhanced, but semantic consistency of data from different modalities in the shared latent space is also ensured. This enables the model to perform association discovery in subsequent cross-modal tasks, such as visual question answering, sentiment analysis, and cross-modal retrieval. Attached Figure Description

[0023] Figure 1 This is the overall framework of the multimodal feature alignment method based on implicit feature space in this invention. Detailed Implementation

[0024] To make the objectives, contents, and advantages of the present invention clearer, the specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and examples.

[0025] To address the aforementioned technical problems, this invention proposes a multimodal feature alignment method based on implicit feature space, which includes the following steps:

[0026] Step 1: Deploy a large DeepSeek R1 model locally

[0027] Use open-source frameworks, such as DeepSeek, for model loading and inference. Utilize dedicated hardware (such as GPUs, CPUs, and TPUs) to accelerate computation. Deploy lightweight versions of the model to adapt to resource-constrained environments.

[0028] Step 2: Fine-tuning the multimodal data of the large model

[0029] Input the raw text data into DeepSeek to obtain the text features H of the text information. T ;

[0030] The original text and audio data are input into DeepSeek in text-audio pair format. Based on the text prompts, DeepSeek outputs the audio features H following the text prompts. A ;

[0031] Input the raw data (text and images) into DeepSeek in text-image pair format. Based on the text prompts, instruct DeepSeek to output the image features H as requested by the text prompts. I .

[0032] Based on the open-source DeepSeek model, fine-tuning was performed using the GRPO algorithm. The GRPO algorithm comprises three models: a policy model, a reference model, and a reward model (RM). The policy model is a pre-trained DeepSeek R1 model, which is fine-tuned through subsequent reinforcement learning to enable text-to-audio and text-to-image editing capabilities. The reference model, also a pre-trained DeepSeek R1 model, is used to constrain the differences between the policy model and the reference model. The reward model is a separately trained model whose purpose is to score the reasonableness of the images and audio output by the policy model. Fine-tuning employs the Group Relative Policy Optimization (GRPO) reinforcement learning training mode, directly enabling the model to autonomously develop its mapping capabilities through a pure reinforcement learning process.

[0033] Step 3: Cross-modal attention mechanism

[0034] For the aforementioned text, image, and speech features, a cross-modal attention layer is introduced to enable mutual learning between different modalities and strengthen their correlations. The attention representations for text-audio and text-image are as follows: H TA H TI and H AI ;

[0035] Step Six: Feature Fusion and Mapping

[0036] In order to align the features of different modalities in the shared latent space, the features of all modalities are mapped to the shared latent space z through a linear transformation.

[0037] Example 1:

[0038] This invention aims to address the shortcomings of existing multimodal data processing technologies by providing a multimodal feature alignment method based on implicit feature space. This technology enhances the extraction of key video information and the analysis of image and text data by efficiently fusing multiple modal information, including text, images, and audio. It optimizes the computational model for efficient processing and adapts to cross-domain application needs, while reducing the demand for labeled data and resources. This provides a more accurate, flexible, and efficient solution for subsequent task analysis and processing.

[0039] This invention relates to a multimodal feature alignment method based on implicit feature space, belonging to the fields of artificial intelligence, multimodal data processing, and feature representation and alignment. This invention enhances key information extraction and data analysis capabilities by efficiently fusing multiple modal information such as text, images, and audio, and adapts to cross-domain application needs. Specifically,

[0040] This solution first completes the local deployment of a large model, using open-source frameworks such as DeepSeek R1 for model loading and inference, and leveraging dedicated hardware such as GPUs and CPUs to accelerate the computation process, while supporting lightweight deployment to adapt to resource-constrained environments. Then, it adopts a large model architecture based on DeepSeek R1, achieving efficient feature extraction for text-image and text-audio pairs by integrating key technologies such as Hybrid Expert Systems (MoE), load balancing, and Multi-Head Latent Attention (MLA). A multimodal input method is constructed, inputting text-audio pairs and text-image pairs into the model, guiding the model to modify audio or image content according to text prompts, laying the foundation for subsequent cross-modal understanding. Then, text, audio, and images independently utilize self-attention to capture long-distance dependencies and generate high-dimensional feature representations, applying a self-attention mechanism within each modality to enhance feature expressiveness. Next, a cross-modal attention layer is introduced, enabling different modalities to learn from each other and strengthen their correlations. To align features from different modalities in a shared latent space, features from all modalities are mapped to the same latent space through linear transformation.

[0041] This invention not only significantly improves the efficiency and accuracy of multimodal data analysis, but also provides strong support for intelligent information processing, and is applicable to various application scenarios such as visual question answering, sentiment analysis, cross-modal retrieval, and human-computer interaction.

[0042] To address the aforementioned technical problems, this invention provides a multimodal feature alignment method based on implicit feature space, comprising the following steps:

[0043] Step 1: Local Deployment of the Large Model

[0044] First, prepare a hardware environment that meets computing requirements (such as GPU / TPU) and configure the operating system and software dependencies. Second, install the necessary deep learning frameworks (such as PyTorch, TensorFlow) and model libraries (such as Hugging FaceTransformers or a specific model SDK). Third, download pre-trained models from open-source platforms and perform lightweight processing such as quantization and pruning according to the actual application scenario. Finally, integrate the model into a local service and call it for inference via API interface or command line, thereby achieving AI applications with stronger data privacy protection and more efficient response.

[0045] Step 2: Fine-tuning the multimodal data of the large model

[0046] 2.1 Fine-tuning of text and images

[0047] Prepare the original image and the prompt text. The text should clearly express what modifications or adjustments are desired to the image. Extract features from both the text and the image using a model. This step is fundamental to multimodal processing, enabling data from different modalities to be represented in a unified latent space. Based on the textual guidance, adjust the image features in the latent space, and then use a decoder to reconstruct the image format. Output the modified image and check if the desired effect has been achieved.

[0048] 2.2 Text-Audio Fine-tuning

[0049] Select the Deepseek R1 pre-trained multimodal model for text-guided audio processing and load it into the environment via the API. Prepare the original audio file and the prompt text. Use the model to extract features from the text and audio separately, and then find the correspondence between the text features and audio features in the shared latent space. Finally, convert the modified features back into the audio format.

[0050] Step 3: Cross-modal attention mechanism

[0051] A cross-modal attention layer is introduced to enable different modalities to learn from each other and strengthen their connections. The attention representations for text-audio, text-image, and audio-image are as follows: H TA H TI and H AI .

[0052] Step 4: Feature Fusion and Mapping

[0053] In order to align the features of different modalities in a shared latent space, all modal features (H) are aligned. T H A and H I It is mapped to the shared latent space z through a linear transformation.

[0054] Example 2:

[0055] This invention discloses a multimodal feature alignment method based on implicit feature space, which mainly solves the limitations of existing technologies in feature representation and alignment of multimodal data (such as text, audio and images). Specifically, it includes: (1) Cross-modal semantic gap: Traditional methods are difficult to effectively capture the deep semantic relationships between different modal data, resulting in poor performance in cross-modal tasks (such as retrieval, generation, etc.). (2) Insufficient feature representation: Single-modal feature extraction methods often cannot fully express the information in complex scenarios, limiting the model's understanding ability and generalization performance.

[0056] This invention proposes a multimodal feature alignment method based on implicit feature space. This framework leverages the powerful multimodal understanding capabilities of DeepSeek, combining self-attention and cross-modal attention mechanisms to capture complex relationships within and between modalities. Furthermore, it optimizes model performance by setting a reward / penalty function for reinforcement learning. The specific steps are as follows:

[0057] Step 1: Local Deployment of the Large Model

[0058] First, prepare a hardware environment that meets computing requirements (such as GPU / TPU) and configure the operating system and software dependencies. Second, install the necessary deep learning frameworks (such as PyTorch, TensorFlow) and model libraries (such as Hugging FaceTransformers or a specific model SDK). Third, download pre-trained models from open-source platforms and perform lightweight processing such as quantization and pruning according to the actual application scenario. Finally, integrate the model into a local service and call it for inference via API interface or command line, thereby achieving AI applications with stronger data privacy protection and more efficient response.

[0059] Step 2: Fine-tuning the multimodal data of the large model

[0060] 2.1 Fine-tuning of text and images

[0061] Prepare the original image and the prompt text. The text should clearly express what modifications or adjustments are desired to the image. Use a model to extract features from both the text and the image separately. This step is fundamental to multimodal processing, enabling data from different modalities to be represented in a unified latent space. Based on the textual guidance, adjust the image features in the latent space, and then use a decoder to reconstruct the image format. Output the modified image and check if the desired effect has been achieved using the reinforcement learning algorithm GPRO.

[0062] 2.2 Text-Audio Fine-tuning

[0063] Select the DeepSeek R1 pre-trained multimodal model for text-guided audio processing and load it into the environment via the API. Prepare the original audio file and the prompt text. Use the model to extract features from the text and audio separately, and then find the correspondence between the text features and audio features in the shared latent space. Finally, convert the modified features back into the audio format. Check whether the expected results have been achieved using the reinforcement learning algorithm GPRO.

[0064] Step 3: Cross-modal attention mechanism

[0065] After DeepSeek fine-tunes the generated audio and image features, a cross-modal attention layer is introduced, enabling different modalities to learn from each other and strengthen their connections. The attention representations for text-audio, text-image, and audio-image are as follows: H TA H TI and H AI The calculation formula is as follows:

[0066] H TA =CrossModalAttention(H T H A )

[0067] H TI =CrossModalAttention(H T H I )

[0068] H AI =CrossModalAttention(H A H I )

[0069] Cross-modal attention can be implemented using a query-key-value (QKV) structure, i.e.:

[0070]

[0071] Among them, Q x K y V y These are the query, key, and value matrix, d k It is the key dimension.

[0072] Specifically, in the cross-modal attention computation process between text and audio, the query matrix Q of the text modality... T The key matrix K of audio modalities A The value matrix V of the audio modality A They are respectively:

[0073]

[0074] in, and These are used to extract text features H T and audio features H A The weight matrix maps to the query, key, and value spaces. Therefore, the cross-modal attention representation between text and audio is H. TA for:

[0075]

[0076] In the cross-modal attention computation process between text and image, the query matrix Q of the text modality is... T The key matrix K of the image modality I The value matrix V of the image modality I They are respectively:

[0077]

[0078] in, and These are used to extract text features H T and image features H I The weight matrix maps to the query, key, and value spaces. Therefore, the cross-modal attention representation between text and image is H. TI for:

[0079]

[0080] In the cross-modal attention computation process between audio and image, the query matrix Q of the audio modality is... A The key matrix K of the image modality I The value matrix V of the image modality I They are respectively:

[0081]

[0082]

[0083] in, and These are used to extract audio features H A and image features H I The weight matrix maps to the query, key, and value spaces. Therefore, the cross-modal attention representation between audio and image, H... AI for:

[0084]

[0085] Step 4: Feature Fusion and Mapping

[0086] To align features from different modalities in a shared latent space, we map the features of all modalities to the shared latent space z through a linear transformation. The feature representations of text, audio, and video in the shared latent space are zi, zj, zv ... T z A and z I The calculation process is as follows:

[0087] z T =W T H T +W TI HI +W TA H A

[0088] z A =W A H A +W AT H T +W AI H I

[0089] z I =W I H I +W IT H T +W IA H A

[0090] Among them W T W A W I These are the weight moments of their respective modes, while W TI W TA W AT W AI W IT W IA This is used to adjust the contribution ratio of cross-modal features. This not only allows features from each modality to align in the shared space, but also preserves the information gain provided by the cross-modal attention mechanism.

[0091] The loss function L is determined by the task-specific loss L. task And contrast loss L contrastive constitute:

[0092] L=w1L task +w2L contrastive

[0093] Among them, w1 and w2 are weighting coefficients used to balance the importance of different losses.

[0094] Task-specific loss refers to the loss function used when performing a specific task, which directly optimizes the model's performance on that task.

[0095] The contrastive loss for feature alignment ensures effective alignment of data from different modalities within a shared latent space, promoting the learning of implicit features and cross-modal consistency. Together, they constitute a comprehensive loss function that ensures the model's effectiveness on specific tasks while improving the overall performance of multimodal data processing.

[0096] The contrastive loss function encourages similarity between positive samples while widening the distance between negative samples; the contrastive loss L... contrastive It can be expressed by the following formula.

[0097]

[0098] Where D represents the distance metric (cosine similarity) between two vectors, z x It is a sample in the latent space z. Is with z x Matching positive samples, Let be the j-th sample in the negative sample set, and N represent the number of samples in the negative sample set. τ is a temperature parameter used to adjust the degree of influence of the similarity score. Positive samples are those that have the same identity or are highly related to the given sample. In other words, they come from different perspectives, modalities, or time points of the same instance or object. For example, in text-image pairing, if a sample is an image, its corresponding positive sample might be the text description of that image. Negative samples are those that are unrelated to the given sample or come from a different identity. This means they do not belong to the same instance or object. In the text-image pairing example above, any description of any other image or text unrelated to the current image can be considered a negative sample of that image. For positive sample pairs... Since they are matched, their expected distance in the shared potential space is... Smaller. For negative sample pairs Since they are not matched, their expected distance in the shared latent space is... The temperature parameter τ controls the scaling of the similarity scores. A smaller τ value makes the similarity scores more extreme, causing the model to focus more on distinguishing between positive and negative samples; a larger τ value makes the score distribution smoother, helping the model learn on more difficult tasks. The entire formula maximizes the similarity between positive sample pairs while minimizing the similarity between negative sample pairs. In this way, the contrastive loss function can effectively encourage the similarity between positive samples and widen the distance between negative samples, thereby achieving better feature alignment and cross-modal consistency.

[0099] This invention discloses a multimodal feature alignment method based on implicit feature space, the main advantages of which are as follows:

[0100] (1) In multimodal information processing, the powerful feature extraction and semantic understanding capabilities of large models can be leveraged to achieve semantic feature alignment of heterogeneous data such as images and audio with text as the center. Specifically, through pre-trained large language models and multimodal interfaces, key features related to the text description can be extracted from the original images or audio, while ignoring irrelevant or redundant information.

[0101] (2) An architecture combining cross-modal attention mechanisms was designed to capture complex relationships between different modalities. By introducing self-attention and cross-modal attention layers, not only is the model's understanding of information within each modality enhanced, but semantic consistency of data from different modalities in the shared latent space is also ensured. This enables the model to leverage its powerful association discovery capabilities in cross-modal tasks such as visual question answering, sentiment analysis, and cross-modal retrieval.

[0102] The above description is only a preferred embodiment of the present invention. It should be noted that for those skilled in the art, several improvements and modifications can be made without departing from the technical principles of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.

Claims

1. A multimodal feature alignment method based on implicit feature space, characterized in that, The method includes the following steps: Step 1: Deploy a large DeepSeek R1 model locally; Step 2: Fine-tuning the multimodal data of the large model Input the raw text data into DeepSeek to obtain the text features of the text information. H T ; The original text and audio data are input into DeepSeek in text-audio pair format. Based on the text prompts, DeepSeek outputs the audio features required by the text prompts. H A ; Input the raw data (text and images) into DeepSeek in text-image pair format. Based on the text prompts, instruct DeepSeek to output the image features as required by the text prompts. H I ; Based on the open-source large model DeepSeek, the GRPO algorithm is used for fine-tuning; Step 3: Cross-modal attention mechanism For the aforementioned text, image, and speech features, a cross-modal attention layer is introduced to enable different modalities to learn from each other and strengthen the correlation between them; Step 4: Feature Fusion and Mapping The features of all modes are mapped to the shared latent space z through a linear transformation; in, After DeepSeek fine-tunes the generated audio and image features, a cross-modal attention layer is introduced to enable different modalities to learn from each other and strengthen their correlations. The attention representations between text-audio, text-image, and audio-image are as follows: H TA 、H TI and H AI The calculation formula is as follows: Cross-modal attention is implemented through a query-key-value (QKV) structure, namely: in, , , These are query, key, and value matrices, respectively. It is the key dimension; Specifically, in the cross-modal attention computation process between text and audio, the query matrix of the text modality... The key matrix of audio modalities The value matrix of audio modalities They are respectively: in, , and These are used to extract text features and audio features The weight matrix is mapped to the query, key, and value spaces; therefore, the cross-modal attention representation between text and audio. for: In the cross-modal attention computation process between text and image, the query matrix of the text modality is... The key matrix of image modalities The value matrix of image modes They are respectively: in, , and These are used to extract text features and image features The weight matrix is mapped to the query, key, and value spaces; therefore, the cross-modal attention representation between text and image. for: In the cross-modal attention computation process between audio and image, the query matrix of the audio modality is... The key matrix of image modalities The value matrix of image modes They are respectively: in, , and These are used to extract audio features and image features The weight matrix is mapped to the query, key, and value spaces; therefore, the cross-modal attention representation between audio and image. for: 。 2. The multimodal feature alignment method based on implicit feature space as described in claim 1, characterized in that, In step two, the DeepSeek R1 large model constructs a multimodal input method, inputting text-audio pairs and text-image pairs into the model. Based on the text prompts, the model is guided to modify the audio or image content that meets the requirements. Then, the text, audio, and images independently use self-attention to capture long-distance dependencies and generate high-dimensional feature representations. Furthermore, a self-attention mechanism is applied within each modality to enhance feature expressiveness.

3. The multimodal feature alignment method based on implicit feature space as described in claim 1, characterized in that, In step two, fine-tuning using the GRPO algorithm based on the DeepSeek large model includes: the GRPO algorithm comprises three models: a policy model, a reference model, and a reward model (RM); the policy model is a pre-trained DeepSeek R1 model, which is fine-tuned through subsequent reinforcement learning to enable it to edit text to audio and text to images; the reference model is also a pre-trained DeepSeek R1 model, used to constrain the differences between the policy model and the reference model; the reward model is a separately trained model whose purpose is to score the reasonableness of the images and audio output by the policy model; during fine-tuning, the reinforcement learning training mode of the Group Relative Policy Optimization (GRPO) algorithm is adopted, allowing the model to autonomously develop its mapping capabilities through a pure reinforcement learning process.

4. The multimodal feature alignment method based on implicit feature space as described in claim 3, characterized in that, When fine-tuning a text-image, prepare the original image and the prompt text. The text should clearly express what modifications or adjustments are desired to the image. Use the model to extract features from the text and image respectively. Based on the information guided by the text, adjust the image features in the latent space, and then use a decoder to convert the modified features back into the image format. Output the modified image and check whether the expected effect has been achieved.

5. The multimodal feature alignment method based on implicit feature space as described in claim 3, characterized in that, When fine-tuning text-audio, the Deepseek R1 pre-trained multimodal model for text-guided audio processing is selected and loaded into the environment via API. The original audio file and prompt text are prepared, and the model is used to extract features from the text and audio respectively. Then, the correspondence between text features and audio features is found in the shared latent space, and finally, the modified features are converted back into audio format.

6. The multimodal feature alignment method based on implicit feature space as described in claim 1, characterized in that, In step four, the features of all modalities are mapped to the shared latent space z through a linear transformation; the feature representations of text, audio, and video in the shared latent space are respectively... , and The calculation process is as follows: in , , These are the weight moments of their respective modes, and , , , , , This is used to adjust the contribution ratio of cross-modal features.

7. The multimodal feature alignment method based on implicit feature space as described in claim 6, characterized in that, The loss function of this method L Task-specific losses L task And comparative loss L contrastive constitute: Among them, w1 and w2 are weighting coefficients used to balance the importance of different losses; Task-specific loss refers to the loss function used when completing a specific task, which is used to directly optimize the model's performance on that task; contrastive loss for feature alignment is used to ensure that data from different modalities are effectively aligned in the shared latent space, promoting the learning of implicit features and cross-modal consistency.

8. The multimodal feature alignment method based on implicit feature space as described in claim 7, characterized in that, Contrastive loss functions encourage similarity between positive samples while widening the distance between negative samples. L contrastive Expressed using the following formula: Where D represents the distance metric between two vectors. It is potential space One of the samples, Is with Matching positive samples, It is the j-th sample in the negative sample set, and N represents the number of samples in the negative sample set; Temperature is a parameter used to adjust the degree of influence of similarity score; positive samples refer to samples that have the same identity or are highly related to the given sample, or samples that come from different perspectives, modalities or time points of the same instance or object. Negative samples refer to samples that are unrelated to the given sample or come from different identities, which means that they do not belong to the same instance or object.

9. The multimodal feature alignment method based on implicit feature space as described in claim 8, characterized in that, The distance metric uses cosine similarity calculation.