A multi-modal large model cross-modal alignment method

By constructing a multimodal knowledge distiller and a lightweight data feature extractor, the problem of insufficient alignment accuracy of cross-modal feature vector space in large multimodal models is solved, and accurate alignment of cross-modal features in a unified semantic space is achieved, thereby improving the matching accuracy of multimodal data.

CN122196893APending Publication Date: 2026-06-12青岛国实科技集团有限公司

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
青岛国实科技集团有限公司
Filing Date
2026-03-11
Publication Date
2026-06-12

Smart Images

  • Figure CN122196893A_ABST
    Figure CN122196893A_ABST
Patent Text Reader

Abstract

The application provides a multimodal large model cross-modal alignment method, and belongs to the technical field of multimodal large models. The application produces multimodal corpus in batches through a multimodal knowledge distiller supporting five types of distillation subsystems, forms a high-quality multimodal alignment dataset through two stages of coarse-grained segment alignment and fine-grained word-level and frame-level semantic alignment, and then constructs a lightweight multimodal data feature extractor based on a shared lightweight backbone network, a single-stage projection alignment module and a modal adaptive routing mechanism. The aggregation, interaction and fusion of features of each mode in the semantic anchor vector space are completed by using a double-flow semantic anchor fusion network and a sparse gating hybrid expert cross-modal routing network driven by a cross-modal semantic anchor vector, and the backtracking realignment process is triggered for low-quality samples through vector space alignment scores, so that the problem of insufficient cross-modal feature vector space alignment precision of a multimodal large model is solved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of multimodal large model technology, and more specifically, relates to a method for cross-modal alignment of multimodal large models. Background Technology

[0002] Multimodal large models are a current research hotspot in the field of artificial intelligence. One of their core tasks is to achieve accurate alignment of heterogeneous modal data such as text, speech, images, and videos in a unified semantic space. In existing technologies, mainstream multimodal large models generally adopt an architecture of sequential mapping between a visual encoder and a large language model. That is, after visual features are extracted by a visual Transformer, visual tokens are mapped to the language model embedding space by a multilayer perceptron projector, and then the large language model completes cross-modal understanding. This architecture has been widely deployed in application scenarios such as image description generation, visual question answering, and cross-modal retrieval, and has formed a technical system with a visual encoder plus a multilayer perceptron plus a large language model as the basic paradigm. However, the above-mentioned sequential mapping architecture has inherent defects such as multi-stage information loss and accumulation of modal distribution bias. The nonlinear transformation in the multilayer perceptron projection process leads to the loss of semantic integrity of the original modal features, and the heterogeneity of the distribution of different modal features in the original feature space causes a systematic bias in direct cross-modal attention alignment. In existing technologies, due to the lack of a vector space alignment mechanism mediated by semantic anchors, multimodal features rely solely on coordinate localization or shallow pixel matching during fusion. This fails to achieve accurate cross-modal matching of object attributes, relationships, and spatial structures within a unified vector space with clear semantic meaning. In other words, existing technologies suffer from insufficient accuracy in aligning cross-modal feature vector spaces for large multimodal models. Summary of the Invention

[0003] In view of this, the present invention provides a method for cross-modal alignment of multimodal large models, which can solve the technical problem of insufficient alignment accuracy of cross-modal feature vector space of multimodal large models in the prior art.

[0004] This invention is implemented as follows: This invention provides a method for cross-modal alignment of a multimodal large model, comprising the following steps:

[0005] A multimodal knowledge distillation device is constructed, comprising five distillation subsystems: text-text knowledge distillation subsystem, text-speech bidirectional bimodal knowledge distillation subsystem, text-image bidirectional bimodal knowledge distillation subsystem, text-video knowledge distillation subsystem, and image-video knowledge distillation subsystem. Multimodal corpora are generated in batches using the generative models corresponding to each distillation subsystem, and the multimodal corpora are subjected to repeatability and reliability checks to obtain the original multimodal corpora.

[0006] For the multimodal raw corpus, text seed banks, speech seed banks, image data sources, and image datasets are constructed respectively. After performing modality-corresponding preprocessing operations on each data source, bidirectional corpus generation is performed through speech generation model, automatic speech recognition model, text-to-image model, image-to-text model, text-to-video model, and image-to-video model respectively, to obtain text-speech bidirectional bimodal corpus, text-image bidirectional bimodal corpus, text-video corpus, and image-video corpus;

[0007] For text-speech bi-modal corpora, a sentence-level segmentation strategy combined with an automatic speech recognition model is used for segment alignment. For text-image bi-modal corpora, a panoramic segmentation model is used for panoramic segmentation, instance segmentation and semantic segmentation. After mapping entity region coordinates to entity names, they are aligned with text entities, forming entity alignment data between entities in images and entities in text, thus completing coarse-grained semantic alignment of cross-modal data.

[0008] Based on text-speech segment alignment data, a word-level frame-level forced alignment model combined with a Hidden Markov Model forced alignment algorithm is used to achieve word-level frame-level text-speech alignment data. A self-supervised speech representation model is used to extract frame-level sentiment features and align them with text sentiment words to obtain timbre-sensory text-speech alignment data. Based on entity-entity alignment data in images and text, visual object text-image alignment data is obtained through a scene graph alignment mechanism, visual space text-image alignment data is obtained through spatial predicate mapping, and action state text-image alignment data is obtained by extracting key image frames and aligning them with text verb phrases through a pose prediction model. For text-video corpora, entity temporal trajectory text-video alignment data is obtained through a joint detection-tracking-recognition framework and a graph-based text model. Entity spatial trajectory text-video alignment data is obtained through a 3D pose estimation model. Frame-level semantic text-video alignment data is obtained through a joint framework of change point detection and phrase localization. A multi-resolution temporal pyramid structure combined with connection temporal classification forced alignment constraints is used to refine the correspondence between video frames and text temporal granularity step by step, completing fine-grained semantic alignment of cross-modal data.

[0009] All cross-modal data with fine-grained semantic alignment is stored in a structured manner according to a hierarchical directory structure. A data index table containing fields such as sample identifier, category, text path, image path, video path, and alignment information is constructed. Manual sampling verification is performed at a rate of 5% to 10% to check the alignment logic, annotation accuracy, and data usability. The verified data is encapsulated in TFRecord or Webdataset format to obtain a multimodal alignment dataset.

[0010] A lightweight multimodal data feature extractor is constructed, comprising a shared lightweight backbone network, a single-stage projection alignment module, and a modality adaptive routing mechanism. A lightweight semantic parsing head is integrated at the end of the shared lightweight backbone network to output instance-level semantic tokens. For speech input, frame-level sub-language features are extracted using a self-supervised speech representation model and concatenated with the main content token. For video input, a lightweight temporal saliency detection module identifies video keyframes and performs high-dimensional semantic parsing only on the video keyframes. Cross-modal semantic anchor vectors are constructed, and visual object text-image alignment data, visual space text-image alignment data, and action state text-image alignment data are associated through a dual-stream semantic anchor fusion network. All modal data in the multimodal alignment dataset are associated through a sparse gated hybrid expert cross-modal routing network, completing cross-modal feature vector space alignment and fusion, and outputting cross-modal alignment feature representations.

[0011] Furthermore, it also includes calculating the mean cosine similarity between each modal feature and the corresponding cross-modal semantic anchor vector in the cross-modal alignment feature representation as the vector space alignment score. For samples with a vector space alignment score lower than a preset alignment threshold, it backtracks to the fine-grained semantic alignment step of cross-modal data and re-executes it. For samples with a vector space alignment score not lower than the preset alignment threshold, it outputs the final cross-modal alignment feature representation and dynamically increases the routing weight of the corresponding expert group for feature paths with a score lower than the preset alignment threshold.

[0012] Specifically, the duplication check involves calculating the cosine similarity between generated corpora and removing redundant samples with a cosine similarity of not less than 0.9; the reliability check involves removing corpora with incomplete semantics or incorrect content.

[0013] The construction of the text-to-text knowledge distillation subsystem involves designing a structured prompt template that includes four elements: task type, domain label, content requirements, and format specifications. It also sets style variables, difficulty variables, and constraints such as prohibiting ambiguous answers, grammatical errors, and omissions of key information. The system then uses a large language model with hundreds of billions of parameters to distill and produce general domain corpora and vertical domain corpora.

[0014] The text seed library is constructed to include Chinese, English, mixed language text, general domain text, vertical domain text, and dialogue scenario text, with the text length controlled between 1 and 30 characters. The text data in the text seed library is processed by removing special characters, standardizing spaces, unifying punctuation marks, and verifying semantic compliance. The speech data in the speech seed library is processed by noise reduction, speech enhancement, trimming blank speech, and removing noisy audio.

[0015] The image data sources include general aligned images and domain-specific images, with high image quality requirements, no blurring, no watermarks, and no noise. The original text data sources are processed by noise removal, expression standardization, and text semantic verification. The image data sources are processed by format unification, size cropping, and image enhancement. Based on the text-to-image model and the image-to-text model, bidirectional bimodal text-image corpus is generated in batches.

[0016] Specifically, the generation of the image-video corpus involves standardizing the format and augmenting the data of the image dataset, using a graph-to-text model to analyze the elements of each image in the dataset in batches, organizing static objects in the images into dynamic text, and then generating image-video corpora in batches based on the dynamic text using a graph-to-video model.

[0017] Specifically, the text-speech segment alignment involves using a sentence-level segmentation strategy to segment data according to text length and punctuation marks, identifying speech content through an automatic speech recognition model, segmenting the audio file according to the segment relationships of the speech content, and then performing semantic comparison between the segmented text data and the speech data to complete the text-speech segment alignment.

[0018] Specifically, the scene graph alignment mechanism parses the entity alignment data in the image and the entity alignment data in the text into a structured language scene graph. In the structured language scene graph, nodes represent entities and attributes, and edges represent relationships between entities. Fine-grained alignment of the image and text is achieved through two-layer matching of node-level alignment and edge-level alignment.

[0019] Specifically, the generation of the entity time trajectory text-video aligned data involves using a graph-based text model to extract the semantic features of each entity time trajectory, parsing the semantic features into a temporal scene graph, and aligning the event flow described in the text with the structural changes of the temporal scene graph through a graph temporal network, representing the data as a quadruple of timestamp, entity, attribute, and relation.

[0020] Specifically, the multi-resolution temporal pyramid structure constructs a hierarchical alignment framework between the coarse-grained segment layer and the fine-grained frame layer, refining the temporal granular correspondence between video frames and text at each level; the connection-temporal classification forced alignment constraint uses the connection-temporal classification loss function to apply soft alignment constraints to the video frame sequence and the text sequence without precise boundary labeling.

[0021] This invention constructs a multimodal knowledge distiller to batch-produce semantically initially aligned multimodal corpora. Through two stages of coarse-grained and fine-grained cross-modal semantic alignment, a high-quality multimodal aligned dataset is formed. Then, using cross-modal semantic anchor vectors in a lightweight multimodal data feature extractor as a semantic space mediator, a dual-stream semantic anchor fusion network and a sparse-gated hybrid expert cross-modal routing network are driven to aggregate and fuse modal features in the semantic anchor vector space rather than the original feature space. This fundamentally avoids alignment bias caused by differences in modal distribution in the original feature space. Since cross-modal semantic anchor vectors have clear semantic meanings, the process of aggregating modal features to their corresponding semantic anchors through cosine similarity weighting is essentially a semantically constrained vector space projection. This allows fine-grained semantic information such as object attributes, entity relationships, and spatial structures to accurately correspond in a unified semantic vector space, rather than relying on shallow perception methods such as coordinate positioning or pixel-level matching. In summary, this invention solves the technical problem of insufficient alignment accuracy in the cross-modal feature vector space of large multimodal models mentioned in the background art. Attached Figure Description

[0022] Figure 1 This is a flowchart of the method of the present invention.

[0023] Figure 2 This is a schematic diagram of the framework of the multimodal knowledge distillation apparatus of the present invention.

[0024] Figure 3 This is a convergence curve of the alignment loss at each level of the multi-resolution temporal pyramid of this invention.

[0025] Figure 4 This is a distribution diagram of the alignment scores of the vector spaces of each modality in this invention. Detailed Implementation

[0026] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below.

[0027] like Figure 1 The diagram shows a flowchart of a multimodal large model cross-modal alignment method provided by the present invention. This method includes the following steps:

[0028] S01. Construct a multimodal knowledge distillation device, which includes five types of distillation subsystems: text-text knowledge distillation subsystem, text-speech bidirectional bimodal knowledge distillation subsystem, text-image bidirectional bimodal knowledge distillation subsystem, text-video knowledge distillation subsystem, and image-video knowledge distillation subsystem. Use the generation model corresponding to each distillation subsystem to batch produce multimodal corpora, and perform repeatability checks and reliability checks on the multimodal corpora to obtain multimodal raw corpora.

[0029] S02. Construct a text seed library, a speech seed library, an image data source, and an image dataset from the multimodal raw corpus obtained in S01. After performing modality-specific preprocessing operations on each data source, generate bidirectional corpus using a speech generation model, an automatic speech recognition model, a text-to-image model, an image-to-text model, a text-to-video model, and an image-to-video model, respectively, to obtain text-speech bidirectional bimodal corpus, text-image bidirectional bimodal corpus, text-video corpus, and image-video corpus.

[0030] S03. For the text-speech bi-modal corpus obtained in S02, a sentence-level segmentation strategy is used in conjunction with an automatic speech recognition model to perform segment alignment. For the text-image bi-modal corpus obtained in S02, a panoramic segmentation model is used to perform panoramic segmentation, instance segmentation and semantic segmentation. After mapping entity region coordinates to entity names, they are aligned with text entities to form entity alignment data in the image to entity alignment data in the text, thus completing the coarse-grained semantic alignment of cross-modal data.

[0031] S04. Based on the text-speech segment alignment data obtained in S03, word-level and frame-level forced alignment models combined with Hidden Markov Model forced alignment algorithms are used to achieve word-level and frame-level text-speech alignment data. A self-supervised speech representation model is used to extract frame-level sentiment features and align them with text sentiment words to obtain timbre-sensory text-speech alignment data. Based on the entity-text entity alignment data in the image obtained in S03, visual object text-image alignment data is obtained through a scene graph alignment mechanism, visual space text-image alignment data is obtained through spatial predicate mapping, and key image frames are extracted through a pose prediction model. Action state text-image aligned data is obtained by aligning with text verb phrases; entity temporal trajectory text-video aligned data is obtained from the text-video corpus obtained in S02 through a joint detection-tracking-recognition framework and graph-text model; entity spatial trajectory text-video aligned data is obtained through a 3D pose estimation model; frame-level semantic text-video aligned data is obtained through a change point detection method and a phrase localization joint framework; and multi-resolution temporal pyramid structure combined with connection temporal classification forced alignment constraints are used to refine the correspondence between video frames and text temporal granularity step by step to complete fine-grained semantic alignment of cross-modal data.

[0032] S05. All cross-modal data fine-grained semantic alignment data obtained in S04 are stored in a structured manner according to a hierarchical directory structure. A data index table containing fields such as sample identifier, category, text path, image path, video path, and alignment information is constructed. Manual sampling verification is performed at a rate of 5% to 10% to check the alignment logic, annotation accuracy, and data usability. The verified data is encapsulated in TFRecord format or Webdataset format to obtain a multimodal alignment dataset.

[0033] S06. Construct a lightweight multimodal data feature extractor. This extractor includes a shared lightweight backbone network, a single-stage projection alignment module, and a modality adaptive routing mechanism. A lightweight semantic parsing head is integrated at the end of the shared lightweight backbone network to output instance-level semantic tokens. For speech input, frame-level sub-language features are extracted using a self-supervised speech representation model and concatenated with the main content token. For video input, a lightweight temporal saliency detection module identifies video keyframes, and high-dimensional semantic parsing is performed only on these keyframes. Construct cross-modal semantic anchor vectors and associate visual object text-image alignment data, visual space text-image alignment data, and action state text-image alignment data through a dual-stream semantic anchor fusion network. For example, for data alignment, the full-modal data in the multimodal alignment dataset obtained in S05 is associated through a sparse gated hybrid expert cross-modal routing network to complete the cross-modal feature vector space alignment and fusion, and output the cross-modal alignment feature representation. The mean cosine similarity between each modal feature and the corresponding cross-modal semantic anchor vector in the cross-modal alignment feature representation is calculated as the vector space alignment score. For samples whose vector space alignment score is lower than a preset alignment threshold, the cross-modal data fine-grained semantic alignment process is backtracked to S04 and re-executed. For samples whose vector space alignment score is not lower than the preset alignment threshold, the final cross-modal alignment feature representation is output, and the routing weight of the corresponding expert group is dynamically increased for feature paths with a score lower than the preset alignment threshold.

[0034] like Figure 2 As shown, the multimodal knowledge distiller refers to a data production framework that adopts a modular design and supports five types of distillation subsystems. Each subsystem uses a large language model with hundreds of billions of parameters, an automatic speech recognition model, a speech generation model, a text-to-graph model, a graph-to-text model, a text-to-video model, and a graph-to-video model to generate multimodal corpus pairs with preliminary semantic alignment in batches.

[0035] Among them, the duplication check refers to calculating the cosine similarity between the generated corpora and removing redundant samples with a cosine similarity of not less than 0.9; the reliability check refers to removing corpora with incomplete semantics or incorrect content.

[0036] In the construction step of the text-to-text knowledge distillation subsystem obtained in S01, a structured prompt template is designed for the text-to-text knowledge distillation subsystem. The structured prompt template includes four elements: task type, domain label, content requirements, and format specifications. Style variables, difficulty variables, and constraints such as prohibiting ambiguous answers, prohibiting grammatical errors, and prohibiting the omission of key information are set. A large language model with hundreds of billions of parameters is used to distill and produce general domain corpora and vertical domain corpora. The general domain corpora include question-answering task corpora, translation task corpora, and summarizing task corpora. The vertical domain corpora include education scenario corpora, medical scenario corpora, and financial scenario corpora.

[0037] In the step of generating bidirectional bimodal text-speech corpus obtained in S02, the text seed library is constructed to cover Chinese, English, mixed language text, general domain text, vertical domain text, and dialogue scenario text, with the text length controlled between 1 and 30 characters. The text data in the text seed library is processed by removing special characters, standardizing spaces, unifying punctuation marks, and verifying semantic compliance. The speech data in the speech seed library is processed by noise reduction, speech enhancement, trimming blank speech, and removing noisy audio. Then, the speech generation model and the automatic speech recognition model are used to generate bidirectional corpus from the preprocessed data.

[0038] In the text-image bi-modal corpus generation step obtained in S02, the image data source covers general aligned images and domain-specific images, with high image quality requirements, no blurring, no watermarks, and no noise; the original text data source is subjected to noise removal, expression standardization, and text semantic verification; the image data source is subjected to format unification, size cropping, and image enhancement; and the text-image bi-modal corpus is generated in batches based on the text-to-image model and the image-to-text model.

[0039] In the text-video corpus generation step obtained in S02, the text content in the text seed library needs to be oriented towards multiple scenarios, have detailed descriptions, and possess temporal integrity and element clarity. After sorting out the subject, action, scene, and action sequence elements of the text data in the text seed library, the text-video corpus is generated in batches based on the text-generated video model.

[0040] In the image-video corpus generation step obtained in S02, the image dataset is required to have comprehensive scene coverage, rich objects, high resolution, and no occlusion. After the image dataset is formatted and data augmented, the elements of each image in the image dataset are analyzed in batches using a graph-to-text model. Static objects in the images are organized into dynamic text, and then image-video corpus is generated in batches based on the dynamic text using a graph-to-video model.

[0041] Among them, the speech generation model refers to the model that converts text into speech; the automatic speech recognition model refers to the model that converts speech into text. The automatic speech recognition model is used simultaneously in the text-speech bidirectional bimodal knowledge distillation subsystem to generate text data from speech and in S03 to recognize speech content.

[0042] Among them, the text-to-image model refers to a model that generates corresponding images in batches from text descriptions; the image-to-text model refers to a model that generates corresponding text descriptions in batches from images. The image-to-text model is also used for image element analysis in the text-image bidirectional bimodal knowledge distillation subsystem and the image-video knowledge distillation subsystem, as well as semantic feature extraction in S04; the text-to-video model refers to a model that generates corresponding videos in batches from text; and the image-to-video model refers to a model that generates corresponding videos in batches from images and dynamic text.

[0043] Among them, the panoptic segmentation model refers to an image segmentation model that supports three modes: panoptic segmentation, instance segmentation, and semantic segmentation. The output format is entity region coordinates-entity name mapping, which is used to establish coarse-grained associations between image regions and text entities.

[0044] In the text-speech segment alignment step obtained in S03, a sentence-level segmentation strategy is used to segment the data according to the text length and punctuation marks. After the speech content is recognized by the automatic speech recognition model, the audio file is segmented according to the segment relationship of the speech content. The segmented text data and speech data are semantically compared to complete the text-speech segment alignment and obtain the text-speech segment aligned data.

[0045] In the step of generating entity-text alignment data in the image obtained in S03, a large language model is used to match the entity region coordinate-entity name mapping with the text content, identify entities in the text, and align the entities in the text with the entities in the entity region coordinate-entity name mapping to form entity-text alignment data in the image.

[0046] Among them, the Hidden Markov Forced Alignment Algorithm refers to an algorithm that uses the Hidden Markov Model to force temporal boundary alignment between speech frame sequences and text word sequences, achieving precise word-level alignment within the framework of word-level and frame-level forced alignment models.

[0047] Among them, the self-supervised speech representation model refers to a speech feature extraction model trained by self-supervised learning, which is used to extract frame-level sentiment features and frame-level sub-language features. Frame-level sub-language features refer to frame-level features in the speech signal that exceed the language and text content itself, such as timbre, speech rate, tone, and emotion. The frame-level sub-language features, after being concatenated with the main content token, support fine-grained semantic expression of timbre and sentiment text-speech aligned data.

[0048] The scene graph alignment mechanism refers to parsing the entity alignment data in the image and the entity alignment data in the text into a structured language scene graph. In the structured language scene graph, nodes represent entities and attributes, and edges represent relationships between entities. Fine-grained alignment of the image and text is achieved through a two-layer matching of node-level alignment and edge-level alignment.

[0049] In the step of generating visual spatial text-image alignment data obtained in S04, the entity-text alignment data in the image is guided by spatial predicates to achieve visual-linguistic alignment. The spatial description in the language is mapped to the relative positional relationship in the image coordinate system to obtain visual spatial text-image alignment data. Spatial predicates refer to the linguistic expressions used to describe the spatial positional relationship between entities, including positional descriptive words such as "above XX" and "inside XX".

[0050] Among them, the pose prediction model refers to the model that performs key point estimation and state recognition on the action and pose of entities in an image. It is used to extract key image frames containing entity actions or states. The key image frames are aligned with verb phrases in the text to obtain action-state text-image aligned data.

[0051] The joint detection-tracking-recognition framework refers to a framework that simultaneously performs target detection, multi-object tracking, and semantic recognition in a video sequence, generating a time-stamped entity time trajectory for each entity. In the step of generating the entity time trajectory text-video aligned data obtained in S04, a graph-based text model is used to extract the semantic features of each entity time trajectory, and the semantic features are parsed into a temporal scene graph. In the temporal scene graph, nodes are entities and their attributes, and edges are relationships between entities. The event flow described in the text is aligned with the structural changes of the temporal scene graph through a graph temporal network, and the data is represented as a quadruple of timestamp, entity, attribute, and relationship to obtain the entity time trajectory text-video aligned data. The graph temporal network refers to a network that models the relationship between nodes and edges in the temporal scene graph as time evolves.

[0052] Among them, the three-dimensional pose estimation model refers to the model that generates spatiotemporal trajectories containing the coordinates of key points of each entity in the video; in the step of generating text-video aligned data of entity spatial trajectory obtained in S04, the action and state of each entity in the video are analyzed by combining the coordinates of the key points of the entity and the corresponding image frames using the graph-to-text model, and the data is represented as a triplet of timestamp, spatial coordinates, action or state to obtain text-video aligned data of entity spatial trajectory.

[0053] The phrase localization joint framework refers to a framework that maps action phrases in text to consecutive video frame intervals and outputs start and end timestamps for temporal frame-level semantic alignment. In the step of generating frame-level semantic text-video aligned data obtained in S04, semantic keyframes are automatically identified using a change point detection method, key events in the text are aligned with the semantic keyframes through cross-modal retrieval, and the phrase localization joint framework is used to map action phrases in text to consecutive video frame intervals to obtain frame-level semantic text-video aligned data.

[0054] Among them, the multi-resolution temporal pyramid structure refers to the construction of a hierarchical alignment framework between the coarse-grained segment layer and the fine-grained frame layer, which refines the temporal granularity correspondence between video frames and text step by step. It is used to solve the alignment problem of the temporal granularity difference of 2 to 3 orders of magnitude between the video frame level (millisecond level) and the text sentence level (second level). The connection temporal classification forced alignment constraint refers to the use of the connection temporal classification loss function to apply soft alignment constraints to the video frame sequence and the text sequence without precise boundary labeling. It assists in automatic alignment in each level of the multi-resolution temporal pyramid structure and reduces the cost of manual labeling.

[0055] Among them, TFRecord format refers to a binary serialized data format adapted to the TensorFlow framework; Webdataset format refers to a standardized data format adapted to the PyTorch framework that supports efficient streaming reading of large-scale datasets.

[0056] Among them, the lightweight multimodal data feature extractor refers to a cross-modal feature extraction module that adopts an end-to-end embedded architecture. By sharing a lightweight backbone network, a single-stage projection alignment module, and a modality adaptive routing mechanism, it avoids the computational overhead and information loss caused by the visual encoder, multilayer perceptron projection, and serial mapping of the large language model in traditional large visual language models.

[0057] In the construction step of the shared lightweight backbone network obtained in S06, the shared lightweight backbone network consists of a basic feature extraction layer, a multi-scale feature fusion module, a dynamic sparse attention mechanism, and a cross-modal feature alignment layer. The basic feature extraction layer uses depthwise separable convolution to decompose the standard convolution into two steps: depthwise convolution and pointwise convolution, and combines them with the SiLU activation function. The multi-scale feature fusion module introduces dilated spatial pyramid pooling combined with depthwise separable convolution to capture multi-scale contextual information. The dynamic sparse attention mechanism dynamically selects the most important attention connections through a predefined sparse mask and adaptively adjusts the sparsity according to the complexity of the input content. The cross-modal feature alignment layer uses 1×1 convolution and layer normalization to perform preliminary spatial alignment of features of different modalities.

[0058] In the construction step of the single-stage projection alignment module obtained in S06, the single-stage projection alignment module uses learnable 1×1 convolution and layer normalization to directly map the visual token or audio token to the large language model embedding space, and introduces local residual connections to perform weighted fusion of features before and after projection. The projection output formula is expressed as follows: ;in The projection outputs a feature vector. The projected feature vector is obtained after 1×1 convolution and layer normalization mapping. The input feature vector is before projection. The reference magnitude of the eigenvector. The parameters are learnable; independent dynamic scaling factors are set for different modal projection paths; gradient pruning and feature distribution monitoring mechanisms are used to fix the shared lightweight backbone network parameters in the early stage of training and gradually unfreeze the single-stage projection alignment module.

[0059] In the construction step of the modal adaptive routing mechanism obtained in S06, the modal adaptive routing mechanism introduces a lightweight modal gating unit and uses a soft routing strategy to allocate computing resources according to the gating weight. For image input, it prioritizes the activation of the spatial attention branch, for audio input, it prioritizes the activation of the temporal convolution branch, and for video frames, it adopts a hybrid strategy. It integrates an energy consumption-aware scheduling module to automatically reduce the activation weight of non-critical branches when resources are limited.

[0060] Among them, SiLU activation function refers to Sigmoid linear unit activation function, which is used to achieve lightweight computation while maintaining feature extraction capability in each stage of depthwise separable convolution; dilated spatial pyramid pooling refers to a pooling method that captures multi-scale contextual information in parallel under different receptive fields through dilated convolution with different dilation rates.

[0061] The lightweight semantic parsing head refers to a lightweight instance-level semantic parsing module integrated at the end of a shared lightweight backbone network. It outputs instance-level semantic tokens for images or videos. These instance-level semantic tokens explicitly encode object categories, attributes, spatial boundaries, and local context information, replacing the shallow perception method that relies solely on raw pixel features.

[0062] The lightweight temporal saliency detection module is a lightweight module that automatically identifies semantic turning points and action keyframes in a video. It achieves a balance between computational cost and semantic expressive power by performing high-dimensional semantic parsing only on keyframes and using low-dimensional motion optical flow tokens to represent the remaining frames.

[0063] Among them, cross-modal semantic anchor vectors refer to a set of learnable semantic center vectors, each vector corresponding to a semantic dimension. The features of each modality are aggregated to the corresponding cross-modal semantic anchor vectors through cosine similarity weighting, so as to achieve alignment of the semantic space rather than the original feature space.

[0064] The dual-stream semantic anchor fusion network adopts a dual-stream Transformer architecture. After the image stream and text stream are encoded independently, they are fused by a cross-modal semantic anchor vector as an intermediate bridge. The features of both streams are projected onto the cross-modal semantic anchor vector space and then fused by vector inner product. The dual-stream semantic anchor fusion network supports a cross-layer iterative refinement mechanism for cross-modal semantic anchor vectors. Each layer of cross-modal semantic anchor vectors receives residual updates from the two streams in the previous layer. After three iterations, the final aligned features are output. The dual-stream semantic anchor fusion network replaces direct cross-stream attention with cross-modal semantic anchor vectors, avoiding alignment deviations caused by modal distribution differences in the original feature space. This allows image and text features to be aligned in a cross-modal semantic anchor vector space with clear semantic meaning, improving the cross-modal matching accuracy of object attributes, relationships, and spatial structures. This enables the model to overcome the limitations of shallow perception that relies solely on coordinate positioning and to have the ability to understand and reason about the deep semantics of images, thus enhancing the generalization performance of aligned features in downstream multimodal tasks.

[0065] The sparse-gated hybrid expert cross-modal routing network adopts a hybrid expert architecture, with each modality corresponding to an independent expert group, each group containing 4 experts, and 2 cross-modal shared experts set at the top level. A lightweight gating network routes to the first 2 experts via softmax based on the input modality type and content complexity. The network internally designs an inter-expert information transfer mechanism; the outputs of different modalities for the same sample undergo a jump interaction through cross-expert attention before being merged into the cross-modal shared experts for fusion. By activating experts on demand using sparse computation, the network improves the capacity and professionalism of cross-modal feature fusion without disproportionately increasing the total number of parameters. The cross-expert attention jump interaction mechanism ensures sufficient information transfer between different modalities before fusion, and the cross-modal shared experts act as a global semantic convergence center to guarantee this. This approach ensures consistent representation of heterogeneous modal features in a unified vector space, improving the inference accuracy and robustness of the model in multimodal joint tasks. In the sparse gated hybrid expert cross-modal routing network, the routing of the first two experts uses a discrete sorting operation. During backpropagation, the gradient cannot flow through unselected attention connections. This scheme employs the Gumbel-Softmax reparameterization method to approximate discrete sampling as a continuous differentiable operation. Gradient flow is maintained during training, while the inference phase switches to hard selection of the first two experts, balancing training stability and inference efficiency. The Gumbel-Softmax reparameterization method involves adding Gumbel noise to the softmax input and introducing a temperature parameter to control the sharpness of the distribution. It approximates a discrete one-hot distribution when the temperature approaches 0 and maintains a differentiable, approximately continuous distribution when the temperature is high, thus achieving approximate gradient flow in the discrete selection operation.

[0066] Among them, the vector space alignment score refers to the mean of the cosine similarity between each modal alignment feature vector and the corresponding cross-modal semantic anchor vector, which serves as a quantitative indicator to measure the quality of cross-modal feature vector space alignment; the preset alignment threshold refers to the lower bound of the preset alignment score. Samples that are lower than the preset alignment threshold are backtracked to S04 to re-execute the cross-modal data fine-grained semantic alignment process.

[0067] Optionally, the present invention also provides a method for forming a multimodal large model cross-modal alignment system by means of a computer, wherein the computer is provided with a readable storage medium, the readable storage medium storing program instructions, and the program instructions are used to execute the above-described method when the computer is run.

[0068] The specific implementation of step S01 is as follows: The goal of this step is to construct a multimodal knowledge distiller and mass-produce quality-controlled multimodal raw corpora. The multimodal knowledge distiller adopts a modular design, internally containing five distillation subsystems: text-to-text, text-to-speech bi-directional bimodal, text-to-image bi-directional bimodal, text-to-video, and image-to-video. In the text-to-text distillation subsystem, a structured prompt template containing four elements—task type, domain label, content requirements, and format specifications—is first designed. Style and difficulty variables are introduced into the template to achieve output diversity control. Hard constraints are set, such as prohibiting ambiguous answers, grammatical errors, and omissions of key information. Subsequently, a large-scale language model with hundreds of billions of parameters is called to mass-produce general domain corpora such as question-and-answer, translation, and summarization, as well as vertical domain corpora such as education, healthcare, and finance, according to the template. The speech distillation subsystem relies on a speech generation model and an automatic speech recognition model to form a bi-directional production link. The image distillation subsystem relies on a text-to-image model and an image-to-text model to achieve bi-directional mutual generation of text and images. The video distillation subsystem is driven by a text-to-video model and an image-to-video model, respectively. After each subsystem is completed, two quality control processes are performed on all generated corpora: a duplication check and a reliability check. The duplication check is based on the cosine similarity principle, which calculates the vector similarity between corpus pairs and removes redundant samples with a cosine similarity of not less than 0.9. This threshold is set based on the fact that sample pairs with a similarity higher than 0.9 have a high degree of semantic overlap, and keeping them would cause a bias in the distribution of training data. The reliability check removes corpora with semantic incompleteness or content errors through semantic integrity verification and content accuracy check, and finally obtains the multimodal original corpus.

[0069] The specific implementation of step S02 is as follows: The goal of this step is to generate four types of cross-modal corpus pairs based on multimodal raw corpora through bidirectional corpus generation. In terms of text-speech corpus generation, the constructed text seed library covers Chinese, English, and mixed languages, encompassing general domains, vertical domains, and dialogue scenarios. The text length is controlled between 1 and 30 characters to ensure speech generation quality. After special character removal, space standardization, punctuation unification, and semantic compliance verification, the text data is input into the speech generation model to generate speech. The speech data, after noise reduction, speech enhancement, whitespace cropping, and noisy audio removal, is input into the automatic speech recognition model to generate text in reverse. This bidirectional generation ensures the initial semantic consistency between text and speech. In terms of text-image corpus generation, the text data source, after noise removal, expression standardization, and semantic verification, is input into the text-to-image model to generate images. The image data source is required to be unblurred, watermark-free, and noise-free. After format unification, size cropping, and image enhancement, it is input into the image-to-text model to generate text descriptions in reverse. In text-to-video corpus generation, the text in the seed library needs to clearly define elements such as the subject, action, scene, and action sequence to ensure the temporal integrity of the video content. After processing, videos are generated in batches based on the text-to-video model. In image-to-video corpus generation, after format unification and data augmentation, the image dataset is first analyzed in batches by the image-to-text model and organized into text with dynamic descriptive properties. Then, the image-to-video model generates videos in batches based on the dynamic text, thereby completing the semantic consistency extension from image to video.

[0070] The specific implementation of step S03 is as follows: The goal of this step is to establish a preliminary semantic correspondence between cross-modal data at the segment level, i.e., coarse-grained semantic alignment. For text-speech segment alignment, a sentence-level segmentation strategy is used to segment the text based on text length and punctuation position. Then, an automatic speech recognition model is used to recognize the entire content of the corresponding speech file. The audio is segmented based on the segment boundary information in the recognition results. The segmented text segments and speech segments are semantically compared one by one to complete text-speech segment alignment, forming text-speech segment aligned data. For text-image entity alignment, a panoramic segmentation model is used to simultaneously perform panoramic segmentation, instance segmentation, and semantic segmentation on the image. The output format is an entity region coordinate-entity name mapping, i.e., the mapping relationship between the pixel coordinate range of each image region and the corresponding entity name. Then, a large language model is used to match the entity region coordinate-entity name mapping with the text content, identifying entities appearing in the text and aligning them with the corresponding entities in the mapping. Finally, entity-text alignment data in the image is formed, completing coarse-grained semantic alignment of cross-modal data and providing a structured foundation for subsequent fine-grained alignment.

[0071] The specific implementation of step S04 is as follows: The goal of this step is to further mine fine-grained cross-modal semantic correspondences at the word, frame, object, and spatial levels based on coarse-grained alignment. Regarding fine-grained text-speech alignment, firstly, a word-level and frame-level forced alignment model framework is used, employing a Hidden Markov Model (HMM) forced alignment algorithm to force temporal boundary alignment between speech frame sequences and text word sequences. The HMM achieves precise word-level timestamp localization by modeling speech state transition probabilities and observation probabilities, forming word-level and frame-level text-speech aligned data. Then, a self-supervised speech representation model is used to extract frame-level sentiment features, which are aligned with sentiment words in the text, forming timbre-sentiment text-speech aligned data. In terms of fine-grained text-image alignment, based on coarse-grained entity alignment data, the image and text data are parsed into a structured language scene graph through a scene graph alignment mechanism. Nodes represent entities and attributes, and edges represent relationships between entities. Visual object text-image alignment data is formed through node-level and edge-level two-layer matching. Spatial predicate guidance maps spatial descriptions in language to relative positional relationships in the image coordinate system, forming visual spatial text-image alignment data. A pose prediction model is used to estimate key points and identify states of entities in the image, extracting key image frames containing entity actions or states and aligning them with text verb phrases to form action-state text-image alignment data. In terms of fine-grained text-video alignment, the joint detection-tracking-recognition framework generates timestamped trajectories for each entity. A graph-to-text model extracts semantic features from these trajectories and parses them into a temporal scene graph. Event flows and graph structure changes are aligned using a graph-temporal network to form entity temporal trajectory text-video aligned data. A 3D pose estimation model generates spatiotemporal trajectories of entity keypoint coordinates. Combined with the graph-to-text model, action states are analyzed to form entity spatial trajectory text-video aligned data. A change point detection method identifies semantic keyframes and aligns them with text key events via cross-modal retrieval. A phrase localization joint framework maps text action phrases to continuous video frame intervals, forming frame-level semantic text-video aligned data. The above process employs a multi-resolution temporal pyramid structure to progressively refine the correspondence between coarse-grained segment layers and fine-grained frame layers. Combined with connection-temporal classification forced alignment constraints, soft alignment without precise boundary labeling is achieved, effectively solving the alignment problem of a 2-3 order of magnitude difference between the temporal granularity of video frames (milliseconds) and text sentences (seconds). Regarding the alignment threshold setting, the convergence value of the connection-temporal classification loss is recommended not to exceed 0.15.

[0072] The specific implementation of step S05 is as follows: The goal of this step is to standardize and store the fine-grained alignment data and construct a multimodal alignment dataset that can be used for training. A hierarchical directory structure is adopted to organize all alignment data hierarchically from the dataset root directory to the modality subdirectory to the sample category directory to the sample file, ensuring that multimodal files of the same sample can be associated through a unique sample identifier. A data index table is constructed to record the identifier, category, text path, image path, video path, and alignment information fields of each sample, supporting fast querying and batch retrieval. Manual sampling verification is performed at a rate of 5% to 10%, checking one by one whether the alignment logic is reasonable, whether the annotation accuracy meets the standard, and whether the data availability is normal. The verified data is encapsulated into TFRecord format or Webdataset format according to the downstream framework type. The TFRecord format is adapted to the TensorFlow framework, and the Webdataset format is adapted to the PyTorch framework and supports efficient streaming reading of large-scale datasets, thus obtaining a standardized multimodal alignment dataset.

[0073] The specific implementation of step S06 is as follows: The goal of this step is to construct a lightweight multimodal data feature extractor, complete cross-modal feature vector space alignment and fusion, and drive the quality control loop through vector space alignment scores. The shared lightweight backbone network consists of four modules: a basic feature extraction layer, a multi-scale feature fusion module, a dynamic sparse attention mechanism, and a cross-modal feature alignment layer. The basic feature extraction layer uses depthwise separable convolution to decompose the standard convolution into two steps: depthwise convolution and pointwise convolution, and uses the SiLU activation function to reduce the computational cost. The multi-scale feature fusion module introduces dilated spatial pyramid pooling to capture multi-scale contextual information in parallel. The dynamic sparse attention mechanism adaptively adjusts the sparsity through a predefined sparse mask. The cross-modal feature alignment layer uses 1×1 convolution and layer normalization to complete the initial spatial alignment. At the end of the backbone network, a lightweight semantic parsing head is integrated to output instance-level semantic tokens. Each token explicitly encodes the object category, attributes, spatial boundaries, and local context. For speech input, the self-supervised speech representation model extracts frame-level sub-language features and concatenates them with the main content tokens. For video input, the lightweight temporal saliency detection module identifies keyframes and performs high-dimensional semantic parsing only on the keyframes; the remaining frames are represented using low-dimensional motion optical flow tokens. The single-stage projection alignment module uses learnable 1×1 convolutions and layer normalization to directly map each modality token to the large language model embedding space, introducing local residual connections according to learnable parameters. The system performs weighted fusion of features before and after projection, and sets independent dynamic scaling factors for different modalities. Backbone network parameters are fixed during the initial training phase, and the projection module is gradually unfrozen. The modality-adaptive routing mechanism employs a soft routing strategy, prioritizing spatial attention branches for images, temporal convolution branches for audio, and a hybrid strategy for video. An integrated energy-aware scheduling module automatically reduces the weights of non-critical branches when resources are limited. The dual-stream semantic anchor fusion network uses cross-modal semantic anchor vectors as an intermediate bridge. Image and text streams are independently encoded and then projected onto the anchor vector space for inner product fusion. The anchor vectors are iterated and refined three times through a cross-layer refining mechanism to output the final aligned features. In the sparse-gated hybrid expert cross-modal routing network, each modality corresponds to four experts, with two cross-modal shared experts at the top layer. The gating network uses the Gumbel-Softmax reparameterization method to address the gradient non-flow problem in discrete routing. A temperature parameter controls the sharpness of the gradient distribution. Approximately continuous gradients are maintained during training, while hard selection is switched during inference. The vector space alignment score is calculated from the mean cosine similarity between the alignment feature vector of each modality and the corresponding cross-modal semantic anchor vector. The preset alignment threshold reference value is recommended to be set to 0.75. Samples below this threshold will automatically backtrack to step S04 to re-perform fine-grained semantic alignment. At the same time, the routing weight of the corresponding expert group will be dynamically increased for low-scoring feature paths to strengthen the fusion capability of the modality. Samples not below this threshold will output the final cross-modal alignment feature representation.

[0074] It should be noted that the first key technical idea of ​​this invention is a vector space alignment mechanism based on cross-modal semantic anchor vectors. Traditional methods directly apply cross-modal attention to the original feature space. Due to the systematic differences in the statistical distribution of different modal features, the calculation of attention weights is biased. This invention constructs learnable cross-modal semantic anchor vectors as an intermediate bridge in the semantic space. Each modal feature is aggregated to the corresponding anchor through cosine similarity weighting, and alignment is completed in the semantic anchor vector space rather than the original feature space, thus eliminating the impact of modal distribution heterogeneity on alignment accuracy from the root. The second key technical idea is a two-stage coarse-grained and fine-grained semantic alignment data construction system. The alignment granularity of traditional multimodal datasets is usually limited to the image-text sentence pair level, which cannot support fine-grained cross-modal understanding tasks such as visual object attributes, spatial relationships, and action states. This invention establishes preliminary cross-modal associations through coarse-grained fragment alignment, and then deepens the alignment granularity layer by layer through multiple mechanisms such as word-level and frame-level alignment, scene graph alignment, spatial predicate alignment, pose prediction, and joint detection, tracking, and recognition. This forms a complete fine-grained alignment dataset covering fragment to frame level and entity to attribute relationship alignment, providing a high-quality training foundation for vector space alignment. The third key technical idea is a sparse-gated hybrid expert cross-modal routing network. Traditional fully connected fusion methods increase computational cost proportionally with the increase of multimodal inputs, and information interference between different modalities is difficult to avoid. This invention improves fusion capacity without disproportionately increasing the number of parameters by activating experts on demand through sparse computation. The cross-expert attention jump interaction mechanism ensures that semantic information from different modalities is fully exchanged before being incorporated into the shared expert, guaranteeing the consistent expression of heterogeneous modalities in a unified vector space. When these three technical approaches work together, fine-grained aligned data provides precise semantic supervision signals for the learning of semantic anchor vectors, semantic anchor vectors provide semantically guided routing basis for sparse hybrid expert routing, and sparse hybrid expert routing networks use fine-grained aligned data to train more professional modal experts. The three form a complete technical loop that reinforces each other, jointly ensuring a comprehensive improvement in the alignment accuracy of cross-modal feature vector space.

[0075] It should be noted that this invention also solves the following technical problems: While addressing the core alignment accuracy issue, this invention also solves the problem of excessive computational overhead during multimodal feature extraction. Existing large multimodal models generally adopt a serial mapping architecture of visual Transformer plus multilayer perceptron plus large language model. The full attention computation of the visual encoder and the multi-stage nonlinear transformation of the multilayer perceptron bring a large amount of redundant computation. This invention, through a shared lightweight backbone network, adopts depthwise separable convolution and dynamic sparse attention mechanism, decomposing the computational cost of standard convolution into two steps: depthwise convolution and pointwise convolution. It also dynamically selects the most important attention connection through sparse masking, significantly reducing computational overhead while maintaining feature extraction capabilities. The single-stage projection alignment module replaces the multi-stage projection of the multilayer perceptron with learnable 1×1 convolution, reducing the number of nonlinear transformation layers and avoiding information loss introduced by multi-stage mapping. The modality adaptive routing mechanism dynamically activates the corresponding computation branch according to the input modality type, realizing on-demand computation and further reducing the proportion of invalid computation, thereby achieving a synergistic unity of lightweight and high-precision alignment. This invention also solves the technical problem of establishing cross-scale alignment between video frame-level temporal granularity and text sentence / word-level temporal granularity. Video frame-level temporal granularity is at the millisecond level, while text sentence / word-level temporal granularity is at the second level, a difference of 2 to 3 orders of magnitude. Traditional methods struggle to establish accurate temporal correspondences under such a large granularity difference. This invention constructs a hierarchical alignment framework from the coarse-grained segment layer to the fine-grained frame layer using a multi-resolution temporal pyramid structure, progressively refining the correspondence between video frames and text temporal granularity. Furthermore, it applies soft alignment constraints at each level by combining connection-based temporal classification with forced alignment constraints. This achieves automatic alignment of cross-scale temporal granularity without precise manual boundary labeling, effectively solving the fundamental technical challenge of the granularity gap in video-text temporal alignment.

[0076] Specifically, the principle of this invention is as follows: This invention can solve the technical problem of insufficient alignment accuracy in cross-modal feature vector space. The fundamental reason is that it introduces cross-modal semantic anchor vectors as a semantic mediator for the alignment of heterogeneous modal features. The original features of different modalities have systematic differences in their respective statistical distribution spaces. If cross-modal attention is directly applied to the original feature space, the heterogeneity of the distribution between modalities will lead to the calculation deviation of attention weights. This invention constructs a set of learnable cross-modal semantic anchor vectors, mapping each modal feature to a common anchor space with semantic dimension as the coordinate basis. Each anchor vector corresponds to a semantic dimension. The modal features are aggregated to the corresponding anchor by cosine similarity weighting, so that features with the same semantics but different modal representations tend to be in the same position in the anchor space. The dual-stream semantic anchor fusion network gradually optimizes the anchor vector through a cross-layer iterative refinement mechanism. The sparse gated hybrid expert cross-modal routing network completes sufficient cross-modal information transmission before fusion through inter-expert information transfer mechanism and cross-expert attention jump interaction. The two work together to ensure the consistent expression of heterogeneous modal features in a unified vector space, enabling the model to understand and reason about the deep semantics of images, thus supporting the improvement of cross-modal feature vector space alignment accuracy at the principle level.

[0077] The following provides a specific embodiment 1 of the present invention, and the specific implementation of each step in this embodiment 1 is described in detail below.

[0078] The specific implementation of step S01 involves constructing a multimodal knowledge distiller, which includes five distillation subsystems: text-to-text, text-to-speech bi-modal, text-to-image bi-modal, text-to-video, and image-to-video. The text-to-text distillation subsystem designs structured prompt templates, which include four elements: task type, domain label, content requirements, and format specifications. Style and difficulty variables are set, and constraints include prohibiting ambiguous answers, grammatical errors, and omissions of key information. A large-scale language model with hundreds of billions of parameters is used to batch distill and produce general-domain and vertical-domain corpora. After the corpora are generated, a duplication check is performed on all generated samples, and the cosine similarity between any two samples is calculated. The formula for calculating cosine similarity is as follows:

[0079] ;

[0080] In the formula, For the sample With sample The angle between the semantic vectors; Samples With sample The semantic embedding vector, with dimension . It is obtained from the embedding layer output of a large language model with hundreds of billions of parameters; , They are respectively and of Norm, dimensionless; the denominator on the left side of the equation is set to 1 to ensure that both sides of the equation are dimensionless. When When the data is in the redundancy, the corresponding redundant samples are removed; the reliability check removes semantically incomplete and content-erroneous data, and finally the multimodal raw data is obtained.

[0081] The specific implementation of step S02 involves constructing a text seed library, a speech seed library, an image data source, and an image dataset for the multimodal raw corpus. The text seed library covers Chinese, English, mixed languages, and texts from general, vertical, and dialogue scenarios, with text lengths controlled between 1 and 30 characters. The text data undergoes special character removal, space standardization, punctuation unification, and semantic compliance verification. The speech data undergoes noise reduction, speech enhancement, white space removal, and noisy audio removal. The image data undergoes format unification, size cropping, and image enhancement. After preprocessing, bidirectional corpus generation is performed using a speech generation model, an automatic speech recognition model, a text-to-image model, an image-to-text model, a text-to-video model, and an image-to-video model, respectively, to obtain corpus pairs for each modality.

[0082] The specific implementation of step S03 is as follows: The text-speech bi-modal corpus is segmented using a sentence-level segmentation strategy based on text length and punctuation. After recognizing the speech content using an automatic speech recognition model, the audio is segmented according to fragment relationships. The segmented text and speech are then semantically compared to complete coarse-grained fragment alignment. The text-image bi-modal corpus is then segmented using a panoramic segmentation model, performing panoramic segmentation, instance segmentation, and semantic segmentation. An entity region coordinate-entity name mapping is output. This mapping is then matched with the text content using a large language model to identify text entities and align them with the entities in the mapping, forming entity-text alignment data in the image, thus completing cross-modal coarse-grained semantic alignment.

[0083] The specific implementation of step S04 is as follows: based on the text-speech segment alignment data, a word-level frame-level forced alignment model combined with a Hidden Markov Forced Alignment (HMAV) algorithm is used to achieve word-level frame-level text-speech alignment. The HMAV forced alignment algorithm performs forced temporal boundary alignment on the speech frame sequence and the text word sequence. The joint estimation formula for its state transition probability and observation probability is expressed as follows:

[0084] ;

[0085] In the formula, For length is The speech observation sequence; This represents the total number of audio frames. For the corresponding hidden state sequence, For the first The hidden state corresponding to the frame; A sequence of words in the text; For state Lower observation The probability of its launch is dimensionless; From state Transition to state The transition probability has a dimension of 1; For the given text Time observation sequence With hidden state sequence The joint probability of , with a dimension of 1; The normalized reference probability is set to 1, with a dimension of 1. A self-supervised speech representation model is used to extract frame-level sentiment features and align them with text sentiment words, resulting in timbre-sensory text-speech aligned data and frame-level sub-language feature vectors. With the main content token The complete representation after splicing is ,in , Main content token vector dimension, For frame-level secondary language feature vector dimensions, This section describes a vector concatenation operation. Based on entity-text alignment data in images, a scene graph alignment mechanism parses the entity alignment data into structured language scene graphs. Nodes represent entities and attributes, and edges represent relationships between entities. Fine-grained alignment of text and image is achieved through node-level and edge-level two-layer matching, resulting in visual object text-image alignment data. Spatial predicate mapping maps spatial descriptions in language to relative positional relationships in the image coordinate system, yielding visual-spatial text-image alignment data. A pose prediction model extracts key image frames and aligns them with text verb phrases, obtaining action-state text-image alignment data. For text-video corpora, a joint detection-tracking-recognition framework generates timestamped entity time trajectories for each entity. A graph-to-text model extracts semantic features from the trajectories and parses them into a temporal scene graph. A graph temporal network aligns the text event flow with the structural changes in the temporal scene graph, representing the data as a quadruple of timestamp, entity, attribute, and relationship. A 3D pose estimation model generates the spatiotemporal trajectory of entity keypoint coordinates, combining image frames to analyze actions and states, representing the data as a triple of timestamp, spatial coordinates, and action or state. The change point detection method automatically identifies semantic keyframes. The phrase localization joint framework maps text action phrases to consecutive video frame intervals and outputs start and end timestamps, obtaining frame-level semantic text-video alignment data. A multi-resolution temporal pyramid structure constructs a hierarchical alignment framework from the coarse-grained segment layer to the fine-grained frame layer. Combined with connection-based temporal classification to force alignment constraints, it progressively refines the correspondence between video frames and text temporal granularity. Its soft alignment loss formula is expressed as follows:

[0086] ;

[0087] In the formula, This is the connection time-series classification loss value, with a dimension of 1; The reference value for loss normalization is 1, and its dimension is 1. The target text sequence; The input is a sequence of video frame features; Align the path; To map to the target sequence The set of all valid aligned paths; Given input Time path The posterior probability of is 1.

[0088] The specific implementation of step S05 is as follows: all cross-modal fine-grained semantic alignment data are stored in a structured manner according to a hierarchical directory structure. A data index table containing fields such as sample identifier, category, text path, image path, video path, and alignment information is constructed. Manual sampling verification is performed at a ratio of 5% to 10% to check the alignment logic, annotation accuracy, and data usability. The verified data is encapsulated into a binary serialized data format adapted to the TensorFlow framework or a standardized data format adapted to the PyTorch framework to obtain a multimodal alignment dataset.

[0089] The specific implementation of step S06 is as follows: a lightweight multimodal data feature extractor is constructed. The shared lightweight backbone network consists of a basic feature extraction layer, a multi-scale feature fusion module, a dynamic sparse attention mechanism, and a cross-modal feature alignment layer. The basic feature extraction layer adopts a combination of depthwise separable convolution. The activation function and multi-scale feature fusion module introduce dilated spatial pyramid pooling to capture multi-scale contextual information. The cross-modal feature alignment layer employs... Convolution and layer normalization perform initial spatial alignment of features from different modalities. The single-stage projection alignment module employs a learnable... Convolution and layer normalization directly map visual or audio tokens to the embedding space of a large language model. Local residual connections are introduced to perform weighted fusion of features before and after projection. The projection output formula is expressed as follows:

[0090] ;

[0091] In the formula, The projection outputs a feature vector with dimension . ; For the The projected feature vector after convolution and layer normalization mapping has a dimension of ; The input feature vector before projection has a dimension of . ; The reference magnitude of the eigenvector, and , , Having the same dimensions is used to normalize both sides of an equation to dimensionless quantities. is a learnable scalar parameter with a dimension of 1, used to control the fusion ratio between projected features and original features. In the dual-stream semantic anchor fusion network, image stream and text stream features are encoded independently, and then projected and fused through cross-modal semantic anchor vectors as an intermediate bridge. The cosine similarity calculation formula for the aggregation of each modality feature to the cross-modal semantic anchor vector is expressed as follows:

[0092] ;

[0093] In the formula, For modality intermediate feature vector With the Cross-modal semantic anchor vectors The cosine similarity between them has a dimension of 1; Modal encoding after dual-stream semantic anchor fusion network The intermediate feature vector, with dimension . ; For the first There are 1 cross-modal semantic anchor vectors with dimension 1. ,and Maintain consistency in dimensions; and They are respectively and of The norm is dimensionless; the denominator on the left side of the equation is set to 1 to ensure that both sides of the equation are dimensionless. The sparsely gated hybrid expert cross-modal routing network adopts a hybrid expert architecture, with each modality corresponding to an independent expert group (each group containing 4 experts). Two cross-modal shared experts are set at the top level. The lightweight gated network... Routing to the first two experts and using Gumbel- Reparameterization methods approximate discrete sampling as a continuously differentiable operation, maintaining gradient flow during training and switching to hard selection during inference, such as Gumbel- The formula for reparameterized route weights is expressed as follows:

[0094] ;

[0095] In the formula, For the sample Assigned to experts The soft route weight has a dimension of 1; For gating networks to sample About the experts The original score, with a dimension of 1; For gating networks to sample Regarding the first The original scores from the experts, with a dimension of 1; For experts from The noise from the distributed sampling is dimensionless (1). In order to target the An expert from The noise from the distributed sampling is dimensionless (1). This is a temperature parameter, with dimensions of 1. The total number of experts is the sum of the four modality experts in each group plus the two top-level cross-modality shared experts. Finally, the mean cosine similarity between the final aligned feature vector of each modality and the corresponding cross-modal semantic anchor vector is calculated as the vector space alignment score, expressed by the following formula:

[0096] ;

[0097] In the formula, The score is the alignment score for the vector space, with a dimension of 1. The total number of modes participating in the fusion; The modality output after fusion by a sparse gated hybrid expert cross-modal routing network The final aligned feature vector has a dimension of ; For modality The corresponding cross-modal semantic anchor vector has a dimension of ; and They are respectively and of Norm, dimensionless; the denominator on the left side of the equation is set to 1 to ensure that both sides of the equation are dimensionless. For samples below the preset alignment threshold, the cross-modal data fine-grained semantic alignment process is re-executed in S04, while the routing weight of the corresponding expert group is dynamically increased; For samples that are not lower than the preset alignment threshold, the final cross-modal alignment feature representation is output.

[0098] To better understand and implement this invention, the following is a specific application scenario of this invention, Example 2:

[0099] To verify the effectiveness of the present invention, the technicians set up a test environment and used a medical image multimodal question answering task as the scenario. They constructed a multimodal alignment dataset containing radiology report text, medical images and medical operation videos using the technical solution of the present invention, and completed the training and alignment evaluation of a lightweight multimodal data feature extractor based on the dataset.

[0100] The technical staff first activated the multimodal knowledge distiller. In the text-to-text distillation subsystem, using medical scenarios as the domain label, they designed a structured prompt template containing four elements: question-and-answer task type, medical diagnosis content requirements, and standardized description format. Constraints prohibiting ambiguous answers and grammatical errors were set. A large language model with hundreds of billions of parameters was used to distill and produce a total of 48,000 text corpora covering medical scenarios such as disease diagnosis, drug instructions, and surgical procedures. In the duplication check phase, cosine similarity was calculated pairwise for all corpora. Redundant samples with a cosine similarity of at least 0.9 were removed, leaving 41,200 valid text corpora. A reliability check further removed semantically incomplete and content-erroneous samples, ultimately yielding 38,600 original multimodal text corpora. In the text-image distillation subsystem, based on medical image data sources, after format standardization, size cropping, and image enhancement preprocessing, a total of 22,000 pairs of bidirectional bimodal text-image corpora were generated in batches based on text-to-image and image-to-text models. In the image-video distillation subsystem, a medical operation image dataset is used as input. The image elements are analyzed in batches and organized into dynamic text using the image-to-text model. Then, the image-to-video model generates medical operation videos, resulting in a total of 8,500 image-video corpora.

[0101] Subsequently, technicians conducted cross-modal coarse-grained semantic alignment on the aforementioned multimodal raw corpus. For text-image coarse-grained alignment, a panoramic segmentation model was used to perform panoramic segmentation, instance segmentation, and semantic segmentation on the medical images, outputting entity region coordinate-entity name mappings. This identified the pixel coordinate ranges of entities such as lesion regions and organ outlines. Then, a large language model was used to match the mapped content with entity descriptions in the radiology report text, forming entity-text alignment data covering 19,800 pairs of samples.

[0102] In the fine-grained semantic alignment stage of cross-modal data, technicians perform visual object alignment, visual space alignment, and action state alignment sequentially on top of coarse-grained alignment of text-image data. The scene graph alignment mechanism parses medical images and report text into structured language scene graphs, where nodes represent lesion entities and their attributes (e.g., size, shape, density), and edges represent relationships between entities (e.g., adjacent, containment). After node-level and edge-level double-layer matching, visual object text-image aligned data is formed. Spatial predicate mapping maps location descriptions in the report (e.g., above the lung lobe, inside the thoracic cavity) to relative positional relationships in the image coordinate system, forming visual space text-image aligned data. The pose prediction model estimates key points for surgical actions in the surgical operation video, extracts key image frames containing the movement state of surgical instruments, and aligns them with verb phrases from the surgical step text, forming action state text-image aligned data. For text-video data, a joint detection-tracking-recognition framework generates timestamped trajectories for each surgical instrument and anatomical entity in the video. A graph-to-text model extracts semantic features from these trajectories and parses them into a temporal scene graph. A graph-temporal network aligns the surgical step text with changes in the graph structure, forming text-video aligned entity temporal trajectories. A multi-resolution temporal pyramid structure combined with connection-temporal classification forces alignment constraints to refine the correspondence between video frame-level and text sentence / word-level temporal granularity. The connection-temporal classification loss eventually stabilizes at 0.12, below the suggested threshold of 0.15, indicating that the frame-level semantic alignment quality meets the standard. Figure 3 As shown, the convergence curves of the alignment loss at each level of the temporal pyramid indicate that the fine-grained frame-level alignment accuracy continuously improves as the level deepens.

[0103] After completing fine-grained semantic alignment, the technicians structured and stored all aligned data according to a hierarchical directory structure, constructing a data index table. Sample fields in the data index table are shown in Table 1:

[0104] Table 1. Example of index fields for multimodal aligned datasets

[0105]

[0106] Manual verification was conducted using an 8% sampling rate, resulting in 2460 samples. After checking alignment logic, annotation accuracy, and data availability, the verification pass rate reached 96.3%. The verified samples were packaged into Webdataset format to adapt to the efficient streaming reading of the PyTorch framework.

[0107] During the training and evaluation phase of the lightweight multimodal data feature extractor, technicians trained the feature extractor based on the aforementioned multimodal alignment dataset. The basic feature extraction layer of the shared lightweight backbone network employs depthwise separable convolution combined with the SiLU activation function. The multi-scale feature fusion module extracts multi-scale contextual information in parallel across different receptive fields using dilated spatial pyramid pooling. A dynamic sparse attention mechanism adaptively adjusts sparsity based on the complexity of the medical image content. Learnable parameters are used in the single-stage projection alignment module. During training, automatic convergence is achieved. Gradient pruning and feature distribution monitoring mechanisms fix the backbone network parameters in the early stages of training and gradually unfreeze the projection modules, ensuring a smooth transition in feature space mapping. In the sparse gated hybrid expert cross-modal routing network, each modality corresponds to four experts, with two cross-modal shared experts responsible for global semantic convergence. The Gumbel-Softmax reparameterization method maintains approximately continuous gradient flow during training and switches to hard selection of the first two experts during inference. Figure 4 As shown, the distribution of vector space alignment scores across different modalities indicates that after multiple rounds of backtracking realignment, the mean cosine similarity of samples from each modality is concentrated above the preset alignment threshold of 0.75, and the proportion of samples below the threshold converges to 4.2%. The alignment quality indicators for each modality are shown in Table 2.

[0108] Table 2. Statistical table of alignment scores for each modality vector space

[0109]

[0110] Compared to traditional serial mapping architectures, this invention aggregates features from various modalities in a semantically meaningful anchor space using cross-modal semantic anchor vectors. This allows fine-grained semantic information such as object attributes, entity relationships, and spatial structures to be accurately mapped in a unified semantic vector space. The principle behind this is that the semantic anchor vectors essentially act as a common coordinate base for cross-modal semantic mapping, eliminating the systematic bias introduced by modal distribution differences when directly calculating cross-modal attention in the original feature space. This enables the model to understand and reason about the deep semantics of medical images, rather than relying solely on... The shallow perception method of coordinate positioning; the multi-resolution temporal pyramid structure fundamentally solves the alignment problem of video frame-level temporal granularity and text sentence-word-level temporal granularity spanning 2 to 3 orders of magnitude, enabling a precise temporal correspondence between surgical step text and operation video frames; the sparse gated hybrid expert cross-modal routing network completes sufficient cross-modal information transmission before fusion through a cross-expert attention jump interaction mechanism, ensuring the consistent expression of heterogeneous modal features in a unified vector space, and enabling the model to have stronger inference accuracy and robustness in medical image multimodal question answering tasks.

[0111] It should be noted that the variables involved in this invention are explained in detail in Tables 3 and 4.

[0112] Table 3. Variable Explanation Table (Part 1)

[0113]

[0114] Table 4. Variable Explanation Table (Part Two)

[0115]

[0116] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.

Claims

1. A method for cross-modal alignment of a multimodal large model, characterized in that, Includes the following steps: A multimodal knowledge distillation device is constructed, comprising five distillation subsystems: text-text knowledge distillation subsystem, text-speech bidirectional bimodal knowledge distillation subsystem, text-image bidirectional bimodal knowledge distillation subsystem, text-video knowledge distillation subsystem, and image-video knowledge distillation subsystem. Multimodal corpora are generated in batches using the generative models corresponding to each distillation subsystem, and the multimodal corpora are subjected to repeatability and reliability checks to obtain the original multimodal corpora. Preprocessing operations and bidirectional corpus generation are performed on the multimodal raw corpus to obtain text-speech bidirectional bimodal corpus, text-image bidirectional bimodal corpus, text-video corpus and image-video corpus; We employ a sentence-level segmentation strategy combined with an automatic speech recognition model to perform segment alignment on the text-speech bi-modal corpus. We also perform segmentation and alignment on the text-image bi-modal corpus to form entity-text alignment data between entities in the image and entities in the text, thus completing coarse-grained semantic alignment of cross-modal data. We employ a word-level and frame-level forced alignment model combined with a Hidden Markov Forced Alignment Alignment Alignment Alignment Alignment Alignment (HMAS) to achieve word-level and frame-level text-speech alignment data. We use a self-supervised speech representation model to extract frame-level sentiment features and align them with text sentiment words to obtain timbre-sensory text-speech alignment data. We obtain visual object text-image alignment data through a scene graph alignment mechanism, visual space text-image alignment data through spatial predicate mapping, and action state text-image alignment data by extracting key image frames and aligning them with text verb phrases through a pose prediction model. This completes fine-grained semantic alignment of cross-modal data. All cross-modal data with fine-grained semantic alignment is stored in a structured manner according to a hierarchical directory structure to obtain a multimodal alignment dataset; A lightweight multimodal data feature extractor is constructed, comprising a shared lightweight backbone network, a single-stage projection alignment module, and a modality adaptive routing mechanism. A lightweight semantic parsing head is integrated at the end of the shared lightweight backbone network to output instance-level semantic tokens. For speech input, frame-level sub-language features are extracted using a self-supervised speech representation model and concatenated with the main content token. For video input, a lightweight temporal saliency detection module identifies video keyframes and performs high-dimensional semantic parsing only on the video keyframes. Cross-modal semantic anchor vectors are constructed, and visual object text-image alignment data, visual space text-image alignment data, and action state text-image alignment data are associated through a dual-stream semantic anchor fusion network. All modal data in the multimodal alignment dataset are associated through a sparse gated hybrid expert cross-modal routing network, completing the cross-modal feature vector space alignment and fusion, and outputting a cross-modal alignment feature representation.

2. The method according to claim 1, characterized in that, The duplication check specifically involves calculating the cosine similarity between generated corpora and removing redundant samples with a cosine similarity of not less than 0.9; the reliability check specifically involves removing corpora with incomplete semantics or incorrect content.

3. The method according to claim 2, characterized in that, The construction of the text-to-text knowledge distillation subsystem involves designing a structured prompt template that includes four elements: task type, domain label, content requirements, and format specifications. It also sets style variables, difficulty variables, and constraints such as prohibiting ambiguous answers, grammatical errors, and omissions of key information. The system then uses a large language model with hundreds of billions of parameters to distill and produce general domain corpora and vertical domain corpora.

4. The method according to claim 3, characterized in that, The text seed library is constructed to include Chinese, English, mixed language text, general domain text, vertical domain text, and dialogue scenario text, with text length controlled between 1 and 30 characters. The text data in the text seed library is processed by removing special characters, standardizing spaces, unifying punctuation marks, and verifying semantic compliance. The speech data in the speech seed library is processed by noise reduction, speech enhancement, trimming blank speech, and removing noisy audio.

5. The method according to claim 4, characterized in that, The image data sources include general aligned images and domain-specific images, with high image quality requirements, no blurring, no watermarks, and no noise. The original text data sources are processed by noise removal, expression standardization, and text semantic verification. The image data sources are processed by format unification, size cropping, and image enhancement. Based on the text-to-image model and the image-to-text model, bidirectional bimodal text-image corpus is generated in batches.

6. The method according to claim 5, characterized in that, The generation of the image-video corpus involves formatting and data augmentation of the image dataset, then using a graph-to-text model to analyze the elements of each image in the dataset in batches, organizing static objects in the images into dynamic text, and then generating image-video corpora in batches based on the dynamic text using a graph-to-video model.

7. The method according to claim 6, characterized in that, The text-to-speech segment alignment specifically employs a sentence-level segmentation strategy to segment data according to text length and punctuation marks. After recognizing the speech content through an automatic speech recognition model, the audio file is segmented based on the segment relationships of the speech content. The segmented text data and speech data are then semantically compared to complete the text-to-speech segment alignment.

8. The method according to claim 7, characterized in that, The scene graph alignment mechanism specifically parses the entity alignment data in the image and the entity alignment data in the text into a structured language scene graph. In the structured language scene graph, nodes represent entities and attributes, and edges represent relationships between entities. Fine-grained alignment of the image and text is achieved through two-layer matching of node-level alignment and edge-level alignment.

9. The method according to claim 8, characterized in that, The generation of the entity time trajectory text-video aligned data specifically involves using a graph-based text model to extract the semantic features of each entity time trajectory, parsing the semantic features into a temporal scene graph, and aligning the event flow described in the text with the structural changes of the temporal scene graph through a graph temporal network, representing the data as a quadruple of timestamp, entity, attribute, and relation.

10. The method according to claim 9, characterized in that, The multi-resolution temporal pyramid structure specifically constructs a hierarchical alignment framework between the coarse-grained segment layer and the fine-grained frame layer, refining the temporal granular correspondence between video frames and text at each level; the connection-temporal classification forced alignment constraint specifically uses the connection-temporal classification loss function to apply soft alignment constraints to the video frame sequence and the text sequence without precise boundary labeling.