A method, system, terminal and medium for constructing and applying a multi-modal wireless base model based on a double-encoder cross-modal injection mechanism

By constructing a dual encoder structure and a cross-modal injection mechanism, the problem that existing wireless channel models cannot simultaneously handle communication and sensing tasks is solved. This enables efficient feature representation of the multimodal wireless fundamental model, improving the accuracy and efficiency of communication and sensing tasks.

CN121997277BActive Publication Date: 2026-06-26PENG CHENG LAB

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
PENG CHENG LAB
Filing Date
2026-04-08
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing wireless channel models or basic models are usually designed for a single mode, making it difficult to take into account the performance differences between communication and perception tasks. This results in limited model generalization and fails to meet the needs of edge artificial intelligence for efficient and universal feature representation.

Method used

A dual encoder structure is constructed, including a CSI encoder and a point cloud encoder. Channel state information and point cloud data are cross-injected and fused through a cross-modal injection mechanism. Combined with a joint training method of autoregressive training and task-related training, a multimodal wireless basic model is generated.

Benefits of technology

It achieves deep fusion and alignment of heterogeneous data in the underlying feature space, improves the performance of communication and sensing tasks, and enhances the accuracy and efficiency of various downstream tasks such as channel estimation, channel prediction, target localization, and electromagnetic mapping.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121997277B_ABST
    Figure CN121997277B_ABST
Patent Text Reader

Abstract

The application discloses a kind of based on double encoder cross-modal injection mechanism The construction and application method, system, terminal and medium of multi-modal wireless foundation model, method includes: constructing double encoder structure, including: CSI encoder and point cloud encoder;Obtain multi-modal data, and carry out data block and position coding;First feature is extracted by CSI encoder and point cloud encoder, second feature is extracted by auxiliary encoder, and the features of different modalities are cross-injected and fused by cross-modal attention mechanism, and multi-modal general representation is output;Pre-training is carried out using joint training method, and multi-modal wireless foundation model is obtained, in inference stage, according to the type of downstream task, select general representation, and access downstream task model to carry out inference.The multi-modal wireless foundation model trained based on the application can improve a variety of widely used communication, perception downstream tasks, including channel estimation, channel prediction, target positioning, electromagnetic map, indoor perception and the like.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of artificial intelligence and wireless communication technology, and in particular to a method, system, terminal and medium for constructing and applying a multimodal wireless basic model based on a dual encoder cross-modal injection mechanism. Background Technology

[0002] With the evolution of edge artificial intelligence (AI-Edge) technology, wireless networks are developing from single communication networks to integrated communication and sensing networks. In a wide range of scenarios, systems not only need to perform high-quality wireless signal communication tasks, but also need to have accurate sensing capabilities.

[0003] Existing wireless channel models or basic models are usually designed for a single mode, making it difficult to take into account the performance differences between communication and perception tasks at the underlying representation level. This results in limited model generalization and fails to meet the AI-Edge's need for efficient and universal feature representation.

[0004] Therefore, existing technologies still have shortcomings. Summary of the Invention

[0005] To address the aforementioned shortcomings of existing technologies, this invention provides a method, system, terminal, and medium for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism. The technical solution adopted by this invention is as follows:

[0006] In a first aspect, the present invention provides a method for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism, the method comprising:

[0007] A dual encoder structure is constructed, comprising: a CSI encoder with channel state information as the primary mode, and a point cloud encoder with point cloud data as the primary mode;

[0008] Acquire time-sampled aligned multimodal data, and perform data segmentation and location encoding on the multimodal data. The multimodal data includes: channel state data or point cloud data as the primary mode, and sensor data as the auxiliary mode.

[0009] The first feature of the corresponding main modality is extracted by the CSI encoder and the point cloud encoder respectively, and the second feature of the auxiliary modality is extracted by the auxiliary encoder. Cross-modal injection is performed in the CSI encoder and the point cloud encoder. The features of different modalities are cross-injected and fused by the cross-modal attention mechanism to output a multimodal general representation.

[0010] Using the aforementioned multimodal general representation, the model is pre-trained using a joint training method that combines autoregressive training and task-related training to obtain a multimodal wireless basic model.

[0011] During the inference phase, the corresponding general representation output by the CSI encoder or point cloud encoder is selected based on the type of the downstream task via the task switch, and then connected to the downstream task model for inference.

[0012] In one implementation, the sensor data of the auxiliary modality includes RGB image data, the corresponding auxiliary encoder is a visual encoder, and the second feature is a visual feature.

[0013] In one implementation, both the CSI encoder and the point cloud encoder are based on the Transformer architecture. The CSI encoder and the point cloud encoder respectively extract the first feature of the corresponding main modality, including:

[0014] The front of the CSI encoder and the point cloud encoder The Transformer module of the layer extracts the first feature of the corresponding main mode. The Transformer module internally adopts a self-attention mechanism and a feedforward neural network.

[0015] In one implementation, the CSI encoder and the point cloud encoder are... In the Transformer modules of the layer and below, a cross-modal attention mechanism and a feedforward neural network are used internally; wherein, the CSI encoder injects features from point cloud data and auxiliary modes, and the point cloud encoder injects features from channel state data and auxiliary modes.

[0016] In one implementation, there is bidirectional cross-modal transmission between the Transformers corresponding to the two primary modalities, and the auxiliary modality is provided to the Transformers of all primary modalities at the same time. The transmission from the auxiliary modality to the primary modality is unidirectional.

[0017] In one implementation, the loss function for joint training is:

[0018]

[0019] in, For the total loss, For the self-supervised learning autoregressive loss calculated based on the autoregressive head output, The supervised learning task-related loss is calculated based on the downstream task header output. For weight hyperparameters.

[0020] In one implementation, during the inference phase, a task switch selects the corresponding general representation output by the CSI encoder or the point cloud encoder based on the type of the downstream task, including:

[0021] When performing a communication task, select the general representation output by the CSI encoder;

[0022] When performing a perception task, the general representation output by the point cloud encoder is selected.

[0023] Secondly, embodiments of the present invention also provide a system for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism. The system is used to implement the method for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism as described in any of the above schemes. The system includes:

[0024] A dual encoder construction module is used to construct a dual encoder structure, which includes: a CSI encoder with channel state information as the main mode and a point cloud encoder with point cloud data as the main mode.

[0025] A multimodal data processing module is used to acquire time-sampled aligned multimodal data and perform data segmentation and location encoding on the multimodal data. The multimodal data includes: channel state data or point cloud data as the primary mode and sensor data as the auxiliary mode.

[0026] The feature extraction and fusion module is used to extract the first feature of the corresponding main modality through the CSI encoder and the point cloud encoder respectively, and extract the second feature of the auxiliary modality through the auxiliary encoder. Cross-modal injection is performed in the CSI encoder and the point cloud encoder. The features of different modalities are cross-injected and fused through the cross-modal attention mechanism to output a multimodal general representation.

[0027] The joint training module is used to pre-train the model using the multimodal general representation and a joint training method that combines autoregressive training and task-related training to obtain a multimodal wireless basic model.

[0028] The inference application module is used to select the corresponding general representation output by the CSI encoder or point cloud encoder according to the type of downstream task through a task switch during the inference stage, and connect it to the downstream task model for inference.

[0029] Thirdly, embodiments of the present invention also provide a terminal, wherein the terminal includes a memory, a processor, and a construction and application program for a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism stored in the memory and executable on the processor. When the processor executes the construction and application program for a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism, it implements the steps of the construction and application method for a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism of any of the above-mentioned schemes.

[0030] Fourthly, embodiments of the present invention also provide a computer-readable storage medium, wherein the computer-readable storage medium stores the construction and application of a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism, wherein the construction and application of the multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism implements the steps of the construction and application method of the multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism as described in any of the above schemes on the computer-readable storage medium.

[0031] Beneficial Effects: Compared with existing technologies, this invention provides a method for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism, comprising: constructing a dual-encoder structure, wherein the dual-encoder structure includes: a CSI encoder with channel state information as the primary modality, and a point cloud encoder with point cloud data as the primary modality; acquiring time-sampled aligned multimodal data, and performing data blocking and position encoding on the multimodal data, wherein the multimodal data includes: channel state data or point cloud data as the primary modality, and sensor data as the auxiliary modality; and extracting data through the CSI encoder and the point cloud encoder respectively. The first feature corresponding to the primary modality is extracted, and the second feature of the auxiliary modality is extracted through the auxiliary encoder. Cross-modal injection is performed in the CSI encoder and the point cloud encoder. The features of different modalities are cross-injected and fused through a cross-modal attention mechanism to output a multimodal general representation. Using the multimodal general representation, the model is pre-trained using a joint training method that combines autoregressive training and task-related training to obtain a multimodal wireless basic model. In the inference stage, the corresponding general representation output by the CSI encoder or the point cloud encoder is selected according to the type of the downstream task through a task switch and connected to the downstream task model for inference.

[0032] This invention achieves deep fusion and alignment of heterogeneous data (channel state, point cloud, image) in the underlying feature space through dual encoders and a cross-modal injection mechanism, effectively balancing the performance requirements of both communication and sensing tasks. The general representation generated by the multimodal wireless fundamental model trained based on this invention can significantly improve the accuracy and efficiency of a wide range of downstream communication and sensing tasks, including channel estimation, channel prediction, target localization, electromagnetic mapping, and indoor sensing. Attached Figure Description

[0033] Figure 1 This is a flowchart of a preferred embodiment of the method for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism according to an embodiment of the present invention.

[0034] Figure 2 This is a schematic diagram of the overall architecture of the multimodal wireless basic model based on the dual encoder cross-modal injection mechanism in an embodiment of the present invention.

[0035] Figure 3 This is a comparative schematic diagram of the internal structure of the Transformer module in the method for constructing and applying a multimodal wireless basic model based on a dual encoder cross-modal injection mechanism, as described in an embodiment of the present invention.

[0036] Figure 4 This is a schematic diagram of the structure of the system for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism, according to an embodiment of the present invention.

[0037] Figure 5 A schematic diagram of a terminal provided in an embodiment of the present invention. Detailed Implementation

[0038] To make the objectives, technical solutions, and effects of this invention clearer and more explicit, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

[0039] The flowchart shown in the attached diagram is for illustrative purposes only and does not necessarily include all content, operations, or steps, nor does it require execution in the described order. For example, some operations or steps can be broken down, combined, or partially merged, so the actual execution order may change depending on the actual situation.

[0040] It should be understood that the terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the invention. As used in this specification and the appended claims, the singular forms “a,” “an,” and “the” are intended to include the plural forms unless the context clearly indicates otherwise.

[0041] It should be understood that, in order to clearly describe the technical solutions of the embodiments of the present invention, the terms "first" and "second" are used in the embodiments of the present invention to distinguish identical or similar items with essentially the same function and effect. For example, "first control information" and "second control information" are only used to distinguish different control information and do not limit their order.

[0042] Those skilled in the art will understand that the words "first" and "second" do not limit the quantity or the order of execution, and that the words "first" and "second" do not necessarily imply that they are different.

[0043] It should also be understood that the term “and / or” as used in this specification and the appended claims refers to any combination of one or more of the associated listed items and all possible combinations, and includes such combinations.

[0044] To address the problems of existing technologies, this invention also provides a method for constructing and applying a multimodal wireless fundamental model based on a dual-encoder cross-modal injection mechanism. The general representation generated by the multimodal wireless fundamental model trained using this method can significantly improve the accuracy and efficiency of a wide range of downstream communication and sensing tasks, including channel estimation, channel prediction, target localization, electromagnetic mapping, and indoor sensing. In specific applications, a dual-encoder structure is first constructed, comprising a CSI encoder with channel state information as the primary modality and a point cloud encoder with point cloud data as the primary modality. Then, time-sampled aligned multimodal data is acquired, and the multimodal data is divided into blocks and encoded at different locations. The multimodal data includes channel state data or point cloud data as the primary modality and sensor data as the auxiliary modality. Next, the first features corresponding to the primary modality are extracted by the CSI encoder and the point cloud encoder, respectively, and the second features of the auxiliary modality are extracted by the auxiliary encoder. Cross-modal injection is performed in the CSI encoder and the point cloud encoder, and the features of different modalities are cross-injected and fused using a cross-modal attention mechanism to output a general multimodal representation. Next, using the aforementioned multimodal general representation, the model is pre-trained using a joint training method combining autoregressive training and task-related training to obtain the multimodal wireless basic model. Finally, in the inference stage, the corresponding general representation output by the CSI encoder or point cloud encoder is selected based on the type of the downstream task via a task switch and connected to the downstream task model for inference.

[0045] The method for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism in this embodiment can be applied in terminals, including intelligent product terminals such as computers and smart TVs. Specifically, as shown in the example... Figure 1 As shown in the figure, the method for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism in this embodiment includes the following steps:

[0046] Step S100: Construct a dual encoder structure, which includes a CSI encoder with channel state information as the main mode and a point cloud encoder with point cloud data as the main mode.

[0047] Specifically, in combination Figure 2As shown, this embodiment first constructs a dual-encoder structure, specifically including a CSI (Channel State Information) encoder that uses channel state data as the primary mode, and a point cloud encoder that uses point cloud data as the primary mode. The CSI encoder corresponds to the CSI communication mode, and the point cloud encoder corresponds to the point cloud perception mode. In addition, this embodiment also sets an auxiliary encoder, which is a visual encoder. The visual encoder is used to acquire RGB image data and serves as the auxiliary mode, specifically the RGB image auxiliary mode.

[0048] Step S200: Acquire time-sampled aligned multimodal data, and perform data block and location encoding on the multimodal data. The multimodal data includes: channel state data or point cloud data as the primary mode, and sensor data as the auxiliary mode.

[0049] Specifically, based on the constructed dual-encoder structure and auxiliary encoder described above, this embodiment can separately collect channel state data, point cloud data, and sensor data in the same scene to obtain multimodal data. Channel state data and point cloud data serve as the primary modality, while sensor data (i.e., RGB image data) serves as the auxiliary modality. This embodiment uses a globally unified timestamp alignment technique to ensure the sampling consistency of different modal data in the time dimension. This ensures consistency at each sampling moment. Observation vector group It has strict time consistency, among which, To be at the sampling time Channel status data collected in real time, To be at the sampling time Point cloud data collected in real time, To be at the sampling time RGB image data collected in real time.

[0050] Furthermore, in this embodiment, data patching is performed on the multimodal data to transform it into a one-dimensional sequence vector. Subsequently, position encoding is added to each vector block to preserve the spatial or frequency distribution characteristics of the original data, aiming to transform the heterogeneous original signals into a unified serialized token. Specifically, when performing data encoding and position encoding on the channel state data, the original channel state data is represented as a complex matrix. (in, For complex fields, For the number of subcarriers, (Number of antennas). During block processing, the matrix... Divided into sizes The sub-blocks are flattened into one-dimensional vectors. Then, a linear projection is performed: using a learnable linear mapping matrix. Map the flattened vector to a dimension of... The embedding space is represented as: ,in, For the first Each CSI complex matrix sub-block is linearly projected and mapped to a one-dimensional vector in the embedding space. For the first The one-dimensional original vector is obtained by flattening the CSI complex matrix sub-blocks. Finally, position coding is performed. Since the wireless channel has strong correlation in the frequency and spatial domains, this embodiment introduces two-dimensional learnable position coding. The position code is represented as: The final serialized token vector of the channel state data after location coding is completed (serves as input to the CSI encoder). It is represented as the embedding vector after linear projection of the 1st to Mth CSI complex matrix sub-blocks, where M is the total number of channel state data blocks.

[0051] When processing point cloud data in blocks, this embodiment uses farthest point sampling (FPS) to extract key points and extracts local features through grouping and a small PointNet (a deep learning framework for point cloud classification / segmentation). When processing RGB image data in blocks, this embodiment uses the PatchPartition method of the Vision Transformer (ViT) standard to convert the RGB image data into a sequence. The Vision Transformer in this embodiment extends the Transformer framework to the field of computer vision and can effectively replace convolution operations, achieving good results in image classification tasks without relying on convolution. Patch Partition is a technique for dividing an image into small patches, commonly used in vision tasks. When performing position encoding on point cloud data and RGB image data, the same technique as for the channel state data is used, introducing learnable position encoding. First, the data blocks are mapped to embedded sequences, and then element-wise added to the learnable position encoding of the corresponding modality. RGB image data is adapted to a two-dimensional pixel grid structure, and point cloud data is adapted to a three-dimensional spatial point set structure to preserve the spatial distribution characteristics of each modality. This embodiment performs data segmentation and location encoding on the above multimodal data to obtain CSI sequences, point cloud sequences, and RGB image sequences.

[0052] Step S300: Extract the first feature of the corresponding main modality through the CSI encoder and the point cloud encoder respectively, and extract the second feature of the auxiliary modality through the auxiliary encoder. Perform cross-modal injection in the CSI encoder and the point cloud encoder, and cross-modal attention mechanism to inject and fuse the features of different modalities to output a multimodal general representation.

[0053] During feature extraction, for the dominant modality, this embodiment feeds the processed CSI sequence into the front of the CSI encoder. The Transformer module of the layer feeds the point cloud sequence into the front of the point cloud encoder. The Transformer module of the layer extracts the first feature of the corresponding main modality. Simultaneously, a visual encoder is used to extract visual features from the RGB image sequence. The Transformer module in this embodiment employs a self-attention mechanism and a feedforward neural network.

[0054] Specifically, in combination Figure 2 As shown, in the front of the CSI encoder and the point cloud encoder Within each layer, each modality performs self-attention extraction within its respective feature space. For any encoder branch, the th... layer( The update logic is as follows:

[0055] 1. Self-attention calculation:

[0056]

[0057]

[0058] in, For the self-attention mechanism, the query vector, key vector, and value vector are... For encoder number layer( , The output feature vector of (total number of self-attention layers) , as well as These are the learnable projection matrices for the self-attention mechanism, used to respectively... Mapped to vector. For encoder number The intermediate feature vectors of the layer after self-attention calculation (without passing through the feedforward network). This is a normalized exponential activation function used to normalize the attention score so that the weights sum to 1. To perform matrix multiplication and transpose of the query vector and the key vector ( Key vector (The transpose of the matrix) yields the attention score matrix. Key vector Feature dimensions, To prevent the attention score from being too large, a scaling factor is used.

[0059] 2. Feedforward Network (FFN):

[0060]

[0061] pass Layer iteration extracts high-dimensional features and ,in, Encoder No. The final output feature vector of the layer, The Gaussian error linear unit activation function is used to introduce a nonlinear feature transformation. For the first The intermediate feature vector after layer self-attention calculation and is the learnable weight matrix of the feedforward neural network, and is the weight of the two fully connected layers. and is the learnable bias term of the feedforward neural network, and is the bias of the two fully connected layers. This is the CSI principal mode feature vector output from the Nth layer of the CSI encoder. This refers to the main modal feature vector of the point cloud output from the Nth layer of the point cloud encoder. Let be the final output feature vector of the Nth layer self-attention of the CSI encoder. This is the final output feature vector of the Nth layer self-attention of the point cloud encoder. Therefore, based on the CSI encoder and the point cloud encoder in this embodiment... The Transformer module of each layer can extract CSI and point cloud features to obtain the first feature. In one implementation, the visual encoder of this embodiment can also adopt the Transformer architecture, which is the same as that of the CSI encoder, to extract visual features layer by layer to obtain the second feature, thus ensuring the structural uniformity of the multimodal model.

[0062] Furthermore, combined Figure 2 As shown in the diagram, this embodiment performs cross-modal injection in the CSI encoder and point cloud encoder, and uses a cross-modal attention mechanism to cross-inject and fuse features from different modalities, outputting a multimodal general representation. Combined with... Figure 3 As shown, Figure 3(a) is a schematic diagram of the internal structure of a standard Transformer module, and (b) is a schematic diagram of the internal structure of the Transformer module with cross-modal injection in this embodiment. Specifically, in this embodiment, the CSI encoder and the point cloud encoder... In the Transformer modules of the layers and subsequent layers, a cross-modal attention mechanism and a feedforward neural network are internally employed. The CSI encoder injects features from point cloud data and auxiliary modalities, that is, it injects extracted point cloud features and visual features into the CSI encoding stream to correct communication feature deviations in non-line-of-sight environments. The point cloud encoder injects features from channel state information and auxiliary modalities. That is, it injects CSI features and visual features into the point cloud encoding stream to enhance the perception accuracy of dynamic targets. In this embodiment, there is bidirectional cross-modal transmission between the Transformers corresponding to the two main modalities. The auxiliary modality is simultaneously provided to all main modal Transformers, while the transmission from the auxiliary modality to the main modality is unidirectional. The final output is a multimodal general representation containing heterogeneous information.

[0063] Taking CSI encoder injection as an example, in the first... In this layer, the system executes the following logic:

[0064] 1. Feature Convergence: Obtaining visual features from the central "visual encoder". Obtaining perceptual features from point cloud branches .

[0065] 2. Cross-modal attention injection: using the current CSI feature as the query vector: The concatenated features of point cloud and visual data are used as key-value pairs:

[0066]

[0067]

[0068] in, This is the query vector for the CSI encoder's cross-modal attention, generated from the CSI master modality features. Learnable query projection matrices for cross-modal injection, designed specifically for cross-modal attention in CSI branches. The hybrid key vector for cross-modal attention is generated by concatenating point cloud and visual auxiliary modal features. The auxiliary modal feature vector of the RGB image extracted by the visual encoder. This refers to the main modal feature vector of the point cloud output from the Nth layer of the point cloud encoder. Learnable key projection matrices are injected across modalities to map mixed features to key vector dimensions. It is a hybrid value vector for cross-modal attention, generated by concatenating point cloud and visual auxiliary modal features. A learnable value projection matrix for cross-modal injection, used to map mixed features to a value vector dimension.

[0069] 3. Injection formula representation:

[0070]

[0071] in, The output feature vector after cross-modal injection at layer N+1 of the CSI encoder (the result of the first cross-modal fusion).

[0072] Using the above formula, the CSI encoder can focus on the physical spatial structure information (such as obstacle positions) and visual semantic information represented by the point cloud when calculating the next layer representation. Thus, in non-line-of-sight scenarios, it can correct the phase shift of the channel estimation through spatial geometric constraints.

[0073] Step S400: Using the multimodal general representation, the model is pre-trained using a joint training method that combines autoregressive training and task-related training to obtain a multimodal wireless basic model.

[0074] Combination Figure 2 As shown, during the pre-training phase, this embodiment performs self-supervised learning using an autoregressive head and calculates the autoregressive loss. On the other hand, supervised learning is performed through downstream task heads to calculate task-related losses. The overall loss function for joint training is:

[0075]

[0076] in, For the total loss, For the self-supervised learning autoregressive loss calculated based on the autoregressive head output, The supervised learning task-related loss is calculated based on the downstream task header output. The weight hyperparameter. The autoregressive loss in this embodiment. This forces the model to learn the implicit patterns of channel evolution over time; task-related loss. This ensures that the generated general representation can be directly adapted to downstream sensing or communication tasks. The encoder weights are continuously updated using algorithms such as stochastic gradient descent until the model converges, thus obtaining the basic multimodal wireless model.

[0077] Step S500: In the inference stage, the corresponding general representation output by the CSI encoder or point cloud encoder is selected according to the type of the downstream task through the task switch, and then connected to the downstream task model for inference.

[0078] Specifically, during application deployment, the system triggers a task switch based on current business instructions. Upon receiving a communication-related task request (such as channel estimation or channel prediction), the task switch activates the general representation output by the CSI encoder to the downstream model. If a perception-related task request (such as target localization, electromagnetic mapping, or indoor sensing) is received, the task switch activates the general representation output by the point cloud encoder to the downstream model. Based on this embodiment, the multimodal wireless fundamental model ultimately outputs inference results including channel prediction, localization, and electromagnetic mapping.

[0079] Therefore, this invention achieves deep fusion and alignment of heterogeneous data (CSI, point cloud, image) in the underlying feature space through dual encoders and cross-modal injection mechanisms, effectively balancing the performance requirements of both communication and sensing tasks. The general representation generated by this basic model can significantly improve the accuracy and efficiency of a wide range of downstream communication and sensing tasks, including channel estimation, channel prediction, target localization, electromagnetic mapping, and indoor sensing.

[0080] Based on the above embodiments, the present invention also provides a system for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism. This system is used to implement the steps in the above method embodiments. Specifically, as Figure 4 As shown in the diagram, the system in this embodiment includes: a dual encoder construction module 10, a multimodal data processing module 20, a feature extraction and fusion module 30, a joint training module 40, and an inference application module 50. Specifically, the dual encoder construction module 10 is used to construct a dual encoder structure, which includes a CSI encoder with channel state information as the primary modality and a point cloud encoder with point cloud data as the primary modality. The multimodal data processing module 20 is used to acquire time-sampled aligned multimodal data and perform data blocking and position encoding on the multimodal data. The multimodal data includes channel state data or point cloud data as the primary modality and sensor data as an auxiliary modality. The feature extraction and fusion module 30 is used to extract the first feature corresponding to the primary modality through the CSI encoder and the point cloud encoder respectively, and extract the second feature of the auxiliary modality through the auxiliary encoder. Cross-modal injection is performed in the CSI encoder and the point cloud encoder, and the features of different modalities are cross-injected and fused through a cross-modal attention mechanism to output a multimodal general representation. The joint training module 40 is used to pre-train the model using the multimodal general representation and a joint training method combining autoregressive training and task-related training to obtain a multimodal wireless basic model. The inference application module 50 is used, during the inference phase, to select the corresponding general representation output by the CSI encoder or point cloud encoder based on the type of the downstream task via a task switch, and then connect it to the downstream task model for inference.

[0081] The principles and methods of each module in the embodiment of the construction and application system of the multimodal wireless basic model based on the dual encoder cross-modal injection mechanism are the same, and will not be elaborated further here.

[0082] Based on the above embodiments, the present invention also provides a terminal, the principle block diagram of which can be as follows: Figure 5 As shown. The terminal may include one or more processors 100 ( Figure 5 (Only one is shown in the image), memory 101, and computer program 102 stored in memory 101 and executable on one or more processors 100. For example, the construction and application of a multimodal wireless infrastructure model based on a dual-encoder cross-modal injection mechanism. When one or more processors 100 execute computer program 102, they can implement the various steps in the method embodiment for constructing and applying a multimodal wireless infrastructure model based on a dual-encoder cross-modal injection mechanism. Alternatively, when one or more processors 100 execute computer program 102, they can implement the functions of various modules / units in the system embodiment for constructing and applying a multimodal wireless infrastructure model based on a dual-encoder cross-modal injection mechanism, which is not limited here.

[0083] In one embodiment, the processor 100 may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor.

[0084] In one embodiment, memory 101 can be an internal storage unit of the terminal, such as a hard disk or RAM. Memory 101 can also be an external storage device of the terminal, such as a plug-in hard disk, smart media card (SM), secure digital card (SD), flash card, etc. Furthermore, memory 101 can include both internal and external storage units. Memory 101 is used to store computer programs and other programs and data required by the terminal. Memory 101 can also be used to temporarily store data that has been output or will be output.

[0085] Those skilled in the art will understand that Figure 5 The block diagram shown is merely a partial structural diagram related to the present invention and does not constitute a limitation on the terminal to which the present invention is applied. A specific terminal may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0086] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided by this invention can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), direct memory bus RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0087] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism, characterized in that, The method includes: A dual encoder structure is constructed, comprising: a CSI encoder with channel state information as the primary mode, and a point cloud encoder with point cloud data as the primary mode; Acquire time-sampled aligned multimodal data, and perform data segmentation and location encoding on the multimodal data. The multimodal data includes: channel state data or point cloud data as the primary mode, and sensor data as the auxiliary mode. The first feature of the corresponding main modality is extracted by the CSI encoder and the point cloud encoder respectively, and the second feature of the auxiliary modality is extracted by the auxiliary encoder. Cross-modal injection is performed in the CSI encoder and the point cloud encoder. The features of different modalities are cross-injected and fused by the cross-modal attention mechanism to output a multimodal general representation. Using the aforementioned multimodal general representation, the model is pre-trained using a joint training method that combines autoregressive training and task-related training to obtain a multimodal wireless basic model. During the inference phase, the corresponding general representation output by the CSI encoder or point cloud encoder is selected based on the type of the downstream task via the task switch, and then connected to the downstream task model for inference. Both the CSI encoder and the point cloud encoder are based on the Transformer architecture. The first features of the corresponding main modality are extracted by the CSI encoder and the point cloud encoder, respectively, including: The front of the CSI encoder and the point cloud encoder The Transformer module of the layer extracts the first feature of the corresponding main mode. The Transformer module internally adopts a self-attention mechanism and a feedforward neural network. The first CSI encoder and point cloud encoder In the Transformer modules of the layer and below, a cross-modal attention mechanism and a feedforward neural network are used internally; wherein, the CSI encoder injects features from point cloud data and auxiliary modes, and the point cloud encoder injects features from channel state data and auxiliary modes.

2. The method for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism according to claim 1, characterized in that, The sensor data of the auxiliary modality includes RGB image data, the corresponding auxiliary encoder is a visual encoder, and the second feature is a visual feature.

3. The method for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism according to claim 1, characterized in that, There is bidirectional cross-modal transfer between the Transformers corresponding to the two primary modes. The auxiliary mode is provided to the Transformers of all primary modes at the same time. The transfer from the auxiliary mode to the primary mode is unidirectional.

4. The method for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism according to claim 1, characterized in that, The loss function for joint training is: in, For the total loss, For the self-supervised learning autoregressive loss calculated based on the autoregressive head output, The supervised learning task-related loss is calculated based on the downstream task header output. For weight hyperparameters.

5. The method for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism according to claim 1, characterized in that, During the inference phase, the corresponding general representation output by the CSI encoder or point cloud encoder is selected based on the type of downstream task via a task switch, including: When performing a communication task, select the general representation output by the CSI encoder; When performing a perception task, the general representation output by the point cloud encoder is selected.

6. A system for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism, the system being used to implement the method for constructing and applying a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism as described in any one of claims 1-5, characterized in that, The system includes: A dual encoder construction module is used to construct a dual encoder structure, which includes: a CSI encoder with channel state information as the main mode and a point cloud encoder with point cloud data as the main mode. A multimodal data processing module is used to acquire time-sampled aligned multimodal data and perform data segmentation and location encoding on the multimodal data. The multimodal data includes: channel state data or point cloud data as the primary mode and sensor data as the auxiliary mode. The feature extraction and fusion module is used to extract the first feature of the corresponding main modality through the CSI encoder and the point cloud encoder respectively, and extract the second feature of the auxiliary modality through the auxiliary encoder. Cross-modal injection is performed in the CSI encoder and the point cloud encoder. The features of different modalities are cross-injected and fused through the cross-modal attention mechanism to output a multimodal general representation. The joint training module is used to pre-train the model using the multimodal general representation and a joint training method that combines autoregressive training and task-related training to obtain a multimodal wireless basic model. The inference application module is used to select the corresponding general representation output by the CSI encoder or point cloud encoder according to the type of downstream task through a task switch during the inference stage, and connect it to the downstream task model for inference.

7. A terminal, characterized in that, The terminal includes a memory, a processor, and a multimodal wireless basic model construction and application program based on a dual-encoder cross-modal injection mechanism stored in the memory and capable of running on the processor. When the processor executes the multimodal wireless basic model construction and application program based on a dual-encoder cross-modal injection mechanism, it implements the steps of the method for constructing and applying the multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism as described in any one of claims 1-5.

8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores the construction and application of a multimodal wireless basic model based on a dual-encoder cross-modal injection mechanism. The construction and application of the multimodal wireless basic model based on the dual-encoder cross-modal injection mechanism implements the steps of the construction and application method of the multimodal wireless basic model based on the dual-encoder cross-modal injection mechanism as described in any one of claims 1-5 on the computer-readable storage medium.