Data processing method and device, electronic equipment and storage medium
By encoding and fusing multimodal data, and utilizing pre-trained multimodal collaborative models and low-rank adaptive methods, the challenges of multimodal data processing are addressed, improving the universality and processing efficiency of intelligent models.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING XIAOMI MOBILE SOFTWARE CO LTD
- Filing Date
- 2026-03-23
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to effectively process multimodal data, particularly posing challenges in processing various types of data such as text, audio, and images.
By acquiring data from multiple modalities, encoding them separately, and then fusing them, a pre-trained multimodal collaborative model is used for processing. A low-rank adaptive method is employed to train the model to improve processing capabilities.
It enables intelligent processing of multi-modal data, improves the universality and processing efficiency of intelligent models, and can better integrate and process different types of data.
Smart Images

Figure CN122241595A_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of artificial intelligence technology, and in particular to a data processing method, apparatus, electronic device, and storage medium. Background Technology
[0002] With the development of the times, artificial intelligence technology has gradually become an indispensable part of daily life. While AI technology has matured in processing single types of data such as text and audio, how to utilize AI technology to process multiple types of data remains a pressing issue in this field.
[0003] It should be noted that the information disclosed in the background section above is only used to enhance the understanding of the background of this disclosure, and therefore may include information that does not constitute prior art known to those skilled in the art. Summary of the Invention
[0004] To overcome the problems existing in related technologies, this disclosure provides a data processing method, apparatus, electronic device, and storage medium.
[0005] According to a first aspect of the present disclosure, a data processing method is provided, comprising: Acquire first data for at least one modality; Encode the first data of at least one modality to obtain the second data corresponding to different modalities; The second data corresponding to at least one mode are fused to obtain fused data; The fused data is processed based on a pre-trained multimodal collaborative model to obtain the processing results.
[0006] In one embodiment of this disclosure, the first data of at least one modality includes at least one of text data, audio data, and image data, and the sensor data includes text data corresponding to the time sequence. Encode the first data of at least one modality to obtain the second data corresponding to each modality, including: Text data is mapped into high-dimensional vector data based on word embedding methods; Extracting spectral features from audio data based on an audio encoder; Image features are extracted from image data based on a visual encoder.
[0007] In one embodiment of this disclosure, the method further includes: The acquired text data, audio data, and image data are preprocessed to obtain preprocessed text data, audio data, and image data.
[0008] In one embodiment of this disclosure, second data corresponding to at least one modality are fused to obtain fused data, including: Determine the first correspondence between spectral features and image features and vector features; Embed the spectral features into a first container that matches the vector data; The first container is inserted into the vector data based on the first correspondence to obtain the first fused data.
[0009] In one embodiment of this disclosure, second data corresponding to multiple modalities are fused to obtain fused data, including: Determine the second correspondence between image features and vector features; Embed image features into a second container that matches the vector data; The second container is inserted into the vector data based on the second correspondence to obtain the second fused data.
[0010] In one embodiment of this disclosure, second data corresponding to multiple modalities are fused to obtain fused data, including: Determine the first correspondence between spectral features and vector features, and the second correspondence between image features and vector features; Embed the spectral features into a first container that matches the vector data; Embed image features into a second container that matches the vector data; Based on the first and second correspondences, the first and second containers are inserted into the vector data to obtain the third fused data.
[0011] In one embodiment of this disclosure, the method further includes: Determine the similarity data between multiple second containers; Multiple second containers are clustered and compressed based on similarity data to obtain compressed second containers.
[0012] According to a second aspect of this disclosure, a data processing apparatus is provided, comprising: The acquisition module is used to acquire first data of at least one modality; An encoding module is used to encode first data of at least one modality to obtain second data corresponding to different modalities; The fusion module is used to fuse the second data corresponding to multiple modalities to obtain fused data; The processing module is used to process the fused data based on the pre-trained multimodal collaborative model to obtain the processing results.
[0013] In one embodiment of this disclosure, the first data of at least one modality includes at least one of text data, audio data, and image data; the encoding module includes: The mapping unit is used to map text data into high-dimensional vector data based on word embedding methods; The first extraction unit is used to extract spectral features from audio data based on the audio encoder; The second extraction unit is used to extract image features from image data based on the visual encoder.
[0014] In one embodiment of this disclosure, the fusion module includes: The first determining unit is used to determine the first correspondence between the spectral features and the vector features; The first embedding unit is used to embed spectral features into a first container that matches the vector data; The second embedding unit is used to insert the first container into the vector data based on the first correspondence to obtain the first fused data.
[0015] In one embodiment of this disclosure, the fusion module includes: The second determining unit is used to determine the second correspondence between image features and vector features based on the word vector alignment and adaptation method; The third embedding unit is used to embed image features into a second container that matches the vector data; The fourth embedding unit is used to insert the second container into the vector data based on the second correspondence to obtain the second fused data.
[0016] In one embodiment of this disclosure, the fusion module includes: The third determining unit is used to determine the first correspondence between spectral features and vector features, and the second correspondence between image features and vector features; The first embedding module is used to embed spectral features into a first container that matches the vector data. The third embedding unit is used to embed image features into a second container that matches the vector data; The fifth embedding unit is used to insert the first container and the second container into the vector data based on the first correspondence and the second correspondence to obtain the third fused data.
[0017] In one embodiment of this disclosure, the apparatus further includes: The determination module is used to determine similarity data between multiple second containers; The compression module is used to cluster and compress multiple second containers based on similarity data to obtain compressed second containers.
[0018] According to a third aspect of the present disclosure, an electronic device is provided, comprising: processor; Memory used to store processor-executable instructions; The processor is configured to implement any of the data processing methods described in the first aspect above.
[0019] According to a fourth aspect of the present disclosure, a non-transitory computer-readable storage medium is provided, which, when the instructions in the storage medium are executed by a processor of a terminal, enables the terminal to perform any of the data processing methods described in the first aspect.
[0020] According to a fifth aspect of the present disclosure, a computer program product is provided, including a computer program that, when executed by a processor, implements any of the data processing methods described in the first aspect.
[0021] The technical solutions provided by the embodiments of this disclosure may include the following beneficial effects: This disclosure obtains first data of at least one modality, encodes the first data of at least one modality to obtain second data corresponding to different modalities, fuses the second data corresponding to multiple modalities to obtain fused data, processes the fused data based on a pre-trained multimodal collaborative model to obtain a processing result, and the intelligent model processes the first data of multiple modalities simultaneously, thereby improving the universality of the intelligent model.
[0022] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description
[0023] The accompanying drawings, which are incorporated in and form a part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure.
[0024] Figure 1 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 1 .
[0025] Figure 2 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 2 .
[0026] Figure 3 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 3 .
[0027] Figure 4 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 4.
[0028] Figure 5 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 5 .
[0029] Figure 6 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 6 .
[0030] Figure 7 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 7 .
[0031] Figure 8 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 8 .
[0032] Figure 9 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 9 .
[0033] Figure 10 This is a data processing apparatus block shown according to an exemplary embodiment of the present disclosure. Figure 1 .
[0034] Figure 11 This is a block diagram illustrating a data processing method or apparatus for the data processing method described above, based on some embodiments of the present disclosure. Detailed Implementation
[0035] Exemplary embodiments of this disclosure will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings denote the same or similar elements unless otherwise indicated. Various changes, modifications, and equivalents of the methods, apparatus, and / or systems described herein will become apparent upon understanding this disclosure. For example, the order of operations described herein is merely illustrative and is not limited to those orders set forth herein, but can be changed as will become apparent upon understanding this disclosure, except for operations that must be performed in a particular order. Furthermore, for clarity and brevity, descriptions of features known in the art may be omitted.
[0036] The embodiments described below, which are examples of some of the embodiments of this disclosure, do not represent all embodiments consistent with this disclosure. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this disclosure as detailed in the appended claims.
[0037] The specific implementation methods of the embodiments of this disclosure will now be described in detail with reference to the accompanying drawings.
[0038] Figure 1 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 1 The data processing methods can be applied to terminals, including but not limited to mobile phones, computers, smart wearable devices, smart homes, smart cars, VR devices, and AR devices. They can also be applied to server-side devices such as local servers and cloud servers, which can be deployed on a single computer or a computer cluster consisting of multiple computers.
[0039] like Figure 1 As shown, it includes the following steps.
[0040] S110, acquire first data for at least one modality.
[0041] In some embodiments, the first data for at least one modality may include data displayed in different formats. For example, the first data for at least one modality may include audio data, image data, text data, and video data.
[0042] It should be noted that different types of data can be acquired using corresponding acquisition devices. The acquisition devices then transmit the acquired data wirelessly or via wired connection to the terminal configured for the data processing method. Alternatively, the terminal configured for the data processing method can directly acquire first data in at least one modality.
[0043] It should be noted that when users use the aforementioned terminals to obtain first data in at least one modality, they can obtain both real-time data and historical data.
[0044] S120, the first data of at least one modality is encoded to obtain the second data corresponding to different modalities.
[0045] In some embodiments, the encoding scheme can be determined based on the data type. It should be noted that different data types require different encoding schemes.
[0046] In some embodiments, intelligent models can be used to encode the first data of at least one modality, resulting in encoded data to be processed. It should be noted that different intelligent models can be used to encode different types of data.
[0047] For example, different types of data can be input into the corresponding intelligent model to obtain the data encoding results output by the intelligent model.
[0048] In some embodiments, the first data of at least one modality includes at least one of text data, audio data, and image data. Text data may include data recorded in the form of text, images, and symbols. For example, text data may include device operation logs and alarm records. Sensor data may include data generated by sensors, which may include any sensor configured on the device. It should be noted that a sensor is a device that converts physical, chemical, biological, or other signals into electrical or other measurable signals. Sensor data may be data obtained by a sensor detecting a measurement object. Audio data may include electronic data formed after being acquired, converted, and digitized by a sensor, as well as audio emitted by the user. Image data may include images and video data. It should be noted that image data may be electronic data converted after visual information is acquired.
[0049] In some embodiments, sensor data may include time-series text data. Time-series text data may include text data sorted by time. For example, sensor data may include time-sorted line graphs, time-sorted parameters, and point sets, etc.
[0050] In some embodiments, the data to be processed corresponding to each type can be data in the same format. It should be noted that data in the same format can include data that can be merged with each other.
[0051] In this embodiment of the disclosure, by encoding the first data of at least one modality respectively, the second data corresponding to different modalities are obtained, and the data of different modalities are transformed into data of the same modality. While preserving the data integrity, the obstacles to data fusion are eliminated.
[0052] S130, fuse the second data corresponding to at least one modality to obtain fused data.
[0053] In some embodiments, word vector alignment and adaptation methods can be used to fuse data of multiple types. It should be noted that word vector alignment and adaptation methods can be techniques that map word vector spaces from different sources to a unified semantic space. Word vector alignment and adaptation methods enable words with similar speech to have similar vector representations in different spaces.
[0054] It should be noted that word vector alignment and adaptation methods can include linear alignment based on parallel corpora, unsupervised alignment, adaptation based on pre-trained models, cross-domain word vector adaptation, and non-linear alignment.
[0055] In some embodiments, fusing the data to be processed may include integrating, correlating, and co-processing multiple data sets to generate more accurate comprehensive information than a single data source.
[0056] In this embodiment of the disclosure, by fusing multiple data to be processed, it is possible to retain the data information of various types of data while also processing different types of data in a unified manner.
[0057] S140, based on a pre-trained multimodal collaborative model, processes the fused data to obtain the processing results.
[0058] In some embodiments, the pre-trained multimodal collaborative model may include an intelligent model based on a large language intelligence model. The pre-trained multimodal collaborative model is used to process data obtained by fusing data from different modalities.
[0059] In some embodiments, a pre-trained multimodal collaborative model can calculate attention weights between different modalities of data, enabling joint inference of different modalities of data.
[0060] In some embodiments, the pre-trained multimodal collaborative model may include an intelligent model built with a Transformer architecture. It should be noted that the Transformer architecture may include at least one of an attention module, a fully connected feedforward module, an encoder, and a decoder.
[0061] In some embodiments, a pre-trained multimodal cooperative model can be obtained by training the multimodal cooperative model using a low-rank adaptive method.
[0062] In some embodiments, the low-rank adaptive method is a parameter-efficient fine-tuning technique that reduces the number of trainable parameters when fine-tuning a large pre-trained model through low-rank matrix factorization. It maintains or approaches the performance of full fine-tuning without increasing inference latency. The core idea is to freeze the pre-trained weights and train only the incremental portion of the low-rank updates.
[0063] In some embodiments, low-rank adaptive methods may include Low-Rank Adaptation (LoRA), Miniature Ensemble LoRA (MELoRA), LoRA-Gradient Alignment (LoRA-GA), Nonlinear EfficientAdapter Tuning (Neat), and Rank-Sparse Adaptation (RoSA). MELoRA enables ensemble learning by training multiple mini-LoRAs in parallel. LoRA-GA initializes the adapter matrices A and B of the LoRA with a fully fine-tuned first-step gradient, aligning the gradient of the low-rank update ΔW=BA with the direction of the fully fine-tuned gradient. Neat freezes the pre-trained weights and adds a nonlinear adapter after each linear layer. LoRA's full-rank update ignores the sparsity in weight updates, while RoSA enhances expressive power through the combination of low-rank and sparsity.
[0064] In this embodiment of the disclosure, a pre-trained multimodal collaborative model is obtained by training the multimodal collaborative model using a low-rank adaptive method, which can enhance the pre-trained multimodal collaborative model's ability to process different types of data.
[0065] In some embodiments, a pre-trained multimodal cooperative model is obtained by training the multimodal cooperative model using a low-rank adaptive method, including: Insert a low-rank matrix and an adapter into the multimodal cooperative model; The parameters inserted into the multimodal collaborative model are trained based on the preset loss function value to obtain the pre-trained multimodal collaborative model.
[0066] In some embodiments, inserting low-rank matrices and adapters into a multimodal cooperative model may include inserting low-rank matrices and adapters into different modules of the multimodal cooperative model.
[0067] For example, inserting low-rank matrices and adapters into different modules of a multimodal collaborative model can include inserting low-rank matrices and adapters into the text processing module, speech processing module, image processing module, and Transformer layer of the multimodal collaborative model.
[0068] In some embodiments, inserting a low-rank matrix into the text processing module may include completely freezing the original weights of the Transformer backbone network parameters of the multimodal collaborative model to avoid catastrophic forgetting caused by full parameter fine-tuning, and inserting the low-rank matrix only in the middle layers of the model.
[0069] In some embodiments, inserting low-rank matrices and adapters into the speech processing module may include completely freezing the Transformer backbone parameters of the multimodal collaborative model to prevent speech tasks from damaging pre-trained language capabilities, and hierarchically inserting LoRA adapters: inserting a LoRA_A adapter to focus on the primary alignment of speech signals and text; and middle layers (such as layers 7-12) to collaborate with the LoRAI4 visual adapter to achieve cross-modal attention interaction, focusing on the primary alignment of speech signals and text. Through low-rank decomposition and hierarchical adaptation strategies, dynamic activation is supported, achieving efficient speech-language collaboration with extremely low parameter overhead.
[0070] In some embodiments, inserting a low-rank matrix and adapter into the image processing module may include completely freezing the original weights of the Transformer of the large model, keeping the core language capabilities stable, and injecting trainable parameters only into the cross-modal attention layer.
[0071] In some embodiments, inserting a low-rank matrix and an adapter into the Transformer layer may include fusing information such as text, visual and audio features in the Transformer layer and achieving cross-modal fusion through inter-modal attention weight calculation, thereby promoting multimodal feature interaction and improving joint reasoning capabilities.
[0072] In some embodiments, training the parameters inserted into the multimodal collaborative model based on a preset loss function value may include: constructing training samples based on historical data of various types and historical processing results; training the multimodal collaborative model based on the training samples; determining the loss function value based on the training results; and obtaining the pre-trained multimodal collaborative model when the loss function value tends to converge.
[0073] This disclosure obtains first data of at least one modality, encodes the first data of at least one modality respectively to obtain second data corresponding to different modalities, fuses the second data corresponding to at least one modality based on word vector alignment method to obtain fused data, processes the fused data based on a pre-trained multimodal collaborative model to obtain processing results, and the intelligent model processes the first data of at least one modality simultaneously, thereby improving the universality of the intelligent model.
[0074] Figure 2 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 2 . Figure 2 Steps S210, S230, and S240 correspond to steps S110, S130, and S140, and will not be repeated here. Figure 2 As shown, in Figure 1 In addition to the implementation process shown, the following steps are also included: S220 maps text data into high-dimensional vector data based on word embedding methods.
[0075] In some embodiments, text data can be mapped into high-dimensional vector data based on a text tokenizer and word embedding methods.
[0076] In some embodiments, a text segmenter is a core tool in Natural Language Processing (NLP) for segmenting raw text into structured units (tokens). It is the first step in transforming unstructured text into a discrete sequence of symbols that the model can understand. Essentially, it segments a continuous stream of characters into meaningful "basic semantic units" (such as words, subwords, and characters) using predefined rules or algorithms, and assigns a unique identifier (Token ID) to each unit, laying the foundation for subsequent model input, semantic understanding, and generation.
[0077] In some embodiments, after a text segmenter segments the text data, a word embedding method can be used to map the segmented words into a high-dimensional vector space, preserving the semantic relationships between words and realizing the transformation of text data from natural language to numerical vectors. Mapping the segmented words into a high-dimensional vector space using word embedding methods can include utilizing the distributed semantic representation of words to transform each word into a dense real-valued vector, so that words with similar semantics or grammar are located close together in the vector space. For example, after segmenting the text data, a vocabulary can be constructed, a corresponding ID can be assigned to each word, and then a word embedding algorithm can be selected to complete the mapping. It should be noted that the word embedding algorithm can include at least one of the following: continuous bag-of-words model, skip-word model, global vector model, and fast text model.
[0078] In this embodiment of the disclosure, text data is mapped into high-dimensional vector data based on the word embedding method. While maximizing the preservation of information in the text data, the text data encoding is completed, and data preparation is completed before data fusion.
[0079] Figure 3 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 3 . Figure 3 Steps S310, S330, and S340 correspond to steps S110, S130, and S140, and will not be repeated here. Figure 3 As shown, in Figure 1 In addition to the implementation process shown, the following steps are also included: S320 extracts spectral features from audio data based on an audio encoder.
[0080] In some embodiments, the audio encoder may include a speech encoder module. It should be noted that the speech encoder module is a core component of the speech processing system. Its core function is to convert the raw speech signal (analog or digital) into a structured feature representation (such as frame-level features and contextual features), providing input for subsequent tasks such as speech recognition, synthesis, and encoding. Through operations such as dimensionality reduction, feature extraction, and semantic enhancement, it transforms high-dimensional, redundant speech signals into low-dimensional, information-rich feature vectors, supporting the performance of downstream tasks.
[0081] In some embodiments, the speech encoder module may extract spectral features from audio data through dimensionality reduction and compression, feature extraction, and semantic enhancement. It should be noted that spectral features may include features displayed in vector form.
[0082] In some embodiments, the spectral features may include Mel spectral features. Mel spectral features are frequency domain representations of audio signals on the Mel scale. By mapping the linear spectrum to a non-linear scale that conforms to the characteristics of human hearing, the frequency-energy distribution features are extracted. Mel spectral features are core input features for tasks such as speech recognition, emotion analysis, and voiceprint recognition.
[0083] For example, the speech encoder module may include a Transformer encoder, a self-supervised pre-trained encoder, and a multimodal fusion encoder.
[0084] In this embodiment of the disclosure, the spectral features in the audio data are extracted by an audio encoder, and the audio data is encoded while retaining the information in the audio data, thus completing the data preparation before data fusion.
[0085] Figure 4 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 4 . Figure 4 Steps S410, S430, and S440 correspond to steps S110, S130, and S140, and will not be repeated here. Figure 4 As shown, in Figure 1 In addition to the implementation process shown, the following steps are also included: S420 extracts image features from image data based on a visual encoder.
[0086] In some embodiments, the visual encoder may include an image encoder module. It should be noted that the image encoder module is a core component of the image intelligent processing system. Its function is to convert the raw image signal (pixel matrix) into a structured feature representation (such as frame-level / region-level features, context-related features), providing input for subsequent tasks such as image recognition, detection, segmentation, and multimodal fusion. Through operations such as dimensionality reduction, feature extraction, and semantic enhancement, it transforms high-dimensional, redundant pixel data into low-dimensional feature vectors rich in spatial-semantic information, supporting the performance of downstream tasks.
[0087] In some embodiments, the image encoder module may extract image features by acquiring image features from image data through dimensionality reduction and compression, feature extraction, semantic enhancement, and multimodal alignment. It should be noted that image features may include features displayed in vector form.
[0088] For example, the image encoder module may include convolutional layers, pooling layers, and manual feature extraction.
[0089] In this embodiment of the disclosure, image features are extracted from image data based on a visual encoder. This allows data preparation to be completed before data fusion while preserving the information in the image data.
[0090] Figure 5 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 5 . Figure 5 Steps S510, S530 to S550 correspond to steps S110 to S140, and will not be repeated here. Figure 5 As shown, in Figure 1 In addition to the implementation process shown, the following steps are also included: S520 preprocesses the acquired text data, audio data, and image data respectively to obtain preprocessed text data, audio data, and image data.
[0091] In some embodiments, preprocessing of text data, audio data, and image data may include at least one operation of denoising, stop word filtering, and lexical normalization for text data; at least one operation of data cleaning, time alignment and resampling, denoising, and normalization for sensor data; at least one operation of loading and format conversion, resampling, framing and windowing, endpoint detection, and denoising for audio data; and at least one operation of resizing, normalization, color space conversion, denoising, and enhancement for image data.
[0092] In some embodiments, different methods can be used to preprocess the different types of data separately, or an intelligent model can be used to perform a unified preprocessing operation on the first data of at least one modality. This disclosure does not limit the preprocessing methods or specific operations.
[0093] In some embodiments, preprocessing of text data can enhance the semantic structure of the text, preprocessing of sensor data can enhance the temporal continuity of the sensor data, and preprocessing of audio data can facilitate acoustic feature extraction. Preprocessing of image data can focus on optimizing spatial information.
[0094] In this embodiment of the disclosure, by performing preprocessing operations on text data, audio data, and image data respectively, the data can be better encoded and fused, thereby improving the efficiency of data processing.
[0095] Figure 6 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 6 . Figure 6 Steps S610, S620, and S660 correspond to steps S110, S120, and S140, and will not be repeated here. Figure 6 As shown, in Figure 1 In addition to the implementation process shown, the following steps are also included: S630, determine the first correspondence between spectral features and vector features.
[0096] In some embodiments, a first correspondence between spectral features and vector features can be determined based on a word vector alignment and adaptation method. The word vector alignment and adaptation method may include techniques for mapping word vectors from different sources, different spaces, or different languages to a unified vector space. This method enables words with similar semantics or functions to be close in distance within the aligned space. It should be noted that the word vector alignment and adaptation method may include alignment methods based on linear transformations, alignment methods based on nonlinear transformations, alignment methods based on adversarial learning, alignment methods based on optimal transmission, and alignment methods based on pre-trained models. This disclosure does not limit the specific word vector alignment and adaptation method used.
[0097] In some embodiments, determining a first correspondence between spectral features and vector features includes determining a first correspondence between the spectral features in space. For example, if the vector feature is a feature data segment of length 84 bytes, after determining the correspondence, the spectral feature may correspond to bytes 20 to 24.
[0098] S640 embeds spectral features into a first container that matches the vector data.
[0099] In some embodiments, the first container matching the vector data can be a container with the same dimension as the vector data. For example, the first container can be a container including 2 convolutional layers and 12 Conformer blocks, with 1024-dimensional attention and 1536-dimensional feedforward layers, using 80-dimensional log-Mel filter bank features, processed by 2 convolutional layers and 12 Conformer blocks to obtain a first container matching the 4096-dimensional vector data.
[0100] S650, based on the first correspondence, insert the first container into the vector data to obtain the first fused data.
[0101] In some embodiments, a placeholder corresponding to the first container can be set in the vector data, and then the first container can be inserted into the placeholder corresponding to the first container to obtain the first fused data.
[0102] In this embodiment of the disclosure, a first correspondence between spectral features and vector features is determined, the spectral features are embedded into a first container that matches the vector data, and the first container is inserted into the vector data based on the first correspondence to obtain the first fused data. This enables the data of different modalities in the first fused data to correspond, avoiding the loss of correlation information between different modal data.
[0103] Figure 7 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 7 . Figure 7 Steps S710, S720, and S760 correspond to steps S110, S120, and S140, and will not be repeated here. Figure 7 As shown, in Figure 1 In addition to the implementation process shown, the following steps are also included: S730, determine the second correspondence between image features and vector features.
[0104] In some embodiments, a second correspondence between image features and vector features can be determined based on a word vector alignment and adaptation method.
[0105] In some embodiments, determining a second correspondence between image features and vector features includes determining a second spatial correspondence between the image features and vector features. For example, if the vector features are feature data of length 84 bytes, after determining the correspondence, the image features may correspond to 30 to 36 bytes.
[0106] S740 embeds image features into a second container that matches the vector data.
[0107] S750, based on the second correspondence, insert the second container into the vector data to obtain the second fused data.
[0108] In some embodiments, the second container matching the vector data can be a container with the same dimension as the vector data. For example, it can be trained on large-scale image-text pairs based on a multimodal vision-language model and embedded into the second container.
[0109] In some embodiments, the second container may be a container that matches the 4096-dimensional vector data.
[0110] In some embodiments, the method of inserting the second container into the vector data based on the second correspondence to obtain the second fused data is the same as the method of obtaining the first fused data, and will not be described again here.
[0111] In this embodiment of the disclosure, a second correspondence between image features and vector features is determined, the image features are embedded into a second container that matches the vector data, and the second container is inserted into the vector data based on the second correspondence to obtain the second fused data. This enables the data of different modalities in the second fused data to correspond, avoiding the loss of correlation information between data of different modalities.
[0112] Figure 8 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 8 . Figure 8 Steps S810, S820, and S870 correspond to steps S110, S120, and S140, and will not be repeated here. Figure 8 As shown, in Figure 1 In addition to the implementation process shown, the following steps are also included: S830, determine the first correspondence between spectral features and vector features, and the second correspondence between image features and vector features; S840 embeds spectral features into a first container that matches the vector data; S850 embeds image features into a second container that matches the vector data; S860, based on the first correspondence and the second correspondence, inserts the first container and the second container into the vector data to obtain the third fused data.
[0113] In some embodiments, inserting the first container and the second container into the vector data may include inserting the first container and the second container into the aforementioned multimodal placeholders.
[0114] In some embodiments, since the dimensions of both the first container and the second container match the vector data, a third fused data with consistent dimensions can be obtained after inserting the first container and the second container.
[0115] In this embodiment of the disclosure, a first correspondence and a second correspondence between spectral features, image features and vector features are determined. Then, the first container and the second container corresponding to the spectral features and image features are inserted into the vector data based on the correspondence to obtain the third fused data. This makes the data of different modalities in the third fused data correspond, thus avoiding the loss of correlation information between data of different modalities.
[0116] Figure 9 This is a flowchart of a data processing method according to an exemplary embodiment of the present disclosure. Figure 5 . Figure 9 Steps S910 to S940 and S960 correspond to steps S710 to S760, and will not be repeated here. Figure 9 As shown, in Figure 7 In addition to the implementation process shown, the following steps are also included: S950, determine the similarity data between multiple second containers.
[0117] In some embodiments, similarity data between multiple second containers can be determined based on methods such as Euclidean distance, cosine similarity, and intelligent models. Euclidean distance is a distance-based similarity determination method. Its core logic is: the smaller the Euclidean distance between two objects in the feature space, the higher their similarity. It quantifies the "straight-line distance" between two points in multidimensional space, intuitively reflecting the proximity of objects, and is one of the most classic similarity calculation tools. Cosine similarity is an angle / direction-based similarity determination method. Its core logic is to measure the directional consistency of two vectors using the cosine value of the angle between them—the smaller the angle, the larger the cosine value, and the higher the similarity. It is a classic tool for calculating the similarity of high-dimensional data such as text, word vectors, and user behavior, and is especially suitable for scenarios where vector length is ignored and semantic / trend direction is the focus.
[0118] S960, based on similarity data, cluster and compress multiple second containers to obtain compressed second containers.
[0119] In some embodiments, cluster compression may include merging second containers of similarity data within a preset range into a second container of the same category.
[0120] In some embodiments, after merging multiple second containers into a second container of the same category, the second containers within the same category can be compressed. For example, compressing second containers within the same category may include deleting second containers within the same category.
[0121] In this embodiment of the disclosure, by clustering and compressing the second container based on similarity data, the number of second containers to be processed can be reduced while retaining the information contained in the second container, thereby improving the efficiency of processing the second container.
[0122] In this embodiment of the disclosure, similarity data among multiple second containers is determined, and the multiple second containers are clustered and compressed based on the similarity data to obtain compressed second containers. Based on the correspondence, the first container and the compressed second containers are inserted into the vector data to obtain fused data, which reduces the number of second containers to be processed and improves processing efficiency.
[0123] To illustrate this disclosure in detail, specific examples of this disclosure are provided below.
[0124] In this example, using an automotive battery production line, the first data of at least one modality can include image data, text data, sensor data, and audio data. Image data can be the color detected by an infrared camera at the battery terminal welding points. Text data can include data related to welding temperature from the service manual. Audio data can include verbal data from the engineering department. When the above data is input into a pre-trained multimodal collaborative model, the model can fuse the various data types to obtain fused data. The fused data is then processed to obtain the processing result. It should be noted that the processing result can include battery equipment health scores, maintenance node reports, and updated knowledge graph maintenance manuals.
[0125] The following are embodiments of the apparatus disclosed herein, which can be used to execute embodiments of the method disclosed herein. For details not disclosed in the apparatus embodiments of this disclosure, please refer to the embodiments of the method disclosed herein.
[0126] Figure 10 This is a data processing apparatus block shown according to an exemplary embodiment of the present disclosure. Figure 1 .
[0127] Reference Figure 10 The device 1000 includes: an acquisition module 1010, an encoding module 1020, a fusion module 1030, and a processing module 1040.
[0128] Acquisition module 1010 is used to acquire first data of at least one modality; Encoding module 1020 is used to encode first data of at least one modality to obtain second data corresponding to different modalities; The fusion module 1030 is used to fuse the second data corresponding to multiple modalities to obtain fused data; The processing module 1040 is used to process the fused data based on a pre-trained multimodal collaborative model to obtain the processing results.
[0129] In one embodiment of this disclosure, the first data of at least one modality includes at least one of text data, audio data, and image data; the encoding module 1020 includes: The mapping unit is used to map text data into high-dimensional vector data based on word embedding methods; The first extraction unit is used to extract spectral features from audio data based on the audio encoder; The second extraction unit is used to extract image features from image data based on the visual encoder.
[0130] In one embodiment of this disclosure, the fusion module 1030 includes: The first determining unit is used to determine the first correspondence between the spectral features and the vector features; The first embedding unit is used to embed spectral features into a first container that matches the vector data; The second embedding unit is used to insert the first container into the vector data based on the first correspondence to obtain the first fused data.
[0131] In one embodiment of this disclosure, the fusion module 1030 includes: The second determining unit is used to determine the second correspondence between image features and vector features based on the word vector alignment and adaptation method; The third embedding unit is used to embed image features into a second container that matches the vector data; The fourth embedding unit is used to insert the second container into the vector data based on the second correspondence to obtain the second fused data.
[0132] In one embodiment of this disclosure, the fusion module 1030 includes: The third determining unit is used to determine the first correspondence between spectral features and vector features, and the second correspondence between image features and vector features; The first embedding module is used to embed spectral features into a first container that matches the vector data. The third embedding unit is used to embed image features into a second container that matches the vector data; The fifth embedding unit is used to insert the first container and the second container into the vector data based on the first correspondence and the second correspondence to obtain the third fused data.
[0133] In one embodiment of this disclosure, the apparatus further includes: The determination module is used to determine similarity data between multiple second containers; The compression module is used to cluster and compress multiple second containers based on similarity data to obtain compressed second containers.
[0134] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.
[0135] Figure 11 This is a block diagram illustrating an apparatus for the data processing method described above or an apparatus for the data processing method described above, according to some embodiments of this disclosure. For example, apparatus 1100 may be a mobile phone, computer, digital broadcasting terminal, messaging device, game console, tablet device, medical device, fitness equipment, personal digital assistant, etc.
[0136] Reference Figure 11 The device 1100 may include one or more of the following components: a processing component 1102, a memory 1104, a power component 1106, a multimedia component 1108, an audio component 1110, an input / output (I / O) interface 1112, a sensor component 1114, and a communication component 1116.
[0137] Processing component 1102 typically controls the overall operation of device 1100, such as operations associated with display, telephone calls, data communication, camera operation, and recording operations. Processing component 1102 may include one or more processors 1120 to execute instructions to perform all or part of the steps of the methods described above. Furthermore, processing component 1102 may include one or more modules to facilitate interaction between processing component 1102 and other components. For example, processing component 1102 may include a multimedia module to facilitate interaction between multimedia component 1108 and processing component 1102.
[0138] Memory 1104 is configured to store various types of data to support the operation of device 1100. Examples of this data include instructions for any application or method operating on device 1100, contact data, phonebook data, messages, pictures, videos, etc. Memory 1104 can be implemented by any type of volatile or non-volatile storage device or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic storage, flash memory, magnetic disk, or optical disk.
[0139] The power supply component 1106 provides power to the various components of the device 1100. The power supply component 1106 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power to the device 1100.
[0140] Multimedia component 1108 includes a screen that provides an output interface between device 1100 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touchscreen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensors may sense not only the boundaries of touch or swipe actions but also the duration and pressure associated with the touch or swipe operation. In some embodiments, multimedia component 1108 includes a front-facing camera and / or a rear-facing camera. When device 1100 is in an operating mode, such as a shooting mode or a video mode, the front-facing camera and / or rear-facing camera may receive external multimedia data. Each front-facing camera and rear-facing camera may be a fixed optical lens system or have focal length and optical zoom capabilities.
[0141] Audio component 1110 is configured to output and / or input audio signals. For example, audio component 1110 includes a microphone (MIC) configured to receive external audio signals when device 1100 is in an operating mode, such as call mode, recording mode, and voice data processing mode. The received audio signals may be further stored in memory 1104 or transmitted via communication component 1116. In some embodiments, audio component 1110 also includes a speaker for outputting audio signals.
[0142] I / O interface 1112 provides an interface between processing component 1102 and peripheral interface modules, such as keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to, home buttons, volume buttons, power buttons, and lock buttons.
[0143] Sensor assembly 1114 includes one or more sensors for providing state assessments of various aspects of device 1100. For example, sensor assembly 1114 may detect the on / off state of device 1100, the relative positioning of components such as the display and keypad of device 1100, changes in the position of device 1100 or a component of device 1100, the presence or absence of user contact with device 1100, the orientation or acceleration / deceleration of device 1100, and temperature changes of device 1100. Sensor assembly 1114 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 1114 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, sensor assembly 1114 may also include an accelerometer, a gyroscope, a magnetometer, a pressure sensor, or a temperature sensor.
[0144] Communication component 1116 is configured to facilitate wired or wireless communication between device 1100 and other devices. Device 1100 can access wireless networks based on communication standards, such as WiFi, 3G, 4G, 5G, other communication standards, or combinations thereof. In some embodiments of this disclosure, communication component 1116 receives broadcast signals or broadcast-related information from an external broadcast management system via a broadcast channel. In some embodiments of this disclosure, communication component 1116 also includes a near-field communication (NFC) module to facilitate short-range communication. For example, the NFC module may be implemented based on radio frequency data processing (RFID) technology, Infrared Data Association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.
[0145] In some embodiments of this disclosure, the apparatus 1100 may be implemented by one or more application-specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field-programmable gate arrays (FPGAs), controllers, microcontrollers, microprocessors, or other electronic components to perform the methods described above.
[0146] In some embodiments of this disclosure, a computer-readable storage medium including instructions is also provided, such as a memory 1104 including instructions, which can be executed by a processor 1120 of device 1100 to perform the above-described method. For example, the computer-readable storage medium may be a ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage device, etc.
[0147] A computer-readable storage medium that, when instructions in the storage medium are executed by a terminal's processor, enables the terminal to perform an operation instruction execution method or a method for operation instruction execution.
[0148] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the following claims.
[0149] It should be understood that this disclosure is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this disclosure is limited only by the appended claims.
Claims
1. A data processing method, characterized in that, include: Acquire first data for at least one modality; Encode the first data of each of the at least one modalities to obtain the second data corresponding to different modalities; The second data corresponding to multiple modalities are fused to obtain fused data; The fused data is processed based on a pre-trained multimodal collaborative model to obtain the processing result.
2. The method according to claim 1, characterized in that, The first data of the at least one modality includes at least one of text data, audio data, and image data; The encoding of first data for at least one modality to obtain second data corresponding to different modalities includes at least one of the following: The text data is mapped into high-dimensional vector data based on the word embedding method; Spectral features are extracted from the audio data based on the audio encoder; Image features are extracted from the image data based on a visual encoder.
3. The method according to claim 2, characterized in that, The process of fusing the second data corresponding to the multiple modalities to obtain fused data includes: Determine the first correspondence between the spectral features and the vector features; The spectral features are embedded into a first container that matches the vector data; Based on the first correspondence, the first container is inserted into the vector data to obtain the first fused data.
4. The method according to claim 2, characterized in that, The process of fusing the second data corresponding to the multiple modalities to obtain fused data includes: Determine a second correspondence between the image features and the vector features; The image features are embedded into a second container that matches the vector data; The second container is inserted into the vector data based on the second correspondence to obtain the second fused data.
5. The method according to claim 2, characterized in that, The process of fusing the second data corresponding to the multiple modalities to obtain fused data includes: Determine a first correspondence between the spectral features and the vector features, and a second correspondence between the image features and the vector features; The spectral features are embedded into a first container that matches the vector data; The image features are embedded into a second container that matches the vector data; Based on the first correspondence and the second correspondence, the first container and the second container are inserted into the vector data to obtain the third fused data.
6. The method according to claim 4 or 5, characterized in that, The method further includes: Determine the similarity data between multiple second containers; Based on the similarity data, the multiple second containers are clustered and compressed to obtain the compressed second containers.
7. A data processing apparatus, characterized in that, include: The acquisition module is used to acquire first data of at least one modality; An encoding module is used to encode the first data of the at least one modality respectively to obtain the second data corresponding to different modalities; The fusion module fuses the second data corresponding to multiple modalities to obtain fused data; The processing module is used to process the fused data based on a pre-trained multimodal collaborative model to obtain the processing result.
8. The apparatus according to claim 7, characterized in that, The encoding module includes: The mapping unit is used to map the text data into high-dimensional vector data based on the word embedding method; The first extraction unit is used to extract spectral features from the audio data based on the audio encoder; The second extraction unit is used to extract image features from the image data based on the visual encoder.
9. The apparatus according to claim 7, characterized in that, The fusion module includes: A determining unit is configured to determine a first correspondence between the spectral features and the vector features based on the word vector alignment and adaptation method; A first embedding unit is used to embed the spectral features into a first container that matches the vector data; The first insertion unit is used to insert the first container into the vector data based on the first correspondence to obtain the first fused data.
10. An electronic device, characterized in that, include: processor; Memory used to store processor-executable instructions; The processor is configured to implement the data processing method according to any one of claims 1 to 6.
11. A non-transitory computer-readable storage medium, wherein instructions in the storage medium, when executed by a processor of a terminal, enable the terminal to perform the steps of a data processing method according to any one of claims 1 to 6.
12. A computer program product, said computer program product comprising a computer program or computer instructions, characterized in that, The computer program or the computer instructions are loaded and executed by the processor to enable the computer to implement the steps of the data processing method as described in any one of claims 1-6.