Model construction and data processing method, device, product and storage medium

By introducing a combination of pruned decoders and ordinary decoders into the multi-layer decoder of a multimodal language model, redundant visual tokens are pruned, solving the problem of slow inference speed in large multimodal language models and achieving a balance between computational load, inference latency, and accuracy.

CN122244629APending Publication Date: 2026-06-19CLOUD INTELLIGENCE ASSETS HOLDING (SINGAPORE) PTE LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CLOUD INTELLIGENCE ASSETS HOLDING (SINGAPORE) PTE LTD
Filing Date
2024-12-18
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing multimodal large language models suffer from slow inference speeds and redundant visual tokens when processing image and text tasks due to the excessive amount of visual data representations.

Method used

A pruning decoder is introduced into the multi-layer decoder. By pruning some visual tokens and combining them with the ordinary decoder, the pruning operation is performed in a step-by-step manner, ensuring that important visual information is preserved and improving inference efficiency.

Benefits of technology

This approach achieves the reduction of computational load and inference latency while maintaining model accuracy, thereby improving the overall inference efficiency of the model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244629A_ABST
    Figure CN122244629A_ABST
Patent Text Reader

Abstract

This specification provides a model construction and data processing method, device, product, and storage medium. The multimodal language model includes a multi-layer decoder. The decoder is used to decode an input data representation sequence to obtain a new data representation sequence and transmit it to the next layer decoder. The method includes: acquiring a data representation sequence; wherein the data representation sequence includes an initial visual data representation obtained by encoding an input image; inputting the data representation sequence to the multi-layer decoder; wherein the multi-layer decoder includes a pruning decoder that performs a pruning operation, the pruning operation including: pruning the input visual data representation in the input data representation sequence so that the pruning decoder decodes based on the pruned data representation sequence to obtain a new data representation sequence; in the multi-layer decoder, at least one ordinary decoder that does not perform the pruning operation is spaced apart between the pruning decoders.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This manual relates to the field of large model technology, and in particular to model building and data processing methods, equipment, products and storage media. Background Technology

[0002] Existing multimodal large language models often enhance their image-to-text understanding capabilities by fusing image information. For example, in question-answering tasks, the model encodes the input text and image into a sequence of data representations, each containing a potential representation of the data. However, current large models often use a simple visual encoder to encode the input image into a series of visual data representations, the number of which typically exceeds the number of text data representations required in question-answering tasks. Furthermore, the model's inference speed is positively correlated with the total number of data representations. Therefore, it is necessary to improve the model's inference efficiency. Summary of the Invention

[0003] To overcome the problems existing in related technologies, this specification provides model building and data processing methods, equipment, products and storage media.

[0004] According to a first aspect of the embodiments of this specification, a data processing method is provided, applied to a multimodal language model, the multimodal language model including a multi-layer decoder, the decoder being used to decode an input data representation sequence to obtain a new data representation sequence, and transmit it to the next layer decoder; the method includes:

[0005] Obtain a data representation sequence; wherein the data representation sequence includes an initial visual data representation obtained by encoding the input image;

[0006] The acquired data representation sequence is input into the multi-layer decoder;

[0007] The multi-layer decoder includes a pruning decoder that performs pruning operations;

[0008] The pruning operation includes: pruning the input visual data representation in the input data representation sequence so that the pruning decoder decodes based on the pruned data representation sequence to obtain a new data representation sequence;

[0009] In the multi-layer decoder, there is at least one ordinary decoder that does not perform the pruning operation between the pruning decoders.

[0010] According to a second aspect of the embodiments of this specification, a method for constructing a multimodal language model is provided, the method comprising:

[0011] Obtain an initial multimodal language model, wherein the initial multimodal language model contains multiple layers of decoders;

[0012] A pruning decoder is determined in the multi-layer decoder, and a pruning module for performing pruning operations is connected between the pruning decoder and the previous layer decoder to obtain a constructed multimodal language model; the constructed multimodal language model is used to perform the steps of the method described in the first aspect.

[0013] According to a third aspect of the embodiments of this specification, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the steps of the method embodiments described in the first or second aspects above.

[0014] According to a fourth aspect of the embodiments of this specification, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the steps of the method embodiments described in the first or second aspect above.

[0015] According to a fifth aspect of the embodiments of this specification, a computer program product is provided, including a computer program that, when executed by a processor, implements the steps of the method embodiments described in the first or second aspect above.

[0016] The technical solutions provided in the embodiments of this specification may include the following beneficial effects:

[0017] In this embodiment, the multimodal language model includes multiple decoders. A pruning operation is designed to be performed on some decoders, causing some visual data representations in the data representation sequence to be removed. Therefore, the model performs inference based on a reduced number of data representation sequences, improving inference efficiency. Furthermore, to ensure no loss of inference accuracy, this embodiment designs at least one ordinary decoder that does not perform pruning operations between pruned decoders. This can be understood as pruning operations being performed step-by-step between decoders, executing only within specific decoders. Thus, only some decoders perform pruning operations, preventing the additional overhead of pruning from reducing the model's inference efficiency. It also ensures that the visual data representation information in the data representation sequence can still be captured by the decoder, preventing the loss of visual information due to excessive pruning, thereby preventing a decrease in model inference accuracy. Therefore, this embodiment achieves an optimal trade-off between computational load, inference latency, and accuracy.

[0018] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description

[0019] Figure 1This is a flowchart illustrating a data processing method according to an exemplary embodiment of this specification.

[0020] Figure 2A This is a schematic diagram illustrating a data processing method according to an exemplary embodiment of this specification.

[0021] Figure 2B This is a schematic diagram of pruning according to an exemplary embodiment of the present specification.

[0022] Figure 3 This is a flowchart illustrating a method for constructing a multimodal language model according to an exemplary embodiment of this specification.

[0023] Figure 4 This is a hardware structure diagram of a computer device containing a data processing apparatus / multimodal language model building apparatus, according to an exemplary embodiment of this specification.

[0024] Figure 5 This is a block diagram illustrating a data processing apparatus according to an exemplary embodiment of this specification.

[0025] Figure 6 This is a block diagram illustrating an apparatus for constructing a multimodal language model according to an exemplary embodiment. Detailed Implementation

[0026] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. When the following description relates to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this specification. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this specification as detailed in the appended claims.

[0027] The terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to be limiting of this specification. The singular forms “a,” “the,” and “the” as used in this specification and the appended claims are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the term “and / or” as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

[0028] It should be understood that although the terms first, second, third, etc., may be used in this specification to describe various information, this information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this specification, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if" as used herein may be interpreted as "when," "when," or "in response to determination."

[0029] The user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for analysis, data stored, data displayed, etc.) involved in this disclosure are all information and data authorized by the user or fully authorized by all parties. Furthermore, the collection, use and processing of the relevant data shall comply with the relevant laws, regulations and standards of the relevant countries and regions, and corresponding operation entry points shall be provided for users to choose to authorize or refuse.

[0030] Currently, multimodal large language models can simultaneously process and understand text and image information. Their architecture typically includes the following modules:

[0031] 1. Vision Encoder: Converts an input image into a series of visual data representations.

[0032] 2. Large Language Model (LLM): Processes text and visual data representations to generate corresponding text output.

[0033] 3. Fusion Mechanism: This mechanism combines visual and textual information to achieve multimodal understanding and generation.

[0034] In a multimodal large language model, the large language model component contains multiple layers of decoders. These decoders are responsible for generating the final text response data representation based on the input's system data representation, visual data representation, and instruction data representation.

[0035] As an example, in the field of machine learning, data representation refers to transforming data into a form suitable for machine learning models to process, such as vectors. In the multimodal language model of this embodiment, data representation can be a latent representation token obtained by an encoder after encoding text or images. Taking tokens as an example, system tokens can provide system-level instructions or contextual information, setting the tone for dialogue or tasks. Visual tokens are latent representations of images generated by a visual encoder, containing various feature information of the image. Instruction tokens are generated based on user input (such as instructions or questions) and are used to guide the model to generate corresponding answers. System tokens and instruction tokens are usually text tokens. After these tokens are embedded, they are combined sequentially to form a token sequence, which is then passed to a multi-layer decoder.

[0036] Each layer of the decoder can contain the following network:

[0037] Multi-head self-attention mechanism: It can handle the relationships between input tokens and capture contextual information.

[0038] Cross-modal attention mechanism: It fuses visual tokens and text tokens, allowing the text generation part to focus on relevant information in the image.

[0039] Feedforward neural networks: can perform nonlinear transformations, enhancing the expressive power of the model.

[0040] Each layer of the decoder processes the input token sequence to gradually generate the next most likely text token until a complete answer is generated.

[0041] Research revealed that visual tokens constitute a significant portion of the input sequence in the decoding layer, and the model's inference speed is positively correlated with the total number of tokens. Furthermore, the study showed that most of these visual tokens are redundant and contribute little to text inference. Therefore, a large number of redundant visual tokens increases the model's inference time. Based on this, visual token pruning can be employed to reduce the number of visual tokens in the decoding layer's input sequence, thereby reducing the model's inference time.

[0042] Research suggests that one approach is to remove all visual tokens from the token sequence in the last 50% of the multi-layer decoders of the model. However, this approach destroys deep visual information, resulting in poor performance on some tasks.

[0043] Based on this, the embodiments of this specification provide a token processing method based on a multimodal language model. This method not only protects important visual information but also removes redundant visual tokens as much as possible, achieving an optimal trade-off between computational load, inference latency, and accuracy. The embodiments of this specification will now be described in detail.

[0044] like Figure 1 As shown, Figure 1 This is a flowchart illustrating a data processing method according to an exemplary embodiment of this specification, applied to a multimodal language model. The multimodal language model includes multiple decoders, each decoder decoding an input data representation sequence to obtain a new data representation sequence, which is then transmitted to the next layer decoder. The method may include the following steps:

[0045] In step 102, obtain the data representation sequence.

[0046] The data representation sequence includes an initial visual data representation obtained by encoding the input image.

[0047] In step 104, the acquired data representation sequence is input into the multilayer decoder.

[0048] The multi-layer decoder includes a pruning decoder that performs pruning operations;

[0049] The pruning operation includes: pruning the input visual data representation in the input data representation sequence so that the pruning decoder decodes based on the pruned data representation sequence to obtain a new data representation sequence.

[0050] In the multi-layer decoder, there is at least one ordinary decoder that does not perform the pruning operation between the pruning decoders.

[0051] As an example, the data processing method of this embodiment can be run on a computer device, which includes, but is not limited to, a server, a cloud server, a server cluster, a tablet computer, a personal digital assistant, a laptop computer, a desktop computer, or a mobile device, etc.

[0052] As an example, the multimodal language model in this embodiment can be used to understand and process images and text. For instance, a user's question text and input image can be acquired as multimodal input data. Different encoders in the multimodal language model can be used to encode the question text and input image respectively.

[0053] As an example, data representation can take the form of tokens, etc. Taking tokens as an example, a token sequence can contain multiple system tokens, multiple visual tokens, and multiple instruction tokens. For example, a text encoder can generate system tokens and instruction tokens from a user's question text, and a visual encoder can generate visual tokens from an input image. The model can include a concatenation network to concatenate system tokens, instruction tokens, and visual tokens into a token sequence in a certain order or other manner, which is the data representation sequence in step 102. The visual tokens contained therein are referred to as the initial visual tokens in this embodiment.

[0054] As an example, in this embodiment, the multimodal language model includes multiple layers of decoders. This embodiment does not limit the number of decoders included in the multimodal language model, but can determine it according to the actual structure of the multimodal language model used.

[0055] As an example, the multimodal language model in this embodiment can be an open-source multimodal language model or a self-built model.

[0056] In this embodiment, the decoder in the multimodal language model can decode the input token sequence to obtain a new token sequence, and then transmit it to the decoder of the next layer; wherein, the number of tokens in the input token sequence and the new token sequence is the same; that is, the number of visual tokens and non-visual tokens (system tokens and instruction tokens) in the input token sequence is the same as the number of visual tokens and non-visual tokens in the new token sequence.

[0057] To improve the inference efficiency of the model, a pruning operation is designed to be performed on some decoders. In this embodiment, the decoder that performs the pruning operation is called the pruned decoder, and the decoder that does not perform the pruning operation is called the normal decoder. The pruning operation includes: trimming the visual tokens in the input token sequence so that the pruned decoder decodes based on the trimmed token sequence to obtain a new token sequence. The input token sequence contains multiple visual tokens; here, trimming refers to deleting one or more input visual tokens from the input token sequence.

[0058] Because a pruned decoder is designed, some visual tokens in the token sequence are removed. Therefore, the model performs inference based on a reduced number of token sequences, which improves inference efficiency. Furthermore, to ensure no loss of inference accuracy, this embodiment employs a regular decoder that has not undergone pruning between each pruned decoder. This can be understood as pruning operations being performed in steps between decoders, only within specific decoders. Specifically, after a pruned decoder performs a pruning operation on the input token sequence, a regular decoder without pruning follows, maintaining the same number of input tokens. After a gap of one or more regular decoders, the pruned decoder performs the pruning operation again. This interval reduces the computational overhead of pruning operations. Simultaneously, based on the autoregressive nature of multimodal language models, these regular decoders between the two layers of pruned decoders allow the model to fully focus on and understand the visual information in the sequence during these multiple decoding processes, ensuring that the model maintains inference efficiency without significantly impacting performance.

[0059] Therefore, this embodiment performs pruning operations by skipping steps between multi-layer decoders, so that the information of visual tokens in the token sequence can be captured by the decoder, preventing the loss of visual information due to excessive pruning, which would reduce the inference accuracy of the model.

[0060] For example, a multi-layer decoder includes a pruned decoder and a regular decoder. The pruned decoder prunes the visual data representation in the input data representation sequence, decodes the pruned data representation sequence to obtain a new data representation sequence, and transmits it to the next layer's decoder. The regular decoder decodes the visual data in the input data representation sequence to obtain a new data representation sequence and transmits it to the next layer's decoder. There is at least one regular decoder between each pruned decoder.

[0061] As an example, the input token sequences fed to each pruning decoder can be different, and the pruning operations performed by each pruning decoder can be the same or different. For example, the number of input visual tokens pruned in each pruning decoder may be the same or different, or the proportion of input visual tokens pruned by each pruning decoder to the input token sequence may be the same or different, and so on.

[0062] As an example, the pruning decoder can be obtained in a variety of ways. For instance, a pruning module capable of performing pruning operations can be implemented and added to the decoder to make the decoder a pruning decoder capable of performing pruning operations.

[0063] As an example, an initial multimodal language model can be obtained, which includes multiple layers of decoders. A pruning decoder is determined within this model, and a pruning module is connected between the pruning decoder and the previous layer decoder to perform pruning operations, thus obtaining the constructed multimodal language model. Based on this, the input token sequence of the pruning decoder can first be pruned by the pruning module to obtain a pruned token sequence, which is then decoded by the pruning decoder to obtain a new token sequence. Therefore, this embodiment not only ensures that the model retains important visual information but also improves its inference efficiency.

[0064] In this embodiment, the number of ordinary decoders between each layer of pruning decoder and the next layer of pruning decoder can be one or more, and can be set according to actual needs. This embodiment does not limit this.

[0065] As an example, to achieve an optimal trade-off between computational cost, inference latency, and accuracy, in some cases, the number of ordinary decoders between the pruned decoder and the next-layer pruned decoder is determined based on: the execution time of the pruning operation; and / or,

[0066] The decoder includes a self-attention module, which is used to extract the importance of each data representation in the input data representation sequence of the decoder.

[0067] The quantity is determined based on the importance of each input visual data representation in the input data representation sequence of the decoder; wherein, among the decoders from the pruned decoder to the next layer of pruned decoder, the first difference between the importance of each input visual data representation in the input data representation sequence of two adjacent decoders is less than a first preset threshold.

[0068] As an example, existing multimodal language models do not perform pruning operations, while this embodiment introduces pruning operations. However, considering that performing pruning operations requires a certain execution time, which can be understood as the additional overhead that pruning operations bring to the existing model, if the execution time of the pruning operation is large, exceeding the decoding processing time of the decoder for the input token sequence, then the pruning operation does not improve the model's inference efficiency. Based on this, this embodiment can pre-determine the execution time of the pruning operation to determine the number of ordinary decoders between the pruned decoder and the next layer pruned decoder.

[0069] As an example, this number can be determined experimentally. Assuming the model has 20 decoder layers, we can experiment with the total time from acquiring the token sequence to the output token sequence of the last decoder layer, as well as the decoding time of each decoder layer. Without pruning, the number of tokens in the input token sequence is the same for each decoder layer, therefore the decoding time for each decoder layer is the same. Through experimentation, we can set the number of visual tokens to be pruned during the pruning operation. By pre-running the program corresponding to the pruning operation, we can obtain the execution time of the pruning operation. Since the number of tokens in the token sequence decreases after pruning, we can determine the decoding time of the decoder after pruning. Based on this, we can determine the number of ordinary decoders between a pair of pruned decoders, ensuring that the processing time is reduced after pruning in this embodiment compared to before pruning.

[0070] For example, without pruning, the decoding time of each layer of decoder is 10 milliseconds, the execution time of the pruning operation is 3 milliseconds, and after pruning, the decoding time of the decoder is reduced to 6 milliseconds. Once a decoder performs a pruning operation, its total processing time is reduced from 10 milliseconds to 6 + 3 = 9 milliseconds. Therefore, the number of ordinary decoders at intervals can be arbitrary, from 0 to any integer, because the processing time has been reduced.

[0071] Assuming the pruning operation takes 6 milliseconds, and the decoder's decoding time is reduced to 6 milliseconds after pruning, then once a decoder performs a pruning operation, its total processing time increases from 10 milliseconds to 6 + 6 = 12 milliseconds. Therefore, at least one regular decoder needs to be spaced out so that the total processing time of the pruned decoder + regular decoder is 12 + 6 milliseconds, which is less than the total time of 10 + 10 milliseconds when neither decoder performs pruning. Of course, more spacers are also possible, because even a single spacer can reduce the processing time.

[0072] Furthermore, this embodiment also determines importance based on the importance of each input visual token in the decoder's input token sequence. The decoder includes a self-attention module, which extracts the importance of each token in the input token sequence of its own decoder. For example, the self-attention module can capture the relationships between tokens in the sequence, thereby obtaining the dependency relationship between the currently generated token and previously generated tokens. It can also combine textual and visual information to achieve cross-modal understanding and generation. Moreover, the importance of each token in the current context can be determined by calculating attention weights; this importance also represents the degree of attention the decoder pays to the token.

[0073] As an example, the self-attention module of each decoder can generate an attention matrix, where the last row represents the attention value for each token, which characterizes the importance of each token in the input token sequence. In this embodiment, this value can also be determined based on the importance of each input visual data representation in the input data representation sequence of the decoder; wherein, among the decoders from the pruned decoder to the next layer of pruned decoder, the first difference between the importance of each input visual data representation in the input data representation sequence corresponding to two adjacent decoders is less than a first preset threshold.

[0074] As an example, suppose the input token sequence contains N tokens, of which m are visual tokens; in the decoder D i In, it can be based on D i The attention matrix is ​​used to obtain the importance of each visual token among the m visual tokens, thus yielding the total importance of the m visual tokens. For example, the attention value of each token in the last row of the attention matrix can be obtained, and the attention values ​​of the visual tokens can be summed to obtain the total attention value of the m visual tokens, thus giving the overall importance O of these m visual tokens. i .

[0075] For decoder D i Adjacent decoder D i+1 Similarly, decoder D can be obtained. i+1 The overall importance O of the m visual tokens in the input token sequence i+1 If O i With O i+1 If the differences between the two decoders are small, it can be assumed that the overall importance of the visual tokens differs little between the two adjacent decoders, and the two decoders pay equal attention to the visual tokens. Therefore, neither decoder needs to perform pruning operations.

[0076] If O i With O i+1 The differences between them are significant, for example, O i+1 With O i If the difference is greater than or equal to the first preset threshold, it indicates that decoder D i+1 Compared to decoder D i The focus on visual tokens has decreased to some extent; therefore, decoder O can be selected. i+1Perform pruning operations. In this way, based on the importance of the visual token in each decoder within the input token sequence, it is possible to determine which decoders are pruned and the number of ordinary decoders that should separate two pruned decoders.

[0077] For example, taking a gap of three ordinary decoders (decoders D2, D3, and D4) between the pruned decoder D1 and the next-layer pruned decoder D5, the difference between the importance of the visual token in the input token sequence in pruned decoder D1 and the importance of the visual token in the input token sequence in decoder D2 is less than a first preset threshold. Similarly, the difference between the importance of the visual token in the input token sequence in pruned decoder D2 and the importance of the visual token in the input token sequence in decoder D3 is less than the first preset threshold. Likewise, the difference between the importance of the visual token in the input token sequence in decoder D3 and the importance of the visual token in the input token sequence in decoder D4 is less than the first preset threshold. Therefore, in D1 to D4, the difference in the importance of the visual token in the input token sequence between any two adjacent decoders is less than the first preset threshold. When we get to decoder D5, the importance of the visual token in the input token sequence in D4 is greater than that of the visual token in the input token sequence in D5, which is greater than or equal to the first preset threshold. Therefore, it can be determined that D1 is a pruned decoder, D2 to D4 are ordinary decoders, and D5 is a pruned decoder.

[0078] Optionally, the aforementioned number can also be determined comprehensively based on the execution time of the pruning operation and the importance of each data representation in the input data representation sequence of the decoder. For example, in the aforementioned example, while ensuring that the post-pruning processing time can be reduced, it is possible to determine at least how many ordinary decoders need to be spaced out. Then, based on this, the number of ordinary decoders to be spaced out can be further determined according to the importance as described in the above embodiment.

[0079] In some examples, the number of ordinary decoders between each pair of pruned decoders is the same; wherein, a pair of pruned decoders refers to a pruned decoder and the next layer of pruned decoders.

[0080] Based on this, by setting the same number of ordinary decoders between each pair of pruned decoders, it can be ensured that the pruning operation is carried out in a uniform and predictable manner throughout the model. This helps to maintain a balance between different layers and avoid some layers from experiencing a sudden drop in performance due to frequent pruning or from failing to reduce the amount of computation sufficiently due to pruning that is too sparse.

[0081] In practical applications, the first-layer pruned decoder in a multi-layer decoder can be determined in various ways. For example, in a model's multi-layer decoder, the visual tokens in the input token sequences of the earlier decoders have higher importance. The importance of visual tokens in the input token sequences of the decoders can be determined sequentially from front to back, and the decoder with the most decreasing visual token importance can be designated as the first-layer pruned decoder. In some examples, the first-layer pruned decoder in the multi-layer decoder can be determined in the following ways:

[0082] In the first n layers of the multi-layer decoder, after obtaining the importance of each input data representation in the input data representation sequence extracted by the self-attention module of each layer of the decoder, the proportion of the importance of the input visual data representation in the input data representation sequence is determined; where n is a preset value.

[0083] The first-layer pruning decoder is determined based on the difference between the proportions of the importance of the input visual data representations corresponding to the two adjacent layers of the decoder in the input data representation sequence.

[0084] Wherein, the second difference between the proportion of the importance of the input visual data representation corresponding to the first-layer pruning decoder in the input data representation sequence and the proportion of the importance of the input visual data representation corresponding to the previous layer decoder in the input data representation sequence is greater than a second preset threshold.

[0085] As an example, the proportion of importance of the input visual token in the input token sequence can be obtained by taking the attention values ​​of each input visual token from the last row of the attention matrix, summing the attention values ​​of each input visual token, and dividing the sum by the sum of the attention values ​​of each token in the last row of the attention matrix.

[0086] For the first n layers of decoders, following the above method, the proportion of importance of the input visual tokens in the input token sequence corresponding to each decoder layer is obtained. This proportion value for each decoder layer can be compared. If the difference between the proportion value of the i-th layer decoder and the proportion value of the (i-1)-th layer decoder is large, for example, greater than the second preset threshold, it indicates that the overall attention of the decoder to all visual tokens decreases at the i-th layer. Therefore, the i-th layer decoder can be used as the first-layer pruned decoder. The second preset threshold can be set according to actual needs; this embodiment does not limit it. Based on this, this embodiment can accurately find the most suitable first-layer pruned decoder based on the importance of the input visual tokens, ensuring that highly important visual tokens are retained and guaranteeing the inference accuracy of the model.

[0087] After determining the first-layer pruning decoder, the number of decoders that need to be spaced out can be determined according to the above embodiment, and then the next-layer pruning decoder can be determined.

[0088] In some examples, the cropping of the input visual data representation in the input data representation sequence includes:

[0089] Using the self-attention module of the decoder in the previous layer, the importance of each input visual data representation in the input data representation sequence is obtained;

[0090] Based on the importance of each input visual data representation in the input data representation sequence, after selecting several input visual data representations with higher importance, the unselected input visual data representations are deleted from the input data representation sequence.

[0091] As an example, for decoder D i Input to decoder D i Token sequence L i It will contain multiple tokens; the self-attention module in decoder i will process the token sequence L. i Extract the importance of each token, and then generate a new token sequence L. i+1 L i+1 Will be used as decoder D i+1 The input token sequence. In decoder D i Without performing pruning, the input token sequence L i The number of tokens in the middle, and the new token sequence L i+1 The number of tokens is the same, but the information carried by the tokens will be different.

[0092] In this embodiment, the pruning decoder D iThe previous layer decoder D i-1 The pruning operation was not performed, and the decoder D can be obtained. i-1 The self-attention module for the token sequence L i-1 The importance of each extracted token, L i-1 The importance of each token in the input is related to the value of the tokens input into D. i sequence L i The importance of each token in the sequence is corresponding. Therefore, it can be determined from sequence L. i After selecting several input visual tokens with higher importance, the unselected input visual tokens are deleted from the input token sequence. The number of input visual tokens deleted by each pruning decoder can be configured according to actual needs; this embodiment does not limit this.

[0093] As an example, input visual tokens can be sorted from highest to lowest importance, and the highest-important tokens can be selected for retention, while the unselected ones can be deleted. Alternatively, the lowest-important input visual tokens can be directly selected for deletion. The number of tokens selected can be configured as needed, and this embodiment does not limit this. Thus, this embodiment retains important visual tokens and deletes unimportant tokens based on their importance, thereby ensuring the inference accuracy of the model.

[0094] In practical applications, the pruning ratio of the input visual tokens pruned in the input token sequence during the pruning operation of the first-layer pruning decoder can be configured according to actual needs, and this embodiment does not impose any limitations on this. To minimize redundant visual tokens and prevent excessive pruning of input visual tokens, in some examples, the pruning ratio of the input visual data pruned in the input token sequence during the pruning operation of the first-layer pruning decoder can be positively correlated with the second difference.

[0095] As an example, taking the first-layer pruning decoder, the pruning ratio of the input visual tokens in the first-layer pruning decoder is determined by the difference between the importance percentage of the input visual tokens in the input token sequence of the previous layer decoder and the importance percentage of the input visual tokens in the previous layer decoder. For example, the larger the difference, the higher the pruning ratio of the visual tokens. A mapping relationship between the difference and the pruning ratio can be pre-defined. For example, the importance percentage of all visual tokens in the sequence corresponding to each layer of the pruning decoder can be obtained. Assuming there are N tokens and m input visual tokens in the input token sequence, the importance of each token can be obtained, thus obtaining the importance percentage of all input visual tokens in the sequence. Assuming the importance percentage corresponding to the first-layer pruning decoder is ci%, and the importance percentage corresponding to the next layer decoder is ci+1%, the difference between the former and the latter is u%. The corresponding pruning ratio can be determined based on u%, for example, by the ratio of u% to ci%, or by directly determining the ratio as the pruning ratio, or by adjusting the ratio accordingly. For example, if the ratio of u% to ci% is one-half, the cropping ratio of the input visual tokens can be 50%, which means that the number of input visual tokens in the sequence is halved.

[0096] In practical applications, the pruning ratio of the number of input visual data representations pruned in each layer of the pruning decoder relative to the total number of data representations in the data representation sequence can be configured according to actual needs. In the example above, the first-layer pruning decoder has already pruned as many redundant visual tokens as possible. Therefore, in the pruning decoders after the first-layer pruning decoder, the pruning ratio can be relatively low, thereby ensuring that the sequence contains certain visual information and ensuring the model's inference ability. Based on this, in some examples, in each layer of the pruning decoder after the first-layer pruning decoder, the pruning ratio of the input visual data representations pruned in the input data representation sequence of each pruning decoder is less than the pruning ratio of the input visual data representations pruned in the input data representation sequence of the first-layer pruning decoder; and / or, in each layer of the pruning decoder after the first-layer pruning decoder, the pruning ratio of the input visual data representations pruned in the input data representation sequence of each pruning decoder is the same.

[0097] As an example, in the first layer of the pruning decoder, the pruning ratio of the input visual token is relatively large, such as 50%; in subsequent layers of the pruning decoder, the pruning ratio of each layer is less than 50%, and a smaller pruning ratio can be selected, such as 5%.

[0098] Optionally, the pruning ratio can be the same or different in each pruning decoder after the first pruning decoder. For example, it can be the same, such as 5%. The pruning ratio here can also be determined based on the degree of attention paid to the input visual token in each decoder after the first pruning decoder.

[0099] Based on this, in this embodiment, after the first-layer pruning decoder prunes a certain amount of input visual tokens, subsequent pruning decoders can be designed based on the pruning ratio of the first-layer pruning decoder. Each subsequent pruning decoder has a smaller pruning ratio than the first-layer pruning decoder, thus preserving some visual information after the first-layer pruning decoder to ensure the model's inference ability. Alternatively, the pruning ratio of each subsequent pruning decoder can be the same, making the pruning operation easier to implement while still preserving some visual information to ensure the model's inference ability.

[0100] In some examples, during the pruning operation of the last layer of the multi-layer decoder, the pruning ratio of the visual data representations in the input data representation sequence is greater than or equal to a set ratio threshold. This set ratio threshold can be set according to actual needs, such as 15%, to ensure that a certain number of visual tokens are still retained in the token sequence, thus guaranteeing the model's inference accuracy.

[0101] In practical applications, multimodal language models are used to perform multiple inferences using the multi-layer decoder. Each inference can be processed in the manner described in the aforementioned embodiments; for example, the pruning operations of each pruning decoder are the same in each inference. In other examples, the pruning operations of the pruning decoder can be different in each inference, for example, the pruning ratio can be different. For instance, the first inference can use the aforementioned embodiments, and from the second inference onwards, the pruning ratio in the pruning operations of the pruning decoder increases sequentially in each inference.

[0102] In some examples, the multimodal language model is used to perform multiple inferences using the multilayer decoder; wherein, in the multiple inferences, the input data representation sequence input to the multilayer decoder in the second inference and each subsequent inference includes: the input data representation sequence input to the multilayer decoder in the previous inference and the new data representation sequence output by the last layer decoder in the previous inference;

[0103] In each inference, the pruning decoder prunes the input visual data representation in the input data representation sequence by a higher percentage than in the previous inference.

[0104] As an example, the multimodal language model uses an autoregressive approach to generate responses. The autoregressive approach is a process of progressively generating a sequence of tokens, each time a new token is generated based on all previously generated tokens.

[0105] Specifically, input data can be acquired, including input images and question text, such as a user's natural language question. An initial sequence of input tokens can be generated based on the input data, for example, by converting the question text into instruction tokens using a text encoder, or by converting the input image into a set of visual tokens using a visual encoder.

[0106] The initial inference can proceed as follows: the initial input token sequence (including text and visual tokens) is fed into the first layer of a multi-layer decoder, and then passed through each subsequent decoder layer. Specifically, each decoder layer (including a pruned decoder and a regular decoder) processes and decodes the input token sequence. In the pruned decoder, visual tokens are pruned, retaining only the most important ones. The final decoder layer processes the input token sequence, and its output, the processed token sequence, becomes the first response token. This first response token can be used to generate the first part of the response content.

[0107] To generate a complete response, the reasoning process is repeated multiple times to generate multiple response tokens. Specifically, starting from the second reasoning iteration, the already generated tokens (e.g., the first response token) are appended to the initial token sequence to form a new input sequence.

[0108] For example, the initial input token sequence expands from "text token..., visual token..." to "text token..., visual token..., response token1".

[0109] The expanded token sequence is then fed back into the multi-layer decoder for the next inference. Similarly, each layer decoder processes the input token sequence, and the pruning decoder removes unimportant visual tokens. The final layer decoder outputs a new token sequence to generate the next response token (e.g., response token2).

[0110] Similarly, in the third reasoning, the newly generated Token (Answer Token 2) is added to the initial input Token sequence, and the newly expanded Token sequence is: "Text Token..., Visual Token..., Answer Token 1, Answer Token 2", and the next Answer Token is generated.

[0111] This process continues in a loop until an ending token (such as a period, terminator, etc.) is generated or the preset maximum token length is reached. Thus, the response content corresponding to the question text can be obtained.

[0112] For example, in the input token sequence for each inference, the first inference:

[0113] The input token sequence fed into the first-layer decoder: text token..., visual token...

[0114] The token sequence output by the last layer decoder: Token1 (response)

[0115] Second reasoning:

[0116] The input token sequence fed into the first-layer decoder is: text token..., visual token..., response token1.

[0117] The token sequence output by the last layer decoder: Token2 (response)

[0118] Third reasoning:

[0119] The input token sequence fed into the first-layer decoder is: text token..., visual token..., response token1, response token2.

[0120] The token sequence output by the final decoder layer: Token3 (response)

[0121] Continue in this manner until the entire answer is completed.

[0122] In this embodiment, the pruning ratio of the pruning decoder is specifically designed for each inference iteration. Specifically, the pruning ratio is dynamically adjusted as the inference time progresses: each time a new response token is generated, the proportion of input visual tokens pruned gradually increases, meaning the proportion of input visual tokens retained in the sequence gradually decreases. In other words, in each pruning decoder, the pruning ratio of input visual tokens is positively correlated with the number of response tokens and the number of inference iterations of the response content. As the number of generated response tokens increases and the number of inference iterations increases, each pruning decoder prunes more input visual tokens. Conversely, in each inference iteration, the proportion of input visual tokens retained in the sequence gradually decreases.

[0123] Therefore, as the number of generated response tokens increases, the amount of contextual information the model needs to process also increases. By gradually increasing the pruning ratio, more visual tokens can be retained in the early generation stage to ensure the quality of the generated content, while the number of visual tokens can be reduced in the later stages of inference, thereby effectively reducing the computational burden and improving the overall inference efficiency.

[0124] For example, in the early stages of model generation, each pruned decoder retains a high proportion of visual tokens, ensuring that the model can fully understand and utilize the information from the input image, thereby generating high-quality, highly relevant responses. As the generation process progresses, gradually pruning more and less critical visual tokens reduces the processing of redundant information, improving efficiency and avoiding the negative impact of information overload on the generated content.

[0125] As an example, in each complete inference iteration, the pruning ratio of visual tokens by each layer of the pruning decoder gradually increases over time. For instance, the pruning ratio of a particular pruning decoder performing pruning operations differs across inference iterations. For example, the first-layer pruning decoder might prune 50% of the input visual tokens in the first inference iteration, a higher pruning ratio in the second inference iteration, and an even higher pruning ratio in the third inference iteration; the same logic applies to other layers of pruning decoders. The pruning ratio of visual tokens in each inference iteration relative to the previous iteration can be configured according to actual needs. For example, a fixed ratio can be used, where each pruning decoder increases the pruning ratio of input visual tokens by a fixed percentage in each inference iteration relative to the previous iteration. Of course, the percentage increases for each pruning decoder can be the same or different.

[0126] Optionally, the pruning ratio of each pruned decoder relative to the input visual token in the previous inference can be dynamically adjusted in each inference. That is, in each inference, the pruning ratio of the input visual token of a certain pruned decoder relative to the previous inference is increased by a set strategy.

[0127] For example, in the second inference relative to the previous inference, the pruning ratio of the visual tokens of a certain pruning decoder is determined using a set strategy; in the third inference relative to the previous inference, the pruning ratio of the visual tokens of the pruning decoder is determined using the set strategy, and so on. This set strategy can be configured according to actual needs, and this embodiment does not limit it.

[0128] For example, smoothing algorithms can be used to achieve smooth changes in the pruning ratio during each inference iteration. Taking the cosine annealing algorithm as an example, a cosine annealing strategy is applied to dynamically adjust the retention ratio of visual tokens in the sequence. The aforementioned pruning ratio refers to the proportion of pruned visual tokens to the total number of visual tokens generated from the input image; the retention ratio refers to the proportion of unpruned visual tokens (i.e., the retained visual tokens) to the total number of visual tokens generated from the input image. Thus, each pruning decoder can have a higher retention ratio in the early stages of inference to retain more visual tokens, ensuring the model has a sufficient understanding of the input image initially, thereby generating high-quality, highly relevant responses. In the middle stages, the retention ratio can be gradually reduced, decreasing computational burden as the generation process progresses while maintaining necessary visual information. In the later stages, a lower retention ratio is used, utilizing previously retained key information to further improve inference efficiency and avoid overcomputation. This scheduling method ensures the reasonable allocation of visual tokens at different generation stages, guaranteeing generation quality while optimizing the utilization of computational resources.

[0129] Therefore, this embodiment provides a smooth and gradual adjustment method for multiple inferences within a single response, avoiding sudden and drastic changes in the retention ratio and ensuring a stable and natural generation process. The gradual reduction of the retention ratio is synchronized with the generation process, allowing the model to adjust computational resource allocation in a targeted manner at different stages. In the early stages of generation, retaining more visual tokens ensures the model has a sufficient understanding of the input image, thereby generating high-quality, highly relevant responses. As generation progresses, the retention ratio is gradually reduced to decrease unnecessary computation, thereby improving inference efficiency and saving computational resources.

[0130] In practice, during adjacent inferences, each pruning decoder can select different visual tokens to prune in subsequent inferences. These visual tokens can include those pruned in the previous inference. For example, during the first inference, the first pruning decoder can record the importance of each visual token in the sequence during the pruning operation, for example, through caching, for access in the next inference. Assuming there are m visual tokens in the sequence, their importance can be recorded. During the first inference, assuming t of the least important visual tokens are pruned, in the second inference, the pruning decoder needs to prune t+j of the least important visual tokens. This can be achieved by accessing the already recorded importance of each visual token in the sequence, without needing to retrieve the importance of each visual token again. That is, of the t+j visual tokens pruned this time, t are the same as the t visual tokens pruned in the previous inference, while the other j visual tokens are determined based on their importance in the sequence. The processing of other pruning decoders is similar.

[0131] like Figure 2A The diagram shown is a schematic representation of a token processing method in a multimodal language model according to an exemplary embodiment of this specification. In this embodiment, the token sequence in the multimodal language model includes system tokens, visual tokens, and instruction tokens. The token processing in this embodiment is a progressive visual token pruning scheme that can remove redundant visual tokens. Figure 2A As shown, the model includes multiple decoder layers (referred to as decoding layers). Visual token pruning operations can be inserted before specific decoding layers to reduce the number of visual tokens participating in the computation at that layer. As the number of layers increases, the number of visual tokens gradually decreases. It is worth noting that in this embodiment, the pruning operation is not performed in every layer, but rather inserted in a skip-step manner before certain specific layers. This reduces the overhead of pruning, and the skip-step approach does not cause a loss of accuracy compared to performing pruning at every layer.

[0132] The pruning operation of visual tokens can be achieved using the attention matrix extracted from the self-attention module in the previous decoding layer. Figure 2BThe diagram illustrates a pruning method as described in this embodiment. The attention value for each visual token can be obtained from the last row of the attention matrix. The magnitude of the attention value of a visual token reflects its importance in the inference process; therefore, token pruning can be performed based on this value. Visual tokens can be sorted according to their attention values, and the top K most important visual tokens can be selected based on this sorting, removing the remaining tokens. The number of visual tokens K retained can be different in pruning operations at different decoding layers. The value of K can be set according to actual needs, and this embodiment does not limit this.

[0133] Due to the inherent overhead of pruning operations, performing pruning before each decoding layer introduces significant overhead. To reduce this additional overhead, this embodiment proposes a skip-step pruning mechanism. Specifically, two pruning operations are separated by an equal number of unpruned decoding layers. This mechanism ensures that pruning operations are performed only on specific decoding layers, while the number of visual tokens in the decoding layers between pruning operations remains constant. Since the pruning step size determines the number of tokens in each decoding layer, the choice of step size determines the computational cost of the model. This step size can be considered a hyperparameter of the model and can be selected through ablation experiments, etc. In the experiments, when the step size is 4, the model achieves the shortest inference latency while maintaining stable accuracy. It should be understood that this is only illustrative; in practical applications, the step size should be determined based on the multimodal language model used.

[0134] The pruning ratio is a crucial parameter affecting the computational cost of the model. Observations of visual token attention reveal that visual token attention accounts for a significant proportion of total attention in the first three layers of the model. However, after the third layer, the proportion of visual token attention decreases sharply. Therefore, it can be assumed that visual token pruning can begin after the third layer. For example, the first pruning operation can be performed at the fourth layer, using a relatively large pruning ratio, such as 50%. Subsequently, 5% of the visual tokens are gradually reduced in each pruning operation, leaving only 15% of the initial visual tokens in the final layer. Based on this, the method in this embodiment reduces the computational cost of the model by more than 50% compared to the original multimodal language model, thereby improving the model's inference efficiency.

[0135] Existing solutions all use a uniform, fixed pruning ratio; however, the progressive pruning strategy of this embodiment can vary spatially, meaning different decoding layers have different pruning ratios. Furthermore, the pruning strategy of this embodiment can also vary over time. For example, the inference of a large model is autoregressive, meaning that after one complete inference iteration, the model generates an output token; a complete inference iteration here refers to the token sequence starting from the input of the first decoder layer and ending with the output response token of the last decoder layer. Then, another complete inference iteration is performed, and this cycle continues until a complete response is generated. Starting from the second inference iteration, the already generated response token (e.g., the first response token) is appended to the initial token sequence to form a new input sequence. As the number of inference iterations increases, the pruning ratio of the visual tokens in each decoder layer increases sequentially.

[0136] The progressive visual token pruning technique used in this embodiment gradually reduces the number of visual tokens as the number of layers increases. Compared to some approaches that aim to prune to the maximum number of tokens, this embodiment considers the overhead of pruning and distributes the pruning operations evenly across specific layers, reducing the number of pruning operations. In the generation of the first response token, this embodiment achieves a 2x speedup compared to a multimodal language model without pruning (reduced from 128ms to 63ms in experiments).

[0137] like Figure 3 The diagram shown is a flowchart illustrating a method for constructing a multimodal language model according to an exemplary embodiment of this specification. The method includes:

[0138] In step 302, an initial multimodal language model is obtained, wherein the initial multimodal language model includes a multi-layer decoder;

[0139] In step 304, a pruning decoder is determined in the multi-layer decoder, and a pruning module for performing pruning operations is connected between the pruning decoder and the previous layer decoder to obtain a constructed multimodal language model; the constructed multimodal language model is used to perform the steps of the aforementioned token processing method based on the multimodal language model.

[0140] This embodiment describes the construction process of a multimodal language model. For details, please refer to the previous embodiment, which will not be repeated here.

[0141] Corresponding to the embodiments of the aforementioned data processing method / multimodal language model construction method, this specification also provides embodiments of a data processing device / multimodal language model construction device and the computer equipment on which it is applied.

[0142] The embodiments of the data processing device / multimodal language model construction device described in this specification can be applied to computer devices, such as servers or terminal devices. The device embodiments can be implemented through software, hardware, or a combination of both. Taking software implementation as an example, as a logical device, it is formed by its processor reading the corresponding computer program instructions from non-volatile memory into memory and executing them. From a hardware perspective, such as... Figure 4 The diagram illustrates a hardware structure of a computer device housing the data processing apparatus / multimodal language model construction apparatus described in this specification. This computer device may include a processor 410, a network interface 420, memory 430, and non-volatile memory 440. The data processing apparatus / multimodal language model construction apparatus is formed by the processor 410 reading the corresponding computer program instructions from the non-volatile memory 440 into the memory 430 and executing them. In addition, the computer device housing the token processing apparatus / multimodal language model construction apparatus based on the multimodal language model in this embodiment may also include other hardware depending on the actual functions of the computer device; these will not be elaborated further.

[0143] like Figure 5 As shown, Figure 5 This is a block diagram illustrating a data processing apparatus according to an exemplary embodiment of the present specification. The multimodal language model includes multiple layers of decoders, each decoder being used to decode an input data representation sequence to obtain a new data representation sequence, and then transmitting it to the next layer of decoders. The apparatus may include:

[0144] The acquisition module 51 is configured to: acquire a data representation sequence; wherein the data representation sequence includes an initial visual data representation obtained by encoding the input image;

[0145] Input module 52 is used to: input the acquired data representation sequence to the multilayer decoder;

[0146] The multilayer decoder includes a pruning decoder that performs a pruning operation, wherein the pruning operation includes pruning the input visual data representation in the input data representation sequence so that the pruning decoder decodes based on the pruned data representation sequence to obtain a new data representation sequence;

[0147] In the multi-layer decoder, there is at least one ordinary decoder that does not perform the pruning operation between the pruning decoders.

[0148] In some examples, the number of ordinary decoders spaced between the pruned decoders is determined based on: the execution duration of the pruning operation; and / or,

[0149] The decoder includes a self-attention module, which is used to extract the importance of each data representation in the input data representation sequence of the decoder.

[0150] The quantity is determined based on the importance of each input visual data representation in the input data representation sequence of the decoder; wherein, among the decoders from the pruned decoder to the next layer of pruned decoder, the first difference between the importance of each input visual data representation in the input data representation sequence of two adjacent decoders is less than a first preset threshold.

[0151] In some examples, the cropping of the input visual data representation in the input data representation sequence includes:

[0152] Using the self-attention module of the decoder in the previous layer, the importance of each input visual data representation in the input data representation sequence is obtained;

[0153] Based on the importance of each input visual data representation in the input data representation sequence, after selecting several input visual data representations with higher importance, the unselected input visual data representations are deleted from the input data representation sequence.

[0154] In some examples, the number of ordinary decoders between each pair of pruned decoders is the same; wherein, a pair of pruned decoders refers to the pruned decoder and the next layer pruned decoder.

[0155] In some examples, the first-layer pruned decoder in the multi-layer decoder is determined in the following way:

[0156] In the first n layers of the multi-layer decoder, after obtaining the importance of each input data representation in the input data representation sequence extracted by the self-attention module of each layer of the decoder, the proportion of the importance of the input visual data representation in the input data representation sequence is determined; where n is a preset value.

[0157] The first-layer pruning decoder is determined based on the difference between the proportions of the importance of the input visual data representations corresponding to the two adjacent layers of the decoder in the input data representation sequence.

[0158] Wherein, the second difference between the proportion of the importance of the input visual data representation corresponding to the first-layer pruning decoder in the input data representation sequence and the proportion of the importance of the input visual data representation corresponding to the previous layer decoder in the input data representation sequence is greater than a second preset threshold.

[0159] In some examples, during the pruning operation of the first-layer pruning decoder, the pruning ratio represented by the pruned input visual data in the input data representation sequence is positively correlated with the second difference.

[0160] In some examples, in each of the pruning decoders following the first-layer pruning decoder, the pruning ratio of the input visual data representation in the input data representation sequence of each pruning decoder is less than the pruning ratio of the input visual data representation in the input data representation sequence of the first-layer pruning decoder; and / or, in each of the pruning decoders following the first-layer pruning decoder, the pruning ratio of the input visual data representation in the input data representation sequence of each pruning decoder is the same.

[0161] In some examples, in the pruning operation of the last layer of the multilayer decoder, the pruning ratio of the input visual data representation in the input data representation sequence is greater than or equal to a set ratio threshold.

[0162] In some examples, in the multimodal language model, a pruning module is connected between the pruning decoder and the decoder above it, and the pruning module is used to perform the pruning operation.

[0163] In some examples, the multimodal language model is used to perform multiple inferences using the multi-layer decoder; wherein, in the multiple inferences, the input data representation sequence input to the multi-layer decoder in the second inference and each subsequent inference includes: the input data representation sequence input to the multi-layer decoder in the previous inference and the new data representation sequence output by the last layer of the decoder in the previous inference;

[0164] In each inference, the pruning decoder prunes the input visual data representation in the input data representation sequence by a higher percentage than in the previous inference.

[0165] like Figure 6 As shown, Figure 6 This is a block diagram illustrating an apparatus for constructing a multimodal language model according to an exemplary embodiment of this specification. The apparatus includes:

[0166] The acquisition module 61 is used to: acquire an initial multimodal language model, wherein the initial multimodal language model includes a multi-layer decoder;

[0167] The construction module 62 is used to: determine the pruning decoder in the multi-layer decoder, and connect a pruning module for performing pruning operations between the pruning decoder and the previous layer decoder to obtain a constructed multimodal language model; the constructed multimodal language model is used to perform the steps of the aforementioned data processing method.

[0168] The specific implementation process of the functions and roles of each module in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.

[0169] Accordingly, embodiments of this specification also provide a computer program product, including a computer program that, when executed by a processor, implements the steps of the aforementioned data processing method / multimodal language model construction method embodiments.

[0170] Accordingly, embodiments of this specification also provide a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of an embodiment of a data processing method / multimodal language model construction method.

[0171] Accordingly, embodiments of this specification also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of an embodiment of a data processing method / multimodal language model construction method.

[0172] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The modules described as separate components may or may not be physically separate, and the components shown as modules may or may not be physical modules, that is, they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected to achieve the purpose of the solution in this specification according to actual needs. Those skilled in the art can understand and implement this without creative effort.

[0173] The above embodiments can be applied to one or more computer devices. The computer device is a device that can automatically perform numerical calculations and / or information processing according to pre-set or stored instructions. The hardware of the computer device includes, but is not limited to, microprocessors, application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), digital signal processors (DSPs), embedded devices, etc.

[0174] The computer device can be any electronic product that can interact with the user, such as a personal computer, tablet computer, smartphone, personal digital assistant (PDA), game console, interactive network television (IPTV), smart wearable device, etc.

[0175] The computer equipment may also include network equipment and / or user equipment. The network equipment includes, but is not limited to, a single network server, a server group consisting of multiple network servers, or a cloud based on cloud computing consisting of a large number of hosts or network servers.

[0176] The network in which the computer device is located includes, but is not limited to, the Internet, wide area network, metropolitan area network, local area network, and virtual private network (VPN).

[0177] The foregoing has described specific embodiments of this specification. Other embodiments are within the scope of the appended claims. In some cases, the actions or steps recited in the claims may be performed in a different order than that shown in the embodiments and may still achieve the desired result. Furthermore, the processes depicted in the drawings do not necessarily require the specific or sequential order shown to achieve the desired result. In some embodiments, multitasking and parallel processing are possible or may be advantageous.

[0178] The steps of the various methods described above are only for clarity. In practice, they can be combined into one step or some steps can be split into multiple steps. As long as they include the same logical relationship, they are all within the scope of protection of this patent. Adding insignificant modifications or introducing insignificant designs to the algorithm or process, but without changing the core design of the algorithm and process, are also within the scope of protection of this application.

[0179] While this specification contains numerous specific implementation details, these should not be construed as limiting the scope of any invention or the scope of the claims, but rather are primarily intended to describe features of specific embodiments of a particular invention. Certain features described in the various embodiments herein may also be implemented in combination in a single embodiment. Conversely, various features described in a single embodiment may also be implemented separately in various embodiments or in any suitable sub-combination. Furthermore, while features may function in certain combinations as described above and even initially claimed in this way, one or more features from a claimed combination may be removed from that combination in some cases, and a claimed combination may refer to a sub-combination or a variation thereof.

[0180] The terms "specific example" or "some examples," etc., refer to specific features, structures, materials, or characteristics described in connection with the embodiments or examples, which are included in at least one embodiment or example of this specification. In this specification, illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.

[0181] Other embodiments of this specification will readily occur to those skilled in the art upon consideration of the specification and practice of the invention claimed herein. This specification is intended to cover any variations, uses, or adaptations that follow the general principles of this specification and include common knowledge or customary techniques in the art not claimed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this specification are indicated by the following claims.

[0182] It should be understood that this specification is not limited to the precise structures described above and shown in the accompanying drawings, and various modifications and changes can be made without departing from its scope. The scope of this specification is limited only by the appended claims.

[0183] The above description is merely a preferred embodiment of this specification and is not intended to limit this specification. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this specification should be included within the scope of protection of this specification.

Claims

1. A data processing method applied to a multimodal language model, the multimodal language model comprising multiple layers of decoders, the decoders being used to decode an input data representation sequence to obtain a new data representation sequence, and transmitting it to the next layer of decoders; the method comprising: Obtain a data representation sequence; wherein the data representation sequence includes an initial visual data representation obtained by encoding the input image; The data representation sequence is input into the multilayer decoder; The multi-layer decoder includes a pruning decoder that performs pruning operations; The pruning operation includes: pruning the input visual data representation in the input data representation sequence so that the pruning decoder decodes based on the pruned data representation sequence to obtain a new data representation sequence; In the multi-layer decoder, there is at least one ordinary decoder that does not perform the pruning operation between the pruning decoders.

2. The method according to claim 1, wherein the number of ordinary decoders spaced between the pruned decoders is determined based on: the execution duration of the pruning operation; and / or, The decoder includes a self-attention module, which is used to extract the importance of each data representation in the input data representation sequence of the decoder; The quantity is determined based on the importance of each input visual data representation in the input data representation sequence of the decoder; wherein, In each of the decoders from the pruning decoder to the next pruning decoder, the first difference between the importance of each input visual data representation in the input data representation sequence corresponding to two adjacent decoders is less than a first preset threshold.

3. The method according to claim 2, wherein cropping the input visual data representation in the input data representation sequence comprises: Using the self-attention module of the decoder in the previous layer, the importance of each input visual data representation in the input data representation sequence is obtained; Based on the importance of each input visual data representation in the input data representation sequence, after selecting several input visual data representations with higher importance, the unselected input visual data representations are deleted from the input data representation sequence.

4. The method according to claim 2, wherein the number of ordinary decoders between each pair of pruned decoders is the same; wherein, The pair of pruning decoders refers to the pruning decoder and the next-layer pruning decoder.

5. The method according to claim 2, wherein the first-layer pruned decoder in the multi-layer decoder is determined in the following manner: In the first n layers of the multi-layer decoder, after obtaining the importance of each input data representation in the input data representation sequence extracted by the self-attention module of each layer of the decoder, the proportion of the importance of the input visual data representation in the input data representation sequence is determined; wherein, The n is a preset value; The first-layer pruned decoder is determined based on the difference between the proportions of the importance of the input visual data representations corresponding to the two adjacent decoder layers in the input data representation sequence. Wherein, the second difference between the proportion of the importance of the input visual data representation corresponding to the first-layer pruning decoder in the input data representation sequence and the proportion of the importance of the input visual data representation corresponding to the previous layer decoder of the first-layer pruning decoder in the input data representation sequence is greater than a second preset threshold.

6. In the method according to claim 5, in the pruning operation of the first-layer pruning decoder, the pruning ratio of the input visual data represented by the input data in the input data representation sequence is positively correlated with the second difference.

7. The method according to claim 6, wherein in each layer of the pruning decoder after the first-layer pruning decoder, the pruning ratio of the input visual data representation in the input data representation sequence of each pruning decoder is less than the pruning ratio of the input visual data representation in the input data representation sequence of the first-layer pruning decoder; and / or, in each layer of the pruning decoder after the first-layer pruning decoder, the pruning ratio of the input visual data representation in the input data representation sequence of each pruning decoder is the same.

8. The method according to claim 7, wherein in the pruning operation of the last layer of the pruning decoder, the pruning ratio of the input visual data representation in the input data representation sequence is greater than or equal to a set ratio threshold.

9. The method according to any one of claims 1 to 8, wherein the multimodal language model is used to perform multiple inferences using the multi-layer decoder; wherein, In the multiple inferences, in the second inference and each subsequent inference, the data representation sequence input to the multilayer decoder includes: the data representation sequence input to the multilayer decoder in the previous inference and the new data representation sequence output by the last layer of the decoder in the previous inference; In each inference, the pruning decoder prunes the input visual data representation in the input data representation sequence by a higher percentage than in the previous inference.

10. The method according to any one of claims 1 to 8, wherein in the multimodal language model, a pruning module is connected between the pruning decoder and the decoder above the pruning decoder, and the pruning module is used to perform the pruning operation.

11. A method for constructing a multimodal language model, the method comprising: Obtain an initial multimodal language model, wherein the initial multimodal language model contains multiple layers of decoders; A pruning decoder is determined in the multi-layer decoder, and a pruning module for performing pruning operations is connected between the pruning decoder and the previous layer decoder to obtain a constructed multimodal language model; the constructed multimodal language model is used to perform the steps of the method according to any one of claims 1 to 10.

12. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 11.

13. A computer program product comprising a computer program that, when executed by a processor, implements the steps of the method according to any one of claims 1 to 11.

14. A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method according to any one of claims 1 to 11.