Model inference method and electronic device

By using principal component analysis and correlation difference indices to screen features of multimodal large models, the problem of increased complexity caused by noisy features is solved, and efficient feature screening and model inference are achieved.

CN122242784APending Publication Date: 2026-06-19INSPUR SUZHOU INTELLIGENT TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INSPUR SUZHOU INTELLIGENT TECH CO LTD
Filing Date
2026-05-22
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

When multimodal large models process high-resolution images or audio, the large number of features, including noisy features, leads to increased time and space complexity, making it difficult for existing technologies to effectively filter and remove noisy features.

Method used

Principal component analysis (PCA) is used to reconstruct the initial features, obtain reconstruction error and correlation difference indices, and determine the importance of features by combining the positional relationships of sub-objects of the visual object. Based on the importance, the features are filtered to obtain the target feature set.

Benefits of technology

It reduces the time and space complexity of the model inference process, improves the accuracy and efficiency of feature selection, and reduces computational and storage requirements.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122242784A_ABST
    Figure CN122242784A_ABST
Patent Text Reader

Abstract

This application discloses a model inference method and electronic device, relating to the field of machine learning technology. The method includes: after obtaining an initial feature set composed of features corresponding to sub-objects of visual objects in multimodal data, on the one hand, reconstructing each feature and obtaining a reconstruction error used to characterize the feature differences before and after reconstruction; on the other hand, obtaining an association difference index used to characterize the spatial visual differences between the sub-objects corresponding to each feature and adjacent sub-objects; further, determining the importance of each feature based on the reconstruction error and the association difference index, thereby combining the two dimensions of reconstruction error and association difference index to improve the effectiveness of determining the importance; on this basis, filtering the initial feature set according to the importance, and performing model inference on multimodal data based on the obtained target feature set, thereby achieving effective filtering of the initial feature set.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of machine learning technology, and in particular to a model inference method and an electronic device. Background Technology

[0002] With the continuous development of deep learning technology, multimodal large models are widely used in various fields such as image understanding, video analysis, cross-modal retrieval, human-computer interaction, and autonomous driving. However, for feature extraction of high-resolution images or audio, multimodal large models can generate hundreds or even thousands of features. Although high-resolution input can improve the model's perception of the input to some extent, the computational complexity of the transformer is quadratically related to the number of features. The introduction of a large number of features will increase the time and space complexity of multimodal large models. Moreover, the features also include features corresponding to input noise. Features corresponding to input noise not only affect the output of the transformer but also increase the time and space complexity of multimodal large models. Therefore, how to effectively delete features corresponding to input noise is an increasingly important focus for model providers. Summary of the Invention

[0003] This application provides a model inference method and an electronic device to at least solve the problem of how to effectively perform feature selection in related technologies.

[0004] This application provides a model reasoning method, including: Obtain an initial feature set corresponding to a visual object in multimodal data; the features in the initial feature set correspond to the sub-objects of the visual object respectively; Based on the principal component analysis algorithm, each feature in the initial feature set is reconstructed to obtain the reconstructed features of each feature, and the reconstruction error of each feature is obtained based on the reconstructed features of each feature. The reconstruction error is used to characterize the feature difference before and after reconstruction. Based on the positional relationship between the sub-objects of the visual object, the association difference index of each feature is determined. The association difference index is used to characterize the spatial visual difference between the sub-object corresponding to the feature and its adjacent sub-objects. The importance of each feature is determined based on its reconstruction error and correlation difference index. The initial feature set is filtered according to its importance to obtain the target feature set; Model inference is performed on the multimodal data based on the target feature set.

[0005] This application also provides a model inference apparatus, including: The feature set acquisition module is used to acquire an initial feature set corresponding to a visual object in multimodal data; the features in the initial feature set correspond to the sub-objects of the visual object respectively; The error acquisition module is used to reconstruct each feature in the initial feature set based on the principal component analysis algorithm, obtain the reconstructed features of each feature, and obtain the reconstruction error of each feature based on the reconstructed features of each feature. The reconstruction error is used to characterize the feature difference before and after reconstruction. The indicator determination module is used to determine the correlation difference index of each feature based on the positional relationship between the sub-objects of the visual object. The correlation difference index is used to characterize the spatial visual difference between the sub-object corresponding to the feature and its adjacent sub-objects. The importance determination module is used to determine the importance of each feature based on the reconstruction error and correlation difference index of each feature; A filtering processing module is used to filter the initial feature set according to the importance level to obtain a target feature set; The model inference module is used to perform model inference on the multimodal data based on the target feature set.

[0006] This application also provides an electronic device, including: a memory for storing a computer program; and a processor for executing the computer program to implement the steps of any of the above-described model reasoning methods.

[0007] This application also provides a computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the steps of any of the above-described model reasoning methods.

[0008] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the steps of any of the above-described model reasoning methods.

[0009] After obtaining the initial feature set corresponding to the visual object, this application determines the importance of each feature in the initial feature set and filters the initial feature set according to the importance to obtain the target feature set. Then, it performs model inference on multimodal data based on the target feature set. In this way, the features corresponding to the visual object are filtered, reducing the number of features corresponding to the visual object in the model inference process and reducing the time and space complexity of the model inference process. In determining the importance of each feature in the initial feature set, on the one hand, each feature in the initial feature set is reconstructed to obtain the reconstruction error, which is used to characterize the difference between the features before and after reconstruction. This reconstruction error is introduced to reflect the difficulty of accurately representing the features in low-dimensional space. On the other hand, based on the positional relationship between the sub-objects of the visual object, an association difference index is determined to characterize the spatial visual difference between the sub-object and its adjacent sub-objects. This effectively characterizes the information density difference of the features by introducing local information of the visual object. Furthermore, the importance of each feature is determined based on the reconstruction error and the association difference index, improving the effectiveness of the obtained importance and thus improving the effectiveness of the selection based on importance, thereby effectively reducing the number of features. Attached Figure Description

[0010] To more clearly illustrate the embodiments of this application, the accompanying drawings used in the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0011] Figure 1 A flowchart illustrating a model reasoning method provided in an embodiment of this application; Figure 2 This is a flowchart illustrating a process for obtaining an initial feature set corresponding to a visual object in multimodal data, as provided in an embodiment of this application. Figure 3 A schematic diagram of a multimodal processing model provided in an embodiment of this application; Figure 4 A flowchart illustrating a model reasoning method applied to a multimodal data reasoning scenario, provided as an embodiment of this application; Figure 5 This is a schematic diagram of the structure of a model inference device provided in an embodiment of this application. Detailed Implementation

[0012] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the protection scope of this application.

[0013] It should be noted that, in the description of this application, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. The terms "first," "second," etc., in this application are used to distinguish similar objects and are not used to describe a specific order or sequence.

[0014] To enable those skilled in the art to better understand the present application, the present application will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0015] The embodiments of this application provide a model reasoning method, and the method is described in detail in conjunction with the execution flow of a model reasoning method.

[0016] Reference Figure 1 As shown, this model reasoning method includes the following steps S11-S16.

[0017] S11, Obtain the initial feature set corresponding to the visual object in the multimodal data.

[0018] In this embodiment, multimodal data can be data containing various data objects of different forms. For example, multimodal data includes text, images, audio, video, and / or signals. Visual objects include target entities or regions that can be recognized, detected, located, or understood by a visual perception system; in this embodiment, visual objects can be images and / or video frames.

[0019] The initial feature set corresponding to the visual object includes a feature set composed of features obtained after encoding each sub-object in the visual object. Optionally, the features in the initial feature set correspond to the sub-objects of the visual object.

[0020] In the specific execution process, when obtaining the initial feature set corresponding to visual objects in multimodal data, the visual objects can first be divided into multiple sub-objects, and then a feature sequence can be generated as the initial feature set through linear mapping and positional encoding. (Refer to...) Figure 2 As shown, the initial feature set corresponding to the visual object in the multimodal data can be obtained through the following steps S11-1 to S11-4.

[0021] S11-1, Divide the visual object to obtain multiple sub-objects.

[0022] Specifically, in the process of dividing a visual object into multiple sub-objects, a fixed-size sliding window can be used to divide the visual object into non-overlapping sub-objects.

[0023] For example, the visual object is a resolution of H The image is represented by H, W, and C, where H represents the height, W represents the width, and C represents the number of channels; a fixed-size sliding window, such as 16... A 16-pixel sliding window is used to divide the image into non-overlapping blocks, resulting in N image patches, where P is the side length of the image block, which is 16 pixels.

[0024] It should be noted that during the process of dividing visual objects into non-overlapping segments using a sliding window of a fixed size, visual objects whose resolution does not meet the requirement of being an integer multiple of the side length corresponding to the fixed size can be padded with zero to make them an integer multiple of the side length, thereby unifying the size of the sub-objects, i.e., image blocks.

[0025] S11-2, extract features from each of the multiple sub-objects to obtain the initial features corresponding to each sub-object.

[0026] In this embodiment of the application, after dividing the visual object into multiple sub-objects, feature extraction can be performed on each sub-object to obtain the initial features corresponding to each sub-object.

[0027] Specifically, feature extraction is performed on each sub-object among multiple sub-objects. In obtaining initial features, each sub-object is first flattened to obtain a one-dimensional vector, along with a mapping matrix and a first bias vector. Then, the dot product of each sub-object's one-dimensional vector and the mapping matrix is ​​calculated to obtain the first intermediate vector for each sub-object. Finally, the sum of the first intermediate vector and the first bias vector is calculated as the initial feature for each sub-object. Optionally, the mapping matrix can be a linear mapping weight matrix.

[0028] Continuing with the previous example, the dimension of each image patch is P. Each image patch is flattened to obtain a one-dimensional vector of each image patch. Then, the dot product of the mapping matrix and the one-dimensional vector of each image patch is calculated to obtain the first intermediate vector. Finally, the sum of the first intermediate vector of each image patch and the pre-configured first bias vector is calculated as the initial feature of each image patch.

[0029] Specifically, the initial features of each image patch (i.e., sub-object) can be calculated using the following formula:

[0030] in, For the p-th child object, To flatten out, For the mapping matrix, This is the first bias vector. This represents the initial characteristics of the p-th sub-object.

[0031] It should be noted that the dimension of the mapping matrix can be [missing information]. Where D can be the feature dimension of the visual feature, for example, 768, and the feature dimension of the initial feature is... .

[0032] S11-3, perform position encoding on each of the multiple sub-objects to obtain the position features corresponding to each sub-object.

[0033] In the specific execution process, since the division of sub-objects ignores spatial location information, in addition to extracting features from each sub-object in multiple sub-objects to obtain the initial features corresponding to each sub-object, the position encoding of each sub-object in multiple sub-objects is also performed to obtain the position features corresponding to each sub-object, thereby preserving the spatial structure information of the sub-objects.

[0034] In the embodiments of this application, during the process of performing position encoding on each of the multiple sub-objects to obtain the position features corresponding to each sub-object, the position information of each sub-object in the visual object can be obtained, as well as the position weight matrix and the second bias vector can be obtained. The dot product of the position information and the position weight matrix is ​​calculated to obtain the second intermediate vector. Then, the sum of the second intermediate vector and the second bias vector is calculated as the third intermediate vector. The sine of the third intermediate vector is calculated to obtain the position features.

[0035] During the actual execution process, the position information of the sub-object can be generated based on the row and column information of the sub-object in the visual object.

[0036] For example, location features can be calculated using the following formula:

[0037] in, This is the position weight matrix. The second bias vector p represents the position information of the sub-object. Let be the positional feature corresponding to the p-th sub-object. It should be noted that the dimension of the positional feature corresponding to the sub-object is the same as the dimension of the initial feature mentioned above.

[0038] S11-4: Generate features for each sub-object based on its initial features and positional features, and obtain an initial feature set consisting of the features of multiple sub-objects.

[0039] Based on the initial features and positional features of each sub-object obtained above, the features of each sub-object are first generated according to the initial features and positional features of each sub-object.

[0040] In the specific execution process, when generating the features of each sub-object based on its initial features and positional features, the initial features and positional features of each sub-object can be added together to obtain the features of each sub-object.

[0041] Furthermore, an initial feature set consisting of the features of multiple sub-objects is obtained. In the process of obtaining the initial feature set consisting of the features of multiple sub-objects, a feature sequence consisting of the features of multiple sub-objects can be obtained as the initial feature set.

[0042] For example, the initial feature set can be Where Z has a dimension of 1. .

[0043] The above provides a detailed explanation of the process of obtaining the corresponding initial feature set from the visual object.

[0044] In this embodiment of the application, the initial feature set corresponding to the visual object can be implemented by a visual encoder. Specifically, in the process of obtaining the initial feature set of the visual object in the multimodal data, the visual object in the multimodal data can be input into the visual encoder for encoding, and the initial feature set output by the visual encoder can be obtained.

[0045] S12, based on the principal component analysis algorithm, reconstruct each feature in the initial feature set to obtain the reconstructed features of each feature, and obtain the reconstruction error of each feature based on the reconstructed features of each feature.

[0046] In the specific execution process, if a certain feature is located on the mainstream structure of the visual object, its corresponding reconstruction error is small. If a certain feature is composed of noise, its feature is difficult to be accurately reconstructed by the low-dimensional model, and the reconstruction error is large. Based on this, in the embodiments of this application, the reconstruction error that can effectively reflect the difficulty of accurately representing the feature in the low-dimensional space is obtained to effectively distinguish information from noise.

[0047] In this embodiment, each feature in the initial feature set can first be reconstructed using principal component analysis (PCA) to obtain the reconstructed features of each feature. Then, the reconstruction error of each feature can be obtained based on the reconstructed features. Optionally, the reconstruction error is used to characterize the feature differences before and after reconstruction.

[0048] Specifically, in the process of reconstructing each feature in the initial feature set based on the principal component analysis (PCA) algorithm to obtain the reconstructed features of each feature, we can first process each feature separately using PCA to obtain the low-dimensional projection features corresponding to each feature. Then, we generate a low-dimensional projection matrix based on the low-dimensional projection features corresponding to each feature, and calculate the transpose of the low-dimensional projection matrix as an intermediate projection matrix. Next, we perform low-dimensional mapping on each feature based on the intermediate projection matrix to obtain the low-dimensional mapping features of each feature. Based on the low-dimensional projection matrix and the low-dimensional mapping features corresponding to each feature, we determine the reconstructed features of each feature. In this way, we first improve the effectiveness of the low-dimensional mapping of each feature in the reconstruction process, improve the effectiveness of the obtained low-dimensional mapping features corresponding to each feature, and thus improve the effectiveness of the determined reconstructed features corresponding to each feature.

[0049] Specifically, the initial features in the initial feature set can be reduced in dimensionality using the principal component analysis algorithm to obtain a low-dimensional projection matrix. Then, the transpose of the low-dimensional projection matrix is ​​calculated as an intermediate projection matrix. Based on the intermediate projection matrix, each feature is mapped in low dimension to obtain the low-dimensional mapped features corresponding to each feature. Finally, based on the low-dimensional projection matrix and the low-dimensional mapped features corresponding to each feature, the reconstructed features corresponding to each feature are determined.

[0050] For example, when the initial feature set consists of feature sequences, PCA (Principal Component Analysis) is used as a dimensionality reduction method to reduce the dimensionality of the feature sequences and obtain a low-dimensional projection matrix. Then, the reconstructed features corresponding to each feature are calculated using the following formula:

[0051] in, It is a low-dimensional projection matrix. Let p be the p-th feature among multiple features.

[0052] Based on the reconstructed features corresponding to each feature, the reconstruction error of each feature is obtained. Specifically, in the process of obtaining the reconstruction error of each feature based on its reconstructed features, the square of the Euclidean distance between each feature and its corresponding reconstructed feature can be calculated as the reconstruction error of each feature.

[0053] For example, the reconstruction error of a feature can be calculated using the following formula:

[0054] It should be noted that the larger the reconstruction error of a feature, the greater the probability that the feature corresponds to noise; conversely, the smaller the reconstruction error, the greater the probability that the feature corresponds to valid information.

[0055] S13, determine the correlation difference index of each feature based on the positional relationship between the sub-objects of the visual object.

[0056] In images, the information entropy distribution across different spatial regions exhibits highly uneven characteristics. Target edges, texture abrupt changes, and / or semantically key regions in an image typically contain rich detail and semantic information, exhibiting high information entropy. Conversely, large background areas, smooth areas, and / or areas with repetitive textures contain less effective information, resulting in lower information entropy. Based on this, local statistics can effectively characterize the differences in information density among different features, providing a theoretical basis for filtering redundant features. Therefore, in this embodiment, after obtaining the initial feature set, in addition to acquiring the reconstruction error of each feature, the association difference index of each feature is also acquired. The association difference index in this embodiment can be used to characterize the spatial visual differences between the sub-objects corresponding to the features and their adjacent sub-objects.

[0057] In the specific execution process, the adjacent sub-objects of each sub-object can be obtained first based on the positional relationship between the sub-objects of the visual object. Then, based on the features corresponding to each sub-object and the features corresponding to its adjacent sub-objects, the correlation difference index of the features corresponding to each sub-object can be determined. In this way, combined with the sub-object and its adjacent sub-objects, the local neighborhood relationship of the features can be analyzed from the local information of the sub-object. In this way, the calculated correlation difference index can effectively characterize the texture complexity and information density of the region where the feature is located, thereby improving the effectiveness and accuracy of the obtained correlation difference index.

[0058] The following section details the process of obtaining the neighboring sub-objects of each sub-object based on the positional relationship between the sub-objects of a visual object, and determining the correlation difference index of the corresponding features of each sub-object based on the features corresponding to each sub-object and the features corresponding to its neighboring sub-objects.

[0059] (1) Obtain the adjacent sub-objects of each sub-object based on the positional relationship between the sub-objects of the visual object.

[0060] Specifically, in the process of obtaining the adjacent sub-objects of each sub-object based on the positional relationship between the sub-objects of a visual object, the sub-object that is simultaneously in the preset space window with any sub-object during the window sliding process can be determined as the adjacent sub-object of any sub-object.

[0061] In the specific execution process, a preset spatial window can be obtained first. Then, based on the preset spatial window and positional relationships, adjacent sub-objects of each sub-object can be obtained from multiple sub-objects of the visual object. Optionally, the adjacent sub-objects of any sub-object are the sub-objects that are simultaneously located within the preset spatial window during the window sliding process. In this way, sub-objects that are simultaneously located within the preset spatial window with a sub-object are determined as its adjacent sub-objects. This ensures that the determined adjacent sub-objects and the sub-object can be located within the same preset spatial window at the same time, avoiding the determination of sub-objects that are far away from the sub-object as its adjacent sub-objects, and improving the correlation between the obtained adjacent sub-objects.

[0062] For example, the position information of the p-th child object is Where i is the row index and j is the column index; using The preset space window determines the adjacent sub-objects that are simultaneously within the preset space window with the p-th sub-object during the window sliding process. Here, k can be 3 or 5, and can also be configured according to actual needs. This embodiment of the application does not limit it here.

[0063] (2) Based on the features corresponding to each sub-object and the features corresponding to its adjacent sub-objects, determine the correlation difference index of the features corresponding to each sub-object.

[0064] In the specific implementation process, when determining the correlation difference index of the corresponding features of each sub-object based on the features corresponding to each sub-object and the features corresponding to its neighboring sub-objects, the reference features of each sub-object can be calculated first based on the features corresponding to its neighboring sub-objects. Then, the distance between the features corresponding to each sub-object and the reference features is calculated to obtain the reference distance of each sub-object. Then, the number of neighboring sub-objects of each sub-object is obtained, and the ratio of the reference distance of each sub-object to the number of neighboring sub-objects is calculated as the correlation difference index of the corresponding features of each sub-object. In this way, the reference features of each sub-object are calculated first from the features corresponding to its neighboring sub-objects, and then subsequent calculations are performed based on the reference features, rather than processing individual neighboring sub-objects, which improves the robustness of the calculation. On this basis, the difference between the features and the features of neighboring sub-objects is quantified, which improves the effectiveness and robustness of the determined correlation difference index.

[0065] Specifically, for any sub-object, the average value of the features corresponding to its neighboring sub-objects can be calculated as a reference feature. Therefore, for the correlation difference index of features, the average feature within the neighborhood is used instead of the feature of a single neighboring sub-object, improving the robustness of subsequent calculations. Then, the square of the Euclidean distance between the feature corresponding to the sub-object and the reference feature is calculated as the reference distance for the sub-object. This avoids the square root operation in the Euclidean distance calculation process, improving computational efficiency while preserving physical meaning. Finally, the ratio of the reference distance of the sub-object to the number of neighboring sub-objects is calculated as the correlation difference index of the feature corresponding to the sub-object. By introducing the number of neighboring sub-objects, the influence of neighborhood size on the correlation difference index is eliminated, making the correlation difference indices calculated from neighborhoods of different sizes comparable and improving the stability of the obtained correlation difference index.

[0066] It should be noted that the larger the calculated correlation difference index, the more complex the texture of the feature region and the higher the information density.

[0067] For example, the correlation difference index of a feature can be calculated using the following formula:

[0068] in, Let k be the feature corresponding to the qth adjacent sub-object among adjacent sub-objects, and k be the number of adjacent sub-objects of the sub-object corresponding to this feature.

[0069] It should be noted that, in the embodiments of this application, step S12 can be executed first, which reconstructs each feature in the initial feature set based on the principal component analysis algorithm to obtain the reconstructed features of each feature, and obtains the reconstruction error of each feature based on the reconstructed features of each feature. Then, step S13 can be executed to determine the correlation difference index of each feature based on the positional relationship between the sub-objects of the visual object. Alternatively, step S13 can be executed first, followed by step S12, or step S12 and step S13 can be executed simultaneously. The embodiments of this application do not limit the execution order of step S12 and step S13.

[0070] S14. Determine the importance of each feature based on the reconstruction error and correlation difference index of each feature.

[0071] Based on the reconstruction error and correlation difference index of each feature, the importance of each feature can be determined by combining the reconstruction error and correlation difference index. In this way, the two dimensions of data, reconstruction error and correlation difference index, are introduced to determine the importance, thereby improving the effectiveness of the determination of importance.

[0072] In this embodiment of the application, in the process of determining the importance of each feature based on the reconstruction error and correlation difference index of each feature, the reconstruction error and correlation difference index of each feature can be normalized first to obtain the target reconstruction error and target correlation difference index of each feature. In this way, the reconstruction error and correlation difference index are mapped to the range of 0 to 1 through normalization, so as to avoid the deviation in the process of calculating the importance caused by the difference in the dimensions of the reconstruction error and correlation difference index. Furthermore, the reconstruction weight and correlation difference weight are obtained. Based on the reconstruction weight, correlation difference weight, target reconstruction error and target correlation difference index of each feature, the importance of each feature is calculated. In this way, the importance is calculated by normalizing the target reconstruction error and target correlation difference index, thereby improving the effectiveness of the calculated importance.

[0073] For example, the feature reconstruction error is The correlation difference index of the features is The reconstruction error and correlation difference index were normalized using the min-max normalization method, resulting in:

[0074]

[0075] in, For the target correlation difference index, For the target reconstruction error, It is the smallest correlation difference index among multiple features. It is the index of the largest correlation difference among multiple features. The minimum reconstruction error among multiple features. The maximum reconstruction error among multiple features; Based on the target correlation difference index and target reconstruction error, a fuzzy weighted fusion function can be used to calculate the importance of each feature. The correlation difference weight can be set in the fuzzy weighted fusion function. and reconstructed weights The importance can be calculated using the following formula:

[0076] It should be noted that the association difference weight and reconstruction weight can each be 0.5, or they can be configured according to actual needs or obtained through model training. This embodiment of the application does not limit this. Optionally, the sum of the association difference weight and reconstruction weight can be a preset value.

[0077] Specifically, in the process of calculating the importance of each feature based on the reconstruction weight, the correlation difference weight, the target reconstruction error of each feature, and the target correlation difference index, the difference between the preset value and the target reconstruction error can be calculated, and then the reconstruction weight of the difference can be calculated as the first parameter. At the same time, the correlation difference weight of the target correlation difference index can be calculated as the second parameter, and then the product of the first parameter and the second parameter can be calculated as the importance.

[0078] It should be noted that since the reconstruction error is inversely proportional to the probability of valid information, the first parameter is the reconstruction weight raised to the power of the difference between the preset value 1 and the target reconstruction error.

[0079] S15: Filter the initial feature set according to its importance to obtain the target feature set.

[0080] In the embodiments of this application, based on determining the importance of each feature, the initial feature set can be filtered according to the importance to obtain the target feature set.

[0081] In practice, an importance threshold can be pre-configured. During the initial feature set screening process based on importance, features with importance less than the importance threshold can be deleted, resulting in a target feature set composed of features with importance greater than or equal to the importance threshold. The importance level can be 0.5, or it can be configured according to actual needs; this embodiment does not limit this.

[0082] S16, perform model inference on multimodal data based on the target feature set.

[0083] Based on the obtained target feature set, model inference can be performed on multimodal data. Specifically, in the process of model inference on multimodal data based on the target feature set, a reference feature set can first be obtained based on the target feature set. The reference feature set consists of the feature sets corresponding to data objects other than visual objects in the multimodal data. Then, the target feature set and the reference feature set are input into the inference module for multimodal data inference. In this way, based on obtaining a valid and relatively small number of target feature sets, model inference is performed by combining them with the feature sets corresponding to other data objects. This reduces the computational complexity and memory usage in the model inference process while ensuring the feature comprehensiveness of the model inference. Furthermore, since more effective features are retained in the target feature set, the model inference performance is improved.

[0084] The model inference method provided in this application can be applied to multimodal processing models. Multimodal processing models include large models capable of simultaneously understanding and processing multiple types or forms of data. The multimodal processing model in this application can be used to generate text-based response data based on visual objects.

[0085] For example, taking multimodal data including visual objects and text objects as an example, refer to... Figure 3 As shown, the multimodal processing model may include a visual encoder, a filtering module, a text encoder, and an inference module; The visual encoder encodes visual objects to obtain an initial feature set; the filtering module executes steps S12 to S15 to obtain a target feature set; the text encoder encodes text objects to obtain a text feature set; and the inference module performs model inference on the multimodal data based on the target feature set and the text feature set. The inference module can be a transformer architecture, but this embodiment does not limit its implementation.

[0086] Specifically, the filtering module may include a reconstruction analysis submodule, an association difference analysis submodule, an importance analysis submodule, and a filtering submodule. The reconstruction analysis submodule can be used to reconstruct each feature in the initial feature set based on the principal component analysis algorithm, obtain the reconstructed features of each feature, and obtain the reconstruction error of each feature based on the reconstructed features. The association difference analysis submodule can be used to determine the association difference index of each feature based on the positional relationship between the sub-objects of the visual object. The importance analysis submodule is used to determine the importance of each feature based on the reconstruction error and association difference index of each feature. The filtering submodule can be used to filter the initial feature set according to the importance to obtain the target feature set.

[0087] Based on the model inference method provided in this application embodiment, which can be applied to a multimodal processing model, the multimodal processing model can be adjusted in the following ways in this application embodiment: Obtain sample data and sample labels; By using a multimodal processing model, we can obtain the inference results corresponding to the sample data and the feature importance of each sample feature. The first loss is calculated based on the importance of features, and the second loss is calculated based on the sample labels and inference results; The multimodal processing model is adjusted based on the first and second losses.

[0088] Optionally, each sample feature corresponds to a sample visual sub-object contained in the sample data. The sample visual sub-object includes sub-objects obtained by segmenting the sample visual object.

[0089] In practice, a first loss calculated based on feature importance and a second loss calculated based on sample labels and inference results are introduced to jointly adjust the multimodal processing model, thereby improving the effectiveness and efficiency of screening while enhancing model performance.

[0090] Specifically, based on sample labels and inference results, the second loss can be calculated as either classification loss or retrieval loss; for example, the second loss could be cross-entropy loss or contrastive learning loss. The first loss, calculated based on feature importance, can be the filtering regularization loss. Specifically, in calculating the first loss based on feature importance, the average value of the feature importance can be used as the first loss.

[0091] For example, there are N sample features, and the first loss is calculated using the following formula:

[0092] in, Let be the feature importance of the p-th sample feature; the calculated second loss is . .

[0093] In this embodiment, during the process of adjusting the multimodal processing model based on the first loss and the second loss, the target loss can be calculated firstly based on the first loss and the second loss. Then, the first model parameters of the multimodal processing model are adjusted based on the target loss to obtain an intermediate model. The intermediate model is then adjusted based on the target loss. Thus, based on obtaining the target loss, the second model parameters are first frozen. The first model parameters used to control the acquisition of reconstruction error, the determination of correlation difference indicators, and the determination of importance in the multimodal processing model are adjusted based on the target loss. In this way, the model's screening ability is trained, that is, the parameters of the screening module in the multimodal processing model are adjusted. During the adjustment process, the second model parameters other than the first model parameters are frozen. After adjusting the first model parameters of the multimodal processing model based on the target loss to obtain the intermediate model, the global parameters of the intermediate model are adjusted based on the target loss. Thus, the model performance is improved by jointly training all the parameters of the intermediate model.

[0094] Specifically, in the process of calculating the target loss based on the first loss and the second loss, the product of the first loss and the pre-configured loss coefficient can be calculated first to obtain the intermediate loss, and then the sum of the intermediate loss and the second loss can be calculated as the target loss.

[0095] Using the previous example, the first loss is The second loss is The target loss can be:

[0096] in, For the pre-configured loss coefficient, It can be 0.1.

[0097] In addition to the above steps of calculating the target loss based on the first loss and the second loss, freezing the second model parameters, adjusting the first model parameters of the multimodal processing model based on the target loss, obtaining the intermediate model, unfreezing the second model parameters, and adjusting the intermediate model based on the target loss, the process of adjusting the multimodal processing model based on the first loss and the second loss can also involve first freezing the second model parameters, adjusting the first model parameters based on the first loss to obtain the second intermediate model, and then unfreezing the second model parameters and adjusting the intermediate model based on the second loss.

[0098] It should be noted that, in addition to being applied to multimodal processing models, the embodiments of this application can also be applied to visual object processing models, such as image classification models, object detection models, and / or semantic segmentation models. In this case, steps S11 to S16 can be replaced by: obtaining an initial feature set corresponding to the visual object; reconstructing each feature in the initial feature set based on the principal component analysis algorithm to obtain the reconstructed features of each feature; obtaining the reconstruction error of each feature based on the reconstructed features of each feature; determining the correlation difference index of each feature based on the positional relationship between the sub-objects of the visual object; determining the importance of each feature based on the reconstruction error and correlation difference index of each feature; filtering the initial feature set according to the importance to obtain a target feature set; and performing model inference on the visual object based on the target feature set.

[0099] It should also be noted that, in addition to using the above method to filter the initial feature set corresponding to the visual object to obtain the target feature set of the visual object, a similar method can also be used to filter audio objects to obtain the target feature set of the audio object. During the filtering process of the audio object, the aforementioned preset spatial window can be modified to a preset time window. Other processing procedures are similar to those described above, and can be referred to the relevant procedures. This application embodiment does not impose limitations here. Alternatively, text objects can also be filtered to obtain the target feature set of the text object. During the filtering process of the text object, the aforementioned preset spatial window can also be modified to a preset time window.

[0100] The model inference method provided in this application, for each feature in the initial feature set corresponding to a visual object, on the one hand, obtains the reconstruction error used to characterize the difference before and after reconstruction; on the other hand, obtains the correlation difference index characterizing the spatial visual difference between the corresponding sub-object and adjacent sub-objects. It integrates the feature expression difficulty based on low-dimensional reconstruction error and the information density of the feature's location region to determine the importance of the feature. By jointly determining the importance by reconstruction error and information density, it avoids the limitations of a single index and improves the effectiveness of the determined importance of each feature. Then, it filters the initial feature set based on the effective importance. In this way, it filters based on more effective importance, effectively distinguishing features corresponding to information from features corresponding to noise, improving the accuracy of filtering, reducing the number of features, reducing the computational complexity and memory usage of the model, and improving the model inference efficiency.

[0101] The following description uses an embodiment of this application to illustrate the model inference method in a multimodal data inference scenario, further explaining the model inference method provided by the embodiment of this application. Figure 4 As shown, the model inference method applied to multimodal data inference scenarios includes the following steps.

[0102] S41, Obtain the initial feature set corresponding to the image object in the multimodal data.

[0103] S42, based on the principal component analysis algorithm, performs dimensionality reduction processing on the feature sequences in the initial feature set to obtain a low-dimensional projection matrix.

[0104] S43, Reconstruct each feature in the initial feature set based on the low-dimensional projection matrix to obtain the reconstructed features of each feature.

[0105] S44, obtain the reconstruction error of each feature based on the reconstruction features of each feature.

[0106] S45: Based on the positional relationship between the image blocks corresponding to each feature, obtain the adjacent features of each feature.

[0107] S46. Based on the adjacent features of each feature, determine the correlation difference index of each feature.

[0108] It should be noted that steps S42 to S44 can be executed before steps S45 to S46, after steps S45 to S46, or simultaneously with steps S45 to S46. This embodiment of the application does not limit the specific steps.

[0109] S47. Determine the importance of each feature based on the reconstruction error and correlation difference index of each feature.

[0110] S48. The initial feature set is filtered according to its importance to obtain the target feature set.

[0111] S49, obtain the reference feature set corresponding to the data objects other than the image objects in the multimodal data, and perform model inference of the multimodal data based on the target feature set and the reference feature set.

[0112] It should be noted that any one or more of steps S41 to S49 can be combined with any one or more of steps S11 to S16 to form a new implementation method according to the needs of implementation and deployment. In addition, any one or more technical features in steps S41 to S49 can be selected according to the actual deployment needs and combined with any one or more technical features provided in steps S11 to S16 to form a new implementation method. Alternatively, any one or more technical features in steps S41 to S49 can be replaced with any one or more technical features provided in steps S11 to S16 to form a new implementation method according to the actual deployment needs. These will not be elaborated on here.

[0113] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods according to the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method.

[0114] Figure 5 A schematic diagram of the structure of a model reasoning device 500 provided in this application is shown below. Figure 5 As shown, the model inference device 500 of this embodiment includes: The feature set acquisition module 510 is used to acquire an initial feature set corresponding to a visual object in multimodal data; the features in the initial feature set correspond to the sub-objects of the visual object respectively; The error acquisition module 520 is used to reconstruct each feature in the initial feature set based on the principal component analysis algorithm, obtain the reconstructed features of each feature, and obtain the reconstruction error of each feature based on the reconstructed features of each feature. The reconstruction error is used to characterize the feature difference before and after reconstruction. The indicator determination module 530 is used to determine the correlation difference index of each feature based on the positional relationship between the sub-objects of the visual object. The correlation difference index is used to characterize the spatial visual difference between the sub-object corresponding to the feature and its adjacent sub-objects. Importance determination module 540 is used to determine the importance of each feature based on the reconstruction error and correlation difference index of each feature; The filtering processing module 550 is used to filter the initial feature set according to the importance level to obtain the target feature set; The model inference module 560 is used to perform model inference on the multimodal data based on the target feature set.

[0115] As an optional implementation of this application, the error acquisition module 520 is specifically used for: processing each feature based on the principal component analysis algorithm to obtain the low-dimensional projection features corresponding to each feature; generating a low-dimensional projection matrix based on the low-dimensional projection features corresponding to each feature, and calculating the transpose of the low-dimensional projection matrix as an intermediate projection matrix; performing low-dimensional mapping on each feature based on the intermediate projection matrix to obtain the low-dimensional mapping features corresponding to each feature; and determining the reconstructed features corresponding to each feature based on the low-dimensional projection matrix and the low-dimensional mapping features corresponding to each feature.

[0116] As an optional implementation of this application, the index determination module 530 is specifically used to: obtain the adjacent sub-objects of each sub-object according to the positional relationship between the sub-objects of the visual object; and determine the correlation difference index of the corresponding features of each sub-object based on the features corresponding to each sub-object and the features corresponding to its adjacent sub-objects.

[0117] As an optional implementation of this application, the index determination module 530, when obtaining the adjacent sub-objects of each sub-object based on the positional relationship between the sub-objects of the visual object, is specifically used to: obtain a preset space window; and, based on the preset space window and the positional relationship, obtain the adjacent sub-objects of each sub-object among the multiple sub-objects of the visual object; wherein, the adjacent sub-object of any sub-object among the sub-objects is the sub-object that is simultaneously located in the preset space window with any sub-object during the window sliding process.

[0118] As an optional implementation of this application, the index determination module 530, when determining the correlation difference index of the corresponding features of each sub-object based on the features corresponding to each sub-object and the features corresponding to its neighboring sub-objects, is specifically used for: calculating the reference features of each sub-object according to the features corresponding to the neighboring sub-objects of each sub-object; performing distance calculation on the features corresponding to each sub-object and the reference features to obtain the reference distance of each sub-object; obtaining the number of neighboring sub-objects of each sub-object, and calculating the ratio of the reference distance of each sub-object to the number of neighboring sub-objects as the correlation difference index of the corresponding features of each sub-object.

[0119] As an optional implementation of this application, the importance determination module is specifically used to: normalize the reconstruction error and association difference index of each feature respectively to obtain the target reconstruction error and target association difference index of each feature; obtain the reconstruction weight and association difference weight; and calculate the importance of each feature based on the reconstruction weight, the association difference weight, and the target reconstruction error and target association difference index of each feature.

[0120] As an optional implementation of this application, the model inference device operates on a multimodal processing model; the device is further configured to: acquire sample data and sample labels; acquire, through the multimodal processing model, the inference result corresponding to the sample data and the feature importance of each sample feature; each sample feature corresponds to a sample visual sub-object contained in the sample data; calculate a first loss based on the feature importance, and calculate a second loss based on the sample labels and the inference result; adjust the multimodal processing model according to the first loss and the second loss.

[0121] As an optional implementation of this application, the device, when adjusting the multimodal processing model based on the first loss and the second loss, is specifically used for: calculating a target loss based on the first loss and the second loss; adjusting the first model parameters of the multimodal processing model based on the target loss to obtain an intermediate model; the first model parameters are used to control the acquisition of reconstruction error, the determination of correlation difference indicators, and the determination of importance; and adjusting the intermediate model based on the target loss.

[0122] As an optional implementation of this application, the model inference module 560 is specifically used for: obtaining a reference feature set based on the target feature set; the reference feature set is the feature set corresponding to data objects other than the visual objects in the multimodal data; and inputting the target feature set and the reference feature set into the inference module to perform model inference on the multimodal data.

[0123] For a description of the features in the embodiment corresponding to the model inference device, please refer to the relevant description of the embodiment corresponding to the model inference method, which will not be repeated here.

[0124] Embodiments of this application also provide an electronic device, including a memory and a processor, wherein the memory stores a computer program and the processor is configured to run the computer program to perform the steps in any of the above-described model reasoning method embodiments.

[0125] Embodiments of this application also provide a computer-readable storage medium storing a computer program, wherein the computer program is configured to execute the steps in any of the above-described model inference method embodiments at runtime.

[0126] In one exemplary embodiment, the aforementioned computer-readable storage medium may include, but is not limited to, various media capable of storing computer programs, such as a USB flash drive, read-only memory (ROM), random access memory (RAM), portable hard disk, magnetic disk, or optical disk.

[0127] Embodiments of this application also provide a computer program product, which includes a computer program that, when executed by a processor, implements the steps in any of the above-described model reasoning method embodiments.

[0128] Embodiments of this application also provide another computer program product, including a non-volatile computer-readable storage medium storing a computer program, which, when executed by a processor, implements the steps in any of the above-described model reasoning method embodiments.

[0129] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0130] The above provides a detailed description of the model reasoning method and electronic device provided in this application. Specific examples have been used to illustrate the principles and implementation methods of this application. The descriptions of the embodiments above are merely for the purpose of helping to understand the method and its core ideas. It should be noted that those skilled in the art can make various improvements and modifications to this application without departing from its principles, and these improvements and modifications also fall within the protection scope of the claims of this application.

Claims

1. A model reasoning method, characterized in that, include: Obtain an initial feature set corresponding to a visual object in multimodal data; the features in the initial feature set correspond to the sub-objects of the visual object respectively; Based on the principal component analysis algorithm, each feature in the initial feature set is reconstructed to obtain the reconstructed features of each feature, and the reconstruction error of each feature is obtained based on the reconstructed features of each feature. The reconstruction error is used to characterize the feature difference before and after reconstruction. Based on the positional relationship between the sub-objects of the visual object, the association difference index of each feature is determined. The association difference index is used to characterize the spatial visual difference between the sub-object corresponding to the feature and its adjacent sub-objects. The importance of each feature is determined based on its reconstruction error and correlation difference index. The initial feature set is filtered according to its importance to obtain the target feature set; Model inference is performed on the multimodal data based on the target feature set.

2. The model reasoning method according to claim 1, characterized in that, The principal component analysis algorithm is used to reconstruct each feature in the initial feature set to obtain the reconstructed features of each feature, including: The principal component analysis algorithm is used to process each feature to obtain the low-dimensional projected features corresponding to each feature. A low-dimensional projection matrix is ​​generated based on the low-dimensional projection features corresponding to each feature, and the transpose of the low-dimensional projection matrix is ​​calculated as an intermediate projection matrix. Based on the intermediate projection matrix, each feature is mapped in a low dimension to obtain the low-dimensional mapped features corresponding to each feature. Based on the low-dimensional projection matrix and the low-dimensional mapping features corresponding to each feature, the reconstructed features corresponding to each feature are determined.

3. The model reasoning method according to claim 1, characterized in that, The step of determining the correlation difference index of each feature based on the positional relationship between the sub-objects of the visual object includes: Based on the positional relationship between the sub-objects of the visual object, obtain the adjacent sub-objects of each sub-object; Based on the features corresponding to each sub-object and the features corresponding to its adjacent sub-objects, the correlation difference index of the features corresponding to each sub-object is determined.

4. The model reasoning method according to claim 3, characterized in that, The step of obtaining the adjacent sub-objects of each sub-object based on the positional relationship between the sub-objects of the visual object includes: Get the preset space window; Based on the preset spatial window and the positional relationship, the adjacent sub-objects of each sub-object are obtained from multiple sub-objects of the visual object; Among them, the adjacent sub-objects of any sub-object are the sub-objects that are simultaneously located in the preset space window during the window sliding process.

5. The model reasoning method according to claim 4, characterized in that, The step of determining the correlation difference index of the features corresponding to each sub-object based on the features corresponding to each sub-object and the features corresponding to its adjacent sub-objects includes: Calculate the reference features of each sub-object based on the features corresponding to the adjacent sub-objects of each sub-object; The distance between the features and reference features corresponding to each sub-object is calculated to obtain the reference distance of each sub-object; Obtain the number of neighboring sub-objects for each sub-object, and calculate the ratio of the reference distance of each sub-object to the number of neighboring sub-objects as the correlation difference index of the corresponding features of each sub-object.

6. The model reasoning method according to claim 1, characterized in that, The step of determining the importance of each feature based on its reconstruction error and correlation difference index includes: The reconstruction error and correlation difference index of each feature are normalized respectively to obtain the target reconstruction error and target correlation difference index of each feature. Obtain the reconstruction weights and associated difference weights; The importance of each feature is calculated based on the reconstruction weight, the association difference weight, and the target reconstruction error and target association difference index of each feature.

7. The model reasoning method according to claim 1, characterized in that, The model inference method is applied to a multimodal processing model; the method further includes: Obtain sample data and sample labels; The multimodal processing model is used to obtain the inference results corresponding to the sample data and the feature importance of each sample feature; each sample feature corresponds to the sample visual sub-object contained in the sample data; A first loss is calculated based on the importance of the features, and a second loss is calculated based on the sample labels and the inference results; The multimodal processing model is adjusted based on the first loss and the second loss.

8. The model reasoning method according to claim 7, characterized in that, The step of adjusting the multimodal processing model based on the first loss and the second loss includes: Calculate the target loss based on the first loss and the second loss; The first model parameters of the multimodal processing model are adjusted according to the target loss to obtain an intermediate model; the first model parameters are used to control the acquisition of reconstruction error, the determination of correlation difference index, and the determination of importance. The intermediate model is adjusted based on the target loss.

9. The model reasoning method according to claim 1, characterized in that, The step of performing model inference on the multimodal data based on the target feature set includes: Based on the target feature set, a reference feature set is obtained; the reference feature set is the feature set corresponding to data objects other than the visual objects in the multimodal data; The target feature set and the reference feature set are input into the inference module to perform model inference on the multimodal data.

10. An electronic device, characterized in that, include: Memory, used to store computer programs; A processor for implementing the steps of the model inference method as described in any one of claims 1 to 9 when executing the computer program.