Visual question answering data processing method and device, and computer device
By using a dual-branch fusion network and a multi-layer perceptual network to process visual question-answering images and query information, the accuracy and diversity issues of visual question-answering models in financial business scenarios are solved, and more efficient answer prediction is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INDUSTRIAL AND COMMERCIAL BANK OF CHINA
- Filing Date
- 2023-05-04
- Publication Date
- 2026-06-19
AI Technical Summary
In existing technologies, visual question answering models struggle to effectively explore visual information in financial business scenarios, leading to inaccurate answer predictions and difficulties in cross-modal fusion, resulting in a high failure rate.
A fusion network with a dual-branch structure is adopted, including a text feature extraction network and an image feature extraction network adjusted based on a hierarchical attention mechanism. It is combined with a multilayer perceptron for pixel-level classification processing, and a pre-trained visual question answering prediction model is used to process visual question answering images and query information.
It improves the accuracy and diversity of visual question answering prediction models, provides a comprehensive visual language view, inspires new ways of thinking about human intelligence, and optimizes the performance of visual question answering prediction models.
Smart Images

Figure CN116662497B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of financial technology, and in particular to a visual question-answering data processing method, apparatus, computer equipment, storage medium, and computer program product. Background Technology
[0002] Visual question answering is a challenging task that requires providing natural language answers to a given image and a natural language question about the image.
[0003] Currently, visual question answering models in related technologies involve representation learning and cross-modal fusion of different modalities, which are quite challenging. They are difficult to explore visual information from a given image, resulting in a high failure rate. Furthermore, for complex financial information in financial business scenarios, they also have the problem of not being able to effectively obtain accurate answer prediction results, resulting in poor performance of visual question answering. Summary of the Invention
[0004] Therefore, it is necessary to provide a visual question-answering data processing method, apparatus, computer equipment, storage medium, and computer program product that can solve the above-mentioned technical problems.
[0005] In a first aspect, this application provides a visual question-answering data processing method, the method comprising:
[0006] The visual question-answering image to be predicted and the query information for the visual question-answering image are obtained and input into a pre-trained visual question-answering prediction model. The pre-trained visual question-answering prediction model includes a fusion network with a dual-branch structure and a multilayer perceptual network. Both the visual question-answering image and the query information are generated in a financial business scenario.
[0007] The query information is processed by the text feature extraction network in the fusion network to obtain text feature information, and the visual question-and-answer image is processed by the image feature extraction network in the fusion network to obtain image feature information; the text feature extraction network is adjusted based on a hierarchical attention mechanism.
[0008] Based on the text feature information, the image feature information is classified at the pixel level through the multilayer perceptron to obtain the model output prediction result, which serves as the predicted question and answer information corresponding to the query information.
[0009] In one embodiment, the step of processing the visual question-answering image through the image feature extraction network in the fusion network to obtain image feature information includes:
[0010] Image information of the visual question-answering image is extracted through the image feature extraction network in the fusion network;
[0011] By combining the text association region features and text recognition features corresponding to the visual question-and-answer image, as well as the image information, visual representation information is obtained; the text association region features and the text recognition features are used to adjust image understanding to provide a comprehensive visual language view;
[0012] Based on the visual representation information and the visual question-and-answer image, a target feature image is obtained, which serves as the image feature information.
[0013] In one embodiment, the step of performing pixel-level classification processing on the image feature information based on the text feature information through the multilayer perceptron to obtain the model output prediction result includes:
[0014] The target feature image is restored by the multilayer perceptron to obtain the processed feature image.
[0015] The decoder of the multilayer perceptron is used to perform pixel-level classification on the processed feature image based on the text feature information to obtain the model output prediction result.
[0016] In one embodiment, the pre-trained visual question-answering prediction model is trained using the following method:
[0017] Obtain a training sample set; each training sample in the training sample set consists of a sample image and multiple question-and-answer information pairs contained in the sample image; both the sample image and the multiple question-and-answer information pairs are collected based on financial business scenarios;
[0018] A visual question answering prediction model to be trained is constructed based on the fusion model with a dual-branch structure and the multilayer perceptual network; the first branch of the fusion model is the text feature extraction network adjusted based on the hierarchical attention mechanism, and the second branch of the fusion model is the image feature extraction network.
[0019] The training sample set is used to train the visual question answering prediction model to be trained, thereby obtaining the pre-trained visual question answering prediction model.
[0020] In one embodiment, prior to the step of obtaining the training sample set, the method further includes:
[0021] Obtain an initial sample set collected under the financial business scenario; each initial sample in the initial sample set contains different question-and-answer information pairs with different query object types;
[0022] The initial sample set is processed according to preset processing information, and the training sample set and the test sample set are obtained based on the processed initial sample set; the preset processing information is used to instruct the initial sample set to perform data filtering operation and image size adjustment operation.
[0023] In one embodiment, after the step of obtaining the pre-trained visual question-answering prediction model, the method further includes:
[0024] Obtain preset evaluation information; the preset evaluation information is used to statistically analyze the accuracy of the predicted question-and-answer results during model testing;
[0025] The pre-trained visual question answering prediction model is tested using the test sample set. The model test result of the pre-trained visual question answering prediction model is obtained by combining the preset evaluation information and the predicted question answering results output by the pre-trained visual question answering prediction model.
[0026] In one embodiment, prior to the step of constructing a visual question-answering prediction model to be trained based on the fusion model with the dual-branch structure and the multilayer perceptron, the method further includes:
[0027] A text feature extraction network for processing text feature extraction tasks is constructed as the first branch, and an image feature extraction network for processing image feature extraction tasks is constructed as the second branch.
[0028] By merging the first branch and the second branch, the fusion model with the dual-branch structure is obtained.
[0029] In one embodiment, constructing a text feature extraction network for processing text feature extraction tasks includes:
[0030] An initial text feature extraction network is obtained, and the initial text feature extraction network is adjusted by combining a weighted accumulation method and a hierarchical structure attention mechanism to obtain the text feature extraction network.
[0031] The hierarchical attention mechanism is used to combine features of different text levels in the question-and-answer information with the self-attention mechanism, so as to enhance the network's ability to acquire structural information by utilizing the degree of correlation between different hierarchical structures.
[0032] Secondly, this application also provides a visual question-answering data processing apparatus, the apparatus comprising:
[0033] The data acquisition module is used to acquire the visual question-and-answer image to be predicted and the query information for the visual question-and-answer image, and input them into the pre-trained visual question-and-answer prediction model; the pre-trained visual question-and-answer prediction model includes a fusion network with a dual-branch structure and a multilayer perceptual network, and the visual question-and-answer image and the query information are both generated in the context of financial business scenarios.
[0034] The visual question answering prediction model processing module is used to process the query information through the text feature extraction network in the fusion network to obtain text feature information, and to process the visual question answering image through the image feature extraction network in the fusion network to obtain image feature information; the text feature extraction network is adjusted based on a hierarchical attention mechanism;
[0035] The predictive question-and-answer information acquisition module is used to perform pixel-level classification processing on the image feature information based on the text feature information through the multilayer perceptron, and obtain the model output prediction result as the predicted question-and-answer information corresponding to the query information.
[0036] Thirdly, this application also provides a computer device. The computer device includes a memory and a processor, the memory storing a computer program, and the processor executing the computer program to implement the steps of the visual question-answering data processing method described above.
[0037] Fourthly, this application also provides a computer-readable storage medium. The computer-readable storage medium stores a computer program thereon, which, when executed by a processor, implements the steps of the visual question-answering data processing method described above.
[0038] Fifthly, this application also provides a computer program product. The computer program product includes a computer program that, when executed by a processor, implements the steps of the visual question-answering data processing method described above.
[0039] The aforementioned visual question-answering data processing method, apparatus, computer device, storage medium, and computer program product acquire a visual question-answering image to be predicted and query information for the visual question-answering image, and input them into a pre-trained visual question-answering prediction model. This pre-trained model includes a fusion network with a dual-branch structure and a multilayer perceptron. Both the visual question-answering image and the query information are generated in a financial business scenario. The query information is then processed by a text feature extraction network within the fusion network to obtain text feature information, and the visual question-answering image is processed by an image feature extraction network within the fusion network to obtain image feature information. The text feature extraction network is adjusted based on a hierarchical attention mechanism. Then, based on the text feature information, the image feature information is processed at the pixel level by the multilayer perceptron to obtain the model's output prediction result, which serves as the predicted question-answer information corresponding to the query information. This optimizes the visual question-answering prediction model. The dual-branch fusion network focuses on processing different feature extraction tasks, and the multilayer perceptron performs pixel-level classification processing. This provides a comprehensive visual language view for the input image to be predicted by enriching visual representation, thereby improving the diversity and accuracy of the generated predicted question-answer information. Attached Figure Description
[0040] Figure 1 This is a flowchart illustrating a visual question-answering data processing method in one embodiment;
[0041] Figure 2 This is a schematic diagram of a model training and testing process in one embodiment;
[0042] Figure 3 This is a flowchart illustrating one model training step in one embodiment;
[0043] Figure 4 This is a flowchart illustrating another visual question-answering data processing method in one embodiment;
[0044] Figure 5 This is a structural block diagram of a visual question-answering data processing device in one embodiment;
[0045] Figure 6 This is an internal structural diagram of a computer device in one embodiment. Detailed Implementation
[0046] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of this application.
[0047] It should be noted that the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data used for display, data used for analysis, etc.) involved in this application are all information and data authorized by the user or fully authorized by all parties; correspondingly, this application also provides a corresponding user authorization entry point for users to choose to authorize or refuse.
[0048] In one embodiment, such as Figure 1 As shown, a visual question-answering data processing method is provided. This embodiment illustrates the method applied to a terminal, but it is understood that the method can also be applied to a server, or to a system including both a terminal and a server, and implemented through interaction between the terminal and the server. In this embodiment, the method includes the following steps:
[0049] Step 101: Obtain the visual question-answering image to be predicted and the query information for the visual question-answering image, and input them into the pre-trained visual question-answering prediction model;
[0050] The pre-trained visual question answering prediction model can include a fusion network with a dual-branch structure and a multilayer perceptual network. In the fusion network, the first branch can be a text feature extraction network and the second branch can be an image feature extraction network. By using the dual-branch structure, different feature extraction tasks can be addressed separately.
[0051] As an example, both the visual question-answering image and the query information are generated in the context of financial business scenarios. For example, a visual question-answering task to be processed can be obtained based on intelligent customer service, intelligent investment advisory services, etc. in the context of financial business scenarios. The visual question-answering task may include the visual question-answering image to be predicted and the query information for the visual question-answering image.
[0052] In practical applications, intelligent robot services in financial business scenarios, such as intelligent customer service and intelligent investment advisory services, can be used to obtain visual question-and-answer tasks to be predicted. Then, based on the visual question-and-answer task, the visual question-and-answer image to be predicted and the query information for the visual question-and-answer image can be obtained. The obtained visual question-and-answer image and query information can then be input into a pre-trained visual question-and-answer prediction model to generate predicted question-and-answer information based on the visual question-and-answer image and query information, which serves as the processing result corresponding to the visual question-and-answer task.
[0053] In one alternative embodiment, such as Figure 2As shown, for the pre-trained visual question answering prediction model, the text feature extraction network of the first branch in its fusion network can be an improved LSTM (Long Short-Term Memory) model, and the image feature extraction network of the second branch can be an EAST (Efficient and Accuracy Scene Text) model, which can be used to detect text in images and videos, such as optical character recognition for text recognition in natural scene images; the multilayer perceptron network can be an MLP (Multi-Layer Perceptron) module.
[0054] Step 102: Process the query information through the text feature extraction network in the fusion network to obtain text feature information, and process the visual question-and-answer image through the image feature extraction network in the fusion network to obtain image feature information;
[0055] As an example, a text feature extraction network can be derived from a hierarchical attention mechanism, such as using a hierarchical attention mechanism to improve an LSTM model as the first branch of the text feature extraction network in a fusion network.
[0056] In a specific implementation, after inputting the visual question-answering image and the query information into the pre-trained visual question-answering prediction model, the query information can be processed through the text feature extraction network in the pre-trained visual question-answering prediction model to obtain text feature information, and the visual question-answering image can be processed through the image feature extraction network in the fusion network to obtain image feature information. Thus, different feature extraction tasks can be processed separately based on the dual-branch structure.
[0057] In one example, image information of the visual question-and-answer image can be extracted by using an image feature extraction network in the fusion network. Then, the visual representation information can be obtained by combining the text association region features and text recognition features corresponding to the visual question-and-answer image, such as the rich text region features and detailed OCR (optical character recognition) features extracted by the model, as well as the image information. Based on the visual representation information and the visual question-and-answer image, the target feature image can be obtained as image feature information. Thus, by using the EAST model to extract image information and utilizing richer text region features and detailed OCR features, the visual representation of image understanding can be improved. This can provide a comprehensive visual language view of the input image, which helps to improve the diversity and accuracy of the generated answers.
[0058] Step 103: Based on the text feature information, the image feature information is classified at the pixel level through the multilayer perceptron to obtain the model output prediction result, which serves as the predicted question and answer information corresponding to the query information.
[0059] After obtaining textual and image feature information, the target feature image can be restored using a multilayer perceptron to obtain a processed feature image. Then, by using a decoder of the multilayer perceptron, the processed feature image can be classified at the pixel level based on the textual feature information to obtain the model output prediction result, which serves as the predicted question and answer information corresponding to the query information.
[0060] For example, by using a decoder with an MLP structure, pixel-level classification can be performed on the feature map (i.e., the target feature image) output by the EAST model to output the prediction result, that is, the model output prediction result.
[0061] Compared to traditional visual question answering models, which involve representation learning across different modalities and cross-modal fusion, making them quite challenging and difficult to explore visual information from a given image, the technical solution in this embodiment provides a dual-route network model, namely a pre-trained visual question answering prediction model. This model can use rich text region features and OCR features to enrich the visual representation, providing a comprehensive visual language view of the input image to be predicted. This improves the diversity and accuracy of the predicted question answer information output by the model, and can also inspire and realize new ways of thinking that are related to human intelligence.
[0062] In the aforementioned visual question-answering data processing method, the visual question-answering image to be predicted and the query information for the visual question-answering image are acquired and input into a pre-trained visual question-answering prediction model. Then, the query information is processed by the text feature extraction network in the fusion network to obtain text feature information, and the visual question-answering image is processed by the image feature extraction network in the fusion network to obtain image feature information. Based on the text feature information, the image feature information is classified at the pixel level by a multilayer perceptron to obtain the model output prediction result, which serves as the predicted question-answer information corresponding to the query information. This method optimizes the visual question-answering prediction model. The fusion network based on a dual-branch structure focuses on processing different feature extraction tasks, and the multilayer perceptron is used for pixel-level classification. This can provide a comprehensive visual language view for the input image to be predicted by enriching the visual representation, thereby improving the diversity and accuracy of the generated predicted question-answer information.
[0063] In one embodiment, processing the visual question-answering image through the image feature extraction network in the fusion network to obtain image feature information may include the following steps:
[0064] Image information of the visual question-and-answer image is extracted through the image feature extraction network in the fusion network; visual representation information is obtained by combining the text association region features and text recognition features corresponding to the visual question-and-answer image with the image information; the text association region features and the text recognition features are used to adjust image understanding to provide a comprehensive visual language view; and a target feature image is obtained based on the visual representation information and the visual question-and-answer image, which serves as the image feature information.
[0065] Specifically, the EAST model (i.e., image feature extraction network) can be used to extract image information. Then, richer text region features (i.e., text association region features) and detailed OCR features (i.e., text recognition features) can be used to improve the visual representation of image understanding and obtain visual representation information. Then, based on this visual representation information and the visual question-and-answer image, the feature map output by the EAST model (i.e., target feature image) can be obtained, thus providing a comprehensive visual language view for the input image, which helps to improve the diversity and accuracy of the generated answers.
[0066] In one embodiment, the step of performing pixel-level classification processing on the image feature information based on the text feature information using the multilayer perceptron to obtain the model output prediction result may include the following steps:
[0067] The target feature image is restored by the multilayer perceptron to obtain the processed feature image; the decoder of the multilayer perceptron is used to perform pixel-level classification on the processed feature image based on the text feature information to obtain the model output prediction result.
[0068] In one example, an MLP (Multilayer Perceptron) structure can be used to restore the low-resolution feature map (i.e., the target feature image) output by the encoder to the original image size, thus obtaining the processed feature image. The feature map output from the last layer of the MLP structure can then be input into a fully connected layer to obtain pixel-level classification results, i.e., the model's output prediction. This allows for global information aggregation based on the comprehensive visual language view provided by rich visual representations, helping to improve the diversity and accuracy of the model's generated answers.
[0069] Specifically, for the MLP structure, in order to recover the low-level feature information lost in downsampling, the input feature map can be concatenated with the feature map passed through skip connections in the encoder and batch normalized. In order to reduce the number of parameters, the feature map can be input into a depthwise separable convolutional layer to reduce the number of channels. Then, the processed feature map can be fed into two consecutive cross-MLP modules. The purpose is to fuse the features in the height and width directions of the feature map with the channel features to achieve global information aggregation. Finally, the last layer feature map can be processed by two linear layers to obtain the output feature map.
[0070] In one embodiment, such as Figure 3 As shown, the pre-trained visual question-answering prediction model is trained using the following method, which may include the following steps:
[0071] Step 301: Obtain the training sample set;
[0072] Each training sample in the training sample set can consist of a sample image and multiple question-and-answer information pairs contained in the sample image.
[0073] As an example, the sample images and multiple question-and-answer information pairs are all collected based on financial business scenarios. For example, sample visual question-and-answer tasks can be obtained based on intelligent customer service, intelligent investment advisory services, etc. in financial business scenarios. Then, training samples can be obtained based on the sample images and multiple question-and-answer information pairs contained in the sample visual question-and-answer tasks.
[0074] In one example, an initial sample set collected in a financial business scenario can be obtained (e.g., Figure 2 The initial sample set can be processed by collecting image question-and-answer datasets, and then, according to the preset processing information, data filtering and image size adjustment operations can be performed on the initial sample set to obtain training sample set and test sample set, which can then be used for model training and model validation respectively.
[0075] Step 302: Based on the fusion model with the dual-branch structure and the multilayer perceptual network, a visual question answering prediction model to be trained is constructed; the first branch of the fusion model is the text feature extraction network adjusted based on the hierarchical attention mechanism, and the second branch of the fusion model is the image feature extraction network.
[0076] In specific implementations, text feature extraction networks (such as...) can be constructed separately for handling text feature extraction tasks. Figure 2 The improved LSTM model), as the first branch, and image feature extraction networks (such as...) for processing image feature extraction tasks. Figure 2The first branch (EAST model) can be used as the second branch. Then, the first branch and the second branch can be fused to obtain a fusion model with a dual-branch structure. Based on this fusion model and the multilayer perceptual network, a visual question answering prediction model to be trained can be constructed.
[0077] In an optional embodiment, the initial text feature extraction network, such as the LSTM base model, can be adjusted by combining a weighted accumulation method and a hierarchical attention mechanism to obtain a text feature extraction network, such as an improved LSTM model. Then, a fusion network model based on the EAST model and the improved LSTM model (i.e., a fusion model with a dual-branch structure) can be constructed. Furthermore, the constructed fusion network model can be improved by combining it with a multilayer perceptron (e.g., ...) Figure 2 The MLP (Multilayer Perceptron) module can be used to construct a visual question-answering prediction model to be trained.
[0078] Step 303: Using the training sample set, train the visual question answering prediction model to be trained to obtain the pre-trained visual question answering prediction model.
[0079] In practical applications, such as Figure 2 As shown, during model training, the parameters of the visual question answering prediction model to be trained can be updated based on the loss value output by the fusion network model. For example, the loss value of model training can be obtained by using the cross-entropy loss function, which can be calculated in the following way:
[0080]
[0081] Where J(θ) is the partial derivative of the parameter θ, y (i) Take 0 or 1; m represents the category. If a question has 3 answers, then m = 3; (i) represents the i-th pixel, and h(*) represents the partial derivative of the parameter.
[0082] In this embodiment, a training sample set is obtained, in which each training sample consists of a sample image and multiple question-and-answer information pairs contained in the sample image. Both the sample image and the multiple question-and-answer information pairs are collected based on financial business scenarios. Then, a visual question-and-answer prediction model to be trained is constructed based on a fusion model with a dual-branch structure and a multilayer perceptron. The first branch of the fusion model is a text feature extraction network adjusted based on a hierarchical attention mechanism, and the second branch is an image feature extraction network. The training sample set is then used to train the visual question-and-answer prediction model to be trained, resulting in a pre-trained visual question-and-answer prediction model. This optimizes the visual question-and-answer prediction model. The dual-branch fusion network focuses on handling different feature extraction tasks, and the multilayer perceptron is used for pixel-level classification. By enriching the visual representation, a comprehensive visual language view can be provided for the input image to be predicted, improving the diversity and accuracy of the generated predicted question-and-answer information.
[0083] In one embodiment, prior to the step of obtaining the training sample set, the following steps may also be included:
[0084] Obtain an initial sample set collected under the financial business scenario; different question-and-answer information pairs in each initial sample of the initial sample set have different inquiry object types; process the initial sample set according to preset processing information, and obtain the training sample set and the test sample set based on the processed initial sample set; the preset processing information is used to instruct data filtering operations and image size adjustment operations to be performed on the initial sample set.
[0085] In practical implementations, within financial business scenarios, various types of image question-and-answer datasets (i.e., initial sample sets) can be collected. These datasets can be used to ask questions about different objects in images, such as animals, plants, and people. Different question-and-answer pairs represent different types of objects being asked. The image question-and-answer dataset can include multiple images, each with multiple question-and-answer pairs. To ensure the accuracy of both questions and answers in the acquired dataset, the various types of questions and responses in the collected image question-and-answer dataset can be verified, enabling the dataset to be used to support model training and testing.
[0086] For example, to obtain image question answering datasets containing various types of data, one can obtain sample visual question answering tasks based on intelligent customer service, intelligent investment advisory services, etc. in financial business scenarios. Alternatively, one can collect and download public datasets, such as DAQUAR, COCO-QA, FM-IQA, and Visual7W VQA (visual question answering) datasets, which can be used to support model training and model testing.
[0087] In one example, to improve the training accuracy and speed of the model, the collected image question-and-answer dataset can be sorted and some data can be discarded (i.e., data filtering operation) to ensure that the responses in the sorted data are related to the images; at the same time, to ensure the consistency of the size of the input model images, the images in the image question-and-answer dataset can be scaled or enlarged (i.e., image size adjustment operation) so that the input model images have a consistent size ratio.
[0088] In another example, the processed image question-answering dataset can be divided into a training set and a test set according to a preset ratio (e.g., 8:2), which results in a training sample set and a test sample set. For example, 80% of the samples in the constructed image question-answering dataset can be randomly selected as the training set, and the remaining 20% of the samples can be used as the test set.
[0089] In one embodiment, after the step of obtaining the pre-trained visual question-answering prediction model, the following steps may also be included:
[0090] Obtain preset evaluation information; the preset evaluation information is used to statistically analyze the accuracy of the predicted question-answering results during model testing; the pre-trained visual question-answering prediction model is tested using the test sample set, and the model test result of the pre-trained visual question-answering prediction model is obtained by combining the preset evaluation information and the predicted question-answering results output by the pre-trained visual question-answering prediction model.
[0091] In practical applications, such as Figure 2 As shown, the trained improved network model (i.e., the pre-trained visual question answering prediction model) can be evaluated using a test set (i.e., a set of test samples). Then, the test results of the model can be obtained by combining the evaluation metrics. That is, by combining the preset evaluation information and the predicted question answering results output by the pre-trained visual question answering prediction model, the model test results of the pre-trained visual question answering prediction model can be obtained.
[0092] In one example, accuracy can be used as an evaluation metric for a pre-trained visual question-answering prediction model for model evaluation, and can be expressed by the following formula:
[0093]
[0094] Specifically, for any sample image containing the same query, if at least three different respondents give the same answer, then the answer can be considered 100% correct.
[0095] In one embodiment, before the step of constructing the visual question-answering prediction model to be trained based on the fusion model with the dual-branch structure and the multilayer perceptual network, the following steps may be included:
[0096] A text feature extraction network for processing text feature extraction tasks is constructed as the first branch, and an image feature extraction network for processing image feature extraction tasks is constructed as the second branch; the first branch and the second branch are fused to obtain the fusion model with the dual-branch structure.
[0097] In practical implementation, the LSTM base model can be adjusted by combining a weighted accumulation method and a hierarchical attention mechanism to obtain a text feature extraction network for processing text feature extraction tasks. As the first branch, such as an improved LSTM model, a fusion network model based on the EAST model and the improved LSTM model (i.e., a fusion model with a dual-branch structure) can be constructed.
[0098] In one example, the EAST model is a fully convolutional network that can have three parts: a feature extraction layer, a feature fusion layer, and an output layer. To address the issue of varying text sizes in an image, feature maps from different levels can be fused. Predicting small text uses lower-level semantic information, while predicting large text uses higher-level semantic information. Furthermore, skip connections can be used to fuse feature maps from different levels, thus avoiding feature loss caused by drastic changes in text line scale.
[0099] In one embodiment, constructing a text feature extraction network for processing text feature extraction tasks may include the following steps:
[0100] An initial text feature extraction network is obtained, and then adjusted by combining a weighted accumulation method and a hierarchical attention mechanism to obtain the text feature extraction network.
[0101] Among them, the hierarchical structure attention mechanism can be used to combine the features of different text levels in question-and-answer information with the self-attention mechanism, so as to enhance the network's ability to acquire structural information by utilizing the degree of correlation between different hierarchical structures.
[0102] In one example, the LSTM model can be improved by using a hierarchical attention mechanism. For instance, based on the design weighted accumulation method, ordinary sentence vectors can be converted into structurally weakly related sentence vectors, and a hierarchical attention mechanism of words, sentences, and text can be constructed to improve the model's structural learning ability.
[0103] For example, LSTM models can use various "gates" simultaneously to selectively process data, such as forget gates, input gates, output gates, and a new value. The forget gate can use the previous state and the current word vector as input, and use a neural network parameter matrix to record information. Then, it can perform a dot product between the output vector and the transmission band matrix to achieve selective memory and forgetting. The input gate can concatenate the previous state and the current input word vector, and then process the output through a switching function. The new value can concatenate the previous state and the current data, and then process the output vector using the tanh function.
[0104] In another example, a hierarchical attention mechanism can be used to combine the characteristics of Chinese text—characters forming words, words forming sentences, and sentences forming text—with a self-attention mechanism. By fully utilizing the connections between different levels, the ability to capture structural information contained in the overall features can be enhanced. Specifically, attention is calculated by performing a dot product operation on the text vector and each weakly related sentence vector to calculate similarity. The sentence vector can be the i-th sentence sequence in the training samples. By successively obtaining the importance of different sentence vectors in the text, the score can be divided by the scale (the square root of the word vector dimension) to prevent the inner product from becoming too large and to stabilize the gradient. Then, the scores of all words can be normalized using the Softmax function to obtain positive scores that sum to 1. Finally, the weight distribution of word vectors in the sentence vectors can be calculated and weighted summed to obtain the output related vector of attention at the current position.
[0105] In one embodiment, such as Figure 4 The diagram illustrates another visual question-answering data processing method. In this embodiment, the method includes the following steps:
[0106] In step 401, an initial sample set is obtained from data collected in a financial business scenario; different question-and-answer information pairs in each initial sample of the initial sample set have different inquiry object types. In step 402, the initial sample set is processed according to preset processing information, resulting in a training sample set and a test sample set; the preset processing information is used to instruct data filtering and image size adjustment operations on the initial sample set. In step 403, a visual question-and-answer prediction model to be trained is constructed based on a fusion model with a dual-branch structure and a multilayer perceptron. In step 404, the visual question-and-answer prediction model to be trained is trained using the training sample set, resulting in a pre-trained visual question-and-answer prediction model. In step 405, preset evaluation information is obtained; this information is used to statistically analyze the accuracy of the predicted question-and-answer results during model testing. In step 406, the pre-trained visual question-and-answer prediction model is tested using the test sample set; the model test result of the pre-trained visual question-and-answer prediction model is obtained by combining the preset evaluation information and the predicted question-and-answer results output by the pre-trained model. It should be noted that the specific limitations of the above steps can be found in the specific limitations of a visual question-answering data processing method described above, and will not be repeated here.
[0107] It should be understood that although the steps in the flowcharts of the embodiments described above are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some steps in the flowcharts of the embodiments described above may include multiple steps or multiple stages. These steps or stages are not necessarily completed at the same time, but can be executed at different times. The execution order of these steps or stages is not necessarily sequential, but can be performed alternately or in turn with other steps or at least some of the steps or stages of other steps.
[0108] Based on the same inventive concept, this application also provides a visual question-answering data processing apparatus for implementing the visual question-answering data processing method described above. The solution provided by this apparatus is similar to the implementation scheme described in the above method; therefore, the specific limitations in one or more embodiments of the visual question-answering data processing apparatus provided below can be found in the limitations of the visual question-answering data processing method described above, and will not be repeated here.
[0109] In one embodiment, such as Figure 5 As shown, a visual question-answering data processing device is provided, comprising:
[0110] The data acquisition module 501 is used to acquire the visual question-and-answer image to be predicted and the query information for the visual question-and-answer image, and input them into the pre-trained visual question-and-answer prediction model; the pre-trained visual question-and-answer prediction model includes a fusion network with a dual-branch structure and a multilayer perceptual network, and the visual question-and-answer image and the query information are both generated in the context of financial business scenarios.
[0111] The visual question answering prediction model processing module 502 is used to process the query information through the text feature extraction network in the fusion network to obtain text feature information, and to process the visual question answering image through the image feature extraction network in the fusion network to obtain image feature information; the text feature extraction network is obtained by adjusting based on a hierarchical attention mechanism;
[0112] The prediction question-and-answer information acquisition module 503 is used to perform pixel-level classification processing on the image feature information based on the text feature information through the multilayer perceptron to obtain the model output prediction result, which serves as the prediction question-and-answer information corresponding to the query information.
[0113] In one embodiment, the visual question-answering prediction model processing module 502 includes:
[0114] The image information extraction submodule is used to extract the image information of the visual question-answering image through the image feature extraction network in the fusion network;
[0115] The visual representation information acquisition submodule is used to combine the text association region features and text recognition features corresponding to the visual question-and-answer image, as well as the image information, to obtain visual representation information; the text association region features and the text recognition features are used to adjust image understanding to provide a comprehensive visual language view;
[0116] The target feature image acquisition submodule is used to obtain the target feature image based on the visual representation information and the visual question-and-answer image, which serves as the image feature information.
[0117] In one embodiment, the predicted question-answering information obtaining module 503 includes:
[0118] The image restoration submodule is used to perform image restoration processing on the target feature image through the multilayer perceptron to obtain the processed feature image;
[0119] The pixel-level classification processing submodule is used to perform pixel-level classification processing on the processed feature image based on the text feature information using the decoder of the multilayer perceptron, and obtain the prediction result output by the model.
[0120] In one embodiment, the apparatus further includes:
[0121] The training sample set acquisition module is used to acquire a training sample set; each training sample in the training sample set consists of a sample image and multiple question-and-answer information pairs contained in the sample image; both the sample image and the multiple question-and-answer information pairs are collected based on financial business scenarios;
[0122] The model construction module is used to construct a visual question answering prediction model to be trained based on the fusion model with a dual-branch structure and the multilayer perceptual network; the first branch of the fusion model is the text feature extraction network adjusted based on the hierarchical attention mechanism, and the second branch of the fusion model is the image feature extraction network.
[0123] The model training module is used to train the visual question answering prediction model to be trained using the training sample set, so as to obtain the pre-trained visual question answering prediction model.
[0124] In one embodiment, the apparatus further includes:
[0125] An initial sample acquisition module is used to acquire an initial sample set collected under the financial business scenario; different question-and-answer information pairs in each initial sample of the initial sample set have different query object types;
[0126] The initial sample processing module is used to process the initial sample set according to preset processing information, and obtain the training sample set and the test sample set based on the processed initial sample set; the preset processing information is used to instruct data filtering operation and image size adjustment operation to be performed on the initial sample set.
[0127] In one embodiment, the apparatus further includes:
[0128] The evaluation information acquisition module is used to acquire preset evaluation information; the preset evaluation information is used to statistically analyze the accuracy of the predicted question-and-answer results during model testing.
[0129] The model testing module is used to test the pre-trained visual question answering prediction model using the test sample set, and to obtain the model test result of the pre-trained visual question answering prediction model by combining the preset evaluation information and the predicted question answering results output by the pre-trained visual question answering prediction model.
[0130] In one embodiment, the apparatus further includes:
[0131] The branch construction module is used to build a text feature extraction network for processing text feature extraction tasks as the first branch, and an image feature extraction network for processing image feature extraction tasks as the second branch.
[0132] The fusion model is obtained by a module that merges the first branch and the second branch to obtain the fusion model with the dual-branch structure.
[0133] In one embodiment, the branch building module includes:
[0134] The text feature extraction network submodule is used to obtain the initial text feature extraction network. By combining the weighted accumulation method and the hierarchical structure attention mechanism, the initial text feature extraction network is adjusted to obtain the text feature extraction network. The hierarchical structure attention mechanism is used to combine the features of different text levels in the question and answer information with the self-attention mechanism, so as to enhance the network's ability to obtain structural information by utilizing the degree of correlation between different hierarchical structures.
[0135] Each module in the aforementioned visual question-answering data processing device can be implemented entirely or partially through software, hardware, or a combination thereof. These modules can be embedded in or independent of the processor in a computer device, or stored in the memory of a computer device as software, so that the processor can call and execute the corresponding operations of each module.
[0136] In one embodiment, a computer device is provided, which may be a terminal, and its internal structure diagram may be as follows: Figure 6 As shown, the computer device includes a processor, memory, communication interface, display screen, and input device connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs stored in the non-volatile storage media. The communication interface is used for wired or wireless communication with external terminals; wireless communication can be achieved through Wi-Fi, mobile cellular networks, NFC (Near Field Communication), or other technologies. When executed by the processor, the computer program implements a visual question-answering data processing method.
[0137] Those skilled in the art will understand that Figure 6 The structure shown is merely a block diagram of a portion of the structure related to the present application and does not constitute a limitation on the computer device to which the present application is applied. Specific computer devices may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.
[0138] In one embodiment, a computer device is provided, including a memory and a processor, wherein the memory stores a computer program, and the processor executes the computer program to perform the following steps:
[0139] The visual question-answering image to be predicted and the query information for the visual question-answering image are obtained and input into a pre-trained visual question-answering prediction model. The pre-trained visual question-answering prediction model includes a fusion network with a dual-branch structure and a multilayer perceptual network. Both the visual question-answering image and the query information are generated in a financial business scenario.
[0140] The query information is processed by the text feature extraction network in the fusion network to obtain text feature information, and the visual question-and-answer image is processed by the image feature extraction network in the fusion network to obtain image feature information; the text feature extraction network is adjusted based on a hierarchical attention mechanism.
[0141] Based on the text feature information, the image feature information is classified at the pixel level through the multilayer perceptron to obtain the model output prediction result, which serves as the predicted question and answer information corresponding to the query information.
[0142] In one embodiment, the processor, when executing a computer program, also implements the steps of the visual question-answering data processing method in the other embodiments described above.
[0143] In one embodiment, a computer-readable storage medium is provided having a computer program stored thereon, the computer program performing the following steps when executed by a processor:
[0144] The visual question-answering image to be predicted and the query information for the visual question-answering image are obtained and input into a pre-trained visual question-answering prediction model. The pre-trained visual question-answering prediction model includes a fusion network with a dual-branch structure and a multilayer perceptual network. Both the visual question-answering image and the query information are generated in a financial business scenario.
[0145] The query information is processed by the text feature extraction network in the fusion network to obtain text feature information, and the visual question-and-answer image is processed by the image feature extraction network in the fusion network to obtain image feature information; the text feature extraction network is adjusted based on a hierarchical attention mechanism.
[0146] Based on the text feature information, the image feature information is classified at the pixel level through the multilayer perceptron to obtain the model output prediction result, which serves as the predicted question and answer information corresponding to the query information.
[0147] In one embodiment, when the computer program is executed by a processor, it also implements the steps of the visual question-answering data processing method in the other embodiments described above.
[0148] In one embodiment, a computer program product is provided, including a computer program that, when executed by a processor, performs the following steps:
[0149] The visual question-answering image to be predicted and the query information for the visual question-answering image are obtained and input into a pre-trained visual question-answering prediction model. The pre-trained visual question-answering prediction model includes a fusion network with a dual-branch structure and a multilayer perceptual network. Both the visual question-answering image and the query information are generated in a financial business scenario.
[0150] The query information is processed by the text feature extraction network in the fusion network to obtain text feature information, and the visual question-and-answer image is processed by the image feature extraction network in the fusion network to obtain image feature information; the text feature extraction network is adjusted based on a hierarchical attention mechanism.
[0151] Based on the text feature information, the image feature information is classified at the pixel level through the multilayer perceptron to obtain the model output prediction result, which serves as the predicted question and answer information corresponding to the query information.
[0152] In one embodiment, when the computer program is executed by a processor, it also implements the steps of the visual question-answering data processing method in the other embodiments described above.
[0153] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium, and when executed, it can include the processes of the embodiments of the above methods. Any references to memory, databases, or other media used in the embodiments provided in this application can include at least one of non-volatile and volatile memory. Non-volatile memory can include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high-density embedded non-volatile memory, resistive random access memory (ReRAM), magnetic random access memory (MRAM), ferroelectric random access memory (FRAM), phase change memory (PCM), graphene memory, etc. Volatile memory can include random access memory (RAM) or external cache memory, etc. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM). The databases involved in the embodiments provided in this application may include at least one type of relational database and non-relational database. Non-relational databases may include, but are not limited to, blockchain-based distributed databases. The processors involved in the embodiments provided in this application may be general-purpose processors, central processing units, graphics processing units, digital signal processors, programmable logic devices, quantum computing-based data processing logic devices, etc., and are not limited to these.
[0154] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.
[0155] The embodiments described above are merely illustrative of several implementation methods of this application, and while the descriptions are specific and detailed, they should not be construed as limiting the scope of this patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the protection scope of this application. Therefore, the protection scope of this application should be determined by the appended claims.
Claims
1. A visual question answering data processing method, characterized by, The method includes: The process involves acquiring a visual question-answering image to be predicted and the query information for that image, and inputting them into a pre-trained visual question-answering prediction model. The pre-trained model includes a fusion network with a dual-branch structure and a multilayer perceptual network. Both the visual question-answering image and the query information are generated in a financial business scenario. The fusion network has a first branch that is a text feature extraction network and a second branch that is an image feature extraction network. The text feature extraction network is obtained by adjusting the initial text feature extraction network using a weighted accumulation method and a hierarchical attention mechanism. The hierarchical attention mechanism combines features from different text levels in the question-answering information with a self-attention mechanism to enhance the network's ability to acquire structural information by leveraging the correlation between different hierarchical structures. The query information is processed by the text feature extraction network in the fusion network to obtain text feature information, and the visual question-and-answer image is processed by the image feature extraction network in the fusion network to obtain image feature information; the text feature extraction network is adjusted based on a hierarchical attention mechanism. Based on the text feature information, the image feature information is classified at the pixel level through the multilayer perceptron to obtain the model output prediction result, which serves as the predicted question and answer information corresponding to the query information.
2. The method of claim 1, wherein, The step of processing the visual question-answering image through the image feature extraction network in the fusion network to obtain image feature information includes: Image information of the visual question-answering image is extracted through the image feature extraction network in the fusion network; By combining the text association region features and text recognition features corresponding to the visual question-and-answer image, as well as the image information, visual representation information is obtained; the text association region features and the text recognition features are used to adjust image understanding to provide a comprehensive visual language view; Based on the visual representation information and the visual question-and-answer image, a target feature image is obtained, which serves as the image feature information.
3. The method according to claim 2, characterized in that, The step of performing pixel-level classification processing on the image feature information based on the text feature information through the multilayer perceptron to obtain the model output prediction result includes: The target feature image is restored by the multilayer perceptron to obtain the processed feature image. The decoder of the multilayer perceptron is used to perform pixel-level classification on the processed feature image based on the text feature information to obtain the model output prediction result.
4. The method of claim 1, wherein, The pre-trained visual question answering prediction model was trained using the following method: Obtain a training sample set; each training sample in the training sample set consists of a sample image and multiple question-and-answer information pairs contained in the sample image; both the sample image and the multiple question-and-answer information pairs are collected based on financial business scenarios; Based on the fusion model with the dual-branch structure and the multilayer perception network, a visual question answering prediction model to be trained is constructed. The first branch of the fusion model is the text feature extraction network adjusted based on the hierarchical attention mechanism, and the second branch of the fusion model is the image feature extraction network. The training sample set is used to train the visual question answering prediction model to be trained, thereby obtaining the pre-trained visual question answering prediction model.
5. The method of claim 4, wherein, Prior to the step of obtaining the training sample set, the method further includes: Obtain an initial sample set collected under the financial business scenario; each initial sample in the initial sample set contains different question-and-answer information pairs with different query object types; The initial sample set is processed according to preset processing information, and the training sample set and the test sample set are obtained based on the processed initial sample set; the preset processing information is used to instruct the initial sample set to perform data filtering operation and image size adjustment operation.
6. The method of claim 5, wherein, After obtaining the pre-trained visual question-answering prediction model, the method further includes: Obtain preset evaluation information; the preset evaluation information is used to statistically analyze the accuracy of the predicted question-and-answer results during model testing; The pre-trained visual question answering prediction model is tested using the test sample set. The model test result of the pre-trained visual question answering prediction model is obtained by combining the preset evaluation information and the predicted question answering results output by the pre-trained visual question answering prediction model.
7. The method of claim 4, wherein, Before the step of constructing the visual question-answering prediction model to be trained based on the fusion model with the dual-branch structure and the multilayer perceptual network, the method further includes: A text feature extraction network for processing text feature extraction tasks is constructed as the first branch, and an image feature extraction network for processing image feature extraction tasks is constructed as the second branch. By merging the first branch and the second branch, the fusion model with the dual-branch structure is obtained.
8. A visual question answering data processing apparatus, characterized by comprising: The device includes: The data acquisition module is used to acquire the visual question-and-answer image to be predicted and the query information for the visual question-and-answer image, and input them into a pre-trained visual question-and-answer prediction model. The pre-trained visual question-and-answer prediction model includes a fusion network with a dual-branch structure and a multilayer perceptual network. Both the visual question-and-answer image and the query information are generated in a financial business scenario. The first branch of the fusion network is a text feature extraction network, and the second branch is an image feature extraction network. The text feature extraction network is obtained by adjusting the initial text feature extraction network based on an initial text feature extraction network, combined with a weighted accumulation method and a hierarchical structure attention mechanism. The hierarchical structure attention mechanism is used to combine the features of different text levels in the question-and-answer information with a self-attention mechanism to enhance the network's ability to acquire structural information by utilizing the degree of correlation between different hierarchical structures. The visual question answering prediction model processing module is used to process the query information through the text feature extraction network in the fusion network to obtain text feature information, and to process the visual question answering image through the image feature extraction network in the fusion network to obtain image feature information; the text feature extraction network is adjusted based on a hierarchical attention mechanism; The predictive question-and-answer information acquisition module is used to perform pixel-level classification processing on the image feature information based on the text feature information through the multilayer perceptron, and obtain the model output prediction result as the predicted question-and-answer information corresponding to the query information.
9. A computer device comprising a memory and a processor, wherein the memory stores a computer program, characterized in that, When the processor executes the computer program, it implements the steps of the method according to any one of claims 1 to 7.
10. A computer-readable storage medium having stored thereon a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.
11. A computer program product comprising a computer program, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 7.