Statistical chart automatic question answering method based on weakly supervised alignment mechanism

CN118132720BActive Publication Date: 2026-06-23NANJING UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
NANJING UNIV
Filing Date
2024-03-20
Publication Date
2026-06-23

Smart Images

  • Figure CN118132720B_ABST
    Figure CN118132720B_ABST
Patent Text Reader

Abstract

The application provides a statistical chart automatic question answering method based on a weakly supervised alignment mechanism, comprising the following steps: step 1, collecting a statistical chart and a question and answer pair set corresponding to the statistical chart, wherein each question and answer pair comprises question text and answer text; step 2, performing alignment target identification to obtain a data point related to the question text in each question and answer pair in the corresponding statistical chart, and taking the data point as an alignment target; step 3, training a question and answer model based on the weakly supervised alignment mechanism by using the alignment target identified in step 2 to obtain a trained question and answer model; and step 4, using the trained question and answer model to perform reasoning on a target statistical chart and a question to obtain a predicted answer, and completing the statistical chart automatic question answering based on the weakly supervised alignment mechanism.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to an automatic question answering method, and more particularly to an automatic question answering method for statistical graphs based on a weakly supervised alignment mechanism. Background Technology

[0002] In today's rapidly developing information age, the importance of data and information processing is increasingly prominent. Against this backdrop, statistical charts (including bar charts, line charts, pie charts, scatter plots, etc.) have become a key tool for transforming complex data information into easily understandable visual elements as an effective data visualization method. These charts play a vital role in various fields such as financial analysis, educational presentation, medical data interpretation, and scientific research. By presenting data in an intuitive graphical form, they help people quickly capture statistical indicators and development trends, thus playing an important role in decision-making and knowledge discovery. However, for most ordinary users, accurately extracting and deeply understanding relevant information from these diverse statistical charts remains a challenge. This is because it requires not only the ability to accurately interpret the visual elements in the charts and their interrelationships, but also basic data analysis skills.

[0003] To address this problem, Chart Question Answering (CQA) technology emerged. The core goal of CQA is to build an intelligent system that allows users to input statistical images and natural language descriptions of questions, and the system automatically infers the answer by calling models and algorithms. For example, ... Figure 1 The user uploaded a line graph showing the temperature over a week and asked, "How many degrees Celsius higher is the temperature on July 4th compared to July 1st?" The system automatically deduced the answer as "6".

[0004] Existing technical solutions in the field of statistical graph question answering mainly fall into two categories: "table-based question answering" methods and "non-table-based question answering" methods. Table-based question answering methods require a series of deep learning models and rules to convert the statistical graph into tabular text, and then rely on a specially trained table-based question answering model to generate the answer based on the tabular text. In contrast, non-table-based question answering methods attempt to bypass the table conversion step, utilizing image feature extractors and text feature extractors to obtain the image features of the statistical graph and the text features of the user's question, respectively, and then directly predicting the answer through supervised learning of deep neural networks.

[0005] In statistical graph question-answering tasks, correctly matching data points in the statistical graph to the question text is crucial; this process is known as "alignment." Existing techniques, such as the paper *PlotQA: Reasoning over Scientific Plots* (Methani N, Ganguly P, Khapra MM, et al. *Plotqa: Reasoning over scientific plots* [C] / / Proceedings of the IEEE / CVF Winter Conference on Applications of Computer Vision. 2020: 1527-1536.), convert statistical graphs into tables before question answering. The drawback of this method is that the visual elements and structural information contained in the statistical graph are indispensable for a deep understanding of its content. The conversion process leads to the loss of visual information and the intuitive representation of the statistical graph, thus affecting the accuracy of the question-answering. Furthermore, the conversion process involves manually writing rules, which cannot cover all types of statistical graphs, resulting in significant cumulative errors. Incorrect table information further reduces the accuracy of the question-answering.

[0006] In existing technologies, such as patent document US11386114(B2), Structure-based transformers with localization and encoding for chart question answering (US11386114(B2)Structure-based transformers with localization and encoding for chartquestion answering.) and paper Classification-Regression for Chart Comprehension (Levy M, Ben-Ari R, Lischinski D.Classification-regression for chartcomprehension[C] / / European Conference on Computer Vision.Cham:Springer NatureSwitzerland,2022:469-484.), the process of converting statistical charts to tables is not used. Instead, an image feature extractor and a text feature extractor are designed. By fusing image features and text features of user questions, and then combining them with supervised learning of deep neural networks, the answer is directly predicted.

[0007] However, all of the above methods have certain drawbacks. Specifically, the patent literature mentioned above attempts to apply dynamic coding technology, focusing on aligning the question text with the text elements of the statistical chart (e.g., horizontal axis labels, vertical axis labels, and titles), but fails to fully consider the necessity of color and position alignment, both of which are crucial for handling problems requiring numerical reasoning. The aforementioned papers introduce cross-attention mechanisms to implicitly enhance the alignment between text and images, but these works often fail to achieve the desired results due to a lack of supervised information on the reasoning process, leading to inaccurate identification of the alignment target. Summary of the Invention

[0008] Purpose of the invention: The technical problem to be solved by the present invention is to provide an automatic question answering method for statistical graphs based on a weakly supervised alignment mechanism, which addresses the shortcomings of the existing technology.

[0009] To address the aforementioned technical problems, this invention discloses an automatic question-answering method for statistical graphs based on a weakly supervised alignment mechanism, comprising the following steps:

[0010] Step 1: Collect the statistical chart and the corresponding set of question-and-answer pairs. Each question-and-answer pair includes the question text and the answer text.

[0011] Step 2: Perform alignment target identification to obtain the relevant data points of the question text in each question-answer pair in the corresponding statistical chart, and use these data points as alignment targets;

[0012] Step 3: Use the alignment targets identified in Step 2 to train a question-answering model based on a weakly supervised alignment mechanism to obtain a trained question-answering model;

[0013] Step 4: Using the trained question-answering model, reason about the target statistical graph and the question to obtain the predicted answer, thus completing the automatic question answering of the statistical graph based on the weakly supervised alignment mechanism.

[0014] Furthermore, the alignment target identification described in step 2 specifically includes:

[0015] Step 2-1: Let the set of n data points in the statistical graph be D = {d1, d2, ..., dn}. n}, where d n Let Q represent the nth data point in the data point set D, and let A represent the question text and answer text in any question-answer pair corresponding to this statistical chart.

[0016] Step 2-2: Set up an operator set O, which includes a set of single operators and a set of compound operators. The set of single operators contains only single operators, i.e., the most basic operators. The set of compound operators consists of two or more single operators combined.

[0017] Based on the number of unit operators Count(o) in operator o∈O, recursively construct the permutation D of data points Count(o)+1 from the data point set D. A The details are as follows:

[0018] D A ={D1,D2,…,D i ,…,D n-Count(o)}

[0019] D i ={d i ,d i+1 …,d i+Count(o)}, 1 ≤ i ≤ n - Count(o)

[0020] Among them, D i This represents the i-th combination;

[0021] Each combination D i Construct (Count(o)+1)! permutations, fill the operator o with the data points in each permutation in order as operands, and construct the expression set E. Use each expression e∈E to solve the problem text Q and obtain the expression corresponding to the problem text Q.

[0022] Steps 2-3: Calculate the result R(e) of each generated expression e, compare the result with the answer text A, and use the preset evaluation index to evaluate the error and obtain the error Δ(e) corresponding to the expression e.

[0023] Steps 2-4: Calculate the semantic similarity S(e,Q) between each expression e and the question text Q to determine the degree of matching between the expression and the question text;

[0024] Steps 2-5: Sort all expressions in ascending order of error and descending order of semantic similarity S(e,Q). Select the highest-ranking expression e as the best match for the question text Q, and use the data points involved in the expression as the alignment target for the question-answer pair.

[0025] Furthermore, the question-answering model described in step 3 specifically includes:

[0026] The question-answering model takes question text and statistical graph as input and outputs predicted answer text.

[0027] The model consists of four modules: a feature encoding module, a feature fusion module, a weakly supervised alignment module, and an answer prediction module.

[0028] The feature encoding module includes a text feature encoder and an image feature encoder, which are used to encode the input question text and statistical graph, respectively, to obtain the text feature embedding sequence and the image feature embedding sequence.

[0029] The feature fusion module fuses the above image feature embedding sequences through a cross-attention mechanism to obtain fused text feature sequences and image feature sequences;

[0030] The weakly supervised alignment module predicts the probability of each image feature in the image feature sequence as an alignment target based on the text features in the text feature sequence, thereby weighting the image features.

[0031] The answer prediction module combines text features and weighted image features, and uses a feedforward neural network to predict the final answer text.

[0032] Furthermore, the image feature encoder employs the Mask RCNN model.

[0033] Furthermore, the text feature encoder described herein employs the BERT model.

[0034] Furthermore, the training of the question-answering model described in step 3 specifically includes:

[0035] Step 3-1: Collect training data, including: the set of data points D in the statistical graph, the question text Q, the answer text A, and the probability set T = {T1,…,T...} for each element of the data point set as an alignment target. n};

[0036] Step 3-2: Perform forward computation using a question-answering model;

[0037] Step 3-3: Calculate the prediction loss to obtain the total loss function L;

[0038] Steps 3-4 involve calculating the gradient based on the total loss function L and updating the model parameters using the backpropagation algorithm, as detailed below:

[0039]

[0040] Where, θ new θ represents the updated model parameters. old This represents the model parameters before the update, where η is the learning rate;

[0041] Step 3-5: Based on the preset performance requirements, determine whether the model has converged. If it has converged and reached the preset number of training rounds, proceed to step 3-6; otherwise, return to step 3-2 to continue training and increment the number of training rounds by 1.

[0042] Steps 3-6 complete the training.

[0043] Furthermore, the forward computation using a question-answering model described in step 3-2 specifically includes:

[0044] Step 3-2-1: Encode the statistical graph using an image feature encoder to obtain the image feature embedding sequence. in, Let represent a real vector space of dimension n×h, where h represents the dimension of the embedding vector for each feature;

[0045] Step 3-2-2: Encode the question text using a text feature encoder to obtain a question text feature embedding sequence with the same dimension as the image feature embedding. Where m represents the length of the question text;

[0046] Step 3-2-3: The image feature embedding sequence C and text feature embedding sequence Q′ obtained above are processed through a k-layer cross-attention mechanism to obtain the fused text-image feature D. text Image-to-text fusion features D img The details are as follows:

[0047]

[0048]

[0049] Among them, Pool [CLS] Represents pooling operation; Softmax represents the Softmax function; W q1 W k1 W v1 W represents the query, key, and value matrices of the text-to-image cross-attention mechanism, respectively. q2 W k2 W v2 These represent the query, key, and value matrices of the image-to-text cross-attention mechanism, respectively.

[0050] Step 3-2-4, based on the text-image fusion feature D text Image-to-text fusion features D img Predict the probability that each data point in the set D of data points in the statistical graph will be the alignment target. The specific method is as follows:

[0051]

[0052] in, Let represent the probability that the j-th data point is the alignment target, and sigmoid represent the sigmoid function. The image-to-text fusion feature D img The j-th feature;

[0053] Step 3-2-5: Based on the predicted probabilities, the data points in the statistical graph are weighted and fused to obtain the aligned image features D. align The details are as follows:

[0054]

[0055] Step 3-2-6, based on the problem text feature D text and the aligned image features D align The answer is predicted using a feedforward neural network.

[0056] Furthermore, the calculation of the prediction loss described in step 3-3 specifically includes:

[0057] Step 3-3-1, Calculate the predicted answer The absolute error between the text and the answer text A is used as the regression task loss L. answer The details are as follows:

[0058]

[0059] Step 3-3-2: Calculate the cross-entropy between the predicted aligned target and the true aligned target, and use it as the classification task loss L. align The details are as follows:

[0060]

[0061] Step 3-3-2, calculate the regression task loss L answer And classification task loss L align The weighted sum, used as the total loss function, is as follows:

[0062] L=αL answer +βL align

[0063] Here, α and β are weighting parameters used to balance the losses of the two tasks.

[0064] Furthermore, the evaluation metric described in steps 2-3 is the absolute value error:

[0065] Δ(e)=|R(e)–A|

[0066] Furthermore, the semantic similarity mentioned in steps 2-4 uses the Jaccard similarity coefficient, calculated as follows:

[0067] S(e,Q)=|T(e)∩T(Q)| / |T(e)∪T(Q)|

[0068] Where T(e) represents the character set of expression e, T(Q) represents the character set of question text Q, ∩ represents taking the intersection, ∪ represents taking the union, and |·| represents taking the number of elements in the set.

[0069] Beneficial effects:

[0070] From a technical perspective, the technical solution of this invention (1) can obtain the alignment information between the question text and the statistical graph data points through the alignment target recognition algorithm, providing weak supervision signals for model training. (2) By introducing a weak supervision-based alignment mechanism during the training process of the statistical graph question answering model, the alignment performance of the model can be effectively improved, thereby improving the performance of the model on complex numerical reasoning problems.

[0071] From an application perspective, the technical solution of this invention (1) improves the efficiency of ordinary users in the data understanding and decision-making process. Regardless of their technical background, users can obtain direct interpretations of complex statistical data through simple queries, simplifying the data processing flow. (2) The improved accuracy of the statistical chart question-and-answer system is particularly important for data-based decision-making. In the financial field, it can help analysts and investors quickly interpret market trends and portfolio performance; in the medical field, it can help doctors and researchers quickly understand clinical trial results or patient data; in the education field, teachers and students can use these tools to analyze and understand complex educational statistics. Attached Figure Description

[0072] The present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments, and the advantages of the present invention in the above and / or other aspects will become clearer.

[0073] Figure 1 This is an example of a statistical chart question-and-answer format in existing technology.

[0074] Figure 2 This is a schematic diagram of the statistical graph question-and-answer method proposed in this invention.

[0075] Figure 3 This is a schematic diagram of the alignment target recognition method proposed in this invention.

[0076] Figure 4 This is a schematic diagram of the model training process in this invention.

[0077] Figure 5 This is a schematic diagram of the statistical graph question-answering model architecture based on the weakly supervised alignment mechanism proposed in this invention. Detailed Implementation

[0078] This invention proposes an alignment target recognition algorithm specifically designed to address the alignment problem between question text and statistical graph elements in the field of statistical graph question answering. The algorithm employs heuristic rules to accurately identify the alignment relationship between the question text and data points in the statistical graph, thereby providing necessary weak supervision signals for complex multi-step reasoning processes. This method reduces the model's blind search in the vast solution space and effectively avoids alignment errors common in traditional methods. Building upon this, this invention further constructs a statistical graph question answering model. This model utilizes the weak supervision signals generated by the alignment target recognition algorithm to perform multi-task learning, enabling it to process question answering tasks while identifying alignment targets, effectively improving the accuracy of statistical graph question answering.

[0079] The basic principle of this invention is as follows: by designing an alignment target recognition algorithm to obtain the alignment relationship between the question text and the data points in the statistical graph, the necessary weak supervision signal is provided for the complex multi-step reasoning process, allowing the model to perform multi-task learning of alignment target recognition and question answering, thereby improving the accuracy of statistical graph question answering.

[0080] This invention proposes an automatic question-answering method for statistical graphs based on a weakly supervised alignment mechanism. By designing heuristic rules to obtain the corresponding computational expressions for question-answer pairs, statistical graph data points corresponding to the question text are selected. This provides necessary weakly supervised signals for the complex multi-step reasoning process, allowing the model to learn to focus on question-related data points during supervised training, thereby improving the accuracy of statistical graph question answering. The specific technical solution is as follows:

[0081] The flowchart of the statistical graph question-and-answer method proposed in this invention is as follows: Figure 2 As shown, the specific steps include:

[0082] Step 101: The system receives two types of input: a statistical chart (a chart uploaded by the user, here taking a line chart of temperature changes over a week as an example. This chart should contain key information, such as date, temperature value, etc.) and a question-and-answer pair (a question asked by the user and its corresponding answer. For example, the question "How many degrees Celsius higher in July 4th than July 1st?" and the answer "6").

[0083] Step 102: Using the alignment target recognition algorithm proposed in this invention, obtain the relevant data points of the question text in the statistical graph in each question-answer pair as alignment targets.

[0084] Here, "data point" is defined as the area representing data in a statistical chart, including rectangles in a bar chart, line segments in a line chart, dots in a scatter plot, and sectors in a pie chart;

[0085] "Data points related to the question text" refers to the data points that determine the answer to the question. For example, in the question "How many degrees Celsius higher is the temperature on July 4th than on July 1st?", the relevant data points are the image areas corresponding to "temperature on July 4th" and "temperature on July 1st".

[0086] This invention is based on the assumption that each problem Q can be expressed by a mathematical expression e∈E. For example, in a specific embodiment, the mathematical expression for "How many degrees Celsius higher is the temperature on 7 / 4 than on 7 / 1?" is "(7 / 4 temperature) minus (7 / 1 temperature)", and "7 / 4 temperature" and "7 / 1 temperature" are the alignment targets for this problem. The specific alignment target identification process is as follows: Figure 3 As shown:

[0087] Step 201, Input the set of statistical chart data points D = {d1, d2, ..., d...} n Question Q and answer A.

[0088] Step 202: Set an operator set O, which includes a set of single operators {+,-,×,÷} and a set of compound operators {+×,-×,+-,…}, where compound operators are composed of two or more single operators; based on the number of single operators Count(o) in operator o∈O, recursively construct all permutations of Count(o)+1 data points from the data point set D:

[0089] D A ={D1,D2,…,D i ,…,D n-Count(o)}

[0090] D i ={d i ,d i+1 …,d i+Count(o)}, 1 ≤ i ≤ n - Count(o)

[0091] Among them, D i D represents the i-th combination; each combination D i It can form (Count(o)+1)! permutations, and fill the operator o with the data points in each permutation in order as operands to construct a set of expressions E, where each expression e∈E attempts to solve the problem Q in order to find the expression corresponding to the problem.

[0092] Step 203: Calculate the result R(e) for each generated mathematical expression e and compare it with the true answer to evaluate the magnitude of the error. Here, the absolute value error Δ(e) = |R(e) - A| is used as the evaluation index.

[0093] Step 204: For each expression e, calculate the semantic similarity S(e,Q) with the question Q to ensure that the selected expression logically matches the question. To guarantee computational performance, the Jaccard similarity coefficient is used to measure the similarity between the expression text and the question text. Specifically, S(e,Q)=|T(e)∩T(Q)| / |T(e)∪T(Q)|, where T(x) represents the character set of text x. The larger the coefficient, the higher the text similarity. For the provided line graph, the expression set shown in Table 1 can be constructed:

[0094] Table 1. Expression Set Table

[0095] Operators Operand 1 Operand 2 expression Result error Text similarity + July 1st temperature July 2nd temperature July 1st temperature + July 2nd temperature 42 0.294 + July 1st temperature July 3rd temperature July 1st temperature + July 3rd temperature 43 0.294 …… …… …… …… …… …… - July 1st temperature July 2nd temperature Temperature on July 1st - Temperature on July 2nd 8 0.294 - July 4th temperature July 1st temperature Temperature on July 4th - Temperature on July 1st 0 0.375

[0096] Step 205: Sort all expressions e in ascending order of error Δ(e) and descending order of similarity S(e,Q), and select the expression with the highest ranking as the final solution, i.e. the best match. The data points involved in this expression are also the alignment targets of the question-answer pair.

[0097] Step 103: Use the alignment target identified in the previous step to train the question answering model.

[0098] The question-answering model architecture is as follows: Figure 5 As shown, the input consists of question text and a statistical graph, and the output is the answer to the question. The model consists of four modules: feature encoding module, feature fusion module, weakly supervised alignment module, and answer prediction module. The feature encoding module encodes the input text and image respectively to obtain the corresponding embedding sequences. The feature fusion module obtains the fused text feature sequence and image feature sequence through a cross-attention mechanism. The weakly supervised alignment module predicts the probability of each image feature as an alignment target through text features, thereby weighting the image features. The answer prediction module combines text features and weighted image features and uses a feedforward neural network to predict the final answer.

[0099] The specific model training process is as follows: Figure 4 As shown:

[0100] Step 301: Prepare all the data and information required for training, including the statistical graph D, the question text Q, the answer A, and the probability set T = {T1, ..., T2} for each element of the data point set as the alignment target. n}

[0101] Step 302, Forward computation of the question-answering model: This involves using the Mask R-CNN model (He K, Gkioxari G, Dollár P, et al. Mask R-CNN [C] / / Proceedings of the IEEE international conference on computer vision. 2017:2961-2969.) or other similar models to encode the statistical graph features to obtain the image feature embedding sequence. Text feature embedding sequences are obtained by encoding text features using the BERT model (Devlin J, Chang MW, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805,2018.) or other similar models. The model then goes through a k-layer cross-attention mechanism to obtain the fused features of the text and the image. and image-to-text fusion features Among them, Pool [CLS] This represents the pooling operation, specifically taking the first and last features of the feature sequence and discarding the rest; Softmax represents the Softmax function; W q1 W k1 W v1 W represents the query, key, and value matrices of the text-to-image cross-attention mechanism, respectively. q2 W k2 W v2 These represent the query, key, and value matrices for the image-to-text cross-attention mechanism. Next, the model, based on D... text and D img To predict the probability of each data point being the alignment target. in The model performs weighted fusion of image data point features based on predicted probabilities to obtain aligned image features. This allows the model to dynamically focus on the data points most relevant to the current problem. Finally, the model then uses D... align and D text Predicting the answer to the question using a feedforward neural network.

[0102] Step 303, Calculate the prediction loss: Calculate the absolute error between the predicted answer and the true answer (regression task loss): Calculate the cross-entropy (classification task loss) between the predicted alignment target and the true alignment target: Where T i It is the distribution of the true alignment target. It predicts the distribution of the alignment target. The total loss function is a weighted sum of the two: L = αL answer +βL align , where α and β are weighting parameters used to balance the losses of the two tasks.

[0103] Step 304: Calculate the gradient based on the total loss function L, and update the model parameters using the backpropagation algorithm to improve model performance, i.e.: Where θ represents the model parameters and η is the learning rate.

[0104] Step 305, Determine model convergence: Evaluate whether the model has converged to a satisfactory performance level. If it has not converged, it may be necessary to return to step 302 to continue training.

[0105] Step 306, End Model Training: Once the model converges or reaches the preset number of training rounds, the training process ends.

[0106] In step 104, the trained model is used to reason about the new input statistical graph and question. The model directly outputs a predicted answer based on the question text and statistical graph. Furthermore, the model can predict the alignment target of the question, improving the interpretability of the model.

[0107] Example:

[0108] The method proposed in this invention is entirely written in Python and implemented using the PyTorch framework. The machine used in this embodiment has an Nvidia GeForce RTX 3090 graphics card with 24GB of video memory, an Intel(R) Xeon(R) Silver 4210R CPU@2.40GHz processor, and 251GB of RAM.

[0109] The experimental dataset was prepared as follows: the PlotQA dataset open-sourced by Methani et al. (Methani N, Ganguly P, Khapra MM, et al. Plotqa: Reasoning over scientific plots[C] / / Proceedings of the IEEE / CVF Winter Conference on Applications of Computer Vision.2020:1527-1536.), which has two versions, V1 and V2, containing a total of 224,000 statistical images and 8.2 million question-answer pairs. Each statistical image is labeled with the region and value of each data point. In the experiment, the dataset was divided into training set, validation set and test set in a ratio of 8:1:1.

[0110] Comparative experiments were conducted using existing open-source methods, including Method 1: PlotQA (Methani N, Ganguly P, Khapra MM, et al. Plotqa: Reasoning over scientific plots [C] / / Proceedings of the IEEE / CVF Winter Conference on Applications of Computer Vision. 2020:1527-1536.) and Method 2: CRCT (Levy M, Ben-Ari R, Lischinski D. Classification-regression for chart comprehension [C] / / European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022:469-484.). Experimental results show that the method proposed in this invention has a significantly improved accuracy.

[0111] The operation process of this embodiment is as follows:

[0112] 1. An alignment target recognition algorithm is used to obtain the alignment target for each question-answer pair. The maximum number of compound operators used in the implementation is binary operators. On the PlotQA dataset, the alignment target recognition accuracy is 91.25%.

[0113] 2. Construct a question-answering model based on weakly supervised alignment with the following parameters: the dimension of the question and statistical graph element embeddings is set to h = 768. The maximum lengths of the question text sequence and the statistical graph object sequence are set to m = 128 and n = 64, respectively. The model uses k = 3 collaborative attention modules to improve processing capacity. The weights assigned to the alignment and regression tasks are α = 1 and γ = 10.

[0114] 3. Train the question-answering model based on weakly supervised alignment according to the following process and parameters: During training, set the training period to 10, the batch size to 60, and adopt a linear learning rate scheduling strategy. The initial learning rate η = 2 × 10 -5 And set the first 1000 steps as a warm-up phase;

[0115] The experimental results are shown in Table 2. On the statistical graph question-and-answer dataset, the accuracy of the method proposed in this invention significantly surpasses that of existing methods, demonstrating the effectiveness of the method proposed in this invention.

[0116] Table 2 Experimental Results

[0117] method PlotQA-V1 PlotQA-V2 The method proposed in this invention 78.34 58.15 PlotQA method 53.96 22.52 CRCT method 76.94 34.43

[0118] In its specific implementation, this application provides a computer storage medium and a corresponding data processing unit. The computer storage medium is capable of storing a computer program, which, when executed by the data processing unit, can run the invention's content regarding an automatic question-answering method for statistical graphs based on a weakly supervised alignment mechanism, as well as some or all of the steps in various embodiments. The storage medium can be a magnetic disk, optical disk, read-only memory (ROM), or random access memory (RAM), etc.

[0119] Those skilled in the art will clearly understand that the technical solutions in the embodiments of the present invention can be implemented using computer programs and their corresponding general-purpose hardware platforms. Based on this understanding, the technical solutions in the embodiments of the present invention, or the parts that contribute to the prior art, can be embodied in the form of computer programs, i.e., software products. These computer program software products can be stored in a storage medium and include several instructions to cause a device containing a data processing unit (which may be a personal computer, server, microcontroller, MCU, or network device, etc.) to execute the methods described in various embodiments or certain parts of the embodiments of the present invention.

[0120] This invention provides an approach and method for automatic question answering of statistical graphs based on a weakly supervised alignment mechanism. Many methods and approaches exist for implementing this technical solution; the above description is merely a preferred embodiment of the invention. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principles of this invention, and these improvements and modifications should also be considered within the scope of protection of this invention. All components not explicitly stated in this embodiment can be implemented using existing technologies.

Claims

1. A method for automatic question answering of statistical graphs based on a weakly supervised alignment mechanism, characterized in that, Includes the following steps: Step 1: Collect the statistical chart and the corresponding set of question-and-answer pairs. Each question-and-answer pair includes the question text and the answer text. Step 2: Perform alignment target identification to obtain the relevant data points of the question text in each question-answer pair in the corresponding statistical chart, and use these data points as alignment targets; Step 3: Train a question-answering model based on a weakly supervised alignment mechanism using the alignment targets identified in Step 2 to obtain the trained question-answering model. The input of the question-answering model is the question text and the statistical graph, and the output is the predicted answer text. The question-answering model consists of four modules, including: a feature encoding module, a feature fusion module, a weakly supervised alignment module, and an answer prediction module. Among them, the weakly supervised alignment module predicts the probability of each image feature in the image feature sequence as the alignment target by using the text features in the text feature sequence, thereby weighting the image features. Step 4: Using the trained question-answering model, reason about the target statistical graph and the question to obtain the predicted answer, thus completing the automatic question answering of the statistical graph based on the weakly supervised alignment mechanism. Step 2, which involves aligning target identification, specifically includes: Step 2-1, let the statistical chart contain... The set consisting of data points is ,in, Represents a set of data points The first in The data points correspond to the question text in any question-answer pair in this statistical chart. The answer text is ; Step 2-2, Set the operator set It includes a set of single operators and a set of compound operators. The set of single operators contains only single operators, that is, the most basic operators. The set of compound operators consists of two or more single operators combined. According to the operator Number of unit operators From the set of data points Recursive construction All permutations of data points The details are as follows: ; ; in, Indicates the first A combination; Each combination constitute Given a permutation, fill the operator with the data points of each permutation in order as operands. In the middle, construct an expression set , where each expression Answer text Obtain the text related to the question. The corresponding expression; Steps 2-3: Calculate each generated expression The result Compare the result with the answer text The comparison is performed, and the error is evaluated using preset evaluation metrics to obtain the corresponding expression. error ; Steps 2-4: Calculate each expression With question text semantic similarity This is used to determine how well the expression matches the question text; Steps 2-5: For all expressions, sort them in ascending order of error and semantic similarity. Sort in descending order and select the expression with the highest ranking. The question text is as follows The best match is determined, and the data points involved in the expression are used as the alignment target for the question-and-answer pair.

2. The automatic question-answering method for statistical graphs based on a weakly supervised alignment mechanism according to claim 1, characterized in that, In the question-answering model described in step 3, the feature encoding module includes a text feature encoder and an image feature encoder, which are used to encode the input question text and statistical graph, respectively, to obtain the text feature embedding sequence and the image feature embedding sequence. The feature fusion module fuses the above image feature embedding sequences through a cross-attention mechanism to obtain fused text feature sequences and image feature sequences; The answer prediction module combines text features and weighted image features, and uses a feedforward neural network to predict the final answer text.

3. The automatic question-answering method for statistical graphs based on a weakly supervised alignment mechanism according to claim 2, characterized in that, The image feature encoder described above uses the Mask RCNN model.

4. The automatic question-answering method for statistical graphs based on a weakly supervised alignment mechanism according to claim 3, characterized in that, The text feature encoder described above uses the BERT model.

5. The automatic question-answering method for statistical graphs based on a weakly supervised alignment mechanism according to claim 4, characterized in that, The training of the question-answering model described in step 3 specifically includes: Step 3-1: Collect training data, including: the set of data points in the statistical chart. Problem text Answer text The set of data points, each element of which represents a probability of being an alignment target. ; Step 3-2: Perform forward computation using a question-answering model; Step 3-3: Calculate the prediction loss to obtain the total loss function. ; Steps 3-4, based on the total loss function The gradient is calculated, and the model parameters are updated using the backpropagation algorithm, as follows: ; in, This represents the updated model parameters. This represents the model parameters before the update. It is the learning rate; Step 3-5: Based on the preset performance requirements, determine whether the model has converged. If it has converged and reached the preset number of training rounds, proceed to step 3-6; otherwise, return to step 3-2 to continue training and increment the number of training rounds by 1. Steps 3-6 complete the training.

6. The automatic question-answering method for statistical graphs based on a weakly supervised alignment mechanism according to claim 5, characterized in that, Step 3-2, which describes using a question-answering model for forward computation, specifically includes: Step 3-2-1: Encode the statistical graph using an image feature encoder to obtain the image feature embedding sequence. ,in, The dimension is The real vector space, This represents the dimension of the embedding vector for each feature; Step 3-2-2: Encode the question text using a text feature encoder to obtain a question text feature embedding sequence with the same dimension as the image feature embedding. ,in Indicates the length of the question text; Step 3-2-3: Embed the obtained image features into the sequence. and text feature embedding sequence ,go through A layered cross-attention mechanism is used to obtain the fused features of text on the image. Image-to-text fusion features The details are as follows: ; ; in, Indicates pooling operation; express function; These represent the query, key, and value matrices of the text-to-image cross-attention mechanism, respectively. These represent the query, key, and value matrices of the image-to-text cross-attention mechanism, respectively. Step 3-2-4: Based on the text-image fusion features Image-to-text fusion features The set of data points in the predictive statistical chart Each data point in the array represents the probability of the alignment target. The specific method is as follows: ; in, Indicates the first Each data point represents the probability of the alignment target. This represents the Sigmoid function. Representing image-to-text fusion features The One feature; Step 3-2-5: Based on the predicted probabilities, the data points in the statistical graph are weighted and fused to obtain the aligned image features. The details are as follows: ; Step 3-2-6, based on the characteristics of the question text and aligned image features The answer is predicted using a feedforward neural network. .

7. The automatic question-answering method for statistical graphs based on a weakly supervised alignment mechanism according to claim 6, characterized in that, The calculation of prediction loss described in step 3-3 specifically includes: Step 3-3-1, Calculate the predicted answer With answer text The absolute error between them is used as the regression task loss. The details are as follows: ; Step 3-3-2: Calculate the cross-entropy between the predicted aligned target and the true aligned target, which serves as the loss for the classification task. The details are as follows: ; Step 3-3-2, Calculate the regression task loss. And classification task loss The weighted sum, used as the total loss function, is as follows: ; in, and These are weighting parameters used to balance the losses of the two tasks.

8. The automatic question-answering method for statistical graphs based on a weakly supervised alignment mechanism according to claim 7, characterized in that, The evaluation index mentioned in steps 2-3 is the absolute value error: 。 9. The automatic question-answering method for statistical graphs based on a weakly supervised alignment mechanism according to claim 8, characterized in that, The semantic similarity mentioned in steps 2-4 uses the Jaccard similarity coefficient, and the calculation method is as follows: ; in, Representing an expression The character set, Representing the problem text The character set, This indicates taking the intersection. To represent taking the union, This indicates the number of elements to retrieve from the set.