Method and system for recognizing handwritten mathematical formula in test paper based on deep learning
By combining connected component and image projection methods for formula detection with a multi-scale attention encoding and decoding model, the problems of misidentification and omission in handwritten mathematical formula recognition are solved, achieving accurate segmentation and recognition of mathematical formulas in test paper images, which is suitable for intelligent grading in intelligent education.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANDONG UNIV
- Filing Date
- 2023-12-25
- Publication Date
- 2026-06-30
AI Technical Summary
Existing handwritten mathematical formula recognition technologies based on encoding and decoding models suffer from misidentification or omission when dealing with similar characters or characters of different sizes, and further exploration is needed, especially in the field of intelligent marking.
A deep learning-based method for recognizing handwritten mathematical formulas on exam papers is adopted. The formula detection is performed by combining connected component analysis and image projection. A multi-scale attention encoding and decoding model is used for recognition, including preprocessing, image augmentation, global character statistics, and local character classification, to improve the robustness and accuracy of the model.
It achieves accurate segmentation and recognition of mathematical formulas in test paper images, improves the accuracy of formula detection, reduces misidentification and omission, and is applicable to intelligent grading in the field of intelligent education, thereby improving grading efficiency and quality.
Smart Images

Figure CN117809320B_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of text recognition technology, and in particular relates to a method and system for recognizing handwritten mathematical formulas on test papers based on deep learning. Background Technology
[0002] The statements in this section are merely background information related to the present invention and do not necessarily constitute prior art.
[0003] Optical Character Recognition (OCR) is a subfield of computer vision and pattern recognition. It uses computer technology to analyze and recognize image files containing text information, extracting text and layout information and returning it as text. With the development of artificial intelligence and computer vision technologies, OCR technology has been widely applied and has made significant progress, finding broad applications in license plate recognition, ID card recognition, and invoice recognition. OCR technology typically consists of two processes: text detection and text recognition. Text detection aims to detect multiple recognition instances in an image and obtain the location regions of the text instances. Text recognition is used to identify and transcribe the text information contained in the recognition instances into a computer-usable sequence format.
[0004] Mathematical formulas are used to describe various concepts, theorems, and laws in mathematics and science, and are indispensable tools in mathematical and scientific research. With the widespread use of mobile devices such as styluses, tablets, and smartphones, people are increasingly using handwritten mathematical symbols as input. Handwritten mathematical formula recognition plays an important role in various application scenarios, including intelligent education, human-computer interaction, and academic paper writing assistance tools.
[0005] Recognizing handwritten mathematical formulas differs from traditional text recognition problems. Handwritten formulas present a more complex two-dimensional handwriting recognition problem, with their intricate internal two-dimensional spatial structure making them difficult to analyze. Recognizing handwritten mathematical formulas has been an active research area for many years. The recognition process can be viewed as translating the two-dimensional language of handwritten strokes in the formula image into a sequence form that computers can use, such as mathematical description languages like LaTeX or MathML.
[0006] Handwritten mathematical formula recognition involves two main problems: symbol recognition and structural analysis, which can be solved sequentially or globally.
[0007] Current approaches to handwritten mathematical formula recognition can be categorized into two types based on their frameworks: grammar-based and encoder-decoder-based. Traditional grammar-based methods, such as graph grammar, clause grammar, and relational grammar, identify mathematical formulas through symbol segmentation, symbol recognition, and structural analysis. However, traditional methods suffer from difficulties in splitting individual characters, their accuracy is heavily influenced by segmentation accuracy, and errors at higher levels can severely impact subsequent tasks; for example, errors from symbol segmentation and recognition can propagate to structural analysis.
[0008] With the development of deep learning, codec models have demonstrated highly effective performance in various tasks such as scene text recognition. Since handwritten formula recognition is also an image-to-text modeling task, the problem of recognizing the tree structure of mathematical formulas has been transformed into recognizing LaTeX strings of mathematical formulas. This transformation simplifies the mathematical formula recognition problem and makes end-to-end recognition possible. However, existing handwritten mathematical formula recognition models based on codec models still suffer from misrecognition or omissions when dealing with similar characters (such as the letter "X" and the operator "×") and when dealing with the large size differences between different parts of the formula due to its two-dimensional structure (such as superscripts and subscripts). Therefore, handwritten mathematical recognition technology for intelligent marking remains a field worthy of further exploration. Summary of the Invention
[0009] To overcome the shortcomings of the prior art, this invention provides a deep learning-based method for recognizing handwritten mathematical formulas on exam papers, which can accurately segment formulas in an image of the exam paper area and recognize the segmented independent formula images.
[0010] To achieve the above objectives, one or more embodiments of the present invention provide the following technical solutions:
[0011] Firstly, a deep learning-based method for recognizing handwritten mathematical formulas on exam papers was disclosed, including:
[0012] Acquire exam paper images, including images of the answer area and template.
[0013] Based on the template image, the mathematical formulas in the acquired test paper image are detected and segmented:
[0014] The answer area image of the exam paper typically contains answers to multiple questions, meaning the image includes multiple formulas, along with irrelevant information such as question numbers and answer boxes. Since the formula recognition module targets individual formulas, it's necessary to detect and acquire these individual formulas in the answer area image before recognizing the formulas themselves. Specifically, a combination of connected component analysis and image projection is used to calculate the bounding box of each individual formula within the answer area image, and then segment the image based on these bounding boxes to obtain the individual formula images.
[0015] The segmented independent formula images are recognized using a trained recognition model. The recognition model is a multi-scale attention encoding and decoding model that combines global and local features of the formula images. The encoder extracts the features of the formula images, and the decoder, based on the multi-scale attention mechanism, decodes the prediction sequence corresponding to the features of the formula images.
[0016] After being identified by the recognition model, the data is converted to LaTeX format, and the formula image and corresponding prediction results are saved.
[0017] As a further technical solution, the acquired test paper image is used for the fill-in-the-blank section of a math exam paper.
[0018] As a further technical solution, after acquiring the test paper image, image preprocessing and difference comparison steps are also included, specifically:
[0019] The image is converted to grayscale, denoised, and then transformed into a binary image.
[0020] The preprocessed test paper answer area image Img1 and template image Img2 are compared and analyzed to remove interference factors such as question numbers, resulting in a formula image Img containing only mathematical formulas to be segmented.
[0021] As a further technical solution, the independent formulas in the formula image Img to be segmented are detected and obtained, specifically including:
[0022] The connected component-based segmentation module initially segments the formula image Img to be segmented, and the projection-based segmentation module further segments the connected parts. The independent formula bounding boxes obtained from the two parts contain the position information of all independent formulas in the test paper answer area image Img1. The original input test paper answer area image Img1 is segmented to obtain the independent formula image.
[0023] Among them, the test paper answer area image is a fill-in-the-blank question answer area image extracted from the actual test paper, which includes irrelevant information such as question number and answer box, and contains recognition instances of multiple questions. The purpose of the formula detection part is to detect each independent formula in the test paper answer area image and save it as an independent formula image.
[0024] Template image: An image containing only the answer box and question number is extracted from a blank test paper. This image is compared with the answer area image on the test paper, and irrelevant information is removed to obtain the formula image to be segmented.
[0025] Image of formulas to be segmented: Irrelevant information has been removed and only multiple formulas are contained, which will be used to obtain the bounding box of each formula in the next step;
[0026] Adhesive formula images: Parts that cannot be segmented by connected component-based segmentation methods still contain multiple formulas in each image;
[0027] Independent formula images: All independent formula images obtained by the detection module, each image containing one formula.
[0028] As a further technical solution, the connected component-based segmentation module performs initial segmentation of the image Img to be segmented, including:
[0029] The image Img of the formula to be segmented is subjected to image dilation. Then, the connected components of the processed image are calculated, and the boundary coordinates of each connected component are obtained. At the same time, the set of boundary coordinates of independent formulas and the set of boundary coordinates of the adhesive formula blocks that need further processing are distinguished according to the height threshold of the independent formula images.
[0030] As a further technical solution, a projection-based segmentation module further divides the adhered portion:
[0031] The connected formula blocks are projected horizontally. The upper and lower boundaries of each independent formula in the formula block are calculated based on the projection array to obtain the formula rows. Then, the formula rows are projected vertically to obtain the left and right boundaries of the formulas. Finally, the set of boundary coordinates of each independent formula is obtained.
[0032] As a further technical solution, based on bounding box segmentation formulas, the following is included: segmenting independent formula images: combining the set of independent formula boundary coordinates obtained by the segmentation module based on connected components and the cutting module based on projection, the independent formulas are segmented from the test paper answer area image Img1 for subsequent formula recognition.
[0033] As a further technical solution, the recognition model includes a preprocessing module, an image augmentation module, an encoder module, a decoder module, a global character statistics module, and a local character classification module;
[0034] The preprocessing module is used to preprocess formula images, annotations, and dictionaries;
[0035] The image augmentation module is used to randomly translate, rotate, and scale the segmentation formula image to enhance the diversity of the training data images;
[0036] The encoder module is used to extract high-dimensional features from the preprocessed formula image of the input, and uses a densely connected convolutional network as the encoder to extract image features.
[0037] The global character statistics module is used to predict the number of each symbol class in the formula image. It uses a weakly supervised method to provide global character information to the decoder module.
[0038] The decoder module is used to parse the prediction results corresponding to the high-dimensional features obtained by the encoder, and relies on the multi-scale attention mechanism to find the key parts in the high-dimensional features to achieve automatic segmentation of symbols. The most relevant region is selected to describe mathematical symbols or spatial operators, so that the GRU-based decoder can decode the LaTeX characters at the current time step.
[0039] The local character classification module directly classifies the feature vector at each position in the feature map obtained by the encoder module, and then calculates the probability of the current child node category by combining the multi-scale attention weights, thus providing local character category information.
[0040] Secondly, a deep learning-based system for recognizing handwritten mathematical formulas on exam papers was disclosed, including:
[0041] The test paper image acquisition module is configured to acquire test paper images, including acquiring answer images and template images;
[0042] The formula detection and segmentation module is configured to: detect and segment mathematical formulas in the acquired test paper image based on the template image; detect independent formulas in the test paper answer area image, use a combination of connected component and image projection method to calculate the bounding box of each independent formula in the test paper answer area image, and segment the formulas based on the bounding box.
[0043] The recognition module is configured to recognize the segmented independent mathematical formula images using a trained recognition model. The recognition model is a multi-scale attention encoding and decoding model that combines global and local features of the formula image. The encoder extracts the features of the formula image, and the decoder, based on the multi-scale attention mechanism, decodes the prediction sequence corresponding to the features of the formula image.
[0044] After being identified by the recognition model, the data is converted to LaTeX format, and the formula image and corresponding prediction results are saved.
[0045] The above one or more technical solutions have the following beneficial effects:
[0046] (1) This method combines two modules, formula detection and formula recognition, as two stages of the handwritten mathematical formula recognition method for test papers. The formula detection stage can obtain the position information and image information of each formula in the image of the mathematical fill-in-the-blank question area of the test paper. The formula recognition stage recognizes the independent formula images segmented in the detection stage into a computer-processable format (LaTeX format).
[0047] (2) This method integrates the connected component method and the projection method in the formula detection stage. Combined with the distribution characteristics of the mathematical fill-in-the-blank questions in the test paper, it processes the easily segmented formulas and the difficult-to-segment connected formulas separately, which can fully segment the test paper image and improve the accuracy of formula segmentation.
[0048] (3) This method adopts a data augmentation method suitable for mathematical formula recognition in mathematical fill-in-the-blank questions. Image augmentation is performed by image scaling, image translation and image rotation operations to simulate situations such as large differences in formula size, writing tilt and incomplete segmentation in actual scenarios, thereby improving the robustness of the model.
[0049] (4) In the formula recognition stage, this method uses an attention-based encoding and decoding model for end-to-end recognition. The DenseNet convolutional network is used as the encoder, and the decoder is implemented based on the attention mechanism and GRU. The multi-scale attention mechanism is adopted, and the feature map is combined with the position encoding for attention calculation, which improves the model's perception of the spatial position of the image. The multi-scale attention mechanism can handle symbols of different sizes in the formula image through convolution operations of different scales.
[0050] (5) This method uses global character statistics and local character classification modules to analyze the types and numbers of characters contained in the formula image based on image features. Simultaneously, during prediction, it directly predicts the character category based on the attention results, providing global and local information to the decoder module. The global character statistics module uses a weakly supervised method to provide global character information to the decoder module, thus achieving more accurate recognition results in the formula recognition stage. The local character classification module is used to obtain stronger feature extraction capabilities. It directly classifies the feature vector at each position in the feature map obtained by the encoder, and then calculates the probability of the current child node category by combining the attention weights. This loss function drives the GRU decoder based on a multi-scale attention mechanism to achieve better accuracy.
[0051] Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description
[0052] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.
[0053] Figure 1 This is a flowchart of a deep learning-based method for recognizing handwritten mathematical formulas on test papers, according to an embodiment of the present invention.
[0054] Figure 2 This is a flowchart illustrating the handwritten mathematical formula detection process in an embodiment of the present invention.
[0055] Figure 3 This is a structural diagram of the handwritten mathematical formula recognition model for test papers according to an embodiment of the present invention;
[0056] Figure 4This is a diagram showing the internal structure of the global character statistics module in an embodiment of the present invention.
[0057] Figure 5 This is a diagram illustrating the internal structure of the GRU decoder module based on a multi-scale attention mechanism according to an embodiment of the present invention.
[0058] Figure 6 This is a diagram showing the internal structure of a partial character statistical classification system according to an embodiment of the present invention. Detailed Implementation
[0059] It should be noted that the following detailed descriptions are exemplary and intended to provide further illustration of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.
[0060] It should be noted that the terminology used herein is for the purpose of describing particular implementations only and is not intended to limit the exemplary implementations of the present invention.
[0061] Where there is no conflict, the embodiments and features in the embodiments of the present invention can be combined with each other.
[0062] Example 1
[0063] This embodiment discloses a deep learning-based method for recognizing handwritten mathematical formulas on exam papers. It employs a formula detection method based on connected components and projection, which can accurately segment formulas in the answer area image of the exam paper. It then uses a multi-scale attention encoding and decoding model that combines global and local features of the formula image to recognize the segmented independent formula images and transcribe them into LaTeX sequences.
[0064] In the handwritten mathematical formula recognition system for exam papers, a trained recognition model is used to identify independent formula images obtained by the formula detection module. The two-dimensional images are then converted into a computer-usable sequence, namely LaTeX format, to achieve automatic recognition of mathematical formulas in the exam paper images. Figure 1 As shown, it consists of the following four steps:
[0065] Step 1: Obtain the test paper image: Obtain the image of the answer area and the template image of the test paper, and input the images into the system for further processing.
[0066] Step 2: Formula Detection: The mathematical formulas in the obtained test paper answer area image are detected and segmented to obtain the independent formulas in the test paper answer area image. The main method used is a combination of connected component and image projection. The bounding box of each independent formula in the test paper answer area image is calculated. The segmented formulas are ready for the next step of processing.
[0067] The image obtained from the test paper contains the answer areas for all fill-in-the-blank questions. It includes answers to multiple questions (multiple formulas in one image), along with question numbers, answer boxes, and other irrelevant information. The detection section aims to detect and segment each formula in the input test paper answer area image into independent formula images.
[0068] Step 3: Formula Recognition: The segmented independent mathematical formula images are recognized using a trained recognition model. The recognition model is a multi-scale attention encoding and decoding model that combines global and local features of the formula image. After the encoder extracts the features of the formula image, the decoder based on the multi-scale attention mechanism decodes the prediction sequence corresponding to the features of the formula image.
[0069] Step 4: Save the recognition results: After the formula image is recognized by the recognition model, it is converted into LaTeX format, and the formula image and the corresponding prediction results are saved.
[0070] In step two, regarding formula detection
[0071] In practical applications, handwritten mathematical formulas on exam papers are typically found in the answer areas of math fill-in-the-blank questions within the exam paper image. These answer areas often contain multiple formulas to be identified, along with irrelevant information such as question numbers and answer boxes. The formula detection component segments the acquired exam paper answer area image, filters out irrelevant information, and extracts the individual formula images from the exam paper image.
[0072] The main structure of the formula detection is as follows: Figure 2 As shown: The image preprocessing module processes the input image. The test paper answer area image and the template image are compared by image difference to remove interference information such as question numbers to obtain the formula image to be segmented (generally containing multiple recognition targets, i.e., multiple formulas). The connected component-based segmentation module performs initial segmentation of the formula image to be segmented, and the projection-based segmentation module further segments the connected parts. The independent formula bounding boxes of the two parts contain the position information of all independent formulas in the test paper answer image. The original input formula image to be segmented is segmented to obtain the independent formula image.
[0073] Step 2-1): Obtain the formula region image: Due to image quality issues such as noise in the input math exam paper image, and the presence of irrelevant factors such as question numbers and answer boxes, the image preprocessing module is used to perform preliminary processing on the obtained answer image and template image. Specifically, the image is converted to grayscale, the image is denoised, and then the OTSU algorithm is used to convert it into a binary image. The preprocessed exam paper answer region image Img1 and template image Img2 are compared and analyzed to remove interference factors such as question numbers and bounding boxes, resulting in the formula image Img to be segmented, which contains only mathematical formulas.
[0074] Step 2-2) Preliminary segmentation based on connected component segmentation module: In practical application scenarios, there are obvious row and column distinctions between multiple formulas in the answer area image of math fill-in-the-blank questions. A connected component segmentation module can be used to perform preliminary segmentation of the formulas in the formula area image. Specifically, the image Img of the formula to be segmented is subjected to image dilation and other operations. Then, the connected components of the processed image are calculated, and the boundary coordinates [left, top, right, bottom] of each connected component are obtained. The boundary coordinates of the connected components are used to distinguish between independent formulas and connected formula blocks. Based on the height threshold of the formula detection module, independent formulas and connected formula blocks that need further processing are distinguished: connected components with a height less than the height threshold are independent formulas, and the corresponding connected component boundary coordinates are added to the set box, which contains the boundary coordinates of all independent formulas; connected components with a height greater than the height threshold are connected formula blocks, and the corresponding connected component boundary coordinates are added to the set box', which contains the boundary coordinates of all connected formula blocks.
[0075] Steps 2-3) Projection-based segmentation module for deep segmentation: Due to the diversity of candidates' handwriting styles, there may be situations where the answer formulas for different questions in the exam paper image are stuck together, making segmentation difficult using methods based on connected components. The projection-based segmentation module is used to segment the stuck formula blocks box' that require further processing. On the exam paper answer area image Img1, a slicing operation is performed based on the coordinate set contained in box' to obtain the stuck formula image. First, the stuck formula image is horizontally projected. Based on the projection array, the upper and lower boundaries of each independent formula in the formula block are calculated to obtain the formula rows. Then, the formula rows are vertically projected to obtain the left and right boundaries of the formulas. Finally, the boundary coordinate set box' of each independent formula is obtained. Therefore, the stuck formula image is obtained by performing a slicing operation on the exam paper answer area image Img1 based on the coordinate set contained in box'. The specific process is as follows:
[0076] First, calculate the row projection of the adhered formula block. This is done by counting the number of pixels in each row to obtain the row projection array L = {l1, l2, ..., l...}. h}, l i ∈[0, w], where w and h represent the width and height of the glued formula block, respectively. A mean filter is applied to the row projection array L to obtain a new array L′ composed of [variables].
[0077] Then, the starting point of the row is determined: starting from zero or the previous ending point, the projection array L′ is traversed. If the corresponding value is greater than the set threshold t1, then this point is taken as the starting point s of the independent formula block. Then, the ending point corresponding to the independent formula block is determined: starting from the selected starting point s, the ending point e = s + index(min{L′[s+t1, s+t2]}) is obtained according to the thresholds t2 and t3, until the glued formula block is traversed, and all formula rows are split.
[0078] Finally, the left and right boundaries of the formula blocks are determined: In order to obtain more accurate bounding boxes, column projection is performed on the segmented formula rows to determine the left and right boundaries of the formulas, resulting in the bounding box set box″ for each formula.
[0079] Steps 2-4) Segmenting the independent formula image: The set box obtained by the connected component segmentation module and the set box″ obtained by the projection segmentation module contain the bounding boxes of all independent formulas in the test paper answer area image Img1 (the boundary coordinates of each bounding box are [left, top, right, bottom]). The independent formula image Result = Img1[top: bottom, left: right] is obtained by performing a slicing operation on the test paper answer area image Img1 and saved for subsequent formula recognition.
[0080] The formula detection module proposed in this invention combines the connected component method and the projection method. First, the connected component method is used to perform preliminary segmentation of independent formulas with non-overlapping handwriting and relatively regular shapes. Then, the projection method is used to further segment formulas with overlapping handwriting and complex shapes that cannot be handled by the connected component method. This makes the formula segmentation results more accurate and reduces the situation where multiple formulas are still in the same image after segmentation.
[0081] In step three, formula recognition:
[0082] The basic architecture of the handwritten mathematical formula recognition model used in this invention is an encoding and decoding model based on an attention mechanism. Specifically, it implements a multi-scale attention encoding and decoding model that combines global and local image features.
[0083] The handwritten mathematical formula recognition model identifies the detected independent formula images. The model includes a preprocessing module, an image augmentation module, an encoder module that encodes image features, a GRU decoder module based on a multi-scale attention mechanism, a global character statistics module, and a local character classification module. Its basic structure is as follows: Figure 3 As shown.
[0084] Preprocessing module:
[0085] The preprocessing module is used to preprocess the formula images, annotations, and dictionary: it converts the input formula images to grayscale, preprocesses the corresponding formula image labels, adds an end marker ("eos") to each formula annotation (LaTeX format), and constructs the correspondence between images and labels. The image dictionary contains all the characters in the dataset that need to be recognized (including the start marker "sos" and the end marker "eos"). During training, the model can map all the characters that appear to the dictionary index.
[0086] Image augmentation module:
[0087] Because mathematical formula recognition differs from traditional OCR problems—it's a two-dimensional problem—traditional image augmentation operations, such as image mirroring, cannot be used when augmenting formula image data. Considering the application scenario of mathematical fill-in-the-blank question recognition, the image augmentation module in this invention employs a set of image augmentation operations that preserve both the overall structure and local feature information of the formula:
[0088] Random Image Translation: In math fill-in-the-blank questions, there may be handwriting overlap between questions and slight errors in the segmentation process, potentially resulting in incomplete segmentation of individual formula images. A slight translation of the original image simulates the incomplete segmentation of formula images in real-world scenarios, where Δx and Δy represent the vertical and horizontal translation distances of the formula image, respectively.
[0089]
[0090] Random Image Rotation: Mathematical formulas may be slanted due to different writing habits. The original formula image is slightly rotated around its center without disrupting the overall two-dimensional structure of the formula. Here, θ is the rotation angle, and x... c y c This indicates the center point of the image.
[0091]
[0092] Random Image Scaling: Characters in formula image data may vary significantly in size due to differences in their meanings or their positions within a two-dimensional structure. The original formula image is randomly scaled down or up, where k is the scaling factor.
[0093]
[0094] By randomly combining the above image augmentation operations, the diversity of training data images is improved, mainly addressing the problems of large differences in character size, incomplete segmentation, and slanted writing in formula image data, thereby improving the robustness of the recognition model.
[0095] Encoder module:
[0096] The formula image feature encoding module is used to extract high-dimensional features from the preprocessed formula image as input. It uses a densely connected convolutional neural network, DenseNet, as the encoder to extract image features. DenseNet is a densely connected convolutional neural network composed of Dense Blocks. Each layer receives additional input from all preceding layers and passes its own feature map to all subsequent layers, reducing the number of parameters and enhancing feature propagation.
[0097]
[0098] The encoder's input image is a grayscale image. Where H′ and W′ are the height and width of the image. The output is the high-dimensional feature F output by the last convolutional layer, which can be regarded as the feature vector a corresponding to the local region of the feature map. i The set of a, i.e., a = {a1, a2, ..., a...} M}, M = H × W.
[0099] Global character statistics module:
[0100] The global character statistics module is used to predict the number of each symbol class in the formula image. It consists of a channel attention mechanism and a summation pooling layer, with the basic structure as follows: Figure 4 As shown.
[0101] The feature map extracted by the encoder is transformed into feature map F′ through convolution operation, and then enhanced feature map F″ is obtained through channel attention mechanism. The function gap(·) is global average pooling, ReLU is the activation function, and W1, W2, b1, and b2 are the weight parameters to be learned by the model.
[0102] N = RELU(W1(gap(F′)) + b1)
[0103]
[0104] Finally, a 1×1 convolution operation is performed on the enhanced feature map F″ to reduce the number of feature channels to C (C is the number of character classes contained in the dictionary). The number of channels represents the number of classes. The Sigmoid activation function is used to restrict the value of each element to the range of (0, 1) to obtain the global character statistics map M. After summing M in the dimensions of H and W through summation pooling, the global character statistics vector Global_V can be obtained directly.
[0105]
[0106] Decoder module:
[0107] The GRU decoder module based on a multi-scale attention mechanism is used to parse the prediction results corresponding to the high-dimensional features obtained by the encoder. Its basic structure is as follows: Figure 5 As shown, specifically, it relies on a multi-scale attention mechanism to find key parts in high-dimensional features to achieve automatic symbol segmentation, selects the most relevant regions to describe mathematical symbols or spatial operators, and uses them for a GRU-based decoder to decode the LaTeX characters at the current time step until decoding is complete.
[0108] The Gated Recurrent Unit (GRU) is a variant of LSTM that uses previous information to influence the output of subsequent nodes. Its basic structure involves storing the network's output in a memory unit, which is then fed into the neural network along with the next input. The current output is calculated through the update and reset gate states. The GRU acts as a decoder, transforming the feature map obtained from the encoder into a high-level representation. In the decoder, the GRU uses the previously transmitted state h... t-1 and the input x of the current node t To calculate the output h at the current time. t .
[0109] h t =GRU(h t-1 x t )
[0110] In the decoder, the attention mechanism scans the global image to identify the target area that needs to be focused on, and then devotes more attention resources to this area to obtain more detailed information related to the target, while ignoring other irrelevant information, helping the decoder to quickly filter out high-value information from a large amount of information.
[0111] Because the size difference between different symbols in the formula image is large, the traditional overlay attention mechanism cannot effectively focus on all symbols in the image at the same time. The multi-scale attention mechanism addresses this problem by using convolution operations of different sizes to obtain multi-scale features of the image. It focuses on global information while paying attention to local features, helping the recognition model to better handle the spatial structural relationships in the formula image.
[0112] To enhance the recognition model's perception of the spatial location of the formula image, the feature map F obtained by the encoder is convolved and then combined with position encoding to obtain an enhanced feature map.
[0113] The computation process of the multi-scale attention mechanism is as follows: First, the sum A of all past attention is calculated. The overlay attention A is initialized as a zero vector and obtained by accumulating the attention scores at each time step. A is then subjected to convolution operations of different sizes to obtain A1 and A2. A1, A2, and the output h of the decoder GRU are then combined. t Enhanced feature maps Together they are used to calculate the current attention weight α ti Then the current attention weight α ti and feature map local information a i The weighted sum yields the context vector c. t Where tanh is the hyperbolic tangent activation function, and exp is an exponential function with the natural constant e as its base. W a1W a2 W h1 These are the weight parameters that the model needs to learn.
[0114]
[0115]
[0116]
[0117] Finally, the Multilayer Perceptron (MLP) uses a multi-scale attention mechanism to obtain the context vector c. t Global_V, the global character statistics vector, and the current state h of the GRU. t And the result y from the previous step t-1 Embedding matrix E(y) t-1 Predict the current outcome.
[0118] Local character classification module:
[0119] The local character classification module is used to improve the feature extraction capability of the formula recognition model. It directly performs feature vector a at each position of F in the feature map obtained by the encoder module. i The system performs classification and then combines multi-scale attention weights to calculate the probability of the current child node category, providing local character category information for the recognition model.
[0120] The probability of the current child node's category is used to directly classify the current feature, providing local character category information for the recognition model. The probability of the current child node's category is Local_p. t Used to calculate loss L l .
[0121] The feature vector a extracted by the encoder i ,like Figure 6 As shown, the local character classification probability o can be obtained through two fully connected layers and Maxout and Softmax activation functions. i At each decoding step t, the multi-scale attention probability α ti This is used to perform a weighted summation of the local character classification probabilities to obtain the current local character classification probability, Local_p. t W4, b3, and b4 are the weight parameters that the model needs to learn.
[0122] O i =softmax(W4maxout(W3a) i +b3)+b4)
[0123]
[0124] Loss function:
[0125] The overall loss of the recognition model during the training phase is as follows:
[0126] L=λ1L p +λ2Lg+λ3L l
[0127] Among them, L p To identify the cross-entropy loss between the decoder prediction and Ground Truth in the model, L g The Smooth L1 loss between global character statistics and GroundTruth, L l The cross-entropy loss is between the results of the local character classification module and GroundTruth. λ1, λ2, and λ3 are the weight parameters of the three loss terms. The loss function L drives the training of the recognition model.
[0128] Compared with existing recognition models based on encoding and decoding models, this model improves the recognition effect of similar characters by using the loss of the global character statistics module and the local character classification module. At the same time, the model incorporates positional encoding to implement a multi-scale attention mechanism, which processes characters with large size differences and reduces the occurrence of misidentification and omission by the recognition model.
[0129] Example 2
[0130] The purpose of this embodiment is to provide a computer device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the steps of the above-described method.
[0131] Example 3
[0132] The purpose of this embodiment is to provide a computer-readable storage medium.
[0133] A computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, performs the steps of the above method.
[0134] Example 4
[0135] The purpose of this embodiment is to provide a deep learning-based system for recognizing handwritten mathematical formulas on exam papers, including:
[0136] The test paper image acquisition module is configured to acquire test paper images, including acquiring answer images and template images;
[0137] The formula detection and segmentation module is configured to: detect and segment mathematical formulas in the acquired test paper image based on the template image; detect independent formulas in the test paper answer area image; use a combination of connected component and image projection to calculate the bounding box of each independent formula in the test paper answer area image; and segment the independent formula image based on the bounding box.
[0138] The recognition module is configured to recognize the segmented independent formula images using a trained recognition model. The recognition model is a multi-scale attention encoding and decoding model that combines global and local features of the formula images. The encoder extracts the features of the formula images, and the decoder, based on the multi-scale attention mechanism, decodes the prediction sequence corresponding to the features of the formula images.
[0139] After being identified by the recognition model, the data is converted to LaTeX format, and the formula image and corresponding prediction results are saved.
[0140] The technical solution of this embodiment is mainly applied to the implementation of intelligent marking technology in the field of intelligent education. Using intelligent marking technology for online marking ensures standardized marking, guarantees fairness and impartiality, reduces organizational costs and manpower input, and improves marking efficiency and quality. It also demonstrates significant application value in subsequent archiving and content analysis. Compared to manual marking, intelligent marking not only has advantages in marking speed but also compensates for its shortcomings in handling identical or blank papers.
[0141] The steps and methods involved in the apparatuses of Embodiments 2, 3, and 4 above correspond to those in Embodiment 1. For specific implementation details, please refer to the relevant description section of Embodiment 1. The term "computer-readable storage medium" should be understood as a single medium or multiple media including one or more instruction sets; it should also be understood as including any medium capable of storing, encoding, or carrying an instruction set for execution by a processor and enabling the processor to perform any of the methods in this invention.
[0142] Those skilled in the art will understand that the modules or steps of the present invention described above can be implemented using general-purpose computer devices. Optionally, they can be implemented using computer-executable program code, thereby allowing them to be stored in a storage device for execution by a computer device, or they can be fabricated as separate integrated circuit modules, or multiple modules or steps can be fabricated as a single integrated circuit module. The present invention is not limited to any particular combination of hardware and software.
[0143] While the specific embodiments of the present invention have been described above in conjunction with the accompanying drawings, this is not intended to limit the scope of protection of the present invention. Those skilled in the art should understand that various modifications or variations that can be made by those skilled in the art without creative effort based on the technical solutions of the present invention are still within the scope of protection of the present invention.
Claims
1. A deep learning-based method for recognizing handwritten mathematical formulas on exam papers, characterized by: include: Acquire exam paper images, including answer images and template images; The detection model detects and segments mathematical formulas in the acquired test paper image based on the template image: it detects and acquires independent formulas in the test paper answer area image, and uses a combination of connected component and image projection to calculate the bounding box of each independent formula in the test paper answer area image, and segments the formulas based on the bounding box. The detection model includes a segmentation module based on connected components and a segmentation module based on image projection; The connected component-based segmentation module is used for preliminary segmentation of the formula region image. It performs image dilation on the formula region image Img, then calculates the connected components of the processed image, obtains the boundary coordinates of each connected component, and distinguishes between individual formula boundary coordinate sets and sets of connected formula boundary coordinates that need further processing based on the independent formula image height threshold. The projection-based segmentation module is used to further segment the adhered parts: the adhered formula block is projected horizontally, the upper and lower boundaries of each independent formula in the formula block are calculated according to the projection array, the formula row is obtained, and then the formula row is projected vertically to obtain the left and right boundaries of the formula, and finally the boundary coordinate set of each independent formula is obtained. By combining the set of independent formula boundary coordinates obtained by the connected component-based segmentation module and the projection-based cutting module, the independent formulas are segmented from the test paper answer area image for subsequent formula recognition. The segmented independent mathematical formula images are recognized using a trained recognition model. The recognition model is a multi-scale attention encoding and decoding model that combines global and local features of the formula images. The encoder extracts the features of the formula images, and the decoder, based on the multi-scale attention mechanism, decodes the prediction sequence corresponding to the features of the formula images. The recognition model adopts an attention-based encoding and decoding model, specifically a multi-scale attention encoding and decoding model that combines global and local features of the formula image. The encoder is used to extract high-dimensional features from the preprocessed formula image of the input. Densely connected convolutional networks are used as the encoder to extract image features; the decoder based on a multi-scale attention mechanism is used to parse the prediction results corresponding to the high-dimensional features obtained by the encoder. In this process, after the encoder extracts the features of the formula image, the decoder, based on a multi-scale attention mechanism, then decodes the prediction sequence corresponding to the features of the formula image. After being identified by the recognition model, the data is converted to LaTeX format, and the formula image and corresponding prediction results are saved.
2. The deep learning-based method for recognizing handwritten mathematical formulas on exam papers as described in claim 1, characterized in that, After acquiring the exam paper image, the process also includes image preprocessing and difference comparison steps, including: The acquired image is converted to grayscale, denoised, and then transformed into a binary image. The preprocessed answer image Img1 and template image Img2 are compared and analyzed to remove interference factors, resulting in a formula image Img containing only mathematical formulas to be segmented.
3. The deep learning-based method for recognizing handwritten mathematical formulas on exam papers as described in claim 1, characterized in that, Before the recognition model recognizes the input image, it includes: firstly, performing image augmentation on the image for recognizing mathematical formulas in the marking scenario: including: performing image augmentation through image scaling, image translation, and image rotation operations to simulate situations in the actual scenario where the formulas have large size differences, are written at an angle, or are incompletely segmented.
4. The deep learning-based method for recognizing handwritten mathematical formulas on exam papers as described in claim 1, characterized in that, The recognition model includes: global character statistics and local character classification; Global character statistics: Using a weakly supervised method, global character information is provided to the decoder module, thereby obtaining more accurate recognition results in the formula recognition stage; Local character classification: used to obtain stronger feature extraction capabilities, directly classifying the feature vector at each position in the feature map obtained by the encoder, and then combining the attention weights to calculate the probability of the current child node category; Among them, global character statistics and local character classification are based on the image feature parsing formula to determine the types and number of characters contained in the image. At the same time, during prediction, the character category is directly predicted based on the attention result, providing global and local information for the decoder.
5. The deep learning-based method for recognizing handwritten mathematical formulas on exam papers as described in claim 1, characterized in that, The decoder based on the multi-scale attention mechanism includes: the decoder is implemented based on the attention mechanism and GRU, and combines the feature map with the position encoding for attention calculation, which improves the model's perception of the spatial position of the image. The multi-scale attention mechanism can handle symbols of different sizes in the formula image through convolution operations of different scales.
6. A deep learning-based system for recognizing handwritten mathematical formulas on exam papers, characterized by: include: The test paper image acquisition module is configured to acquire test paper images, including acquiring images of the answer area and template images. The formula detection and segmentation module is configured as follows: the detection model detects and segments mathematical formulas in the acquired test paper image based on the template image: detects independent formulas in the acquired test paper answer area image, uses a combination of connected component and image projection method to calculate the bounding box of each independent formula in the test paper answer area image, and segments the formulas based on the bounding box. The detection model includes a segmentation module based on connected components and a segmentation module based on image projection; The connected component-based segmentation module is used for preliminary segmentation of the formula region image. It performs image dilation on the formula region image Img, then calculates the connected components of the processed image, obtains the boundary coordinates of each connected component, and distinguishes between individual formula boundary coordinate sets and sets of connected formula boundary coordinates that need further processing based on the independent formula image height threshold. The projection-based segmentation module is used to further segment the adhered parts: the adhered formula block is projected horizontally, the upper and lower boundaries of each independent formula in the formula block are calculated according to the projection array, the formula row is obtained, and then the formula row is projected vertically to obtain the left and right boundaries of the formula, and finally the boundary coordinate set of each independent formula is obtained. By combining the set of independent formula boundary coordinates obtained by the connected component-based segmentation module and the projection-based cutting module, the independent formulas are segmented from the test paper answer area image for subsequent formula recognition. The recognition module is configured to recognize the segmented independent mathematical formula images using a trained recognition model. The recognition model is a multi-scale attention encoding and decoding model that combines global and local features of the formula image. The encoder extracts the features of the formula image, and the decoder, based on the multi-scale attention mechanism, decodes the prediction sequence corresponding to the features of the formula image. The recognition model adopts an attention-based encoding and decoding model, specifically a multi-scale attention encoding and decoding model that combines global and local features of the formula image. The encoder is used to extract high-dimensional features from the preprocessed formula image of the input. Densely connected convolutional networks are used as the encoder to extract image features; the decoder based on a multi-scale attention mechanism is used to parse the prediction results corresponding to the high-dimensional features obtained by the encoder. In this process, after the encoder extracts the features of the formula image, the decoder, based on a multi-scale attention mechanism, then decodes the prediction sequence corresponding to the features of the formula image. After being identified by the recognition model, the data is converted to LaTeX format, and the formula image and corresponding prediction results are saved.
7. A computer device, comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, characterized in that, When the processor executes the program, it implements the steps of the method described in any one of claims 1-5.
8. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it performs the steps of the method described in any one of claims 1-5.