Formula recognition method, related device and program product

By introducing a discrete diffusion model for parallel decoding and visual feature fusion, the speed and accuracy problems of traditional autoregressive recognition methods in complex formula recognition are solved, achieving a more efficient formula recognition effect.

CN122200697APending Publication Date: 2026-06-12IFLYTEK CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
IFLYTEK CO LTD
Filing Date
2026-05-13
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Traditional autoregressive recognition methods are slow and inaccurate when recognizing complex formulas, and they are difficult to effectively utilize the spatial structure and nesting relationships of formulas.

Method used

A discrete diffusion model is introduced for parallel decoding, and the visual features and text features of the formula image are fused together. The formula characters are predicted in parallel by the discrete diffusion model, and a parallel denoising formula recognition paradigm is constructed.

Benefits of technology

It improves the speed and accuracy of formula recognition, effectively handles complex formulas, reduces reliance on previous decoding results, and improves recognition efficiency.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122200697A_ABST
    Figure CN122200697A_ABST
Patent Text Reader

Abstract

The application discloses a formula recognition method, related equipment and program products, and relates to the technical field of artificial intelligence. The application introduces a discrete diffusion model into a formula recognition task and constructs a new "parallel denoising" formula recognition paradigm. The discrete diffusion model is used to replace the traditional autoregressive decoding, and the formula recognition is changed from a "serial prediction" problem to a "parallel denoising" problem. Since the discrete diffusion model can decode and predict the target formula character token corresponding to each mask position in the initial input text sequence in parallel, the decoding rate is greatly improved compared with the serial decoding mode. Compared with the autoregressive decoding mode which can only see the content at the previous time, the parallel decoding process of the discrete diffusion model can see all the text information before and after the current time, so that the accuracy of the formula recognition result can be improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and more specifically, to a formula recognition method, related equipment, and program products. Background Technology

[0002] Formula recognition can be understood as parsing and processing images containing formulas, converting them into computer-processable symbol sequences or semantic expressions. It can generally be understood as converting formula images into sequences of formula characters. Traditional formula recognition methods typically employ autoregressive methods, which encode the input formula image and then decode and recognize it character by character sequentially using an autoregressive approach. The decoding at each time step depends on the results of one or more previous time steps.

[0003] This autoregressive decoding method often works well for recognizing formula text, but for formulas with spatial structures and nested relationships, which are not visual symbols, the autoregressive decoding method can only see the content of previous moments, making it difficult to achieve good results for formulas with complex structures.

[0004] Furthermore, the autoregressive decoding method relies on the decoding results of the previous moment or even all historical moments for the reasoning process at each moment during the decoding process, resulting in a low recognition speed. Summary of the Invention

[0005] In view of the above problems, this application is made to provide a formula recognition method, related equipment, and program products to improve the speed and accuracy of formula recognition. The specific solution is as follows:

[0006] Firstly, a formula recognition method is provided, including:

[0007] Obtain the formula image to be identified and extract the visual features of the formula image;

[0008] Obtain an initial input text sequence, which consists of a masked token sequence;

[0009] The text features of the initial input text sequence are fused with the visual feature input feature fusion module to obtain text features that fuse visual information;

[0010] The text features fused with visual information are input into a discrete diffusion model, which then predicts in parallel the target formula character token corresponding to each mask position in the initial input text sequence, thus obtaining the formula character sequence.

[0011] In one possible design, in another implementation of the first aspect of the embodiments of this application, the process of fusing the text features of the initial input text sequence with the visual feature input feature fusion module to obtain text features with fused visual information includes:

[0012] The text features of the initial input text sequence and the visual features are input into an attention network, and the text features fused with visual information are calculated through a cross-attention mechanism.

[0013] In one possible design, another implementation of the first aspect of the embodiments of this application further includes:

[0014] After obtaining the formula character sequence after the current round of decoding of the discrete diffusion model, based on the confidence of the target formula character token at each mask position, the tokens in the formula character sequence obtained in the current round of decoding with a confidence lower than the first confidence threshold are reset as masks to obtain the updated text sequence;

[0015] Using the updated text sequence as the input text sequence for the next round of decoding, the steps of fusing the text features and visual features of the input text sequence and decoding in parallel by the discrete diffusion model are repeated until the updated text sequence no longer contains the mask. The updated text sequence is then used as the final formula recognition result.

[0016] In one possible design, another implementation of the first aspect of the embodiments of this application further includes:

[0017] After obtaining the formula character sequence after the current round of decoding of the discrete diffusion model, for each non-masked token in the input text sequence of the current round of decoding of the discrete diffusion model: verify the confidence of the non-masked token in the current round of decoding result. If the confidence is lower than the second confidence threshold, reset the non-masked token to a mask to obtain the updated text sequence.

[0018] In one possible design, in another implementation of the first aspect of the embodiments of this application, the discrete diffusion model adopts a bidirectional causal Transformer network structure.

[0019] In one possible design, in another implementation of the first aspect of the embodiments of this application, the visual features of the formula image are extracted by an image feature extraction module, and the image feature extraction module, the feature fusion module and the discrete diffusion model constitute a formula recognition model;

[0020] The formula recognition model was trained in the following manner:

[0021] Obtain formula image samples and formula text sample sets. The formula text sample set includes formula character sequence text tags corresponding to the formula image samples, and formula character sequence texts carrying masks obtained after performing different degrees of masking on the formula character sequence text tags.

[0022] Randomly sample formula character sequence text from the formula text sample set and use it together with the formula image sample as input to the formula recognition model to obtain the formula character sequence prediction result output by the formula recognition model;

[0023] The loss value is calculated based on the prediction result of the formula character sequence and the text label of the formula character sequence, and the parameters of the formula recognition model are updated according to the loss value.

[0024] In one possible design, in another implementation of the first aspect of the embodiments of this application, before extracting the visual features of the formula image, the following steps are further included:

[0025] The image height estimation module estimates the target height information after the formula image is normalized. The image height estimation module is configured to use the formula image sample as the training sample and the image height information after normalizing the formula image sample as the sample label for training. The height of the formula characters in the normalized image of the formula image sample is a uniformly set height value.

[0026] The formula image is normalized according to the target height information to obtain the normalized formula image.

[0027] In a second aspect, an electronic device is provided, comprising: a memory and a processor;

[0028] The memory is used to store programs;

[0029] The processor is configured to execute the program to implement the various steps of the formula recognition method described in any of the first aspects of this application.

[0030] Thirdly, a readable storage medium is provided having a computer program stored thereon, which, when executed by a processor, implements the various steps of the formula recognition method described in any of the preceding first aspects of this application.

[0031] Fourthly, a computer program product is provided, comprising a computer program that, when executed by a processor, implements the various steps of the formula recognition method described in any of the first aspects of this application.

[0032] By employing the aforementioned technical solution, this application introduces the discrete diffusion model into the formula recognition task, constructing a novel "parallel denoising" paradigm for formula recognition. Replacing traditional autoregressive decoding with the discrete diffusion model transforms formula recognition from a "serial prediction" problem into a "parallel denoising" problem. Because the discrete diffusion model can decode and predict the target formula character token corresponding to each mask position in the initial input text sequence in parallel, it significantly improves the decoding speed compared to serial decoding. Compared to autoregressive decoding, which only sees content from previous moments, the discrete diffusion model's parallel decoding process can see all text information before and after the current moment, thus improving the accuracy of formula recognition results.

[0033] Furthermore, this application fuses the text features of the initial input text sequence with the visual features of the formula image to obtain text features with fused visual information. By guiding the discrete diffusion module to decode the target formula character through visual information, the accuracy of the decoding result can be improved. Attached Figure Description

[0034] Various other advantages and benefits will become apparent to those skilled in the art upon reading the following detailed description of preferred embodiments. The accompanying drawings are for illustrative purposes only and are not intended to limit the scope of this application. Furthermore, the same reference numerals denote the same parts throughout the drawings. In the drawings:

[0035] Figure 1 A schematic diagram of an implementation system architecture for the formula recognition method provided in this application embodiment;

[0036] Figure 2 This is a schematic flowchart of a formula recognition method provided in an embodiment of this application;

[0037] Figure 3 This is a schematic diagram of a formula recognition model structure provided in an embodiment of this application;

[0038] Figure 4 A schematic diagram illustrating a process of constructing training data by progressively adding noise, as provided in an embodiment of this application;

[0039] Figure 5 This application provides a schematic diagram illustrating the multi-round iterative decoding process of a formula recognition model.

[0040] Figure 6 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0041] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0042] In formula recognition scenarios, relevant technologies typically employ autoregressive decoding algorithms. This involves encoding the formula image and then using an autoregressive decoder to decode and recognize each character sequentially based on the image's encoded features. However, autoregressive decoding methods are slow, and because they only consider data from previous moments, they struggle to achieve good results with complex formulas.

[0043] This application creatively introduces the discrete diffusion model into the formula recognition scenario, transforming formula recognition from a "serial prediction" problem into a "parallel denoising" problem.

[0044] Diffusion models are a class of mainstream generative deep learning models based on Markov chains. They achieve accurate modeling of the target data distribution and high-quality sample generation through a symmetric paradigm of forward stepwise perturbation and backward iterative reconstruction.

[0045] Diffusion models are divided into two main branches: continuous and discrete. Continuous diffusion models use Gaussian noise superposition to add and remove noise from continuous data such as images and audio, while discrete diffusion models rely on discrete state transition strategies such as mask replacement and class transfer, making them naturally suitable for modeling discrete data such as text, graph structures, and categorical variables.

[0046] This application utilizes a discrete diffusion model architecture to decode formula character information from a mask in parallel at all time steps. As a parallel, non-autoregressive algorithm, it can efficiently decode all formula character information simultaneously without relying on decoding results from previous time steps, significantly improving decoding speed.

[0047] The formula recognition method provided in this application can be applied to, for example... Figure 1 The system architecture shown includes terminal 100 and server 200.

[0048] Either terminal 100 or server 200 can be used independently to execute the formula recognition method provided in the embodiments of this application. Alternatively, terminal 100 and server 200 can also be used collaboratively to execute the formula recognition method provided in the embodiments of this application.

[0049] Terminal 100 can be a mobile phone, tablet computer, learning machine, scanning pen, translator, notebook, wearable device, etc.

[0050] This application provides a formula recognition method, which is illustrated by applying the method to a computer device. Specifically, the computer device may be... Figure 1 Terminal 100 or server 200. (Refer to...) Figure 2 The formula recognition method specifically includes the following steps:

[0051] Step S100: Obtain the formula image to be identified and extract the visual features of the formula image.

[0052] The image of the formula to be identified in this step is an image containing the formula. Specifically, the region containing the formula can be located in the original image to be identified, and then the image of the formula region can be cropped out as the formula image to be identified.

[0053] This step uses an image feature extraction module to extract the visual features of the formula image, converting it into a visual feature map that a computer can understand.

[0054] The image feature extraction module, as a visual embedding network, can employ convolutional neural networks or visual Transformer networks, such as deep residual networks (ResNet), densely connected networks (DenseNet), visual Transformer networks, and swin transformers.

[0055] Taking the image feature extraction module using a Swing Transformer network as an example, the main body of this network consists of 14 stacked Swing Transformer modules, and it performs 5 downsampling operations in the width and height spatial dimensions of the feature map, with a cumulative downsampling factor of 32 times. Taking a 3-channel RGB text line formula image with a size of 3×96×512 as an input example, after a series of processing by this feature extraction network, the final output will be a feature map with a dimension of 512×3×16.

[0056] Step S110: Obtain the initial input text sequence, which consists of a mask token sequence.

[0057] The initial input text sequence consists of a mask token sequence. This mask token sequence can be of a preset fixed length, for example, a sufficiently large length L_max to cover most formula scenarios. The input text sequence is a sequence of pure mask (MASK) tokens of length L_max. Subsequently, a discrete diffusion model decodes the formula character corresponding to each mask in the input text sequence, and after predicting the last formula character, the subsequent masks are padded with a set terminator (such as "End" or other forms of terminator).

[0058] In another optional implementation, this application can also predict the length of the formula in the formula image to be identified using a configured formula length prediction model. This formula length prediction model can be pre-trained using sample data of formula images labeled with formula lengths. After obtaining the first length of the formula in the formula image to be identified output by the formula length prediction model, an initial input text sequence of the first length can be constructed. This initial input text sequence consists of a mask token sequence of the first length. Subsequently, a discrete diffusion model is used to decode the formula character corresponding to each mask in the input text sequence.

[0059] Step S120: Input the text features and visual features of the initial input text sequence into the feature fusion module to obtain text features with fused visual information.

[0060] Specifically, the text features encoded from the initial input text sequence and the visual features of the formula image are fed into the feature fusion module to fuse the text features and visual features, resulting in text features that incorporate visual information, which serve as the input for the next step of the discrete diffusion model.

[0061] By incorporating visual information into text features, discrete diffusion models can decode under the guidance of visual information, thereby improving the accuracy of decoding results.

[0062] In some possible implementations, the feature fusion module can employ an attention network to calculate text features that fuse visual information through a cross-attention mechanism. Attention networks typically consist of fully connected layers, matrix multiplication operators, and softmax operators.

[0063] The attention network is used to connect the visual features of the formula image output by the image feature extraction module with the text features generated by the discrete diffusion model during the decoding process. By fusing and modeling the two types of heterogeneous features and dynamically calculating the attention weights, it accurately selects and locates the visual feature regions that the discrete diffusion model needs to focus on in the current decoding sequence. This provides targeted visual feature support for the accurate decoding of the formula characters in the future, allowing the text decoding process to align with the visual structural features of the formula for reasoning.

[0064] In other possible implementations, the feature fusion module can also adopt other fusion strategies, which will not be elaborated here.

[0065] Step S130: Input the text features of the fused visual information into the discrete diffusion model, and use the discrete diffusion model to predict the target formula character token corresponding to each mask position in the initial input text sequence in parallel to obtain the formula character sequence.

[0066] Specifically, the discrete diffusion model performs parallel decoding based on text features fused with visual information, that is, it predicts the decoded target formula character token corresponding to each mask position in the initial input text sequence, and obtains the output formula character sequence.

[0067] The formula character sequence consists of consecutive formula characters. In some possible implementations, a terminator may be included at the end of the consecutive formula characters to indicate the end of decoding.

[0068] The formula recognition method provided in this application introduces the discrete diffusion model into the formula recognition task, constructing a novel "parallel denoising" paradigm for formula recognition. By replacing traditional autoregressive decoding with the discrete diffusion model, formula recognition is transformed from a "serial prediction" problem into a "parallel denoising" problem. Because the discrete diffusion model can decode and predict the target formula character token corresponding to each mask position in the initial input text sequence in parallel, it significantly improves the decoding speed compared to the serial decoding method. Compared to the autoregressive decoding method, which can only see content from previous moments, the discrete diffusion model's parallel decoding process can see all text information before and after the current moment, thus improving the accuracy of the formula recognition results.

[0069] Furthermore, this application fuses the text features of the initial input text sequence with the visual features of the formula image to obtain text features with fused visual information. By guiding the discrete diffusion module to decode the target formula character through visual information, the accuracy of the decoding result can be improved.

[0070] In some possible implementations, the discrete diffusion model can employ network architectures with parallel decoding capabilities, such as bidirectional causal Transformer network structures and bidirectional LSTM network structures.

[0071] Taking the discrete diffusion model with a bidirectional causal Transformer network structure as an example, the bidirectional causal Transformer network structure can fully adapt to the structural characteristics and recognition requirements of formula character sequences. It can focus on all text information instead of just the content of previous time steps, which can effectively improve the accuracy of formula recognition results.

[0072] In some embodiments of this application, a regularization process for the formula image may be added before extracting the visual features of the formula image in the aforementioned step S100.

[0073] Because formula recognition scenarios often involve complex structures such as multi-line formulas, matrices, and subscripts and superscripts, traditional recognition algorithms compress images to a fixed resolution, which can easily destroy the formula structure and lead to errors in the formula recognition results.

[0074] Examples of traditional fixed-resolution compression methods are shown in Table 1 below:

[0075] Table 1

[0076]

[0077] This application embodiment uses an image height estimation module to accurately predict the height of the normalized formula image, and then performs image normalization adjustment based on the height, thereby enabling the model to effectively cope with various complex formula recognition scenarios.

[0078] Specifically, the target height information of the formula image after normalization is estimated by the image height estimation module, and the formula image is normalized according to the target height information to obtain the normalized formula image.

[0079] The image height estimation module is configured to use formula image samples as training samples and the height information of the image after normalization of the formula image samples as sample labels for training. The height of the formula characters in the image after normalization of the formula image samples is a uniformly set height value.

[0080] The image height estimation module can employ a small convolutional neural network, such as the deep residual network ResNet or the densely connected network DenseNet.

[0081] The core task of the image height estimation module designed in this embodiment is to enable the model to "understand" the height requirements of the formula image. It analyzes the visual features of the original formula image, determines how many pixels of height are needed for a clear display, and then dynamically outputs a target height H_target, for example:

[0082] A simple single-line formula: H_target = 32;

[0083] Two lines of formula: H_target = 64;

[0084] Three-row matrix: H_target = 96;

[0085] Complex fractional structure: H_target = 128.

[0086] For example, suppose the original formula image input is:

[0087] Original: 3×128×512 (128 pixels high, 512 pixels wide, containing a 3-row matrix).

[0088] Traditional fixed compression:

[0089] Scaling to a fixed height of 32: 3×32×128. The three rows of the matrix are compressed into one row, and the inter-row structure is completely lost.

[0090] Using the image height estimation module in this case, the formula image is analyzed, and the height of the normalized image is predicted to be H_target = 96.

[0091] The original formula image was scaled to 3×96×384 (height 96, width scaled proportionally). This preserved the three-line spatial structure, allowing the subsequent discrete diffusion model to distinguish each line of characters and improve the accuracy of formula recognition.

[0092] Reference Figure 3 This example illustrates an optional architecture for a formula recognition model:

[0093] The formula recognition model includes an image height estimation module, an image feature extraction module, a feature fusion module, and a discrete diffusion model.

[0094] The formula image is processed by the image height estimation module to output target height information, and then the formula image is normalized according to the target height information to obtain the normalized formula image, which is then sent to the image feature extraction module to extract visual features.

[0095] The feature fusion module can use a cross-attention network, whose input includes the initial input text sequence Masktoken and the visual features of the image. The text features that fuse visual information are calculated through the cross-attention mechanism and then fed into the discrete diffusion model.

[0096] The discrete diffusion model performs parallel decoding based on the text features fused with visual information to obtain the target formula character token corresponding to each Mask position in the initial input text sequence, and finally obtains the formula character sequence Formula token.

[0097] In some embodiments of this application, the training process of the formula recognition model is described. The formula recognition model may include an image feature extraction module, a feature fusion module, and a discrete diffusion model. Optionally, an image height estimation module may also be added.

[0098] The training process of a formula recognition model may include the following steps:

[0099] S1. Obtain formula image samples and formula text sample sets. The formula text sample set includes formula character sequence text tags corresponding to the formula image samples, and formula character sequence texts carrying masks obtained after performing different degrees of masking on the formula character sequence text tags.

[0100] The formula image samples and corresponding formula character sequence text labels in this application can be obtained through manual annotation. However, considering the inherent complexity and diversity of formula representations, manual annotation is not only costly but may also fail to guarantee consistency in the expression of identical formulas. Therefore, in some possible implementations, this application can extract the formula portion of the corpus (i.e., formula character sequence text labels) from massive amounts of publicly available text data on the internet (such as academic papers), and then standardize the formula corpus based on the rules of formula typesetting systems such as LaTeX rendering to ensure maximum consistency. Then, LaTeX is used to render the formula corpus into corresponding formula images, obtaining formula image samples and corresponding formula character sequence text labels. This method can ensure consistency between the formula corpus and the formula images as much as possible.

[0101] Based on the LaTeX formula rendering method, this application can also construct printed formula images and add them to the training dataset. Furthermore, high-quality, manually annotated formula images containing both printed and handwritten text can also be added to the training dataset.

[0102] Alternatively, after obtaining the above formula image samples, they can be further preprocessed uniformly: the part containing the formula in each formula image can be segmented and used as the final formula image, so as to avoid other information in the non-formula part from interfering with the model training, and at the same time, the size of the model input image can be reduced and the model inference efficiency can be improved.

[0103] Reference Figure 4 As shown, after obtaining the formula character sequence text labels corresponding to the formula image samples, these labels can be used as the initial noise-free formula character sequence text (i.e., t=0, corresponding to the original noise-free formula character sequence text). Following a preset time step sequence (0→1→…→T), the real text tokens in the sequence are gradually replaced with a mask. As the time step increases, the number of text tokens covered by the mask continuously increases, meaning the noise level gradually deepens until the final time step (t=T), when the entire text sequence is entirely composed of masks, forming a pure noise sequence, completing the entire forward noise addition process. This process simulates the continuous degradation of text from a clean state to a completely noisy state, constructing a reasonable sample distribution for subsequent reverse denoising training.

[0104] It should be noted that, Figure 4 The sample formula text shown has a fixed length for each text, and the remaining part, except for the formula character sequence, is filled with a set terminator (End).

[0105] S2. Randomly sample formula character sequence text from the formula text sample set, and use it together with the formula image sample as input to the formula recognition model to obtain the formula character sequence prediction result output by the formula recognition model.

[0106] In the training process of the formula recognition model, a random time step sampling strategy is adopted. Noisy text (formula character sequence text with mask) at any time t is randomly extracted from the entire time step interval [0,T] corresponding to the formula text sample set. It is used as input to the formula recognition model together with the formula image sample. After inference by the formula recognition model, the formula character sequence prediction result is obtained.

[0107] S3. Calculate the loss value based on the prediction results of the formula character sequence and the text label of the formula character sequence, and update the parameters of the formula recognition model according to the loss value.

[0108] Specifically, the training process uses the clean text (text labels of the formula character sequence) at time t=0 as the supervision signal, calculates the loss value between the prediction result of the formula character sequence and the supervision signal, and updates the parameters of the formula recognition model according to the loss value.

[0109] By guiding the formula recognition model to learn the inverse denoising mapping from any noisy time t to the initial noise-free time t=0, the model gradually masters the ability to decode denoising across time steps, and finally can restore noise-free text from noisy sequences of any degree of pollution.

[0110] In terms of loss function design, cross-entropy loss is used to achieve supervised training, taking into account the discrete characteristics of text.

[0111] Optionally, this embodiment can also introduce a targeted supervision mechanism, that is, only calculating the loss for the noisy Mask positions in the formula character sequence text, while the noise-free original formula character tokens in the formula character sequence text do not participate in the loss iteration. This design not only utilizes the adaptability of cross-entropy loss to discrete category prediction, accurately measuring the distribution difference between the model-predicted tokens and the real tokens, but also focuses on the core denoising task, avoiding redundant supervision of noise-free tokens, reducing interference from invalid training signals, and improving model training efficiency and denoising accuracy.

[0112] In some embodiments of this application, in order to further improve reasoning efficiency and recognition effect during the reasoning process of formula recognition model, a multi-round parallel text reasoning and recognition strategy based on confidence judgment is proposed.

[0113] In the aforementioned step S130, the text features fused with visual information are input into the discrete diffusion model. The discrete diffusion model predicts in parallel the target formula character token corresponding to each mask position in the initial input text sequence. Based on the formula character sequence obtained, if only one round of reasoning is performed, the formula character sequence obtained from one round of reasoning in step S130 can be used as the final formula recognition result.

[0114] In some other possible implementations, the formula recognition model can be configured to perform more than two rounds of inference. Then, after obtaining the formula character sequence decoded by the discrete diffusion model in step S130, based on the confidence level of the target formula character token at each mask position, valid formula character tokens are selected and retained, while the remaining invalid formula character tokens are reset to the mask, resulting in an updated text sequence.

[0115] Specifically, when decoding the formula character sequence, the discrete diffusion model can simultaneously output the confidence level corresponding to the target formula character token at each mask position. This confidence level can be used to filter valid formula character tokens in this step.

[0116] In one optional example, a first confidence threshold can be set. Formula character tokens with a confidence level not lower than the first confidence threshold are retained as valid formula character tokens, while formula character tokens with a confidence level lower than the first confidence threshold are treated as invalid formula character tokens and reset to a mask, resulting in an updated text sequence.

[0117] In another optional example, a maximum number of iterations, *m*, and a first confidence threshold can be set. After each decoding round, valid and invalid formula character tokens can be filtered based on the maximum number of iterations, *m*, and the first confidence threshold. For example, based on the maximum number of iterations, *m*, and the length *L* of the initial input text sequence, the minimum number of character tokens, *L / m*, that need to be decoded in each round can be calculated. First, candidate formula character tokens with a confidence level not lower than the first confidence threshold are filtered. Then, from these candidate tokens, the first *L / m* tokens are selected as valid formula character tokens, and the rest are considered invalid. The invalid formula character tokens are then reset to masks to obtain the updated text sequence for this round.

[0118] The updated text sequence is used as the input text sequence for the next round of decoding. The steps of fusing the text features and visual features of the input text sequence and decoding in parallel by the discrete diffusion model are repeated until the updated text sequence no longer contains the mask. The updated text sequence is then used as the final formula recognition result.

[0119] Combination Figure 5 As shown, the decoding process of the formula recognition model starts from the pure noise text sequence composed entirely of Masks at time t=T, and gradually reverses through multiple rounds of iterative denoising to obtain a clean formula text sequence without noise (mask) at time t=T.

[0120] When the formula recognition model starts inference, it executes the first round of inference, taking the pure noise text sequence at time t=T as input, and the formula recognition model decodes it in parallel to obtain the target formula character token and its confidence level at each mask position.

[0121] For tokens with a confidence level not lower than the first confidence level threshold, this application considers them to be valid tokens and retains them. For tokens with a confidence level lower than the first confidence level threshold, this application considers them to be noise and therefore resets them to masks for subsequent rounds of decoding and identification. After filtering according to the above confidence level, the updated text sequence can be obtained.

[0122] Tokens with a confidence level below the first confidence threshold are reset to masks, resulting in an updated text sequence.

[0123] Subsequently, the updated text sequence is used as the input text sequence for the next round of decoding, and the formula recognition model is reused for the next round of parallel decoding. The above process of "confidence-based filtering - retaining valid tokens - resetting noise positions to masks" is repeated multiple times until the updated text sequence no longer contains masks, at which point the iteration terminates, and the updated text sequence is used as the final formula recognition result.

[0124] The confidence-based multi-round iterative decoding strategy provided in this embodiment can adapt to the formula image to be recognized. For simpler recognition scenarios, such as printed text or neatly handwritten formulas, the formula recognition model will provide a high confidence level during the recognition process, thus obtaining a large number of formula character tokens in one or a few decoding rounds, greatly improving the recognition speed. For more difficult recognition scenarios, such as poorly photographed natural scene images or illegible handwriting, the formula recognition model may provide a low confidence level in the first decoding round, especially for difficult-to-recognize formula characters. In this case, the tokens with low confidence levels are reset as masks. In subsequent multi-round iterative decoding rounds, simpler formula characters are recognized first, while difficult-to-recognize formula characters are recognized in the last few decoding rounds. This allows the recognition of difficult characters to refer to more contextual information (already recognized formula characters), improving the recognition accuracy of difficult-to-recognize formula characters. In summary, this application can achieve adaptive, dynamic, efficient, and high-precision recognition of formula images to be recognized.

[0125] In some embodiments of this application, a dynamic correction mechanism is proposed to further improve the recognition effect during the reasoning process of the formula recognition model.

[0126] The structural rigor of the formula text places high demands on the accuracy of token positioning. Token misalignment may occur during the initial decoding process. Therefore, this embodiment, based on the multi-round iterative decoding strategy described in the previous embodiments, adds a dynamic correction mechanism during the iteration process:

[0127] For the formula character tokens retained in the previous round, their confidence level in the current round of decoding is checked in real time after decoding. If the confidence level is lower than the second confidence threshold, it is determined that the formula character token may have a deviation. It can be reset to a mask and handed over to subsequent rounds for re-decoding and optimization, so as to accurately correct the token position deviation and ensure the structural correctness and content integrity of the final recognition result.

[0128] One possible implementation is as follows: After obtaining the character sequence of the formula after the current round of decoding of the discrete diffusion model, for each non-masked character token in the input text sequence of the current round of decoding of the discrete diffusion model: verify the confidence level of the non-masked character token in the current round of decoding result; if the confidence level is lower than a second confidence threshold, reset the non-masked character token to a mask to obtain an updated text sequence. The updated text sequence is used as the input text sequence for the next round of decoding.

[0129] For example:

[0130] After the first round of decoding by the formula recognition model, formula character tokens with a confidence level not lower than the first confidence threshold are retained. The rest are reset to masks, resulting in the updated text sequence:

[0131] [y][Mask][a][Mask] [Mask] [Mask].

[0132] Using the updated text sequence as the input text sequence for the next round of decoding, after the second round of decoding by the formula recognition model, the decoding results corresponding to each mask position in the input text sequence are selected, and those with a confidence level not lower than the first confidence threshold are retained; the rest are set as masks. Furthermore, assuming that the confidence level of the predicted token "a" at the third token position in the second round of decoding is 0.3, which is less than the second confidence threshold of 0.6, the formula character "a" at the third token position is reset as a mask, resulting in the updated text sequence after the second round of decoding:

[0133] [y][=][Mask][x] [+] [Mask].

[0134] The updated text sequence after the second round of decoding is used as the input text sequence for the next round of decoding, and this process is repeated multiple times until the updated text sequence does not contain the mask.

[0135] The embodiments of this application can correct errors in the recognition process through a dynamic correction mechanism of remasking, which is especially suitable for complex formula scenarios and can improve the accuracy of complex formula recognition scenarios.

[0136] This application also provides an electronic device in its embodiments. (See reference...) Figure 6 The diagram illustrates a structural schematic suitable for implementing the electronic device in the embodiments of this application. The electronic device in the embodiments of this application may include, but is not limited to, terminals such as mobile phones, tablet computers, learning machines, scanning pens, translators, notebooks, etc. Figure 6 The electronic device shown is merely an example and should not impose any limitation on the functionality and scope of use of the embodiments of this application.

[0137] like Figure 6 As shown, the electronic device may include a processing unit (e.g., a central processing unit, a graphics processing unit, etc.) 1, which can perform various appropriate actions and processes according to a program stored in a read-only memory (ROM) 2 or a program loaded from a storage device 8 into a random access memory (RAM) 3, to implement the formula recognition method of the foregoing embodiments of this application. When the electronic device is powered on, the RAM 3 also stores various programs and data required for the operation of the electronic device. The processing unit 1, ROM 2, and RAM 3 are interconnected via a bus 4. An input / output (I / O) interface 5 is also connected to the bus 4.

[0138] Typically, the following devices can be connected to I / O interface 5: input devices 6 including, for example, touchscreens, touchpads, keyboards, mice, cameras, microphones, accelerometers, gyroscopes, etc.; output devices 7 including, for example, liquid crystal displays (LCDs), speakers, vibrators, etc.; storage devices 8 including, for example, memory cards, hard drives, etc.; and communication devices 9. Communication device 9 allows electronic devices to communicate wirelessly or wiredly with other devices to exchange data. Although Figure 6 Electronic devices with various devices are shown, but it should be understood that it is not required to implement or have all of the devices shown. More or fewer devices may be implemented or have alternatively.

[0139] This application also provides a computer program product including computer-readable instructions, which, when executed on an electronic device, cause the electronic device to implement any of the formula recognition methods provided in this application.

[0140] This application also provides a computer-readable storage medium that carries one or more computer programs. When the one or more computer programs are executed by an electronic device, the electronic device can implement any of the formula recognition methods provided in this application.

[0141] It should also be noted that the device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. In addition, in the device embodiment drawings provided in this application, the connection relationship between modules indicates that they have a communication connection, which can be implemented as one or more communication buses or signal lines.

[0142] Through the above description of the embodiments, those skilled in the art can clearly understand that this application can be implemented by means of software plus necessary general-purpose hardware, or it can be implemented by special-purpose hardware including application-specific integrated circuits, special-purpose CPUs, special-purpose memory, special-purpose components, etc. Generally, any function performed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structure used to implement the same function can also be diverse, such as analog circuits, digital circuits, or special-purpose circuits. However, for this application, software program implementation is more often the preferred implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a readable storage medium, such as a computer floppy disk, USB flash drive, mobile hard disk, ROM, RAM, magnetic disk, or optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, training equipment, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0143] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product.

[0144] The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions may be transmitted from one website, computer, training device, or data center to another website, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that a computer can store or a data storage device such as a training device or data center that integrates one or more available media. The available media may be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., DVDs), or semiconductor media (e.g., solid-state drives (SSDs)).

[0145] The various embodiments in this specification are described in a progressive manner. Each embodiment focuses on the differences from other embodiments. The various embodiments can be combined as needed, and the same or similar parts can be referred to each other.

Claims

1. A formula recognition method, characterized in that, include: Obtain the formula image to be identified and extract the visual features of the formula image; Obtain an initial input text sequence, which consists of a masked token sequence; The text features of the initial input text sequence are fused with the visual feature input feature fusion module to obtain text features that fuse visual information; The text features fused with visual information are input into a discrete diffusion model, which then predicts in parallel the target formula character token corresponding to each mask position in the initial input text sequence, thus obtaining the formula character sequence.

2. The method according to claim 1, characterized in that, The process of fusing the text features of the initial input text sequence with the visual feature input feature fusion module to obtain text features with fused visual information includes: The text features of the initial input text sequence and the visual features are input into an attention network, and the text features fused with visual information are calculated through a cross-attention mechanism.

3. The method according to claim 1, characterized in that, Also includes: After obtaining the formula character sequence after the current round of decoding of the discrete diffusion model, based on the confidence of the target formula character token at each mask position, the tokens in the formula character sequence obtained in the current round of decoding with a confidence lower than the first confidence threshold are reset as masks to obtain the updated text sequence; Using the updated text sequence as the input text sequence for the next round of decoding, the steps of fusing the text features and visual features of the input text sequence and decoding in parallel by the discrete diffusion model are repeated until the updated text sequence no longer contains the mask. The updated text sequence is then used as the final formula recognition result.

4. The method according to claim 3, characterized in that, Also includes: After obtaining the formula character sequence after the current round of decoding of the discrete diffusion model, for each non-masked token in the input text sequence of the current round of decoding of the discrete diffusion model: verify the confidence of the non-masked token in the current round of decoding result. If the confidence is lower than the second confidence threshold, reset the non-masked token to a mask to obtain the updated text sequence.

5. The method according to claim 1, characterized in that, The discrete diffusion model employs a bidirectional causal Transformer network structure.

6. The method according to claim 1, characterized in that, The visual features of the formula image are extracted by the image feature extraction module. The image feature extraction module, the feature fusion module, and the discrete diffusion model constitute the formula recognition model. The formula recognition model was trained in the following manner: Obtain formula image samples and formula text sample sets. The formula text sample set includes formula character sequence text tags corresponding to the formula image samples, and formula character sequence texts carrying masks obtained after performing different degrees of masking on the formula character sequence text tags. Randomly sample formula character sequence text from the formula text sample set and use it together with the formula image sample as input to the formula recognition model to obtain the formula character sequence prediction result output by the formula recognition model; The loss value is calculated based on the prediction result of the formula character sequence and the text label of the formula character sequence, and the parameters of the formula recognition model are updated according to the loss value.

7. The method according to any one of claims 1-6, characterized in that, Before extracting the visual features of the formula image, the process also includes: The image height estimation module estimates the target height information after the formula image is normalized. The image height estimation module is configured to use the formula image sample as the training sample and the image height information after normalizing the formula image sample as the sample label for training. The height of the formula characters in the normalized image of the formula image sample is a uniformly set height value. The formula image is normalized according to the target height information to obtain the normalized formula image.

8. An electronic device, characterized in that, include: Memory and processor; The memory is used to store programs; The processor is used to execute the program to implement each step of the formula recognition method as described in any one of claims 1 to 7.

9. A readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements each step of the formula recognition method as described in any one of claims 1 to 7.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the various steps of the formula recognition method as described in any one of claims 1 to 7.