A visual rich text layout restoration method based on an attention network

By using an attention network-based visual rich text layout restoration method, the problems of difficult training data and high computational resources in existing technologies are solved, and efficient and accurate document image restoration into editable documents is achieved, thus improving the accuracy and efficiency of layout restoration.

CN118072342BActive Publication Date: 2026-06-26HANGZHOU EBOYLAMP ELECTRONICS CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HANGZHOU EBOYLAMP ELECTRONICS CO LTD
Filing Date
2024-02-01
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing document image layout restoration methods suffer from problems such as difficulty in obtaining training data, high computational resource requirements, low accuracy, and limited overall accuracy when multiple models are concatenated.

Method used

A visual rich text layout reconstruction method based on attention networks is adopted. By classifying document image rotation angle, correcting projection, and detecting element coordinates and categories through layout analysis network, combined with multi-head self-attention encoder and decoder, the layout analysis network is optimized using contrastive denoising training strategy, and single-line text detection and recognition network is connected in series to generate editable documents.

Benefits of technology

It improves the accuracy and efficiency of layout restoration, shortens training time, and enhances the overall performance of multi-network models.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118072342B_ABST
    Figure CN118072342B_ABST
Patent Text Reader

Abstract

The application discloses a visual rich-text page restoration method based on an attention network, which comprises the following steps: inputting a document image into a trained document image rotation angle classification network to obtain the angle at which the document image needs to be rotated and correct the document image, and then using a projection correction algorithm to perform secondary correction on the document image; performing element detection on the document image through a page analysis network and performing corresponding processing to obtain element content corresponding to each element; using a plurality of deep learning algorithm network models connected in series to restore the document image into an editable document; and adding a contrast denoising training strategy in the page analysis network training to improve the accuracy of page analysis, because the plurality of network models are connected in series, according to the rule of the weakest link, the effect of page restoration will be affected by the page analysis model, therefore, the addition of the contrast denoising strategy can improve the accuracy of the network, promote the convergence of the network and shorten the training time.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of text layout restoration, specifically relating to a visual rich text layout restoration method based on attention networks. Background Technology

[0002] Page layout restoration technology refers to restoring a document image to its original page structure, converting an uneditable information carrier into an editable one. With the advent of the information age, the number and complexity of documents are constantly increasing, and traditional document processing methods can no longer meet people's needs. The development of artificial intelligence technology provides more efficient and accurate methods for document processing, enabling it to better adapt to the demands of the information age. The semantic structure of visual rich text is related to its text content, layout, table structure, font, and other visual elements. Therefore, page layout restoration of a document image requires the detection and recognition of text content, layout structure, table structure, etc. Page layout restoration technology can restore a document image to an editable document, maintaining consistency between content and structure. Page layout restoration technology is a preprocessing technique used for automated document processing, intelligent analysis, and artificial intelligence applications, which can improve the usability and efficiency of documents.

[0003] Accurately restoring a document image to its original form can effectively enhance the document's usability and application scope. Currently, there are two commonly used methods for restoring document layout.

[0004] Method 1: End-to-end layout reconstruction using a deep neural network model. This method typically employs an encoder-decoder generation model to directly generate text content from the input image. However, constructing the training data for this model is challenging, requiring additional processing of image elements and significant computational power, making layout reconstruction time-consuming. Furthermore, this end-to-end layout reconstruction based on a deep neural network model necessitates a large number of parameters in the network, inevitably increasing reconstruction time. End-to-end reconstruction also requires a large amount of data with complex annotations, resulting in a substantial workload for data labeling. Due to the inherent difficulty of the layout reconstruction task, accuracy will be significantly limited.

[0005] Method 2: Combining layout analysis, table recognition, optical character detection, and optical character recognition technologies, multiple neural network models are chained together, and pre-written rules are used to restore the recognized elements to the Word document. However, the overall accuracy of page restoration by chaining multiple deep neural network models together is affected by the accuracy of the first task, namely the layout analysis task. Using complex models for layout analysis will slow down the overall page restoration process. Summary of the Invention

[0006] The purpose of this invention is to address the problems raised in the background art by proposing a visual rich text layout restoration method based on attention networks.

[0007] To achieve the above objectives, the technical solution adopted by the present invention is as follows:

[0008] The present invention proposes a visual rich text layout restoration method based on attention network, which includes inputting the document image into a trained document image rotation angle classification network to obtain the required rotation angle of the document image and correct the document image, and then using a projection correction algorithm to perform secondary correction on the document image;

[0009] The document image after secondary correction is input into the layout analysis network to detect elements in the layout and obtain the coordinates and categories of each element in the document image. The layout analysis network includes a lightweight visual self-attention network, a multi-head self-attention encoder and a multi-head self-attention decoder connected in sequence from data input to output.

[0010] Based on the coordinate information of each element in the document image, the corresponding element is cropped out, and then the element is processed according to its category to obtain the element content corresponding to each element.

[0011] Generate a blank document, and insert the content of each element into the blank document according to its position in the document image to restore the layout.

[0012] Preferably, the document image rotation angle classification network includes, in sequence from data input to output, a first convolutional layer, a first hardswish activation function, a first depthwise separable convolutional structure, a second depthwise separable convolutional structure, a third depthwise separable convolutional structure, a fourth depthwise separable convolutional structure, a fifth depthwise separable convolutional structure, a sixth depthwise separable convolutional structure, a seventh depthwise separable convolutional structure, an eighth depthwise separable convolutional structure, a ninth depthwise separable convolutional structure, a tenth depthwise separable convolutional structure, an eleventh depthwise separable convolutional structure, a twelfth depthwise separable convolutional structure, a thirteenth depthwise separable convolutional structure, a global average pooling layer, a first fully connected layer, a second hardswish activation function, and a second fully connected layer.

[0013] The document image rotation angle classification network is optimized using cross-entropy loss during training, with the loss function L... direction As shown below:

[0014]

[0015] Where i=1 represents the first category, which does not require rotation; i=2 represents the second category, which requires a 90-degree rotation; i=3 represents the third category, which requires a 180-degree rotation; and i=4 represents the fourth category, which requires a 270-degree rotation. i Indicates the actual label category, q i This represents the network's prediction results.

[0016] Preferably, the lightweight visual self-attention network includes a convolutional backbone, a first content local enhancement module, a first convolutional feedforward module, a second content local enhancement module, a second convolutional feedforward module, a third content local enhancement module, a third convolutional feedforward module, a fourth content local enhancement module, and a fourth convolutional feedforward module connected sequentially from data input to output, and the convolutional backbone includes four convolutional layers connected sequentially from data input to output.

[0017] Preferably, each of the content local enhancement modules includes a global branch and a local branch. The content local enhancement module first undergoes processing through a first normalization layer. The global branch first performs a linear transformation on the output of the first normalization layer through a third fully connected layer to generate a first query Q. g First key K g and the first value V g Then the first key K g and the first value V g Downsampling is performed sequentially through the first pooling layer and the second pooling layer, resulting in the first result and the second result, respectively. The first query Q... g After multiplying the second result by a matrix, the result is processed by a softmax layer to obtain the third result. The third result is then multiplied by the first result by a matrix to obtain the fourth result.

[0018] The local branch first performs a linear transformation on the output of the first normalization layer through the fourth fully connected layer to generate the second query Q, the second key K, and the second value V. Then, the second query Q, the second key K, and the second value V are processed sequentially through the first depthwise separable convolutional block, the second depthwise separable convolutional block, and the third depthwise separable convolutional block to obtain the fifth, sixth, and seventh results. The sixth result and the seventh result are then multiplied by a matrix to obtain the eighth result. The eighth result is then passed through the fifth fully connected layer, the swish activation function, the sixth fully connected layer, and the tanh activation function to obtain the ninth result. The fifth result and the ninth result are then multiplied by a matrix and concatenated with the fourth result. This concatenation is then passed through the seventh fully connected layer. Finally, the input of the content local enhancement module is added to the output of the eighth fully connected layer to obtain the output of the content local enhancement module.

[0019] Preferably, each of the convolutional feedforward modules is a stage-preserving convolutional feedforward module or a downsampling convolutional feedforward module. The input of the stage-preserving convolutional feedforward module is processed sequentially through a second normalization layer, a ninth fully connected layer, a first GELU activation function, a fourth depthwise separable convolutional block, and a tenth fully connected layer. Then, the output of the eleventh fully connected layer is added to the input of the stage-preserving convolutional feedforward module as the output of the stage-preserving convolutional feedforward module.

[0020] The downsampling convolutional feedforward module includes a first branch and a second branch. The first branch includes a third normalized layer, a twelfth fully connected layer, a second GELU activation function, a fifth depthwise separable convolutional block, and a thirteenth fully connected layer, which are connected sequentially from data input to output. The second branch includes a sixth depthwise separable convolutional block, a first batch of normalized layers, and a fourteenth fully connected layer, which are connected sequentially from data input to output. Then, the outputs of the fifteenth and sixteenth fully connected layers are added together as the output of the downsampling convolutional feedforward module.

[0021] Preferably, the layout analysis network uses contrastive denoising training during training to stabilize training and accelerate convergence:

[0022] Two hyperparameters λ1 and λ2 are set, where λ1 < λ2. Two types of contrastive denoising queries are generated simultaneously: positive query and negative query. Two concentric squares are drawn with half the side length of λ1 and λ2, respectively. The region within the smaller square is the positive query, whose noise scale is less than λ1. During training, it is expected to reconstruct the corresponding ground truth box. The region between the smaller and larger squares is the negative query, whose noise scale is greater than λ1 and less than λ2. During training, it is expected to be targetless. During training, λ2 is used to create difficult negative queries to accelerate network convergence. High-quality location queries are selected through contrastive denoising training to predict bounding boxes and thus suppress ambiguity.

[0023] The anchor box regression reconstruction training of the layout analysis network is optimized using L1 loss and GIOU loss, while the element classification training and contrast denoising training of the layout both use the focal loss function, the specific formula of which is as follows:

[0024]

[0025]

[0026]

[0027] L focalloss = -y(1-p) γ log(p)-(1-y)p γ log(1-p)

[0028] In the L1 loss, x i This represents the predicted coordinates output by the layout analysis network, y i γ represents the true coordinate value. In GIOU loss, A represents the predicted bounding box, B represents the label box, C represents the area of ​​the minimum bounding rectangle of the predicted bounding box and the label box, and IOU represents the intersection-union ratio of the predicted bounding box and the label box. In focalloss loss, p represents the output of the layout analysis network, y represents the true label, and γ represents an adjustable factor that is greater than 0.

[0029] The total loss L of the layout analysis network layout for:

[0030]

[0031] Where λ1, λ2, and λ3 represent the weight coefficients of L1 loss, GIOU loss, and focal loss, respectively.

[0032] Preferably, the step of cropping out the corresponding elements based on their coordinate information in the document image, and then performing corresponding processing according to the element category to obtain the element content corresponding to each element includes:

[0033] Based on the coordinate information of each element in the document image, the corresponding element is directly cropped out, and the elements include at least the title, body text, table of contents, figure title, table title, header, footer, footnote, table and image;

[0034] When the element's category belongs to title, body text, table of contents, figure title, table title, header, footer, and footnote, the cut-out elements are sequentially passed through the single-line text detection network and the single-line text recognition network to obtain the four coordinates of all single-line text boxes in each element and the corresponding text content of each single-line text box in each element.

[0035] When the element's category is table, the cropped table is processed through a single-line text detection network to obtain the four-point coordinates of all single-line text boxes in the table. Then, it is processed through a single-line text recognition network to obtain the text content corresponding to each single-line text box in the table. The cropped table is then processed through a table detection network to obtain the four-point coordinates of each cell in the table and the table structure information. These are then combined with the four-point coordinates of the single-line text boxes and input into the table element coordinate aggregation module. Then, they are input into the table element text aggregation module to concatenate the text belonging to the same cell. Finally, the table is output by combining the table structure information.

[0036] Preferably, the single-line document detection network includes a first feature extraction backbone network, a first parallel branch fusion network, a second parallel branch fusion network, a third parallel branch fusion network, and a first prediction network. The three features output by the first feature extraction backbone network are respectively input into the three parallel branch fusion networks, and the outputs of the three parallel branch fusion networks are fused and then input into the first prediction network.

[0037] The single-line text detection network uses probabilistic graphical loss L s Threshold map loss L t Binarization loss L b With joint optimization, the overall loss function L of the single-line text detection network is... det for:

[0038] L det =α1L s +α2L b +α3L t

[0039] Where α1, α2, and α3 represent the probabilistic graphical loss L, respectively. s Threshold map loss L t And binarization map loss L b Weighting coefficients;

[0040] Among them, the probabilistic graphical loss L s And binarization map loss L b The binary cross-entropy loss is used for calculation, and the formula is as follows:

[0041]

[0042] Among them, S i Indicates performing hard sample mining, x i 'Represents the probability map output by the single-line text detection network, or the binarized map calculated from the output probability map and the threshold map, when x i When it is a probability graph, then y i ' represents the label of the probability graph, when x i When it is a binary image, then y i 'Indicates the label of the binary image;

[0043] Threshold graph loss L t Calculated based on the distance between the predicted value and the label:

[0044]

[0045] Among them, R d This represents the pixels within the label text box area. This represents the threshold graph output by the single-line text detection network. This represents a label threshold map.

[0046] Preferably, the single-line character recognition network consists of an image block encoding module, a position encoding embedding module, a first feature mixing module, a first downsampling module, a second feature mixing module, a second downsampling module, a third feature mixing module, a feature integration module, and a seventeenth fully connected layer, connected sequentially from data input to output.

[0047] The single-line character recognition network is optimized using the CTC loss function and the SAR loss function. Therefore, the total loss function L of the single-line character recognition network is... rec for:

[0048] L rec =L CTC +L SAR

[0049] Among them, L CTC and L SAR The CTC loss function and the SAR loss function are represented in turn.

[0050] Preferably, the table detection network includes a second feature extraction backbone network, a feature fusion neck network, and a feature decoding head network connected sequentially from data input to output;

[0051] The table detection network is optimized using structural loss and positional loss. Therefore, the loss function L of the table detection network is... table for:

[0052] L table =L struct +2L loc

[0053] Among them, L struct L represents the structural loss, which is the difference between the XML element sequence of the table structure predicted by the table detection network and the true XML element sequence. loc This represents the positional loss, which is the difference between the four coordinates of the cells in the table predicted by the table detection network and the four coordinates of the actual cells.

[0054] Structural and positional losses are calculated using smooth L1 loss:

[0055] L struct =L loc =smooth

[0056] Where x represents the input of the smoothL1 loss.

[0057] Compared with the prior art, the beneficial effects of the present invention are as follows:

[0058] 1. This visual rich text layout restoration method based on attention network performs element detection on document image through layout analysis network to obtain the coordinates and category of each element in the document image. Based on the coordinate information of each element in the document image, the corresponding element is cropped out. Then, according to the category of the element, corresponding processing is performed to obtain the element content corresponding to each element. Finally, the element content corresponding to each element is inserted into the blank document according to its position in the document image to complete the layout restoration.

[0059] 2. This attention network-based visual rich text layout restoration method uses multiple deep learning algorithm network models in series to restore document images into editable documents. A contrastive denoising training strategy is added to the layout analysis network training to improve the accuracy of layout analysis. Because multiple network models are connected in series, according to the law of the weakest link, the layout restoration effect will be affected by the layout analysis model. Therefore, adding a contrastive denoising strategy can improve the accuracy of the network, promote network convergence, and shorten the training time. Attached Figure Description

[0060] Figure 1 This is a flowchart of the visual rich text layout restoration method based on attention networks according to the present invention;

[0061] Figure 2 This is a schematic diagram of the depth-separable convolution structure of the present invention;

[0062] Figure 3 This is a schematic diagram of the lightweight visual self-attention network of the present invention;

[0063] Figure 4 This is a schematic diagram of the structure of a local enhancement module of the present invention;

[0064] Figure 5 This is a schematic diagram of the structure of the convolutional feedforward module of the present invention;

[0065] Figure 6 This is a schematic diagram of the layout analysis network of the present invention;

[0066] Figure 7 This is a schematic diagram of the single-line text detection network of the present invention;

[0067] Figure 8 This is a schematic diagram of the parallel branch fusion network of the present invention;

[0068] Figure 9 This is a schematic diagram of the structure of the single-line character recognition network of the present invention;

[0069] Figure 10 This is a schematic diagram of the structure of the feature mixing module of the present invention;

[0070] Figure 11This is a schematic diagram of the structure of the table detection network of the present invention;

[0071] Figure 12 This is a schematic diagram of the structure of the feature fusion neck network of the present invention;

[0072] Figure 13 This is a schematic diagram of the high and low feature fusion module of the present invention;

[0073] Figure 14 This is a schematic diagram of the feature decoding header network of the present invention; Detailed Implementation

[0074] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0075] It should be noted that when a component is referred to as being "connected" to another component, it can be directly connected to the other component or there may be an intervening component. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the specification of this application is for the purpose of describing particular embodiments only and is not intended to limit the application.

[0076] like Figure 1-14 As shown, a visual rich text layout restoration method based on attention networks includes:

[0077] Step 1: Input the document image into the trained document image rotation angle classification network to obtain the required rotation angle of the document image and correct the document image. Then, use the projection correction algorithm to perform secondary correction on the document image.

[0078] Specifically, it includes:

[0079] The document image is input into the trained document image rotation angle classification network, which outputs the angle that the document image needs to be rotated, and then corrects the document image based on the angle that the document image needs to be rotated.

[0080] The document image rotation angle classification network includes, in sequence from data input to output, a first convolutional layer, a first hardswish activation function, a first depthwise separable convolutional structure, a second depthwise separable convolutional structure, a third depthwise separable convolutional structure, a fourth depthwise separable convolutional structure, a fifth depthwise separable convolutional structure, a sixth depthwise separable convolutional structure, a seventh depthwise separable convolutional structure, an eighth depthwise separable convolutional structure, a ninth depthwise separable convolutional structure, a tenth depthwise separable convolutional structure, an eleventh depthwise separable convolutional structure, a twelfth depthwise separable convolutional structure, a thirteenth depthwise separable convolutional structure, a global average pooling layer, a first fully connected layer, a second hardswish activation function, and a second fully connected layer.

[0081] Among them, such as Figure 2 As shown, each depthwise separable convolutional structure differs from existing depthwise classifiable convolutional blocks. The depthwise separable convolutional structure includes a first channel-wise convolutional layer, a third hardswish activation function, a first pointwise convolutional layer, a fourth hardswish activation function, and an SE module, all connected sequentially from data input to output. The SE module includes a global average pooling layer, an eighteenth fully connected layer, a first ReLU activation function, a nineteenth fully connected layer, and a hardsigmoid activation function, all connected sequentially from data input to output. The main purpose of using this depthwise separable convolutional structure instead of conventional convolutions is to reduce the number of network parameters and improve the network's inference speed. Because layout reconstruction is achieved by concatenating multiple deep neural network models, the layout reconstruction time is the sum of the processing time of all stages. Using the depthwise separable convolutional structure can effectively reduce the number of network parameters and computational cost, and allows for deeper networks with the same number of parameters. The third and fourth hardswish activation functions are used to replace the sigmoid activation function, which can reduce the computational overhead of the network and further improve computational speed. This SE module removes skip connections compared to conventional depthwise separable convolutional blocks to improve network speed and reduce computational cost. These three optimization methods effectively increase network computation speed and reduce layout reconstruction processing time.

[0082] During the training of the document image rotation angle classification network, a large number of document images containing different rotation angles are collected and their rotation angles are labeled. The rotation angles are divided into four categories: 0 degrees, 90 degrees, 180 degrees, and 270 degrees. The labeled data is used to train the document image rotation angle classification network.

[0083] The document image rotation angle classification network is optimized using cross-entropy loss during training, with the loss function L... direction As shown below:

[0084]

[0085] Where i=1 represents the first category, which does not require rotation; i=2 represents the second category, which requires a 90-degree rotation; i=3 represents the third category, which requires a 180-degree rotation; and i=4 represents the fourth category, which requires a 270-degree rotation. i Indicates the actual label category, q i This represents the network's prediction results.

[0086] Then, a projection correction algorithm is used to perform a second correction on the document image (the first correction completes the large-angle correction of the document image, but there may be small angles that need correction, so a projection correction algorithm is used to perform a second small-angle correction on the image). The projection correction algorithm is an existing technology. It involves rotating an image that has been "grayscaled", "binarized", and "automatically thresholded" from -35° to 35°. Each rotated image is projected onto the x or y coordinate (here, the y coordinate is used), and then the number of non-zero pixel rows at that coordinate is counted. The rotation angle with the fewest non-zero pixel rows is the angle that the image needs to be rotated. As described in the link https: / / www.jianshu.com / p / 5a644ebcb62d.

[0087] Step 2: Input the second-corrected document image into the layout analysis network to detect elements in the layout, obtaining the coordinates and category of each element in the document image. The layout analysis network includes a lightweight visual self-attention network, a multi-head self-attention encoder, and a multi-head self-attention decoder (e.g., ...) connected sequentially from data input to output. Figure 6 As shown, the layout analysis network uses a lightweight visual self-attention network pre-trained on the ImageNet-1K dataset as the first feature extraction backbone to extract visual features from the image. The visual features are flattened and then coupled with position encoding before being input into a multi-head self-attention encoder for feature refinement. The output queries of the multi-head self-attention encoder are then filtered, and a portion of the queries are selected and input into the multi-head feature decoder. The visual self-attention decoder is responsible for outputting the coordinates and categories of elements in the image. The entire network uses anchor box regression L1 loss, GIOU loss, and layout element classification focal loss to optimize network parameters. To accelerate and stabilize training, a focal loss for contrast denoising training is added.

[0088] Specifically, it includes:

[0089] like Figure 3As shown, the lightweight visual self-attention network includes a convolutional backbone, a first content local enhancement module, a first convolutional feedforward module, a second content local enhancement module, a second convolutional feedforward module, a third content local enhancement module, a third convolutional feedforward module, a fourth content local enhancement module, and a fourth convolutional feedforward module, all connected sequentially from data input to output. The convolutional backbone comprises four convolutional layers connected sequentially from data input to output. Figure 3 H in

[0090] Among them, such as Figure 4 As shown, each of the content local enhancement modules includes a global branch and a local branch. The content local enhancement module first undergoes processing through a first normalization layer. The global branch first performs a linear transformation on the output of the first normalization layer through a third fully connected layer to generate the first query Q. g First key K g and the first value V g Then the first key K g and the first value V g Downsampling is performed sequentially through the first pooling layer and the second pooling layer, resulting in the first result and the second result, respectively. The first query Q... g After multiplying the second result by a matrix, the result is processed by a softmax layer to obtain the third result. The third result is then multiplied by the first result by a matrix to obtain the fourth result.

[0091] The local branch first performs a linear transformation on the output of the first normalization layer through the fourth fully connected layer to generate the second query Q, the second key K, and the second value V. Then, the second query Q, the second key K, and the second value V are processed sequentially through the first depthwise separable convolutional block, the second depthwise separable convolutional block, and the third depthwise separable convolutional block to obtain the fifth, sixth, and seventh results. The sixth result and the seventh result are then multiplied by a matrix to obtain the eighth result. The eighth result is then passed through the fifth fully connected layer, the swish activation function, the sixth fully connected layer, and the tanh activation function to obtain the ninth result. The fifth result and the ninth result are then multiplied by a matrix and concatenated with the fourth result. This concatenation is then passed through the seventh fully connected layer. Finally, the input of the content local enhancement module is added to the output of the eighth fully connected layer to obtain the output of the content local enhancement module.

[0092] It should be noted that each content enhancement module contains two branches: a global branch and a local branch. The global branch first performs a linear transformation on the processing of the first normalization layer to generate the first query Q. g First key K g and the first value V g Then for the first key K g and the first value V gDownsampling is performed sequentially through the first and second pooling layers, and finally, conventional attention calculations are performed to obtain the output X of all branches. global The process is as follows:

[0093] X global =Attention(Q) g Pool(K) g ),Pool(V g ))

[0094] Here, Pool(·) is the pooling operation, and Attention(·) is the regular attention operation. The global branch can extract low-frequency global information, reduce computation, and give the network a global receptive field through regular attention. The global branch focuses on global perception but struggles to effectively capture high-frequency local information. To enable the network to consider local information simultaneously, this method uses a convolutional attention module as a local branch.

[0095] In the local branch, the output of the first normalization layer is first linearly transformed to generate the second query Q, the second key K, and the second value V. The calculation process is as follows:

[0096] Q,K,V=FC(X in )

[0097] Among them, X in This represents the input to the convolutional attention module, and FC represents the fully connected layer (i.e., the fourth fully connected layer).

[0098] The second query Q, the second key K, and the second value V are respectively processed by the first depthwise separable convolutional block, the second depthwise separable convolutional block, and the third depthwise separable convolutional block to aggregate local information, thus obtaining the fifth result Q. l Sixth result K l And the seventh result V s The polymerization process is as follows:

[0099] Q l =DWconv(Q)

[0100] K l =DWconv(K)

[0101] V s =DWconv(V)

[0102] Then Q is calculated. l and K l The result of the multiplication is used to obtain the content attention matrix Attn through a series of transformations. t :

[0103] Attn t =FC(Swish(FC(Q)l ×K l )))

[0104] Finally, normalizing the attention matrix and multiplying it by the matrix yields the output X of the enhanced local branch. local :

[0105]

[0106] Where d represents the number of channels for the token. Compared to ordinary attention methods, the attention proposed in this method has stronger non-linearity, thus producing higher quality content weights.

[0107] Finally, a direct connection is used to fuse the outputs of the local branch and the global branch. Specifically, the two outputs are concatenated along the channel dimension, and an eighth fully connected layer is used to manage the number of channels. The calculation process is shown below:

[0108] X out =FC(Concat(X) local ,X global ))

[0109] Finally, the input of the content local enhancement module is added to the output of the eighth fully connected layer to obtain the output of the content local enhancement module.

[0110] like Figure 5 As shown, each of the convolutional feedforward modules is a stage-preserving convolutional feedforward module or a downsampling convolutional feedforward module. The input of the stage-preserving convolutional feedforward module is processed sequentially through the second normalization layer, the ninth fully connected layer, the first GELU activation function, the fourth depthwise separable convolutional block, and the tenth fully connected layer. Then, the output of the tenth fully connected layer is added to the input of the stage-preserving convolutional feedforward module as the output of the stage-preserving convolutional feedforward module.

[0111] The downsampling convolutional feedforward module includes a first branch and a second branch. The first branch includes a third normalized layer, a twelfth fully connected layer, a fourth GELU activation function, a fifth depthwise separable convolutional block, and a thirteenth fully connected layer, which are connected sequentially from data input to output. The second branch includes a sixth depthwise separable convolutional block, a first batch of normalized layers, and a fourteenth fully connected layer, which are connected sequentially from data input to output. Then, the outputs of the fifteenth and sixteenth fully connected layers are added together as the output of the downsampling convolutional feedforward module.

[0112] It should be noted that the convolutional feedforward module is designed to integrate local information. It uses convolutional feedforward instead of ordinary feedforward, with the main difference being that the convolutional feedforward uses the GELU activation function and is followed by depthwise separable convolutional blocks, allowing it to integrate local information. The stage maintains a stride of 1 for each depthwise separable convolutional block in the convolutional feedforward module, while the stride of a depthwise separable convolutional block in the downsampling convolutional feedforward module is 2.

[0113] The layout analysis network uses contrastive denoising training to stabilize training and accelerate convergence during training.

[0114] Two hyperparameters λ1 and λ2 are set, where λ1 < λ2. Two types of contrastive denoising queries are generated simultaneously: positive query and negative query. Two concentric squares are drawn with half the side length of λ1 and λ2, respectively. The region within the smaller square is the positive query, whose noise scale is less than λ1. During training, it is expected to reconstruct the corresponding ground truth box. The region between the smaller and larger squares is the negative query, whose noise scale is greater than λ1 and less than λ2. During training, it is expected to be targetless. During training, λ2 (smaller λ2) is used to create difficult negative queries to accelerate network convergence. High-quality location queries are selected through contrastive denoising training to predict bounding boxes and thus suppress ambiguity.

[0115] The anchor box regression reconstruction training of the layout analysis network is optimized using L1 loss and GIOU loss, while the element classification training and contrast denoising training of the layout both use the focal loss function, the specific formula of which is as follows:

[0116]

[0117]

[0118]

[0119] L focalloss = -y(1-p) γ log(p)-(1-y)p γ log(1-p)

[0120] In the L1 loss, x i The y-coordinate represents the predicted coordinates output by the network. i γ represents the true coordinate values. In GIOU loss, A represents the predicted bounding box, B represents the label bounding box, C represents the area of ​​the minimum bounding rectangle of the predicted bounding box and the label bounding box, and IOU represents the intersection-union ratio of the predicted bounding box and the label bounding box. In focal loss, p represents the network output, y represents the true label, and γ represents an adjustable factor that is greater than 0.

[0121] The total loss L of the layout analysis networklayout for:

[0122]

[0123] Wherein, λ1, λ2 and λ3 represent the weight coefficients of L1 loss, GIOU loss and focal loss respectively (in this method, their values ​​are 5, 2 and 1 respectively).

[0124] Step 3: Based on the coordinate information of each element in the document image, crop out the corresponding element, and then perform corresponding processing according to the element category to obtain the element content corresponding to each element.

[0125] Specifically, it includes:

[0126] The process involves cropping out the corresponding elements based on their coordinate information within the document image, and then processing them according to their category to obtain the element content for each element, including:

[0127] Based on the coordinate information of each element in the document image, the corresponding element is directly cropped out, and the elements include at least the title, body text, table of contents, figure title, table title, header, footer, footnote, table and image;

[0128] When the element's category belongs to title, body text, table of contents, figure title, table title, header, footer, and footnote, the cut-out elements are sequentially passed through the single-line text detection network and the single-line text recognition network to obtain the four coordinates of all single-line text boxes in each element and the corresponding text content of each single-line text box in each element.

[0129] When the element's category is table, the cropped table is processed through a single-line text detection network to obtain the four-point coordinates of all single-line text boxes in the table. Then, it is processed through a single-line text recognition network to obtain the text content corresponding to each single-line text box in the table. Next, the cropped table is processed through a table detection network to obtain the four-point coordinates of each cell in the table and the table structure information (XML format table structure, i.e., the XML element sequence of the table structure). Then, it is combined with the four-point coordinates of the single-line text boxes and input into the table element coordinate aggregation module. Then, it is input into the table element text aggregation module to concatenate the text belonging to the same cell. Finally, the table (Excel table) is output by combining the table structure information.

[0130] The single-line document detection network includes a first feature extraction backbone network, a first parallel branch fusion network, a second parallel branch fusion network, a third parallel branch fusion network, and a first prediction network. The three features output from the first feature extraction backbone network are respectively input into the three parallel branch fusion networks. The outputs of the three parallel branch fusion networks are then fused and input into the first prediction network. The first feature extraction backbone network adopts a local design from the document image rotation angle classification network in step 1, specifically using the separable convolutional structure from the first convolutional layer to the thirteenth depth (i.e., the first convolutional layer, the first h...) from the document image rotation angle classification network in step 1. The network employs the Ardswish activation function, and includes three depthwise separable convolutional structures: the first, second, third, fourth, fifth, sixth, seventh, eighth, ninth, tenth, eleventh, twelfth, and thirteenth. The first feature extraction backbone network extracts features from the shallow, medium, and deep layers of the image and fuses these features through a parallel branch fusion network. Figure 8 As shown, each parallel branch fusion network (first parallel branch fusion network, second parallel branch fusion network, and third parallel branch fusion network) includes a second convolutional layer and a first transposed convolutional layer connected sequentially from data input to output, a first upsampling layer, a third convolutional layer, and a fourth convolutional layer connected sequentially from data input to output, and a second transposed convolutional layer. The output of the first transposed convolutional layer is connected to the first upsampling layer and the second transposed convolutional layer, respectively. The output of the second transposed convolutional layer is concatenated with the output of the third convolutional layer and then input into the fourth convolutional layer. Finally, the output of the second transposed convolutional layer and the output of the fourth convolutional layer are added together as the output of the parallel branch fusion network. Features input into the single-line document detection network are extracted through a feature extraction backbone network (shallow, mid, and deep features are extracted sequentially from the second, seventh, and thirteenth deep separable convolutional structures in the feature extraction backbone network, respectively). Then, the shallow, mid, and deep features are sequentially input into the first, second, and third parallel branch fusion networks. The features output from the three parallel branch fusion networks are concatenated and input into the first prediction network. The first prediction network generates a probability map, and the coordinates of the four points of the single-line text box of the input features can be obtained from the probability map.

[0131] like Figure 7As shown, the single-line document detection network includes two prediction networks (i.e., the first prediction network and the second prediction network) during the training phase. The first prediction network is for predicting the probability map, and the second prediction network is for predicting the threshold map. Each prediction network includes a fifth convolutional layer, a second batch normalization layer, a second ReLU activation function, a third transposed convolutional layer, a third batch normalization layer, a third ReLU activation function, a fourth transposed convolutional layer, and a first Sigmoid activation function, which are connected sequentially from the data input to the output. Then, the probability map and the threshold map are binarized by differentiability to obtain the binarized map. Based on the binarized map, the coordinates of the four points of the single-line text box of the input feature can be obtained.

[0132] Meanwhile, during the training process, the single-line document detection network collects a large number of document images and performs detection annotation on the document images, that is, annotates the coordinates of the four points of the text box: the upper left, upper right, lower right, and lower left. Then, based on the existing text boxes, it annotates the specific text content inside the text boxes. The single-line document detection network is trained using document images with detection annotations.

[0133] The single-line text detection network uses probabilistic graphical loss L s Threshold map loss L t Binarization loss L b With joint optimization, the overall loss function L of the single-line text detection network is... det for:

[0134] L det =α1L s +α2L b +α3L t

[0135] Where α1, α2, and α3 represent the probabilistic graphical loss L, respectively. s Threshold map loss L t And binarization map loss L b The weighting coefficients (e.g., setting α1=1, α2=1, α3=10);

[0136] Among them, the probabilistic graphical loss L s And binarization map loss L b The binary cross-entropy loss is used for calculation, and the formula is as follows:

[0137]

[0138] Among them, S i Indicates performing hard sample mining, x i 'Represents the probability map output by the single-line text detection network, or the binarized map calculated from the output probability map and the threshold map, when x i When it is a probability graph, then y i ' represents the label of the probability graph, when xi When it is a binary image, then y i 'Indicates the label of the binary image;

[0139] Threshold graph loss L t Calculated based on the distance between the predicted value and the label:

[0140]

[0141] Among them, R d This represents the pixels within the label text box area. This represents the threshold graph output by the single-line text detection network. This represents a label threshold map.

[0142] Among them, such as Figure 9 As shown, the single-line character recognition network consists of an image block encoding module, a position encoding embedding module, a first feature mixing module, a first downsampling module, a second feature mixing module, a second downsampling module, a third feature mixing module, a feature integration module, and a seventeenth fully connected layer, connected sequentially from data input to output.

[0143] The single-line text recognition network divides the input image into small blocks and encodes them through an image block encoding module. This is to meet the needs of subsequent self-attention calculation. Then, the position encoding is added to each small image block for marking and subsequent calculation. Through continuous self-attention and downsampling calculations, the connection between pixels at different positions is established.

[0144] The image block coding module includes a sixth convolutional layer, a fourth batch normalization layer, a third GELU activation function, a seventh convolutional layer, a fifth batch normalization layer, and a fourth GELU activation function, which are connected sequentially from data input to output.

[0145] Positional encoding embedding involves directly adding a set of learnable parameters to the features output by the image block encoding module. The channel dimension of this set of learnable parameters is consistent with that of the added features.

[0146] like Figure 10 As shown, each feature mixing module includes a fourth normalization layer, a local-global mixing module, a fifth normalization layer, and a multilayer perceptron network connected sequentially from data input to output. The input of the feature mixing module is added to the output of the local-global mixing module and then used as the input of the fifth normalization layer. The input of the feature mixing module is added to the output of the local-global mixing module and then added to the output of the multilayer perceptron network to become the output of the feature mixing module.

[0147] The calculation process of the local-global hybrid module is as follows:

[0148]

[0149] Where Q1, K1, and V1 represent the third query Q generated by the fully connected layer in the local-global hybrid module, respectively. g Third key K g and the third value V g d k This represents the number of heads in a multi-head attention system, and its purpose is to stabilize the gradient.

[0150] Each downsampling module includes an eighth convolutional layer and a sixth normalization layer connected sequentially from data input to output.

[0151] The feature integration module includes an adaptive average pooling layer connected sequentially from data input to output, a 20th fully connected layer for feature integration, a 5th hardswish activation function, and a random dropout layer.

[0152] The single-line text recognition network continuously stacks feature mixing modules and downsampling modules, and calculates the self-attention of image blocks under different resolution conditions. As a result, the network can recognize text of different sizes in the image, thus improving the robustness of the single-line text recognition network.

[0153] The single-line text recognition network is trained using document images with recognition annotations. The annotated text includes all Chinese characters, Arabic numerals, English letters, punctuation marks, etc.

[0154] The single-line character recognition network is optimized using the CTC loss function and the SAR loss function. Therefore, the total loss function L of the single-line character recognition network is... rec for:

[0155] L rec =L CTC +L SAR

[0156] Among them, L CTC and L SAR The CTC loss function and the SAR loss function are represented in turn.

[0157] Among them, such as Figure 11As shown, the table detection network includes a second feature extraction backbone network, a feature fusion neck network, and a feature decoding head network connected sequentially from data input to output. The second feature extraction backbone network has the same structure as the first feature extraction backbone network. The third, fifth, ninth, and thirteenth depthwise separable convolutional structures in the second feature extraction backbone network extract four layers of visual features from shallow to deep (namely, the first feature, the second feature, the third feature, and the fourth feature, respectively) and input them into the feature fusion neck network.

[0158] like Figure 12 As shown, the feature fusion neck network includes a second upsampling layer, a third upsampling layer, a fourth upsampling layer, a first downsampling layer, a second downsampling layer, a third downsampling layer, a first high-low feature fusion module, a second high-low feature fusion module, a third high-low feature fusion module, a fourth high-low feature fusion module, a fifth high-low feature fusion module, and a sixth high-low feature fusion module. The first feature output from the second feature extraction backbone network is input to the second upsampling layer. The output of the second upsampling layer is concatenated with the second feature and then input to the sixth high-low feature fusion module. The output of the sixth high-low feature fusion module is then sent to the third upsampling layer. The output of the third upsampling layer is concatenated with the third feature and then input to the fifth high-low feature fusion module. The output of the fifth high-low feature fusion module is then sent to the sixth high-low feature fusion module. The output of the fourth upsampling layer is concatenated with the fourth feature and then input into the first high-low feature fusion module. The output of the first high-low feature fusion module is then fed to the first downsampling layer. The output of the first downsampling layer is concatenated with the output of the fifth high-low feature fusion module and then input into the second high-low feature fusion module. The output of the second high-low feature fusion module is then fed to the second downsampling layer. The output of the second downsampling layer is concatenated with the output of the sixth high-low feature fusion module and then input into the third high-low feature fusion module. The output of the third high-low feature fusion module is then fed to the third downsampling layer. The output of the third downsampling layer is concatenated with the first feature and then input into the fourth high-low feature fusion module. The output of the fourth high-low feature fusion module serves as the output of the feature fusion neck network. Each upsampling layer in the feature fusion neck network is achieved through nearest-neighbor interpolation of the features, while each downsampling layer is implemented using a convolutional layer with a stride of 2.

[0159] like Figure 13As shown, each high-low feature fusion module includes a ninth convolutional layer, a sixth batch normalization layer, and a sixth hardswish activation function connected sequentially from data input to output, and a tenth convolutional layer, a seventh batch normalization layer, a seventh hardswish activation function, an eleventh convolutional layer, an eighth batch normalization layer, an eighth hardswish activation function, a second channel-wise convolutional layer, a ninth batch normalization layer, a ninth hardswish activation function, a second point-wise convolutional layer, a tenth batch normalization layer, and a tenth hardswish activation function connected sequentially from data input to output. The inputs of the high-low feature fusion modules are respectively input to the ninth and tenth convolutional layers, and the outputs of the sixth and tenth hardswish activation functions are concatenated to serve as the output of the high-low feature fusion module.

[0160] like Figure 14As shown, the feature decoding head network includes recurrent units, a 21st fully connected layer, a 22nd fully connected layer, a 23rd fully connected layer, and a 24th fully connected layer. First, it initializes and generates all-zero intermediate hidden features and structure prediction features. One-hot encoding is used to generate the structure prediction feature's encoding as the structure encoding feature. Attention is calculated on the intermediate hidden features, the input features of the feature decoding head network, and the structure encoding feature. Then, the structure from the attention calculation is input into the recurrent unit for calculation, yielding the recurrent unit output features, the calculated intermediate hidden features, and feature coefficients (the feature coefficients are obtained from the structure encoding features through attention calculation and the recurrent unit calculation). The recurrent unit output features are then passed sequentially through the 21st and 22nd fully connected layers to obtain the table structure information. The recurrent unit output features are then passed sequentially through the 23rd and 24th fully connected layers, and then through the second sigmoid activation function to obtain the four coordinates of the cell. Simultaneously, the calculated intermediate hidden features are... The input features are fed into the recurrent unit in the next iteration, and the feature coefficients are fed into the attention calculation in the next iteration. In the second iteration, the input features of the feature decoding head network and the feature coefficients retained from the previous iteration are used for attention calculation. The result, combined with the intermediate hidden layer features retained from the previous iteration, is fed into the recurrent unit for calculation, resulting in the recurrent unit output features, the calculated intermediate hidden layer features, and the feature coefficients. Simultaneously, the recurrent unit output features are passed sequentially through the 21st and 22nd fully connected layers to obtain the table structure information. The recurrent unit output features are then passed sequentially through the 23rd and 24th fully connected layers, and then through the second sigmoid activation function to obtain the four coordinates of the cells. The calculated intermediate hidden layer features are then fed into the recurrent unit in the next iteration, and the feature coefficients are fed into the attention calculation in the next iteration. This second iteration repeats until all table structure information and the four coordinates of all cells are obtained. The recurrent unit consists of the 25th, 26th, and 11th fully connected layers, connected sequentially from data input to output, and a recurrent neural network (i.e., a GRU network). The use of recurrent units is to reduce the network parameters and improve the network's computation speed. The gating mechanism controls the flow of information and enables long-range dependency modeling of sequences. Secondly, compared with ordinary RNN networks, it can better capture the long-range dependencies of long sequences and avoid the vanishing gradient phenomenon. Compared with LSTM, it has fewer parameters and simpler computation, thus resulting in faster training speed.

[0161] The training process of the table detection network involves using a table image with structural annotations (the table structure is annotated in XML format, and the coordinates of the four points of each element in the table are marked) to train the table detection network.

[0162] The table detection network is optimized using structural loss and positional loss. Therefore, the loss function L of the table detection network is... table for:

[0163] L table =L struct +2L loc

[0164] Among them, L struct L represents the structural loss, which is the difference between the XML element sequence of the table structure predicted by the table detection network and the true XML element sequence. loc This represents the positional loss, which is the difference between the four coordinates of the cells in the table predicted by the table detection network and the four coordinates of the actual cells.

[0165] Structural and positional losses are calculated using smooth L1 loss:

[0166] L struct =L loc =smooth

[0167] Where x represents the input of the smoothL1 loss.

[0168] The table element coordinate aggregation module primarily addresses the problem of reassembling text from multiple rows into a single cell. It aggregates text from single-line to multi-line by calculating the intersection-over-union (IoU) ratio and vertex distance between the four-point coordinates of the single-line text boxes obtained from the single-line text detection network and the four-point coordinates of the cells obtained from the table detection network. The IoU ratio is used to determine which text boxes obtained from the single-line text detection network belong to the same cell obtained from the table detection network, and the vertex distance and IoU ratio are used to determine the order of the text boxes obtained from the single-line text detection network model.

[0169] The table element text aggregation module uses the text box arrangement order obtained from the existing single-line text detection network to concatenate the text recognition results of the single-line text recognition network model according to the order from top to bottom and left to right using the results of the table element coordinate aggregation module. In this way, the cell contents of multi-line text can be concatenated into a string. Finally, combined with the table structure information output by the table detection network, the final Excel table is output.

[0170] Step 4: Generate a blank document. Insert the content of each element into the blank document according to its position in the document image to restore the layout.

[0171] It should be noted that at this point, all elements in the layout have been processed. Based on the coordinate order, the processed elements are sorted from the top left to the bottom right to generate a blank Word document. The coordinates are used to determine whether the elements are arranged in a single column or a double column in the original document image. The elements are then inserted into the generated Word document according to the obtained layout. Once all elements have been inserted, the layout restoration of the document image is complete.

[0172] This attention-based visual rich text layout restoration method uses a layout analysis network to detect elements in a document image, obtaining the coordinates and categories of each element. Based on these coordinates, corresponding elements are cropped, and then processed according to their categories to obtain their content. Finally, the content is inserted into a blank document based on its position in the document image, completing the layout restoration. This method uses a concatenated deep learning network model to restore the document image into an editable document. A contrastive denoising training strategy is added to the layout analysis network training to improve the accuracy of the layout analysis. Because multiple network models are concatenated, according to the "law of the weakest link," the layout restoration effect will be affected by the layout analysis model. Therefore, adding a contrastive denoising strategy can improve the network's accuracy, promote network convergence, and shorten training time.

[0173] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0174] The embodiments described above are merely specific and detailed examples of the embodiments described in this application, and should not be construed as limiting the scope of the patent application. It should be noted that those skilled in the art can make various modifications and improvements without departing from the concept of this application, and these all fall within the scope of protection of this application. Therefore, the scope of protection of this patent application should be determined by the appended claims.

Claims

1. A method for visual rich text layout restoration based on attention networks, characterized in that: The attention network-based visual rich text layout restoration method includes: The document image is input into the trained document image rotation angle classification network to obtain the required rotation angle of the document image and correct the document image. Then, the projection correction algorithm is used to perform a second correction on the document image. The document image after secondary correction is input into the layout analysis network to detect elements in the layout and obtain the coordinates and categories of each element in the document image. The layout analysis network includes a lightweight visual self-attention network, a multi-head self-attention encoder and a multi-head self-attention decoder connected in sequence from data input to output. Based on the coordinate information of each element in the document image, the corresponding element is cropped out, and then the element is processed according to its category to obtain the element content corresponding to each element. Generate a blank document, and insert the content of each element into the blank document according to its position in the document image to restore the layout; The layout analysis network uses contrastive denoising training to stabilize training and accelerate convergence during training. Set two hyperparameters and ,and Simultaneously, it generates two types of contrastive denoising queries: positive example queries and negative example queries, respectively using... and Draw two concentric squares with half the side length of the square. The region within the smaller square represents the positive query, and its noise scale is smaller than [missing value]. During training, the goal is to reconstruct the corresponding ground truth bounding boxes. The region between the small and large squares represents negative query cases, and its noise scale is greater than [missing information]. and less than During training, it is expected to be goalless, and during training, it is used... Creating difficult negative example queries accelerates network convergence, and high-quality location queries are selected through contrastive denoising training to predict bounding boxes, thereby suppressing confusion. The anchor box regression reconstruction training of the layout analysis network uses... Loss and The loss is used for optimization, and the element classification training and contrast denoising training of the layout are both adopted. The loss function, specifically the formula, is as follows: Among them, Among the losses, This represents the predicted coordinates output by the layout analysis network. Represents the actual coordinate values, in Among the losses, Indicates the prediction box. This indicates the label box. This represents the area of ​​the smallest bounding rectangle between the prediction box and the label box. The intersection-union ratio (IUU) of the predicted bounding box and the label bounding box is represented in [the context of the original text]. Among the losses, This represents the output of the layout analysis network. Indicates the true label, This indicates an adjustable factor, and it is greater than 0; The total loss of the layout analysis network. for: in, , and In order to represent loss, Loss and The weighting coefficients of the loss.

2. The visual rich text layout restoration method based on attention networks as described in claim 1, characterized in that: The document image rotation angle classification network includes, in sequence from data input to output, a first convolutional layer, a first hardswish activation function, a first depthwise separable convolutional structure, a second depthwise separable convolutional structure, a third depthwise separable convolutional structure, a fourth depthwise separable convolutional structure, a fifth depthwise separable convolutional structure, a sixth depthwise separable convolutional structure, a seventh depthwise separable convolutional structure, an eighth depthwise separable convolutional structure, a ninth depthwise separable convolutional structure, a tenth depthwise separable convolutional structure, an eleventh depthwise separable convolutional structure, a twelfth depthwise separable convolutional structure, a thirteenth depthwise separable convolutional structure, a global average pooling layer, a first fully connected layer, a second hardswish activation function, and a second fully connected layer. The document image rotation angle classification network is optimized using cross-entropy loss during training. The loss function is... As shown below: in, This indicates the first type, which does not require rotation. To indicate the second type, a 90-degree rotation is required. To indicate the third category, a 180-degree rotation is required. To indicate the fourth category, a 270-degree rotation is required. Indicates the actual label category. This represents the network's prediction results.

3. The visual rich text layout restoration method based on attention networks as described in claim 1, characterized in that: The lightweight visual self-attention network includes a convolutional backbone, a first content local enhancement module, a first convolutional feedforward module, a second content local enhancement module, a second convolutional feedforward module, a third content local enhancement module, a third convolutional feedforward module, a fourth content local enhancement module, and a fourth convolutional feedforward module, which are connected sequentially from data input to output. The convolutional backbone includes four convolutional layers connected sequentially from data input to output.

4. The visual rich text layout restoration method based on attention networks as described in claim 3, characterized in that: Each of the aforementioned content local enhancement modules includes a global branch and a local branch. The content local enhancement module first undergoes processing through a first normalization layer. The global branch first performs a linear transformation on the output of the first normalization layer through a third fully connected layer to generate the first query. First key and the first value Then the first key and the first value The data is downsampled sequentially through the first pooling layer and the second pooling layer, resulting in a first result and a second result, respectively. (First query) After multiplying the second result by a matrix, the result is processed by a softmax layer to obtain the third result. The third result is then multiplied by the first result by a matrix to obtain the fourth result. The local branch first performs a linear transformation on the output of the first normalization layer through the fourth fully connected layer to generate the second query. Second key Second value Then the second query Second key Second value The process is sequentially processed through the first depthwise separable convolutional block, the second depthwise separable convolutional block, and the third depthwise separable convolutional block to obtain the fifth, sixth, and seventh results. Then, the sixth and seventh results are multiplied by a matrix to obtain the eighth result. The eighth result then passes through the fifth fully connected layer, the swish activation function, the sixth fully connected layer, and the tanh activation function to obtain the ninth result. The fifth and ninth results are then multiplied by a matrix and concatenated with the fourth result. This concatenation then passes through the seventh fully connected layer. Finally, the input of the content local enhancement module is added to the output of the eighth fully connected layer to obtain the output of the content local enhancement module.

5. The visual rich text layout restoration method based on attention networks as described in claim 3, characterized in that: Each of the convolutional feedforward modules is a stage-preserving convolutional feedforward module or a downsampling convolutional feedforward module. The input of the stage-preserving convolutional feedforward module is processed sequentially through the second normalization layer, the ninth fully connected layer, the first GELU activation function, the fourth depthwise separable convolutional block, and the tenth fully connected layer. Then, the output of the eleventh fully connected layer is added to the input of the stage-preserving convolutional feedforward module as the output of the stage-preserving convolutional feedforward module. The downsampling convolutional feedforward module includes a first branch and a second branch. The first branch includes a third normalized layer, a twelfth fully connected layer, a second GELU activation function, a fifth depthwise separable convolutional block, and a thirteenth fully connected layer, which are connected sequentially from data input to output. The second branch includes a sixth depthwise separable convolutional block, a first batch of normalized layers, and a fourteenth fully connected layer, which are connected sequentially from data input to output. Then, the outputs of the fifteenth and sixteenth fully connected layers are added together as the output of the downsampling convolutional feedforward module.

6. The visual rich text layout restoration method based on attention networks as described in claim 1, characterized in that: The process involves cropping out the corresponding elements based on their coordinate information within the document image, and then processing them according to their category to obtain the element content for each element, including: Based on the coordinate information of each element in the document image, the corresponding element is directly cropped out, and the elements include at least the title, body text, table of contents, figure title, table title, header, footer, footnote, table and image; When the element's category belongs to title, body text, table of contents, figure title, table title, header, footer, and footnote, the cut-out elements are sequentially passed through the single-line text detection network and the single-line text recognition network to obtain the four coordinates of all single-line text boxes in each element and the corresponding text content of each single-line text box in each element. When the element's category is table, the cropped table is processed through a single-line text detection network to obtain the four-point coordinates of all single-line text boxes in the table. Then, it is processed through a single-line text recognition network to obtain the text content corresponding to each single-line text box in the table. The cropped table is then processed through a table detection network to obtain the four-point coordinates of each cell in the table and the table structure information. These are then combined with the four-point coordinates of the single-line text boxes and input into the table element coordinate aggregation module. Then, they are input into the table element text aggregation module to concatenate the text belonging to the same cell. Finally, the table is output by combining the table structure information.

7. The visual rich text layout restoration method based on attention networks as described in claim 6, characterized in that: The single-line text detection network includes a first feature extraction backbone network, a first parallel branch fusion network, a second parallel branch fusion network, a third parallel branch fusion network, and a first prediction network. The three features output by the first feature extraction backbone network are respectively input into the three parallel branch fusion networks, and the outputs of the three parallel branch fusion networks are fused and then input into the first prediction network. The single-line text detection network uses probabilistic graphical loss. Threshold map loss Binarization loss Through joint optimization, the overall loss function of the single-line text detection network is obtained. for: in, , and The probabilistic graphical loss is represented in sequence. Threshold map loss and binarization map loss Weighting coefficients; Among them, probabilistic graphical loss and binarization map loss The binary cross-entropy loss is used for calculation, and the formula is as follows: in, This indicates that hard sample mining is being performed. This represents the probability map output by the single-line text detection network, or the binarized map calculated from the output probability map and the threshold map. When it is a probability graph, then Represents the label of the probability graph, when When it is a binary image, then Indicates the label of the binarized image; Threshold graph loss Calculated based on the distance between the predicted value and the label: in, This represents the pixels within the label text box area. This represents the threshold graph output by the single-line text detection network. This represents a label threshold map.

8. The visual rich text layout restoration method based on attention networks as described in claim 6, characterized in that: The single-line character recognition network consists of an image block encoding module, a position encoding embedding module, a first feature mixing module, a first downsampling module, a second feature mixing module, a second downsampling module, a third feature mixing module, a feature integration module, and a seventeenth fully connected layer, connected sequentially from data input to output. The single-line text recognition network adopts loss function and After optimizing the loss function, the overall loss function of the single-line text recognition network becomes... for: in, and In order to represent loss function and Loss function.

9. The visual rich text layout restoration method based on attention networks as described in claim 6, characterized in that: The table detection network includes a second feature extraction backbone network, a feature fusion neck network, and a feature decoding head network, which are connected sequentially from data input to output. The table detection network is optimized using structural loss and positional loss. The loss function of the table detection network is... for: in, This represents the structural loss, which is the difference between the XML element sequence of the table structure predicted by the table detection network and the actual XML element sequence. This represents the positional loss, which is the difference between the four coordinates of the cells in the table predicted by the table detection network and the four coordinates of the actual cells. Structural loss and positional loss are adopted Calculate based on loss: in, express The input of loss.