Image processing method and device, model training method and device, electronic equipment and medium

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By combining scaling processing and optical flow prediction models, the problem of poor document image correction effect is solved, achieving fast and efficient image correction and improving the accuracy of OCR recognition.

CN117237214BActive Publication Date: 2026-06-23BEIJING BAIDU NETCOM SCI & TECH CO LTD

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: BEIJING BAIDU NETCOM SCI & TECH CO LTD
Filing Date: 2023-08-30
Publication Date: 2026-06-23

Application Information

Patent Timeline

30 Aug 2023

Application

23 Jun 2026

Publication

CN117237214B

IPC: G06T5/80; G06T5/60; G06T3/40; G06T7/269; G06N3/0464; G06N3/08

AI Tagging

Application Domain

Image enhancement Image analysis

Technology Topics

Imaging processing Radiology

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing technologies are not effective in document image processing, especially in complex scenarios, and are time-consuming, which affects the accuracy of OCR recognition.

Method used

The pre-trained optical flow prediction model is input after scaling, and the mapped optical flow image is obtained through optical flow prediction and upsampling to achieve image mapping correction.

Benefits of technology

It improves the speed and effectiveness of image correction, ensures the content similarity between the corrected image and the original image, and enhances the accuracy of OCR recognition.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN117237214B_ABST

Patent Text Reader

Abstract

The present disclosure provides an image processing method and device, a model training method and device, an electronic device and a medium, and relates to the technical field of image processing, in particular to the technical fields of artificial intelligence, deep learning, image anti-warping and the like. The specific implementation scheme is as follows: an image to be processed is acquired, scaling processing is performed on the image to be processed, and a scaling processing image corresponding to the image to be processed is acquired; the scaling processing image is input into a pre-trained optical flow prediction model, and a predicted optical flow image corresponding to the scaling processing image is acquired; pixels of the predicted optical flow image represent a corresponding relationship between pixels of the scaling processing image and pixels of a predicted correction image corresponding to the scaling processing image; up-sampling processing is performed on the predicted optical flow image, and a mapping optical flow image is acquired; image mapping processing is performed on the image to be processed according to the mapping optical flow image, and a correction image corresponding to the image to be processed is acquired.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of image processing technology, and more particularly to the fields of artificial intelligence, deep learning, and image distortion resistance. Specifically, this disclosure relates to an image processing method, a model training method, an apparatus, an electronic device, and a medium. Background Technology

[0002] Document digitization is the process of processing, understanding, classifying, extracting, and summarizing images taken by users. However, in practical applications, limitations in shooting methods can lead to curved or folded images, which seriously affects the processing of document images and the accuracy of OCR (Optical Character Recognition) recognition.

[0003] Document anti-distortion algorithms can transform curved and folded images into flat images, improving the accuracy of OCR recognition and the performance of document image processing tasks such as shadow removal, playing a crucial role in document digitization. Summary of the Invention

[0004] This disclosure provides an image processing method, a model training method, an apparatus, an electronic device, and a medium.

[0005] According to a first aspect of this disclosure, an image processing method is provided, the method comprising:

[0006] Obtain the image to be processed, scale the image to be processed, and obtain the scaled image corresponding to the image to be processed;

[0007] The scaled image is input into a pre-trained optical flow prediction model to obtain a predicted optical flow image corresponding to the scaled image; the pixels of the predicted optical flow image represent the correspondence between the pixels of the scaled image and the pixels of the predicted corrected image corresponding to the scaled image.

[0008] The predicted optical flow image is upsampled to obtain the mapped optical flow image;

[0009] Based on the mapped optical flow image, image mapping processing is performed on the image to be processed to obtain the corrected image corresponding to the image to be processed.

[0010] According to a second aspect of this disclosure, an image processing apparatus is provided, the apparatus comprising:

[0011] The scaling module is used to acquire the image to be processed, scale the image to be processed, and acquire the scaled image corresponding to the image to be processed.

[0012] The prediction module is used to input the scaled image into a pre-trained optical flow prediction model to obtain a predicted optical flow image corresponding to the scaled image; the pixels of the predicted optical flow image represent the correspondence between the pixels of the scaled image and the pixels of the predicted corrected image corresponding to the scaled image.

[0013] The upsampling module is used to perform upsampling processing on the predicted optical flow image to obtain the mapped optical flow image;

[0014] The image mapping module is used to perform image mapping processing on the image to be processed based on the mapped optical flow image to obtain the corrected image corresponding to the image to be processed.

[0015] According to a third aspect of this disclosure, a method for model training is provided, the method comprising:

[0016] Acquire the image to be trained and the corresponding labeled optical flow image of the image to be trained;

[0017] The optical flow prediction model is trained based on the image to be trained and the labeled optical flow image to obtain a pre-trained optical flow prediction model.

[0018] The pixels in the labeled optical flow map represent the correspondence between the pixels of the image to be trained and the pixels of the corresponding corrected image.

[0019] According to a fourth aspect of this disclosure, an apparatus for model training is provided, the apparatus comprising:

[0020] The image module is used to acquire the image to be trained and the corresponding label optical flow image of the image to be trained;

[0021] The training module is used to train the optical flow prediction model based on the image to be trained and the labeled optical flow image to obtain the pre-trained optical flow prediction model.

[0022] The pixels in the labeled optical flow map represent the correspondence between the pixels of the image to be trained and the pixels of the corresponding corrected image.

[0023] According to a fifth aspect of this disclosure, an electronic device is provided, the electronic device comprising:

[0024] At least one processor; and

[0025] A memory communicatively connected to at least one of the aforementioned processors; wherein,

[0026] The memory stores instructions that can be executed by at least one processor, which, when executed by at least one processor, enable the at least one processor to perform the image processing method and / or the model training method.

[0027] According to a sixth aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are used to cause a computer to perform the above-described image processing method and / or model training method.

[0028] According to a seventh aspect of this disclosure, a computer program product is provided, comprising a computer program that, when executed by a processor, implements the above-described image processing method and / or model training method.

[0029] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0030] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:

[0031] Figure 1 This is a schematic flowchart of an image processing method provided in an embodiment of this disclosure;

[0032] Figure 2 This is a flowchart illustrating some steps of another image processing method provided in an embodiment of this disclosure;

[0033] Figure 3 This is a schematic diagram of the bottleneck structure in the method provided in the embodiments of this disclosure;

[0034] Figure 4 This is a schematic diagram of the FCN structure in the method provided in the embodiments of this disclosure;

[0035] Figure 5 This is a schematic diagram of the structure of ASPP in the method provided by the embodiments of this disclosure;

[0036] Figure 6 This is a schematic diagram of the structure of the Head in the method provided by the embodiments of this disclosure;

[0037] Figure 7 This is a flowchart illustrating a model training method provided in an embodiment of this disclosure;

[0038] Figure 8 This is a schematic diagram of the structure of an image processing apparatus provided in an embodiment of this disclosure;

[0039] Figure 9 This is a schematic diagram of the structure of a model training apparatus provided in an embodiment of this disclosure;

[0040] Figure 10This is a block diagram of an electronic device used to implement the image processing method and model training method of the embodiments of this disclosure. Detailed Implementation

[0041] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0042] In some related technologies, traditional image processing algorithms can be used to achieve document distortion resistance. Specifically, document image correction can be achieved based on content segmentation. By analyzing the content of the document image, including tilt angles, text lines, character or phrase features, a corrected image is obtained based on certain rules.

[0043] While traditional image processing algorithms can correct document images, the correction effect is poor, especially for document images in complex scenes (such as documents with mixed text and images). They also heavily rely on the effectiveness of OCR and are very time-consuming.

[0044] In some related technologies, deep learning-based algorithms can be used to achieve document anti-distortion. Specifically, based on deep learning models, such as DewarpNet (a single-image document correction network based on 2D and 3D regression networks) and DocUnet (a stacked U-Net network), optical flow prediction of pixel boundaries can be achieved, and then the corrected image can be obtained through mapping.

[0045] Although deep learning-based algorithms can achieve document image correction and have good generalization ability, they are time-consuming and their effects cannot meet the needs of practical applications.

[0046] The image processing method, model training method, apparatus, electronic device, and medium provided in the embodiments of this disclosure are intended to solve at least one of the above-mentioned technical problems of the prior art.

[0047] The image processing method and model training method provided in this disclosure can be executed by electronic devices such as terminal devices or servers. The terminal device can be an in-vehicle device, user equipment (UE), mobile device, user terminal, terminal, cellular phone, cordless phone, personal digital assistant (PDA), handheld device, computing device, in-vehicle device, wearable device, etc. The method can be implemented by a processor calling computer-readable program instructions stored in memory. Alternatively, the method can be executed by a server.

[0048] Figure 1 A schematic flowchart of an image processing method provided in an embodiment of this disclosure is shown. Figure 1 As shown in the figure, the statistical method for the application provided in this embodiment of the disclosure may include steps S110, S120, S130 and S140.

[0049] In step S110, the image to be processed is obtained, the image to be processed is scaled, and the scaled image corresponding to the image to be processed is obtained.

[0050] In step S120, the scaled image is input into a pre-trained optical flow prediction model to obtain the predicted optical flow image corresponding to the scaled image;

[0051] In step S130, the predicted optical flow image is upsampled to obtain the mapped optical flow image;

[0052] In step S140, image mapping processing is performed on the image to be processed based on the mapped optical flow image to obtain the corrected image corresponding to the image to be processed;

[0053] Among them, the pixels of the predicted optical flow image represent the correspondence between the pixels of the scaled image and the pixels of the corresponding predicted corrected image.

[0054] For example, the image to be processed could be a document image generated by taking a picture of a document with a camera.

[0055] Therefore, the image processing method provided in this disclosure can be applied to document digitization. The image processing method provided in this disclosure can be used to correct document images, achieve anti-distortion of documents, and transform curved and folded document images into flat images.

[0056] In some possible implementations, in step S110, acquiring the image to be processed can be done by receiving the image to be processed sent by the client. The client can be a personal computer, mobile device, or other device capable of interacting with the user.

[0057] In some possible implementations, there can be multiple images to be processed, that is, the image processing method provided in this disclosure can be applied to batch image processing.

[0058] In some possible implementations, scaling the image to be processed can be achieved by downsampling the image to be processed according to a certain reduction ratio to obtain a scaled image.

[0059] In some possible implementations, the image to be processed is scaled, or it is downsampled to scale the image to a fixed size to obtain the scaled image.

[0060] In some possible implementations, in step S120, the optical flow prediction model can be a pre-trained deep learning model, whose input is an image (specifically, an image of a fixed size), and whose output is the predicted optical flow image corresponding to the input image.

[0061] The pixel value of the pixel in the predicted optical flow image represents the pixel correspondence between the corresponding pixel in the image and the pixel in the predicted and corrected image corresponding to the input image.

[0062] The predicted corrected image corresponding to the input image can be the corrected (or anti-distortion processed) image obtained from the prediction results of the optical flow prediction model. In other words, the predicted corrected image can be the image obtained after image mapping based on the predicted optical flow image.

[0063] In some specific implementations, the pixel values of the pixels in the predicted optical flow image are used to characterize the motion trajectory (including the direction and distance of motion) of the corresponding pixels in the input image. After each pixel in the input image moves according to the motion trajectory corresponding to that pixel, the generated new image is the predicted and corrected image.

[0064] Therefore, the predicted corrected image corresponding to the scaled image can be the image obtained by inputting the scaled image into the optical flow prediction model and then mapping the predicted optical flow image output by the optical flow prediction model.

[0065] The pixel values of the predicted optical flow image represent the correspondence between the pixels in the scaled image and the pixels in the predicted and corrected image.

[0066] In some possible implementations, the optical flow prediction model can be any deep learning network capable of optical flow prediction.

[0067] In some possible implementations, in step S130, upsampling the predicted optical flow image can be performed on the predicted optical flow image in accordance with the downsampling process.

[0068] In other words, by downsampling the image to be processed and scaling it by a factor of N to obtain the scaled image, the predicted optical flow image is then magnified by a factor of N to obtain the mapped optical flow image. Here, N is a positive number.

[0069] Therefore, the image size of the acquired mapped optical flow image is consistent with the image size of the image to be processed, and thus the corrected image acquired from the mapped optical flow image is also consistent with the image size of the image to be processed.

[0070] In some possible implementations, since the pixels of the predicted optical flow image are pixel correspondences between images, and the mapped optical flow image obtained from the predicted optical flow image is also a pixel correspondence, in step S140, performing image mapping processing on the image to be processed means generating a corrected image corresponding to the image to be processed based on the pixel correspondences.

[0071] In the image processing method provided in this embodiment, since the input optical flow prediction model is a scaled image, fewer pixels need to be processed compared to directly inputting the image to be processed into the optical flow prediction model, resulting in faster processing speed and improved image correction speed. Simultaneously, since the mapped optical flow image is obtained by upsampling the predicted optical flow image, it ensures that the pixel values corresponding to adjacent pixels in the mapped optical flow image are similar. This also ensures that the mapping of adjacent pixels in the image to be processed is similar, thereby ensuring that the pixel proximity relationship of the corrected image obtained from the image mapping process is similar to that of the image to be processed. This ensures the content similarity between the image to be processed and the corrected image, improving the image correction effect.

[0072] The image processing method provided in the embodiments of this disclosure will be described in detail below.

[0073] In some possible implementations, the optical flow prediction model may include a backbone network, a neck network, and a head.

[0074] Figure 2 This document illustrates a flowchart of a specific embodiment of a method for obtaining a predicted optical flow image from a scaled image by inputting a pre-trained optical flow prediction model, when the flow prediction model includes a backbone, neck, and head. Figure 2 As shown, inputting the scaled image into a pre-trained optical flow prediction model to obtain the predicted optical flow image corresponding to the scaled image may include steps S210, S220, and S230.

[0075] In step S210, the scaled image is input into the backbone network to obtain the hierarchical image features of the scaled image;

[0076] In step S220, the hierarchical image features are input into the neck network, and feature fusion is performed on the hierarchical image features to obtain the fused image features of the scaled image.

[0077] In step S230, the fused image features are input into the detection head to obtain the predicted optical flow image corresponding to the scaled image.

[0078] In some possible implementations, in step S210, the Backbone can be an image feature extraction network, which may include multiple cascaded feature extraction modules. The image features of different sizes output by the multiple feature extraction modules are the hierarchical image features.

[0079] In some possible implementations, the backbone can be a lightweight network called MobileNet (Mobile Neural Network).

[0080] MobileNet is faster than other image feature extraction networks. Using MobileNet as the backbone can further accelerate image processing while ensuring the correction effect.

[0081] In some specific implementations, when the Backbone is Mobilenet, the Backbone network structure is shown in the table below:

[0082]

[0083]

[0084] Wherein, input represents the size of the graph input to the structure, operator represents the operation corresponding to the structure, t represents the expansion factor in the bottleneck structure, c represents the number of channels in the graph, n represents the number of times the operation corresponding to the structure is executed, and s represents the stride of the convolution operation in the structure.

[0085] conv2d represents a two-dimensional convolution operation on the input.

[0086] Figure 3 A schematic diagram of the bottleneck structure, as shown below. Figure 3As shown, when the stride of the convolution operation in the bottleneck is 1, the bottleneck consists of a Conv (convolutional neural network layer) 301 with ReLU6 activation function and a kernel size of 1x1, a Dwise (DepthwiseConv, depthwise separable convolution) 302 with Rule6 activation function and a kernel size of 3*3, a Conv (convolutional neural network layer) 303 with Linear activation function and a kernel size of 1x1, and an Add 304 used to fuse the output of Conv 303 with Linear activation function and a kernel size of 1x1 with the input (input of the bottleneck).

[0087] When the stride of the convolution operation in the bottleneck is 2, the bottleneck consists of a Conv (convolutional neural network layer) 305 with ReLU6 activation function and a kernel size of 1x1, a Dwise 306 with Rule6 activation function and a kernel size of 3*3 and a convolution stride of 2, and a Conv (convolutional neural network layer) 307 with Linear activation function and a kernel size of 1x1.

[0088] In some possible implementations, in step S220, the hierarchical image features may include at least one shallow image feature and at least one deep image feature.

[0089] Among them, shallow image features can be image features output by the shallow network structure in the Backbone, and deep image features can be image features output by the deep network structure in the Backbone.

[0090] In some specific implementations, deep image features can be the image features output from the last layer of the Backbone, while shallow image features can be the image features output from the penultimate, third-to-last, and fourth-to-last layers of the Backbone.

[0091] In some possible implementations, inputting hierarchical image features into the neck network and fusing the hierarchical image features to obtain the fused image features of the image to be processed may include inputting shallow image features and high-level image features into the neck network, upsampling the high-level image features to obtain sampled image features of the same size as the shallow image features, and fusing the shallow image features and sampled image features to obtain the fused image features of the scaled image.

[0092] In some possible implementations, feature fusion of shallow and high-level image features can be achieved through FPN (Feature Pyramid Networks).

[0093] Figure 4 A schematic diagram of the FPN structure is shown, as follows: Figure 4 As shown on the left, the shallow image features output by Backbone are obtained, namely, the first shallow image feature 403, the second shallow image feature 402, the third shallow image feature 401, and the high-level image feature 404. By upsampling the high-level image features, a sampled image feature of the same size as the first shallow image feature 403 is obtained (named the first sampled image feature 405 to distinguish it from other sampled image features). The first shallow image feature 405 and the first sampled image feature 403 are fused to obtain the first fused image feature. The first fused image feature is upsampled to obtain the second sampled image feature 406 of the same size as the second shallow image feature 402. The second shallow image feature 402 and the second sampled image feature 406 are fused to obtain the second fused image feature. The second fused image feature is upsampled to obtain the third sampled image feature 407 of the same size as the third shallow image feature 401. The third shallow image feature 401 and the third sampled image feature 407 are fused to obtain the third fused image feature.

[0094] The first fused image features, the second fused image features, the third fused image features, and the high-level image features can all be used as the fused image features of the scaled image; alternatively, the fused image features of the scaled image can be obtained by processing the first fused image features, the second fused image features, the third fused image features, and the high-level image features.

[0095] In some specific implementations, when the Backbone is Mobilenet, such as Figure 4 As shown, the upsampling process can specifically be a 2x upsampling process, which can be implemented through a 2xUP structure. The image features output by the Backbone and the image features obtained by upsampling are fused to obtain the fused image features. This can be achieved by using a 1x1Conv structure to perform a 1x1 convolution on the image features output by the Backbone, adding and fusing the convolutional image features with the image features obtained by upsampling, and obtaining the fused image features through two cascaded bottleneck structures.

[0096] Since shallow image features are beneficial for optical flow prediction tasks, and high-level image features contain semantic information, the fused image features obtained by fusing high-level and shallow image features contain semantic features as well as features that are useful for optical flow prediction. Therefore, using fused image features can improve the optical flow prediction accuracy of the optical flow prediction model.

[0097] In some possible implementations, feature fusion is performed on the hierarchical image features to obtain the fused image features of the scaled image. Alternatively, the hierarchical image features can be processed to obtain receptive field image features of different receptive fields, and then the fused image features can be obtained by fusing the receptive field image features.

[0098] In some possible implementations, processing the hierarchical image features to obtain receptive field image features of different receptive fields, and fusing the receptive field image features to obtain fused image features may include obtaining multiple receptive field image features based on the hierarchical image features; fusing the multiple receptive field image features to obtain receptive field fused features; and fusing the receptive field fused features with the hierarchical image features to obtain fused image features of the scaled image.

[0099] Different receptive fields correspond to different receptive field image features.

[0100] In some possible implementations, obtaining multiple receptive field image features based on hierarchical image features can be achieved by performing convolution processing on the hierarchical image features using convolution kernels of different sizes to obtain multiple receptive field image features.

[0101] In some possible implementations, feature fusion is performed on multiple receptive field image features to obtain receptive field fusion features, including: channel stitching of multiple receptive field image features to obtain stitched image features; and convolution processing of the stitched image features to obtain receptive field fusion features with the same size as the hierarchical image features.

[0102] In some possible implementations, the ASPP (Atrous Spatial Pyramid Pooling) structure can be used to process hierarchical image features to obtain receptive field image features of different receptive fields, and then the receptive field image features can be fused to obtain fused image features.

[0103] Figure 5 A schematic diagram of the ASPP structure is shown, as follows: Figure 5 As shown, 1x1Conv and 3x3Conv with rates (hole coefficients) of 12, 24, and 36 respectively, along with Time Pooling, can be used to process the high-level image features 501 to obtain receptive field image features 502, 503, 504, 505, and 506 with different receptive fields. Then, by channel fusion of the receptive field image features, a stitched image feature 507 is obtained. Finally, 1x1Conv is used to transform the stitched image feature 507 into a fused image feature 508 with the same size as the layer image features.

[0104] By acquiring receptive field image features from different receptive fields, the receptive field of the acquired fused image features can be improved, thereby enhancing the accuracy of the predicted optical flow image obtained from the fused image features and thus improving the image correction effect.

[0105] In some possible implementations, inputting hierarchical image features into the neck network and fusing the hierarchical image features to obtain the fused image features of the image to be processed may also include processing the high-level image features to obtain receptive field image features of different receptive fields, fusing the receptive field image features to obtain the fused image features corresponding to the high-level image features, upsampling the fused image features corresponding to the high-level image features to obtain sampled image features with the same size as the shallow image features, and fusing the shallow image features and sampled image features to obtain the fused image features of the scaled image.

[0106] Specifically, ASPP can be used to process high-level image features to obtain receptive field image features of different receptive fields. The fused image features corresponding to the high-level image features can be obtained by fusing these receptive field image features. Upsampling the fused image features corresponding to the high-level image features yields sampled image features of the same size as the shallow image features. Finally, feature fusion of the shallow image features and the sampled image features yields the fused image features of the scaled image, which can be achieved through FPN. The structures and processing procedures of ASPP and FPN are as described above and will not be repeated here.

[0107] By using ASPP and FPN, we can obtain fused features, including shallow image features and features from different receptive fields, thereby improving the accuracy of the predicted optical flow image obtained from the fused image features, and thus improving the image correction effect.

[0108] Shallow image features and high-level image features are input into the neck network. The high-level image features are upsampled to obtain sampled image features with the same size as the shallow image features. The shallow image features and sampled image features are then fused to obtain the fused image features of the scaled image.

[0109] In some possible implementations, in step S230, the Head may correspond to the Backbone.

[0110] Figure 6 This diagram illustrates the structure of the Head when the Backbone is Mobilenet. Figure 6 As shown, the Head can include two bottlenecks cascaded in sequence and a 1x1 Conv. The specific components of the bottleneck and the 1x1 Conv are as described above and will not be repeated here.

[0111] The optical flow prediction model described above can improve the speed of image correction while ensuring the image correction effect, so that the image processing method provided in this embodiment can be used to solve actual image anti-distortion processing.

[0112] use Figure 7 A flowchart illustrating the model training method provided in this disclosure embodiment is shown, as follows: Figure 7 As shown in the figure, the model training method provided in this embodiment may include steps S710 and S720.

[0113] In step S710, the image to be trained and the corresponding label optical flow image are acquired;

[0114] In step S720, the optical flow prediction model is trained based on the image to be trained and the labeled optical flow image to obtain the pre-trained optical flow prediction model.

[0115] In this context, the pixels in the labeled optical flow map represent the correspondence between the pixels of the image to be trained and the pixels of the corresponding corrected image.

[0116] For example, the training image could be a document image generated by taking a picture of a document with a camera.

[0117] Therefore, the optical flow prediction model obtained by the model training method provided in this embodiment can be applied to document digitization to correct document images, achieve document anti-distortion, and transform curved and folded document images into flat images.

[0118] In some possible implementations, in step S710, the image to be acquired for training may be a preprocessed image.

[0119] Preprocessing can involve adjusting the image size to ensure that the resulting training images have consistent dimensions.

[0120] In some possible implementations, in step S720, the optical flow prediction model can be a deep learning model, whose input is an image (specifically, an image of a fixed size), and whose output is the predicted optical flow image corresponding to the input image.

[0121] The pixel value of the pixel in the predicted optical flow image represents the pixel correspondence between the corresponding pixel in the image and the pixel in the predicted and corrected image corresponding to the input image.

[0122] The predicted corrected image corresponding to the input image can be the corrected (or anti-distortion processed) image obtained from the prediction results of the optical flow prediction model. In other words, the predicted corrected image can be the image obtained after image mapping based on the predicted optical flow image.

[0123] In some specific implementations, the pixel values of the pixels in the predicted optical flow image are used to characterize the motion trajectory (including the direction and distance of motion) of the corresponding pixels in the input image. After each pixel in the input image moves according to the motion trajectory corresponding to that pixel, the generated new image is the predicted and corrected image.

[0124] In some possible implementations, the loss value of the optical flow prediction network is obtained by predicting the optical flow image and the labeled optical flow image. Backpropagation is then performed based on this loss value to modify the model parameters of the optical flow prediction network and obtain a pre-trained optical flow prediction model.

[0125] In the model training method provided in this embodiment, the input to the obtained optical flow prediction model can be a scaled image. Therefore, compared with directly inputting the image to be processed into the optical flow prediction model, fewer pixels need to be processed, the processing speed is faster, and the speed of image correction is improved. At the same time, since the mapped optical flow image is obtained by upsampling the predicted optical flow image, it can be guaranteed that the pixel values corresponding to nearby pixels in the mapped optical flow image are similar, which also guarantees that the mapping of nearby pixels in the image to be processed is similar. This ensures that the pixel proximity relationship of the corrected image obtained by the image mapping processing is similar to that of the image to be processed, thus ensuring the content similarity between the image to be processed and the corrected image, and improving the effect of image correction.

[0126] The method for model training provided in the embodiments of this disclosure will be described in detail below.

[0127] In some possible implementations, the optical flow prediction model may include a backbone network, a neck network, and a head.

[0128] When the optical flow prediction model includes Backbone, Neck, and Head, training the optical flow prediction model based on the image to be trained and the labeled optical flow image can yield a pre-trained optical flow prediction model, which may include:

[0129] The images to be trained are input into the backbone network to obtain the hierarchical image features of the images to be trained.

[0130] The hierarchical image features are input into the neck network, and the hierarchical image features are fused to obtain the fused image features of the image to be trained.

[0131] The fused image features are input into the detection head to obtain the predicted optical flow image corresponding to the image to be trained;

[0132] The model parameters of the optical flow prediction model are modified based on the predicted optical flow image and the labeled optical flow image to obtain a pre-trained optical flow prediction model.

[0133] In some possible implementations, the backbone can be an image feature extraction network, which may include multiple cascaded feature extraction modules. The image features of different sizes output by the multiple feature extraction modules are called hierarchical image features.

[0134] In some possible implementations, the backbone can be a lightweight network called MobileNet (Mobile Neural Network).

[0135] MobileNet is faster than other image feature extraction networks. Using MobileNet as the backbone can further accelerate image processing while ensuring the correction effect.

[0136] In some specific implementations, when the Backbone is Mobilenet, the Backbone network structure is as described above, and will not be repeated here.

[0137] In some possible implementations, the hierarchical image features may include at least one shallow image feature and at least one deep image feature.

[0138] Among them, shallow image features can be image features output by the shallow network structure in the Backbone, and deep image features can be image features output by the deep network structure in the Backbone.

[0139] In some specific implementations, deep image features can be the image features output from the last layer of the Backbone, while shallow image features can be the image features output from the penultimate, third-to-last, and fourth-to-last layers of the Backbone.

[0140] In some possible implementations, inputting hierarchical image features into the neck network and fusing the hierarchical image features to obtain the fused image features of the image to be trained may include inputting shallow image features and high-level image features into the neck network, upsampling the high-level image features to obtain sampled image features of the same size as the shallow image features, and fusing the shallow image features and sampled image features to obtain the fused image features of the image to be trained.

[0141] In some possible implementations, feature fusion of shallow and high-level image features can be achieved through FPN (Feature Pyramid Networks).

[0142] Figure 4A schematic diagram of the FPN structure is shown, as follows: Figure 4 As shown on the left, the shallow image features output by Backbone are obtained, namely the first shallow image feature 403, the second shallow image feature 402, the third shallow image feature 401, and the high-level image feature 404. By upsampling the high-level image features, a sampled image feature of the same size as the first shallow image feature 403 is obtained (named the first sampled image feature 405 to distinguish it from other sampled image features). The first shallow image feature 405 and the first sampled image feature 403 are fused to obtain the first fused image feature. The first fused image feature is upsampled to obtain the second sampled image feature 406 of the same size as the second shallow image feature 402. The second shallow image feature 402 and the second sampled image feature 406 are fused to obtain the second fused image feature. The second fused image feature is upsampled to obtain the third sampled image feature 407 of the same size as the third shallow image feature 401. The third shallow image feature 401 and the third sampled image feature 407 are fused to obtain the third fused image feature.

[0143] The first fused image features, the second fused image features, the third fused image features, and the high-level image features can all be used as the fused image features of the image to be trained; alternatively, the fused image features of the image to be trained can be obtained by processing the first fused image features, the second fused image features, the third fused image features, and the high-level image features.

[0144] In some specific implementations, when the Backbone is Mobilenet, such as Figure 4 As shown, the upsampling process can specifically be a 2x upsampling process, which can be implemented through a 2xUP structure. The image features output by the Backbone and the image features obtained by upsampling are fused to obtain the fused image features. This can be achieved by using a 1x1Conv structure to perform a 1x1 convolution on the image features output by the Backbone, adding and fusing the convolutional image features with the image features obtained by upsampling, and obtaining the fused image features through two cascaded bottleneck structures.

[0145] Since shallow image features are beneficial for optical flow prediction tasks, and high-level image features contain semantic information, the fused image features obtained by fusing high-level and shallow image features contain semantic features as well as features that are useful for optical flow prediction. Therefore, using fused image features can improve the optical flow prediction accuracy of the optical flow prediction model.

[0146] In some possible implementations, feature fusion of hierarchical image features to obtain fused image features of the image to be trained can also be performed by processing hierarchical image features to obtain receptive field image features of different receptive fields, and then fusing the receptive field image features to obtain fused image features.

[0147] In some possible implementations, processing the hierarchical image features to obtain receptive field image features of different receptive fields, and fusing the receptive field image features to obtain fused image features may include obtaining multiple receptive field image features based on the hierarchical image features; fusing the multiple receptive field image features to obtain receptive field fused features; and fusing the receptive field fused features with the hierarchical image features to obtain fused image features of the image to be trained.

[0148] Different receptive fields correspond to different receptive field image features.

[0149] In some possible implementations, obtaining multiple receptive field image features based on hierarchical image features can be achieved by performing convolution processing on the hierarchical image features using convolution kernels of different sizes to obtain multiple receptive field image features.

[0150] In some possible implementations, feature fusion is performed on multiple receptive field image features to obtain receptive field fusion features, including: channel stitching of multiple receptive field image features to obtain stitched image features; and convolution processing of the stitched image features to obtain receptive field fusion features with the same size as the hierarchical image features.

[0151] In some possible implementations, the ASPP (Atrous Spatial Pyramid Pooling) structure can be used to process hierarchical image features to obtain receptive field image features of different receptive fields, and then the receptive field image features can be fused to obtain fused image features.

[0152] Figure 5 A schematic diagram of the ASPP structure is shown, as follows: Figure 5 As shown, 1x1Conv and 3x3Conv with rates (hole coefficients) of 12, 24, and 36 respectively, along with Time Pooling, can be used to process the high-level image features 501 to obtain receptive field image features 502, 503, 504, 505, and 506 with different receptive fields. Then, by channel fusion of the receptive field image features, a stitched image feature 507 is obtained. Finally, 1x1Conv is used to transform the stitched image feature 507 into a fused image feature 508 with the same size as the layer image features.

[0153] By acquiring receptive field image features from different receptive fields, the receptive field of the acquired fused image features can be improved, thereby enhancing the accuracy of the predicted optical flow image obtained from the fused image features and thus improving the image correction effect.

[0154] In some possible implementations, inputting hierarchical image features into the neck network and fusing the hierarchical image features to obtain the fused image features of the image to be processed may also include processing the high-level image features to obtain receptive field image features of different receptive fields, fusing the receptive field image features to obtain the fused image features corresponding to the high-level image features, upsampling the fused image features corresponding to the high-level image features to obtain sampled image features with the same size as the shallow image features, and fusing the shallow image features and sampled image features to obtain the fused image features of the scaled image.

[0155] Specifically, ASPP can be used to process high-level image features to obtain receptive field image features of different receptive fields. The fused image features corresponding to the high-level image features can be obtained by fusing these receptive field image features. Upsampling the fused image features corresponding to the high-level image features yields sampled image features of the same size as the shallow image features. Finally, feature fusion of the shallow image features and the sampled image features yields the fused image features of the scaled image, which can be achieved through FPN. The structures and processing procedures of ASPP and FPN are as described above and will not be repeated here.

[0156] By using ASPP and FPN, we can obtain fused features, including shallow image features and features from different receptive fields, thereby improving the accuracy of the predicted optical flow image obtained from the fused image features, and thus improving the image correction effect.

[0157] In some possible implementations, the Head can correspond to the Backbone.

[0158] Figure 6 This diagram illustrates the structure of the Head when the Backbone is Mobilenet. Figure 6 As shown, the Head can include two bottlenecks cascaded in sequence and a 1x1 Conv. The specific components of the bottleneck and the 1x1 Conv are as described above and will not be repeated here.

[0159] The optical flow prediction model described above can improve the speed of image correction while ensuring the image correction effect, so that the image processing method provided in this embodiment can be used to solve actual image anti-distortion processing.

[0160] Based on and Figure 1 The method shown follows the same principle. Figure 8 A schematic diagram of the structure of an image processing apparatus provided in an embodiment of this disclosure is shown, such as... Figure 8 As shown, the image processing apparatus 80 may include:

[0161] The scaling module 810 is used to acquire the image to be processed, perform scaling processing on the image to be processed, and acquire the scaled image corresponding to the image to be processed.

[0162] The prediction module 820 images correspond to the predicted optical flow image; the pixels of the predicted optical flow image represent the correspondence between the pixels of the scaled image and the pixels of the corresponding predicted corrected image.

[0163] The upsampling module 830 is used to perform upsampling processing on the predicted optical flow image to obtain the mapped optical flow image;

[0164] The image mapping module 840 is used to perform image mapping processing on the image to be processed based on the mapped optical flow image, and obtain the corrected image corresponding to the image to be processed.

[0165] In the image processing apparatus provided in this embodiment, since the input optical flow prediction model is a scaled image, fewer pixels need to be processed compared to directly inputting the image to be processed into the optical flow prediction model, resulting in faster processing speed and improved image correction speed. Simultaneously, since the mapped optical flow image is obtained by upsampling the predicted optical flow image, it ensures that the pixel values corresponding to adjacent pixels in the mapped optical flow image are similar. This also ensures that the mapping of adjacent pixels in the image to be processed is similar, thereby ensuring that the pixel proximity relationship of the corrected image obtained from the image mapping process is similar to that of the image to be processed. This ensures the content similarity between the image to be processed and the corrected image, improving the image correction effect.

[0166] It is understood that the above-described modules of the image processing apparatus in the embodiments of this disclosure have the ability to implement... Figure 1 The image processing method in the illustrated embodiment demonstrates the functionality of corresponding steps. This functionality can be implemented in hardware or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the aforementioned functions. These modules can be software and / or hardware, and each module can be implemented individually or multiple modules can be integrated. For a detailed description of the functions of each module in the aforementioned image processing apparatus, please refer to [link to relevant documentation]. Figure 1 The corresponding descriptions of the image processing methods in the embodiments shown are not repeated here.

[0167] Based on and Figure 7 The method shown follows the same principle. Figure 9 A schematic diagram of the structure of a model training apparatus provided in an embodiment of this disclosure is shown, as follows: Figure 9 As shown, the device 90 for training the model may include:

[0168] Image module 910 is used to acquire the image to be trained and the corresponding label optical flow image;

[0169] Training module 920 is used to train the optical flow prediction model based on the image to be trained and the labeled optical flow image to obtain the pre-trained optical flow prediction model.

[0170] In this context, the pixels in the labeled optical flow map represent the correspondence between the pixels of the image to be trained and the pixels of the corresponding corrected image.

[0171] In the model training apparatus provided in this embodiment, the input to the obtained optical flow prediction model can be a scaled image. Therefore, compared to directly inputting the image to be processed into the optical flow prediction model, fewer pixels need to be processed, the processing speed is faster, and the speed of image correction is improved. At the same time, since the mapped optical flow image is obtained by upsampling the predicted optical flow image, it can be guaranteed that the pixel values corresponding to nearby pixels in the mapped optical flow image are similar, which also guarantees that the mapping of nearby pixels in the image to be processed is similar. This ensures that the pixel proximity relationship of the corrected image obtained by the image mapping processing is similar to that of the image to be processed, thus ensuring the content similarity between the image to be processed and the corrected image, and improving the effect of image correction.

[0172] It is understood that the above-described modules of the model training apparatus in the embodiments of this disclosure have the ability to implement... Figure 7 The embodiments shown illustrate the functionality of the corresponding steps in the model training method. This functionality can be implemented in hardware or by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the above-described functions. These modules can be software and / or hardware, and each module can be implemented individually or integrated from multiple modules. For a detailed description of the functions of each module in the above-described model training apparatus, please refer to [link to relevant documentation]. Figure 7 The corresponding descriptions of the model training methods in the embodiments shown are not repeated here.

[0173] The acquisition, storage, and application of user personal information involved in the technical solution disclosed herein comply with the provisions of relevant laws and regulations and do not violate public order and good morals.

[0174] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.

[0175] The electronic device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform image processing methods and / or model training methods as provided in embodiments of this disclosure.

[0176] Compared with existing technologies, this electronic device uses a scaled image as the input to the optical flow prediction model. Therefore, it requires processing fewer pixels and has a faster processing speed than directly inputting the image to be processed into the optical flow prediction model, thus improving the speed of image correction. At the same time, since the mapped optical flow image is obtained by upsampling the predicted optical flow image, it can ensure that the pixel values corresponding to nearby pixels in the mapped optical flow image are similar. This also ensures that the mapping of nearby pixels in the image to be processed is similar, thus ensuring that the pixel proximity relationship of the corrected image obtained by the image mapping process is similar to that of the image to be processed. This ensures the content similarity between the image to be processed and the corrected image, improving the image correction effect.

[0177] The readable storage medium is a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause a computer to perform image processing methods and / or model training methods as provided in embodiments of the present disclosure.

[0178] Compared with existing technologies, this readable storage medium uses a scaled image as the input to the optical flow prediction model. Therefore, it requires fewer pixels to be processed and processes faster than directly inputting the image to be processed into the optical flow prediction model, thus improving the speed of image correction. Furthermore, since the mapped optical flow image is obtained by upsampling the predicted optical flow image, it ensures that the pixel values corresponding to adjacent pixels in the mapped optical flow image are similar. This also ensures that the mapping of adjacent pixels in the image to be processed is similar, thereby ensuring that the pixel proximity relationship of the corrected image obtained from the image mapping process is similar to that of the image to be processed. This guarantees the content similarity between the image to be processed and the corrected image, improving the image correction effect.

[0179] The computer program product includes a computer program that, when executed by a processor, implements the image processing method and / or model training method provided in embodiments of this disclosure.

[0180] Compared with existing technologies, this computer program product uses a scaled image as input to the optical flow prediction model. Therefore, it requires fewer pixels to be processed and processes faster than directly inputting the image to be processed into the optical flow prediction model, thus improving the speed of image correction. At the same time, since the mapped optical flow image is obtained by upsampling the predicted optical flow image, it can be ensured that the pixel values corresponding to nearby pixels in the mapped optical flow image are similar. This also ensures that the mapping of nearby pixels in the image to be processed is similar, thereby ensuring that the pixel proximity relationship of the corrected image obtained by image mapping processing is similar to that of the image to be processed. This ensures the content similarity between the image to be processed and the corrected image, improving the image correction effect.

[0181] Figure 10 A schematic block diagram of an example electronic device 1000 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0182] like Figure 10 As shown, device 1000 includes a computing unit 1001, which can perform various appropriate actions and processes according to a computer program stored in read-only memory (ROM) 1002 or a computer program loaded into random access memory (RAM) 1003 from storage unit 1008. RAM 1003 may also store various programs and data required for the operation of device 1000. The computing unit 1001, ROM 1002, and RAM 1003 are interconnected via bus 1004. Input / output (I / O) interface 1005 is also connected to bus 1004.

[0183] Multiple components in device 1000 are connected to I / O interface 1005, including: input unit 1006, such as keyboard, mouse, etc.; output unit 1007, such as various types of monitors, speakers, etc.; storage unit 1008, such as disk, optical disk, etc.; and communication unit 1009, such as network card, modem, wireless transceiver, etc. Communication unit 1009 allows device 1000 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0184] The computing unit 1001 can be various general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 1001 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 1001 performs the various methods and processes described above, such as image processing methods and / or model training methods. For example, in some embodiments, the image processing methods and / or model training methods can be implemented as computer software programs tangibly contained in a machine-readable medium, such as storage unit 1008. In some embodiments, part or all of the computer program can be loaded and / or installed on device 1000 via ROM 1002 and / or communication unit 1009. When the computer program is loaded into RAM 1003 and executed by the computing unit 1001, one or more steps of the image processing methods and / or model training methods described above can be performed. Alternatively, in other embodiments, the computing unit 1001 may be configured by any other suitable means (e.g., by means of firmware) to perform image processing methods and / or model training methods.

[0185] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0186] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0187] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0188] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0189] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with embodiments of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0190] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.

[0191] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0192] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. An image processing method, comprising: Obtain the image to be processed, scale the image to be processed, and obtain the scaled image corresponding to the image to be processed; The scaled image is input into a pre-trained optical flow prediction model to obtain a predicted optical flow image corresponding to the scaled image; the pixels of the predicted optical flow image represent the correspondence between the pixels of the scaled image and the pixels of the predicted corrected image corresponding to the scaled image. The optical flow prediction model includes: a backbone network, a neck network, and a detection head; The predicted optical flow image is upsampled to obtain the mapped optical flow image; Based on the mapped optical flow image, the image to be processed is subjected to image mapping processing to obtain the corrected image corresponding to the image to be processed; The step of inputting the scaled image into a pre-trained optical flow prediction model to obtain a predicted optical flow image corresponding to the scaled image includes: The scaled image is input into the backbone network to obtain the hierarchical image features of the scaled image; the hierarchical image features include at least one shallow image feature and at least one high-level image feature; The high-level image features are processed to obtain receptive field image features with different receptive fields, and the fused image features corresponding to the high-level image features are obtained by fusing the receptive field image features. Upsample the fused image features corresponding to the high-level image features to obtain sampled image features with the same size as the shallow image features; The shallow image features and the sampled image features are fused to obtain the fused image features of the scaled image; The fused image features are input into the detection head to obtain the predicted optical flow image corresponding to the scaled image.

2. The method according to claim 1, wherein, The step of inputting the scaled image into a pre-trained optical flow prediction model to obtain a predicted optical flow image corresponding to the scaled image includes: The scaled image is input into the backbone network to obtain the hierarchical image features of the scaled image; The hierarchical image features are input into the neck network, and feature fusion is performed on the hierarchical image features to obtain the fused image features of the scaled image; The fused image features are input into the detection head to obtain the predicted optical flow image corresponding to the scaled image.

3. The method according to claim 2, wherein, The step of inputting the hierarchical image features into the neck network and performing feature fusion on the hierarchical image features to obtain the fused image features of the image to be processed includes: The shallow image features and the high-level image features are input into the neck network, and the high-level image features are upsampled to obtain sampled image features with the same size as the shallow image features; The shallow image features and the sampled image features are fused to obtain the fused image features of the scaled image.

4. The method according to claim 2, wherein, The step of fusing the hierarchical image features to obtain the fused image features of the scaled image includes: Based on the hierarchical image features, multiple receptive field image features are obtained; The receptive field image features are fused to obtain the receptive field fused features. The receptive field fusion feature and the hierarchical image feature are fused to obtain the fused image feature of the scaled image; Different receptive fields correspond to different receptive field image features.

5. The method according to claim 4, wherein, The step of fusing features from multiple receptive field images to obtain receptive field fused features includes: Multiple receptive field image features are stitched together to obtain stitched image features; By performing convolution processing on the stitched image features, receptive field fusion features with the same size as the hierarchical image features are obtained.

6. The method according to claim 4, wherein, The step of obtaining multiple receptive field image features based on the hierarchical image features includes: The image features at each level are convolved using convolution kernels of different sizes to obtain multiple receptive field image features.

7. The method according to claim 1, wherein, The backbone network is a mobile neural network.

8. The method according to claim 1, wherein, The upsampling process of the predicted optical flow image to obtain the mapped optical flow image includes: The predicted optical flow image is upsampled to obtain a mapped optical flow image with the same size as the image to be processed.

9. The method according to claim 1, wherein, The image to be processed is a document image generated by capturing a document using a camera.

10. A method for training a model, comprising: Acquire the image to be trained and the corresponding labeled optical flow image of the image to be trained; The optical flow prediction model is trained based on the image to be trained and the labeled optical flow image to obtain a pre-trained optical flow prediction model. The optical flow prediction model includes a backbone network, a neck network, and a detection head; the pixels of the labeled optical flow map represent the correspondence between the pixels of the image to be trained and the pixels of the corresponding corrected image. The step of training the optical flow prediction model based on the image to be trained and the labeled optical flow image to obtain a pre-trained optical flow prediction model includes: The image to be trained is input into the backbone network to obtain the hierarchical image features of the image to be trained; the hierarchical image features include at least one shallow image feature and at least one high-level image feature; The high-level image features are processed to obtain receptive field image features with different receptive fields, and the fused image features corresponding to the high-level image features are obtained by fusing the receptive field image features. Upsample the fused image features corresponding to the high-level image features to obtain sampled image features with the same size as the shallow image features; The shallow image features and the sampled image features are fused to obtain the fused image features of the image to be trained; The fused image features are input into the detection head to obtain the predicted optical flow image corresponding to the image to be trained; The model parameters of the optical flow prediction model are modified based on the predicted optical flow image and the labeled optical flow image to obtain a pre-trained optical flow prediction model.

11. The method according to claim 10, wherein, The step of training the optical flow prediction model based on the image to be trained and the labeled optical flow image to obtain a pre-trained optical flow prediction model includes: The image to be trained is input into the backbone network to obtain the hierarchical image features of the image to be trained; The hierarchical image features are input into the neck network, and feature fusion is performed on the hierarchical image features to obtain the fused image features of the image to be trained. The fused image features are input into the detection head to obtain the predicted optical flow image corresponding to the image to be trained; The model parameters of the optical flow prediction model are modified based on the predicted optical flow image and the labeled optical flow image to obtain a pre-trained optical flow prediction model.

12. The method according to claim 11, wherein, The step of inputting the hierarchical image features into the neck network and performing feature fusion on the hierarchical image features to obtain the fused image features of the image to be trained includes: The shallow image features and the high-level image features are input into the neck network, and the high-level image features are upsampled to obtain sampled image features with the same size as the shallow image features; The shallow image features and the sampled image features are fused to obtain the fused image features of the image to be trained.

13. The method according to claim 11, wherein, The step of fusing the hierarchical image features to obtain the fused image features of the image to be trained includes: Based on the hierarchical image features, multiple receptive field image features are obtained; The receptive field image features are fused to obtain the receptive field fused features. The receptive field fusion features are fused with the hierarchical image features to obtain the fused image features of the image to be trained; Different receptive fields correspond to different receptive field image features.

14. The method according to claim 13, wherein, The step of fusing features from multiple receptive field images to obtain receptive field fused features includes: Multiple receptive field image features are stitched together to obtain stitched image features; By performing convolution processing on the stitched image features, receptive field fusion features with the same size as the hierarchical image features are obtained.

15. The method according to claim 13, wherein, The step of obtaining multiple receptive field image features based on the hierarchical image features includes: The image features at each level are convolved using convolution kernels of different sizes to obtain multiple receptive field image features.

16. An image processing apparatus, comprising: The scaling module is used to acquire the image to be processed, scale the image to be processed, and acquire the scaled image corresponding to the image to be processed. The prediction module is used to input the scaled image into a pre-trained optical flow prediction model to obtain a predicted optical flow image corresponding to the scaled image; the pixels of the predicted optical flow image represent the correspondence between the pixels of the scaled image and the pixels of the predicted corrected image corresponding to the scaled image. The optical flow prediction model includes: a backbone network, a neck network, and a detection head; The upsampling module is used to perform upsampling processing on the predicted optical flow image to obtain the mapped optical flow image; The image mapping module is used to perform image mapping processing on the image to be processed based on the mapped optical flow image to obtain the corrected image corresponding to the image to be processed; The prediction module is specifically used for: inputting the scaled image into the backbone network to obtain hierarchical image features of the scaled image; the hierarchical image features include at least one shallow image feature and at least one high-level image feature; processing the high-level image features to obtain receptive field image features with different receptive fields, and fusing the receptive field image features to obtain fused image features corresponding to the high-level image features; upsampling the fused image features corresponding to the high-level image features to obtain sampled image features with the same size as the shallow image features; fusing the shallow image features and the sampled image features to obtain fused image features of the scaled image; and inputting the fused image features into the detection head to obtain a predicted optical flow image corresponding to the scaled image.

17. An apparatus for model training, comprising: The image module is used to acquire the image to be trained and the corresponding label optical flow image of the image to be trained; The training module is used to train the optical flow prediction model based on the image to be trained and the labeled optical flow image to obtain the pre-trained optical flow prediction model. The optical flow prediction model includes a backbone network, a neck network, and a detection head; the pixels of the labeled optical flow map represent the correspondence between the pixels of the image to be trained and the pixels of the corresponding corrected image. The training module is specifically used for: inputting the image to be trained into the backbone network to obtain hierarchical image features of the image to be trained; the hierarchical image features include at least one shallow image feature and at least one high-level image feature; processing the high-level image features to obtain receptive field image features with different receptive fields, and fusing the receptive field image features to obtain fused image features corresponding to the high-level image features; upsampling the fused image features corresponding to the high-level image features to obtain sampled image features with the same size as the shallow image features; fusing the shallow image features and the sampled image features to obtain fused image features of the image to be trained; inputting the fused image features into the detection head to obtain a predicted optical flow image corresponding to the image to be trained; modifying the model parameters of the optical flow prediction model according to the predicted optical flow image and the labeled optical flow image to obtain a pre-trained optical flow prediction model.

18. An electronic device comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions executable by the at least one processor, which, when executed by the at least one processor, enables the at least one processor to perform the method of any one of claims 1-9 and / or the method of any one of claims 10-15.

19. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1-9 and / or the method according to any one of claims 10-15.

20. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1-9 and / or the method according to any one of claims 10-15.