Text recognition model training method, text recognition method, and related apparatuses

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By masking and adjusting the encoder of the text recognition model, the problem of poor recognition results caused by image blur in natural scenes is solved, and accurate text recognition under blurry conditions is achieved.

CN115620304BActive Publication Date: 2026-06-16ZHEJIANG DAHUA TECH CO LTD

View PDF 3 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: ZHEJIANG DAHUA TECH CO LTD
Filing Date: 2022-10-11
Publication Date: 2026-06-16

Application Information

Patent Timeline

11 Oct 2022

Application

16 Jun 2026

Publication

CN115620304B

IPC: G06V30/19; G06V10/82; G06N3/045; G06N3/084; G06N3/0442

CPC: G06V30/19147; G06V30/1918; G06V10/82; G06N3/084

AI Tagging

Application Domain

Biological models

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

Image blur caused by human-shot images in natural scenes affects text recognition performance.

⚗Method used

By masking the text recognition model, features of the masked and unmasked regions are extracted, encoded using an encoder, and the features are predicted. The model parameters are then adjusted to improve recognition accuracy.

🎯Benefits of technology

It improves the recognition performance of text recognition models in the case of blurred images, enabling more accurate extraction and prediction of text content.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115620304B_ABST

Patent Text Reader

Abstract

The application discloses a text recognition model training method, a text recognition method and related devices. The method comprises: performing mask processing on a first sample text image to obtain first mask features of a first mask region image and a first non-mask region image in the first sample text image; using an encoder of a text recognition model to encode the first non-mask region image of the first sample text image to obtain first encoding features; predicting the first mask features and the first encoding features to obtain a first text recognition result of the first sample text image; and adjusting parameters of the encoder of the text recognition model based on at least the first text recognition result. In this way, the text recognition effect of the text recognition model can be improved.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image processing technology, and in particular to a training method for a text recognition model, a text recognition method, and related apparatus. Background Technology

[0002] Natural scenes contain a wealth of textual information, such as in card and ID card recognition, intelligent subtitle review for short videos, and industrial serial number recognition. Extracting and further processing this text will provide a highly valuable basis and rich information for understanding image semantics.

[0003] Text extraction processing relies on the acquisition of natural scene images. Currently, most natural scene images are captured by people holding mobile phones, tablets, or other electronic devices. Human-captured images are prone to shaking during shooting, resulting in blurry images and consequently poor recognition performance. Summary of the Invention

[0004] The main technical problem addressed in this application is to provide a training method for a text recognition model, a text recognition method, and related devices, which can improve the text recognition performance of the text recognition model.

[0005] To address the aforementioned technical problems, the first aspect of this application provides a training method for a text recognition model. This method includes: performing masking processing on a first sample text image to obtain a first masked region image and a first unmasked region image in the first sample text image; encoding the first unmasked region image of the first sample text image using an encoder of the text recognition model to obtain a first encoded feature; predicting the first masked feature and the first encoded feature to obtain a first text recognition result for the first sample text image; and adjusting the parameters of the encoder of the text recognition model based at least on the first text recognition result.

[0006] To address the aforementioned technical problems, a second aspect of this application provides a text recognition method, comprising: acquiring a target image; encoding the target image using an encoder of a text recognition model to obtain target encoding features of the target image; and predicting the target encoding features of the target image using a prediction module of the text recognition model to obtain target text in the target image; wherein the text recognition model is a text recognition model trained using the method described in the first aspect above.

[0007] To address the aforementioned technical problems, a third aspect of this application provides an electronic device comprising a memory and a processor coupled to each other, the memory storing program instructions; the processor executing the program instructions stored in the memory to implement the training method of the text recognition model described in the first aspect, or to implement the text recognition method described in the second aspect.

[0008] To address the aforementioned technical problems, a fourth aspect of this application provides a computer-readable storage medium for storing program instructions that can be executed to implement the training method of the text recognition model described in the first aspect, or to implement the text recognition method described in the second aspect.

[0009] The beneficial effects of this application are as follows: Unlike existing technologies, this application performs masking processing on the first sample text image during the training process of the text recognition model, obtaining first mask features and first unmasked region images of the first sample text image; the encoder of the text recognition model encodes the first unmasked region image of the first sample text image to obtain first encoded features; the first mask features and the first encoded features are predicted to obtain a first text recognition result for the first sample text image; and the parameters of the encoder of the text recognition model are adjusted, at least based on the first text recognition result. By adjusting the encoder parameters using the first text recognition result obtained from predicting the first mask features and the first encoded features, the encoder of the text recognition model can more accurately extract features from the text image even when the image is blurred, and then predict accurate text content based on the extracted features, thereby improving the recognition effect of the text recognition model. Attached Figure Description

[0010] Figure 1 This is a flowchart illustrating the first embodiment of the training method for the text recognition model provided in this application;

[0011] Figure 2 This is a schematic diagram illustrating how the location mask provided in this application determines the fusion features of the first sample text image;

[0012] Figure 3 This is a schematic diagram of the encoder provided in this application obtaining the first encoded feature;

[0013] Figure 4 This is a flowchart illustrating the second embodiment of the training method for the text recognition model provided in this application;

[0014] Figure 5 This is a schematic diagram of the overall framework of the second embodiment of the training method for the text recognition model provided in this application;

[0015] Figure 6 This is a flowchart illustrating the third embodiment of the training method for the text recognition model provided in this application;

[0016] Figure 7 This is a schematic diagram of the overall framework of the third embodiment of the training method for the text recognition model provided in this application;

[0017] Figure 8 This is a flowchart illustrating one embodiment of the text recognition method provided in this application;

[0018] Figure 9 This is a schematic diagram of the framework structure of one embodiment of the electronic device provided in this application;

[0019] Figure 10 This is a schematic diagram of a framework of one embodiment of the computer-readable storage medium provided in this application. Detailed Implementation

[0020] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those of ordinary skill in the art without creative effort are within the scope of protection of this application.

[0021] It should be noted that the embodiments of this application contain descriptions involving "first," "second," etc., which are for descriptive purposes only and should not be construed as indicating or implying their relative importance or implicitly specifying the number of technical features indicated. Therefore, a feature defined with "first" or "second" may explicitly or implicitly include at least one of that feature.

[0022] In this document, the term "embodiment" means that a particular feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of the invention. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

[0023] Please refer to the following: Figure 1-3 , Figure 1 This is a flowchart illustrating the first embodiment of the training method for the text recognition model provided in this application. Figure 2 This is a schematic diagram illustrating how the location mask provided in this application determines the fusion features of the first sample text image. Figure 3 This is a schematic diagram of the encoder obtaining the first encoded feature provided in this application; the training method of the text recognition model includes:

[0024] S11: Perform masking processing on the first sample text image to obtain the first masking feature and the first non-masked region image of the first masked region in the first sample text image.

[0025] In one embodiment, step S11 can be performed by a position mask included in the text recognition model, where the first sample text image is labeled with the real text recognition result. The first masked region image in the first sample text image can be determined according to the mask ratio. In one specific embodiment, the mask ratio can be preset, and the first sample text image can be divided into several image blocks along a preset direction according to the mask ratio. At least one image block is randomly masked to obtain the first masked region image. For example, if the preset mask ratio is three-fifths, the first sample text image can be divided into five image blocks, and three of these image blocks can be randomly masked. In another specific embodiment, the first sample text image can also be randomly divided into several image blocks first, and then at least one image block can be masked based on the mask ratio to obtain the first masked region image. The area outside the first masked region image in the first sample text image is the first non-masked region image.

[0026] While determining the first masked region image and the first unmasked region image in the first sample text image based on a preset mask ratio, the word embedding vector of the first masked region image can also be determined. In one specific embodiment, after determining the mask ratio, mask ratio information can be obtained. The mask ratio information can contain information in multiple dimensions. For example, the mask ratio information includes the mask ratio and the text information corresponding to the first masked region image. According to the mask ratio information, the embedding layer in the position mask can return the corresponding word embedding vector. In another specific embodiment, the first masked region image can be determined first, the text information contained in the region can be obtained, and the corresponding word embedding vector can be obtained according to the text information. After obtaining the word embedding vector of the first masked region image, the word embedding vector of the first masked region image and the regional features of the first masked region image are fused to obtain the first mask feature of the first masked region image. The regional features of the first mask region image can be obtained by feature extraction from the first mask region image using a position masker; alternatively, other devices can pre-extract features from the first sample text image to obtain its image features, and then extract the features corresponding to the first mask region image from these image features to obtain the regional features of the first mask region image. For example... Figure 2As shown, in one specific embodiment, the word embedding vector of the first masked region image and the regional features of the first masked region image of the first sample text image are in different dimensions. Therefore, the word embedding vector of the first masked region image is first mapped to a preset dimension through a fully connected layer. The word embedding vector of the preset dimension is then fused with the regional features of the first masked region image to obtain the first masked feature of the first masked region image. Here, the preset dimension is the dimension of the regional features of the first masked region image. The regional features of the first masked region image can be extracted by a deep convolutional neural network. In one embodiment, a deep convolutional neural network can be used to extract the image features of the first sample text image, and the regional features of the first masked region image can be obtained from the image features of the first sample text image. In this embodiment, the first sample text image can be a target sample image; correspondingly, the first masked region image and the first non-masked region image can be a target masked region image and a target non-masked region image.

[0027] S12: The first non-masked region image of the first sample text image is encoded using the encoder of the text recognition model to obtain the first encoded feature.

[0028] In one embodiment, encoding the first non-masked region image of the first sample text image using the encoder of the text recognition model to obtain the first encoded feature includes extracting features from the first non-masked region image using the encoder to obtain the target non-masked feature. For ease of distinction, the target non-masked feature is referred to here as the first non-masked feature. Figure 3 As shown, the first non-masked feature is subjected to self-attention calculation to obtain a self-attention feature; the self-attention feature is then fused with the first non-masked feature to obtain the first encoded feature. Specifically, the first non-masked feature is passed through a self-attention layer to obtain the self-attention feature. The self-attention layer may include three fully connected layers, so that the first non-masked feature is mapped through the three fully connected layers to obtain a query vector, a key-value vector, and a value vector, respectively. The query vector and the key-value vector are multiplied by a dot product to obtain a score value. The score value is normalized (e.g., by using the SoftMax activation function) and then multiplied by the value vector to obtain the self-attention coefficient of the first non-masked feature. The self-attention feature is obtained based on the self-attention coefficient. Then, the self-attention feature and the first non-masked feature are summed and normalized to obtain the first encoded feature. The self-attention feature obtained at this point can be multi-dimensional or single-dimensional.

[0029] If the first non-masked feature is a multi-dimensional feature, the self-attention coefficients of the first non-masked feature obtained through the self-attention layer can also be multi-dimensional features. Multiplying the self-attention coefficient of each dimension by the corresponding first non-masked feature yields the self-attention feature for that dimension. Summing the self-attention feature of that dimension with the corresponding first non-masked feature yields the third encoding feature for that dimension. Normalizing the third encoding features of multiple dimensions and passing them through a feedforward neural network yields the fourth encoding features for each dimension. Summing and normalizing the fourth encoding features of each dimension with the third encoding features of each dimension yields the first encoding feature for each dimension. The feedforward neural network can consist of multiple fully connected layers. In this embodiment, the first non-masked region image can be the target non-masked region image, and the first encoding feature can be the target encoding feature.

[0030] S13: Predict the first mask feature and the first coding feature to obtain the first text recognition result of the first sample text image.

[0031] In one embodiment, a Long Short-Term Memory (LSTM) neural network can be used to predict the first mask feature and the first encoded feature to obtain the first text recognition result of the first sample text image. Understandably, in other embodiments, other neural networks can also be used to predict the first mask feature and the first encoded feature, and this is not limited here. This embodiment adds a first mask feature during the text prediction process, which makes the text recognition model with adjusted parameters more robust to interference; that is, when using the trained text recognition model to recognize contaminated text images, a more accurate recognition result can be obtained.

[0032] S14: Adjust the encoder parameters of the text recognition model based at least on the first text recognition result.

[0033] In one implementation, the encoder parameters can be adjusted based on the first text recognition result. Specifically, a first recognition loss can be obtained based on the difference between the first text recognition result and the real text recognition result. The encoder parameters of the text recognition model are then adjusted based on the first recognition loss. This first recognition loss can be the CTC loss (Connectionist Temporal Classification loss). The CTC loss function can solve the problem of input-output alignment, avoiding character-by-character annotation; only line-by-line annotation of samples is required. During training, special characters are inserted between repeated characters when encoding the labeled text. During backpropagation, the Adam algorithm (an adaptive momentum stochastic optimization method) continuously adjusts the weights and biases in the network, making the CTC loss smaller so that the text sequence predicted by the model is closer to the real text sequence. During decoding, the optimal path is calculated by selecting the most probable character at each time step, repeating characters are removed, and then all special characters are removed from the path; what remains is the first text recognition result.

[0034] In another embodiment, the encoder parameters can be adjusted based on the first text recognition result and the second text recognition result. Specifically, a first recognition loss is obtained based on the difference between the first text recognition result and the true text recognition result; a second recognition loss is obtained based on the difference between the second text recognition result and the true text recognition result; and the encoder parameters are adjusted based on the first recognition loss and the second recognition loss. The second text recognition result is obtained by predicting the second encoded feature using the first prediction module of the text recognition model, and the second encoded feature is obtained by encoding the first sample text image by the encoder.

[0035] The above method involves masking the first sample text image during the training process of the text recognition model to obtain the first masked region image and the first unmasked region image. The encoder of the text recognition model then encodes the first unmasked region image to obtain the first encoded feature. Prediction is performed on the first masked feature and the first encoded feature to obtain the first text recognition result of the first sample text image. Based on at least the first text recognition result, the parameters of the encoder of the text recognition model are adjusted. By adjusting the encoder parameters using the first text recognition result obtained from the prediction of the first masked feature and the first encoded feature, the encoder of the text recognition model can more accurately extract features from the text image even when the image is blurred. This allows for accurate prediction of the text content based on the extracted features, thereby improving the recognition performance of the text recognition model.

[0036] Please refer to the following: Figure 4 and Figure 5 , Figure 4 This is a flowchart illustrating the second embodiment of the training method for the text recognition model provided in this application. Figure 5 This is a schematic diagram of the overall framework of the second embodiment of the text recognition model training method provided in this application; the method includes:

[0037] S41: Perform masking processing on the first sample text image to obtain the first masking feature and the first non-masked region image of the first masked region in the first sample text image.

[0038] S42: The first non-masked region image of the first sample text image is encoded using the encoder of the text recognition model to obtain the first encoded feature.

[0039] For the specific implementation of steps S41-S42, please refer to steps S11-S12 of the first implementation of the text recognition model training method, which will not be repeated here.

[0040] S43: Predict the first mask feature and the first coding feature to obtain the first text recognition result of the first sample text image.

[0041] In one embodiment, step S43 can be performed by the second prediction module. Specifically, the second prediction module uses LSTM to predict the first mask feature and the first encoded feature to obtain the first text recognition result of the first sample text image.

[0042] S44: Use the encoder to encode the first sample text image to obtain the second encoded feature.

[0043] In one embodiment, the method of encoding the first sample text image to obtain the second encoded feature can be the same as the method of encoding the first non-masked region image of the first sample text image to obtain the first encoded feature, and will not be described again here. It is understood that in other embodiments, other methods can also be used to obtain the second encoded feature, and no specific limitation is made here.

[0044] S45: The first prediction module of the text recognition model is used to predict the second encoded features to obtain the second text recognition result of the first sample text image.

[0045] In one embodiment, the first prediction module uses LSTM to predict the second encoded features to obtain the second text recognition result of the first sample text image.

[0046] S46: Adjust the parameters of the first prediction module based on the second text recognition result.

[0047] S47: Adjust the encoder parameters based on the first text recognition result and the second text recognition result.

[0048] In one embodiment, a first recognition loss is obtained based on the difference between the first text recognition result and the real text recognition result, and a second recognition loss is obtained based on the difference between the second text recognition result and the real text recognition result; the encoder parameters are adjusted based on the first and second recognition losses. Further, after obtaining the first recognition loss, the parameters of the position mask and the second prediction module can be adjusted according to the first recognition loss; after obtaining the second recognition loss, the parameters of the first prediction module can be adjusted based on the second recognition loss. Both the first and second recognition losses can be CTC losses.

[0049] In this embodiment, the text recognition model can be trained using two branches. For example... Figure 5 As shown, in the first branch, the encoder directly encodes the first sample text image to obtain the second encoded feature. The first prediction module of the text recognition model then predicts the second encoded feature to obtain the second text recognition result of the first sample text image. A second recognition loss is calculated based on the difference between the second text recognition result and the true text recognition result. The parameters of the encoder and the first prediction module are adjusted based on the second recognition loss. In the second branch, the first sample text image is first randomly masked using a position masker to obtain the first masked feature and the first unmasked region. The encoder encodes the first unmasked region of the first sample text image to obtain the first encoded feature. The second prediction module predicts the first masked feature and the first encoded feature to obtain the first text recognition result. The first recognition loss is calculated based on the difference between the first text recognition result and the true text recognition result. The parameters of the position masker, the second prediction module, and the encoder are adjusted based on the first recognition loss. The encoders of the first and second branches can be the same encoder; that is, the first and second branches use the same encoder. By employing a dual-branch training method for the text recognition model, the model can utilize the visual texture features of the natural scene image in which the text content is located, as well as the linguistic information in the visual context. This implicitly guides the model to accurately recognize text content in complex scenarios such as occlusion and noise.

[0050] Please refer to the following: Figure 6 and Figure 7 , Figure 6 This is a flowchart illustrating the third embodiment of the training method for the text recognition model provided in this application. Figure 7 This is a schematic diagram of the overall framework of the third embodiment of the text recognition model training method provided in this application; the method includes:

[0051] S61: Pre-train the location masker and encoder using the second sample text image.

[0052] The second sample text image can be either unlabeled or labeled. In actual training, the number of labeled sample text images is limited. In this case, the location mask and encoder can be pre-trained using unlabeled sample text images to give them a certain feature extraction capability. Then, a second training can be performed using a small number of labeled sample text images to give the text recognition model a better text recognition capability.

[0053] In one embodiment, the second sample text image is an unlabeled image. A position mask is used to mask the second sample text image to obtain a second mask feature and a second unmasked region image in the second sample text image. An encoder is used to encode the second unmasked region image of the second sample text image to obtain a second encoded feature. A decoder is used to reconstruct the pixel information of the second mask region image based on the second mask feature and the second encoded feature to obtain the reconstructed pixel information of the second mask region image. Based on the original pixel information and the reconstructed pixel information of the second mask region image, the parameters of the position mask, encoder and decoder are adjusted.

[0054] Specifically, a mask ratio can be preset. Based on this ratio, the second sample text image is divided into several image blocks along a preset direction. At least one image block is randomly masked to obtain a second masked region image. The area outside the second masked region image in the second sample text image is the second unmasked region image. While determining the second masked region image and the second unmasked region image in the second sample text image based on the preset mask ratio, the word embedding vector corresponding to the second masked region image can also be determined. In one specific embodiment, after determining the mask ratio, mask ratio information can be obtained. This mask ratio information can contain information in multiple dimensions; for example, it may include the mask ratio and the text information of the second masked region image. Based on the mask ratio information, the embedding layer in the position masker can return the corresponding word embedding vector. In another specific embodiment, the second masked region image can be determined first, and the text information contained within that region can be obtained. The corresponding word embedding vector can then be obtained based on the text information.

[0055] After obtaining the word embedding vectors of the second masked region image, the word embedding vectors of the second masked region image and the regional features of the second masked region image are fused to obtain the second masking feature of the second masked region image. In one specific embodiment, if the word embedding vectors of the second masked region image and the image features of the second sample text image are in different dimensions, the word embedding vectors of the second masked region image are first mapped to a preset dimension through a fully connected layer. The word embedding vectors of the preset dimension are then fused with the regional features of the second masked region image to obtain the second masking feature of the second masked region image. Here, the preset dimension is the dimension of the regional features of the second masked region image. The regional features of the second masked region image can be extracted by a deep convolutional neural network.

[0056] Furthermore, the encoder is used to extract features from the second unmasked region image to obtain target unmasked features, which are referred to here as the second unmasked features. Self-attention is then calculated on the second unmasked features to obtain self-attention features. The self-attention features and the second unmasked features are then fused to obtain the second encoded features. Specifically, the second unmasked features are mapped through three fully connected layers to obtain a query vector, a key-value vector, and a value vector. The query vector and the key-value vector are multiplied by a dot product to obtain a score. The score is then multiplied by the value vector using a SoftMax activation function to obtain the self-attention coefficients of the second unmasked features. These self-attention coefficients are multiplied by the second unmasked features to obtain the self-attention features. Finally, the self-attention features and the second unmasked features are summed and normalized to obtain the second encoded features. The second unmasked features, self-attention features, and second encoded features can all be multi-dimensional features. In one specific implementation, the self-attention coefficient of each dimension can be multiplied by the corresponding second non-masked feature to obtain the self-attention feature for that dimension; the self-attention feature of that dimension is then summed with the corresponding second non-masked feature to obtain the fifth coding feature for that dimension. After normalizing the fifth coding features of multiple dimensions, they are passed through a feedforward neural network to obtain the sixth coding features of each dimension. The sixth coding features of each dimension are then summed and normalized with the fourth coding features of each dimension to obtain the second coding feature for each dimension.

[0057] The second mask feature and the second encoded feature are merged, and the decoder reconstructs the pixel information of the second mask region image based on the merged features to obtain the reconstructed pixel information of the second mask region image. The mean square error loss is calculated using the original pixel information and the reconstructed pixel information of the second mask region image, and the parameters of the position masker, encoder, and decoder are adjusted according to the mean square error loss.

[0058] In this embodiment, the second sample text image can be the target sample image, and correspondingly, the second masked region image and the second unmasked region image can be the target masked region image and the target unmasked region image; the second encoding feature can be the target encoding feature.

[0059] S62: Perform masking processing on the first sample text image to obtain the first masking feature and the first non-masked region image of the first masked region in the first sample text image.

[0060] In one embodiment, the position mask trained in step S61 is used to mask the first sample text image, resulting in a first mask feature and a first unmasked region image in the first sample text image. The first sample text image is labeled with the actual text recognition result. If the second sample text image is also labeled with the actual text recognition result, the first sample text image can be the same as the second sample text image; for example, both can be target sample images.

[0061] S63: The first non-masked region image in the first sample text image is encoded using the encoder of the text recognition model to obtain the first encoded feature.

[0062] S64: Predict the first mask feature and the first coding feature to obtain the first text recognition result of the first sample text image.

[0063] S65: Adjust the encoder parameters of the text recognition model based at least on the first text recognition result.

[0064] For details on the implementation of steps S62-S65, please refer to steps S11-14 of the first implementation of the text recognition model training method, which will not be repeated here.

[0065] In one specific implementation, such as Figure 7 As shown, the first and second sample text images are identical, both labeled with the actual text recognition results. First, the text recognition model is self-supervised trained. Then, the position encoder of the text recognition model is used to mask the sample text image, resulting in the second masked region image (e.g., ...). Figure 7 The second mask feature of the region containing characters L, d, and y in the sample text image is the region other than the second mask region image, which is the second non-masked region image (e.g., ...). Figure 7The second non-masked region image is encoded using an encoder to obtain a second encoded feature. The second encoded feature and the second masked feature are then merged to obtain a merged feature. The second masked region image is then reconstructed using a decoder based on the merged feature to obtain the reconstructed pixel information of the second masked region image. The mean square error loss is calculated based on the original pixel information and the reconstructed pixel information of the second masked region image. The parameters of the position masker, encoder, and decoder are then adjusted according to the mean square error loss. The text recognition model is then subjected to supervised training, which can be performed using two branches. The first branch uses the trained encoder to encode the sample text image, and the first prediction module predicts the encoded features to obtain the second text recognition result. The second branch uses a position masker to mask the sample text image, obtaining the first masked region image and the first unmasked region image. These first and unmasked region images can be the same as or different from the second masked region images and the second unmasked region images obtained during self-supervised training. The encoder encodes the first unmasked region image to obtain the corresponding first encoded features, and the second prediction module predicts the first encoded features output by the encoder and the first masked features output by the position masker to obtain the first text recognition result. Based on the first and second text recognition results, the encoder parameters are adjusted; the parameters of the position masker and the second prediction module are adjusted based on the first text recognition result; and the parameters of the first prediction module are adjusted based on the second text recognition result. The text recognition model trained in this way can accurately recognize text content in natural scene images even when the text content is occluded.

[0066] Please see Figure 8 , Figure 8 This is a flowchart illustrating an embodiment of the text recognition method provided in this application. The method includes:

[0067] S81: Acquire the target image.

[0068] The target image can be an image captured from any natural scene. In one embodiment, the target image includes a text area.

[0069] S82: Encode the target image using the encoder of the text recognition model to obtain the target encoded features of the target image.

[0070] This step can be referred to the above description of encoding the first non-masked region image of the first sample text image using an encoder, and will not be repeated here.

[0071] S83: The prediction module of the text recognition model is used to predict the target encoding features of the target image to obtain the target text in the target image.

[0072] The prediction module of the text recognition model can be either the first prediction module or the second prediction module described above. In this embodiment, the first prediction module is used as the prediction module of the text recognition model. The text recognition model is a text recognition model trained using any of the above-described text recognition model training methods. For specific training methods, please refer to any of the above-described embodiments, which will not be repeated here.

[0073] Please see Figure 9 , Figure 9 This is a schematic diagram of the framework structure of one embodiment of the electronic device provided in this application.

[0074] The electronic device 90 includes a memory 91 and a processor 92 coupled to each other. The memory 91 stores program instructions, and the processor 92 executes the program instructions stored in the memory 91 to implement the steps of any of the above-described text recognition model training method implementations, or to implement the steps of the above-described text recognition method implementations. In a specific implementation scenario, the electronic device 90 may include, but is not limited to, a microcomputer or a server. In addition, the electronic device 90 may also include mobile devices such as laptops and tablets, which are not limited here.

[0075] Specifically, processor 92 controls itself and memory 91 to implement the steps of any of the above-described method embodiments. Processor 92 may also be referred to as a CPU (Central Processing Unit). Processor 92 may be an integrated circuit chip with signal processing capabilities. Processor 92 may also be a general-purpose processor, a digital signal processor (DSP), an application-specific integrated circuit (ASIC), a field-programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. A general-purpose processor may be a microprocessor or any conventional processor. Furthermore, processor 92 may be implemented using integrated circuit chips.

[0076] Please see Figure 10 , Figure 10 This is a schematic diagram of a framework of one embodiment of the computer-readable storage medium provided in this application.

[0077] The computer-readable storage medium 100 stores program instructions 101, which, when executed by a processor, are used to implement the steps of any of the above-described text recognition model training method implementations, or to implement the steps of the above-described text recognition method implementations.

[0078] The computer-readable storage medium 100 may specifically be a USB flash drive, a portable hard drive, a read-only memory (ROM), a random access memory (RAM), a magnetic disk, or an optical disk, or a medium capable of storing computer programs. Alternatively, it may be a server storing the computer program, which can send the stored computer program to other devices for execution or can also execute the stored computer program itself.

[0079] The description of the various embodiments above tends to emphasize the differences between the various embodiments. The similarities or similarities between them can be referred to, and for the sake of brevity, they will not be repeated here.

[0080] In the several embodiments provided in this application, it should be understood that the disclosed methods and apparatus can be implemented in other ways. For example, the apparatus implementations described above are merely illustrative. For instance, the division of modules or units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical, mechanical, or other forms.

[0081] The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment, depending on actual needs.

[0082] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0083] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) or processor to execute all or part of the steps of the methods of various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0084] If the technical solution of this application involves personal information, the product using this technical solution has clearly informed the user of the personal information processing rules and obtained the user's voluntary consent before processing the personal information. If the technical solution of this application involves sensitive personal information, the product using this technical solution has obtained the user's separate consent before processing the sensitive personal information, and also meets the requirement of "express consent". For example, at personal information collection devices such as cameras, clear and prominent signs are set up to inform users that they have entered the scope of personal information collection and that personal information will be collected. If an individual voluntarily enters the collection scope, it is deemed that they have agreed to the collection of their personal information; or on the personal information processing device, with clear signs / information informing users of the personal information processing rules, authorization is obtained from the individual through pop-up information or by asking the individual to upload their personal information; wherein, the personal information processing rules may include information such as the personal information processor, the purpose of personal information processing, the processing method, and the types of personal information processed.

[0085] The above description is merely an embodiment of this application and does not limit the patent scope of this application. Any equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.

Claims

1. A method for training a text recognition model, characterized in that, The method includes: The first sample text image is masked to obtain the first mask feature and the first non-masked region image of the first masked region image in the first sample text image. The first non-masked region image of the first sample text image is encoded using the encoder of the text recognition model to obtain the first encoded feature; Predicting the first mask feature and the first encoded feature yields the first text recognition result of the first sample text image; Based at least on the first text recognition result, adjust the parameters of the encoder of the text recognition model; The process of masking the first sample text image to obtain the first mask feature and the first non-masked region image of the first sample text image is performed using a position masker. Before performing masking processing on the first sample text image to obtain the first masked region image and the first unmasked region image in the first sample text image, the method further includes: The pre-training of the position masker and the encoder using a second sample text image includes: using the position masker to mask the second sample text image to obtain a second mask feature and a second non-masked region image in the second sample text image; using the encoder to encode the second non-masked region image of the second sample text image to obtain a second encoded feature; using the decoder to reconstruct pixel information of the second mask region image based on the second mask feature and the second encoded feature to obtain reconstructed pixel information of the second mask region image; and adjusting the parameters of the position masker, the encoder, and the decoder based on the original pixel information of the second mask region image and the reconstructed pixel information. The first sample text image is labeled with the real text recognition result, while the second sample text image is an unlabeled image. The step of masking a first sample text image to obtain a first mask feature and a first unmasked region image in the first sample text image, or the step of masking a second sample text image to obtain a second mask feature and a second unmasked region image in the second sample text image, includes: determining a target masked region image and a target unmasked region image in a target sample image based on a preset masking ratio, and determining the word embedding vector corresponding to the target masked region image; fusing the word embedding vector corresponding to the target masked region image and the region features of the target masked region image to obtain the target mask feature of the target masked region image; wherein, the target sample image is the first sample text image, the target masked region image is the first masked region image, the target unmasked region image is the first unmasked region image, and the target mask feature is the first mask feature; or, the target sample image is the second sample text image, the target masked region image is the second masked region image, the target unmasked region image is the second unmasked region image, and the target mask feature is the second mask feature.

2. The method according to claim 1, characterized in that, Before adjusting the parameters of the encoder of the text recognition model based at least on the first text recognition result, the method further includes: The encoder is used to encode the first sample text image to obtain the second encoded feature; The second encoded feature is predicted using the first prediction module of the text recognition model to obtain the second text recognition result of the first sample text image; The parameters of the first prediction module are adjusted based on the second text recognition result; The step of adjusting the encoder parameters of the text recognition model based at least on the first text recognition result includes: The encoder parameters are adjusted based on the first text recognition result and the second text recognition result.

3. The method according to claim 2, characterized in that, The first sample text image is labeled with the real text recognition results; The step of adjusting the encoder parameters based on the first text recognition result and the second text recognition result, and the step of adjusting the parameters of the first prediction module based on the second text recognition result, include: A first recognition loss is obtained based on the difference between the first text recognition result and the real text recognition result, and a second recognition loss is obtained based on the difference between the second text recognition result and the real text recognition result. Based on the first recognition loss and the second recognition loss, the parameters of the encoder are adjusted; and The parameters of the first prediction module are adjusted based on the second identification loss.

4. The method according to claim 3, characterized in that, The process of masking the first sample text image to obtain the first mask feature and the first non-masked region image of the first sample text image is performed using a position masker. The step of predicting the first mask feature and the first encoding feature to obtain the first text recognition result of the first sample text image is performed using the second prediction module; After obtaining the first recognition loss based on the difference between the first text recognition result and the real text recognition result, the method further includes: Based on the first identification loss, the parameters of the location mask and the second prediction module are adjusted.

5. The method according to claim 1, characterized in that, The process of determining the target masked region and the target unmasked region in the target sample image based on a preset mask ratio includes: Based on the preset mask ratio, the target sample image is divided into several image blocks along a preset direction, and at least one image block is randomly selected from the several image blocks as the target mask region image, and the remaining image blocks are used as the target non-mask region image. The process of fusing the word embedding vector corresponding to the target mask region image and the region features of the target mask region image to obtain the target mask features of the target mask region image includes: The word embedding vector is mapped to a preset dimension, where the preset dimension is the dimension of the region feature; The word embedding vector of the preset dimension is fused with the region feature to obtain the target mask feature.

6. The method according to claim 1, characterized in that, The step of encoding the first non-masked region image of the first sample text image using the encoder of the text recognition model to obtain a first encoded feature, or the step of encoding the second non-masked region image of the second sample text image using the encoder to obtain a second encoded feature, includes: Feature extraction is performed on the image of the non-masked region of the target to obtain the non-masked features of the target; The non-masked features of the target are subjected to self-attention processing to obtain self-attention features; The target non-masked features are fused with the self-attention features to obtain the target encoded features; Wherein, the target non-masked region image is a first non-masked region image and the target coding feature is a first coding feature, or the target non-masked region image is a second non-masked region image and the target coding feature is a second coding feature.

7. A text recognition method, characterized in that, The method includes: Acquire the target image; The target image is encoded using the encoder of a text recognition model to obtain the target encoded features of the target image; The target text in the target image is obtained by predicting the target encoding features of the target image using the prediction module of the text recognition model; wherein the text recognition model is a text recognition model trained using the method described in any one of claims 1-6.

8. An electronic device, characterized in that, Including interconnected memory and processor, The memory stores program instructions; The processor is used to execute program instructions stored in the memory to implement the training method of the text recognition model according to any one of claims 1-6, or to implement the text recognition method according to claim 7.

9. A computer-readable storage medium, characterized in that, The computer-readable storage medium is used to store program instructions that can be executed to implement the training method of the text recognition model as described in any one of claims 1-6, or to implement the text recognition method as described in claim 7.