Text recognition method and device, electronic equipment and storage medium
By combining the first and second preset models in a text recognition method, the problem of decreased recognition accuracy caused by blurred handwriting in the image to be recognized is solved. The replacement character is determined by the intersection, thereby improving the accuracy of text recognition.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- BEIJING DAJIA INTERNET INFORMATION TECH CO LTD
- Filing Date
- 2022-09-22
- Publication Date
- 2026-06-12
Smart Images

Figure CN115641598B_ABST
Abstract
Description
Technical Field
[0001] This disclosure relates to the field of computer technology, and in particular to a text recognition method, apparatus, electronic device, and storage medium. Background Technology
[0002] Image text recognition is a technology that uses optical techniques and text recognition models to scan and recognize text and characters in images, ultimately converting the text in the image into text format for further editing and processing by text processing software. However, due to factors such as blurred or occluded characters in some images, some text recognition errors occur, reducing the accuracy of text recognition. Summary of the Invention
[0003] This disclosure provides a text recognition method, apparatus, electronic device, and storage medium to reduce miscorrections in text recognition results and improve text recognition accuracy. The technical solution of this disclosure is as follows:
[0004] According to a first aspect of the present disclosure, a text recognition method is provided. The method includes: performing text feature recognition processing on an image to be recognized to obtain an initial recognition result; the initial recognition result includes at least one initial character; each initial character has a corresponding first candidate character set; each initial character is a character in the first candidate character set corresponding to each initial character, and any first candidate character set is obtained by recognizing a preset position in the image to be recognized; determining a target character in the initial recognition result whose first confidence level is less than or equal to a first preset threshold; the first confidence level is obtained during the process of determining the initial recognition result; performing semantic feature extraction processing on the initial recognition result to predict the character at the target position in the initial recognition result, thereby obtaining a second candidate character set; the target position is the position of the target character in the initial recognition result; determining a replacement character for the target character based on the intersection of the second candidate character set and the first target candidate character set; the first target candidate character set is the first candidate character set corresponding to the target character; and determining the target recognition result of the image to be recognized based on the replacement character and the initial recognition result.
[0005] Optionally, the text in the image to be recognized is recognized to obtain an initial recognition result, including: inputting the image to be recognized into a first preset model for text feature recognition processing to obtain an initial recognition result; the first preset model is trained based on text feature recognition of multiple sample images.
[0006] Optionally, determining the target characters in the initial recognition results whose first confidence level is less than or equal to the first preset threshold includes: obtaining the first confidence level of each initial character in the initial recognition results; and determining the initial characters whose first confidence level is less than or equal to the first preset threshold as target characters.
[0007] Optionally, semantic feature extraction processing is performed on the initial recognition result to predict the character at the target position in the initial recognition result and obtain a second candidate character set, including: inputting the initial recognition result and the target position into a second preset model for semantic feature extraction processing to obtain a second candidate character set; the second preset model is trained based on text feature recognition of multiple sample texts.
[0008] Optionally, based on the intersection of the second candidate character set and the first candidate character set corresponding to the target character, the replacement character of the target character is determined, including: determining the target confidence of each candidate character in the intersection; the target confidence of each candidate character is the sum of the first confidence and the second confidence corresponding to each candidate character; the second confidence is obtained by the second preset model in the process of determining the second candidate character set; and the candidate characters in the intersection whose target confidence is greater than or equal to the preset confidence are determined as the replacement characters of the target character.
[0009] Optionally, based on the replacement character and the initial recognition result, the target recognition result of the image to be recognized is determined, including: if the replacement character is different from the target character in the initial recognition result, the target character in the initial recognition result is replaced with the replacement character, and the initial recognition result after replacement is determined as the target recognition result of the image to be recognized; if the replacement character is the same as the target character in the initial recognition result, the initial recognition result is determined as the target recognition result of the image to be recognized.
[0010] Optionally, text feature recognition processing is performed on the image to be recognized to obtain an initial recognition result, including: for a preset position in the image to be recognized, text feature recognition processing is performed based on a first preset model to obtain a first candidate character set corresponding to the preset position; characters in the first candidate character set corresponding to the preset position with a first confidence level greater than or equal to a second preset threshold are determined as initial characters.
[0011] According to a second aspect of the present disclosure, a text recognition apparatus is provided, comprising a processing unit and a determining unit. The processing unit is configured to perform text feature recognition processing on an image to be recognized to obtain an initial recognition result. The initial recognition result includes at least one initial character. Each initial character has a corresponding first candidate character set. Each initial character is a character in the first candidate character set corresponding to the initial character, and any first candidate character set is obtained by recognizing a preset position in the image to be recognized. The determining unit is configured to determine a target character in the initial recognition result whose first confidence level is less than or equal to a first preset threshold. The first confidence level is obtained during the determination of the initial recognition result. The processing unit is further configured to perform semantic feature extraction processing on the initial recognition result, predict the character at the target position in the initial recognition result, and obtain a second candidate character set. The target position is the position of the target character in the initial recognition result. The determining unit is further configured to determine a replacement character for the target character based on the intersection of the second candidate character set and the first target candidate character set, and determine the target recognition result of the image to be recognized based on the replacement character and the initial recognition result. The first target candidate character set is the first candidate character set corresponding to the target character.
[0012] Optionally, the processing unit is specifically configured to perform: inputting the image to be recognized into a first preset model for text feature recognition processing to obtain an initial recognition result; the first preset model is trained based on text feature recognition of multiple sample images.
[0013] Optionally, the determining unit is specifically configured to perform: obtaining the first confidence score of each initial character in the initial recognition result; and determining the initial characters whose first confidence score is less than or equal to a first preset threshold as target characters.
[0014] Optionally, the processing unit is specifically configured to perform the following: inputting the initial recognition result and the target position into the second preset model for semantic feature extraction processing to obtain a second candidate character set; the second preset model is trained based on text feature recognition of multiple sample texts.
[0015] Optionally, the determining unit is specifically configured to perform: determining the target confidence of each candidate character in the intersection; the target confidence of each candidate character is the sum of the first confidence and the second confidence corresponding to each candidate character; the second confidence is obtained by the second preset model in the process of determining the second candidate character set; and determining the candidate characters in the intersection whose target confidence is greater than or equal to the preset confidence as the replacement characters of the target character.
[0016] Optionally, the determining unit is specifically configured to perform: if the replacement character is different from the target character in the initial recognition result, replace the target character in the initial recognition result with the replacement character, and determine the replaced initial recognition result as the target recognition result of the image to be recognized; if the replacement character is the same as the target character in the initial recognition result, determine the initial recognition result as the target recognition result of the image to be recognized.
[0017] Optionally, the processing unit is specifically configured to perform: for a preset position in the image to be recognized, perform text feature recognition processing based on a first preset model to obtain a first candidate character set corresponding to the preset position; and determine the characters in the first candidate character set corresponding to the preset position whose first confidence level is greater than or equal to a second preset threshold as initial characters.
[0018] According to a third aspect of the present disclosure, an electronic device is provided, comprising: a processor and a memory for storing processor-executable instructions; wherein the processor is configured to execute instructions to implement the text recognition method of the first aspect described above.
[0019] According to a fourth aspect of the present disclosure, a computer-readable storage medium is provided, on which instructions are stored, such that when the instructions in the computer-readable storage medium are executed by a processor of an electronic device, the electronic device is able to perform the text recognition method of the first aspect described above.
[0020] According to a fifth aspect of the present disclosure, a computer program product is provided, the computer program product including computer instructions, which, when executed by a processor, implement the text recognition method as described in the first aspect above.
[0021] The technical solution provided in this disclosure brings at least the following beneficial effects: The text recognition device recognizes the text in the image to be recognized based on a first preset model, obtaining an initial recognition result including at least one initial character. Since an initial character is a character in a first candidate character set; and the first candidate character set is obtained by the first preset model recognizing the same position in the image to be recognized, the first candidate character set is equivalent to the candidate character set predicted by the first preset model for a certain position in the image to be recognized. Further, the text recognition device determines the target character in the initial recognition result whose first confidence level is less than or equal to a preset threshold. Since the first confidence level is obtained by the first preset model in determining the initial recognition result, the target character is likely to be a target character that the first preset model has not accurately recognized. The text recognition device performs semantic analysis on the initial recognition result based on a second preset model, only needing to predict the character at the target position in the initial recognition result, to obtain a second candidate character set, which is equivalent to the candidate character set after the second preset model corrects the target position. The text recognition device determines the replacement character of the target character based on the intersection of the second candidate character set and the first candidate character set corresponding to the target character, and determines the target recognition result of the image to be recognized based on the replacement character and the initial recognition result. Compared to the error correction problem caused by relying solely on the language model (i.e., the second preset model) in related technologies, this disclosure combines the suggestions of the recognition model (i.e., the first preset model) and only corrects the target characters, thus reducing the error correction rate of the second model and improving the text recognition accuracy.
[0022] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and are not intended to limit this disclosure. Attached Figure Description
[0023] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this disclosure and, together with the description, serve to explain the principles of this disclosure, and are not intended to unduly limit this disclosure.
[0024] Figure 1 This is an image to be identified according to an exemplary embodiment;
[0025] Figure 2 This is a schematic diagram of the structure of the TROCR model according to an exemplary embodiment;
[0026] Figure 3 This is a schematic diagram of the structure of an ABINet model according to an exemplary embodiment;
[0027] Figure 4 This is a schematic diagram illustrating the structure of a text recognition system according to an exemplary embodiment;
[0028] Figure 5 This is one of the flowcharts illustrating a text recognition method according to an exemplary embodiment;
[0029] Figure 6 This is a schematic diagram of the structure of a first preset model according to an exemplary embodiment;
[0030] Figure 7 This is a schematic diagram of the structure of a second preset model according to an exemplary embodiment;
[0031] Figure 8 This is a schematic diagram illustrating the recognition effect according to an exemplary embodiment;
[0032] Figure 9 This is a second schematic flowchart illustrating a text recognition method according to an exemplary embodiment;
[0033] Figure 10 This is a third flowchart illustrating a text recognition method according to an exemplary embodiment;
[0034] Figure 11 This is a schematic diagram illustrating the structure of a text recognition device according to an exemplary embodiment;
[0035] Figure 12 This is a schematic diagram of the structure of an electronic device according to an exemplary embodiment. Detailed Implementation
[0036] To enable those skilled in the art to better understand the technical solutions of this disclosure, the technical solutions in the embodiments of this disclosure will be clearly and completely described below with reference to the accompanying drawings.
[0037] It should be noted that the terms "first," "second," etc., used in the specification, claims, and accompanying drawings of this disclosure are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this disclosure described herein can be implemented in orders other than those illustrated or described herein. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with this disclosure. Rather, they are merely examples of apparatuses and methods consistent with some aspects of this disclosure as detailed in the appended claims.
[0038] In addition, in the description of the embodiments of the present disclosure, unless otherwise specified, " / " means "or". For example, A / B may mean A or B. "And / or" herein is merely a description of the association relationship of associated objects, indicating that there can be three relationships. For example, A and / or B may mean: A exists alone, A and B exist simultaneously, and B exists alone. In addition, in the description of the embodiments of the present disclosure, "multiple" means two or more than two.
[0039] It should be noted that the user information (including but not limited to user device information, user personal information, user behavior information, etc.) and data (including but not limited to program code, etc.) involved in the present disclosure are all information and data authorized by the user or fully authorized by all parties.
[0040] Before explaining the embodiments of the present disclosure in detail, some related technologies involved in the embodiments of the present disclosure will be introduced first.
[0041] Image text recognition is essentially optical character recognition (OCR), which uses optical technology and a text recognition model to scan and recognize the text and characters in an image, and finally converts the text in the image into a text format for further editing and processing by text processing software. Among them, the text recognition model mainly relies on visual information to recognize the text in the image. However, when there are some artistic deformations or blurs in the text in the image, it is very difficult for the text recognition model to recognize the correct text.
[0042] Due to large-scale pre-training, the language model learns a large amount of text semantic knowledge, and thus can correct the recognition result of the text recognition model. For example, Figure 1 the recognition result of the shown image by the text recognition model is "MBA Top 100 Balls" because Figure 1 the artistic characters M and N in it are visually indistinguishable, resulting in difficulties for the text recognition model to recognize. After pre-training with tens of billions of parameters, the language model has learned human language expression methods and co-occurrence relationships, so it can correct "MBA Top 100 Balls" to "NBA Top 100 Balls".
[0043] However, in actual applications, the frequencies of different phrases vary greatly. Simply relying on the language model usually causes incorrect corrections. The language model usually corrects rare phrases in the recognition result to common phrases, and then corrects the originally correct recognition result of the text recognition model, and outputs the recognition result after incorrect correction. Suppose there is a clear text "xue diao" in the image, so the recognition result of the text recognition model for this image is "xue diao", while the language model will correct this recognition result to "study" because the language model believes that "study" is a commonly used word, thus causing an incorrect correction.
[0044] In some related technologies, text recognition models and language models can be fused to obtain a unified fusion model, such as the TROCR model and the ABINet model. The TROCR model uses implicit modeling of the language model, meaning the text recognition model and the language model share parameters and are coupled together to perform text recognition on images. For example... Figure 2 As shown, the structure of the TROCR model is illustrated, including an encoder and a decoder. After an image is input into the TROCR model, the model segments the image into multiple small image patches, encodes each patch, and then decodes the encoded results to output the final recognized text. The ABINet model employs a gradient flow blocking mechanism between the visual and language models to achieve explicit language modeling. For example... Figure 3 As shown, the structure of the ABINet model is illustrated, including a text recognition model and a language model. The recognition results of the text recognition model are used as the input of the language model to achieve end-to-end OCR recognition.
[0045] However, both the TROCR and ABINet models essentially rely on language models for their correction results. That is, both the TROCR and ABINet models make language model predictions for every recognized character in the image without distinction, which ultimately leads to corrections for some clear and correctly recognized characters.
[0046] The text recognition method provided in this disclosure addresses the aforementioned technical problems in related technologies. The text recognition method provided in this disclosure can be applied to text recognition systems. Figure 1 A schematic diagram of one structure of the text recognition system is shown. For example... Figure 4 As shown, the text recognition system 10 includes a text recognition device 11 and an electronic device 12. The text recognition device 11 is connected to the electronic device 12. The text recognition device 11 and the electronic device 12 can be connected via a wired connection or a wireless connection; this embodiment of the invention does not limit the connection in this way.
[0047] The text recognition device 11 is used to recognize text in the image to be recognized based on a first preset model, obtain an initial recognition result, and determine target characters in the initial recognition result whose first confidence level is less than or equal to a preset threshold. The text recognition device 11 is also used to perform semantic analysis on the initial recognition result based on a second preset model, predict the character at the target position in the initial recognition result, and obtain a second candidate character set. The text recognition device 11 is further used to determine the replacement character of the target character based on the intersection of the second candidate character set and the first candidate character set corresponding to the target character, and to determine the target recognition result of the image to be recognized based on the replacement character and the initial recognition result.
[0048] The text recognition device 11 can implement the text recognition method of this disclosure embodiment in various electronic devices 12. For example, the electronic device 12 can be a scanner, digital camera, etc.
[0049] In different application scenarios, the text recognition device 11 and the electronic device 12 can be independent devices or integrated into the same device. This embodiment of the invention does not make specific limitations in this regard.
[0050] When the text recognition device 11 and the electronic device 12 are integrated into the same device, the data transmission method between the text recognition device 11 and the electronic device 12 is the data transmission between internal modules of the device. In this case, the data transmission process between the two is the same as that when the text recognition device 11 and the electronic device 12 are independent of each other.
[0051] In the following embodiments provided in this disclosure, the text recognition device 11 and the electronic device 12 are described as being configured independently of each other.
[0052] Figure 5 This is a flowchart illustrating a text recognition method according to some exemplary embodiments. In some embodiments, the above-described text recognition method can be applied to, for example... Figure 1 The text recognition device and electronic device shown can also be applied to other similar devices.
[0053] like Figure 5 As shown, the text recognition method provided in this embodiment includes the following steps S201-S205.
[0054] S201. The text recognition device performs text feature recognition processing on the image to be recognized to obtain the initial recognition result.
[0055] The initial recognition result includes at least one initial character; each initial character has a corresponding first candidate character set; each initial character is a character in the first candidate character set corresponding to each initial character, and any first candidate character set is obtained by recognizing a preset position in the image to be recognized;
[0056] As one possible implementation, the text recognition device uses a preset text recognition algorithm to perform text feature recognition processing on the image to be recognized, and obtains the initial recognition result.
[0057] As another possible implementation, the text recognition device inputs the image to be recognized into a first preset model for text feature recognition processing to obtain an initial recognition result.
[0058] It should be noted that the first preset model is a model pre-deployed by maintenance personnel in the text recognition device for preliminary recognition of text in the image to be recognized. In practical applications, the first model can be any text recognition model, such as a CRNN-structured text recognition model or a CNN-structured text recognition model.
[0059] Specifically, for a preset location in the image to be recognized, the text recognition device performs text feature recognition processing using a first preset model to obtain a first candidate character set corresponding to the preset location. Further, the text recognition device identifies characters in the first candidate character set corresponding to the preset location whose first confidence level is greater than or equal to a second preset threshold as initial characters. It should be noted that the second preset threshold is set in advance by maintenance personnel in the text recognition device.
[0060] like Figure 6 The diagram illustrates a structural schematic of a text recognition model where the first preset model is a CRNN structure. For the image to be recognized, the first preset model first uses a convolutional neural network (CNN) part for feature extraction, then inputs the extracted features into a recurrent neural network (RNN) part for temporal encoding to obtain an encoded feature matrix. Finally, the decoder part decodes the feature matrix to obtain the initial recognition result.
[0061] Specifically, the feature matrix is denoted as [Length×Chars]. The length of the matrix is proportional to the length of the text lines in the image to be recognized, and the height of the matrix, Chars, is the total number of characters in the recognition dictionary of the text recognition model (i.e., each column corresponds to one character to be recognized in the image). Each number in each column of the matrix represents the first confidence level of the character to be recognized in that column. The decoder decodes the feature matrix by selecting the element with the highest first confidence level in each column of the feature matrix [Length×Chars], taking the element value (index), and then merging adjacent and identical characters using the connectionist temporal classification (CTC) decoding principle to output the initial recognition result.
[0062] For example, consider a [3×3] feature matrix:
[0063] [[0.9, 0.8, 0.05],
[0064] [0.05, 0.15, 0.15],
[0065] [0.05, 0.05, 0.8]].
[0066] For the aforementioned feature matrix, the decoder takes the element with the highest first confidence score in each column as the object to be decoded, i.e., 0.9, 0.8, 0.8, which appear in rows 1, 1, and 3 respectively. Therefore, the initial recognition result before decoding is denoted as [1, 1, 3]. Assume that in the text recognition model's recognition dictionary, row 1 corresponds to character A, row 2 corresponds to character B, and row 3 corresponds to character C. Therefore, the decoder can parse [1, 1, 3] into [A, A, C] by recognizing the correspondence in the dictionary. Furthermore, the decoder merges adjacent identical characters [A, A] using the CTC decoding principle, finally outputting the initial recognition result [A, C], where the first confidence score for character A is 0.85, and the first confidence score for character C is 0.8.
[0067] S202, The text recognition device determines the target character in the initial recognition result whose first confidence level is less than or equal to the first preset threshold.
[0068] The first confidence level is obtained during the process of determining the initial identification result.
[0069] As one possible implementation, the text recognition device obtains the first confidence level of each initial character in the initial recognition result, and determines the initial characters whose first confidence level is less than or equal to a first preset threshold as target characters.
[0070] It should be noted that the first preset threshold is set in advance by the maintenance personnel in the text recognition device. The first preset threshold and the second preset threshold may be the same or different, and this embodiment does not limit this.
[0071] For example, taking the output recognition result [A, C] in the above S201 embodiment as an example, the text recognition device obtains the first confidence level of the initial character A (0.85) and the first confidence level of the initial character C (0.8) from the first preset model. If the preset threshold is 0.81, the text recognition device determines the initial character C as the target character (also called the character that the first preset model does not believe).
[0072] S203. The text recognition device performs semantic feature extraction processing on the initial recognition result, predicts the character at the target position in the initial recognition result, and obtains the second candidate character set.
[0073] The target position is the location of the target character in the initial recognition result.
[0074] As one possible implementation, the text recognition device uses a preset semantic correction algorithm to perform semantic feature extraction processing on the initial recognition result, predicts the character at the target position in the initial recognition result, and obtains a second candidate character set.
[0075] As another possible implementation, the text recognition device inputs the initial recognition result and the target position into a second preset model, performs semantic feature extraction processing on the initial recognition result, predicts the character at the target position in the initial recognition result, and obtains a second candidate character set.
[0076] It should be noted that the second preset model is a model pre-deployed by maintenance personnel in the text recognition device to predict characters at the target location. In practical applications, the second model can be any language model, such as a GPT-1 or GPT-2 language model.
[0077] like Figure 7 The diagram illustrates a possible structure of a language model where the second preset model is a GPT-2 structure. The GPT-2 model consists of a multi-layer masked self-attention module and a feedforward neural network module. The text recognition device inputs the initial recognition result and the target position into the second preset model. The second preset model predicts the target position and obtains the predicted target position and a second set of candidate characters through the multi-layer feedforward neural network.
[0078] Specifically, the second preset model predicts a target location, obtaining a second set of candidate characters corresponding to that target location, and then selects the character with the highest second confidence score from this second set as the predicted character. The second confidence score is obtained by the second preset model during the process of determining the prediction result for the target location. For example, if the second candidate character set is (N, n, C, M), and the corresponding second confidence scores are 0.9, 0.8, 0.7, and 0.6 respectively, then the prediction result for the target location by the second preset model is N.
[0079] S204. The text recognition device determines the replacement character of the target character based on the intersection of the second candidate character set and the first target candidate character set.
[0080] The first target candidate character set is the first candidate character set corresponding to the target character.
[0081] As one possible implementation, the text recognition device determines the target confidence level of each candidate character in the intersection; the target confidence level of a candidate character is the sum of the first confidence level and the second confidence level of the candidate character. Furthermore, the text recognition device identifies the candidate character with the highest target confidence level in the intersection as the replacement character for the target character.
[0082] It should be noted that after the text recognition device identifies the target character, it re-decodes the target position and retains the recognition result of the first preset model for the target position to obtain the first candidate character set corresponding to the target character.
[0083] As another possible implementation, the text recognition device randomly selects a candidate character from the intersection of the second candidate character set and the first candidate character set corresponding to the target character, and uses this subsequent character as the replacement character for the target character.
[0084] As another possible implementation, the text recognition device scores each candidate character in the first candidate character set and each candidate character in the second candidate character set. Furthermore, the text recognition device selects the candidate character with the highest overall score from the intersection as the replacement character.
[0085] Specifically, for the first candidate character set, the text recognition device sorts each candidate character from highest to lowest based on its first confidence level, and scores each candidate character based on the sorting result. For example, if the sorting result of the first candidate character set is Top 1, 2, 3, ... 10, the corresponding scores are 15, 12, 8, 7, 6, 5, 4, 3, 2, 1. Similarly, for the second candidate character set, the text recognition device sorts each candidate character from highest to lowest based on its second confidence level, and scores each candidate character based on the sorting result. For example, if the sorting result of the second candidate character set is Top 1, 2, 3, ... 10, the corresponding scores are 15, 12, 8, 7, 6, 5, 4, 3, 2, 1. If the intersection is candidate character a and candidate character b, and candidate character a scores 15 in the first candidate character set and 5 in the second candidate character set; and candidate character b scores 8 in the first candidate character set and 6 in the second candidate character set, then the overall score of candidate character a is 20, and the overall score of candidate character b is 14. Therefore, the text recognition device identifies candidate character a as the replacement character for the target character.
[0086] S205. The text recognition device determines the target recognition result of the image to be recognized based on the replacement character and the initial recognition result.
[0087] As one possible implementation, the text recognition device replaces the target character in the initial recognition result with a replacement character, and determines the replaced initial recognition result as the target recognition result of the image to be recognized.
[0088] For example, the initial recognition result is "MBA Top 100 Plays", where "M" is the target character. When the replacement character is "N", the text recognition device replaces "M" with "N" and outputs "NBA Top 100 Plays".
[0089] like Figure 8 The image shown illustrates the initial recognition results and target recognition results after recognizing some images to be recognized according to embodiments of this disclosure. It is evident that the text recognition method provided by embodiments of this disclosure reduces the false corrections made by related technologies to the text recognition results, thereby improving the accuracy of text recognition.
[0090] The technical solution provided by this disclosure provides at least the following beneficial effects: The text recognition device recognizes text in the image to be recognized based on a first preset model, obtaining an initial recognition result including at least one initial character. Since an initial character is a character in a first candidate character set, and the first candidate character set is obtained by the first preset model recognizing the same position in the image to be recognized, the first candidate character set is equivalent to the candidate character set predicted by the first preset model for a certain position in the image to be recognized. Further, the text recognition device determines a target character in the initial recognition result whose first confidence level is less than or equal to a preset threshold. Since the first confidence level is obtained by the first preset model in determining the initial recognition result, the target character is likely a target character that the first preset model has not accurately recognized. The text recognition device performs semantic analysis on the initial recognition result based on a second preset model, only needing to predict the character at the target position in the initial recognition result to obtain a second candidate character set, which is equivalent to the candidate character set after the second preset model corrects the target position. The text recognition device determines a replacement character for the target character based on the intersection of the second candidate character set and the first candidate character set corresponding to the target character, and determines the target recognition result of the image to be recognized based on the replacement character and the initial recognition result. Compared to the error correction problem caused by relying solely on the language model (i.e., the second preset model) in related technologies, this disclosure combines the suggestions of the recognition model (i.e., the first preset model) and only corrects the target characters, thus reducing the error correction rate of the second model and improving the text recognition accuracy.
[0091] In a design, such as Figure 9 As shown, in order to determine the replacement character for the target character, the above-mentioned S204 provided in this embodiment of the disclosure specifically includes the following S2041-S2042:
[0092] S2041. The text recognition device determines the target confidence level of each candidate character in the intersection.
[0093] The target confidence level of a candidate character is the sum of the first confidence level and the second confidence level of the candidate character; the second confidence level is obtained by the second preset model in the process of determining the second candidate character set.
[0094] As one possible implementation, for any candidate character in the intersection, the text recognition device obtains a first confidence level and a second confidence level for that candidate character. Further, the text recognition device determines the sum of the first and second confidence levels as the target confidence level for that candidate character.
[0095] S2042. The text recognition device determines the candidate character with the highest confidence in the intersection as the replacement character for the target character.
[0096] As one possible implementation, the text recognition device selects the candidate character with the highest target confidence from the intersection and uses it as the replacement character for the target character.
[0097] In a design, such as Figure 10 As shown, in order to determine the target recognition result of the image to be recognized, the above-mentioned S205 provided in this embodiment of the disclosure specifically includes the following S2051-S2053:
[0098] S2051. The text recognition device determines whether the replacement character is the same as the target character.
[0099] As one possible implementation, the text recognition device compares the replacement character with the target character to determine whether the replacement character and the target character are the same.
[0100] S2052. When the replacement character is different from the target character in the initial recognition result, the text recognition device replaces the target character in the initial recognition result with the replacement character and determines the replaced initial recognition result as the target recognition result of the image to be recognized.
[0101] For example, the initial recognition result is "MBA Top 100 Plays", where "M" is the target character. The replacement character is "N", which is different from the target character in the initial recognition result. Therefore, the text recognition device replaces "M" with "N" and outputs "NBA Top 100 Plays".
[0102] S2053. If the replacement character is the same as the target character in the initial recognition result, the text recognition device determines the initial recognition result as the target recognition result of the image to be recognized.
[0103] For example, the initial recognition result is "NBA Top 100 Plays", where "N" is the target character. The replacement character is "N", meaning the replacement character is the same as the target character in the initial recognition result. Therefore, the text recognition device directly outputs "NBA Top 100 Plays".
[0104] The above embodiments primarily describe the solutions provided by the embodiments of this disclosure from the perspective of an apparatus (device). It is understood that, in order to implement the above methods, the apparatus or device includes hardware structures and / or software modules corresponding to the execution of each method flow, and these hardware structures and / or software modules corresponding to the execution of each method flow can constitute an electronic device. Those skilled in the art should readily recognize that, in conjunction with the algorithm steps of the various examples described in the embodiments disclosed herein, this disclosure can be implemented in hardware or a combination of hardware and computer software. Whether a function is executed in a hardware or computer software-driven hardware manner depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this disclosure.
[0105] This disclosure embodiment can divide the apparatus or device into functional modules according to the above method examples. For example, the apparatus or device can be divided into functional modules corresponding to each function, or two or more functions can be integrated into one processing module. The integrated module can be implemented in hardware or as a software functional module. It should be noted that the module division in this disclosure embodiment is illustrative and only represents one logical functional division; other division methods may be used in actual implementation.
[0106] Figure 11 This is a schematic diagram illustrating the structure of a text recognition device according to an exemplary embodiment. (Refer to...) Figure 11 As shown, the text recognition device 30 provided in this embodiment includes a processing unit 301 and a determination unit 302.
[0107] Processing unit 301 is used to recognize text in the image to be recognized based on a first preset model to obtain an initial recognition result; the initial recognition result includes at least one initial character; an initial character is a character in a first candidate character set; the first candidate character set is obtained by the first preset model recognizing the same position in the image to be recognized; determining unit 302 is used to determine the target character in the initial recognition result whose first confidence is less than or equal to a preset threshold; the first confidence is obtained by the first preset model in the process of determining the initial recognition result; processing unit 301 is also used to perform semantic analysis on the initial recognition result based on a second preset model, predict the character at the target position in the initial recognition result, and obtain a second candidate character set; the target position is the position of the target character in the initial recognition result; determining unit 302 is also used to determine the replacement character of the target character based on the intersection of the second candidate character set and the first candidate character set corresponding to the target character, and determine the target recognition result of the image to be recognized based on the replacement character and the initial recognition result.
[0108] Optionally, the processing unit 301 is specifically used to: input the image to be recognized into the first preset model to obtain the initial recognition result.
[0109] Optionally, the determining unit 302 is specifically used to: obtain the first confidence level of each initial character in the initial recognition result, and determine the initial characters whose first confidence level is less than or equal to a preset threshold as target characters.
[0110] Optionally, the processing unit 301 is specifically used to: input the initial recognition result and the target position into the second preset model to obtain the second candidate character set.
[0111] Optionally, the determining unit 302 is specifically used to: determine the target confidence of each candidate character in the intersection; the target confidence of a candidate character is the sum of the first confidence and the second confidence of the candidate character; the second confidence is obtained by the second preset model in the process of determining the second set of candidate characters; and determine the candidate character with the highest target confidence in the intersection as the replacement character of the target character.
[0112] Optionally, the determining unit 302 is specifically used to: replace the target character in the initial recognition result with a replacement character, and determine the replaced initial recognition result as the target recognition result of the image to be recognized.
[0113] Optionally, the determining unit 302 is specifically used to: replace the target character in the initial recognition result with the replacement character when the replacement character is different from the target character in the initial recognition result, and determine the initial recognition result after replacement as the target recognition result of the image to be recognized; and determine the initial recognition result as the target recognition result of the image to be recognized when the replacement character is the same as the target character in the initial recognition result.
[0114] Figure 12 This is a schematic diagram of the structure of an electronic device provided in this disclosure. For example... Figure 12 The electronic device 40 may include at least one processor 401 and a memory 402 for storing processor-executable instructions, wherein the processor 401 is configured to execute the instructions in the memory 402 to implement the text recognition method in the above embodiments.
[0115] In addition, the electronic device 40 may also include a communication bus 403 and at least one communication interface 404.
[0116] Processor 401 may be a processor (central processing unit, CPU), microprocessor unit 301, ASIC, or one or more integrated circuits for controlling the execution of programs according to the present disclosure.
[0117] The communication bus 403 may include a path for transmitting information between the aforementioned components.
[0118] Communication interface 404 uses any transceiver-like device for communicating with other devices or communication networks, such as Ethernet, radio access network (RAN), wireless local area networks (WLAN), etc.
[0119] Memory 402 may be a read-only memory (ROM) or other type of static storage device capable of storing static information and instructions, random access memory (RAM) or other type of dynamic storage device capable of storing information and instructions, or electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disc storage, optical disc storage (including compressed optical discs, laser discs, optical discs, digital versatile optical discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium capable of carrying or storing desired program code in the form of instructions or data structures and accessible by a computer, but not limited thereto. Memory may exist independently and be connected to the processor via a bus. Memory may also be integrated with the processor.
[0120] The memory 402 stores instructions for executing the present invention, and the processor 401 controls the execution of these instructions. The processor 401 executes the instructions stored in the memory 402 to implement the functions of the text recognition method of the present invention.
[0121] As an example, combined Figure 11 The functions implemented by the processing unit 301 and the determining unit 302 in the text recognition device 30 are the same as those implemented by the processing unit 301 and the determining unit 302. Figure 12 The processor 401 in it has the same function.
[0122] In a specific implementation, as one example, processor 401 may include one or more CPUs, for example... Figure 12 CPU0 and CPU1 in the CPU.
[0123] In a specific implementation, as one example, the electronic device 40 may include multiple processors, such as... Figure 12Processors 401 and 407 are described herein. Each of these processors may be a single-core (single-CPU) processor or a multi-core (multi-CPU) processor. A processor here may refer to one or more devices, circuits, and / or processing cores used to process data (e.g., computer program instructions).
[0124] In a specific implementation, as one embodiment, the electronic device 40 may further include an output device 405 and an input device 406. The output device 405 communicates with the processor 401 and can display information in various ways. For example, the output device 405 may be a liquid crystal display (LCD), a light-emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector, etc. The input device 406 communicates with the processor 401 and can accept input from user objects in various ways. For example, the input device 406 may be a mouse, keyboard, touchscreen device, or sensing device, etc.
[0125] Those skilled in the art will understand that Figure 12 The structure shown does not constitute a limitation on the electronic device 40, and may include more or fewer components than shown, or combine certain components, or use different component arrangements.
[0126] In addition, this disclosure also provides a computer-readable storage medium that, when the instructions in the computer-readable storage medium are executed by the processor of an electronic device, enables the electronic device to perform the text recognition method provided in the above embodiments.
[0127] In addition, this disclosure also provides a computer program product, including computer instructions, which, when executed on an electronic device, cause the electronic device to perform the text recognition method provided in the above embodiments.
[0128] Other embodiments of this disclosure will readily occur to those skilled in the art upon consideration of the specification and practice of the disclosure herein. This disclosure is intended to cover any variations, uses, or adaptations of this disclosure that follow the general principles of this disclosure and include common knowledge or customary techniques in the art not disclosed herein. The specification and examples are to be considered exemplary only, and the true scope and spirit of this disclosure are indicated by the claims.
Claims
1. A text recognition method, characterized in that, The method includes: Text feature recognition processing is performed on the image to be recognized to obtain an initial recognition result; the initial recognition result includes at least one initial character; each initial character has a corresponding first candidate character set; each initial character is a character in the first candidate character set corresponding to each initial character, and any first candidate character set is obtained by recognizing a preset position in the image to be recognized; The target characters in the initial recognition result are identified as having a first confidence level less than or equal to a first preset threshold; the first confidence level is obtained during the process of determining the initial recognition result. The initial recognition result is subjected to semantic feature extraction processing to predict the character at the target position in the initial recognition result, thereby obtaining a second candidate character set; the target position is the position of the target character in the initial recognition result. The replacement character for the target character is determined based on the intersection of the second candidate character set and the first target candidate character set; the first target candidate character set is the first candidate character set corresponding to the target character. If the replacement character is different from the target character in the initial recognition result, the target character in the initial recognition result is replaced with the replacement character, and the replaced initial recognition result is determined as the target recognition result of the image to be recognized. If the replaced character is the same as the target character in the initial recognition result, the initial recognition result is determined as the target recognition result of the image to be recognized.
2. The text recognition method according to claim 1, characterized in that, The process of recognizing the text in the image to be recognized to obtain an initial recognition result includes: The image to be identified is input into a first preset model for text feature recognition processing to obtain the initial recognition result; the first preset model is trained based on text feature recognition of multiple sample images.
3. The text recognition method according to claim 1, characterized in that, The step of determining the target character in the initial recognition result whose first confidence level is less than or equal to a first preset threshold includes: Obtain the first confidence level of each initial character in the initial recognition result; The initial character whose first confidence level is less than or equal to the first preset threshold is determined as the target character.
4. The text recognition method according to claim 1, characterized in that, The initial recognition result is subjected to semantic feature extraction processing to predict the character at the target position in the initial recognition result, resulting in a second candidate character set, including: The initial recognition result and the target position are input into the second preset model for semantic feature extraction processing to obtain the second candidate character set; the second preset model is trained based on text feature recognition of multiple sample texts.
5. The text recognition method according to claim 4, characterized in that, The step of determining the replacement character for the target character based on the intersection of the second candidate character set and the first candidate character set corresponding to the target character includes: The target confidence level of each candidate character in the intersection is determined; the target confidence level of each candidate character is the sum of the first confidence level and the second confidence level corresponding to each candidate character; the second confidence level is obtained by the second preset model in the process of determining the second candidate character set; Candidate characters in the intersection whose target confidence is greater than or equal to a preset confidence are determined as replacement characters for the target character.
6. The text recognition method according to any one of claims 2-4, characterized in that, The text feature recognition processing of the image to be recognized to obtain the initial recognition result includes: For the preset position in the image to be identified, text feature recognition processing is performed based on the first preset model to obtain the first candidate character set corresponding to the preset position; The characters in the first candidate character set corresponding to the preset position whose first confidence level is greater than or equal to the second preset threshold are determined as the initial characters.
7. A text recognition device, characterized in that, The text recognition device includes a processing unit and a determination unit; The processing unit is configured to perform text feature recognition processing on the image to be recognized to obtain an initial recognition result; the initial recognition result includes at least one initial character; each initial character has a corresponding first candidate character set; Each initial character is a character in the first candidate character set corresponding to each initial character, and any first candidate character set is obtained by recognizing a preset position in the image to be recognized; The determining unit is configured to determine target characters in the initial recognition result whose first confidence level is less than or equal to a first preset threshold; the first confidence level is obtained during the process of determining the initial recognition result; The processing unit is further configured to perform semantic feature extraction processing on the initial recognition result, predict the character at the target position in the initial recognition result, and obtain a second candidate character set; the target position is the position of the target character in the initial recognition result; The determining unit is further configured to perform a determination of the replacement character for the target character based on the intersection of the second candidate character set and the first target candidate character set; If the replacement character is different from the target character in the initial recognition result, the target character in the initial recognition result is replaced with the replacement character, and the replaced initial recognition result is determined as the target recognition result of the image to be recognized. If the replaced character is the same as the target character in the initial recognition result, the initial recognition result is determined as the target recognition result of the image to be recognized; The first target candidate character set is the first candidate character set corresponding to the target character.
8. An electronic device, characterized in that, include: A processor and a memory for storing instructions executable by the processor; wherein the processor is configured to execute instructions to implement the text recognition method according to any one of claims 1-6.
9. A computer-readable storage medium storing instructions thereon, characterized in that, When the instructions in the computer-readable storage medium are executed by the processor of the electronic device, the electronic device is enabled to perform the text recognition method as described in any one of claims 1-6.