Text recognition method and device, model training method, electronic equipment and medium

By separating and pre-training visual and semantic feature extraction models and then fusing them, the problem of recognition accuracy of OCR models under lighting, noise, and character occlusion was solved, improving the robustness and recognition effect of the model and expanding its application scope.

CN115631502BActive Publication Date: 2026-06-26BEIJING BAIDU NETCOM SCI & TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING BAIDU NETCOM SCI & TECH CO LTD
Filing Date
2022-10-21
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing OCR recognition technologies are not robust enough when faced with external interference such as lighting and noise, and their recognition performance for incomplete characters is poor, resulting in low recognition accuracy and making it difficult to apply them effectively in natural scenes.

Method used

By pre-training the visual feature extraction model and the semantic feature extraction model separately, and then combining the visual and semantic features for feature fusion, the robustness and recognition accuracy of the model are improved.

Benefits of technology

It improves the recognition accuracy of OCR models under lighting, noise, and character occlusion conditions, enhances the versatility and portability of the models, and expands the application areas of OCR.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN115631502B_ABST
    Figure CN115631502B_ABST
Patent Text Reader

Abstract

The present disclosure relates to the technical field of artificial intelligence, in particular to the technical field of deep learning, image processing and computer vision, and can be applied to the scene of OCR. The specific implementation scheme is: training a first neural network based on image samples to obtain a visual feature extraction model; training a second neural network based on text samples to obtain a semantic feature extraction model; training the visual feature extraction model based on the image samples; obtaining the text corresponding to the image samples based on the visual features output in the training process of the visual feature extraction model; and training the semantic feature extraction model based on the text until the visual feature extraction model and the semantic feature extraction model converge. Before training the text recognition model, the sub-model for extracting visual features and the sub-model for extracting semantic features are pre-trained separately, the robustness of the text recognition model is improved, and thus the accuracy of text recognition is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of artificial intelligence technology, specifically to the fields of deep learning, image processing, and computer vision, and can be applied to scenarios such as OCR (optical character recognition). Background Technology

[0002] With the continuous upgrading of computing resources and the development of deep learning, OCR recognition technology has become increasingly mature and plays an important role in scenarios such as transportation, card verification, and traffic control. However, images in natural scenes inevitably suffer from interference such as lighting and noise. In addition, due to manual shooting, there are sometimes incomplete or obstructed images, which affect the recognition effect of OCR.

[0003] Existing text recognition methods involve: inputting an image containing text, selecting candidate text regions, then extracting the corresponding text regions from the original image based on the candidate regions, inputting them into a text recognition model for text recognition, and obtaining the final recognition result. On one hand, when facing external interference such as lighting and noise, the robustness of the model to certain interferences can be improved by increasing the corresponding training data. However, data collection / generation, data labeling, and model training all require time and manpower, resulting in high costs. Furthermore, the model's learning ability is limited by its structure and may not be able to learn all scenarios. On the other hand, when characters in an image are incomplete, it may lead to misrecognition or missed recognition by the model. Currently, a relatively effective method is to correct this through error correction of the recognition results. However, this two-stage recognition method has poor performance, and the correction effect depends entirely on the error correction module, making practical application difficult. Summary of the Invention

[0004] This disclosure provides a character recognition method, a character recognition device, a training method for a character recognition model, a training device, an electronic device, and a storage medium.

[0005] According to a first aspect of this disclosure, a model training method is provided, comprising:

[0006] A visual feature extraction model is obtained by training the first neural network based on image samples;

[0007] A semantic feature extraction model is obtained by training a second neural network based on text samples;

[0008] The visual feature extraction model is trained based on the image samples;

[0009] Based on the visual features output during the training of the visual feature extraction model, the text corresponding to the image sample is obtained;

[0010] The semantic feature extraction model is trained based on the text until the visual feature extraction model and the semantic feature extraction model converge.

[0011] According to a second aspect of this disclosure, a character recognition method is provided, comprising:

[0012] Obtain the image of the text to be recognized;

[0013] Extract the visual features of the text image to be identified;

[0014] Semantic features are extracted based on the visual features;

[0015] The visual features and the semantic features are fused together.

[0016] Text prediction is performed based on the features fused from the above features to obtain the text recognition result.

[0017] According to a third aspect of this disclosure, a model training apparatus is provided, comprising:

[0018] The first training module is configured to train the first neural network based on image samples to obtain a visual feature extraction model;

[0019] The second training module is configured to train the second neural network based on text samples to obtain a semantic feature extraction model;

[0020] The third training module is configured to train the visual feature extraction model based on the image samples;

[0021] The third training module obtains the text corresponding to the image sample based on the visual features output during the training process of the visual feature extraction model;

[0022] The third training module trains the semantic feature extraction model based on the text until the visual feature extraction model and the semantic feature extraction model converge.

[0023] According to a fourth aspect of this disclosure, a character recognition device is provided, comprising:

[0024] The acquisition module is configured to acquire images of the text to be recognized.

[0025] The first feature extraction module is configured to extract the visual features of the text image to be identified;

[0026] The second feature extraction module is configured to extract semantic features based on the visual features;

[0027] The feature fusion module is configured to fuse the visual features with the semantic features;

[0028] The text recognition module is configured to perform text prediction based on the features after feature fusion to obtain text recognition results.

[0029] According to a fifth aspect of this disclosure, an electronic device is provided, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method described in any one of the above technical solutions.

[0030] According to a sixth aspect of this disclosure, a non-transitory computer-readable storage medium is provided storing computer instructions, wherein the computer instructions are used to cause the computer to perform the method according to any one of the above-described technical solutions.

[0031] According to a seventh aspect of this disclosure, a computer program product is provided, comprising a computer program that, when executed by a processor, implements the method according to any one of the above-described technical solutions.

[0032] This disclosure provides a character recognition method, a character recognition device, a model training method, a model training device, an electronic device, and a storage medium. Before training the character recognition model, the sub-models for extracting visual features and the sub-models for extracting semantic features are pre-trained separately to improve the robustness of the character recognition model and the accuracy of character recognition.

[0033] It should be understood that the description in this section is not intended to identify key or essential features of the embodiments of this disclosure, nor is it intended to limit the scope of this disclosure. Other features of this disclosure will become readily apparent from the following description. Attached Figure Description

[0034] The accompanying drawings are provided to better understand this solution and do not constitute a limitation of this disclosure. Wherein:

[0035] Figure 1 This is a schematic diagram of the steps of the text recognition model training method in the embodiments of this disclosure;

[0036] Figure 2 This is a schematic diagram of the training steps of the visual feature extraction model in this embodiment of the present disclosure;

[0037] Figure 3 This is a schematic diagram of the training steps of the semantic feature extraction model in this embodiment of the disclosure;

[0038] Figure 4 This is a flowchart illustrating the text recognition method in an embodiment of this disclosure;

[0039] Figure 5 This is a schematic diagram of the steps of the character recognition method in the embodiments of this disclosure;

[0040] Figure 6 This is a schematic block diagram of the model training device in the embodiments of this disclosure;

[0041] Figure 7 This is a schematic block diagram of the character recognition device in the embodiments of this disclosure;

[0042] Figure 8 This is a schematic block diagram of an example electronic device in an embodiment of this disclosure. Detailed Implementation

[0043] The exemplary embodiments of this disclosure are described below with reference to the accompanying drawings, including various details of the embodiments to aid understanding, and should be considered merely exemplary. Therefore, those skilled in the art will recognize that various changes and modifications can be made to the embodiments described herein without departing from the scope and spirit of this disclosure. Similarly, for clarity and brevity, descriptions of well-known functions and structures are omitted in the following description.

[0044] This disclosure provides a model training method, such as Figure 1 As shown, it includes:

[0045] Step S101: Train the first neural network based on image samples to obtain a visual feature extraction model. Image samples can be image data containing text, such as... Figure 2 As shown, the input to the first neural network can be an image containing text. Based on the principle of self-supervision, the basic structure of the first neural network can include an Encoder module 201 and a Decoder module 202. The Encoder module 201 is used to extract image features, and the Decoder module 202 is used to reconstruct the input. The model training loss is the reconstruction loss of the output and input (generally L2 loss). After the model training converges, it will have the ability to extract visual features.

[0046] Step S102: Train the second neural network based on text samples to obtain a semantic feature extraction model. Text samples can be plain text data, such as... Figure 3 As shown, the input to the second neural network can be the plain text "Hello World". The second neural network can use MLM (Masked Language Model) to train the language model specifically. Compared with using image data as training samples, this can further improve the robustness of the language model.

[0047] Step S103: Train the visual feature extraction model based on image samples. After pre-training in steps S101 and S102, the visual feature extraction model and the semantic feature extraction model are obtained respectively. Then, the text recognition model is obtained by training the two sub-models simultaneously.

[0048] Step S104: Based on the visual features output during the training of the visual feature extraction model, obtain the text corresponding to the image sample. The image sample is input into the visual feature extraction model, which outputs the corresponding visual features. Based on the visual features, the corresponding text can be classified.

[0049] Step S105: Train the semantic feature extraction model based on the text until both the visual feature extraction model and the semantic feature extraction model converge. For example... Figure 4 As shown, first, the visual feature extraction model and the semantic feature extraction model are imported. Then, a line of text images are input, and the entire process is trained together until both sub-models converge. Figure 4 The text recognition model shown is used for text recognition.

[0050] Before training, the text recognition model in this disclosure pre-trains its sub-models, namely the visual feature extraction model and the semantic feature extraction model. These two sub-models are used to extract visual and semantic features, respectively. By fusing the visual and semantic features, text recognition is then performed. Pre-training the visual and semantic feature extraction models improves the model's robustness. The trained model can fuse visual and semantic features, significantly reducing the impact of noise or occlusion on text recognition, thus improving accuracy. Furthermore, the model exhibits strong versatility and portability, making it widely applicable in OCR scenarios and expanding the application areas of OCR.

[0051] As an optional implementation, step S102, which trains the second neural network based on text samples to obtain a semantic feature extraction model, includes: randomly masking at least one character in the text sample, and converting all characters in the randomly masked text sample into corresponding first character identification codes; and inputting all the first character identification codes into the second neural network (MLM) for training to obtain the semantic feature extraction model.

[0052] like Figure 3As shown, the semantic feature extraction model can be used to extract semantic information. The pre-training process of the semantic feature extraction model includes: Step S301, the MLM model inputs the plain text "Hello World"; Step S302, the "e" character in "Hello World" is randomly masked to obtain "H[M]llo World"; Step S303, all characters in "H[M]llo World" are converted into corresponding character IDs (identity documents, i.e., character identification codes) and sent to the language model (Masked Language Model, MLM); Step S304, the head module then performs character prediction for a classification task. The loss of the model training is cross-entropy. After the model training converges, the text model has the ability to extract semantic information. When characters are randomly masked, it can correctly predict the masked characters based on the semantic information. This disclosure enhances the robustness of the language model by using pure text data for training during the pre-training process and randomly masking a certain proportion of the characters in the text data. When extracting semantic features, the language model can predict characters that are incomplete or occluded in the image by referring to the context, thereby improving the accuracy of text recognition.

[0053] As an optional implementation, step S103, training the semantic feature extraction model based on text, includes: converting characters in the text into corresponding second character identification codes; and inputting the second character identification codes into the semantic feature extraction model for training.

[0054] like Figure 4 As shown, when training the text recognition model, the encoding module 401 (Encoder) of the visual feature extraction model and the semantic feature extraction model 402 (MLM model) are first imported, then... Figure 4 The entire training process is as follows: An image containing text is input into the model. First, visual features are extracted by the Encoder module 401. After extraction, the training is divided into two parallel paths. One path classifies the visual features extracted by the Encoder module 401 using the first classification module 403, then predicts the character IDs corresponding to the characters using the calculation module 404 (including Softmax and Argmax functions). The character IDs are then input into the MLM model 402 for semantic feature extraction. The other path maps the visual features extracted by the Encoder module 401 using the mapping module 405 to maintain the visual and semantic features in the same feature space. The feature fusion module 406 then fuses the visual and semantic information, and the second classification module 407 predicts the text. Figure 4During the training process shown, the weights of the encoding module 401 and the semantic feature extraction model 402 are fixed and not updated in the first two rounds. Only the weights of other modules are updated. After two rounds of training, the weights of all modules are updated together.

[0055] It is worth noting that after feature extraction in the Encoder module 401, if visual features are directly used for text prediction, the model may misidentify characters due to incomplete characters in the image. However, this disclosure integrates semantic features into the visual features, allowing for timely adjustment of the recognition results and output of the correct characters. This eliminates the need for error correction after obtaining the recognition results, effectively improving the accuracy of text recognition and reducing misidentification or missed recognition.

[0056] This disclosure provides a character recognition method, such as Figure 5 As shown, it includes: Step S501, acquiring an image of the text to be recognized. For example... Figure 4 The image shown includes the text "a toy bear". Step S502: Extract visual features from the image of the text to be recognized; Step S503: Extract semantic features based on the visual features; Step S504: Fuse the visual features and semantic features; Step S505: Perform text prediction based on the fused features to obtain the text recognition result. Steps S502 to S505 are executed by the text recognition model trained using the model training method in the above embodiment. The visual feature extraction model in the text recognition model extracts the visual features of the image of the text to be recognized, and the semantic feature extraction model in the text recognition model further extracts semantic features based on the visual features. Finally, the text recognition model fuses the visual features and semantic features to perform text prediction.

[0057] This disclosure combines visual and semantic features. When some characters in an image are occluded or the image is incomplete, the extracted semantic features can be used for prediction. During the semantic feature extraction process, the language model can predict the missing characters through contextual relationships. Thus, combining semantic features can improve the accuracy of character recognition.

[0058] As an optional implementation, obtaining semantic features based on visual feature extraction includes: classifying visual features to obtain characters corresponding to text images; calculating corresponding character recognition codes based on characters; and extracting semantic features based on character recognition codes.

[0059] like Figure 4As shown, an image containing text is input into the model. First, visual features are extracted. After extraction, the visual features extracted by the Encoder module 401 are classified by the first classification module 403 to obtain the characters in the image. Then, the SA (Softmax+Argmax) module 404 calculates the character ID corresponding to the character. The character ID is then input into the MLM model 402 for semantic feature extraction. Furthermore, before fusing visual and semantic features, a mapping module maps the visual features to the same feature space as the semantic features. Even if the text in the image is occluded, the language model can predict the occluded character based on the context. Therefore, fusing semantic features into visual features in this disclosure can effectively improve the accuracy of recognizing occluded characters.

[0060] This disclosure provides a model training device 600, such as Figure 6 As shown, it includes:

[0061] The first training module 601 is configured to train the first neural network based on image samples to obtain a visual feature extraction model. Image samples can be image data containing text, such as... Figure 2 As shown, the input to the first neural network can be an image containing text. Based on the principle of self-supervision, the basic structure of the first neural network can include an Encoder module 201 and a Decoder module 202. The Encoder module 201 is used to extract image features, and the Decoder module 202 is used to reconstruct the input. The model training loss is the reconstruction loss of the output and input (generally L2 loss). After the model training converges, it will have the ability to extract visual features.

[0062] The second training module 602 is configured to train the second neural network based on text samples to obtain a semantic feature extraction model. The text samples can be plain text data, such as... Figure 3 As shown, the input to the second neural network can be the plain text "Hello World". The second neural network can use MLM (Masked Language Model) to train the language model specifically. Compared with using image data as training samples, this can further improve the robustness of the language model.

[0063] The third training module 603 is configured to train the visual feature extraction model based on image samples. After pre-training by the first training module 601 and the second training module 602, the visual feature extraction model and the semantic feature extraction model are obtained respectively. The third training module 603 then trains the text recognition model simultaneously based on the two sub-models, the visual feature extraction model and the semantic feature extraction model.

[0064] The third training module 603 obtains the text corresponding to the image samples based on the visual features output during the training of the visual feature extraction model. The image samples are input into the visual feature extraction model, which outputs the corresponding visual features. Based on the visual features, the corresponding text can be classified.

[0065] The third training module 603 trains the semantic feature extraction model based on text until both the visual feature extraction model and the semantic feature extraction model converge. For example... Figure 4 As shown, first, the visual feature extraction model and the semantic feature extraction model are imported. Then, a line of text images are input, and the entire process is trained together until both sub-models converge. Figure 4 The text recognition model shown is used for text recognition.

[0066] Before training, the text recognition model in this disclosure pre-trains its sub-models, namely the visual feature extraction model and the semantic feature extraction model. These two trained sub-models are used to extract visual and semantic features, respectively. By fusing the visual and semantic features, text recognition is then performed. Pre-training the visual and semantic feature extraction models improves the model's robustness. The trained model can fuse visual and semantic features, significantly reducing the impact of noise or occlusion on text recognition, thus improving accuracy. Furthermore, the model exhibits strong versatility and portability, making it widely applicable in OCR scenarios and expanding the application areas of OCR.

[0067] As an optional implementation, the second training module 602 trains the second neural network to obtain a semantic feature extraction model based on text samples, including: randomly masking at least one character in the text sample, and converting all characters in the randomly masked text sample into corresponding first character recognition codes; inputting all the first character recognition codes into the second neural network for training to obtain the semantic feature extraction model. Figure 3As shown, the semantic feature extraction model can be used to extract semantic information. The pre-training process of the semantic feature extraction model includes: Step S301, the MLM model inputs the plain text "Hello World"; Step S302, the "e" character in "Hello World" is randomly masked to obtain "H[M]lloWorld"; Step S303, all characters in "H[M]llo World" are converted into corresponding character IDs and sent to the MLM model; Step S304, the Head module then performs character prediction on a classification task and outputs the text prediction result "Hello World". The loss of the model training is cross-entropy. After the model training converges, the text model has the ability to extract semantic information. When characters are randomly masked, it can correctly predict the masked characters based on the semantic information. This disclosure enhances the robustness of the language model by using pure text data for training during the pre-training process and randomly masking a certain proportion of the characters in the text data. When extracting semantic features, the language model can predict characters that are incomplete or occluded in the image by referring to the context, thereby improving the accuracy of text recognition.

[0068] As an optional implementation, the third training module 603 trains the semantic feature extraction model based on text by: converting characters in the text into corresponding second character identification codes; and inputting the second character identification codes into the semantic feature extraction model for training. Figure 4 As shown, when training the text recognition model, the encoding module 401 and the MLM model 402 (i.e., the semantic feature extraction model) of the first pre-training module are first imported, and then... Figure 4 The entire training process is as follows: An image containing text is input into the model. First, visual features are extracted by the Encoder module 401. After extraction, the training is divided into two parallel paths. One path classifies the visual features extracted by the Encoder module 401 using the first classification module 403, then predicts the character IDs corresponding to the characters using the SA (Softmax+Argmax) module 404. The character IDs are then input into the MLM model 402 for semantic feature extraction. The other path maps the visual features extracted by the Encoder module 401 using the mapping module 405 to maintain the visual and semantic features in the same feature space. The feature fusion module 406 then fuses the visual and semantic information, and the second classification module 407 predicts the text. Figure 4 During the training process shown, the weights of the encoding module 401 and the semantic feature extraction model 402 are fixed and not updated in the first two rounds. Only the weights of other modules are updated. After two rounds of training, the weights of all modules are updated together.

[0069] It is worth noting that after feature extraction in the Encoder module 401, if the visual features are directly input into the classification module (Head) for text prediction (e.g., ... Figure 4 As shown by the dashed line, characters in an image may be incomplete. In such cases, the model may be affected by these incomplete characters, leading to misidentification. However, this disclosure integrates semantic features into the visual features, allowing for timely adjustments to the recognition results and outputting the correct characters. This eliminates the need for post-recognition error correction, effectively improving the accuracy of text recognition and reducing misidentification or missed recognition.

[0070] This disclosure provides a character recognition device 700, such as... Figure 7 As shown, the system includes: an acquisition module 701 configured to acquire an image of text to be recognized; a first feature extraction module 702 configured to extract visual features from the image of text to be recognized; a second feature extraction module 703 configured to extract semantic features based on the visual features; a feature fusion module 704 configured to fuse the visual features and semantic features; and a text recognition module 705 configured to perform text prediction based on the fused features to obtain the text recognition result. This disclosure combines visual and semantic features. When some characters in an image are occluded or the image is incomplete, prediction can be made using the extracted semantic features. During the semantic feature extraction process, the language model can predict missing characters through contextual relationships, thereby improving the accuracy of text recognition by combining semantic features.

[0071] As an optional implementation, the second feature extraction module 703 extracts semantic features based on visual features, including: classifying visual features to obtain characters corresponding to the text image; calculating the corresponding character recognition code based on the character; and extracting semantic features based on the character recognition code. For example... Figure 4 As shown, an image containing text is input into the model. First, visual features are extracted. After extraction, the first classification module 403 classifies the visual features extracted by the Encoder module 401 to obtain the characters in the image. Then, the SA (Softmax+Argmax) module 404 calculates the character ID corresponding to the character. The character ID is then input into the MLM model 402 for semantic feature extraction. Furthermore, the text recognition device also includes a mapping module. Before fusing visual and semantic features, the mapping module maps the visual features to the same feature space as the semantic features. Even if the text in the image is occluded, the language model can predict the occluded character based on the context. Therefore, fusing semantic features into visual features in this disclosure can effectively improve the accuracy of recognizing occluded characters.

[0072] The acquisition, storage, and application of user personal information involved in the technical solution disclosed herein comply with the provisions of relevant laws and regulations and do not violate public order and good morals.

[0073] According to embodiments of this disclosure, this disclosure also provides an electronic device, a readable storage medium, and a computer program product.

[0074] Figure 8 A schematic block diagram of an example electronic device 800 that can be used to implement embodiments of the present disclosure is shown. The electronic device is intended to represent various forms of digital computers, such as laptop computers, desktop computers, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are merely illustrative and are not intended to limit the implementation of the present disclosure described and / or claimed herein.

[0075] like Figure 8 As shown, device 800 includes a computing unit 801, which can perform various appropriate actions and processes based on a computer program stored in read-only memory (ROM) 802 or a computer program loaded from storage unit 808 into random access memory (RAM) 803. RAM 803 may also store various programs and data required for the operation of device 800. The computing unit 801, ROM 802, and RAM 803 are interconnected via bus 804. Input / output (I / O) interface 805 is also connected to bus 804.

[0076] Multiple components in device 800 are connected to I / O interface 805, including: input unit 806, such as keyboard, mouse, etc.; output unit 807, such as various types of monitors, speakers, etc.; storage unit 808, such as disk, optical disk, etc.; and communication unit 809, such as network card, modem, wireless transceiver, etc. Communication unit 809 allows device 800 to exchange information / data with other devices through computer networks such as the Internet and / or various telecommunications networks.

[0077] The computing unit 801 can be a variety of general-purpose and / or special-purpose processing components with processing and computing capabilities. Some examples of the computing unit 801 include, but are not limited to, a central processing unit (CPU), a graphics processing unit (GPU), various special-purpose artificial intelligence (AI) computing chips, various computing units running machine learning model algorithms, a digital signal processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as character recognition methods or model training methods. For example, in some embodiments, the character recognition method or model training method may be implemented as a computer software program tangibly contained in a machine-readable medium, such as storage unit 808. In some embodiments, part or all of the computer program may be loaded and / or installed on device 800 via ROM 802 and / or communication unit 809. When the computer program is loaded into RAM 803 and executed by the computing unit 801, one or more steps of the character recognition method or model training method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform a character recognition method or a model training method by any other suitable means (e.g., by means of firmware).

[0078] Various embodiments of the systems and techniques described above herein can be implemented in digital electronic circuit systems, integrated circuit systems, field-programmable gate arrays (FPGAs), application-specific integrated circuits (ASICs), application-specific standard products (ASSPs), systems-on-a-chip (SoCs), payload-programmable logic devices (CPLDs), computer hardware, firmware, software, and / or combinations thereof. These various embodiments may include implementations in one or more computer programs that can be executed and / or interpreted on a programmable system including at least one programmable processor, which may be a dedicated or general-purpose programmable processor, capable of receiving data and instructions from a storage system, at least one input device, and at least one output device, and transmitting data and instructions to the storage system, the at least one input device, and the at least one output device.

[0079] The program code used to implement the methods of this disclosure may be written in any combination of one or more programming languages. This program code may be provided to a processor or controller of a general-purpose computer, special-purpose computer, or other programmable data processing apparatus, such that when executed by the processor or controller, the program code causes the functions / operations specified in the flowcharts and / or block diagrams to be implemented. The program code may be executed entirely on a machine, partially on a machine, as a standalone software package partially on a machine and partially on a remote machine, or entirely on a remote machine or server.

[0080] In the context of this disclosure, a machine-readable medium can be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium can be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium can be, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media include electrical connections based on one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.

[0081] To provide interaction with a user, the systems and techniques described herein can be implemented on a computer having: a display device for displaying information to the user (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor); and a keyboard and pointing device (e.g., a mouse or trackball) through which the user provides input to the computer. Other types of devices can also be used to provide interaction with the user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user can be received in any form (including sound input, voice input, or tactile input).

[0082] The systems and technologies described herein can be implemented in computing systems that include backend components (e.g., as a data server), or computing systems that include middleware components (e.g., an application server), or computing systems that include frontend components (e.g., a user computer with a graphical user interface or web browser through which a user can interact with embodiments of the systems and technologies described herein), or any combination of such backend, middleware, or frontend components. The components of the system can be interconnected via digital data communication of any form or medium (e.g., a communication network). Examples of communication networks include local area networks (LANs), wide area networks (WANs), and the Internet.

[0083] Computer systems can include clients and servers. Clients and servers are generally located far apart and typically interact via communication networks. Client-server relationships are created by computer programs running on the respective computers and having a client-server relationship with each other. Servers can be cloud servers, servers in distributed systems, or servers incorporating blockchain technology.

[0084] It should be understood that the various forms of processes shown above can be used to rearrange, add, or delete steps. For example, the steps described in this disclosure can be executed in parallel, sequentially, or in different orders, as long as the desired result of the technical solution disclosed in this disclosure can be achieved, and this is not limited herein.

[0085] The specific embodiments described above do not constitute a limitation on the scope of protection of this disclosure. Those skilled in the art should understand that various modifications, combinations, sub-combinations, and substitutions can be made according to design requirements and other factors. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of this disclosure should be included within the scope of protection of this disclosure.

Claims

1. A model training method, comprising: A visual feature extraction model is obtained by training a first neural network based on image samples. The first neural network includes an encoding module and a decoding module. The encoding module is used to extract image features, and the decoding module is used to reconstruct the input image. The first neural network is trained based on the reconstruction loss between the output of the decoding module and the input image to obtain the visual feature extraction model. At least one character in a text sample is randomly masked; all characters in the text sample after the random masking are converted into corresponding first character identification codes; the first character identification codes are input into an MLM model as a second neural network for training, and the MLM model is adjusted based on the cross-entropy loss between the predicted characters and the real characters to obtain a semantic feature extraction model. An image containing line text is input into the visual feature extraction model. The visual features are extracted by the encoding module of the visual feature extraction model. After the visual features are extracted, the process is divided into two parallel paths. The first path classifies the visual features extracted by the encoding module through the first classification module, and then predicts the character recognition code corresponding to the character through the calculation module. The character recognition code is then input into the semantic feature extraction model for semantic feature extraction. The second path maps the visual features extracted by the encoding module through the mapping module to keep the visual features and semantic features in the same feature space. Then, the visual features and semantic information are fused through the feature fusion module. The second classification module predicts the text and calculates the loss between the prediction result and the true label. During training, the weights of the encoding module and the semantic feature extraction model are fixed and not updated in the first two rounds. Only the weights of the first classification module, the calculation module, the mapping module, the feature fusion module, and the second classification module are updated. After two rounds of training, the weights of all modules are updated together. The parameters of the visual feature extraction model and the semantic feature extraction model are adjusted according to the loss until the visual feature extraction model and the semantic feature extraction model converge.

2. The method according to claim 1, wherein, The method further includes: Convert the characters in the text into corresponding second character identification codes; The second character identification code is input into the semantic feature extraction model for training.

3. A model training device, comprising: The first training module is configured to train a first neural network based on image samples to obtain a visual feature extraction model. The first neural network includes an encoding module and a decoding module. The encoding module is used to extract image features, and the decoding module is used to reconstruct the input image. The first neural network is trained based on the reconstruction loss between the output of the decoding module and the input image to obtain the visual feature extraction model. The second training module is configured to randomly mask at least one character in the text sample; convert all characters in the text sample after the random masking into corresponding first character recognition codes; input the first character recognition codes as the MLM model of the second neural network for training, and adjust the MLM model based on the cross-entropy loss between the predicted characters and the real characters to obtain a semantic feature extraction model. The third training module is configured to input an image containing line text into the visual feature extraction model. The visual feature extraction model's encoding module extracts visual features. After extraction, the model splits into two parallel paths. The first path classifies the visual features extracted by the encoding module through a first classification module, then predicts the character recognition code corresponding to each character through a calculation module. This character recognition code is then input into the semantic feature extraction model for semantic feature extraction. The second path maps the visual features extracted by the encoding module through a mapping module, keeping the visual and semantic features in the same feature space. A feature fusion module then fuses the visual and semantic information, and a second classification module predicts the text, calculating the loss between the predicted result and the true label. During training, the weights of the encoding module and the semantic feature extraction model are fixed in the first two rounds and are not updated. Only the weights of the first classification module, the calculation module, the mapping module, the feature fusion module, and the second classification module are updated. After two rounds of training, all module weights are updated together. The parameters of the visual feature extraction model and the semantic feature extraction model are adjusted according to the loss until they converge.

4. The apparatus according to claim 3, wherein, The third training module is also used for: Convert the characters in the text into corresponding second character identification codes; The second character identification code is input into the semantic feature extraction model for training.

5. An electronic device, comprising: At least one processor; as well as A memory communicatively connected to the at least one processor; wherein, The memory stores instructions that can be executed by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-2.

6. A non-transitory computer-readable storage medium storing computer instructions, wherein, The computer instructions are used to cause the computer to perform the method according to any one of claims 1-2.

7. A computer program product comprising a computer program that, when executed by a processor, implements the method according to any one of claims 1-2.