Handwritten text recognition method and apparatus

What is AI technical title?
AI technical title is built by PatSnap AI team. It summarizes the technical point description of the patent document.
By using feature extraction and attention mechanisms in a handwritten text classification model, combined with upsampling and downsampling fusion, the problem of low accuracy in handwritten text recognition was solved, achieving higher recognition accuracy.

CN116030474BActive Publication Date: 2026-06-16SF TECH CO LTD

View PDF 2 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: SF TECH CO LTD
Filing Date: 2021-10-25
Publication Date: 2026-06-16

Application Information

Patent Timeline

25 Oct 2021

Application

16 Jun 2026

Publication

CN116030474B

IPC: G06V30/19; G06V30/148; G06V30/18; G06N3/048; G06N3/08; G06N3/0464

AI Tagging

Application Domain

Neural learning methods

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Sar ship detection method and system with hierarchical attention fusion and edge enhancement
CN121962936Breduce overfittingImprove stability Character and pattern recognition Neural learning methods
Property prediction system
US12658286B2Geometric CAD Chemical property prediction
Multi-scale neural network for anomaly detection
CN122197978AKernel methods Neural learning methods
A multimodal fusion video conference content real-time abstract generation method and system
CN122205030ATelevision conference systemsTwo-way working systems
A vehicle position estimation method of a fusion filtering network and a computer readable medium
CN116086476BInstruments for road network navigation Internal combustion piston engines

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

⚠Technical Problem

The accuracy of handwritten text recognition in existing technologies is not high, especially when dealing with complex Chinese characters and when affected by factors such as the individual handwriting and the shooting environment, it is difficult to accurately recognize characters such as names and dates.

⚗Method used

A handwritten text classification model is adopted, which uses the first and second feature extraction layers for feature extraction. It combines channel attention and spatial attention mechanisms, and improves feature representation ability and enhances model prediction accuracy by fusing feature maps through upsampling and downsampling.

🎯Benefits of technology

It improves the accuracy of handwritten text recognition, enhances the expression of global and local features, and improves the model's recognition performance.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN116030474B_ABST

Patent Text Reader

Abstract

The application provides a handwriting text recognition method and device. The handwriting text recognition method comprises: obtaining a first handwriting text image to be recognized; obtaining a first feature map and a second feature map of the first handwriting text image to be recognized, wherein the first feature map is obtained by performing feature extraction on the first handwriting text image to be recognized through a first feature extraction layer, the second feature map is obtained by performing feature extraction through the first feature extraction layer and a second feature extraction layer in sequence, and the dimension of convolution operation of the second feature extraction layer is higher than that of the first feature extraction layer; performing up-sampling fusion on the second feature map and the first feature map to obtain a third feature map; performing down-sampling fusion on the third feature map and the second feature map to obtain a fourth feature map; and performing recognition on the first handwriting text image to be recognized based on the fourth feature map to obtain a handwriting text recognition result. The application can improve the accuracy of handwriting text recognition.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application mainly relates to the field of image recognition technology, specifically to a method and apparatus for handwritten text recognition. Background Technology

[0002] To facilitate the collection, organization, storage, and transmission of important paper documents, video recording equipment can be used to capture and disseminate these documents. In some scenarios, it is necessary to review the quality of the content on electronic images to ensure that the document images contain the necessary signature information, such as the recipient's name, ID number, and signing time. For document images lacking the required information, feedback from the review algorithm can supervise the parties involved to fill in the required information and re-capture the image, thus automating the quality review of paper documents. This automated method, which replaces manual review, saves significant labor and time costs, and the standardized review criteria avoid the inconsistencies in human judgment.

[0003] Existing methods for classifying handwritten text based on semantic information are built upon the semantic recognition and translation of handwritten text. OCR recognition networks such as Long Short-Term Memory (LSTM) networks and Recurrent Neural Networks (RNNs) are used to recognize handwritten digits and printed text. RNNs perform well in processing simple data such as handwritten digits and characters, but struggle with complex Chinese characters, and the network's inference process is time-consuming. The quality of handwritten text images is affected by the individual writer, the shooting environment, and the parameters of the handheld device. Different writers exhibit significant differences in the features of the same character, while similar characters show smaller differences. The shooting environment can also lead to blurring and illegible handwriting, increasing the difficulty of text detection and recognition. For target categories other than names, dates, and ID numbers, there are many types of handwriting, with significant intra-category differences and dispersed features. Based on these challenges, existing OCR recognition technologies struggle to solve the problem of recognizing Chinese characters in names, dates, and similar text.

[0004] In other words, the accuracy of handwritten text recognition in existing technologies is not high. Summary of the Invention

[0005] This application provides a handwritten text recognition method and apparatus, aiming to solve the problem of low accuracy in handwritten text recognition in the prior art.

[0006] In a first aspect, this application provides a handwritten text recognition method applied to a handwritten text classification model, wherein the handwritten text classification model includes a first feature extraction layer and a second feature extraction layer, and the handwritten text recognition method includes:

[0007] Acquire the first image of the handwritten text to be recognized;

[0008] A first feature map and a second feature map of the first handwritten text image to be recognized are obtained, wherein the first feature map is obtained by performing feature extraction on the first handwritten text image to be recognized through the first feature extraction layer, and the second feature map is obtained by performing feature extraction on the first handwritten text image to be recognized through the first feature extraction layer and the second feature extraction layer in sequence, wherein the dimension of the convolution operation performed by the second feature extraction layer is higher than the dimension of the convolution operation performed by the first feature extraction layer.

[0009] The second feature map and the first feature map are upsampled and fused to obtain the third feature map;

[0010] The third feature map and the second feature map are downsampled and fused to obtain the fourth feature map;

[0011] The first handwritten text image to be identified is identified based on the fourth feature map to obtain the handwritten text recognition result.

[0012] Optionally, obtaining the first feature map and the second feature map of the first handwritten text image to be recognized includes:

[0013] The first handwritten text image to be recognized is subjected to feature extraction to obtain the fifth feature map;

[0014] Channel attention feature extraction is performed on the fifth feature map to obtain a channel attention feature map;

[0015] The first feature map is determined based on the channel attention feature map and the fifth feature map.

[0016] Optionally, determining the first feature map based on the channel attention feature map and the fifth feature map includes:

[0017] The sixth feature map is obtained by fusing the channel attention feature map and the fifth feature map;

[0018] Spatial attention feature extraction is performed on the sixth feature map to obtain a spatial attention feature map;

[0019] The spatial attention feature map and the sixth feature map are fused to obtain the first feature map.

[0020] Optionally, the handwritten text classification model uses a weighted sum of circular loss and cross-entropy loss as the total loss function during the training phase, wherein the weight coefficient of the cross-entropy loss is greater than the weight coefficient of the circular loss.

[0021] Optionally, acquiring the first handwritten text image to be recognized includes:

[0022] A second handwritten text image to be identified and a handwritten text detection model are obtained, wherein the handwritten text detection model is obtained by training a preset target detection model using a preset image set, the preset image set includes multiple sample images, and the sample images are marked with annotation boxes of handwritten text regions;

[0023] Based on the handwritten text detection model, the second handwritten text image to be identified is detected to obtain a handwritten text detection box;

[0024] The image located within the handwritten text detection box in the second handwritten text image to be recognized is cropped to obtain the first handwritten text image to be recognized.

[0025] Optionally, the step of obtaining the second handwritten text image to be recognized and the handwritten text detection model includes:

[0026] The image within the labeled box of the first target sample image in the preset image set is cropped to obtain the cropped image;

[0027] The cropped image is moved to a random position within the second target sample image;

[0028] If the intersection-union ratio of the bounding boxes of the cropped image and the second target sample image is less than a preset value, then the cropped image is pasted onto the second target sample image to obtain a third target sample image;

[0029] Add the third target sample image to the preset image set;

[0030] The preset target detection model is trained based on the preset image set to obtain the handwritten text detection model.

[0031] Optionally, the handwritten text detection model includes multiple sub-detection models, which are trained on multiple different sub-training sets, where the sub-training sets are subsets of the preset image set. The step of detecting the second handwritten text image to be recognized based on the handwritten text detection model to obtain a handwritten text detection box includes:

[0032] The second handwritten text image to be recognized is input into multiple sub-detection models to obtain the detection boxes and the confidence scores corresponding to the detection boxes of multiple sub-detection models.

[0033] The handwritten text detection boxes are obtained by weighting the detection boxes of the multiple sub-detection models based on the confidence of the multiple detection boxes.

[0034] Secondly, this application provides a handwritten text recognition device, which stores a handwritten text classification model, wherein the handwritten text classification model includes a first feature extraction layer and a second feature extraction layer, and the handwritten text recognition device includes:

[0035] The first acquisition unit is used to acquire a first handwritten text image to be recognized;

[0036] The second acquisition unit is used to acquire a first feature map and a second feature map of the first handwritten text image to be recognized. The first feature map is obtained by extracting features from the first handwritten text image to be recognized through the first feature extraction layer. The second feature map is obtained by extracting features from the first handwritten text image to be recognized through the first feature extraction layer and the second feature extraction layer in sequence. The dimension of the convolution operation performed by the second feature extraction layer is higher than the dimension of the convolution operation performed by the first feature extraction layer.

[0037] An upsampling fusion unit is used to upsample and fuse the second feature map and the first feature map to obtain a third feature map;

[0038] A downsampling fusion unit is used to downsample and fuse the third feature map and the second feature map to obtain a fourth feature map;

[0039] The recognition unit is used to recognize the first handwritten text image to be recognized based on the fourth feature map, and obtain the handwritten text recognition result.

[0040] Optionally, the second acquisition unit is configured to:

[0041] The first handwritten text image to be recognized is subjected to feature extraction to obtain the fifth feature map;

[0042] Channel attention feature extraction is performed on the fifth feature map to obtain a channel attention feature map;

[0043] The first feature map is determined based on the channel attention feature map and the fifth feature map.

[0044] Optionally, the second acquisition unit is configured to:

[0045] The sixth feature map is obtained by fusing the channel attention feature map and the fifth feature map;

[0046] Spatial attention feature extraction is performed on the sixth feature map to obtain a spatial attention feature map;

[0047] The spatial attention feature map and the sixth feature map are fused to obtain the first feature map.

[0048] Optionally, the handwritten text classification model uses a weighted sum of circular loss and cross-entropy loss as the total loss function during the training phase, wherein the weight coefficient of the cross-entropy loss is greater than the weight coefficient of the circular loss.

[0049] Optionally, the first acquisition unit is configured to:

[0050] A second handwritten text image to be identified and a handwritten text detection model are obtained, wherein the handwritten text detection model is obtained by training a preset target detection model using a preset image set, the preset image set includes multiple sample images, and the sample images are marked with annotation boxes of handwritten text regions;

[0051] Based on the handwritten text detection model, the second handwritten text image to be identified is detected to obtain a handwritten text detection box;

[0052] The image located within the handwritten text detection box in the second handwritten text image to be recognized is cropped to obtain the first handwritten text image to be recognized.

[0053] Optionally, the first acquisition unit is configured to:

[0054] The image within the labeled box of the first target sample image in the preset image set is cropped to obtain the cropped image;

[0055] The cropped image is moved to a random position within the second target sample image;

[0056] If the intersection-union ratio of the bounding boxes of the cropped image and the second target sample image is less than a preset value, then the cropped image is pasted onto the second target sample image to obtain a third target sample image;

[0057] Add the third target sample image to the preset image set;

[0058] The preset target detection model is trained based on the preset image set to obtain the handwritten text detection model.

[0059] Optionally, the handwritten text detection model includes multiple sub-detection models, which are trained on multiple different sub-training sets, where the sub-training sets are subsets of the preset image set. The first acquisition unit is used for:

[0060] The second handwritten text image to be recognized is input into multiple sub-detection models to obtain the detection boxes and the confidence scores corresponding to the detection boxes of multiple sub-detection models.

[0061] The handwritten text detection boxes are obtained by weighting the detection boxes of the multiple sub-detection models based on the confidence of the multiple detection boxes.

[0062] Thirdly, this application provides a computer device, the computer device comprising:

[0063] One or more processors;

[0064] Memory; and

[0065] One or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the handwritten text recognition method as described in any one aspect.

[0066] Fourthly, this application provides a computer-readable storage medium storing a plurality of instructions adapted for loading by a processor to perform the steps of the handwritten text recognition method described in any one of the first aspects.

[0067] This application provides a handwritten text recognition method and apparatus. The handwritten text recognition method is applied to a handwritten text classification model, wherein the handwritten text classification model includes a first feature extraction layer and a second feature extraction layer. The handwritten text recognition method includes: acquiring a first handwritten text image to be recognized; acquiring a first feature map and a second feature map of the first handwritten text image to be recognized, wherein the first feature map is obtained by performing feature extraction on the first handwritten text image to be recognized through the first feature extraction layer, and the second feature map is obtained by performing feature extraction on the first handwritten text image to be recognized sequentially through the first feature extraction layer and the second feature extraction layer, wherein the dimension of the convolution operation performed by the second feature extraction layer is higher than the dimension of the convolution operation performed by the first feature extraction layer; upsampling and fusing the second feature map and the first feature map to obtain a third feature map; downsampling and fusing the third feature map and the second feature map to obtain a fourth feature map; and recognizing the first handwritten text image to be recognized based on the fourth feature map to obtain a handwritten text recognition result. This application utilizes a cascaded first and second feature extraction layers to extract features from a first handwritten text image to be recognized, obtaining a second feature map obtained through higher-dimensional convolutional operations and a first feature map obtained through lower-dimensional convolutional operations. The first and second feature maps are then upsampled and fused from high to low dimensions, and the resulting image is then downsampled and fused from low to high dimensions. This network structure conveys strong semantic features from high to low dimensions and strong localization features from low to high dimensions. Parameter aggregation across different detection layers enhances the representation of features for both global and local contexts, thereby improving model prediction accuracy and the overall accuracy of handwritten text recognition. Attached Figure Description

[0068] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0069] Figure 1 This is a schematic diagram of a scenario for the handwritten text recognition system provided in an embodiment of this application;

[0070] Figure 2 This is a schematic flowchart of an embodiment of the handwritten text recognition method provided in this application.

[0071] Figure 3 This is a flowchart of step S201 in one embodiment of the handwritten text recognition method provided in this application.

[0072] Figure 4 This is a schematic diagram of the network structure of an embodiment of the handwritten text classification model in this application.

[0073] Figure 5 This is a schematic diagram of an embodiment of the handwritten text recognition device provided in this application.

[0074] Figure 6 This is a schematic diagram of an embodiment of the computer device provided in this application. Detailed Implementation

[0075] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0076] In the description of this application, it should be understood that the terms "center," "longitudinal," "lateral," "length," "width," "thickness," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," and "outer," etc., indicating orientation or positional relationships based on the orientation or positional relationships shown in the accompanying drawings, are used only for the convenience of describing this application and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of this application. Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Thus, features defined with "first" and "second" may explicitly or implicitly include one or more features. In the description of this application, "a plurality of" means two or more, unless otherwise explicitly specified.

[0077] In this application, the term "exemplary" is used to mean "used as an example, illustration, or description." Any embodiment described as "exemplary" in this application is not necessarily to be construed as being more preferred or advantageous than other embodiments. The following description is provided to enable any person skilled in the art to make and use this application. Details are set forth in the following description for purposes of explanation. It should be understood that those skilled in the art will recognize that this application can be made without using these specific details. In other instances, well-known structures and processes are not described in detail to avoid obscuring the description of this application with unnecessary detail. Therefore, this application is not intended to be limited to the embodiments shown, but is consistent with the broadest scope of the principles and features disclosed in this application.

[0078] This application provides a handwritten text recognition method and apparatus, which will be described in detail below.

[0079] Please see Figure 1 , Figure 1 This is a schematic diagram of a handwritten text recognition system provided in an embodiment of this application. The handwritten text recognition system may include a computer device 100, which integrates a handwritten text recognition device.

[0080] In this embodiment, the computer device 100 can be a standalone server, a server network, or a server cluster. For example, the computer device 100 described in this embodiment includes, but is not limited to, a computer, a network host, a single network server, a set of multiple network servers, or a cloud server composed of multiple servers. The cloud server is composed of a large number of computers or network servers based on cloud computing.

[0081] In this embodiment, the computer device 100 described above can be a general-purpose computer device or a special-purpose computer device. In specific implementations, the computer device 100 can be a desktop computer, a portable computer, a network server, a handheld computer (Personal Digital Assistant, PDA), a mobile phone, a tablet computer, a wireless terminal device, a communication device, an embedded device, etc. This embodiment does not limit the type of computer device 100.

[0082] Those skilled in the art will understand that Figure 1 The application environment shown is merely one application scenario of the solution in this application and does not constitute a limitation on the application scenario of the solution in this application. Other application environments may include more than one application scenario. Figure 1 The number of computer devices shown is more or less, for example Figure 1Only one computer device is shown in the diagram. It is understood that the handwritten text recognition system may also include one or more other computer devices capable of processing data, which are not specifically limited here.

[0083] In addition, such as Figure 1 As shown, the handwritten text recognition system may also include a memory 200 for storing data.

[0084] It should be noted that, Figure 1 The schematic diagram of the handwritten text recognition system shown is merely an example. The handwritten text recognition system and scenarios described in this application are intended to more clearly illustrate the technical solutions of this application and do not constitute a limitation on the technical solutions provided in this application. As those skilled in the art will know, with the evolution of handwritten text recognition systems and the emergence of new business scenarios, the technical solutions provided in this application are also applicable to similar technical problems.

[0085] First, this application provides a handwritten text recognition method. This handwritten text recognition method is applied to a handwritten text classification model, wherein the handwritten text classification model includes a first feature extraction layer and a second feature extraction layer. The handwritten text recognition method includes: acquiring a first handwritten text image to be recognized; acquiring a first feature map and a second feature map of the first handwritten text image to be recognized, wherein the first feature map is obtained by performing feature extraction on the first handwritten text image to be recognized through the first feature extraction layer, and the second feature map is obtained by performing feature extraction on the first handwritten text image to be recognized sequentially through the first feature extraction layer and the second feature extraction layer, wherein the dimension of the convolution operation performed by the second feature extraction layer is higher than the dimension of the convolution operation performed by the first feature extraction layer; upsampling and fusing the second feature map and the first feature map to obtain a third feature map; downsampling and fusing the third feature map and the second feature map to obtain a fourth feature map; and recognizing the first handwritten text image to be recognized based on the fourth feature map to obtain a handwritten text recognition result.

[0086] like Figure 2 As shown, Figure 2 This is a schematic flowchart of an embodiment of the handwritten text recognition method provided in this application. The handwritten text recognition method is applied to a handwritten text classification model, which includes a first feature extraction layer and a second feature extraction layer. The handwritten text recognition method includes the following steps S201 to S205:

[0087] S201. Obtain the first handwritten text image to be recognized.

[0088] In this embodiment of the application, the first image of the handwritten text to be identified can be an image obtained by taking a picture of a waybill, document, etc., with handwritten text. Handwritten text refers to fonts written by hand, as opposed to printed text, such as a handwritten signature, date, ID number, etc.

[0089] S202. Obtain the first feature map and the second feature map of the first handwritten text image to be recognized.

[0090] The first feature map is obtained by extracting features from the first handwritten text image to be recognized through the first feature extraction layer, and the second feature map is obtained by extracting features from the first handwritten text image to be recognized through the first feature extraction layer and the second feature extraction layer in sequence. The dimension of the convolution operation performed by the second feature extraction layer is higher than the dimension of the convolution operation performed by the first feature extraction layer.

[0091] In one specific embodiment, the first feature extraction layer includes MBConv1(k3x3). The second feature extraction layer includes MBConv6(k3x3). The first feature extraction layer performs convolution operations in a 1-dimensional dimension, performing 1-dimensional matrix convolution operations; the second feature extraction layer performs convolution operations in a 6-dimensional dimension, performing 6-dimensional matrix convolution operations.

[0092] To allocate limited computational resources to more important features, this application employs an attention mechanism during feature extraction. In a specific embodiment, obtaining a first feature map and a second feature map of a first handwritten text image to be recognized may include:

[0093] (1) Extract features from the first handwritten text image to be recognized to obtain the fifth feature map.

[0094] Specifically, the size of the first handwritten text image to be recognized is preprocessed to 224×224×40. The 224×224×40 first handwritten text image to be recognized is input into Conv3×3 for feature extraction, resulting in a fifth feature map of 112×112×24.

[0095] (2) Extract channel attention features from the fifth feature map to obtain the channel attention feature map.

[0096] Specifically, the fifth feature map is sequentially subjected to 1x1 convolution for dimensionality increase, BN normalization, Swish activation function, 3x3 depthwise separable convolution, SeNet attention layer, and 1x1 convolution for dimensionality reduction using the MBConv1(k3x3) of the first feature extraction layer to obtain the channel attention feature map.

[0097] Attention mechanisms are a data processing method in machine learning, widely used in various machine learning tasks such as natural language processing, image recognition, and speech recognition. Simply put, attention mechanisms aim to enable the network to automatically learn the areas of attention within an image or text sequence. For example, when the human eye views a painting, it doesn't distribute attention equally among all pixels, but rather focuses more on the areas that attract attention. From an implementation perspective, attention mechanisms generate a mask through neural network operations, with values on the mask representing a score that evaluates the point that needs attention. Attention mechanisms can be divided into channel attention mechanisms and spatial attention mechanisms. Channel attention mechanisms generate masks for each channel; examples include SeNet and Channel Attention Modules. Spatial attention mechanisms generate masks for the space; examples include Spatial Attention Modules.

[0098] (3) Determine the first feature map based on the channel attention feature map and the fifth feature map.

[0099] The size of the first feature map is 112×112×32.

[0100] In one specific embodiment, the channel attention feature map and the first feature map are multiplied by a dot to obtain the first feature map. A channel attention module is then added to extract features along the channel dimension of the image, emphasizing meaningful features in the channel dimension, determining the importance of each feature channel, and then enhancing or suppressing different channels for different tasks, thereby improving the computational efficiency and accuracy of the model.

[0101] In another specific embodiment, determining the first feature map based on the channel attention feature map and the fifth feature map may include:

[0102] (1) The channel attention feature map and the fifth feature map are fused to obtain the sixth feature map.

[0103] Specifically, the sixth feature map is obtained by multiplying the channel attention feature map and the fifth feature map.

[0104] (2) Spatial attention features are extracted from the sixth feature map to obtain the spatial attention feature map.

[0105] Specifically, in one embodiment, max pooling and average pooling are performed on the third feature map along the channel dimension to obtain channel max pooling results and channel average pooling results. The channel max pooling results and channel average pooling results are concatenated and used as input to the pointwise convolutional layer, and then the final spatial attention feature map is generated through the sigmoid function.

[0106] (3) The spatial attention feature map and the sixth feature map are fused to obtain the first feature map.

[0107] Specifically, the spatial attention feature map and the sixth feature map are multiplied to obtain the first feature map.

[0108] By incorporating channel attention and spatial attention modules, features are extracted from images in both channel and spatial dimensions. This emphasizes meaningful features in both dimensions, determines the importance of each feature channel and feature space, and then enhances or suppresses different channels and spaces for different tasks, thereby improving the computational efficiency and accuracy of the model.

[0109] Furthermore, after obtaining the first feature map, the 112×112×32 first feature map is input into MBConv6(k3x3) in the second feature extraction layer to obtain a 56×56×48 second feature map. Specifically, MBConv6(k3x3) sequentially processes the first feature map through a 1x1 convolution, BN normalization, and a Swish activation function, increasing the number of channels by a factor of 6. Then, it passes through a 5x5 depthwise separable convolution, a SeNet attention layer, and a 1x1 convolution to restore the original number of channels, resulting in a 56×56×48 second feature map.

[0110] S203. Upsample and fuse the second feature map and the first feature map to obtain the third feature map.

[0111] In one specific embodiment, the second feature map and the first feature map are upsampled and fused to obtain a third feature map. This includes: upsampling the 56×56×48 second feature map by a factor of 2 to obtain an upsampled feature map of 112×112×48; and fusing the 112×112×48 upsampled feature map with the 112×112×32 first feature map to obtain the third feature map. The main purpose of upsampling is to enlarge the original image so that it can be displayed on a higher resolution display device.

[0112] S204. Downsample and fuse the third feature map and the second feature map to obtain the fourth feature map.

[0113] In one specific embodiment, downsampling and fusing the third feature map and the second feature map to obtain the fourth feature map includes: downsampling the third feature map by a factor of 2 to obtain the downsampled feature map; and fusing the downsampled feature map and the second feature map to obtain the fourth feature map. Downsampling has two main purposes: 1. to make the image conform to the size of the display area; 2. to generate a thumbnail of the corresponding image.

[0114] S205. Based on the fourth feature map, the first handwritten text image to be identified is recognized to obtain the handwritten text recognition result.

[0115] Specifically, the fourth feature map is sequentially output to the pooling layer, fully connected layer, and softmax layer to obtain the handwritten text recognition result. The handwritten text recognition result includes the handwritten text category and its corresponding confidence score. The handwritten text categories include name, date, and ID card number. For example, after inputting a first image of handwritten text to be recognized, the handwritten text recognition results are: name category, confidence score 0.1; date category, confidence score 0.8; ID card number, confidence score 0.1. The handwritten text category with the highest confidence score is output.

[0116] In one specific embodiment, the handwritten text classification model uses a weighted sum of circular loss and cross-entropy loss as the total loss function during the training phase, where the weight coefficient of the cross-entropy loss is greater than that of the circular loss. For example, the weight coefficient of the cross-entropy loss is 1, and the weight coefficient of the circular loss is 0.001.

[0117] Specifically, the total loss function is obtained by solving formulas (1), (2), and (3) simultaneously.

[0118] L loss =L classify +0.001L circle (1)

[0119]

[0120] α p i =[O p -s p i ] + α n j =[s n j -O n ] + (3)

[0121] Where {s n j}(j=1,2,...,L) represents the similarity scores between L classes of a single sample x in the feature space, where {s p i}(i=1,2,...,K) represents the similarity scores of x within K classes. γ is a scaling factor, set to O in the calculation. p =1+m,O n =-m, where m is a threshold. L loss It is the total loss function, L classifyIt is the cross-entropy loss, L circle It is a circular loss.

[0122] Circle loss reweights underoptimized similarity scores, penalizing individual similarity scores more flexibly, resulting in a more reasonable distribution of learned features and facilitating feature differentiation. Using a joint loss function better learns the distribution of features while optimizing the model's classification, accelerating the learning process and improving classification accuracy.

[0123] For further details, please refer to [link / reference]. Figure 3 , Figure 3 This is a flowchart illustrating step S201 of one embodiment of the handwritten text recognition method provided in this application. In this embodiment, acquiring the first handwritten text image to be recognized includes the following steps S301 to S303:

[0124] S301. Obtain the second handwritten text image to be recognized and the handwritten text detection model. The handwritten text detection model is obtained by training the preset target detection model using a preset image set. The preset image set includes multiple sample images, and the sample images are marked with annotation boxes of handwritten text regions.

[0125] Because the captured images sometimes don't perfectly capture the handwritten text area, they may contain other content, necessitating the identification of the handwritten text area. The second image to be identified is one that contains the handwritten text area, but may also contain other content such as seals, printed fonts, etc.

[0126] In this embodiment, sample images in the preset image set can be manually annotated. The standard bounding box is the rectangle circumscribed by the handwritten text. The position of the annotation box is represented by the coordinates of its four points: upper left, upper right, lower right, and lower left. The sample image also includes the category to which the annotation box belongs, i.e., whether the annotation box belongs to a stamp area or a handwritten text area.

[0127] In one specific embodiment, obtaining the second handwritten text image to be recognized and the handwritten text detection model may include:

[0128] (1) Cropping the image within the annotation box of the first target sample image in the preset image set to obtain the cropped image.

[0129] The first target sample image can be any sample image from a preset image set. Alternatively, the first target sample image can be a sample image from the preset image set with a resolution higher than a predetermined resolution.

[0130] (2) Move the cropped image to a random position in the second target sample image.

[0131] (3) If the intersection-union ratio of the cropped image and the annotation box of the second target sample image is less than the preset value, then the cropped image is pasted onto the second target sample image to obtain the third target sample image.

[0132] The Intersection over Union (IoU) ratio refers to the ratio of the intersection to the union of the bounding boxes of the cropped image and the second target sample image. Specifically, the preset value can be 0.9. If the IoU ratio of the cropped image and the bounding boxes of the second target sample image is less than the preset value, it means that placing the cropped image into the second target sample image will not affect the recognition of the bounding boxes. The cropped image is then pasted into the second target sample image to obtain the third target sample image. The third target sample image is a newly generated sample; therefore, new samples are generated based on the existing first and second target sample images, thus expanding the preset image set.

[0133] (4) Add the third target sample image to the preset image set.

[0134] The third target sample image is a newly generated sample. Adding the third target sample image to the preset image set expands the preset image set, thereby improving the model training effect.

[0135] (5) Train the preset target detection model based on the preset image set to obtain the handwritten text detection model.

[0136] The preset target detection model can be a target detection model such as YOLOv5 or SSD, which can be selected according to the specific situation.

[0137] S302. Detect the second handwritten text image to be recognized based on the handwritten text detection model to obtain the handwritten text detection box.

[0138] To further improve the accuracy of handwritten text detection, in one specific embodiment, the handwritten text detection model includes multiple sub-detection models, which are trained on multiple different sub-training sets. The sub-training sets are subsets of a preset image set. Detecting a second handwritten text image to be recognized based on the handwritten text detection model to obtain handwritten text detection boxes includes: inputting the second handwritten text image to be recognized into the multiple sub-detection models respectively to obtain the detection boxes and corresponding confidence scores of the multiple sub-detection models; and weighting the detection boxes of the multiple sub-detection models based on the confidence scores of the multiple detection boxes to obtain the handwritten text detection boxes.

[0139] Specifically, the preset image set is divided into 10 equal parts, resulting in 10 sub-image sets. One sub-image set 'a' is extracted as the validation set, and the rest are used as the training set. The preset object detection model is trained using this training set to obtain sub-detection model A. Another sub-image set 'b', which is different from sub-image set 'a', is extracted from the 10 equal parts and used as the validation set. The rest are used as the training set. The preset object detection model is trained using this training set to obtain sub-detection model B. A third sub-image set 'c', which is different from both 'a' and 'b', is extracted from the 10 equal parts and used as the validation set. The rest are used as the training set. The preset object detection model is trained using this training set to obtain sub-detection model C. In this way, three sub-detection models are obtained: sub-detection model A, sub-detection model B, and sub-detection model C. This prevents the bias in detection results caused by inconsistent distribution of training and validation set data and improves the generalization performance of the model.

[0140] S303. Cropping the image located in the handwritten text detection box in the second handwritten text image to be recognized to obtain the first handwritten text image to be recognized.

[0141] The image located within the handwritten text detection box in the second handwritten text image to be recognized is cropped to obtain the first handwritten text image to be recognized. The first handwritten text image to be recognized removes other content outside the handwritten text area, and the area for recognition by the handwritten text classification model is located in the handwritten text area, which can improve the classification accuracy.

[0142] The above embodiments only illustrate the case where the handwritten text classification model includes a first feature extraction layer and a second feature extraction layer. The handwritten text classification model may also include more feature extraction layers.

[0143] See Figure 4 , Figure 4 This is a schematic diagram of the network structure of an embodiment of the handwritten text classification model in this application.

[0144] In this embodiment, the parameters of each module of the handwritten text classification model are shown in Table 1. The handwritten text classification model includes a first feature extraction layer MBConv1 (k3x3), a second feature extraction layer MBConv6 (k3x3), a third feature extraction layer MBConv6 (k5x5), a fourth feature extraction layer MBConv6 (k3x3), a fifth feature extraction layer MBConv6 (k5x5), a sixth feature extraction layer MBConv6 (k5x5), and a seventh feature extraction layer MBConv6 (k3x3).

[0145] Stage Operator Resolution Channels Layers 1 Conv 3x3 224x224 40 1 2 MBConv1(k3x3) 112x112 24 2 3 MBConv6(k3x3) 112x112 32 3 4 MBConv6(k5x5) 56x56 48 3 5 MBConv6(k3x3) 28x28 96 5 6 MBConv6(k5x5) 14x14 136 5 7 MBConv6(k5x5) 14x14 232 6 8 MBConv6(k3x3) 7x7 384 2 9 Conv1x1&Pooling&FC 7x7 1536 1

[0146] Table 1: Parameter Table of Each Module of the Handwritten Text Classification Model

[0147] Combination Figure 4According to Table 1, the size of the first handwritten text image to be recognized is preprocessed to 224×224×40. The 224×224×40 first handwritten text image to be recognized is input into Conv3×3 for feature extraction, resulting in a fifth feature map of 112×112×24. The 112×112×24 fifth feature map is then sequentially input into the first feature extraction layer MBConv1(k3x3), the second feature extraction layer MBConv6(k3x3), the third feature extraction layer MBConv6(k5x5), the fourth feature extraction layer MBConv6(k3x3), the fifth feature extraction layer MBConv6(k5x5), the sixth feature extraction layer MBConv6(k5x5), and the seventh feature extraction layer MBConv6(k3x3).

[0148] The feature map P5 output by the seventh feature extraction layer MBConv6(k3x3) is upsampled by a factor of 2 and fused with the feature map output by the fourth feature extraction layer MBConv6(k3x3) to obtain feature map P4. Feature map P4 is upsampled by a factor of 2 and fused with the feature map output by the second feature extraction layer MBConv6(k3x3) to obtain feature map P3. Feature map P3 is upsampled by a factor of 2 and fused with the feature map output by the first feature extraction layer MBConv1(k3x3) to obtain feature map P2. This completes the pyramid feature extraction from high to low.

[0149] Perform a Conv1x1 convolution on feature map P2 to obtain feature map N2; downsample feature map N2 by a factor of 2 and fuse it with feature map P3 to obtain feature map N3. Downsample feature map N3 by a factor of 2 and fuse it with feature map P4 to obtain feature map N4. Downsample feature map N4 by a factor of 2 and fuse it with feature map P5 to obtain feature map N5. This completes the pyramid feature extraction from low to high.

[0150] The feature map N5 is sequentially output to the pooling layer, the fully connected layer, and the softmax layer to obtain the handwritten text recognition result.

[0151] To better implement the handwritten text recognition method in the embodiments of this application, based on the handwritten text recognition method, the embodiments of this application also provide a handwritten text recognition device, such as... Figure 5 As shown, the handwritten text recognition device 500 includes:

[0152] The first acquisition unit 501 is used to acquire a first handwritten text image to be recognized;

[0153] The second acquisition unit 502 is used to acquire a first feature map and a second feature map of a first handwritten text image to be recognized. The first feature map is obtained by extracting features from the first handwritten text image to be recognized through a first feature extraction layer. The second feature map is obtained by extracting features from the first handwritten text image to be recognized through a first feature extraction layer and a second feature extraction layer in sequence. The dimension of the convolution operation performed by the second feature extraction layer is higher than the dimension of the convolution operation performed by the first feature extraction layer.

[0154] The upsampling fusion unit 503 is used to upsample and fuse the second feature map and the first feature map to obtain the third feature map;

[0155] The downsampling fusion unit 504 is used to downsample and fuse the third feature map and the second feature map to obtain the fourth feature map;

[0156] The recognition unit 505 is used to recognize the first handwritten text image to be recognized based on the fourth feature map, and obtain the handwritten text recognition result.

[0157] Optionally, the second acquisition unit 502 is used for:

[0158] Feature extraction is performed on the first handwritten text image to be recognized to obtain the fifth feature map;

[0159] Channel attention feature extraction is performed on the fifth feature map to obtain the channel attention feature map;

[0160] The first feature map is determined based on the channel attention feature map and the fifth feature map.

[0161] Optionally, the second acquisition unit 502 is used for:

[0162] The channel attention feature map and the fifth feature map are fused to obtain the sixth feature map;

[0163] Spatial attention features are extracted from the sixth feature map to obtain the spatial attention feature map;

[0164] The first feature map is obtained by fusing the spatial attention feature map and the sixth feature map.

[0165] Optionally, the handwritten text classification model uses a weighted sum of circular loss and cross-entropy loss as the total loss function during the training phase, where the weight coefficient of cross-entropy loss is greater than the weight coefficient of circular loss.

[0166] Optionally, the first acquisition unit 501 is used for:

[0167] Obtain a second handwritten text image to be identified and a handwritten text detection model. The handwritten text detection model is obtained by training a preset target detection model using a preset image set. The preset image set includes multiple sample images, and the sample images are marked with bounding boxes indicating the handwritten text regions.

[0168] The handwritten text detection model is used to detect the second handwritten text image to be identified, and the handwritten text detection box is obtained.

[0169] The image located within the handwritten text detection box in the second handwritten text image to be recognized is cropped to obtain the first handwritten text image to be recognized.

[0170] Optionally, the first acquisition unit 501 is used for:

[0171] The image within the labeled box of the first target sample image in the preset image set is cropped to obtain the cropped image;

[0172] Move the cropped image to a random position within the second target sample image;

[0173] If the intersection-union ratio of the bounding boxes of the cropped image and the second target sample image is less than a preset value, the cropped image is pasted onto the second target sample image to obtain the third target sample image;

[0174] Add the third target sample image to the preset image set;

[0175] A handwritten text detection model is obtained by training a preset target detection model on a preset image set.

[0176] Optionally, the handwritten text detection model includes multiple sub-detection models, which are trained on multiple different sub-training sets. The sub-training sets are subsets of a preset image set. The first acquisition unit 501 is used for:

[0177] The second handwritten text image to be recognized is input into multiple sub-detection models to obtain the detection boxes and the confidence scores of the detection boxes of multiple sub-detection models.

[0178] The detection boxes of multiple sub-detection models are weighted based on the confidence of multiple detection boxes to obtain the handwritten text detection boxes.

[0179] This application also provides a computer device that integrates any of the handwritten text recognition devices provided in this application. The computer device includes:

[0180] One or more processors;

[0181] Memory; and

[0182] One or more applications, wherein the applications are stored in memory and configured to be executed by a processor of the steps of the handwritten text recognition method in any of the embodiments described above.

[0183] like Figure 6 As shown, it illustrates a structural schematic diagram of the computer device involved in the embodiments of this application, specifically:

[0184] The computer device may include components such as a processor 601 with one or more processing cores, a memory 602 with one or more computer-readable storage media, a power supply 603, and an input unit 604. Those skilled in the art will understand that the computer device structure shown in the figures does not constitute a limitation on the computer device, and may include more or fewer components than shown, or combine certain components, or have different component arrangements. Wherein:

[0185] Processor 601 is the control center of the computer device. It connects various parts of the computer device via various interfaces and lines, and performs various functions and processes data by running or executing software programs and / or modules stored in memory 602, and by calling data stored in memory 602, thereby providing overall monitoring of the computer device. Optionally, processor 601 may include one or more processing cores; processor 601 may be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or any conventional processor. Preferably, processor 601 may integrate an application processor and a modem processor, wherein the application processor mainly handles the operating system, user interface, and applications, and the modem processor mainly handles wireless communication. It is understood that the aforementioned modem processor may not be integrated into processor 601.

[0186] The memory 602 can be used to store software programs and modules. The processor 601 executes various functional applications and data processing by running the software programs and modules stored in the memory 602. The memory 602 may mainly include a program storage area and a data storage area. The program storage area may store the operating system, application programs required for at least one function (such as sound playback function, image playback function, etc.), etc.; the data storage area may store data created according to the use of the computer device, etc. In addition, the memory 602 may include high-speed random access memory, and may also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 602 may also include a memory controller to provide the processor 601 with access to the memory 602.

[0187] The computer device also includes a power supply 603 that supplies power to the various components. Preferably, the power supply 603 can be logically connected to the processor 601 through a power management system, thereby enabling functions such as charging, discharging, and power consumption management through the power management system. The power supply 603 may also include one or more DC or AC power supplies, recharging systems, power fault detection circuits, power converters or inverters, power status indicators, and other arbitrary components.

[0188] The computer device may also include an input unit 604, which can be used to receive input digital or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

[0189] Although not shown, the computer device may also include a display unit, etc., which will not be described in detail here. Specifically, in this embodiment, the processor 601 in the computer device loads the executable files corresponding to the processes of one or more application programs into the memory 602 according to the following instructions, and the processor 601 runs the application programs stored in the memory 602 to realize various functions, as follows:

[0190] Acquire the first image of the handwritten text to be recognized;

[0191] A first feature map and a second feature map of a first handwritten text image to be recognized are obtained. The first feature map is obtained by extracting features from the first handwritten text image to be recognized through a first feature extraction layer. The second feature map is obtained by extracting features from the first handwritten text image to be recognized through the first feature extraction layer and the second feature extraction layer in sequence. The dimension of the convolution operation performed by the second feature extraction layer is higher than the dimension of the convolution operation performed by the first feature extraction layer.

[0192] The second feature map and the first feature map are upsampled and fused to obtain the third feature map;

[0193] The third feature map and the second feature map are downsampled and fused to obtain the fourth feature map;

[0194] The first handwritten text image to be identified is identified based on the fourth feature map, and the handwritten text recognition result is obtained.

[0195] Those skilled in the art will understand that all or part of the steps in the various methods of the above embodiments can be performed by instructions, or by instructions controlling related hardware. These instructions can be stored in a computer-readable storage medium and loaded and executed by a processor.

[0196] Therefore, embodiments of this application provide a computer-readable storage medium, which may include: read-only memory (ROM), random access memory (RAM), a magnetic disk, or an optical disk, etc. A computer program is stored thereon, and the computer program is loaded by a processor to execute the steps in any of the handwritten text recognition methods provided in embodiments of this application. For example, the computer program loaded by the processor can execute the following steps:

[0197] Acquire the first image of the handwritten text to be recognized;

[0198] A first feature map and a second feature map of a first handwritten text image to be recognized are obtained. The first feature map is obtained by extracting features from the first handwritten text image to be recognized through a first feature extraction layer. The second feature map is obtained by extracting features from the first handwritten text image to be recognized through the first feature extraction layer and the second feature extraction layer in sequence. The dimension of the convolution operation performed by the second feature extraction layer is higher than the dimension of the convolution operation performed by the first feature extraction layer.

[0199] The second feature map and the first feature map are upsampled and fused to obtain the third feature map;

[0200] The third feature map and the second feature map are downsampled and fused to obtain the fourth feature map;

[0201] The first handwritten text image to be identified is identified based on the fourth feature map, and the handwritten text recognition result is obtained.

[0202] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the detailed descriptions of other embodiments above, which will not be repeated here.

[0203] In practice, each of the above units or structures can be implemented as an independent entity or can be arbitrarily combined to be implemented as the same or several entities. For the specific implementation of each of the above units or structures, please refer to the previous method embodiments, which will not be repeated here.

[0204] For details on the implementation of each of the above operations, please refer to the previous examples, which will not be repeated here.

[0205] The present application provides a detailed description of a handwritten text recognition method and apparatus. Specific examples have been used to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only for the purpose of helping to understand the method and core ideas of the present application. At the same time, those skilled in the art will recognize that there will be changes in the specific implementation methods and application scope based on the ideas of the present application. Therefore, the content of this specification should not be construed as a limitation of the present application.

Claims

1. A method of recognizing handwritten text, characterized by, An application to a handwritten text classification model, wherein the handwritten text classification model includes a first feature extraction layer and a second feature extraction layer, and the handwritten text recognition method includes: Acquire the first image of the handwritten text to be recognized; A first feature map and a second feature map of the first handwritten text image to be recognized are obtained, wherein the first feature map is obtained by performing feature extraction on the first handwritten text image to be recognized through the first feature extraction layer, and the second feature map is obtained by performing feature extraction on the first handwritten text image to be recognized through the first feature extraction layer and the second feature extraction layer in sequence, wherein the dimension of the convolution operation performed by the second feature extraction layer is higher than the dimension of the convolution operation performed by the first feature extraction layer. The second feature map and the first feature map are upsampled and fused. The upsampling fusion includes upsampling the second feature map and fusing the upsampled second feature map with the first feature map to obtain a third feature map. The third feature map and the second feature map are downsampled and fused. The downsampling fusion includes downsampling the third feature map and fusing the downsampled third feature map with the second feature map to obtain a fourth feature map. The first handwritten text image to be identified is identified based on the fourth feature map to obtain the handwritten text recognition result.

2. The method of claim 1, wherein, The step of obtaining the first feature map and the second feature map of the first handwritten text image to be recognized includes: The first handwritten text image to be recognized is subjected to feature extraction to obtain the fifth feature map; Channel attention feature extraction is performed on the fifth feature map to obtain a channel attention feature map; The first feature map is determined based on the channel attention feature map and the fifth feature map.

3. The method of claim 2, wherein, Determining the first feature map based on the channel attention feature map and the fifth feature map includes: The sixth feature map is obtained by fusing the channel attention feature map and the fifth feature map; Spatial attention feature extraction is performed on the sixth feature map to obtain a spatial attention feature map; The spatial attention feature map and the sixth feature map are fused to obtain the first feature map.

4. The method of claim 1, wherein, The handwritten text classification model uses a weighted sum of circular loss and cross-entropy loss as the total loss function during the training phase, wherein the weight coefficient of the cross-entropy loss is greater than the weight coefficient of the circular loss.

5. The handwritten text recognition method according to any one of claims 1-4, characterized in that, The step of obtaining the first handwritten text image to be recognized includes: A second handwritten text image to be identified and a handwritten text detection model are obtained, wherein the handwritten text detection model is obtained by training a preset target detection model using a preset image set, the preset image set includes multiple sample images, and the sample images are marked with annotation boxes of handwritten text regions; Based on the handwritten text detection model, the second handwritten text image to be identified is detected to obtain a handwritten text detection box; The image located within the handwritten text detection box in the second handwritten text image to be recognized is cropped to obtain the first handwritten text image to be recognized.

6. The handwritten text recognition method according to claim 5, characterized in that, The acquisition of the second handwritten text image to be recognized and the handwritten text detection model includes: The image within the labeled box of the first target sample image in the preset image set is cropped to obtain the cropped image; Move the cropped image to a random position within the second target sample image; If the intersection-union ratio of the bounding boxes of the cropped image and the second target sample image is less than a preset value, then the cropped image is pasted onto the second target sample image to obtain a third target sample image; Add the third target sample image to the preset image set; The preset target detection model is trained based on the preset image set to obtain the handwritten text detection model.

7. The handwritten text recognition method according to claim 5, characterized in that, The handwritten text detection model includes multiple sub-detection models, which are trained on multiple different sub-training sets. The sub-training sets are subsets of the preset image set. The step of detecting the second handwritten text image to be recognized based on the handwritten text detection model to obtain a handwritten text detection box includes: The second handwritten text image to be recognized is input into multiple sub-detection models to obtain the detection boxes and the confidence scores corresponding to the detection boxes of multiple sub-detection models. The handwritten text detection boxes are obtained by weighting the detection boxes of the multiple sub-detection models based on the confidence of the multiple detection boxes.

8. A handwritten text recognition device, characterized in that, The device stores a handwritten text classification model, which includes a first feature extraction layer and a second feature extraction layer. The handwritten text recognition device includes: The first acquisition unit is used to acquire a first handwritten text image to be recognized; The second acquisition unit is used to acquire a first feature map and a second feature map of the first handwritten text image to be recognized. The first feature map is obtained by extracting features from the first handwritten text image to be recognized through the first feature extraction layer. The second feature map is obtained by extracting features from the first handwritten text image to be recognized through the first feature extraction layer and the second feature extraction layer in sequence. The dimension of the convolution operation performed by the second feature extraction layer is higher than the dimension of the convolution operation performed by the first feature extraction layer. An upsampling fusion unit is used to perform upsampling fusion on the second feature map and the first feature map. The upsampling fusion includes upsampling the second feature map and fusing the upsampled second feature map with the first feature map to obtain a third feature map. The downsampling fusion unit is used to perform downsampling fusion on the third feature map and the second feature map. The downsampling fusion includes downsampling the third feature map and fusing the downsampled third feature map with the second feature map to obtain a fourth feature map. The recognition unit is used to recognize the first handwritten text image to be recognized based on the fourth feature map, and obtain the handwritten text recognition result.

9. A computer device, characterized in that, The computer device includes: One or more processors; Memory; and One or more applications, wherein the one or more applications are stored in the memory and configured to be executed by the processor to implement the handwritten text recognition method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, It stores a computer program, which is loaded by a processor to perform the steps of the handwritten text recognition method according to any one of claims 1 to 7.