Text recognition method, device and equipment and storage medium

By acquiring and fusing point-level and stroke-level trajectory information of handwritten text, and combining the Transformer model and cross-attention mechanism, the problem of poor accuracy in handwritten text recognition is solved, and accurate recognition of complex writing scenarios is achieved.

CN116343235BActive Publication Date: 2026-06-26IFLYTEK CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
IFLYTEK CO LTD
Filing Date
2023-02-13
Publication Date
2026-06-26

Smart Images

  • Figure CN116343235B_ABST
    Figure CN116343235B_ABST
Patent Text Reader

Abstract

The application provides a text recognition method and device, equipment and a storage medium. The specific implementation scheme is as follows: point-level trajectory information and stroke-level trajectory information of dynamic handwriting of a text to be recognized are acquired; corresponding trajectory point features are determined by using the point-level trajectory information and the stroke-level trajectory information; and a recognition result of the text to be recognized is obtained based on the trajectory point features. According to the technical scheme of the application, the accuracy of handwriting text recognition can be effectively improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of deep learning technology, and in particular to a text recognition method, apparatus, device, and storage medium. Background Technology

[0002] Handwriting is one of the most natural and effective ways for humans to record information. The advent of a series of electronic devices has made information recording extremely convenient. However, handwritten text has the characteristics of diversity, uncertainty, and long-tail distribution. In addition, the writing styles of different people's handwritten texts vary greatly, which makes the recognition effect of handwritten text poor. Summary of the Invention

[0003] To address the aforementioned issues, this application proposes a text recognition method, apparatus, device, and storage medium that can significantly improve the accuracy of handwritten text recognition.

[0004] According to a first aspect of the embodiments of this application, a text recognition method is provided, comprising:

[0005] Obtain point-level and stroke-level trajectory information of the dynamic handwriting of the text to be recognized;

[0006] The corresponding trajectory point features are determined using the point-level trajectory information and the stroke-level trajectory information;

[0007] The recognition result of the text to be recognized is obtained based on the trajectory point features.

[0008] According to a second aspect of the embodiments of this application, a text recognition device is provided, comprising:

[0009] The acquisition module is used to acquire point-level trajectory information and stroke-level trajectory information of the dynamic handwriting of the text to be recognized;

[0010] The feature extraction module is used to determine the corresponding trajectory point features using the point-level trajectory information and the stroke-level trajectory information;

[0011] The recognition module is used to obtain the recognition result of the text to be recognized based on the features of the trajectory points.

[0012] A third aspect of this application provides an electronic device, comprising:

[0013] Memory and processor;

[0014] The memory is connected to the processor and is used to store programs;

[0015] The processor implements the text recognition method described above by running the program in the memory.

[0016] A fourth aspect of this application provides a storage medium storing a computer program, which, when executed by a processor, implements the aforementioned text recognition method.

[0017] One embodiment of the above application has the following advantages or beneficial effects:

[0018] The system acquires point-level and stroke-level trajectory information of the dynamic handwriting of the text to be recognized. By combining the point-level and stroke-level trajectory information, the trajectory point features of the text to be recognized can be determined more accurately through different levels of information, thereby enabling accurate recognition of the content of the handwritten text based on the trajectory point features. Attached Figure Description

[0019] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only embodiments of this application. For those skilled in the art, other drawings can be obtained based on the provided drawings without creative effort.

[0020] Figure 1 A flowchart illustrating a text recognition method provided in an embodiment of this application;

[0021] Figure 2 A schematic diagram of step S110 in a text recognition method provided in an embodiment of this application;

[0022] Figure 3 A flowchart illustrating step S130 of a text recognition method provided in this application embodiment;

[0023] Figure 4 A schematic diagram illustrating the process of fusing image features and trajectory point features as provided in an embodiment of this application;

[0024] Figure 5 A schematic diagram illustrating the specific process of another text recognition method provided in this application embodiment;

[0025] Figure 6 A schematic diagram of the structure of a text recognition device provided in an embodiment of this application;

[0026] Figure 7 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0027] The technical solutions of this application are applicable to various handwritten text recognition scenarios, such as meeting scenarios and online education scenarios. Using the technical solutions of this application can improve the accuracy of handwritten text recognition.

[0028] The technical solutions of this application can be applied, by way of example, to hardware devices such as processors, electronic devices, and servers (including cloud servers), or packaged as software programs and run. When the hardware device executes the processing procedure of the technical solutions of this application, or when the aforementioned software program is run, the accuracy of handwritten text recognition can be improved. This application only provides illustrative descriptions of the specific processing procedure of the technical solutions of this application and does not limit the specific implementation form of the technical solutions of this application. Any technical implementation form that can execute the processing procedure of the technical solutions of this application can be adopted by this application.

[0029] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0030] Exemplary methods

[0031] Figure 1 This is a flowchart of a text recognition method according to an embodiment of this application. In an exemplary embodiment, a text recognition method is provided, including:

[0032] S110. Obtain the point-level trajectory information and stroke-level trajectory information of the dynamic handwriting of the text to be recognized;

[0033] S120. Determine the corresponding trajectory point features using the point-level trajectory information and the stroke-level trajectory information;

[0034] S130. Obtain the recognition result of the text to be recognized based on the trajectory point features.

[0035] In step S110, exemplarily, the text to be recognized represents the text data to be recognized, which is formed by a number of trajectory points. Optionally, the text to be recognized can be text data formed by trajectory points acquired on a terminal device, wherein the terminal device includes devices with writing and communication functions such as mobile phones, tablets, and electronic whiteboards. Dynamic handwriting refers to the handwriting formed by a user writing text on a terminal device. It is understood that different people have different writing styles for different characters; therefore, the same character can have multiple handwritings, i.e., for the same text.

[0036] For example, point-level trajectory information represents the features of the text to be recognized determined by trajectory points. Optionally, multiple trajectory points are selected in the dynamic handwriting, and point-level trajectory information is formed based on these multiple trajectory points. Stroke-level trajectory information represents the features of the text to be recognized determined based on the trajectory points of each stroke. Specifically, when a user writes text on a terminal device, the trajectory points of the text to be recognized are received in real time, and point-level trajectory information is determined. Then, based on pre-stored strokes, stroke-level trajectory information is determined from the trajectory points of the text to be recognized.

[0037] In step S120, for example, trajectory point features are used to represent the feature information of the dynamic handwriting of the text to be recognized. To comprehensively represent the feature information of the dynamic handwriting of the text to be recognized, point-level trajectory information and stroke-level trajectory information are fused. Through these two levels of features, the features of the text to be recognized are determined more accurately. Specifically, point-level trajectory information and stroke-level trajectory information are concatenated, and the concatenation result is input into a feature extraction network to obtain trajectory point features. The feature extraction network can use a Transformer model, or other models, which are not limited here.

[0038] In step S130, for example, a correspondence between trajectory point features and text recognition results can be pre-set. This way, after determining the trajectory point features of the text to be recognized, the recognition result can be determined based on the aforementioned correspondence. Optionally, the trajectory point features can be used as training data input to the neural network model, and the text recognition result can be used as the output of the neural network model. This trains the neural network model, resulting in a trained model. The trained model is then used to recognize the trajectory point features to obtain the recognition result of the text to be recognized.

[0039] In related technologies, existing handwritten text recognition methods are typically based on supervised learning, which requires a large amount of labeled data to train the model. However, acquiring large amounts of labeled data is extremely costly, making it unfriendly for quickly implementing and deploying a mature handwritten text detection or recognition system. Furthermore, although some work has attempted to explore weakly supervised training for text recognition, the datasets used by these methods mostly come from similar texts, which differ somewhat from the text lines commonly found in real-world applications. Therefore, it is essential to use weakly supervised pre-trained models for text lines. Weakly supervised text recognition methods typically do not use large amounts of manually labeled data to train the model, but instead obtain pseudo-labels as supervision signals through various means. For example, a model can be trained in a supervised manner using a small amount of labeled data, and then this model can be used to generate pseudo-labels to obtain supervision signals. However, such methods are often limited by the performance of existing models; if the current model performs poorly in recognizing certain complex situations, the generated supervision signals will inevitably be unreliable, leading to a weakly supervised model that also cannot solve problems in such scenarios. Therefore, existing supervised and weakly supervised techniques struggle to handle complex writing scenarios and content with limited labeled data.

[0040] In the technical solution of this application, point-level trajectory information and stroke-level trajectory information of the dynamic handwriting of the text to be recognized are obtained. By combining point-level trajectory information and stroke-level trajectory information, the trajectory point features of the text to be recognized can be determined more accurately through information at different levels. In this way, the writing content in various complex writing scenarios can be recognized through trajectory point features, such as uncommon characters and handwritten text.

[0041] In one implementation, such as Figure 2 As shown, step S110, which involves acquiring the point-level trajectory information and stroke-level trajectory information of the dynamic handwriting of the text to be recognized, includes:

[0042] S210. Obtain the trajectory point sequence of the dynamic handwriting of the text to be recognized;

[0043] S220. Determine the point-level trajectory information and stroke-level trajectory information in the trajectory point sequence.

[0044] For example, the trajectory point sequence represents the sequence of trajectory points corresponding to the input text. Specifically, the dynamic handwriting of the text to be recognized consists of a series of trajectory points x i y i ∈R T Composition, where T represents the number of trajectory points. A series of trajectory points can form different strokes S. i ∈R nWhere n represents the total number of strokes in the text line. To utilize information at different levels, the trajectory point sequence needs to be preprocessed. Specifically, an adaptive normalization method is used to preprocess the trajectory point sequence. The method is as follows: given any two points on a stroke, calculate the ratio r of the projection of the stroke between the two points onto the x-axis and y-axis respectively, and the length of its chord. xj r yj (j = 1, 2, ..., N-1), where N is the number of handwriting dots. The median is used as the normalization scale.

[0045] Preferably, step S220 includes: dividing the trajectory point sequence into multiple trajectory point sets, and performing masking processing on each trajectory point set to obtain a trajectory point vector;

[0046] The trajectory points of each stroke are determined in the trajectory point sequence, and the trajectory points of each stroke are masked to obtain the stroke trajectory vector.

[0047] For example, m trajectory points are randomly selected from the trajectory point sequence each time to form a token group. The selection process uses a non-repeating selection method, where m can be 32 or any other positive integer, without limitation. This will result in T / m token groups. Then, 70% of the T / m token groups are randomly masked and represented using learnable embeddings. The corresponding trajectory point vectors are output, serving as point-level trajectory point information.

[0048] The strokes will be grouped into a single label group. Since the length of each stroke segment is not uniform, each stroke segment will first be sampled. Strokes with more than k trajectory points will be uniformly sampled into k trajectory points; strokes with fewer than k trajectory points will be padded to k points through interpolation. Here, k can be 32 or any other positive integer, without limitation. Then, 70% of the labeled strokes in the above label group will be randomly masked and represented using learnable embeddings, outputting the corresponding stroke trajectory vectors. These stroke trajectory vectors will serve as the stroke-level trajectory information. In this way, masking the trajectory points and stroke trajectories allows the model to better learn the features of the text to be recognized during model training.

[0049] In one implementation, such as Figure 3 As shown, the recognition result of the text to be recognized based on the trajectory point features, step S130 includes:

[0050] S310. Determine the image features corresponding to the text image of the text to be identified;

[0051] S320. Based on the image features and the trajectory point features, the recognition result of the text to be recognized is obtained.

[0052] For example, a text image represents an image containing the text to be identified. The text image can be a screenshot taken from a terminal device or captured by another device. Optionally, image features can be extracted from the text image using a trained image processing model. Alternatively, it can be obtained through...

[0053] Specifically, after obtaining the text image, it needs to be preprocessed by normalizing the pixel values ​​to between -1 and 1. Then, the image is segmented into non-overlapping patches, each 16×16 pixels. This will result in... A number of image patches are generated. Then, 70% of these patches are randomly masked and represented using learnable embeddings to obtain image feature vectors. These image vectors are then input into a feature extraction network to obtain image features. The feature extraction network can be a Transformer model, or other models; no specific limitation is made here.

[0054] Furthermore, after acquiring the point-level and stroke-level trajectory information of the dynamic handwriting of the text to be recognized, the point-level and stroke-level trajectory information are concatenated. The concatenated result and the image vector are then input into the feature extraction network, which uses a symmetric native Transformer model (i.e., a two-stream structure) as its backbone network. The concatenated result and the image vector are respectively input into their respective encoders. After passing through multiple layers of Transformer networks, the trajectory point features and image features (i.e., high-level feature representations of their respective modalities) are output, respectively. In addition, the Transformer can capture long-distance dependencies through multi-head self-attention, thereby constructing global contextual semantics, which enables the extraction of more accurate trajectory point features and image features.

[0055] For example, a pre-defined correspondence between trajectory point features, image features, and text recognition results can be established. After determining the trajectory point features and image features, the recognition result of the text to be recognized can be determined based on this correspondence. Alternatively, the trajectory point features and image features can be used as training data input to a neural network model, and the text recognition result can be used as the output of the neural network model. This allows for training of the neural network model, resulting in a trained model. The trained model can then be used to recognize the trajectory point features and image features to obtain the recognition result of the text to be recognized.

[0056] In one implementation, such as Figure 4 As shown, the step S320, which obtains the recognition result of the text to be recognized based on the image features and the trajectory point features, includes:

[0057] S410. Perform cross-attention calculation on the image features and the trajectory point features to obtain fused features;

[0058] S420. Based on the fusion features, obtain the recognition result of the text to be recognized.

[0059] For example, fusion features are used to represent the result of interaction between information from multiple different modalities. Specifically, fusion features are obtained by processing image features and trajectory point features through a cross-attention mechanism. In this way, information from different modalities is used to effectively compensate for the information loss of the current modality, thereby realizing information interaction between different modalities.

[0060] For example, a correspondence between fusion features and text recognition results can be pre-defined. After determining the fusion features, the recognition result of the text to be recognized can be determined based on this correspondence. Alternatively, the fusion features can be used as training data input to a neural network model, and the text recognition result can be used as the output of the neural network model. This trains the neural network model, resulting in a trained model. The trained model is then used to recognize the fusion features to obtain the recognition result of the text to be recognized.

[0061] In one implementation, the step S410 of performing cross-attention calculation on the image features and the trajectory point features to obtain fused features includes:

[0062] Based on the relationship between each vector in the trajectory point features and the image features, and the relationship between each vector in the trajectory point features, a first fusion feature is determined;

[0063] The second fusion feature is determined based on the relationship between each vector in the image features and the trajectory point features, and the relationship between each vector in the image features.

[0064] In this embodiment, since information from different modalities can effectively compensate for the information loss of the current modality, when the information of the trajectory point features is supplemented based on the image features, the first fusion feature corresponding to the trajectory point features is output; when the information of the image features is supplemented based on the trajectory point features, the second fusion feature corresponding to the image features is output. In this way, information interaction can be achieved for both features.

[0065] The formula for the first fusion feature is as follows:

[0066]

[0067] Among them, F s Denotes the first fusion feature, d k V1 represents the feature dimension, K1 represents the relationship between the vectors in the trajectory point feature, and Q1 is the query vector, which is the relationship between the vectors in the trajectory point feature and the image feature.

[0068] The formula for the second fusion feature is as follows:

[0069]

[0070] Among them, F v Indicates the second fusion feature, d k V2 represents the feature dimension, K2 represents the relationship between the vectors in the image features, and Q2 is the query vector, which is the relationship between the vectors in the image features and the trajectory point features.

[0071] In one implementation, the step S420 of obtaining the recognition result of the text to be recognized based on the fusion features includes:

[0072] The fused features are then subjected to nonlinear mapping to obtain the corresponding enhanced features;

[0073] The recognition result of the text to be recognized is obtained based on the enhanced features.

[0074] For example, since it is desired to further enhance the features of the dynamic handwriting of the text to be recognized, a nonlinear mapping method is used to process the fusion features, thereby obtaining enhanced features that fuse visual information and trajectory point information. In this way, by using cross-modal information aggregation to enhance features, the credibility of the features (i.e., enhanced features) of the dynamic handwriting of the text to be recognized can be further improved, and thus the recognition result of the text to be recognized can be determined more accurately.

[0075] For example, a correspondence between enhanced features and text recognition results can be pre-defined. After determining the enhanced features, the recognition result of the text to be recognized can be determined based on this correspondence. Alternatively, the enhanced features can be used as training data input to a neural network model, and the text recognition result can be used as the output of the neural network model. This trains the neural network model, resulting in a trained model. The trained model is then used to recognize the enhanced features to obtain the recognition result of the text to be recognized.

[0076] In one embodiment, the step of performing nonlinear mapping processing on the fused features to obtain corresponding enhanced features includes:

[0077] The first fusion feature and the second fusion feature are concatenated to obtain the concatenated feature;

[0078] The spliced ​​features are calculated using a nonlinear mapping function to obtain the feature enhancement factor;

[0079] Based on the feature enhancement factor and the first fusion feature, the trajectory point enhancement feature is calculated;

[0080] Image enhancement features are obtained by calculating based on the feature enhancement factor and the second fusion feature.

[0081] In this embodiment, if the fusion feature includes a first fusion feature determined based on the relationship between each vector in the trajectory point feature and the image feature, and a second fusion feature determined based on the relationship between each vector in the image feature and the trajectory point feature, then the first fusion feature and the second fusion feature are concatenated to obtain a concatenated feature. The concatenated feature is then input into a nonlinear mapping function to obtain a feature enhancement factor for cross-modal feature selection. The nonlinear mapping function includes: the sigmoid function, the ELU exponential linear function, the ReLU modified linear function, etc. Then, the first and second fusion features are further enhanced using an element-wise product approach. The formula is expressed as follows:

[0082]

[0083]

[0084] in, This represents a non-linear mapping function; concat represents a concatenation operation. Indicates trajectory point augmentation features, This represents image enhancement features.

[0085] In one implementation, obtaining the recognition result of the text to be recognized based on the trajectory point features includes:

[0086] The trajectory point features are input into a preset text recognition model to obtain the recognition result. The preset text recognition model is trained based on the reconstructed text trajectory points. The reconstructed text trajectory points are reconstructed based on the trajectory point features of the training text. The trajectory point features of the training text are determined based on the point-level trajectory information and stroke-level trajectory information of the training text.

[0087] For example, such as Figure 5As shown, during the training of the text recognition model, point-level trajectory information and stroke-level trajectory information of the training text are determined based on the acquired trajectory points; the corresponding text vector is determined by masking the acquired text image. The point-level trajectory information and stroke-level trajectory information are concatenated, and feature extraction is performed on the concatenated result and the text vector to obtain the trajectory point features and image features of the training text. Then, cross-attention calculation is performed on the image features and trajectory point features to obtain the first fusion feature and the second fusion feature. The first fusion feature and the second fusion feature are concatenated to obtain the concatenated feature, and a nonlinear mapping function is performed on the concatenated feature to obtain the feature enhancement factor. The feature enhancement factor is then calculated with the first fusion feature and the second fusion feature to obtain the trajectory point enhancement feature and the image enhancement feature, respectively. The trajectory point enhancement feature and the image enhancement feature are reconstructed according to the text reconstruction model to generate the reconstructed text trajectory points and the reconstructed text image. The text reconstruction model can use a multi-layer feedforward network (FFN) combined with the sigmoid function and the mean square error (MSE) function. This pre-training of the model enables the text reconstruction model to learn features from various texts. The reconstructed text trajectory points and images generated by the text reconstruction model contain trajectory point enhancement features and image enhancement features. Therefore, using the reconstructed text trajectory points and images as training data to train the neural network model yields a text recognition model. This eliminates the need for a large amount of labeled data, achieving good results in downstream tasks with only a small amount of data. Furthermore, it is not limited by the performance of existing models, ensuring the accuracy of the text recognition model.

[0088] Exemplary device

[0089] Correspondingly, Figure 6 This is a schematic diagram of a text recognition device according to an embodiment of this application. In an exemplary embodiment, a text recognition device is provided, comprising:

[0090] The acquisition module 610 is used to acquire point-level trajectory information and stroke-level trajectory information of the dynamic handwriting of the text to be recognized;

[0091] Feature extraction module 620 is used to determine the corresponding trajectory point features using the point-level trajectory information and the stroke-level trajectory information;

[0092] The recognition module 630 is used to obtain the recognition result of the text to be recognized based on the features of the trajectory points.

[0093] In one implementation, the identification module includes:

[0094] The image feature determination module is used to determine the image features corresponding to the text image of the text to be identified;

[0095] The first processing module is used to obtain the recognition result of the text to be recognized based on the image features and the trajectory point features.

[0096] In one embodiment, the first processing module includes:

[0097] The cross-attention calculation module is used to perform cross-attention calculation on the image features and the trajectory point features to obtain fused features;

[0098] The second processing module is used to obtain the recognition result of the text to be recognized based on the fusion features.

[0099] In one implementation, the cross-attention calculation module includes:

[0100] The first fusion module is used to determine the first fusion feature based on the relationship between each vector in the trajectory point feature and the image feature and the relationship between each vector in the trajectory point feature;

[0101] The second fusion module is used to determine the second fusion feature based on the relationship between each vector in the image features and the trajectory point features, and the relationship between each vector in the image features.

[0102] In one embodiment, the second processing module includes:

[0103] The feature enhancement module is used to perform nonlinear mapping processing on the fused features to obtain the corresponding enhanced features;

[0104] The third processing module is used to obtain the recognition result of the text to be recognized based on the enhanced features.

[0105] In one implementation, the feature enhancement module is further configured to:

[0106] The first fusion feature and the second fusion feature are concatenated to obtain the concatenated feature;

[0107] The spliced ​​features are calculated using a nonlinear mapping function to obtain the feature enhancement factor;

[0108] Based on the feature enhancement factor and the first fusion feature, the trajectory point enhancement feature is calculated;

[0109] Image enhancement features are obtained by calculating based on the feature enhancement factor and the second fusion feature.

[0110] In one embodiment, the acquisition module includes:

[0111] The trajectory point sequence acquisition module is used to acquire the trajectory point sequence of the dynamic handwriting of the text to be recognized;

[0112] The information extraction module is used to determine the point-level trajectory information and stroke-level trajectory information in the trajectory point sequence.

[0113] In one embodiment, the information extraction module is further configured to:

[0114] The trajectory point sequence is divided into multiple trajectory point sets, and each trajectory point set is masked to obtain a trajectory point vector.

[0115] The trajectory points of each stroke are determined in the trajectory point sequence, and the trajectory points of each stroke are masked to obtain the stroke trajectory vector.

[0116] In one embodiment, the identification module is further configured to:

[0117] The trajectory point features are input into a preset text recognition model to obtain the recognition result. The preset text recognition model is trained based on the reconstructed text trajectory points. The reconstructed text trajectory points are reconstructed based on the trajectory point features of the training text. The trajectory point features of the training text are determined based on the point-level trajectory information and stroke-level trajectory information of the training text.

[0118] The text recognition device provided in this embodiment belongs to the same concept as the text recognition method provided in the above embodiments of this application. It can execute the text recognition method provided in any of the above embodiments of this application and has the corresponding functional modules and beneficial effects for executing the text recognition method. Technical details not described in detail in this embodiment can be found in the specific processing content of the text recognition method provided in the above embodiments of this application, and will not be repeated here.

[0119] Exemplary electronic devices

[0120] Another embodiment of this application also provides an electronic device, see [link to relevant documentation] Figure 7 As shown, the device includes:

[0121] Memory 700 and processor 710;

[0122] The memory 700 is connected to the processor 710 and is used to store programs;

[0123] The processor 710 is configured to implement the text recognition method disclosed in any of the above embodiments by running the program stored in the memory 700.

[0124] Specifically, the aforementioned electronic device may also include: a bus, a communication interface 720, an input device 730, and an output device 740.

[0125] The processor 710, memory 700, communication interface 720, input device 730, and output device 740 are interconnected via a bus. Among them:

[0126] A bus can include a pathway for transmitting information between various components of a computer system.

[0127] The processor 710 can be a general-purpose processor, such as a general-purpose central processing unit (CPU), a microprocessor, etc., or an application-specific integrated circuit (ASIC), or one or more integrated circuits used to control the execution of the program of the present invention. It can also be a digital signal processor (DSP), an application-specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.

[0128] The processor 710 may include a main processor, as well as a baseband chip, modem, etc.

[0129] The memory 700 stores a program that executes the technical solution of this invention, and may also store an operating system and other key business functions. Specifically, the program may include program code, which includes computer operation instructions. More specifically, the memory 700 may include read-only memory (ROM), other types of static storage devices capable of storing static information and instructions, random access memory (RAM), other types of dynamic storage devices capable of storing information and instructions, disk storage, flash memory, etc.

[0130] Input device 730 may include a device for receiving data and information input by a user, such as a keyboard, mouse, camera, scanner, light pen, voice input device, touch screen, pedometer, or gravity sensor.

[0131] Output device 740 may include devices that allow information to be output to a user, such as a display screen, printer, speaker, etc.

[0132] The communication interface 720 may include a device that uses any transceiver to communicate with other devices or communication networks, such as Ethernet, Radio Access Network (RAN), Wireless Local Area Network (WLAN), etc.

[0133] The processor 710 executes the program stored in the memory 700 and calls other devices, which can be used to implement the various steps of any of the text recognition methods provided in the above embodiments of this application.

[0134] Exemplary computer program products and storage media

[0135] In addition to the methods and apparatus described above, embodiments of this application may also be computer program products, which include computer program instructions that, when executed by a processor, cause the processor to perform the steps of the text recognition methods according to various embodiments of this application described in the "Exemplary Methods" section of this specification.

[0136] The computer program product can be written in any combination of one or more programming languages ​​to perform the operations of the embodiments of this application. The programming languages ​​include object-oriented programming languages ​​such as Java and C++, as well as conventional procedural programming languages ​​such as C or similar languages. The program code can be executed entirely on the user's computing device, partially on the user's computing device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server.

[0137] Furthermore, embodiments of this application may also be storage media storing computer programs, the computer programs being executed by a processor in the steps of the text recognition methods according to various embodiments of this application described in the "Exemplary Methods" section above. The specific working content of the above-described electronic device, as well as the specific working content of the computer program product and the computer program on the storage medium being run by a processor, can all be found in the content of the above-described method embodiments, and will not be repeated here.

[0138] For the foregoing method embodiments, in order to simplify the description, they are all described as a series of actions. However, those skilled in the art should understand that this application is not limited to the described order of actions, because according to this application, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to this application.

[0139] It should be noted that the various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For apparatus embodiments, since they are basically similar to method embodiments, the description is relatively simple; relevant parts can be referred to the descriptions in the method embodiments.

[0140] The steps in the methods of the various embodiments of this application can be adjusted, merged, or deleted in order according to actual needs, and the technical features described in each embodiment can be replaced or combined.

[0141] The modules and sub-modules in the various embodiments of the present application's devices and terminals can be merged, divided, and deleted according to actual needs.

[0142] It should be understood that the disclosed terminals, devices, and methods can be implemented in other ways, given the several embodiments provided in this application. For example, the terminal embodiments described above are merely illustrative. For instance, the division of modules or sub-modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple sub-modules or modules may be combined or integrated into another module, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be indirect coupling or communication connection through some interfaces, devices, or modules, and may be electrical, mechanical, or other forms.

[0143] The modules or submodules described as separate components may or may not be physically separate. The components that constitute a module or submodule may or may not be physical modules or submodules; that is, they may be located in one place or distributed across multiple network modules or submodules. Some or all of the modules or submodules can be selected to achieve the purpose of this embodiment's solution, depending on actual needs.

[0144] Furthermore, the functional modules or sub-modules in the various embodiments of this application can be integrated into one processing module, or each module or sub-module can exist physically separately, or two or more modules or sub-modules can be integrated into one module. The integrated modules or sub-modules described above can be implemented in hardware or in the form of software functional modules or sub-modules.

[0145] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0146] The steps of the methods or algorithms described in conjunction with the embodiments disclosed herein can be implemented directly by hardware, a software unit executed by a processor, or a combination of both. The software unit can be located in random access memory (RAM), main memory, read-only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other form of storage medium known in the art.

[0147] Finally, it should be noted that in this document, relational terms such as "first" and "second" are used only to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0148] The above description of the disclosed embodiments enables those skilled in the art to make or use this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments shown herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A text recognition method, characterized in that, include: Obtain the trajectory point sequence of the dynamic handwriting of the text to be recognized; Determine the point-level trajectory information and stroke-level trajectory information in the trajectory point sequence; The point-level trajectory information and the stroke-level trajectory information are fused to obtain the trajectory point features corresponding to the text to be identified; The recognition result of the text to be recognized is obtained based on the trajectory point features.

2. The method according to claim 1, characterized in that, The recognition result of the text to be recognized obtained based on the trajectory point features includes: Determine the image features corresponding to the text image of the text to be identified; The recognition result of the text to be recognized is obtained based on the image features and the trajectory point features.

3. The method according to claim 2, characterized in that, The process of obtaining the recognition result of the text to be recognized based on the image features and the trajectory point features includes: Cross-attention calculation is performed on the image features and the trajectory point features to obtain fused features; The recognition result of the text to be recognized is obtained based on the fusion features.

4. The method according to claim 3, characterized in that, The step of performing cross-attention calculation on the image features and the trajectory point features to obtain fused features includes: Based on the relationship between each vector in the trajectory point features and the image features, and the relationship between each vector in the trajectory point features, a first fusion feature is determined; The second fusion feature is determined based on the relationship between each vector in the image features and the trajectory point features, and the relationship between each vector in the image features.

5. The method according to claim 3, characterized in that, The process of obtaining the recognition result of the text to be recognized based on the fusion features includes: The fused features are then subjected to nonlinear mapping to obtain the corresponding enhanced features; The recognition result of the text to be recognized is obtained based on the enhanced features.

6. The method according to claim 5, characterized in that, If the fusion feature includes a first fusion feature determined based on the relationship between each vector in the trajectory point feature and the image feature, and the relationship between each vector in the trajectory point feature, and a second fusion feature determined based on the relationship between each vector in the image feature and the trajectory point feature, and the relationship between each vector in the image feature, then performing nonlinear mapping processing on the fusion feature to obtain the corresponding enhanced feature includes: The first fusion feature and the second fusion feature are concatenated to obtain the concatenated feature; The spliced ​​features are calculated using a nonlinear mapping function to obtain the feature enhancement factor; Based on the feature enhancement factor and the first fusion feature, the trajectory point enhancement feature is calculated; Image enhancement features are obtained by calculating based on the feature enhancement factor and the second fusion feature.

7. The method according to claim 1, characterized in that, Determining the point-level trajectory information and stroke-level trajectory information in the trajectory point sequence includes: The trajectory point sequence is divided into multiple trajectory point sets, and each trajectory point set is masked to obtain a trajectory point vector. The trajectory points of each stroke are determined in the trajectory point sequence, and the trajectory points of each stroke are masked to obtain the stroke trajectory vector.

8. The method according to claim 1, characterized in that, The process of obtaining the recognition result of the text to be recognized based on the trajectory point features includes: The trajectory point features are input into a preset text recognition model to obtain the recognition result. The preset text recognition model is trained based on the reconstructed text trajectory points. The reconstructed text trajectory points are reconstructed based on the trajectory point features of the training text. The trajectory point features of the training text are determined based on the point-level trajectory information and stroke-level trajectory information of the training text.

9. A text recognition device, characterized in that, include: The acquisition module is used to acquire the trajectory point sequence of the dynamic handwriting of the text to be recognized; Determine the point-level trajectory information and stroke-level trajectory information in the trajectory point sequence; The feature extraction module is used to fuse the point-level trajectory information and the stroke-level trajectory information to obtain the trajectory point features corresponding to the text to be identified; The recognition module is used to obtain the recognition result of the text to be recognized based on the features of the trajectory points.

10. An electronic device, characterized in that, include: Memory and processor; The memory is connected to the processor and is used to store programs; The processor, by running the program in the memory, implements the text recognition method as described in any one of claims 1 to 8.

11. A storage medium, characterized in that, The storage medium stores a computer program, which, when executed by a processor, implements the text recognition method as described in any one of claims 1 to 8.