A multi-granularity perception weakly supervised character recognition method and device

By employing a weakly supervised character recognition method with multi-granularity perception, and utilizing feature extraction and multi-granularity feature fusion networks combined with a sequential recursive decoder, the high annotation cost of fully supervised methods is solved, achieving efficient character recognition.

CN117935286BActive Publication Date: 2026-06-30SUN YAT SEN UNIV +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
SUN YAT SEN UNIV
Filing Date
2024-01-19
Publication Date
2026-06-30

AI Technical Summary

Technical Problem

Existing fully supervised scene text recognition methods have high annotation costs, resulting in low text recognition efficiency and difficulty in achieving automation.

Method used

We employ a weakly supervised text recognition method with multi-granularity perception. By combining feature extraction, multi-granularity feature fusion network and sequential recursive decoder with weakly supervised training, we improve the text region perception capability and text recognition efficiency.

Benefits of technology

It reduces the workload of manual annotation, saves labor costs, and improves the efficiency and accuracy of text recognition. It is suitable for weakly supervised text recognition scenarios under single-point supervision.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117935286B_ABST
    Figure CN117935286B_ABST
Patent Text Reader

Abstract

This application discloses a weakly supervised text recognition method and apparatus with multi-granularity perception. The method includes: acquiring an image to be recognized; performing feature extraction processing on the image to be recognized to obtain visual features; performing multi-granularity fusion processing on the visual features through a multi-granularity feature fusion network to obtain fused features; and performing autoregressive decoding processing on the fused features through a sequential recursive decoder to obtain a text recognition result, wherein the sequential recursive decoder is obtained through weakly supervised training. The embodiments of this application perform multi-granularity fusion processing on the visual features through a multi-granularity feature fusion network, which can improve the perception ability of text regions, is suitable for weakly supervised text recognition scenarios, and can be widely applied in the field of computer vision technology.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer vision technology, and in particular to a weakly supervised character recognition method and apparatus with multi-granularity perception. Background Technology

[0002] Images, as the mainstream information carrier in life, are widely present in cyberspace. Images that carry text information make it easier for people to communicate and spread information. Therefore, the detection and recognition of text in images has very important application value in real life.

[0003] In related technologies, fully supervised scene text recognition methods are commonly used to identify text in images. However, these methods have high annotation costs, significantly increasing manual labor costs and affecting the efficiency of text recognition. In summary, the technical problems existing in these technologies need to be improved. Summary of the Invention

[0004] The main objective of this application is to propose a weakly supervised character recognition method and apparatus with multi-granularity perception, which can improve the efficiency of character recognition.

[0005] To achieve the above objectives, one aspect of this application proposes a weakly supervised character recognition method with multi-granularity perception, the method comprising:

[0006] Acquire the image to be recognized;

[0007] The image to be identified is subjected to feature extraction processing to obtain visual features;

[0008] The visual features are fused using a multi-granularity feature fusion network to obtain fused features.

[0009] The fused features are autoregressively decoded using a sequential recursive decoder to obtain the text recognition result. The sequential recursive decoder is obtained through weakly supervised training.

[0010] In some embodiments, the step of performing multi-granularity fusion processing on the visual features through a multi-granularity feature fusion network to obtain fused features includes:

[0011] The visual features are input into the multi-granularity feature fusion network, wherein the visual features include a set of high-order features and a set of low-order features, and the multi-granularity feature fusion network includes a low-order feature fusion module and an intra-layer interaction module.

[0012] The low-order feature fusion module performs feature fusion processing on the low-order feature set to obtain low-order enhanced features;

[0013] The higher-order features are processed by the intra-layer interaction module to obtain higher-order enhanced features.

[0014] The low-order enhancement features and the high-order enhancement features are concatenated to obtain the fused features.

[0015] In some embodiments, the step of performing feature fusion processing on the low-level feature set through the low-level feature fusion module to obtain low-level enhanced features includes:

[0016] The first low-order feature and the second low-order feature are obtained from the set of low-order features;

[0017] The first low-order feature is subjected to feature enhancement processing to obtain the first intermediate feature;

[0018] The first intermediate feature is upsampled to obtain the upsampled feature.

[0019] The upsampled feature is concatenated with the second low-order feature to obtain the first concatenated feature;

[0020] The first spliced ​​feature is subjected to feature enhancement processing to obtain the second intermediate feature;

[0021] The first intermediate feature and the second intermediate feature are concatenated to obtain the second concatenated feature;

[0022] The second spliced ​​feature is subjected to feature enhancement processing to obtain a low-order enhanced feature.

[0023] In some embodiments, the step of performing feature interaction processing on the higher-order features through the intra-layer interaction module to obtain higher-order enhanced features includes:

[0024] Obtain the location code;

[0025] The higher-order features are added to the positional codes to obtain the encoded features;

[0026] The encoded features are input into a multi-head self-attention module for attention calculation to obtain an attention score.

[0027] The attention scores are sequentially input into the first normalization layer and the feedforward neural network layer for residual summation to obtain the additive features;

[0028] The summed features are input into the second normalization layer for normalization processing to obtain higher-order enhanced features.

[0029] In some embodiments, the step of performing autoregressive decoding on the fused features using a sequential recursive decoder to obtain the text recognition result includes:

[0030] The image to be identified is subjected to text sequence discretization processing to obtain the target sequence;

[0031] The target sequence is processed by sequence feature calculation to obtain the target sequence features;

[0032] The fused features and the target sequence features are subjected to multi-head attention processing to obtain the text recognition result.

[0033] In some embodiments, the step of performing text sequence discretization processing on the image to be identified to obtain the target sequence includes:

[0034] Discrete labeling is performed on the text sequence in the image to be identified to obtain text instances;

[0035] The text instances are serialized to obtain a text instance sequence;

[0036] The text instance sequence is subjected to marker insertion processing to obtain the target sequence.

[0037] In some embodiments, the step of performing sequence feature calculation on the target sequence to obtain target sequence features includes:

[0038] The target sequence is subjected to position embedding processing to obtain a position matrix;

[0039] The hidden matrix is ​​obtained by calculating and processing the position matrix according to the first formula;

[0040] The position matrix and the hidden matrix are proportionally concatenated to obtain the target sequence features.

[0041] In some embodiments, the feature extraction process on the image to be identified to obtain visual features includes:

[0042] Obtain the image training dataset;

[0043] The image training dataset is input into the feature extraction network for pre-training to obtain the backbone network;

[0044] The visual features are obtained by extracting visual features from the image to be identified through the backbone network.

[0045] In some embodiments, the method further includes performing weakly supervised training on the sequential recursive decoder, specifically including:

[0046] Obtain the sequence dataset;

[0047] Single-coordinate point annotation is performed on a portion of the data in the sequence dataset to obtain the sequence training dataset;

[0048] The sequence training dataset is input into the sequential recursive decoder for training, resulting in a trained sequential recursive decoder.

[0049] To achieve the above objectives, another aspect of this application proposes a multi-granularity sensing weakly supervised character recognition device, the device comprising:

[0050] The first module is used to acquire the image to be recognized;

[0051] The second module is used to perform feature extraction processing on the image to be identified to obtain visual features;

[0052] The third module is used to perform multi-granularity fusion processing on the visual features through a multi-granularity feature fusion network to obtain fused features;

[0053] The fourth module is used to perform autoregressive decoding on the fused features through a sequential recursive decoder to obtain the text recognition result. The sequential recursive decoder is obtained through weakly supervised training.

[0054] To achieve the above objectives, another aspect of this application provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the method described above.

[0055] To achieve the above objectives, another aspect of the embodiments of this application proposes a computer-readable storage medium storing a computer program that, when executed by a processor, implements the methods described above.

[0056] The embodiments of this application include at least the following beneficial effects: This application provides a weakly supervised text recognition method and apparatus with multi-granularity perception. This scheme obtains visual features by performing feature extraction processing on the acquired image to be recognized, and then performs multi-granularity fusion processing on the visual features through a multi-granularity feature fusion network to obtain fused features. This can improve the perception ability of text regions in the image, thereby improving the efficiency of text recognition. In addition, this scheme also obtains the text recognition result by performing autoregressive decoding processing on the fused features through a sequential recursive decoder. The sequential recursive decoder can obtain the local dependencies and global relationships between characters, correct prediction errors, and thus improve the efficiency of text recognition. Furthermore, this scheme obtains the sequential recursive decoder through weakly supervised training, which can reduce the workload of manual annotation and save labor costs. Attached Figure Description

[0057] Figure 1 This is a flowchart of a weakly supervised character recognition method with multi-granularity perception provided in an embodiment of this application;

[0058] Figure 2This is a data processing flowchart of a character recognition system provided in an embodiment of this application;

[0059] Figure 3 This is a data processing flowchart of a multi-granularity feature fusion network provided in an embodiment of this application;

[0060] Figure 4 This is a schematic diagram of the structure of a feature enhancement module provided in an embodiment of this application;

[0061] Figure 5 This is a data processing flowchart of a sequential recursive decoder provided in an embodiment of this application;

[0062] Figure 6 This is a schematic diagram of the structure of a weakly supervised character recognition device with multi-granularity perception provided in an embodiment of this application;

[0063] Figure 7 This is a schematic diagram of the hardware structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0064] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to limit it. In the following description, when referring to the accompanying drawings, unless otherwise indicated, the same numbers in different drawings represent the same or similar elements. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with those of this application; they are merely examples of apparatuses and methods consistent with some aspects of the embodiments of this application as detailed in the appended claims.

[0065] It is understood that the terms “first,” “second,” etc., used in this application may be used herein to describe various concepts, but unless otherwise stated, these concepts are not limited by these terms. These terms are only used to distinguish one concept from another. For example, without departing from the scope of the embodiments of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the words “if,” “when,” or “in response to a determination” as used herein may be interpreted as “when…” or “when…” or “in response to a determination.”

[0066] As used in this application, the terms "at least one", "multiple", "each", "any", etc., "at least one" includes one, two or more, "multiple" includes two or more, "each" refers to each of the corresponding multiples, and "any" refers to any one of the multiples.

[0067] Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of this application only and is not intended to limit this application.

[0068] Before providing a detailed description of the embodiments of this application, some of the nouns and terms involved in the embodiments of this application will be explained first. The nouns and terms involved in the embodiments of this application are subject to the following interpretations.

[0069] Artificial intelligence (AI) is the theory, methods, technology, and application systems that use digital computers or machines controlled by digital computers to simulate, extend, and expand human intelligence, perceive the environment, acquire knowledge, and use that knowledge to achieve optimal results. In other words, AI is a comprehensive technology within computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a way similar to human intelligence. AI studies the design principles and implementation methods of various intelligent machines, enabling them to possess the functions of perception, reasoning, and decision-making.

[0070] Artificial intelligence (AI) is a comprehensive discipline encompassing a wide range of fields, including both hardware and software technologies. Fundamental AI technologies generally include sensors, dedicated AI chips, cloud computing, distributed storage, big data processing, operating / interactive systems, and mechatronics. AI software technologies primarily include computer vision, speech processing, natural language processing, and machine learning / deep learning.

[0071] Images, as the mainstream information carrier in modern life, are widely present in cyberspace. Text, on the other hand, consists of characters formed by mapping symbols according to certain rules to represent specific meanings. Compared to text, which is an abstract concept requiring a symbolic system for understanding, images are undoubtedly more concise and clear. Therefore, images carrying textual information facilitate communication and dissemination. Consequently, the detection and recognition of text within images has significant practical value in real life.

[0072] The first step in recognizing text in an image is to locate it. Accurate localization not only greatly aids subsequent text recognition but also enables techniques such as style transformation, text erasure, and text manipulation within the text region. However, text detection and recognition in natural scenes present significant challenges. Firstly, the text itself exhibits diversity and variability in natural scenes, with numerous variations in attributes such as color, size, font, shape, orientation, and aspect ratio, increasing the difficulty of the detection and recognition task. Secondly, the background also presents complexity and interference, including objects with similar shapes to the text (such as bricks, windows, traffic signs, etc.) and occlusion issues. These factors lead to missed detections, false detections, or incomplete boundary localization of text in the image. Furthermore, current imperfect imaging conditions, such as low resolution, distortion, blur, low / high brightness, and shadows, are all significant factors interfering with text detection and recognition. In short, text detection and recognition in natural scenes is an extremely challenging task in the field of computer vision.

[0073] In related technologies, end-to-end text recognition models aim to achieve text localization and recognition within a unified model. Current fully supervised scene text recognition methods mainly fall into two categories: methods based on regions of interest (ROIs) and methods without ROIs. ROI-based methods first generate proposal candidate regions, then classify or regress them to detect objects, and finally, the recognition head identifies the text within the detected regions. Currently, mainstream recognition methods utilize various ROI techniques to enhance detection and recognition capabilities. For curved text recognition, some researchers have proposed using Bézier curves to fit text regions of arbitrary shapes, thus successfully recognizing curved text. Others have constructed a language model to associate character-level proposal boxes. Still other methods decouple the recognition model into a visual model and a language model to further improve model performance. However, all these methods utilize proposal boxes, involving complex post-processing, which slows down the entire algorithm. ROI-free methods avoid ROI pooling operations and heuristic post-processing, simplifying the intermediate process and improving model robustness. These methods extract text features and inter-feature correlations by designing specific attention modules and combining them with multi-scale backbone networks to form multi-granularity feature descriptions. Meanwhile, some scholars have designed decoders to decode text regions and content to obtain prediction results. However, regardless of whether the method uses regions of interest (ROI) or not, both types of methods constrain the network to recognize text through fully supervised learning, and have currently achieved satisfactory accuracy. However, in practice, the fully supervised approach suffers from high annotation costs, greatly consuming manual labor, and is still some distance from achieving the goal of automated text recognition.

[0074] In view of this, this application provides a weakly supervised text recognition method and apparatus with multi-granularity perception. This scheme extracts visual features from the acquired image to be recognized and then fuses these visual features using a multi-granularity feature fusion network to obtain fused features. This improves the perception of text regions in the image, thereby increasing the efficiency of text recognition. Furthermore, this scheme uses a sequential recursive decoder to perform autoregressive decoding on the fused features to obtain the text recognition result. The sequential recursive decoder can acquire local dependencies and global relationships between characters, correcting prediction errors and further improving text recognition efficiency. Moreover, this scheme uses weakly supervised training to obtain the sequential recursive decoder, reducing the workload of manual annotation and saving labor costs.

[0075] This application provides a multi-granularity perception-based weakly supervised character recognition method, relating to the field of computer vision technology. The multi-granularity perception-based weakly supervised character recognition method provided in this application can be applied to a terminal, a server, or software running on a terminal or server. In some embodiments, the terminal can be a smartphone, tablet, laptop, desktop computer, smart speaker, smartwatch, or in-vehicle terminal, but is not limited to these. The server can be configured as an independent physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms. The server can also be a node server in a blockchain network. The software can be an application implementing the multi-granularity perception-based weakly supervised character recognition method, but is not limited to the above forms.

[0076] This application can be used in a wide variety of general-purpose or special-purpose computer system environments or configurations. Examples include: personal computers, server computers, handheld or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set-top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, and distributed computing environments including any of the above systems or devices. This application can be described in the general context of computer-executable instructions executed by a computer, such as program modules. Generally, program modules include routines, programs, objects, components, data structures, etc., that perform specific tasks or implement specific abstract data types. This application can also be practiced in distributed computing environments where tasks are performed by remote processing devices connected via a communication network. In distributed computing environments, program modules can reside in local and remote computer storage media, including storage devices.

[0077] Figure 1 This is an optional flowchart of a multi-granularity perceptual weakly supervised character recognition method provided in an embodiment of this application. Figure 1 The method may include, but is not limited to, steps S101 to S104.

[0078] Step S101: Obtain the image to be recognized;

[0079] Step S102: Perform feature extraction processing on the image to be identified to obtain visual features;

[0080] Step S103: Perform multi-granularity fusion processing on the visual features using a multi-granularity feature fusion network to obtain fused features;

[0081] Step S104: The fused features are subjected to autoregressive decoding processing through a sequential recursive decoder to obtain the text recognition result. The sequential recursive decoder is obtained through weakly supervised training.

[0082] In the embodiments of this application, reference is made to Figure 2 The process involves acquiring an image to be recognized, inputting it into a backbone network to extract features, and then obtaining visual features. A multi-granularity feature fusion network is then used to fuse these visual features at multiple granularities, resulting in fused features. Finally, a sequential recursive decoder performs autoregressive decoding on the fused features to obtain the text recognition result. The sequential recursive decoder is trained under weak supervision. Weak supervision is a branch of machine learning that, compared to traditional supervised learning, uses limited, noisy, or inaccurately labeled data to train model parameters. This application embodiment targets natural scene images and performs single-point scene text recognition based on scale-robust sequential recursive self-attention. Specifically, this application embodiment first integrates the global interaction between high-level features and low-level local features through a multi-granularity feature fusion network to enhance text feature representations at different scales. Then, based on the scale-robust text features, a sequential recursive decoder decodes them into text scripts using sequential recursive self-attention. Furthermore, the sequential recursive decoder in this application embodiment is obtained through weakly supervised training, which can be applied to weakly supervised text recognition scenarios under single-point supervision. This model text recognition in weakly supervised scenarios into a simple sequence prediction problem, thereby improving the efficiency of text recognition. Moreover, this application embodiment can be applied to weakly supervised scenarios, reducing annotation costs.

[0083] In step S101 of some embodiments, the image to be identified can be obtained by a shooting device such as a camera, or by a network method such as a database. The image to be identified is any natural scene image with text.

[0084] In step S102 of some embodiments, the step of performing feature extraction processing on the image to be identified to obtain visual features includes:

[0085] Obtain the image training dataset;

[0086] The image training dataset is input into the feature extraction network for pre-training to obtain the backbone network;

[0087] The visual features are obtained by extracting visual features from the image to be identified through the backbone network.

[0088] In this embodiment, the image training dataset uses the ImageNet dataset, which is the largest database in the field of computer vision used for recognizing objects in images. This embodiment obtains the backbone network by pre-training the image training dataset into a feature extraction network. This feature extraction network can be a ResNet50 network, a variant of a deep residual network and a classic feature extraction network structure. Finally, the backbone network performs visual feature extraction processing on the image to be recognized to obtain visual features. It is conceivable that the input in this embodiment is an RGB image of size H*W. During training, the image is randomly cropped, with the shorter side randomly cropped from 640 pixels to 896 pixels, while the longer side remains unchanged at 1600 pixels. During testing, the shortest side of the input image is set to 1000, and the longest side is kept above 1824. This embodiment, by obtaining visual features through visual feature extraction processing of the image to be recognized, can improve the accuracy of text recognition.

[0089] In step S103 of some embodiments, the step of performing multi-granularity fusion processing on the visual features through a multi-granularity feature fusion network to obtain fused features includes:

[0090] Step S1031: Input the visual features into the multi-granularity feature fusion network. The visual features include a set of high-order features and a set of low-order features. The multi-granularity feature fusion network includes a low-order feature fusion module and an intra-layer interaction module.

[0091] Step S1032: The low-order feature set is subjected to feature fusion processing by the low-order feature fusion module to obtain low-order enhanced features;

[0092] Step S1033: Perform feature interaction processing on the higher-order features through the intra-layer interaction module to obtain higher-order enhanced features;

[0093] Step S1034: The low-order enhancement features and the high-order enhancement features are concatenated to obtain the fused features.

[0094] In this embodiment, a backbone network is used to extract the visual features of the input image to be recognized, and these visual features are represented as follows: L represents the number of feature layers, and in this embodiment, L is set to 5. The visual feature includes a set of high-order features and a set of low-order features, where the high-order feature is F5, and the low-order feature set includes F3 and F4. D3, D4, and D5 are 512, 1024, and 2048, respectively. Here, F3-F5 represent the 3rd to 5th feature layers output from the ResNet50 network, H represents the height of the image to be recognized, and W represents the width of the image to be recognized. In this embodiment, visual features are input into a Multi-granularity Feature Fusion (MGFF) network, as described above. Figure 3 This multi-granularity feature fusion network includes a Low-Level Feature Fusion (LLFF) module and an Intra-Layer Interaction (ILI) module. In this embodiment, the LLFF module performs feature fusion processing on a low-level feature set to obtain low-level enhanced features. Then, the ILI module performs feature interaction processing on high-level features to obtain high-level enhanced features. Finally, the LLFF and high-level enhanced features are concatenated to obtain fused features. This multi-granularity feature fusion module can generate scale-robust feature representations, thereby improving the ability to perceive text regions under single-point supervision.

[0095] In step S1032 of some embodiments, the step of performing feature fusion processing on the low-order feature set through the low-order feature fusion module to obtain low-order enhanced features includes:

[0096] The first low-order feature and the second low-order feature are obtained from the set of low-order features;

[0097] The first low-order feature is subjected to feature enhancement processing to obtain the first intermediate feature;

[0098] The first intermediate feature is upsampled to obtain the upsampled feature.

[0099] The upsampled feature is concatenated with the second low-order feature to obtain the first concatenated feature;

[0100] The first spliced ​​feature is subjected to feature enhancement processing to obtain the second intermediate feature;

[0101] The first intermediate feature and the second intermediate feature are concatenated to obtain the second concatenated feature;

[0102] The second spliced ​​feature is subjected to feature enhancement processing to obtain a low-order enhanced feature.

[0103] In the embodiments of this application, reference is made to Figure 3 The low-level feature fusion module is Figure 3 The F3-F4 branches shown contain multiple RepBlock modules, 1×1 convolutional blocks, and 3×3 convolutional blocks. The RepBlock modules are further composed of... Figure 4 The module structure is shown below. First, in this embodiment, a first low-order feature F4 and a second low-order feature F3 are obtained from the low-order feature set. The first low-order feature F4 is enhanced by the RepBlock module and then reduced in dimensionality by a 1*1 convolution to obtain a first intermediate feature. The RepBlock module is composed of multiple stacked convolutional layers. Then, the second low-order feature F3 and the first intermediate feature are concatenated together after upsampling to obtain a first concatenated feature. The same feature enhancement operation as F4 is performed on the first concatenated feature to obtain a second intermediate feature. Finally, the first intermediate feature and the second intermediate feature are concatenated to obtain a second concatenated feature. After being enhanced by the RepBlock module, it is output after a 3*3 convolution to obtain a low-order enhanced feature. The calculation formula of the low-order feature fusion module in this embodiment is expressed as follows:

[0104]

[0105]

[0106]

[0107] In the formula, C and U represent the concatenated sampling operation and the upsampling operation, respectively, S represents the feature, F represents the flattening operation, and L represents the number of feature layers. This represents a semantic feature interaction network with learnable parameters Θ. Wherein, It is a learnable network with weights Θ1, consisting of RepBlocks and 1×1 convolutional layers. It has learnable parameters. Φ2 contains a RepPlock, followed by a 3×3 convolutional layer. and They have the same network structure but different weights. This application embodiment fuses low-level features through a low-level feature fusion module, which improves the model's understanding and perception capabilities, thereby increasing the efficiency of text recognition.

[0108] In step S1033 of some embodiments, the step of performing feature interaction processing on the higher-order features through the intra-layer interaction module to obtain higher-order enhanced features includes:

[0109] Obtain the location code;

[0110] The higher-order features are added to the positional codes to obtain the encoded features;

[0111] The encoded features are input into a multi-head self-attention module for attention calculation to obtain an attention score.

[0112] The attention scores are sequentially input into the first normalization layer and the feedforward neural network layer for residual summation to obtain the additive features;

[0113] The summed features are input into the second normalization layer for normalization processing to obtain higher-order enhanced features.

[0114] In this embodiment, the location code is first obtained, which is acquired through methods such as manual annotation. (Refer to...) Figure 3 In this embodiment, the higher-order feature F5 is augmented with a positional encoding P to obtain the encoded feature. This encoded feature is then input into a multi-head self-attention module, followed by residual summation through a normalization layer and a feedforward neural network layer. Finally, it passes through another normalization layer to obtain the higher-order enhanced feature. The calculation expression for the intra-layer interaction module is shown below:

[0115]

[0116] In the formula, F represents the flattening operation. It is a semantic feature interaction network with learnable parameters Θ. In this application embodiment, high-order enhancement features are generated through intra-layer interaction modules, so that the high-order enhancement features can be aggregated with low-level enhancement features fused across layers to generate robust coding features, thereby improving the ability to perceive text regions under single-point supervision.

[0117] In step S104 of some embodiments, the step of performing autoregressive decoding on the fused features through a sequential recursive decoder to obtain the text recognition result includes:

[0118] Step S1041: Perform text sequence discretization processing on the image to be recognized to obtain the target sequence;

[0119] Step S1042: Perform sequence feature calculation on the target sequence to obtain target sequence features;

[0120] Step S1043: Perform multi-head attention processing on the fused features and the target sequence features to obtain the text recognition result.

[0121] In this embodiment, the target sequence is obtained by discretizing the text sequence of the image to be recognized. Then, sequence feature calculation is performed on the target sequence to obtain target sequence features. Finally, multi-head attention processing is applied to the fused features and target sequence features to obtain the text recognition result. In scene text localization and recognition, this embodiment uses a transformer decoder with a sequence recursive self-attention mechanism to model the global-local sequence relationship between features. It performs sufficient autoregressive decoding on the fused features containing rich semantic information and ultimately generates transcribed text information, thereby obtaining the scene text recognition result. (Refer to...) Figure 5 The sequence recursive decoder takes the constructed target sequence as input information and processes it through the position matrix M. pos Perform position embedding, and then output the Q... m K m After performing matrix multiplication on the vector values, a soft maximum operation is performed, followed by multiplication with V. m Vector values ​​are multiplied by a matrix, and the result is the same as the result of matrix M which has not been passed through a position matrix. pos The embedded feature values ​​are concatenated into the target sequence feature F according to the μ ratio. xf Finally, the fusion feature F enc Performing a multi-head attention operation yields the decoded sequence representation, t n Indicates the sequence position, x, y, d n These represent the values ​​at corresponding positions in the sequence. The character recognition result is obtained by decoding scale-robust text features into a text script using a sequential recursive self-attention approach. This text script includes coordinate information and transcription information. This application's embodiment introduces a sequential recursive decoder to obtain local dependencies and global relationships between characters, as well as the position of individual points within the text, thereby correcting prediction errors and improving the accuracy of character recognition.

[0122] In step S1041 of some embodiments, the step of performing text sequence discretization processing on the image to be recognized to obtain the target sequence includes:

[0123] Discrete labeling is performed on the text sequence in the image to be identified to obtain text instances;

[0124] The text instances are serialized to obtain a text instance sequence;

[0125] The text instance sequence is subjected to marker insertion processing to obtain the target sequence.

[0126] In this embodiment, the text sequence of the image to be recognized is characterized by converting the position and transcription of text instances into continuous discrete tokens and linking them within each instance. Then, each continuous coordinate of the center point of a text instance is uniformly discretized to the range [1, n], where n controls the degree of discretization. A text instance sequence is obtained by serializing the text instances, represented as [x, y, d], where (x, y) represent coordinate points and d represents the transcribed text instance description. Furthermore, token insertion processing is performed on the text instance sequence, where tokens are... <sos>and <eos>Inserted into the sequence to indicate the beginning and end of the sequence. This application embodiment constructs a target sequence so that, in subsequent processing, the decoder uses information embedded in the input characters to map the sequence text information to words to generate the final text output, thus improving the efficiency of text recognition.

[0127] In step S1042 of some embodiments, the step of performing sequence feature calculation on the target sequence to obtain target sequence features includes:

[0128] The target sequence is subjected to position embedding processing to obtain a position matrix;

[0129] The hidden matrix is ​​obtained by calculating and processing the position matrix according to the first formula;

[0130] The position matrix and the hidden matrix are proportionally concatenated to obtain the target sequence features.

[0131] In this embodiment, a position matrix is ​​obtained by embedding the target sequence into its position within the self-attention module of the transformer decoder structure. Specifically, the relative positions of the sequence are encoded into a position matrix. The expression for this position matrix is ​​as follows:

[0132]

[0133] Where i and j represent the row and column indices of the matrix, respectively. T is the length of the sequence, and α is a hyperparameter.

[0134] Next, the hidden matrix H is calculated according to the first formula, the expression of which is as follows:

[0135]

[0136] In the formula, V m This represents the value matrix in the m-th decoder, i.e. It is to input sequence X m Learnable parameters projected into the embedding. M is the total number of decoder layers.

[0137] Subsequently, the target sequence features, i.e., the global-local sequence features, are calculated as follows:

[0138] F sf =(1-μ)softmax(Q·K)·V+μH;

[0139] In the formula, Q, K, and V represent vectors obtained by multiplying the target sequence by three linear projections and three matrices with different weights, respectively, and μ represents the gating factor between the global features and the local sequence-dependent hidden state features.

[0140] Finally, a multi-attention network is used. The output O is represented as:

[0141]

[0142] Where, Θ sf This represents the learnable parameters.

[0143] During training, the entire network is optimized end-to-end using the cross-entropy loss function based on the output sequence O. The total object function L is expressed as:

[0144]

[0145] Among them O t Indicates the current output, O <t Let represent the output sequence before the current step t. P(·) represents the prediction probability.

[0146] During inference, the multi-granularity feature fusion module encodes the features of the input image. The initial input to the decoder is the start token. <sos>It tells the text sequence decoder which information to start decoding. Then, the text sequence decoder reads the target sequence [x,y,d] markers one by one. Finally, for each marker, the translation into point coordinates and the transcribed result is text localization. This application's embodiments introduce sequence recursive self-attention to obtain local dependencies and global relationships between characters, as well as the position of a single point relative to the text, correcting prediction errors and improving the accuracy of text recognition.

[0147] In some embodiments, the method further includes performing weakly supervised training on the sequential recursive decoder, specifically including:

[0148] Obtain the sequence dataset;

[0149] Single-coordinate point annotation is performed on a portion of the data in the sequence dataset to obtain the sequence training dataset;

[0150] The sequence training dataset is input into the sequential recursive decoder for training, resulting in a trained sequential recursive decoder.

[0151] In this embodiment, the sequential recursive decoder is trained using a weakly supervised training method. Specifically, single-coordinate point annotation is performed on a portion of the data in the sequence dataset. Through weakly supervised training, this embodiment can annotate text in any region with only a single coordinate point, greatly saving manpower costs.

[0152] The following is a detailed description and explanation of the solutions in the embodiments of this application, in conjunction with specific scenarios:

[0153] This application embodiment acquires an image to be recognized, performs feature extraction on the image to obtain visual features, then uses a multi-granularity feature fusion network to perform multi-granularity fusion processing on the visual features to obtain fused features, and finally uses a sequential recursive decoder to perform autoregressive decoding processing on the fused features to obtain the text recognition result. This application embodiment can model scene text localization as a simple sequence prediction problem and apply it to weakly supervised text localization and recognition scenarios under single-point supervision. In addition, this application embodiment focuses more on modeling the global relationships and local dependencies of the target sequence composed of a single point and transcribed text, without the need to design prior boxes, thereby making the model have better generalization ability and strong recognition ability, and exhibiting excellent performance for text of arbitrary shapes such as horizontal text, multi-directional text, and curved text.

[0154] This application's embodiments can be applied to human-computer interaction scenarios, such as automatic data entry in offices, express delivery orders written by customers in the courier industry, and handwritten information forms in the financial and insurance industries. Applying text recognition technology can accelerate the data input process while protecting customer privacy. Another potential application is note-taking software, which can instantly transcribe notes as users write them. Furthermore, there are applications such as automatic identity authentication and automatic license plate and vehicle logo recognition. The methods described in this application's embodiments can improve people's work and life efficiency. This application's embodiments can also be applied to deep intelligent recognition question-and-answer scenarios. For example, given an image, the machine can intelligently combine the text information in the image to answer or describe a deeper meaning. In the fields of autonomous driving and robotics, text recognition is used for interaction and guidance with machines, enabling timely and accurate feedback. It can also improve the efficiency of standardized question-and-answer processes in online customer service. This application's embodiments can also be applied to intelligent content understanding scenarios. The text recognition method of this application allows industries to perform more intelligent analysis, primarily for platforms such as video sharing websites and e-commerce, extracting text from images, captions, and real-time comment captions. On one hand, this extracted text can be used for automatic content tagging and recommendation systems. They can also be used to perform user sentiment analysis, such as identifying which parts of a video are most engaging to users. On the other hand, website administrators can monitor and filter inappropriate and illegal content.

[0155] The proposed weakly supervised character recognition method with multi-granularity perception is tested in the following way:

[0156] The test environment uses Ubuntu 20.04 as the system environment and has the following hardware: 80GB memory, NVIDIA A100 GPU, and 2TB hard drive.

[0157] The following are the experimental data for this invention:

[0158] This invention first uses the SynthText-150K dataset (94,723 multi-directional images and 54,327 curved images) for pre-training. Then, experiments were conducted on three datasets: ICDAR 2015 (1000 training images and 500 test images), Total-Text (1255 training images and 300 test images), and Inverse-Text (500 test images).

[0159] This application embodiment uses the AdamW optimizer for optimization. The initial learning rate is 5×10. -4 The learning rate decays linearly to 1×10 -5 After pre-training, the training splits for each target dataset are fine-tuned for 200 epochs, with a fixed learning rate of 1×10⁻⁶. -5 The entire model was trained on four NVIDIA A100 GPUs with a batch size of 4.

[0160] The following are the experimental results of this application:

[0161] The ablation experiments conducted on the Total-Text dataset in this application are shown in Tables 1 to 3, and the ablation experiments conducted on the ICDAR2013 dataset are shown in Table 4. In the experiments, the baseline model directly uses an autoregressive method based on an encoder-decoder for prediction. Table 1 is a validation table of the effectiveness of the proposed module. As shown in the first row of Table 1, the baseline achieves an F-measure of 66.98% with a dictionary and 56.37% without a dictionary. If a multi-granularity feature fusion network is added to the baseline model, it can improve the F-measure by 1.19% with a dictionary and by 2.66% without a dictionary. Furthermore, when the SRSA module is added to the baseline, it can improve the F-measure by 1.49% with a dictionary and by 1.17% without a dictionary. Adding both MGFF and SRSA modules to the baseline model simultaneously significantly improves performance compared to the baseline (68.74% vs 66.98% and 59.07% vs 56.37%), both with and without a dictionary. Furthermore, the model's performance after fine-tuning is similar to the pre-trained model, showing good results compared to the baseline. Table 1 is shown below:

[0162]

[0163] Table 1 Module Validation Table

[0164] Table 2 shows the performance comparison of the MGFF module, which is a performance comparison table between the MGFF module and various feature enhancement modules. Here, two different feature fusion methods are compared; "Fusion1" refers to... L-2 ,F L-1 ,F L The model described in this application uses an FPN-like structure for multi-scale feature fusion instead of using the In-Layer Interaction Module (ILI) for encoding. "Fusion2" first fuses high-level semantic features with low-level features, then uses the interaction features of high-level information for contact operations. Experimental results show that using the "Fusion1" fusion method results in a slight performance decrease, accompanied by a larger number of parameters (91.4M). Using the "Fusion2" fusion method with only pre-training still results in a performance decrease, but after fine-tuning, there is a slight performance improvement compared to the baseline, with improvements of 0.21% and 0.36% with and without a dictionary. Finally, the MGFF module in this embodiment improves performance by 1.19% and 2.66% with and without a dictionary without fine-tuning, respectively, and still shows improvements of 0.58% and 0.51% after fine-tuning. Table 2 is shown below:

[0165]

[0166] Table 2. Performance Comparison of MGFF Module and Various Feature Enhancement Modules

[0167] Table 3 shows the performance comparison of the SRSA module, which is a table illustrating the impact of feature ratios on the SRSA module. It can be observed that the Sequential Recursive Self-Attention (SRSA) mechanism proposed in this application improves performance by 1.49% and 1.17% with and without a dictionary in the pre-training mode, respectively. Furthermore, in fine-tuning mode, SRSA with and without a dictionary also outperforms the baseline by 1.89% and 1.39%, respectively. Based on scale robustness features, the average improvement in Fmeasty performance of the SRSA proposed in this application exceeds 1%. Clearly, the combination of MGFF and SRSA can further enhance performance. Table 3 is shown below:

[0168]

[0169] Table 3. Influence of Feature Ratio in SRSA Module

[0170] Table 4 shows the performance comparison of different scales tested on the ICDAR2015 dataset. As can be seen from Table 4, the method of this invention achieves state-of-the-art performance. For scene text of arbitrary shapes, on the Total-Text dataset, the method of this embodiment achieves competitive performance compared to some fully supervised methods. In particular, the method of this embodiment outperforms most weakly supervised methods in terms of performance. For multi-directional text instances, on ICDAR 2015, compared to single-point supervised methods, the method of this embodiment shows superior performance compared to SPTS, improving S, W, and G metrics by 5.3%, 6.8%, and 5.9%, respectively. Furthermore, the method of this embodiment outperforms SPTSv2 by 1.1%, 1.4%, and 1.4% in S, W, and G metrics, respectively. In addition, quantitative results regarding inverse text examples are shown in Table 4. It can be observed that the method of this embodiment outperforms SPTS by 4.8% with a dictionary and 6.3% without. Table 4 is shown below:

[0171]

[0172] Table 4. Performance comparison of different scales on the ICDAR2015 dataset.

[0173] Furthermore, referring to Table 5, the method of this application embodiment outperforms SPTSv2 by 4.6% and 4.3% in these two metrics, respectively, demonstrating that the method of this application embodiment is a strongly competitive one. Table 5 is a performance comparison table for various datasets, as shown in the table below:

[0174]

[0175] Table 5 Performance comparison table for each dataset

[0176] Please see Figure 6 This application also provides a multi-granularity perception weakly supervised character recognition device, which can implement the above-mentioned multi-granularity perception weakly supervised character recognition method. The device includes:

[0177] The first module 601 is used to acquire the image to be recognized;

[0178] The second module 602 is used to perform feature extraction processing on the image to be identified to obtain visual features;

[0179] The third module 603 is used to perform multi-granularity fusion processing on the visual features through a multi-granularity feature fusion network to obtain fused features;

[0180] The fourth module 604 is used to perform autoregressive decoding on the fused features through a sequential recursive decoder to obtain the text recognition result. The sequential recursive decoder is obtained through weakly supervised training.

[0181] It is understood that the content of the above method embodiments is applicable to the present device embodiments. The specific functions implemented by the present device embodiments are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

[0182] This application also provides an electronic device, which includes a memory and a processor. The memory stores a computer program, and the processor executes the computer program to implement the aforementioned multi-granularity perception weakly supervised character recognition method. This electronic device can be any smart terminal, including tablet computers, in-vehicle computers, etc.

[0183] It is understood that the content of the above method embodiments is applicable to this device embodiment. The specific functions implemented by this device embodiment are the same as those of the above method embodiments, and the beneficial effects achieved are also the same as those achieved by the above method embodiments.

[0184] Please see Figure 7 , Figure 7 The hardware structure of an electronic device according to another embodiment is illustrated. The electronic device includes:

[0185] The processor 701 can be implemented using a general-purpose CPU (Central Processing Unit), microprocessor, application-specific integrated circuit (ASIC), or one or more integrated circuits, and is used to execute relevant programs to implement the technical solutions provided in the embodiments of this application.

[0186] The memory 702 can be implemented as a read-only memory (ROM), a static storage device, a dynamic storage device, or a random access memory (RAM). The memory 702 can store the operating system and other application programs. When the technical solutions provided in the embodiments of this specification are implemented through software or firmware, the relevant program code is stored in the memory 702 and is called and executed by the processor 701 to implement a multi-granularity perceptual weakly supervised character recognition method according to an embodiment of this application.

[0187] The input / output interface 703 is used to implement information input and output;

[0188] The communication interface 704 is used to enable communication and interaction between this device and other devices. Communication can be achieved through wired means (such as USB, Ethernet cable, etc.) or wireless means (such as mobile network, WIFI, Bluetooth, etc.).

[0189] Bus 705 transmits information between various components of the device (e.g., processor 701, memory 702, input / output interface 703, and communication interface 704);

[0190] The processor 701, memory 702, input / output interface 703, and communication interface 704 are connected to each other within the device via bus 705.

[0191] This application also provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the aforementioned multi-granularity perception weakly supervised character recognition method.

[0192] It is understood that the content of the above method embodiments is applicable to this storage medium embodiment. The specific functions implemented in this storage medium embodiment are the same as those in the above method embodiments, and the beneficial effects achieved are also the same as those achieved in the above method embodiments.

[0193] Memory, as a non-transitory computer-readable storage medium, can be used to store non-transitory software programs and non-transitory computer-executable programs. Furthermore, memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid-state storage device. In some embodiments, memory may optionally include memory remotely located relative to the processor, and these remote memories can be connected to the processor via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

[0194] This application provides a multi-granularity perceptual weakly supervised character recognition method and apparatus. It extracts visual features from the acquired image to be recognized and then fuses these visual features using a multi-granularity feature fusion network to obtain fused features. This improves the perception of text regions in the image, thereby increasing the efficiency of character recognition. Furthermore, the scheme uses a sequential recursive decoder to perform autoregressive decoding on the fused features to obtain the character recognition result. The sequential recursive decoder can acquire local dependencies and global relationships between characters, correcting prediction errors and further improving the efficiency of character recognition. Moreover, this scheme uses weakly supervised training to obtain the sequential recursive decoder, reducing the workload of manual annotation and saving labor costs.

[0195] The embodiments described in this application are for the purpose of more clearly illustrating the technical solutions of the embodiments of this application, and do not constitute a limitation on the technical solutions provided by the embodiments of this application. As those skilled in the art will know, with the evolution of technology and the emergence of new application scenarios, the technical solutions provided by the embodiments of this application are also applicable to similar technical problems.

[0196] Those skilled in the art will understand that the technical solutions shown in the figures do not constitute a limitation on the embodiments of this application, and may include more or fewer steps than shown, or combine certain steps, or different steps.

[0197] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.

[0198] Those skilled in the art will understand that all or some of the steps in the methods disclosed above, as well as the functional modules / units in the systems and devices, can be implemented as software, firmware, hardware, or suitable combinations thereof.

[0199] The terms "first," "second," "third," "fourth," etc. (if present) in the specification and accompanying drawings of this application are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that the embodiments of this application described herein can be implemented in orders other than those illustrated or described herein. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion; for example, a process, method, system, product, or apparatus that comprises a series of steps or units is not necessarily limited to those steps or units explicitly listed, but may include other steps or units not explicitly listed or inherent to such processes, methods, products, or apparatus.

[0200] It should be understood that in this application, "at least one (item)" means one or more, and "more than" means two or more. "And / or" is used to describe the relationship between related objects, indicating that three relationships can exist. For example, "A and / or B" can represent: only A exists, only B exists, and both A and B exist simultaneously, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one (item) of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one (item) of a, b, or c can represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", where a, b, and c can be single or multiple.

[0201] In the several embodiments provided in this application, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of the units described above is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or units may be electrical, mechanical, or other forms.

[0202] The units described above as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the units can be selected to achieve the purpose of this embodiment according to actual needs.

[0203] Furthermore, the functional units in the various embodiments of this application can be integrated into one processing unit, or each unit can exist physically separately, or two or more units can be integrated into one unit. The integrated unit can be implemented in hardware or as a software functional unit.

[0204] If the integrated unit is implemented as a software functional unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or all or part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes multiple instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods of the various embodiments of this application. The aforementioned storage medium includes various media capable of storing programs, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0205] The preferred embodiments of the present application have been described above with reference to the accompanying drawings, but this does not limit the scope of the claims of the present application. Any modifications, equivalent substitutions, and improvements made by those skilled in the art without departing from the scope and substance of the embodiments of the present application shall be within the scope of the claims of the present application.< / sos> < / eos> < / sos>

Claims

1. A weakly supervised character recognition method with multi-granularity perception, characterized in that, The method includes: Acquire the image to be recognized; The image to be identified is subjected to feature extraction processing to obtain visual features; The visual features are fused using a multi-granularity feature fusion network to obtain fused features. The fused features are processed by autoregressive decoding using a sequential recursive decoder to obtain the text recognition result. The sequential recursive decoder is obtained through weakly supervised training. The step of performing multi-granularity feature fusion processing on the visual features through a multi-granularity feature fusion network to obtain fused features includes: The visual features are input into the multi-granularity feature fusion network, wherein the visual features include a set of high-order features and a set of low-order features, and the multi-granularity feature fusion network includes a low-order feature fusion module and an intra-layer interaction module. The low-order feature fusion module performs feature fusion processing on the low-order feature set to obtain low-order enhanced features; The higher-order features are processed by the intra-layer interaction module to obtain higher-order enhanced features. The low-order enhancement features and the high-order enhancement features are concatenated to obtain the fused features; The step of performing feature fusion processing on the low-order feature set through the low-order feature fusion module to obtain low-order enhanced features includes: The first low-order feature and the second low-order feature are obtained from the set of low-order features; The first low-order feature is subjected to feature enhancement processing to obtain the first intermediate feature; The first intermediate feature is upsampled to obtain the upsampled feature. The upsampled feature is concatenated with the second low-order feature to obtain the first concatenated feature; The first spliced ​​feature is subjected to feature enhancement processing to obtain the second intermediate feature; The first intermediate feature and the second intermediate feature are concatenated to obtain the second concatenated feature; The second spliced ​​feature is subjected to feature enhancement processing to obtain a low-order enhanced feature; The step of performing feature interaction processing on the higher-order features through the intra-layer interaction module to obtain higher-order enhanced features includes: Obtain the location code; The higher-order features are added to the positional codes to obtain the encoded features; The encoded features are input into a multi-head self-attention module for attention calculation to obtain an attention score; The attention scores are sequentially input into the first normalization layer and the feedforward neural network layer for residual summation to obtain the additive features; The summed features are input into the second normalization layer for normalization processing to obtain higher-order enhanced features.

2. The method according to claim 1, characterized in that, The step of performing autoregressive decoding on the fused features using a sequential recursive decoder to obtain the text recognition result includes: The image to be identified is subjected to text sequence discretization processing to obtain the target sequence; The target sequence is processed by sequence feature calculation to obtain the target sequence features; The fused features and the target sequence features are subjected to multi-head attention processing to obtain the text recognition result.

3. The method according to claim 2, characterized in that, The step of performing text sequence discretization processing on the image to be identified to obtain the target sequence includes: Discrete labeling is performed on the text sequence in the image to be identified to obtain text instances; The text instances are serialized to obtain a text instance sequence; The text instance sequence is subjected to marker insertion processing to obtain the target sequence.

4. The method according to claim 2, characterized in that, The step of performing sequence feature calculation on the target sequence to obtain target sequence features includes: The target sequence is subjected to position embedding processing to obtain a position matrix; The hidden matrix is ​​obtained by calculating and processing the position matrix according to the first formula; The position matrix and the hidden matrix are proportionally concatenated to obtain the target sequence features.

5. The method according to any one of claims 1 to 4, characterized in that, The step of performing feature extraction processing on the image to be identified to obtain visual features includes: Obtain the image training dataset; The image training dataset is input into the feature extraction network for pre-training to obtain the backbone network; The visual features are obtained by extracting visual features from the image to be identified through the backbone network.

6. The method according to any one of claims 1 to 4, characterized in that, The method further includes weakly supervised training of the sequential recursive decoder, specifically including: Obtain the sequence dataset; Single-coordinate point annotation is performed on a portion of the data in the sequence dataset to obtain the sequence training dataset; The sequence training dataset is input into the sequential recursive decoder for training, resulting in a trained sequential recursive decoder.

7. A weakly supervised character recognition device with multi-granularity perception, characterized in that, The device is applied to the weakly supervised character recognition method with multi-granularity perception as described in claim 1, and the device comprises: The first module is used to acquire the image to be recognized; The second module is used to perform feature extraction processing on the image to be identified to obtain visual features; The third module is used to perform multi-granularity fusion processing on the visual features through a multi-granularity feature fusion network to obtain fused features; The fourth module is used to perform autoregressive decoding on the fused features through a sequential recursive decoder to obtain the text recognition result. The sequential recursive decoder is obtained through weakly supervised training.