Scene text extraction method and system without fine-grained detection

What is AI technical title?
AI technical title is built by Patsnap AI team. It summarizes the technical point description of the patent document.
By using a pre-trained text block detector and recognizer, combined with feature fusion and splicing techniques, the challenge of fine-grained detection in scene text localization and recognition is solved, achieving high-precision text extraction, reducing the detection burden and improving the flexibility and generalization ability of recognition.

CN115879462BActive Publication Date: 2026-06-26COMMUNICATION UNIVERSITY OF CHINA +1

View PDF 1 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Patents(China)
Current Assignee / Owner: COMMUNICATION UNIVERSITY OF CHINA
Filing Date: 2022-10-10
Publication Date: 2026-06-26

AI Technical Summary

Technical Problem

Existing technologies face challenges in fine-grained detection in scene text localization and recognition, especially when there are multiple rows and columns and dense text clusters, it is difficult to accurately distinguish word boundaries. Furthermore, the recognition module is sensitive to the detection results and is prone to introducing background interference or destroying character integrity. At the same time, the recognizer lacks flexibility and generalization ability.

Method used

We employ a pre-trained text block detector and recognizer, construct a text block dataset using a heuristic text block generation method, utilize a feature pyramid network and a region selection network for text block detection, and combine an LSTM attention module and a positional attention module for feature fusion and concatenation to achieve coarse-grained detection and multi-instance recognition.

Benefits of technology

It reduces the detection burden, utilizes rich contextual information for recognition, improves the flexibility and generalization ability of recognition, and achieves high-precision text extraction.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN115879462B_ABST

Patent Text Reader

Abstract

The application provides a scene text extraction method without fine-grained detection. First, the obtained text image is input into a pre-trained text block detector to enable the text block detector to detect and crop the text image to form a text block image. Then, a pre-trained text block recognizer is used to obtain a semantic feature vector and a position feature vector of the text block image based on a text block feature map. Feature fusion and splicing are performed based on the semantic feature vector and the position feature vector to obtain a predicted feature, and a predicted text corresponding to the predicted feature is obtained. This framework combining coarse-grained detection and multi-instance recognition can reduce the detection burden, and the rich context information can be used for recognition. The text block detector can be trained by a text block level data set generated by a heuristic text block generation method based on a real data set, and high-precision text extraction can be achieved without fine-grained detection.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of text extraction technology, and more specifically, to a method, system, and electronic device for extracting scene text without the need for fine-grained detection. Background Technology

[0002] In recent years, scene text localization and recognition systems have achieved great success, playing a significant role in numerous practical applications such as identity authentication, license plate recognition, and visual question answering. With the help of deep learning, text extraction techniques have achieved impressive results using finely annotated datasets. Traditional scene text localization and recognition systems typically involve two separate tasks: text detection and text recognition. Specifically, the detection goal is to provide accurate and tight contours for fine-grained text instances; the recognizer aims to transcribe the cropped text image into a readable character sequence. This requires the detector to be as accurate as possible to provide appropriate text region features for subsequent recognition. Most existing work follows a framework of fine-grained word / character level and single instance recognition, which overemphasizes the role of the detector while neglecting the importance of rich contextual information in recognition.

[0003] On the one hand, fine-grained and accurate detection presents significant challenges in real-world scenarios. For example, when text is distributed across multiple rows and columns, ambiguous detection results are prone to occur; when text is densely packed, the detector struggles to distinguish word boundaries. Furthermore, because the recognition module is highly sensitive to detection results, if the detection boundaries are too loose, background interference will be introduced; if the detection boundaries are too tight, the integrity of the characters will be compromised.

[0004] On the other hand, the input to the recognizer is usually an isolated instance (such as a word), which loses rich contextual information from the surrounding text and can lead to recognition errors under conditions such as occlusion and reflection. Although some works involve dictionary-based post-processing or additional language models, they lack flexibility and have limited generalization ability.

[0005] Therefore, there is an urgent need for a scene text extraction method or system that can reduce the pressure on the detector, make full use of contextual semantic information, and improve flexibility and generalization ability without fine-grained detection. Summary of the Invention

[0006] In view of the above problems, the purpose of this invention is to provide a scene text extraction method that does not require fine-grained detection, so as to solve the following problems in the current prior art: First, fine-grained accurate detection is very challenging in real-world scenarios. For example, when text is distributed across multiple rows and columns, ambiguous detection results are prone to occur; when text is densely clustered, the detector has difficulty distinguishing word boundaries. At the same time, since the recognition module is highly sensitive to the detection results, if the detection boundary is too loose, it will introduce background interference; if the detection boundary is too tight, it will also destroy the integrity of the characters. On the other hand, the input of the recognizer is usually an isolated instance (such as a word), which will lose the rich contextual information of the surrounding text, leading to recognition errors in cases of occlusion, reflection, etc. Although some works involve dictionary-based post-processing or additional language models, they lack flexibility and have limited generalization ability.

[0007] This invention provides a method for extracting scene text without fine-grained detection, comprising:

[0008] The acquired text image is input into a pre-trained text block detector, which detects and crops the text image to form text block images; wherein, the text block detector is trained from a pre-established text block dataset; the text block dataset is generated using a heuristic text block generation method;

[0009] The text block image is subjected to feature extraction by a pre-trained text block recognizer to obtain a text block feature map. Based on the text block feature map, the semantic feature vector and the positional feature vector of the text block image are obtained. Based on the semantic feature vector and the positional feature vector, feature fusion and concatenation are performed to obtain predicted features, and the predicted text corresponding to the predicted features is obtained.

[0010] Preferably, the step of generating the text block dataset using a heuristic text block generation method includes:

[0011] Text block annotations for text detector training are generated on a pre-acquired public benchmark dataset based on words or lines of text; the text block annotations include positional information and text information.

[0012] Based on the location information, the public basic data in the public basic dataset are sorted according to vertical and horizontal positions, and a minimum bounding matrix label is generated for the original labels carried by the public data.

[0013] Text boxes of the common data are generated based on the annotation of the minimum outer matrix to form sample data; wherein, if the intersection-union ratio of two text boxes in a common data is greater than a preset text box threshold, the two text boxes are merged into one text box.

[0014] Sample data with text box and text block annotations are aggregated into a dataset called the text block dataset.

[0015] Preferably, the step of inputting the acquired text image into a pre-trained text block detector so that the text block detector detects and crops the text image to form a text block image includes:

[0016] The text image is input into the backbone network of the residual network of the text block detector through the feature pyramid network to obtain the full-image feature map of the text image;

[0017] The region selection network in the text block detector generates the detection bounding box of the text image based on the full-image feature map;

[0018] The feature network module in the text detector selects the block features corresponding to each block in the full image feature map based on the box to be detected.

[0019] The text block detector classifies the target bounding box based on the block features using a fully connected layer to determine the text boxes of each category, and then crops the text image based on the text boxes of each category to generate a text block image.

[0020] Preferably, the steps of extracting features from the text block image using a pre-trained text block recognizer to obtain a text block feature map, and obtaining the semantic feature vector and positional feature vector of the text block image based on the text block feature map, include:

[0021] The text block image is used to extract features through the backbone network in the text block recognizer to obtain a text block feature map.

[0022] The LSTM-based attention module and the positional attention module in the text block recognizer obtain the semantic feature vector and positional feature vector of the text block image based on the text block feature map, respectively.

[0023] Preferably, the step of performing feature fusion and concatenation based on the semantic feature vector and the positional feature vector to obtain predicted features, and obtaining the predicted text corresponding to the predicted features, includes:

[0024] The semantic feature vector and the positional feature vector are fused by the fusion module in the text block recognizer to obtain a fused feature vector, and the fused feature vector is used as the prediction feature.

[0025] The predicted features are decoded using a pre-trained feedforward neural network to output the predicted text.

[0026] Preferably, the text block recognizer is trained using a synthetic dataset; the synthetic dataset includes text block images labeled with contextual and visual tags.

[0027] Preferably, during the training of the text block recognizer, a loss function is calculated based on character count supervision and cross-entropy loss using aggregated cross-entropy loss; wherein the step of calculating the loss function includes:

[0028] Extract the text block feature map from the backbone network;

[0029] Dense prediction is performed based on the text block feature map to obtain prediction parameters, and prediction statistics are obtained based on the prediction parameters.

[0030] The difference parameter between the text prediction generated by the trained text block recognizer and the known labels is calculated using a preset ACE loss function, and the difference parameter is used as a character counting supervision.

[0031] The cross-entropy loss between the text prediction generated by the trained text block recognizer and the known labels is calculated using a preset cross-entropy algorithm, and the loss function is obtained based on the cross-entropy loss and the character count supervision calculation.

[0032] This invention also provides a scene text extraction system that does not require fine-grained detection, realizing the scene text extraction method described above that does not require fine-grained detection, including:

[0033] A text block detector is used to detect and crop text images to form text block images; wherein, the text block detector is trained from a pre-established text block dataset; the text block dataset is generated using a heuristic text block generation method;

[0034] A text block recognizer is used to extract features from the text block image to obtain a text block feature map, obtain semantic feature vectors and positional feature vectors of the text block image based on the text block feature map, perform feature fusion and concatenation based on the semantic feature vectors and the positional feature vectors to obtain predicted features, and obtain predicted text corresponding to the predicted features.

[0035] Preferably, the text block recognizer includes:

[0036] A backbone network is used to extract features from the text block image to obtain a text block feature map;

[0037] An LSTM-based attention module is used to obtain the semantic feature vector of the text block image based on the text block feature map;

[0038] The positional attention module is used to obtain the positional feature vector of the text block image based on the text block feature map;

[0039] The fusion module is used to perform feature fusion on the semantic feature vector and the positional feature vector to obtain a fused feature vector, and use the fused feature vector as a prediction feature;

[0040] A feedforward neural network is used to decode the predicted features to output predicted text.

[0041] Preferably, the text block detector includes:

[0042] The backbone network of the residual network is used to obtain the full-image feature map of the text image;

[0043] A region selection network is used to generate detection boxes for the text image based on the full-image feature map;

[0044] The feature network module is used to select block features corresponding to each block in the full-image feature map based on the box to be detected;

[0045] A fully connected layer is used to classify the detection box based on the block features to determine the text boxes of each category, and to crop the text image based on the text boxes of each category to generate a text block image.

[0046] As can be seen from the above technical solution, the scene text extraction method without fine-grained detection provided by the present invention first inputs the acquired text image into a pre-trained text block detector so that the text block detector detects and crops the text image to form a text block image; wherein, the text block detector is trained by a pre-established text block dataset; the text block dataset is generated by a heuristic text block generation method; then, the text block image is used to extract features from the text block image to obtain a text block feature map through a pre-trained text block recognizer; based on the text block feature map, the semantic feature vector and positional feature vector of the text block image are obtained; based on the semantic feature vector and the positional feature vector, feature fusion and concatenation are performed to obtain predicted features, and the predicted text corresponding to the predicted features is obtained. Thus, based on a reflection on the traditional scene text extraction framework that combines fine-grained detection and independent instance recognition based on words / characters, the proposed unified framework of coarse-grained detection and multi-instance recognition reduces the detection burden, while using rich contextual information for recognition. It can generate a text block-level dataset based on a real dataset through a heuristic text block generation method to train the text block detector, achieving high-precision text extraction without fine-grained detection. Attached Figure Description

[0047] Other objects and results of the invention will become more apparent and readily understood by referring to the following description taken in conjunction with the accompanying drawings, and with a more complete understanding of the invention. In the drawings:

[0048] Figure 1 This is a flowchart of a scene text extraction method without fine-grained detection according to an embodiment of the present invention;

[0049] Figure 2 This is a schematic diagram of a scene text extraction system that does not require fine-grained detection according to an embodiment of the present invention. Detailed Implementation

[0050] Current text extraction methods face several challenges. Firstly, achieving fine-grained, accurate detection is extremely difficult in real-world scenarios. For example, ambiguous detection results can easily occur when text is distributed across multiple rows and columns; and when text is densely packed, detectors struggle to distinguish word boundaries. Furthermore, because the recognition module is highly sensitive to detection results, overly loose detection boundaries introduce background interference, while overly tight boundaries compromise character integrity. Secondly, the input to the recognizer is typically an isolated instance (such as a word), which loses rich contextual information from surrounding text, leading to recognition errors under conditions of occlusion or reflection. While some works involve dictionary-based post-processing or additional language models, these lack flexibility and have limited generalization capabilities.

[0051] To address the aforementioned problems, this invention provides a method and system for extracting scene text without requiring fine-grained detection. The specific embodiments of this invention will be described in detail below with reference to the accompanying drawings.

[0052] To illustrate the scene text extraction method and system that does not require fine-grained detection provided by this invention, Figure 1 , Figure 2 The embodiments of the present invention are illustrated by way of example.

[0053] The following description of exemplary embodiments is merely illustrative and is in no way intended to limit the invention or its application or use. Techniques and equipment known to those skilled in the art may not be discussed in detail, but where appropriate, such techniques and equipment should be considered part of the specification.

[0054] like Figure 1 As shown, the scene text extraction method without fine-grained detection provided by the present invention includes:

[0055] S1: The acquired text image is input into a pre-trained text block detector so that the text block detector detects and crops the text image to form text block images; wherein, the text block detector is trained from a pre-established text block dataset; the text block dataset is generated using a heuristic text block generation method;

[0056] S2: Extract features from the text block image using a pre-trained text block recognizer to obtain a text block feature map. Obtain the semantic feature vector and positional feature vector of the text block image based on the text block feature map. Perform feature fusion and concatenation based on the semantic feature vector and the positional feature vector to obtain predicted features, and obtain the predicted text corresponding to the predicted features.

[0057] like Figure 1 As shown, step S1 is the process of inputting the acquired text image into a pre-trained text block detector so that the text block detector can detect and crop the text image to form a text block image; wherein, the text block detector is trained by a pre-established text block dataset; the text block dataset is generated by a heuristic text block generation method;

[0058] That is, step S1 is the process of obtaining text block images; the process of obtaining text block images is completed by a text block detector, which is trained from a pre-established text block dataset; and the text block dataset is generated by a heuristic text block generation method.

[0059] Specifically, the steps for generating the text block dataset using a heuristic text block generation method include:

[0060] S01: Label text blocks for text detector training on a pre-acquired public benchmark dataset based on words or lines of text; the text block labels include positional information and text information;

[0061] S02: Based on the location information, sort the public basic data in the public basic dataset according to the vertical and horizontal positions, and generate the minimum bounding matrix label for the original labels carried by the public data;

[0062] S03: Generate text boxes for the common data based on the annotation of the minimum outer matrix to form sample data; wherein, if the intersection-union ratio of two text boxes in a common data is greater than a preset text box threshold, then the two text boxes are merged into one text box.

[0063] S04: Summarize the sample data with text box and text block annotations into a dataset called the text block dataset.

[0064] More specifically, in this embodiment, to enable the detector to locate text blocks, a text block dataset is first established to train the text block detector. Annotating the text block data during the establishment of the text block dataset incurs additional costs; therefore, in this embodiment, text block annotations are generated on a public benchmark dataset. Specifically, a heuristic text block generation method is used to generate text block-level annotations for detector training on public benchmark dataset annotations based on words or lines of text. More specifically, firstly, the outer bounding rectangle (minimum bounding rectangle) of each word or text line polygon / quadrilateral position annotation is generated, and these rectangles are combined according to their IoU (Intersection over Union) value; the polygon / quadrilateral is the original annotation on the public benchmark dataset, which is the original detection box, and any polygon / quadrilateral can generate its minimum bounding rectangle; specifically, the steps of the text block generation algorithm are as follows: (1) Input data annotations based on words or text lines, including position information and text information; (2) Sort the input data from left to right and from top to bottom according to the vertical and horizontal positions to ensure the consistency of subsequent text block generation; (3) Generate the minimum bounding rectangle annotations for all quadrilateral / polygon annotations; (4) For any two text boxes in an image, if the intersection over union ratio of their bounding rectangles is greater than the threshold, we combine the two text boxes into a larger text block, for example, the threshold can be set to 0.2. (5) Repeat process (4) until there are no overlapping text boxes in the same image.

[0065] The step of inputting the acquired text image into a pre-trained text block detector so that the text block detector can detect and crop the text image to form a text block image includes:

[0066] S11: The text image is input into the backbone network of the residual network of the text block detector through the feature pyramid network to obtain the full-image feature map of the text image; wherein, the residual network is ResNet-50;

[0067] S12: The region selection network in the text block detector generates the detection box of the text image based on the full image feature map; wherein, the region selection network is an RPN network;

[0068] S13: The feature network module in the text detector selects the block features corresponding to each block in the full-image feature map according to the box to be detected; wherein, the feature network module is a RoI pooling layer network;

[0069] S14: The text block detector classifies the detection box based on the block features using the fully connected layer to determine the text boxes of each category, and crops the text image based on the text boxes of each category to generate a text block image.

[0070] Specifically, in this embodiment, to verify the effectiveness and robustness of the block-level framework, a Faster R-CNN model is used as the text block detector, and the detector is trained using a text block dataset constructed using a text block generation algorithm. The training and usage processes are the same, with training only adding a detection-feedback-retraining process. After training, when the trained text block detector is applied, the input image (text block image) is first fed into the backbone network of ResNet-50 through a Feature Pyramid Network (FPN) to extract features from the entire image and obtain a full-image feature map. Then, the obtained full-image feature map is fed into the Region Proposal Network (RPN) to generate detection boxes. Subsequently, the RoI Pooling Layer selects block features corresponding to each RoI on the feature map based on the detection boxes output by the RPN and sets the dimension of the block features to a fixed value. Finally, a fully connected layer is used to classify the boxes, generating block-level text boxes. Then, the cropped text block image based on the block-level text boxes is sent to the following text block recognition module.

[0071] Step S2 involves extracting features from the text block image using a pre-trained text block recognizer to obtain a text block feature map, obtaining semantic feature vectors and positional feature vectors of the text block image based on the text block feature map, performing feature fusion and concatenation based on the semantic feature vectors and the positional feature vectors to obtain predicted features, and obtaining the predicted text corresponding to the predicted features.

[0072] Step S2 is performed by the text block recognizer. The difficulty of text block recognition mainly stems from more flexible text arrangements and longer sequence lengths. For text block recognition, an LSTM-based attention module can be used as the encoder-decoder framework. Although attention-based methods offer the flexibility of implicit context guidance, existing models trained on public synthetic datasets struggle to predict blank positions and often fail to perceive word endings or line endings within text blocks. Therefore, in this embodiment, the text block recognizer is trained on a synthetic dataset. The synthetic dataset includes text block images labeled with contextual and visual tags; that is, contextual tags are integrated into block-level recognition, generating a synthetic dataset containing 800K text block images, called SynthBlock, used to train the text block recognizer. Contextual labels refer to the labels indicating end-of-word characters. <eow>)), text line terminator (End-Of-Line) <eol>The labeling of text blocks makes the model pay more attention to their arrangement. To mitigate attention drift and character loss issues in long text sequences, a positional attention module is further employed to encode positional cues, and Aggregation Cross-Entropy (ACE) loss is used as additional counting supervision. The positional attention module transcribes character index information into positional glimpses in parallel and enhances positional cues during decoding. ACE is a dense prediction pattern that performs well on counting problems, interpreting each pixel in the feature map as a probability distribution. The designed recognizer can utilize the rich contextual features of blocks, making it more suitable for recognizing various arrangement types and long texts.

[0073] More specifically, when training a text block recognizer using a synthetic dataset, there are two types of labels: visual labels and contextual labels. Visual labels are readable characters in the text, including numbers and lowercase characters; contextual labels, in addition to the commonly used End-Of-Sequence (...) <eos>In addition to the ) tag, we add an extra End-Of-Word ( <eow>) and End-Of-Line ( <eol>These labels constitute the entire classification space of the output layer.

[0074] Furthermore, in this embodiment, during the training of the text block recognizer, the loss function is calculated under supervision based on the character count and cross-entropy loss using aggregated cross-entropy loss; wherein the step of calculating the loss function includes:

[0075] Extract the text block feature map from the backbone network;

[0076] Dense prediction is performed based on the text block feature map to obtain prediction parameters, and prediction statistics are obtained based on the prediction parameters.

[0077] The difference parameter between the text prediction generated by the trained text block recognizer and the known labels is calculated using a preset ACE loss function, and the difference parameter is used as a character counting supervision.

[0078] The cross-entropy loss between the text prediction generated by the trained text block recognizer and the known labels is calculated using a preset cross-entropy algorithm, and the loss function is obtained based on the cross-entropy loss and the character count supervision calculation.

[0079] More specifically, the first step is to perform dense prediction on the feature map F extracted from the backbone network, that is, to perform dense prediction on the feature map F at each pixel location (i,j). i,j predict:

[0080] M = FFN(F)

[0081] M is the prediction result for each pixel value. It is the statistical value of the k-th character predicted across the entire feature map. This is the normalized value.

[0082] Compared to ordinary character recognition tasks, the ACE loss function ignores the order of character appearance and serializes the output to the final string. The ACE loss is defined as follows: it calculates the difference between the predicted distribution and the label distribution.

[0083]

[0084] Where K is the total number of character classes plus a space sign. and This represents the normalized character occurrence count and dense prediction result for each pixel labeled in the k-th class feature map. It can be easily obtained from the original sequence annotation by counting.

[0085] Then, the loss function is calculated, which consists of two parts:

[0086] L = L CE +λL ACE

[0087] Among them, L CE It is the cross-entropy loss between the predicted output and the training labels, L ACE For character count supervision, calculated using the formula above, λ is the balance parameter between the two losses.

[0088] Finally, the training of the text block recognizer is determined based on the loss function. When the loss function is less than the preset loss threshold, the last trained text block recognizer is taken as the final text block recognizer.

[0089] After training the text block recognizer, proceed to step S2, where the pre-trained text block recognizer extracts features from the text block image to obtain a text block feature map. The steps for obtaining the semantic feature vector and positional feature vector of the text block image based on the text block feature map include:

[0090] S211: The text block image is used to extract features through the backbone network in the text block recognizer to obtain a text block feature map;

[0091] S212: The semantic feature vector and positional feature vector of the text block image are obtained by the LSTM-based attention module and positional attention module in the text block recognizer based on the text block feature map, respectively.

[0092] The steps of performing feature fusion and concatenation based on the semantic feature vector and the positional feature vector to obtain predicted features, and obtaining the predicted text corresponding to the predicted features, include:

[0093] S221: The semantic feature vector and the positional feature vector are fused by the fusion module in the text block recognizer to obtain a fused feature vector, and the fused feature vector is used as the prediction feature;

[0094] S222: Decode the predicted features using a pre-trained feedforward neural network to output the predicted text.

[0095] Specifically, in step S211, the text block image is used to extract features through the backbone network in the text block recognizer to obtain a text block feature map;

[0096] More specifically, FPN is used to extract visual features through the ResNet-50 backbone network. To obtain a larger receptive field and distinguish foreground and background information, in this embodiment, two TransformerUnits are stacked after the ResNet-50. Given a text block image x, let f represent the feature extractor. The extracted text block feature map can be represented as:

[0097]

[0098] Where F represents a large generalized text block feature map, f i,j Denotes a subset of F, with f as the base. i,j This represents the text block feature map corresponding to each small text block;

[0099] In step S212, the semantic feature vector and positional feature vector of the text block image are obtained by the LSTM-based attention module and the positional attention module in the text block recognizer based on the text block feature map, respectively.

[0100] Specifically, in each time step t, the LSTM-based attention module first utilizes the character y predicted in the previous time step. t-1 and hidden state h t-1 Generate hidden state h t h t As the query vector of the attention module, it is combined with the feature map f extracted by the backbone network. i,j Fusion to estimate attention map Will With feature map f i,j Weighted summation is used to calculate a semantic feature vector. The calculation process is as follows:

[0101] h t =LSTM(y t-1 ,h t-1 )

[0102]

[0103]

[0104] Among them, y t-1 and h t-1 These are the output and hidden state of the LSTM with time step t-1, respectively, f i,j This is the local feature vector at position (i, j) in the feature map F. To compute the attention weight at each position (i, j)... is a weighted sum of local features at time step t, and is considered a semantic feature vector. Wf, Wh, and Wg are trainable parameters.

[0105] Positional attention modules play a crucial role in locating character index positions using positional information, especially when dealing with long texts and irregular shapes. To address this issue, we employ a positional attention module that, based on a query paradigm, transcribes character index information in parallel into a positional feature vector g. p The location information is enhanced through the following process:

[0106]

[0107] Where Q is the position embedding of the character sequence of length T. K and V are feature maps obtained from the backbone network, and C is the number of channels in the feature map. That is, character index information is used as the query, and the feature map is used as the key and value. The position feature vector g is calculated based on the above formula. p .

[0108] Then, step S221 is performed: the semantic feature vector and the position feature vector are fused by the fusion module in the text block recognizer to obtain a fused feature vector, and the fused feature vector is used as the prediction feature.

[0109] This involves dynamically fusing semantic features and location features to obtain a fused feature vector.

[0110]

[0111]

[0112] in, It is the fused feature vector (predicted feature) at time step t;

[0113] Finally, step S222 is performed: the predicted features are decoded using a pre-trained feedforward neural network to output the predicted text;

[0114] The predicted features are then fed into an FFN (Feed-Forward Network) for final prediction. t This is the output of the decoding process at time step t:

[0115]

[0116] In this way, after decoding, the predicted text corresponding to the predicted features is obtained, completing the entire process of scene text extraction.

[0117] To verify the effectiveness of the scene text extraction method without fine-grained detection in this embodiment of the invention, it was compared fairly and objectively with other end-to-end evaluation methods in the experiment. The EEM metric was adopted, and the general end-to-end evaluation metric F-measure was modified. EEM selects the best matching detection result by finding the detection box with the largest intersection area with a ground truth box. After the matching step, we merge the matching sets that share common elements into large groups in the merging step. Then, the edit distance of the matching groups is calculated between the new ground truth and the recognition result.

[0118]

[0119] Considering that the F-measure is not applicable to block-level frameworks, we modify the original F-measure to a generalized F-measure: a word is considered correctly located and identified when it matches a detection block and is accurately identified. The matching condition is defined as follows, and in our experiments, thr is set to 0.4.

[0120]

[0121] Experimental results show that the detection model was trained on the RealBlock dataset, the recognition model was pre-trained on the virtual datasets Synth90K and SynthText, fine-tuned on SynthBlock, and tested on mainstream end-to-end extraction datasets.

[0122] Table 1 compares the results with and without block-level post-processing. In the scene text extraction method without fine-grained detection in this embodiment of the invention, Faster R-CNN and EAST are used as word-level detectors. For the block-level post-processing experiment, block-level results are generated according to the algorithm and then input into the text block recognizer. As shown in Table 1, when additional text block post-processing is used, the f-measure of Faster R-CNN and EAST is improved by 2.3% and 3.5%, respectively. The NS scores are improved by 2.4% and 1.9%, respectively. As mentioned earlier, the design of text blocks can mitigate the negative impact of incomplete detection results and provide contextual information to the recognizer.

[0123] method Block-level post-processing NS (%) F-measure (%) Faster R-CNN+Rec. no 72.7 64.7 Faster R-CNN+Block Rec. yes 75.1 67.0 EAST+Rec. no 71.2 59.4 EAST+Block Rec. yes 73.1 62.9

[0124] Table 1 Comparison results of whether block-level post-processing was used.

[0125] Table 2 compares the performance of the scene text extraction method without fine-grained detection proposed in this embodiment of the invention with previous methods on three specific data subsets, which mainly include ambiguous, dense, and low-quality text instances, respectively. As shown in Table 2, TextBlock has a significant performance improvement over previous methods on the three data subsets. This demonstrates that TextBlock has good robustness and can achieve more effective recognition while avoiding the shortcomings of detection.

[0126]

[0127] Table 2 shows the performance of the previous methods on three specific data subsets.

[0128] Table 3 compares the performance of TextBlock with previous end-to-end extraction frameworks on the end-to-end extraction dataset. Although TextBlock only requires coarse block-level box annotations, it still achieves results comparable to or even better than state-of-the-art methods.

[0129]

[0130]

[0131] Table 3 Performance comparison on the end-to-end extraction dataset ICDAR2015

[0132] As described above, the scene text extraction method without fine-grained detection provided by this invention first inputs the acquired text image into a pre-trained text block detector to detect and crop the text image to form text block images. The text block detector is trained from a pre-established text block dataset, which is generated using a heuristic text block generation method. Then, a pre-trained text block recognizer extracts features from the text block image to obtain a text block feature map. Based on the text block feature map, semantic feature vectors and positional feature vectors of the text block image are obtained. Feature fusion and concatenation are performed based on the semantic feature vectors and the positional feature vectors to obtain predicted features, and the predicted text corresponding to the predicted features is obtained. This unified framework of coarse-grained detection and multi-instance recognition, proposed based on a reflection on the traditional scene text extraction framework combining fine-grained detection and independent instance recognition based on words / characters, reduces the detection burden. Simultaneously, it utilizes rich contextual information for recognition and can train the text block detector by generating a text block-level dataset based on a real dataset using a heuristic text block generation method, achieving high-precision text extraction without fine-grained detection.

[0133] like Figure 2 As shown, the present invention also provides a scene text extraction system 100 that does not require fine-grained detection, realizing the scene text extraction method without fine-grained detection as described above, including:

[0134] A text block detector 101 is used to detect and crop text images to form text block images; wherein, the text block detector is trained from a pre-established text block dataset; the text block dataset is generated using a heuristic text block generation method;

[0135] The text block recognizer 102 is used to extract features from the text block image to obtain a text block feature map, obtain semantic feature vectors and positional feature vectors of the text block image based on the text block feature map, perform feature fusion and concatenation based on the semantic feature vectors and the positional feature vectors to obtain predicted features, and obtain predicted text corresponding to the predicted features.

[0136] The text block detector 101 includes:

[0137] The backbone network 1011 of the residual network is used to obtain the full-image feature map of the text image;

[0138] Region selection network 1012 is used to generate the detection bounding box of the text image based on the full-image feature map;

[0139] The feature network module 1013 selects block features corresponding to each block in the full-image feature map based on the box to be detected; wherein, the feature network module is an RoI pooling layer network;

[0140] Fully connected layer 1014 is used to classify the detection box based on the block features to determine the text boxes of each category, and to crop the text image based on the text boxes of each category to generate a text block image.

[0141] The text block recognizer 102 includes:

[0142] Backbone network 1021 is used to extract features from the text block image to obtain a text block feature map;

[0143] The LSTM-based attention module 1022 is used to obtain the semantic feature vector of the text block image based on the text block feature map;

[0144] The position attention module 1023 is used to obtain the position feature vector of the text block image based on the text block feature map;

[0145] The fusion module 1024 is used to perform feature fusion on the semantic feature vector and the position feature vector to obtain a fused feature vector, and use the fused feature vector as a prediction feature;

[0146] Feedforward neural network 1025 is used to decode the predicted features to output predicted text.

[0147] For a detailed implementation method of the scene text extraction system that does not require fine-grained detection, please refer to [reference needed]. Figure 1 The descriptions of the relevant steps in the corresponding embodiments will not be repeated here.

[0148] The present invention provides a method for extracting scene text requiring fine-grained detection. First, the acquired text image is input into a pre-trained text block detector 101, which detects and crops the text image to form text block images. The text block detector is trained from a pre-established text block dataset, which is generated using a heuristic text block generation method. Then, a pre-trained text block recognizer 102 extracts features from the text block images to obtain text block feature maps. Based on these feature maps, semantic feature vectors and positional feature vectors are obtained. Feature fusion and concatenation are performed based on the semantic and positional feature vectors to obtain predicted features, and the predicted text corresponding to these features is obtained. This method, based on a reflection on the traditional scene text extraction framework combining fine-grained detection and independent instance recognition based on words / characters, proposes a unified framework of coarse-grained detection and multi-instance recognition. This reduces the detection burden and utilizes rich contextual information for recognition. It can generate a text block-level dataset based on a real dataset using a heuristic text block generation method to train the text block detector, achieving high-precision text extraction without fine-grained detection.

[0149] The scene text extraction method, system, and electronic device without fine-grained detection according to the present invention have been described above by way of example with reference to the accompanying drawings. However, those skilled in the art should understand that various modifications can be made to the scene text extraction method, system, and electronic device without fine-grained detection proposed by the present invention without departing from the scope of the invention. Therefore, the scope of protection of the present invention should be determined by the content of the appended claims.< / eol> < / eow> < / eos> < / eol> < / eow>

Claims

1. A method for extracting scene text without fine-grained detection, characterized in that, include: The acquired text image is input into a pre-trained text block detector, which detects and crops the text image to form text block images; wherein, the text block detector is trained from a pre-established text block dataset; the text block dataset is generated using a heuristic text block generation method; The process involves labeling text blocks for text detector training on a pre-acquired public benchmark dataset based on words or lines of text. These text block labels include both location and text information. Based on the location information, the public benchmark data in the dataset is sorted according to vertical and horizontal positions, and a minimum bounding matrix label is generated for the original labels carried by the public benchmark data. Text boxes from the public benchmark data are generated based on the minimum bounding matrix label to form sample data. If the intersection-union ratio (IU) of two text boxes in a public benchmark dataset is greater than a preset text box threshold, the two text boxes are merged into one text box. The sample data with text boxes and text block labels are then aggregated into a dataset called the text block dataset. The text block image is feature-extracted by the backbone network in the pre-trained text block recognizer to obtain a text block feature map. Based on the text block feature map, the semantic feature vector of the text block image is obtained by the attention module of the LSTM in the text block recognizer, and the position feature vector is obtained by the position attention module in the text block recognizer. Based on the semantic feature vector and the position feature vector, feature fusion and concatenation are performed to obtain the predicted features. The predicted features are then decoded by the pre-trained feedforward neural network to obtain the predicted text corresponding to the predicted features.

2. The scene text extraction method without fine-grained detection as described in claim 1, characterized in that, The step of inputting the acquired text image into a pre-trained text block detector so that the text block detector can detect and crop the text image to form a text block image includes: The text image is input into the backbone network of the residual network of the text block detector through the feature pyramid network to obtain the full-image feature map of the text image; The region selection network in the text block detector generates the detection bounding box of the text image based on the full-image feature map; The feature network module in the text detector selects the block features corresponding to each block in the full image feature map based on the box to be detected. The text block detector classifies the target bounding box based on the block features using a fully connected layer to determine the text boxes of each category, and then crops the text image based on the text boxes of each category to generate a text block image.

3. The scene text extraction method without fine-grained detection as described in claim 2, characterized in that, Based on the semantic feature vector and the positional feature vector, feature fusion and concatenation are performed to obtain predicted features, including: The semantic feature vector and the positional feature vector are fused by the fusion module in the text block recognizer to obtain a fused feature vector, and the fused feature vector is used as the prediction feature.

4. The scene text extraction method without fine-grained detection as described in claim 3, characterized in that, The text block recognizer is trained on a synthetic dataset, which includes text block images labeled with contextual and visual tags.

5. The scene text extraction method without fine-grained detection as described in claim 2, characterized in that, During the training of the text block recognizer, a loss function is calculated based on character count supervision and cross-entropy loss using aggregated cross-entropy loss; wherein the steps for calculating the loss function include: Extract the text block feature map from the backbone network; Dense prediction is performed based on the text block feature map to obtain prediction parameters, and prediction statistics are obtained based on the prediction parameters; The difference parameter between the text prediction generated by the trained text block recognizer and the known labels is calculated using a preset ACE loss function, and the difference parameter is used as a character counting supervision. The cross-entropy loss between the text prediction generated by the trained text block recognizer and the known labels is calculated using a preset cross-entropy algorithm, and the loss function is obtained based on the cross-entropy loss and the character count supervision calculation.

6. A scene text extraction system without fine-grained detection, implementing the scene text extraction method without fine-grained detection as described in any one of claims 1-5, comprising: A text block detector is used to detect and crop text images to form text block images; wherein, the text block detector is trained from a pre-established text block dataset; the text block dataset is generated using a heuristic text block generation method; The process involves labeling text blocks for text detector training on a pre-acquired public benchmark dataset based on words or lines of text. These text block labels include both location and text information. Based on the location information, the public benchmark data in the dataset is sorted according to vertical and horizontal positions, and a minimum bounding matrix label is generated for the original labels carried by the public benchmark data. Text boxes from the public benchmark data are generated based on the minimum bounding matrix label to form sample data. If the intersection-union ratio (IU) of two text boxes in a public benchmark dataset is greater than a preset text box threshold, the two text boxes are merged into one text box. The sample data with text boxes and text block labels are then aggregated into a dataset called the text block dataset. A text block recognizer is configured to extract features from a text block image to obtain a text block feature map, obtain semantic feature vectors and positional feature vectors of the text block image based on the text block feature map, perform feature fusion and concatenation based on the semantic feature vectors and the positional feature vectors to obtain predicted features, and obtain predicted text corresponding to the predicted features; the text block recognizer includes: A backbone network is used to extract features from the text block image to obtain a text block feature map; An LSTM-based attention module is used to obtain the semantic feature vector of the text block image based on the text block feature map; The positional attention module is used to obtain the positional feature vector of the text block image based on the text block feature map; The fusion module is used to perform feature fusion on the semantic feature vector and the positional feature vector to obtain a fused feature vector, and use the fused feature vector as a prediction feature; A feedforward neural network is used to decode the predicted features to output predicted text.

7. The scene text extraction system without fine-grained detection as described in claim 6, characterized in that, The text block detector includes: The backbone network of the residual network is used to obtain the full-image feature map of the text image; A region selection network is used to generate detection boxes for the text image based on the full-image feature map; The feature network module is used to select block features corresponding to each block in the full-image feature map based on the box to be detected; A fully connected layer is used to classify the detection box based on the block features to determine the text boxes of each category, and to crop the text image based on the text boxes of each category to generate a text block image.

Citation Information

Patent Citations

Text recognition model training method and device and text recognition method and device
CN113111871A

Patent Information

AI Technical Summary

Abstract

Description

Patent Citations

Text recognition model training method and device and text recognition method and device