A context-guided and semantic-compensated arbitrary scene text detection method

By employing context-guided and semantic compensation methods, the problem of erroneous segmentation of text boundary regions is solved. The accuracy and speed of text detection are enhanced by using feature enhancement modules and high-level semantic information compensation modules. This solves the problems of erroneous segmentation of text boundaries and lack of high-level semantic information in existing algorithms, achieving more efficient text detection results.

CN120451986BActive Publication Date: 2026-06-26TIANJIN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
TIANJIN UNIV
Filing Date
2025-04-25
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing text detection algorithms based on segmentation are prone to missegmentation of text boundary regions when processing natural scene text, especially when processing adjacent text instances, which can easily lead to oversegmentation or undersegmentation. Furthermore, feature pyramid networks ignore the problem that high-level semantic information is gradually submerged by low-level semantic information during the fusion process.

Method used

We adopt a context-guided and semantic compensation approach. By designing a backbone network, a context-guided feature enhancement module (CFEM), and a high-level semantic information compensation module (HSCM), we capture local and global contextual information using convolutional operations and attention mechanisms to enhance text feature representation. Furthermore, we achieve adaptive fusion through resampling techniques and gating mechanisms to compensate for missing high-level semantic information.

Benefits of technology

It effectively reduces erroneous segmentation of text boundary regions, improves the accuracy and speed of text detection, enhances the semantic richness and accuracy of feature representation, and significantly improves the performance of multi-scale text detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120451986B_ABST
    Figure CN120451986B_ABST
Patent Text Reader

Abstract

The application discloses a kind of arbitrary scene text detection methods based on context guidance and semantic compensation, belong to computer vision technical field;The application proposes a kind of arbitrary scene text detection methods based on context guidance and semantic compensation, to solve the problem of text boundary area error segmentation, to accurately locate the text instance in scene image.The application mainly includes context guidance feature enhancement module, advanced semantic information compensation module, wherein the context force guidance feature enhancement module is learned by combining convolution and attention, local and global context information, and the complex text features are fully modeled;And advanced semantic compensation module can make up the missing advanced semantic information in fusion features, so as to enhance the semantic richness and accuracy of feature expression.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, specifically to a method for arbitrary scene text detection based on context guidance and semantic compensation. Background Technology

[0002] In recent years, with the widespread adoption of modern smartphones, action cameras, drones, and other mobile devices, the number of natural scene images has exploded. Natural scene images typically contain rich textual information, such as traffic signs, sender and recipient information on express delivery slips, product descriptions, and shop sign advertisements. This textual information carries a large amount of semantic information; therefore, understanding and analyzing the textual information in scene images can significantly improve work efficiency in various application scenarios and assist in other computer vision tasks. Extracting textual information from scene images can be divided into two steps: text detection and text recognition. The task of text detection is to locate and detect text regions in the image, while the task of text recognition is to convert the text in the image into a sequence of characters or words that can be processed by a computer. As the first step in text information extraction, the accuracy of text detection results directly affects subsequent text recognition tasks. Therefore, designing a natural scene text detection algorithm with high detection accuracy, fast detection speed, and strong generalization is of significant research value and practical significance.

[0003] Currently, scene text detection technology has been widely applied in multiple fields. In autonomous driving: By using text detection and recognition technology to identify traffic signs, directional signs, license plates, and other markings on the road in real time, autonomous driving systems can make decisions such as adjusting speed, choosing the correct lane, or avoiding hazards according to traffic rules. In real-time translation: When faced with videos, menus, road signs, etc., containing foreign languages, text detection can automatically identify the text information and provide it to a translation system to translate it into the target language, greatly facilitating people's work, life, and cultural communication. In accessibility technology: Portable vision systems capture images of the user's surrounding scene, perform text detection, and extract the text information for voice broadcasting. This technology can greatly facilitate the lives of visually impaired individuals.

[0004] With the rapid development of deep learning technology, scene text detection technology has also made breakthrough progress. More and more researchers are focusing on this field and have proposed many innovative text detection algorithms. Deep learning-based text detection methods can be divided into regression-based text detection algorithms, segmentation-based text detection algorithms, and image contour modeling-based text detection algorithms. Due to its lack of limitation on text shape and the simplicity of its algorithm flow, segmentation-based text detection algorithms have become one of the mainstream algorithms in the field of natural scene text detection.

[0005] Inspired by tasks such as semantic segmentation and instance segmentation, researchers have transformed text detection into a pixel-level segmentation problem. To address the difficulty of separating dense text instances, Wang et al. proposed the Progressive Scale Expansion Network (PSENet) in 2019. This network generates a series of multi-scale text kernels for each text instance, progressively expanding from the smallest kernel to the entire text region, where the smallest kernel is used to separate adjacent text. However, the progressive scale expansion algorithm makes PSENet's post-processing time-consuming and difficult to handle small text instances. The Pixel Aggregation Network (PAN) predicts and calculates the similarity vector between pixels and text kernels, guiding text pixels to cluster towards the correct kernel, achieving accurate instance segmentation. Furthermore, PAN uses depthwise separable convolutions instead of conventional convolutions in multi-scale feature fusion, significantly reducing computation and improving detection speed. In 2020, Liao et al. proposed a highly efficient and accurate text detection algorithm: Differentiable Binarization Network (DBNet). DBNet designed a differentiable binarization function that can be optimized along with the segmentation network during training. The optimized network can adaptively set the threshold for binarization, simplifying the post-processing process while improving text detection performance. However, the above algorithms directly use the coarse-grained boundary annotations provided by the dataset, ignoring background noise in the annotations, resulting in inaccurate predicted text boundaries. In 2022, Zhang et al. proposed a base probabilistic graph segmentation network (Text Detection via Segmentation with Probability Maps, TextPMs). TextPMs designed a sigmoid variant function to map the distance between the boundary and its internal pixels to a probability map, and generated a series of probability maps by adjusting the hyperparameter values ​​to describe the possible probability distributions. In the post-processing stage, a simple region growing algorithm is used to aggregate the probability maps into complete text instances.

[0006] While the aforementioned text detection algorithms have achieved good performance, they still face some challenges. Segmentation-based text detection algorithms often suffer from incorrect segmentation of text boundary regions. Specifically, because scene text lacks clearly defined closed geometric boundaries, segmentation-based algorithms struggle to separate adjacent text instances, leading to over-segmentation or under-segmentation due to excessive refinement or insufficient segmentation when dealing with text boundaries. Furthermore, to detect text of varying sizes, most existing segmentation-based text detection algorithms employ a Feature Pyramid Network (FPN) structure. This FPN gradually fuses features from different levels upwards through a top-down path, allowing the network to utilize both high-level semantic information and low-level detail information. Although this method achieves good performance, it overlooks the issue of high-level semantic information being gradually overwhelmed by low-level semantic information during the fusion process, as well as the semantic differences between features at different levels, resulting in incorrect segmentation of text boundary regions.

[0007] To address the aforementioned issues, this invention proposes an arbitrary scene text detection method based on context guidance and semantic compensation. Summary of the Invention

[0008] The purpose of this invention is to propose an arbitrary scene text detection method based on context guidance and semantic compensation to solve the problem of erroneous segmentation of text boundary regions, so as to accurately locate text instances in scene images.

[0009] To achieve the above objectives, the present invention adopts the following technical solution:

[0010] An arbitrary scene text detection method based on context-guided and semantic compensation is disclosed. The method is implemented through an arbitrary scene text detection system based on context-guided and semantic compensation. The system includes a backbone network, a context-guided feature enhancement module (CFEM), a high-level semantic compensation module (HSCM), and an output module.

[0011] The method includes the following steps:

[0012] S1. Design a backbone network and use the improved backbone network to extract multi-level features.

[0013] S2. Based on convolution operations and attention mechanisms, a context attention-guided feature enhancement module (CFEM) is designed to feed the output features of the backbone network into the context attention-guided feature enhancement module (CFEM) to capture local and global context attention in order to enhance text feature representation.

[0014] S3. The features output by the Context Attention-Guided Feature Enhancement Module (CFEM) at each level are fused with the features output by the backbone network to obtain multi-scale fused features.

[0015] S4. Design an advanced semantic information compensation module (HSCM) to feed the advanced features output by the context attention-guided feature enhancement module (CFEM) and the multi-scale fusion features obtained in S3 into the advanced semantic information compensation module (HSCM). Align the features through resampling technology, and further design a gating mechanism to dynamically adjust the weights of the aligned advanced features and fused features to achieve adaptive fusion.

[0016] S5. Generate text detection results through the output module to complete the text detection work.

[0017] Preferably, S1 specifically includes the following:

[0018] ResNet-50 is used as the backbone network. The 3×3 convolutions in stages 3 to 5 of the backbone network are replaced with deformable convolutions, and a pre-trained weight model on ImageNet is loaded during training. The output features of stages 2 to 5 of the backbone network are labeled as { F 2, F 3, F 4, F 5}, the resolution of each level of output feature is {1 / 4, 1 / 8, 1 / 16, 1 / 32} of the input image, and the number of channels is {256, 512, 1024, 2048}, respectively. Multi-level features are extracted using the improved backbone network described above.

[0019] Preferably, the context attention-guided feature enhancement module (CFEM) includes a wavelet convolution-based shape calibration branch, a global self-attention branch, and a channel attention branch, wherein:

[0020] The wavelet convolution-based shape calibration branch performs pooling operations in the horizontal and vertical directions to capture axial contextual information. By adding the vectors in the horizontal and vertical directions, it achieves rectangular modeling of the text region. The wavelet convolution-based shape calibration branch further designs a shape calibration function to capture local contextual information, thereby calibrating the text region to better fit the text boundary contour. Specifically, this includes the following:

[0021] Use k×1 convolution to adjust the elements in each row so that the modeling area is closer to the shape of the text in the horizontal direction;

[0022] Feature normalization is performed using BN, and nonlinearity is added using the ReLU function;

[0023] The shape calibration function uses a 1×k convolution to calibrate the modeled shape in the vertical direction;

[0024] The calibration features are mapped to the (0, 1) range using the Sigmoid function to obtain the shape calibration weights. W B The formula for the shape calibration function is as follows:

[0025]

[0026] in, δ Represents the Sigmoid activation function; φ k×1 and φ 1×k These represent k×1 and 1×k convolutions, respectively. β Represents normalization; ReLU Represents the ReLU activation function;

[0027] The global self-attention branch maps the features at each level output by the backbone network through a fully connected layer to obtain... , , Three weight matrices; the query matrix is ​​obtained by multiplying the features at each level with the weight matrices. Q i Key matrix K i Value matrix V i ;

[0028] Calculate the query matrix Q i Bond matrix K i The similarity between them, the similarity is calculated by... Q i and K i This is achieved through the dot product; the dot product result is then scaled and divided by the key matrix. K i Square root of dimension ;

[0029] The attention score for each query is normalized using the Softmax function and converted into a probability distribution. This assigns a weight to each pixel, reflecting its level of attention to other pixels;

[0030] The normalized attention weight matrix AND-value matrix V i Perform a weighted summation to obtain the output of the global self-attention branch. The functional representation of the above process is as follows:

[0031]

[0032]

[0033]

[0034] The channel attention branch smooths the feature map through a 3×3 convolutional layer;

[0035] Global max pooling and global average pooling operations are performed on the processed feature map in the spatial dimension to compress the features into a one-dimensional vector. Then, after passing through a fully connected layer and an activation function, the information of the two branches is fused by adding pixel by pixel.

[0036] The vector is mapped to the range (0, 1) using the Sigmoid function to obtain the channel attention weights. Channel attention weights and input feature maps F i Channel attention features are obtained by pixel-wise multiplication. The functional representation of the above process is as follows:

[0037]

[0038]

[0039]

[0040]

[0041] in, GMP S and GAP S These represent global max pooling and global average pooling operations in the spatial dimension, respectively; φ FC and φ 3×3 These represent fully connected layers and 3×3 convolutional layers, respectively. δ This represents the Sigmoid activation function.

[0042] Preferably, step S2 specifically includes the following:

[0043] The output features of each stage from stage 3 to stage 5 of the backbone network { F 3, F 4, F 5} The text is fed into the Context-Guided Feature Enhancement Module (CFEM), which captures local and global contextual attention based on the shape calibration branch, global self-attention branch, and channel attention branch of wavelet convolution. It enhances the text feature representation by using contextual information and expanding the receptive field, and distinguishes between text and background regions.

[0044] Preferably, S3 specifically includes the following:

[0045] The features at each level output by the Context Attention-Guided Feature Enhancement Module (CFEM) { } and the features output by the second stage of the backbone network F 2. A top-down fusion is performed using pixel-level addition, and a 3×3 convolutional layer is set to achieve smooth feature fusion;

[0046] The fourth-level features { The resolution was adjusted to 1 / 4 of the original image size using bilinear interpolation. Then, a 3×3 convolutional layer reduced the number of feature channels to 64. Finally, the four levels of features were concatenated along the channel dimension to obtain multi-scale fused features. F .

[0047] Preferably, S4 specifically includes the following:

[0048] The output of the context-guided feature enhancement module (CFEM) and fusion features F The number of channels is unified to 256 using 3×3 convolutional layers, standardized by batch normalization, and then non-linearity is added using an activation function; high-level features are then processed. Features were upsampled and fused using bilinear interpolation. F They are the same size; the function representation of the above process is:

[0049]

[0050]

[0051] The Advanced Semantic Information Compensation (HSCM) module groups the channel dimension into multiple sub-functions, which respectively perform calibration and alignment operations and integration gating mechanisms for adaptive fusion. and F Specifically, it includes:

[0052] Will and The number of channels is reduced to 128 using a 1×1 convolutional layer, and the two channels are concatenated using channel concatenation; these are then fed into a weight generation block to learn two sets of offsets. and and two gating masks and The calibration and alignment operation is implemented using resampling technology, assuming the feature map... Each unknown spatial coordinate is For the learned offset, use coordinates The value is used to replace the original The value of , where The value is obtained through bilinear interpolation; similarly, for the feature map... The calibration and alignment process described above is represented by the following function:

[0053]

[0054]

[0055] The calibrated feature map is thus obtained, and the two are adaptively fused using a gated mask. The process function is represented as follows:

[0056]

[0057] in, and Represents the calibrated feature map; This represents pixel-level multiplication operations.

[0058] Compared with the prior art, the present invention has the following beneficial effects:

[0059] This invention addresses the problem of erroneous segmentation of text boundary regions in existing scene text detection algorithms based on segmentation principles. It proposes an arbitrary scene text detection method based on context guidance and semantic compensation. The proposed method is implemented using a corresponding text detection system and mainly includes a context-guided feature enhancement module and a high-level semantic information compensation module. The context-guided feature enhancement module learns local and global contextual information by combining convolution and attention, fully modeling complex text features. The high-level semantic compensation module compensates for the missing high-level semantic information in the fused features, thereby enhancing the semantic richness and accuracy of feature representation. Furthermore, extensive experiments on four text detection benchmark datasets demonstrate the effectiveness and advantages of the proposed method. Attached Figure Description

[0060] Figure 1This is the overall network structure diagram of an arbitrary scene text detection method based on context guidance and semantic compensation mentioned in Embodiment 1 of the present invention;

[0061] Figure 2 This is a detailed structural diagram of the Context-Guided Feature Enhancement Module (CFEM) mentioned in Embodiment 1 of the present invention;

[0062] Figure 3 This is a detailed structural diagram of the Advanced Semantic Information Compensation Module (HSCM) mentioned in Embodiment 1 of the present invention;

[0063] Figure 4 This is a comparison chart of the visualization results of the method mentioned in Embodiment 2 of the present invention and other methods. Detailed Implementation

[0064] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below.

[0065] First, the definitions of abbreviations and key terms mentioned in this invention will be explained, specifically as follows:

[0066] PSENet: Progressive Scale Expansion Network

[0067] PAN: Pixel Aggregation Network

[0068] DBNet: Differentiable Binarization Network

[0069] TextPMs: Text Detection via Segmentation with Probability Maps.

[0070] CFEM: Context-guided Feature Enhancement Module

[0071] HSCM: High-level Semantic Compensation Module

[0072] Based on the above, the following explanation will be provided in conjunction with relevant accompanying drawings and specific examples.

[0073] Example 1:

[0074] This invention proposes an arbitrary scene text detection method based on context guidance and semantic compensation, the overall network structure of which is as follows: Figure 1 As shown, it mainly consists of a backbone network, a context-guided feature enhancement module (CFEM), an advanced semantic information compensation module (HSCM), and an output module.

[0075] Specifically, this invention first utilizes the ResNet-50 backbone network to extract multi-level features. To enhance the feature extraction capability of the backbone network, the 3×3 convolutions in stages 3 to 5 of the backbone network are replaced with deformable convolutions, and a weight model pre-trained on ImageNet is loaded during training. The output features of stages 2 to 5 of the backbone network are labeled as { F 2, F 3, F 4, F 5}, the resolution of each level of output features is {1 / 4, 1 / 8, 1 / 16, 1 / 32} of the input image, and the number of channels is {256, 512, 1024, 2048}. The output features of each level from stage 3 to stage 5 of the backbone network are { F 3, F 4, F 5} The text is fed into the Context-Guided Feature Enhancement (CFEM) module, which enhances the text feature representation by using contextual information and expanding the receptive field, thus distinguishing between text and background regions.

[0076] Subsequently, this invention applies the features of each level output by CFEM { } and the features output by the second stage of the backbone network F 2. A top-down fusion was performed using pixel-level addition, and a 3×3 convolutional layer was set to achieve smooth feature fusion.

[0077] Next, the fourth-level features { Feature integration is then performed. Specifically, the resolution of all four levels of features is first adjusted to 1 / 4 of the original image size using bilinear interpolation. Then, a 3×3 convolutional layer is used to reduce the number of feature channels to 64. Finally, the four levels of features are concatenated along the channel dimension to obtain multi-scale fused features. F .

[0078] The detailed structure of the Context-Guided Feature Enhancement Module (CFEM) is as follows: Figure 2 As shown. The input to CFEM is the features obtained from the backbone network at each level { F 3, F 4, F 5}. CFEM consists of three branches: a shape calibration branch based on wavelet convolution, a global self-attention branch, and a channel attention branch.

[0079] To concentrate model features more within the text region, a wavelet convolution-based shape calibration branch performs pooling operations in both the horizontal and vertical directions, capturing axial contextual information in both directions. By summing these two vectors, a rectangular model of the text region can be achieved. Then, this branch designs a shape calibration function to capture local contextual information, thereby calibrating the text region to better fit the text boundary contour. Specifically, firstly, a k×1 convolution is used to adjust the elements of each row, making the modeled region closer to the text shape in the horizontal direction. Then, feature normalization is performed using Batch Normalization (BN), and non-linearity is added using the ReLU function. Subsequently, the shape calibration function uses a 1×k convolution to calibrate the modeled shape in the vertical direction. Finally, the calibrated features are mapped to the (0, 1) range using the Sigmoid function to obtain the shape calibration weights. W B This method achieves decoupling in both the horizontal and vertical directions, allowing the model to adapt to various shapes. The formula for the shape calibration function is as follows:

[0080]

[0081] in, Represents the Sigmoid activation function; and These represent k×1 and 1×k convolutions, respectively. Represents normalization; This represents the ReLU activation function. Inspired by the large receptive field wavelet convolution WTConv, which captures feature maps at different scales and further expands the receptive field, the shape calibration branch based on wavelet convolution captures local contextual information at different scales. Through wavelet convolution, the model can simultaneously capture local features at different scales, thereby enhancing the model's understanding of complex images and further expanding the receptive field.

[0082] In the global self-attention branch, the features at each level output by the backbone network are first mapped through a fully connected layer to obtain... , , Three weight matrices. The query matrix is ​​obtained by multiplying each level of feature by the weight matrix. Q i Key matrix K i Value matrix V i Next, calculate the query matrix. Q i Bond matrix K i The similarity between them is calculated. Q i andK i The dot product is used to achieve this. To improve numerical stability and ensure the validity of the calculation results, the dot product result is scaled by dividing by the key matrix. K i Square root of dimension Subsequently, the attention score for each query is normalized using the Softmax function, transforming it into a probability distribution. This assigns a weight to each pixel, reflecting its level of attention to other pixels. Finally, the normalized attention weight matrix is... AND-value matrix V i Perform a weighted summation to obtain the output of the global self-attention branch. The process of generating a global self-attention feature map is expressed by the following formula:

[0083]

[0084]

[0085]

[0086] In the channel attention branch, the feature map is first smoothed using a 3×3 convolutional layer. Then, global max pooling and global average pooling operations are performed on the processed feature map to compress it into a one-dimensional vector. This vector is then fused pixel-by-pixel after passing through a fully connected layer and an activation function. Finally, the vector is mapped to the (0, 1) range using the sigmoid function to obtain the channel attention weights. Channel attention weights and input feature maps F i Channel attention features are obtained by pixel-wise multiplication. The process of generating a global self-attention feature map can be represented as follows:

[0087]

[0088]

[0089]

[0090]

[0091] in, GMP S and GAP S These represent global max pooling and global average pooling operations in the spatial dimension, respectively; and represents a fully connected layer and a 3×3 convolutional layer, respectively; represents the Sigmoid activation function.

[0092] The detailed structure of the advanced semantic information compensation module is as follows: Figure 3 As shown. Although some existing algorithms supplement low-level features with high-level features, they neglect the spatial misalignment and semantic gap between features of different levels. To alleviate this problem, this invention proposes an Advanced Semantic Information Compensation (HSCM) module, which is built on the foundation of the FPN structure and effectively aligns the high-level features output by the text contour enhancement module with the fused features through resampling technology. F This ensures accurate matching of information from different features. The module then incorporates a gating mechanism to dynamically adjust the weights of the aligned high-level features and the fused features, enabling adaptive fusion. This process effectively compensates for the missing high-level semantic information in the fused features, thereby enhancing the semantic richness and accuracy of the feature representation.

[0093] To reduce computational overhead and improve operational efficiency, the context-guided feature enhancement module is first output. and fusion features F The number of channels is unified to 256 using 3×3 convolutional layers, standardized by batch normalization, and then non-linearity is added using an activation function. Subsequently, high-level features are... Features were upsampled and fused using bilinear interpolation. F They are the same size. This process can be represented by a formula:

[0094]

[0095]

[0096] However, due to spatial misalignment and significant representational differences, directly fusing the two does not yield ideal results. Furthermore, because... F It contains rich spatial details, and Including more semantic information, fusing the two after channel-level calibration may negatively impact performance. To address this issue, HSCM groups the channel dimension into multiple sub-functions, performing calibration alignment operations and integration gating mechanisms separately for adaptive fusion. and F Specifically, firstly, and The number of channels is reduced to 128 using a 1×1 convolutional layer, and the two layers are concatenated using channel concatenation. Then, they are fed into a weight generation block to learn two sets of offsets. and and two gating masks and In the calibration and alignment operation, this method employs a resampling technique. Assuming a feature map... Each unknown spatial coordinate is The learned offset is then expressed in coordinates. The value is used to replace the original The value of , where The value is obtained through bilinear interpolation. Similarly, for the feature map... Perform calibration and alignment. This calibration process can be represented by the following formula.

[0097]

[0098]

[0099] The calibrated feature map is thus obtained, and the two are adaptively fused using a gated mask. The process function is represented as follows:

[0100]

[0101] in, and Represents the calibrated feature map; This represents pixel-level multiplication operations.

[0102] Example 2:

[0103] Based on Example 1, but with some differences, the following specific experiments are designed to verify the performance of the arbitrary scene text detection method based on context guidance and semantic compensation proposed in this invention. The specific content is as follows.

[0104] 1. Dataset

[0105] The arbitrary scene text detection method proposed in Example 1, based on context guidance and semantic compensation, was experimentally validated on four benchmark datasets. The datasets used include:

[0106] (1) SynthText. SynthText is an artificially generated dataset containing over 850,000 (858,750) natural scene images. This dataset collects 8 million common English words and synthesizes text images by rendering the text onto natural images using methods such as random transformations. To make the generated text images more realistic, deep learning and semantic segmentation are used to align the geometric shapes of the text with the background image during image synthesis. The method proposed in this invention is pre-trained only on this dataset.

[0107] (2) MSRA-TD500. MSRA-TD500 is a multilingual dataset containing Chinese and English, with a total of 500 images, of which 300 are used as training images and the remaining 200 are used as test images. The image resolutions range from 1296×864 to 1920×1280. In addition, since the amount of training data in MSRA-TD500 is relatively small, 400 training images from HUST-TR400 are added during network training.

[0108] (1) CTW1500. CTW1500 is a Chinese arbitrary shape text dataset containing 1000 images for training and 500 images for testing. This dataset is highly diverse, including planar text, convex text, urban street view text, rural street view text, text under low lighting conditions, distant text, partially displayed text, etc.

[0109] (4) Total-Text. The Total-Text dataset contains 1555 images, of which 1255 were used for training and the remaining 300 for testing. The text shapes in the dataset are diverse, including horizontal, slanted, and curved text, with more than half of the images containing text of two or more shapes. Curved text instances account for nearly half of the total text instances. Furthermore, the dataset images contain various font styles, text sizes, and complex backgrounds, which places higher demands on the performance of the text detection model.

[0110] 2. Experimental setup

[0111] The model proposed in this invention is implemented using Python, and the experimental environment is shown in Table 1. The SGD optimizer was used during training. The backbone network used weights from a ResNet-50 model pre-trained on ImageNet, while other network layers were initialized using the Kaiming initialization method. During data preprocessing, images were randomly cropped to a uniform size of 640×640, and data augmentation techniques such as random rotation and flipping were also employed. Pre-training was performed on the synthetic dataset SynthText for 3 epochs with a fixed learning rate of 0.007 and a batch size of 12. After pre-training, fine-tuning was performed on four benchmark datasets for a total of 1500 epochs, using a "poly" strategy with an initial learning rate of 0.007.

[0112] Table 1: Experimental Environment

[0113]

[0114] 3. Experimental Results

[0115] The proposed method was compared with previous scene text detection methods on four benchmark datasets. Experiments were conducted on multi-directional text datasets (MSRA-TD500 and ICDAR2015) and arbitrary-shape text datasets (Total-Text and CTW1500), and the performance of the proposed method was evaluated based on four metrics: precision, recall, mean precision (Hmean), and frames per second (FPS). Furthermore, ablation experiments were performed on the CTW1500 dataset to illustrate the function of each component of the proposed method.

[0116] Tables 2, 3, 4, and 5 present the quantitative evaluation results of each method on the text dataset. The best performance for each metric is bolded, and the second-best performance is underlined. Tables 2 and 3 show that the proposed method achieves excellent average accuracy on the multi-directional text detection dataset. Tables 4 and 5 show the detection results of the proposed method on the arbitrary-shape text dataset. The proposed method achieves best or second-best results on these two curved text datasets. This is due to the fact that the proposed method utilizes local and global contextual information, effectively reducing false positives and false negatives caused by small text spacing. Simultaneously, by compensating for high-level semantic information in the fusion features, the network's discriminative ability is enhanced, enabling the network to generate more accurate text contours.

[0117] Table 2: Experimental results on the MSRA-TD500 dataset, with best performance highlighted in bold.

[0118]

[0119] Table 3: Experimental results on the ICDAR2015 dataset, with best performance highlighted in bold.

[0120]

[0121] Table 4: Experimental results on the Total-Text dataset, with best performance highlighted in bold.

[0122]

[0123] Table 5: Experimental results on the CTW1500 dataset, with best performance highlighted in bold.

[0124]

[0125] To more intuitively demonstrate the effectiveness of the method proposed in this invention, a visual comparison was made between the proposed method and the detection results of classic text detection methods DBNet and DBNet++. The visualization results of the real labels and each detection method are shown below. Figure 4 As shown. From Figure 4 As can be clearly seen, the method proposed in this invention can accurately identify text and non-text regions and generate text boxes that precisely surround text instances.

[0126] Table 6 shows the ablation experiments conducted on the CTW1500 dataset for the main modules of the method of this invention. As can be seen from Table 6, the precision, recall, and mean precision of the baseline model were 86.90%, 80.20%, and 83.40%, respectively. When only CFEM was added to the baseline model, the mean precision improved by 2.32%. This result indicates that CFEM, by learning local and global contextual information, fully models complex text features, thus enabling more accurate prediction of text boundary regions. When only TFEM was added to the baseline model, the mean precision improved by 1.29%. This result indicates that the HSCM module effectively compensates for the missing high-level semantic information in the fused features, thereby enhancing the semantic richness and accuracy of the feature representation. After adding both CFEM and HSCM to the baseline model, the model ultimately achieved a precision of 88.71%, a recall of 85.04%, and a mean precision of 86.83%, which are improvements of 1.81%, 4.84%, and 3.43% compared to the corresponding metrics of the baseline model, respectively. The experimental results fully verify the effectiveness of the Context-Guided Feature Enhancement Module (CFEM) and the High-Level Semantic Information Compensation Module (HSCM) proposed in this invention.

[0127] Table 6: Ablation test results for each module, with best performance highlighted in bold.

[0128]

[0129] Tables 7, 8, and 9 present ablation experiments on the various components of the context attention-guided feature enhancement module. Table 7, based on the CTW1500 dataset, demonstrates the impact of each branch in CFEM on model performance. Table 8 illustrates the rationale for using wavelet convolution, and Table 9, based on the CTW1500 dataset, compares the impact of the convolution kernel size of the shape self-calibration function in CFEM.

[0130] Table 7: Impact of each branch in CFEM on model performance, with the best performance highlighted in bold.

[0131]

[0132] Table 8: The impact of convolution selection on CFEM performance; best performance is highlighted in bold.

[0133]

[0134] Table 9: The impact of kernel size on model performance in the shape self-calibration function; best performance is highlighted in bold.

[0135]

[0136] The above description represents the preferred embodiments of the present invention. It should be noted that those skilled in the art can make various improvements and modifications without departing from the principles of the present invention, and these improvements and modifications should also be considered within the scope of protection of the present invention.

Claims

1. A text detection method for any scene based on context guidance and semantic compensation, characterized in that, The method is implemented through an arbitrary scene text detection system based on context guidance and semantic compensation. The system includes a backbone network, a context-guided feature enhancement module, an advanced semantic information compensation module, and an output module. The method includes the following steps: S1. Design a backbone network and use the improved backbone network to extract multi-level features. S2. Based on convolution operations and attention mechanisms, a context-attention-guided feature enhancement module is designed. The output features of the backbone network are fed into the context-attention-guided feature enhancement module to capture local and global contextual attention, thereby enhancing text feature representation. The context-attention-guided feature enhancement module includes a shape calibration branch based on wavelet convolution, a global self-attention branch, and a channel attention branch, wherein: The wavelet convolution-based shape calibration branch performs pooling operations in the horizontal and vertical directions to capture axial contextual information. By adding the vectors in the horizontal and vertical directions, it achieves rectangular modeling of the text region. The wavelet convolution-based shape calibration branch further designs a shape calibration function to capture local contextual information, thereby calibrating the text region to better fit the text boundary contour. Specifically, this includes the following: Use k×1 convolution to adjust the elements in each row so that the modeling area is closer to the shape of the text in the horizontal direction; Feature normalization is performed using BN, and nonlinearity is added using the ReLU function; The shape calibration function uses a 1×k convolution to calibrate the modeled shape in the vertical direction; The calibration features are mapped to the (0, 1) range using the Sigmoid function to obtain the shape calibration weights. W B The formula for the shape calibration function is as follows: in, δ Represents the Sigmoid activation function; φ k×1 and φ 1×k These represent k×1 and 1×k convolutions, respectively. β Represents normalization; ReLU Represents the ReLU activation function; The global self-attention branch maps the features at each level output by the backbone network through a fully connected layer to obtain... , , Three weight matrices; the query matrix is ​​obtained by multiplying the features at each level with the weight matrices. Q i Key matrix K i Value matrix V i ; Calculate the query matrix Q i Bond matrix K i The similarity between them, the similarity is calculated by... Q i and K i This is achieved through the dot product; the dot product result is then scaled and divided by the key matrix. K i Square root of dimension ; The attention score for each query is normalized using the Softmax function and converted into a probability distribution. This assigns a weight to each pixel, reflecting its level of attention to other pixels; The normalized attention weight matrix AND-value matrix V i Perform a weighted summation to obtain the output of the global self-attention branch. The functional representation of the above process is as follows: The channel attention branch smooths the feature map through a 3×3 convolutional layer; Global max pooling and global average pooling operations are performed on the processed feature map in the spatial dimension to compress the features into a one-dimensional vector. Then, after passing through a fully connected layer and an activation function, the information of the two branches is fused by adding pixel by pixel. The vector is mapped to the range (0, 1) using the Sigmoid function to obtain the channel attention weights. Channel attention weights and input feature maps F i Channel attention features are obtained by pixel-wise multiplication. The functional representation of the above process is as follows: in, GMP S and GAP S These represent global max pooling and global average pooling operations in the spatial dimension, respectively; φ FC and φ 3×3 These represent fully connected layers and 3×3 convolutional layers, respectively. δ This represents the Sigmoid activation function; S3. The features output by the context attention-guided feature enhancement module at each level are fused with the features output by the backbone network to obtain multi-scale fused features. S4. Design an advanced semantic information compensation module. The advanced features output by the context attention-guided feature enhancement module and the multi-scale fusion features obtained in S3 are fed into the advanced semantic information compensation module. The features are aligned by resampling technology. Furthermore, a gating mechanism is designed to dynamically adjust the weights of the aligned advanced features and the fusion features to achieve adaptive fusion. S5. Generate text detection results through the output module to complete the text detection work.

2. The arbitrary scene text detection method based on context guidance and semantic compensation according to claim 1, characterized in that, S1 specifically includes the following: ResNet-50 is used as the backbone network. The 3×3 convolutions in stages 3 to 5 of the backbone network are replaced with deformable convolutions, and a pre-trained weight model on ImageNet is loaded during training. The output features of stages 2 to 5 of the backbone network are labeled as { F 2, F 3, F 4, F 5}, the resolution of each level of output feature is {1 / 4, 1 / 8, 1 / 16, 1 / 32} of the input image, and the number of channels is {256, 512, 1024, 2048}, respectively. Multi-level features are extracted using the improved backbone network described above.

3. The arbitrary scene text detection method based on context guidance and semantic compensation according to claim 2, characterized in that, S2 specifically includes the following: The output features of each stage from stage 3 to stage 5 of the backbone network { F 3, F 4, F 5} The text is fed into the context-guided feature enhancement module, which captures local and global contextual attention based on the shape calibration branch, global self-attention branch, and channel attention branch of wavelet convolution. It enhances the text feature representation by using contextual information and expanding the receptive field, and distinguishes between text and background regions.

4. The arbitrary scene text detection method based on context guidance and semantic compensation according to claim 3, characterized in that, S3 specifically includes the following: The features output by the context attention-guided feature enhancement module at each level { } and the features output by the second stage of the backbone network F 2. A top-down fusion is performed using pixel-level addition, and a 3×3 convolutional layer is set to achieve smooth feature fusion; The fourth-level features { The resolution was adjusted to 1 / 4 of the original image size using bilinear interpolation. Then, a 3×3 convolutional layer reduced the number of feature channels to 64. Finally, the four levels of features were concatenated along the channel dimension to obtain multi-scale fused features. F .

5. The arbitrary scene text detection method based on context guidance and semantic compensation according to claim 4, characterized in that, S4 specifically includes the following: The output of the context-guided feature enhancement module and fusion features F The number of channels is unified to 256 using 3×3 convolutional layers, standardized by batch normalization, and then non-linearity is added using an activation function; high-level features are then processed. Features were upsampled and fused using bilinear interpolation. F They are the same size; the function representation of the above process is: The advanced semantic information compensation module groups the channel dimension into multiple sub-functions, which respectively perform calibration and alignment operations and integration gating mechanisms for adaptive fusion. and F Specifically, it includes: Will and The number of channels is reduced to 128 using a 1×1 convolutional layer, and the two channels are concatenated using channel concatenation; these are then fed into a weight generation block to learn two sets of offsets. and and two gating masks and The calibration and alignment operation is implemented using resampling technology, assuming the feature map... Each unknown spatial coordinate is For the learned offset, use coordinates The value is used to replace the original The value of , where The value is obtained through bilinear interpolation; similarly, for the feature map... The calibration and alignment process described above is represented by the following function: The calibrated feature map is thus obtained, and the two are adaptively fused using a gated mask. The process function is represented as follows: in, and Represents the calibrated feature map; This represents pixel-level multiplication operations.