Method, apparatus and electronic device for recognizing traffic signs

By using a feature map extraction and fusion classification module of a convolutional neural network and actively discarding information using a Gaussian-like operator, the recognition accuracy of traffic signs under complex road conditions is improved, solving the problem of poor recognition performance in existing technologies and enhancing the safety of autonomous driving.

CN116259040BActive Publication Date: 2026-06-19ZHIDAO NETWORK TECH (BEIJING) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHIDAO NETWORK TECH (BEIJING) CO LTD
Filing Date
2023-03-20
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies are ineffective at recognizing partially obscured or poor-quality traffic signs under complex road conditions, resulting in insufficient recognition accuracy and potential safety hazards.

Method used

A traffic sign recognition model based on convolutional neural networks is adopted. Through feature map extraction and fusion classification modules, a Gaussian-like operator is used to actively discard some information to simulate scenarios where traffic signs are incomplete or of poor quality, thereby improving the recognition effect.

Benefits of technology

It effectively improves the recognition accuracy in scenarios where traffic sign images are partially missing or of poor quality, thereby enhancing the safety of autonomous driving.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116259040B_ABST
    Figure CN116259040B_ABST
Patent Text Reader

Abstract

This application relates to a method, apparatus, and electronic device for recognizing traffic signs. The method includes: obtaining an image to be recognized; processing the image to be recognized using a trained traffic sign recognition model to obtain traffic signs; wherein the traffic sign recognition model includes: a feature map extraction module for extracting feature maps from the image to be recognized, and processing the image to be recognized and / or at least some of the feature maps to obtain adjusted feature maps with partially missing information; and a fusion classification module for fusing the feature maps and the adjusted feature maps, and determining the traffic signs based on the fused feature maps and the adjusted feature maps. This application can improve the recognition performance of traffic signs with incomplete images or poor image quality for some traffic signs.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to a method, apparatus and electronic device for recognizing traffic signs. Background Technology

[0002] With the rapid development of computer technology and artificial intelligence technology, artificial intelligence technology is being applied to more and more scenarios, such as intelligent transportation and image recognition.

[0003] Accurate traffic sign recognition is crucial for achieving autonomous driving. For example, autonomous vehicles can perform maneuvers such as lane changes, turns, and speed limits according to traffic regulations and signs. Related technologies can identify traffic signs through image recognition.

[0004] The applicant found that the relevant technology's ability to recognize traffic signs in certain special scenarios needs improvement. For example, the traffic sign image in the captured image may be partially obscured, or at least some areas of the traffic sign image may be unclear or deformed due to road conditions, resulting in the accuracy of the traffic sign recognition results failing to meet user needs. Summary of the Invention

[0005] To address or partially address the problems existing in related technologies, this application provides a method, apparatus, and electronic device for recognizing traffic signs, which can effectively improve the recognition effect in scenarios where traffic sign images are partially missing or where some parts of the traffic sign image have poor image quality.

[0006] The first aspect of this application provides a method for recognizing traffic signs, comprising: obtaining an image to be recognized; processing the image to be recognized using a trained traffic sign recognition model to obtain a traffic sign; wherein the traffic sign recognition model includes: a feature map extraction module for extracting feature maps from the image to be recognized, and processing the image to be recognized and / or at least some of the feature maps to obtain an adjusted feature map with partially missing information; and a fusion classification module for fusing the feature map and the adjusted feature map, and determining the traffic sign based on the fused feature map and the adjusted feature map.

[0007] According to certain embodiments of this application, the feature map extraction module includes: a convolutional neural network, including an input layer and at least two convolutional layers connected in series, used to perform convolution operations on the image to be recognized to obtain a feature map; and a graph processing unit, including multiple processing subunits, respectively connected to the input layer or the convolutional layer, used to process the image to be recognized and / or at least some of the feature maps output by the convolutional layers to obtain the image to be recognized and / or the feature map with partially missing information.

[0008] According to certain embodiments of this application, the fusion classification module includes: a first feature fusion unit, used to fuse partially missing information in the image to be identified and / or feature map to obtain an adjusted feature map; and a second feature fusion unit, used to stitch the feature map and the adjusted feature map to determine the traffic sign based on the stitched feature map and the adjusted feature map.

[0009] According to certain embodiments of this application, each of the plurality of processing sub-units corresponds to a convolution kernel, and at least one element of the convolution kernel has a value of zero.

[0010] According to certain embodiments of this application, the elements of the convolution kernel, excluding zero, conform to a Gaussian distribution.

[0011] According to certain embodiments of this application, the image processing module further includes: an interpolation unit, used to perform bilinear interpolation on the feature map with partially missing information to obtain an image to be identified and / or a feature map of the same size; and a feature fusion unit specifically used to fuse the image to be identified and / or the feature map of the same size to obtain an adjusted feature map.

[0012] According to certain embodiments of this application, the above method further includes: associating traffic sign image data and annotation data to generate sample data; randomly grouping the sample data to obtain training data and test data; training a traffic sign recognition model using the training data, processing the test data using the trained traffic sign recognition model to obtain test results; and comparing the test results with the label data of the test data to determine the accuracy of the recognition results output by the traffic sign recognition model.

[0013] A second aspect of this application provides an apparatus for recognizing traffic signs, comprising: an image acquisition module and an image recognition module. The image acquisition module acquires an image to be recognized; the image recognition module processes the image to be recognized using a trained traffic sign recognition model to obtain a traffic sign; wherein the traffic sign recognition model includes: a feature map extraction module, which extracts feature maps from the image to be recognized and processes the image to be recognized and / or at least some of the feature maps to obtain an adjusted feature map with partially missing information; and a fusion classification module, which fuses the feature map and the adjusted feature map, and determines the traffic sign based on the fused feature map and the adjusted feature map.

[0014] A third aspect of this application provides an electronic device, including: a processor; and a memory having executable code stored thereon, which, when executed by the processor, causes the processor to perform the method described above.

[0015] A fourth aspect of this application also provides a computer-readable storage medium having executable code stored thereon, which, when executed by a processor of an electronic device, causes the processor to perform the above-described method.

[0016] The fifth aspect of this application also provides a computer program product including executable code that, when executed by a processor, implements the above-described method.

[0017] The method, apparatus, and electronic device for recognizing traffic signs provided in this application identify traffic signs by extracting features of target objects in an image to be recognized. When extracting features of the target objects, issues such as incomplete traffic sign images or poor image quality due to occlusion, road conditions, etc., are considered. The traffic sign recognition model actively discards some information from the image to be recognized and / or some information from the feature map when extracting image features, simulating scenarios such as incomplete or poor-quality traffic sign images. This method enables the adjusted feature map extracted by the trained traffic sign recognition model to better handle scenarios with incomplete or poor-quality traffic signs. The embodiments of this application can effectively improve the recognition performance in scenarios where traffic sign images are partially missing or have poor image quality.

[0018] It should be understood that the above general description and the following detailed description are exemplary and explanatory only, and do not limit this application. Attached Figure Description

[0019] The above and other objects, features and advantages of this application will become more apparent from the more detailed description of exemplary embodiments thereof in conjunction with the accompanying drawings, wherein the same reference numerals generally represent the same components in the exemplary embodiments thereof.

[0020] Figure 1 An exemplary system architecture, according to embodiments of this application, can be applied to a method, apparatus, and electronic device for recognizing traffic signs;

[0021] Figure 2 This illustration schematically depicts an application scenario for recognizing traffic signs according to an embodiment of this application;

[0022] Figure 3 A flowchart illustrating a method for recognizing traffic signs according to an embodiment of this application is shown schematically.

[0023] Figure 4 A topology diagram of a traffic sign recognition model according to an embodiment of this application is illustrated schematically;

[0024] Figure 5 The schematic diagram illustrates the structure of a feature map extraction module according to an embodiment of this application;

[0025] Figure 6 This illustration schematically shows a structural diagram of a traffic sign recognition model according to an embodiment of the present application;

[0026] Figure 7A schematic diagram illustrating the adjustment of the feature map according to an embodiment of this application is shown.

[0027] Figure 8 A schematic diagram illustrating the bilinear interpolation calculation process according to an embodiment of this application is shown.

[0028] Figure 9 A block diagram schematically illustrates a device for recognizing traffic signs according to an embodiment of this application;

[0029] Figure 10 A block diagram of an electronic device according to an embodiment of this application is shown schematically. Detailed Implementation

[0030] Embodiments of this application will now be described in more detail with reference to the accompanying drawings. While embodiments of this application are shown in the drawings, it should be understood that this application may be implemented in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided to make this application more thorough and complete, and to fully convey the scope of this application to those skilled in the art.

[0031] The terminology used in this application is for the purpose of describing particular embodiments only and is not intended to limit the application. The terms "comprising," "including," etc., as used herein indicate the presence of features, steps, operations, and / or components, but do not exclude the presence or addition of one or more other features, steps, operations, or components.

[0032] All terms used herein (including technical and scientific terms) have the meanings commonly understood by those skilled in the art, unless otherwise defined. It should be noted that the terms used herein are to be interpreted in a manner consistent with the context of this specification, and not in an idealized or overly rigid way.

[0033] It should be understood that although the terms "first," "second," "third," etc., may be used in this application to describe various information, such information should not be limited to these terms. These terms are only used to distinguish information of the same type from one another. For example, without departing from the scope of this application, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature. In the description of this application, "multiple" means two or more, unless otherwise explicitly specified.

[0034] With the development of deep learning, image-based deep learning algorithms such as Fully Convolutional Networks (FCNs), FaceNet, and Lightweight Convolutional Networks (such as MobileNet) have emerged. These technologies perform well in recognizing complete and clear traffic sign images from captured images. However, through real-world driving tests, applicants have found that road conditions vary greatly, and traffic sign conditions are also diverse, making it difficult for these technologies to accurately identify partially obscured, incomplete, or poor-quality traffic signs. This necessitates improving the recognition performance of incomplete or poor-quality traffic sign images from captured images. For example, the image of a car or other obstacles may obscure part of the traffic sign image. Uneven road surfaces may cause partial loss or distortion of traffic sign images in captured images. Ruts or aging / damaged traffic signs may result in partial loss or blurring of traffic sign images. In these scenarios, the technologies are prone to providing incorrect traffic sign information, posing safety hazards. Therefore, it is important to identify traffic signs as accurately as possible, such as not outputting incorrect recognition results for traffic signs with missing parts or poor image quality, in order to better improve driving safety.

[0035] For example, in the recognition of ground markings on high-precision maps, there are very complex situations, such as vehicle overlays, road surface aging, and different road materials. However, high-precision data cannot fully cover these complex scenarios, and traffic sign recognition models must be used to simulate various situations.

[0036] To improve recognition performance in various scenarios, related technologies can employ convolutional neural networks to extract image features of targets with different sizes through layer-by-layer abstraction. Then, features from different dimensions are fused using methods such as skip connections. This approach helps achieve better recognition of objects of varying sizes within an image. However, this method still has limitations. For example, experiments show a significant improvement in the quality of the reconstructed image, but it can only effectively recognize relatively complete traffic signs; the recognition performance for incomplete or poorly quality traffic signs needs further improvement.

[0037] In addition, related technologies can also use traditional machine learning algorithms such as support vector machines (SVM), iterative algorithms (such as AdaBoost), and decision trees for recognition. However, these algorithms suffer from problems such as poor recognition performance, low recognition efficiency, and inability to process image data in parallel, which makes it impossible to complete high-precision map image recognition tasks.

[0038] Furthermore, deep learning algorithms trained on image data have emerged one after another. Examples include U-net and semantic segmentation algorithms (such as SegNet and DeepLab). By testing these algorithms on the experimental data provided in this application, the applicant found their recognition performance to be very poor. This is mainly reflected in incomplete or unclear road markings, making accurate identification difficult. This recognition performance cannot meet the accuracy requirements of practical application scenarios, thus necessitating a new recognition scheme suitable for this scenario.

[0039] This application proposes a high-precision map ground marker recognition scheme based on confidence convolution. Image information is extracted using convolution and a Gaussian-like operator, and the extracted multi-layer information is fused before output. This scheme effectively improves recognition accuracy and has already produced good recognition results in engineering applications.

[0040] Specifically, this application utilizes a trained traffic sign recognition model to process the image to be recognized. This model employs a Gaussian-like operator to actively discard some information from the image to be recognized and / or some information from the feature map when extracting image features, simulating scenarios where traffic sign images are incomplete or of poor quality. This allows the adjusted feature map extracted by the trained traffic sign recognition model to better handle scenarios where traffic signs are incomplete or where parts of the image are of poor quality, effectively improving traffic sign recognition performance, especially for application scenarios where parts of the image are missing or where parts of the traffic sign image are of poor quality.

[0041] The following will be through Figures 1 to 10 This application provides a detailed description of a method, apparatus, and electronic device for identifying traffic signs according to embodiments of the present application.

[0042] Figure 1 An exemplary system architecture, applicable to methods, apparatus, and electronic devices for recognizing traffic signs according to embodiments of this application, is shown. It should be noted that... Figure 1 The examples shown are merely examples of system architectures that can be applied to the embodiments of this application, in order to help those skilled in the art understand the technical content of this application, but do not mean that the embodiments of this application cannot be used in other devices, systems, environments or scenarios.

[0043] See Figure 1 The system architecture 100 according to this embodiment may include mobile platforms 101, 102, and 103, a network 104, and a cloud 105. The network 104 serves as a medium for providing communication links between the mobile platforms 101, 102, and 103 and the cloud 105. The network 104 may include various connection types, such as wired or wireless communication links, or fiber optic cables, etc. Mobile platforms 101, 102, and 103 may be equipped with mobile terminals, such as cameras and LiDAR, to achieve functions such as recognizing traffic signs, recognizing obstacles, and capturing video.

[0044] Users can use mobile platforms 101, 102, and 103 to interact with other mobile platforms and the cloud 105 via network 104 to receive or send information, such as sending model training requests, model parameter download requests, and receiving trained model parameters. Mobile platforms 101, 102, and 103 can have various communication client applications installed, such as driver assistance applications, autonomous driving applications, in-vehicle applications, web browser applications, database applications, search applications, instant messaging tools, email clients, social media platform software, and so on.

[0045] Mobile platforms 101, 102, and 103 include, but are not limited to, electronic devices such as cars, robots, tablets, and laptops that can support functions such as internet access, point cloud data acquisition, video recording, and human-computer interaction.

[0046] Cloud platform 105 can receive model training requests and model parameter download requests, adjust model parameters for model training, distribute model topology structures, and distribute trained model parameters. It can also send road weather information and real-time traffic information to mobile platforms 101, 102, and 103. For example, cloud platform 105 can serve as a backend management server, server cluster, or vehicle network.

[0047] It should be noted that the number of servers on mobile platforms, networks, and in the cloud is merely illustrative. Depending on implementation needs, any number of mobile platforms, networks, and cloud servers can be used.

[0048] Figure 2 The illustration shows a schematic diagram of an application scenario for recognizing traffic signs according to an embodiment of this application.

[0049] See Figure 2 This image is a partial view captured by a camera (which can be positioned in a fixed location or on a mobile platform). In related technologies, clear road markings in the image can be identified relatively well. However, if the road markings are obscured by tire tracks or other objects, or if parts of the road markings are partially obscured, the accuracy of the recognition results cannot meet application requirements. For example, Figure 2 The right-turn traffic sign at position 1 in the diagram may be judged as two separate parts due to the influence of ruts, and thus cannot be identified as a left-turn sign. Figure 2 The sign at position 2 in the diagram is placed on a colored road surface. The color (grayscale value) of the traffic sign and the color (grayscale value) of the road surface are quite similar. The tire ruts superimposed on it make it even more difficult to identify the correct traffic sign. Figure 2 The recognition difficulty at position 3 in the middle circle is also very high. On top of the recognition difficulty at position 2, part of the traffic sign is obscured by moving vehicles, resulting in an incomplete image of the traffic sign in the image, which increases the recognition difficulty.

[0050] In this embodiment, based on the feature map extraction process provided by the recognition model in related technologies, an additional information processing procedure for the image to be recognized and / or the feature map is added. By actively discarding some information, an adjusted feature map is obtained, which allows the adjusted feature map to simulate, for example, [the image to be recognized]. Figure 2 This is useful in complex application scenarios where traffic sign images are partially missing or of poor quality. This approach improves the accuracy of recognition results based on the stitched feature map and the adjusted feature map.

[0051] Figure 3 A flowchart illustrating a method for identifying traffic signs according to an embodiment of this application is shown schematically.

[0052] See Figure 3 This embodiment provides a method for recognizing traffic signs, which includes operations S310 to S320, as detailed below.

[0053] In operation S310, an image to be identified is obtained, which includes at least one traffic sign image.

[0054] In this embodiment, the image to be identified can be a frame from a video captured by a camera device mounted on a mobile platform. The mobile platform includes, but is not limited to, any one of the following: a vehicle, a robot, a ship, or an aircraft. For example, the image to be identified can be captured by a camera device mounted on a vehicle. For example, the camera device can be a dashcam, etc.

[0055] The imaging device can be a monocular imaging device. Alternatively, a binocular imaging device can be used, where two images to be identified captured by the binocular imaging device can be fused together, and then traffic sign recognition can be performed on the stitched image.

[0056] The image to be recognized may include, but is not limited to, traffic sign images, such as turn signs, U-turn signs, and road signs. The image to be recognized may include at least a portion of the image of the mobile platform itself, or may not include an image of the mobile platform itself. In addition, it may include various man-made objects and non-man-made objects, such as buildings, vehicles, pedestrians, and trees.

[0057] In operation S320, the trained traffic sign recognition model is used to process the image to be recognized to obtain the traffic signs.

[0058] In this embodiment, the traffic sign recognition model can be a pre-trained model capable of determining whether the input image to be recognized contains a traffic sign image, or segmenting the traffic sign image from the image to be recognized. The traffic sign recognition model can be various types of neural networks, etc.

[0059] Figure 4 A topology diagram of a traffic sign recognition model according to an embodiment of this application is illustrated schematically. See also Figure 4 The traffic sign recognition model includes a feature map extraction module and a fusion classification module.

[0060] The feature map extraction module is used to extract feature maps from the image to be recognized. Furthermore, the feature map extraction module can also be used to process the image to be recognized and / or at least some of the feature maps to obtain adjusted feature maps with missing information. For example, image information from some regions of the image to be recognized can be discarded according to preset rules and algorithms. Similarly, feature information from at least some of the feature maps can be discarded according to preset rules and algorithms.

[0061] Figure 5 The schematic diagram illustrates the structure of a feature map extraction module according to an embodiment of this application.

[0062] See Figure 5 The feature map extraction module may include a convolutional neural network, comprising an input layer and at least two convolutional layers connected in series, used to perform convolution operations on the image to be recognized to obtain a feature map. The convolutional neural network may include at least one convolutional pair, and each convolutional pair may include a pair of convolutional layers and pooling layers.

[0063] In addition, the feature map extraction module may also include a graph processing unit. This graph processing unit comprises multiple processing subunits, each connected to the input layer or convolutional layer, for processing the image to be recognized and / or at least part of the feature maps output by the convolutional layers, to obtain a partially missing image to be recognized and / or feature maps. For example, each convolutional layer may have a unique corresponding processing subunit for processing the feature maps output by that convolutional layer to obtain a partially missing image to be recognized and / or feature maps.

[0064] It should be noted that the feature map extraction module can also be a more complex network structure. For example, the feature map extraction module can employ an encoding module, where the encoding blocks output feature maps in at least two dimensions. For instance, the encoding module can include multiple cascaded encoding blocks, starting with the encoding block of the input image to be recognized, with the dimensions of the features extracted by each encoding block ranging from low to high levels.

[0065] In deep learning, high-level coding blocks have larger receptive fields and stronger feature map representation capabilities, but their feature map resolution is low, resulting in weak representation capabilities of geometric information (lacking spatial geometric details). Low-level coding blocks have smaller receptive fields and stronger representation capabilities of geometric details; although their resolution is high, their feature map representation capabilities are weak. High-level feature maps can help accurately identify or segment targets. Therefore, fusing features from at least some different dimensions in deep learning can effectively improve recognition and segmentation performance.

[0066] Please see also Figure 4 The fusion classification module is used to fuse feature maps and adjusted feature maps, and determine traffic signs based on the fused feature maps. For example, the fusion classification module can fuse feature maps from different channels to perform classification based on the fused feature maps. Fusion operations include, but are not limited to, addition or contact. Adding is equivalent to fusing information from corresponding channels, while concatenating is equivalent to fusing information from all channels together (using convolutional kernels). Adding has lower computational cost than contacting.

[0067] For example, a high-dimensional feature map can be gradually restored to the same size as the image to be recognized by upsampling, which can include the segmented traffic signs and recognition results.

[0068] In one specific embodiment, semantic information is extracted from video frames via convolutional layers. Specifically, the convolutional neural network may include multiple convolutional pairs and pooling layers positioned between some adjacent convolutional pairs. For example, each convolutional pair includes an adjacent convolutional layer and an activation layer, and the convolutional kernel size of the convolutional layer may be 3, etc. The feature map padding width may be 1, and the stride may be 1. Furthermore, a normalization layer may be included between the convolutional layer and the activation layer to normalize the extracted features. The pooling layer may have a convolutional kernel size of 2, a feature map padding width of 0, and a stride of 2.

[0069] Furthermore, convolutional neural networks can include more or fewer pooling layers. Multiple convolutional pairs and pooling layers can be used to extract features from video frames and output feature maps. For example, convolutional layer parameters: kernel size=3, padding=1, stride=1. Pooling layer parameters: kernel size=2, padding=0, stride=2.

[0070] Here, `padding=1` makes the resolution of the video frame (X+2)×(Y+2). After convolution with a 3×3 kernel, the resolution of the output matrix is ​​X×Y. The above convolutional layer parameter settings ensure that the input image and output matrix of the convolutional layer have the same size.

[0071] Pooling replaces the network's output at a given location with the overall statistical features of its neighboring outputs. Its advantage is that when the input data undergoes a small shift, most of the outputs remain unchanged after pooling. Pooling can also compress images. Larger images increase processing speed and recognition difficulty. Pooling can reduce image size. While reducing the dimensionality of feature maps, pooling retains most of the important information.

[0072] For example, when identifying whether an image contains a turn sign, if the image to be identified contains a polyline and a triangle at one end of that polyline, but the precise location of the turn sign is not required, pooling the pixels of a certain region to obtain the overall statistical features can be very useful. Since the feature map becomes smaller after pooling, if a fully connected layer is followed, it can effectively reduce the number of neurons, save storage space, and improve computational efficiency.

[0073] Currently, the main pooling methods include max pooling, average pooling, and additive pooling. For example, max pooling selects the largest pixel from four pixels and discards the other three. Spatial pooling aggregates different features to obtain a relatively lower dimensionality while avoiding overfitting. Average pooling calculates the average value of an image region and uses it as the pooled value for that region. Max pooling selects the maximum value of an image region and uses it as the pooled value for that region. Alternatively, a spatial neighborhood can be defined, and the largest element can be extracted from the modified feature map, or the average value can be taken.

[0074] Pooling operations can gradually reduce the spatial scale of the input representation, decrease the feature dimension, and more controllably reduce the number of parameters and computations in the network. This makes the network invariant to smaller changes, redundancies, and transformations in the input image, helping to achieve maximum scale invariance of the image.

[0075] In some embodiments, the above-mentioned fusion classification module may include: a first feature fusion unit and a second feature fusion unit.

[0076] The first feature fusion unit is used to fuse partially missing information in the image to be identified and / or feature maps to obtain an adjusted feature map.

[0077] The second feature fusion unit is used to stitch together and adjust the feature maps to determine traffic signs based on the stitched and adjusted feature maps.

[0078] Figure 6 The diagram illustrates the structure of a traffic sign recognition model according to an embodiment of this application.

[0079] See Figure 6The first feature fusion unit is connected to each sub-processing unit of the image processing unit. After the sub-processing units process the image in the channel by actively losing some information, they transmit the output results to the first feature fusion unit. The first feature fusion unit fuses the feature maps output by each sub-processing unit and transmits the fusion result to the second fusion unit.

[0080] To reduce computational cost, the first feature fusion unit can use the add fusion method. To prevent the feature maps output by the convolutional layers from overwriting lost information in the adjusted feature maps output by the first feature fusion unit, the second feature fusion unit can use the contact fusion method. Furthermore, using the contact fusion method in the second feature fusion unit helps increase the number of channels, providing more dimensional feature maps and improving the accuracy of traffic sign recognition.

[0081] The following provides an example of how to actively discard some information from the image to be processed and / or the feature map.

[0082] In some embodiments, the operation of actively discarding some information can be achieved through a specific convolution algorithm. Specifically, each of the multiple processing sub-units corresponds to a convolution kernel, and at least one element of the convolution kernel has a value of zero. The convolution kernel may employ a Gaussian-like operator. For example, the elements of the convolution kernel, excluding zero, conform to a Gaussian distribution.

[0083] Figure 7 The illustration shows a schematic diagram of the adjusted feature map according to an embodiment of this application. This embodiment utilizes a specific convolution kernel to perform a convolution operation on the image to be recognized and / or the feature map. A weight element in this convolution kernel is set to 0, thus simulating the absence of image information at the position corresponding to the 0-weight element.

[0084] See Figure 7 The following example illustrates the process using the image to be recognized as a sub-processing unit. The image to be recognized contains traffic signs that guide traffic order, such as lane lines. Figure 7 A section of the lane line is missing at the location of the convolution kernel (the area indicated by the thin dashed line on the lane line), which can easily lead to incorrect identification of the lane line. This embodiment uses a Gaussian-like operator for the convolution operation. This Gaussian-like operator is obtained based on a processed Gaussian template, such as by normalizing the Gaussian template. The weight of the lower right corner of the processed Gaussian template is 0. This allows the model to actively discard some lane line information during training using the corresponding Gaussian-like operator, simulating scenarios where lane lines are partially missing in reality. The trained traffic sign recognition model can better handle this scenario when processing images with partially missing lane lines in real-world environments, resulting in more accurate recognition results.

[0085] The Gaussian template after the above processing can be designed according to a preset information discard ratio. For example, if the confidence level is preset to 0.9, that is, to retain about 90% of the information, a 3×3 convolution kernel can be used, with one weight in the convolution kernel set to 0.

[0086] In addition, please see also Figure 6 Since there are multiple sub-processing units, the Gaussian-like operators used by each sub-processing unit can be the same or different. For example, Figure 6 In all sub-processing units, the Gaussian-like operator used has a weight of 0 for the bottom-left element. For example, Figure 6 The weight of the bottom left element of the Gaussian-like operator used in the first sub-processing unit from left to right is 0, and the weight of the middle element of the Gaussian-like operator used in the second sub-processing unit is 0.

[0087] It should be noted that the sub-processing unit uses a Gaussian-like convolution operation to actively discard some information, which also helps to remove at least some noise information in the current step and further improves the quality of the adjusted feature map.

[0088] In addition, the fusion classification module may also include a decoder section. The decoder section performs upsampling operations. For example, if the encoder input image is 480×800, a downsampling operation is performed at each layer, doubling the number of channels and reducing the image's width and height to half their original values. Upsampling is the reverse of downsampling.

[0089] Specifically, the decoder's task is to semantically map the discriminative features (feature maps, which have a lower resolution) learned by the encoder to the pixel space (a higher resolution) to obtain dense classification.

[0090] The decoder can have a complex or simple structure. For example, a simple decoder might include a classification head (such as MLP) followed by an activation layer (such as softmax). The convolutional neural network and graph processing unit described above are used to extract features, acting as the encoder. The decoder can then decode the feature map into the desired segmentation and / or classification results.

[0091] For example, the decoder upsamples its lower-resolution input feature map. This can be achieved by using a pooling index computed in the corresponding encoder's max-pooling step to perform non-linear upsampling. This method eliminates the need to learn the upsampling process. The upsampled feature map is sparse, so a trainable convolutional kernel is then used to generate a denser feature map.

[0092] For example, the fusion classification module can consist of a classification layer following the decoding network after the second feature fusion unit. The encoding network can consist of multiple convolutional layers, with each encoder layer corresponding to a decoder layer. A multi-class softmax classifier is then applied to the decoder network output to generate class probabilities for each pixel.

[0093] For example, the decoder can use the same convolutional layers as the first 13 layers of VGG16, and use the weight values ​​obtained by training VGG16 on a large dataset as the initial weight values ​​of the encoding network. In order to preserve the high-resolution feature maps output by the deepest layer of the encoder, the encoding network can consist of 13 convolutional layers.

[0094] In this embodiment, confidence level (information retention rate) and a Gaussian-like operator are used to extract adjusted feature maps from the image to be identified and the feature map, fitting scenarios where traffic sign image information is missing, effectively improving the recognition performance in scenarios where traffic sign images are partially missing. Furthermore, an appropriate confidence level can prevent the loss of too much information that could lead to incorrect recognition results.

[0095] In some embodiments, since the size of the image to be identified and the feature map may be different, the feature map needs to be processed to make its size consistent with that of the image to be identified, so that fusion can be performed. Specifically, the feature map can be interpolated to make the image to be identified and the feature map have the same size, so that feature fusion can be performed. For example, linear interpolation, bilinear interpolation, and nearest neighbor interpolation algorithms can be used.

[0096] For example, the graph processing unit may further include an interpolation subunit. This interpolation subunit performs bilinear interpolation on the feature map with partially missing information to obtain a target image and / or feature map of the same size. Correspondingly, the first feature fusion unit is specifically used to fuse the target image and / or feature map of the same size to obtain an adjusted feature map.

[0097] The following example illustrates the bilinear interpolation process. Figure 8 A schematic diagram illustrating the bilinear interpolation calculation process according to an embodiment of this application is shown.

[0098] The calculation formula for bilinear interpolation is similar to that of the nearest neighbor method. The difference is that instead of finding the single nearest point, it finds the four nearest points based on the correspondence. See also... Figure 8 Bilinear interpolation involves calculating a total of three unilinear interpolations in two directions (twice for the x-axis and once for the y-axis), such as... Figure 8As shown, first, perform two linear interpolations in the x-direction to obtain two temporary points R1(x1, y1) and R2(x2, y2). Then, perform one linear interpolation in the y-direction to obtain P(x, y) (actually, changing the direction of the two axes, first y and then x, yields the same result).

[0099] The weight of each point is related to the distance between the point to be determined and the diagonal point. For example, the weight of f(Q11) is related to the coordinates of f(Q22), and the weight of f(Q12) is related to the coordinates of f(Q21).

[0100] In this embodiment, bilinear interpolation is used to ensure that the feature map and the image to be identified are the same size without reducing the quality of the feature map, which helps to improve the convenience of feature fusion.

[0101] In some embodiments, object edge information in the image can be further identified, and this object edge information, as edge features, can be fused with the feature map and adjusted feature map of the image to be recognized to enhance the recognition effect of traffic signs. Please refer to [further details omitted]. Figure 2 If a better edge recognition algorithm is used, the edges of the turn signs in circle 1 can be accurately identified, which will help improve the accuracy of traffic sign recognition.

[0102] Specifically, edge features can be extracted by finding the locations in an image where the grayscale intensity changes most strongly. The direction of the strongest grayscale intensity change refers to the gradient direction.

[0103] In some embodiments, the feature map extraction module described above may include: an intensity gradient determination unit, a candidate pixel acquisition unit, and an edge feature determination unit.

[0104] The intensity gradient determination unit is used to determine the gradient magnitude and gradient direction of each pixel in the grayscale image of the image to be recognized.

[0105] The candidate pixel acquisition unit is used to obtain one or more (Top N) candidate pixels with the highest gradient magnitude along the gradient direction.

[0106] The edge feature determination unit is used to use the first type of candidate pixels as edge features and delete the second type of candidate pixels. The gradient magnitude of the first type of candidate pixels is greater than or equal to the upper threshold, the gradient magnitude of the second type of candidate pixels is less than the lower threshold, and the upper threshold is greater than the lower threshold.

[0107] In one specific embodiment, the gradient of each pixel in the image can be obtained by an operator (convolution kernel), such as the Laplacian operator or the Sobel operator. For example, image and video processing libraries in the field of computer vision (such as OpenCV) have encapsulated functions that can calculate the nth derivative of each pixel in the image. First, the gradients G along the horizontal (x) and vertical (y) directions are obtained using the aforementioned convolution kernel. X and G Y This allows us to use a formula to calculate the gradient magnitude of each pixel.

[0108] In addition, for the sake of simplicity, G can also be used. X and G Y The infinite norm is used to replace the L2 norm. Replacing each pixel in the grayscale image with G, a larger gradient value G will be obtained where the pixel brightness values ​​change drastically in the new image (at the edges).

[0109] However, the edges in this image may be quite coarse, making it difficult to pinpoint their true locations. To address this issue, more precise edge information is determined based on gradient direction information and the gradient value G. Specifically, the maximum gradient intensity at each pixel is retained, while other values ​​are discarded. For example, the gradient intensities of adjacent pixels along the gradient direction of a specific pixel can be compared with the gradient intensity of that specific pixel to find the top n (e.g., top 1) pixels with the strongest gradient intensities as candidate pixels. Alternatively, pixels other than the top n pixels can be deleted, such as by setting their values ​​to zero.

[0110] Even after the above processing, some noise may still exist in the image. This noise can be removed using a dual-threshold method, while avoiding the accidental deletion of edge pixels. Specifically, an upper threshold and a lower threshold are set. If the gradient intensity of a pixel in the image is greater than the upper threshold, it is considered an edge (also called a strong edge); if the gradient intensity of a pixel in the image is less than the lower threshold, it is considered not an edge.

[0111] In some embodiments, considering that the gradient intensity of some pixels in the image is between an upper threshold and a lower threshold, these may include some edge pixels. This application determines whether a pixel is an edge pixel by judging whether the line formed by these pixels is connected to the line formed by strong edge pixels.

[0112] Specifically, the feature map extraction module may further include a weak edge determination unit. This weak edge determination unit is used to use third-class candidate pixels as edge features, wherein the gradient magnitude of the third-class candidate pixels is greater than or equal to the lower threshold, the gradient magnitude of the third-class candidate pixels is less than the upper threshold, and the lines formed by the third-class candidate pixels are connected to the lines formed by the first-class candidate pixels.

[0113] In this embodiment, edge information in the image can be effectively extracted. When this edge information is fused with the feature map and adjusted feature map, it is equivalent to redrawing the edges in the feature map, making the edges in the feature map more obvious, which helps to improve the recognition effect of traffic signs.

[0114] In some embodiments, the traffic recognition model can be trained in the following manner. Specifically, the method may further include the following operations: First, traffic sign image data and labeled data are associated to generate sample data; then, the sample data is randomly grouped to obtain training data and test data; next, the traffic sign recognition model is trained using the training data, and then the trained traffic sign recognition model is used to process the test data to obtain test results; finally, the test results and the labeled data of the test data are compared to determine the accuracy of the traffic sign recognition model's output recognition results. For example, a recognition model with sufficiently high prediction accuracy can be trained using the backpropagation algorithm.

[0115] Dividing the labeled sample images into training and testing sets according to a preset ratio effectively improves training results. The training set is used for model training, while the testing set is used to test the trained model to ensure that it achieves the expected recognition performance.

[0116] It should be noted that the technical solution of this application is well applicable to scenarios where a vehicle is moving and the video is captured by a camera. In such scenarios, the vehicle may be moving at high speed, and the position of traffic signs in the captured video frames changes rapidly, requiring the ability to quickly determine the traffic signs from the video frames.

[0117] In embodiments of this application, the shooting device may be a monocular camera, a binocular camera, a tri-lens camera, or more cameras. For example, a multi-lens camera includes multiple cameras with different shooting ranges; exemplarily, a tri-lens camera may include a first camera, a second camera, and a third camera.

[0118] During model training, a separate traffic sign recognition model can be trained for each camera in the tri-lens camera system. For example, a first traffic sign recognition model can be trained for the first camera to recognize images captured by it. A second traffic sign recognition model can be trained for the second camera to recognize images captured by it. A third traffic sign recognition model can be trained for the third camera to recognize images captured by it. The weighted sum of the three traffic sign recognition models is used as the final result. Alternatively, a general traffic sign recognition model can be trained for all cameras in the tri-lens camera system, capable of recognizing images captured by each camera.

[0119] In the embodiments of this application, corresponding sample images can be obtained according to the traffic sign recognition model that needs to be trained. For example, when the traffic sign recognition model to be trained is a first traffic sign recognition model corresponding to a first camera, the sample image is the image captured by the first camera. When the traffic sign recognition model to be trained is a second traffic sign recognition model corresponding to a second camera, the sample image is the image captured by the second camera. When the traffic sign recognition model to be trained is a third traffic sign recognition model corresponding to a third camera, the sample image is the image captured by the third camera. As another example, when the traffic sign recognition model to be trained is a traffic sign recognition model common to all cameras of a three-lens camera, the sample image is the image captured by all three cameras.

[0120] In some implementations, the acquired sample images can be time-synchronized, that is, the acquired images can be sorted according to the order of their acquisition time. When the sample images are acquired by three cameras, the images acquired by the three cameras at the same time can be regarded as the same group of images, and then multiple groups of images can be sorted according to the order of their acquisition time.

[0121] In some implementations, in order to use the acquired sample images for model training, it is necessary to pre-label the traffic signs in the sample images, such as the location and range of the traffic signs, the color information of the traffic signs, etc. The specific labeling can be selected according to the purpose of model training.

[0122] In some embodiments, the traffic sign recognition model can be trained using a backpropagation algorithm. Specific examples can be found in neural network training methods. During the training of the base model, the required traffic sign location information can be input externally.

[0123] It should be noted that the traffic sign recognition model can be trained offline or online, and can be trained in the cloud. The mobile platform's computing device can download the trained traffic sign recognition model's topology and parameters from the cloud to enable traffic sign recognition locally on the mobile platform. Alternatively, the mobile platform can send video streams to the cloud, allowing the cloud to process the image using the trained traffic sign recognition model, obtain the traffic sign information in the image, and then send (or broadcast) the traffic sign location information to the mobile platform.

[0124] In a specific embodiment of traffic sign recognition, referring to section 6, the confidence level is first set to 0.9 (retaining 90% of the information). Then, a 3×3 Gaussian operator is used, and each time information is extracted using the Gaussian operator, a position is randomly assigned to 0 (mimicking a scenario where image information is missing). Next, feature maps are extracted using the 3×3 Gaussian operator in layers 1 (input layer), 2 (convolutional layer), and 3 (convolutional layer), resulting in three feature maps. These three feature maps are then added together, and the result is concatenated with the feature map output from the network's convolution.

[0125] A specific model training process may include the following steps: First, combine roadside vehicle image data and JSON data (such as annotations of traffic sign locations and types) to generate the required sample data. Then, process any non-compliant data to obtain compliant data. Next, randomly group the obtained data into test data and training data, and save these two datasets separately to an MDB database. Then, read the MDB data, parse it into a 480×800×3 matrix, and input it into the network for training to obtain the trained model. Next, use the trained model to make predictions, and compare the prediction results with the actual image labels. The comparison reveals a significant improvement in recognition accuracy, and the model is then tested on a larger range of real-world data.

[0126] In this embodiment, when extracting image features using confidence scores and Gaussian-like operators, partial information from the image to be recognized and / or partial information from the feature map is actively discarded to simulate a scenario where traffic sign images are incomplete. This method enables the adjusted feature map extracted by the trained traffic sign recognition model to better handle scenarios with incomplete traffic signs or partially poor image quality, thus making it applicable to various complex scenarios. Furthermore, using a shallower neural network helps reduce network complexity, improve response speed, and reduce computational resource consumption.

[0127] Another aspect of this application provides a device for recognizing traffic signs.

[0128] Figure 9 A block diagram of a device for recognizing traffic signs according to an embodiment of this application is shown schematically.

[0129] Referring to Figure 900, the traffic sign recognition device 900 may include an image acquisition module 910 and an image recognition module 920.

[0130] The image acquisition module 910 is used to acquire the image to be recognized.

[0131] The image recognition module 920 is used to process the image to be recognized using a trained traffic sign recognition model to obtain traffic signs. The traffic sign recognition model may include a feature map extraction module and a fusion classification module.

[0132] The feature map extraction module is used to extract feature maps from the image to be identified, and to process the image to be identified and / or at least some of the feature maps to obtain an adjusted feature map with some missing information.

[0133] The fusion classification module is used to fuse the feature map and the adjusted feature map, and to determine traffic signs based on the fused feature map and the adjusted feature map.

[0134] In some embodiments, the feature map extraction module includes a convolutional neural network and a graph processing unit.

[0135] A convolutional neural network consists of an input layer and at least two convolutional layers connected in series, used to perform convolution operations on the image to be recognized to obtain a feature map.

[0136] The image processing unit includes multiple processing sub-units, which are connected to the input layer or convolutional layer respectively, and are used to process the image to be recognized and / or at least part of the feature map output by the convolutional layer to obtain the image to be recognized and / or the feature map with missing information.

[0137] In some embodiments, each of the multiple processing sub-units corresponds to a convolution kernel, and at least one element of the convolution kernel has a value of zero.

[0138] In some embodiments, the elements of the convolution kernel, excluding zero, follow a Gaussian distribution.

[0139] Regarding the apparatus 900 in the above embodiments, the specific manner in which each module and unit performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated further here.

[0140] Another aspect of this application provides an electronic device.

[0141] Figure 10 A block diagram of an electronic device according to an embodiment of this application is shown schematically.

[0142] See Figure 10 The electronic device 1000 includes a memory 1010 and a processor 1020.

[0143] The processor 1020 can be a Central Processing Unit (CPU), or other general-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor can be a microprocessor or any conventional processor.

[0144] Memory 1010 may include various types of storage units, such as system memory, read-only memory (ROM), and permanent storage devices. ROM may store static data or instructions required by processor 1020 or other modules of the computer. Permanent storage devices may be read-write storage devices. Permanent storage devices may be non-volatile storage devices that retain stored instructions and data even when the computer is powered off. In some embodiments, permanent storage devices use mass storage devices (e.g., magnetic or optical disks, flash memory) as permanent storage devices. In other embodiments, permanent storage devices may be removable storage devices (e.g., floppy disks, optical drives). System memory may be a read-write storage device or a volatile read-write storage device, such as dynamic random access memory. System memory may store some or all of the instructions and data required by the processor during operation. Furthermore, memory 1010 may include any combination of computer-readable storage media, including various types of semiconductor memory chips (e.g., DRAM, SRAM, SDRAM, flash memory, programmable read-only memory), and disks and / or optical disks may also be used. In some embodiments, the memory 1010 may include a removable storage device that is readable and / or writable, such as a laser disc (CD), a read-only digital multifunction optical disc (e.g., DVD-ROM, dual-layer DVD-ROM), a read-only Blu-ray disc, an ultra-high density optical disc, a flash memory card (e.g., SD card, mini SD card, Micro-SD card, etc.), a magnetic floppy disk, etc. Computer-readable storage media do not contain carrier waves or transient electronic signals transmitted wirelessly or via wired connections.

[0145] The memory 1010 stores executable code, which, when processed by the processor 1020, can cause the processor 1020 to execute part or all of the methods described above.

[0146] Furthermore, the method according to this application can also be implemented as a computer program or computer program product, which includes computer program code instructions for performing some or all of the steps in the method described above.

[0147] Alternatively, this application may be implemented as a computer-readable storage medium (or a non-transitory machine-readable storage medium or a machine-readable storage medium) storing executable code (or computer program or computer instruction code) thereon, which, when executed by a processor of an electronic device (or server, etc.), causes the processor to perform part or all of the steps of the methods described above according to this application.

[0148] The various embodiments of this application have been described above. These descriptions are exemplary and not exhaustive, nor are they limited to the disclosed embodiments. Many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principles, practical application, or improvement of the technology in the market, or to enable others skilled in the art to understand the embodiments disclosed herein.

Claims

1. A method of recognizing traffic signs, characterized by, include: Obtain the image to be recognized; The image to be recognized is processed using a trained traffic sign recognition model to obtain traffic signs; The traffic sign recognition model includes: The feature map extraction module is used to extract feature maps from the image to be identified, and to process the image to be identified and / or at least part of the feature maps by a Gaussian operator convolution kernel to obtain an adjusted feature map with partially missing information. The Gaussian operator convolution kernel satisfies that at least one element has a value of zero, and the elements other than zero conform to a Gaussian distribution. The fusion classification module is used to fuse the feature map and the adjusted feature map, and determine the traffic sign based on the fused feature map and the adjusted feature map.

2. The method of claim 1, wherein, The feature map extraction module includes: A convolutional neural network includes an input layer and at least two convolutional layers connected in series, used to perform convolution operations on the image to be recognized to obtain the feature map; The image processing unit includes multiple processing subunits, which are respectively connected to the input layer and / or at least part of the convolutional layer, for processing the image to be recognized and / or at least part of the feature maps output by the convolutional layer to obtain the image to be recognized and / or the feature maps with partially missing information.

3. The method of claim 2, wherein, The fusion classification module includes: The first feature fusion unit is used to fuse the image to be identified with missing information and / or the feature map to obtain the adjusted feature map; The second feature fusion unit is used to stitch together the feature map and the adjusted feature map to determine the traffic sign based on the stitched feature map and the adjusted feature map.

4. The method of claim 2, wherein, The Gaussian-like operator convolution kernel is a convolution kernel corresponding to each of the plurality of processing sub-units, and at least some of the convolution kernels have at least one element value of zero.

5. The method of claim 3, wherein, The graph processing unit further includes: The interpolation subunit is used to perform bilinear interpolation on the feature map with missing information to obtain the image to be recognized and the feature map of the same size. The first feature fusion unit is specifically used to fuse the image to be identified and the feature map of the same size to obtain the adjusted feature map.

6. The method according to any one of claims 1 to 5, characterized in that, Also includes: Link traffic sign image data and annotation data to generate sample data; The sample data is randomly grouped to obtain training data and test data; After training the traffic sign recognition model using the training data, the test data is processed using the trained traffic sign recognition model to obtain test results; By comparing the test results with the label data of the test data, the accuracy of the recognition results output by the traffic sign recognition model is determined.

7. An apparatus for recognizing traffic signs, characterized in that include: Image acquisition module, used to acquire the image to be recognized; An image recognition module is used to process the image to be recognized using a trained traffic sign recognition model to obtain traffic signs; wherein, the traffic sign recognition model includes: The feature map extraction module is used to extract feature maps from the image to be identified, and to process the image to be identified and / or at least part of the feature maps through a Gaussian-like operator convolution kernel to obtain an adjusted feature map with partially missing information. The Gaussian-like operator convolution kernel satisfies that at least one element is zero and the elements other than zero conform to a Gaussian distribution. The fusion classification module is used to fuse the feature map and the adjusted feature map, and determine the traffic sign based on the fused feature map and the adjusted feature map.

8. An electronic device, comprising: include: processor; as well as A memory having executable code stored thereon, which, when executed by the processor, causes the processor to perform the method according to any one of claims 1-6.

9. A computer storage medium, characterized in that, It stores executable code, which, when executed, performs the method according to any one of claims 1-6.